Commit Graph

814 Commits

Author SHA1 Message Date
Martin Stolle 35e9f9f05f Introduce attention implementation configurability.
PiperOrigin-RevId: 828971705
2025-11-06 08:43:41 -08:00
Jan Wassenberg 091b4567c9 Minor: ParallelismStrategy->Parallelism
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg a344a70c59 Change (old) attention behavior to disallow wraparound, enforced via assertion.
Shared kU64PerLine constant

PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Charles Zhao 3a63a12624 Allow prefill only run by allowing max_prompt_size == seq_len
PiperOrigin-RevId: 827415258
2025-11-03 03:17:54 -08:00
Phil Culliton ab87807a4c Pre-compress query activations to BF16 before FlashAttention.
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith 8a100c1e8d Added access to flash attention internals to TileFlashAttention4
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Jan Wassenberg ee7d79c0a6 Add Decompress2AndCompressInplace helper
PiperOrigin-RevId: 825966142
2025-10-30 04:04:41 -07:00
Jan Wassenberg 006999063c Fix PaliGemma matmul warning
PiperOrigin-RevId: 825627406
2025-10-29 11:15:50 -07:00
Phil Culliton ecab0cef3a Update README with Gemma 3 support and contributor acknowledgments
PiperOrigin-RevId: 825588241
2025-10-29 09:46:51 -07:00
Phil Culliton 036f91f63c Add Gemma 3 270M to gemma_test
PiperOrigin-RevId: 825582368
2025-10-29 09:31:32 -07:00
Phil Culliton 116cd6eff6 BF16 mixed-mode flash attention
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg 4bd465ffd3 Also update attention.h to type-erased query_norm_scale
PiperOrigin-RevId: 825014334
2025-10-28 06:48:33 -07:00
Jan Wassenberg 3cc0139ebb Fix excessive KC/MC from prior change
This could lead to stack overflow in B_storage.

Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.

PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo 5a05857deb [Gemma.cpp] Allows non-owned arguments for attention methods.
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold  non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.

PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg 86200ce224 1.01x speedup: improved autotune
Group M=4..7 into same config. Add configs for power of two sizes.
Allow odd mc to enable a single range for odd M.

io.cc: warning fix(cast).
IsBlock -> !IsOneMC
benchmark_helper: best for verbosity 3, all configs for 4
ops_test: remove unused includes
PiperOrigin-RevId: 824475104
2025-10-27 05:35:31 -07:00
Jan Wassenberg 8198e7104a Batch bench: 4 runs to give autotuning more time
Also print auto-tune info for verbosity 3.

PiperOrigin-RevId: 823555008
2025-10-24 09:14:39 -07:00
Theotime Combes 1bdde1af3c Add config flag for global timescale & rely on config to deduce wrapping
PiperOrigin-RevId: 823512377
2025-10-24 06:54:56 -07:00
Jan Wassenberg a48e614f64 1.02x speedup: improve load balance and simplify parallelFor
Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead.

PiperOrigin-RevId: 823388890
2025-10-24 00:19:09 -07:00
Nitin Gangahar 085a34965a Update README since backprop and Adam optimizer has been deleted.
PiperOrigin-RevId: 823388833
2025-10-24 00:18:05 -07:00
Jan Wassenberg 3ed403e287 Major cleanup of profiler zones, add Caller annotation for all pool.Run
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor

PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Nitin Gangahar 9e8ac7e2f0 Use correct offsets in BlobWriter.
Updates the FileSize() calls in BlobWriter to instead use a computed offset.
FileSize() may not work with all implementations of File which can cause issues
while writing.

PiperOrigin-RevId: 822646338
2025-10-22 10:29:04 -07:00
Copybara-Service 64a82ed645 Merge pull request #735 from Hitesh-ed:gemma.cpp-windows-build-fix
PiperOrigin-RevId: 822559272
2025-10-22 06:26:29 -07:00
Hitesh K V 027288b5e4
Merge branch 'dev' into gemma.cpp-windows-build-fix 2025-10-22 16:53:48 +05:30
Jan Wassenberg acede9d682 Warning fix (unused var), Windows build fix (missing member variable)
PiperOrigin-RevId: 822172982
2025-10-21 10:17:34 -07:00
Hitesh K V c55120fc6d
Merge branch 'dev' into gemma.cpp-windows-build-fix 2025-10-16 20:18:09 +05:30
Jan Wassenberg f59eb2ed72 Remove multi-package support from topology
Also no longer assume equal-sized clusters

PiperOrigin-RevId: 820164125
2025-10-16 04:00:35 -07:00
Hitesh K V cc1d256cff
Update CMakePresets.json
Adding the following cache variable in the CMakePresets.json to enforce modern policies automatically

This ensures all developers can run cmake --preset windows without hitting legacy compatibility or deprecation issues.
2025-10-16 12:08:29 +05:30
Jan Wassenberg 9b6ed1a58f gemma_batch_bench: generate more unique prompts
PiperOrigin-RevId: 819944137
2025-10-15 15:46:05 -07:00
Phil Culliton 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith ee18916abf Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith e3e8511e79 Initialization of profiler zones.
PiperOrigin-RevId: 819662587
2025-10-15 03:05:58 -07:00
Ray Smith fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
Improved flash_attention to enable profiling using the new zones.

PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg 035273c184 tune pool kSpin mode in threading_context
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.

threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Nitin Gangahar 9dc802c7aa Add logging to io.cc on failed write and read.
This should provide insights into any failures.

PiperOrigin-RevId: 815784482
2025-10-06 10:25:41 -07:00
Ray Smith 684a0444e9 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith 14244664c8 Avoid transposing Q when it isn't needed
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg fe5a39990e Improve FlashAttention threading:
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.

PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith 6098a022b3 Increased parallelism for RMSNormAndPositionalEncoding
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith 4974f24832 Fixed bug with softcap in single flash attention
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar 16536996d1 Remove less useful spammy log lines.
PiperOrigin-RevId: 812694572
2025-09-29 02:28:41 -07:00
Nitin Gangahar 667a3f117a Utilize multiple cores to read weight batches.
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Ray Smith d15731d201 Used hn::BroadcastLane instead of Set(..., x.raw)
PiperOrigin-RevId: 811386295
2025-09-25 09:42:03 -07:00
Charles Zhao 4f0c633248 (1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results.
PiperOrigin-RevId: 810634261
2025-09-23 17:02:29 -07:00
Jan Wassenberg fac8aac4cb Internal change
PiperOrigin-RevId: 809975026
2025-09-22 05:37:03 -07:00
Jan Wassenberg 501fdf000e Remove no longer used MatVec
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Jan Wassenberg b603425bf3 Fix batch inference: dangling reference
Also add more detailed asserts/error messages.

PiperOrigin-RevId: 807695421
2025-09-16 08:01:56 -07:00
Jan Wassenberg f3bc1c17da 1.03x speedup: fused FFN
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
2025-09-15 10:26:37 -07:00
Charles Zhao 59db30e209 add const restriction for benchmark_helper.cc, and paligemma_helper.cc to remove a few uncessary copies.
PiperOrigin-RevId: 807004597
2025-09-14 16:27:26 -07:00
Ray Smith c9b8479f7d Added zero-initialization to att_out.
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.

PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00