Commit Graph

821 Commits

Author SHA1 Message Date
Krzysztof Rymski be30473dc6 Internal changes
PiperOrigin-RevId: 835160876
2025-11-21 04:08:04 -08:00
Martin Stolle 5a500872b8 Internal change
PiperOrigin-RevId: 835115693
2025-11-21 01:17:45 -08:00
Martin Stolle 49d420aeaf Add some comments.
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
The gemma.cpp Authors b8f6be72b1 Improves autodetection of Gemma3-1B.
Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models.
Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3.

PiperOrigin-RevId: 833213331
2025-11-17 01:12:50 -08:00
The gemma.cpp Authors 7c1656f2fc Fix NibbleCodec for AVX3_{ZEN4,DL,SPR}
PiperOrigin-RevId: 831002073
2025-11-11 11:31:25 -08:00
Jan Wassenberg 3e18db17f4 Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes #762.
PiperOrigin-RevId: 829308896
2025-11-07 00:32:31 -08:00
Charles Zhao f8131339a7 Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching.
PiperOrigin-RevId: 829104661
2025-11-06 14:20:17 -08:00
Martin Stolle 35e9f9f05f Introduce attention implementation configurability.
PiperOrigin-RevId: 828971705
2025-11-06 08:43:41 -08:00
Jan Wassenberg 091b4567c9 Minor: ParallelismStrategy->Parallelism
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg a344a70c59 Change (old) attention behavior to disallow wraparound, enforced via assertion.
Shared kU64PerLine constant

PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Charles Zhao 3a63a12624 Allow prefill only run by allowing max_prompt_size == seq_len
PiperOrigin-RevId: 827415258
2025-11-03 03:17:54 -08:00
Phil Culliton ab87807a4c Pre-compress query activations to BF16 before FlashAttention.
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith 8a100c1e8d Added access to flash attention internals to TileFlashAttention4
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Jan Wassenberg ee7d79c0a6 Add Decompress2AndCompressInplace helper
PiperOrigin-RevId: 825966142
2025-10-30 04:04:41 -07:00
Jan Wassenberg 006999063c Fix PaliGemma matmul warning
PiperOrigin-RevId: 825627406
2025-10-29 11:15:50 -07:00
Phil Culliton ecab0cef3a Update README with Gemma 3 support and contributor acknowledgments
PiperOrigin-RevId: 825588241
2025-10-29 09:46:51 -07:00
Phil Culliton 036f91f63c Add Gemma 3 270M to gemma_test
PiperOrigin-RevId: 825582368
2025-10-29 09:31:32 -07:00
Phil Culliton 116cd6eff6 BF16 mixed-mode flash attention
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg 4bd465ffd3 Also update attention.h to type-erased query_norm_scale
PiperOrigin-RevId: 825014334
2025-10-28 06:48:33 -07:00
Jan Wassenberg 3cc0139ebb Fix excessive KC/MC from prior change
This could lead to stack overflow in B_storage.

Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.

PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo 5a05857deb [Gemma.cpp] Allows non-owned arguments for attention methods.
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold  non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.

PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg 86200ce224 1.01x speedup: improved autotune
Group M=4..7 into same config. Add configs for power of two sizes.
Allow odd mc to enable a single range for odd M.

io.cc: warning fix(cast).
IsBlock -> !IsOneMC
benchmark_helper: best for verbosity 3, all configs for 4
ops_test: remove unused includes
PiperOrigin-RevId: 824475104
2025-10-27 05:35:31 -07:00
Jan Wassenberg 8198e7104a Batch bench: 4 runs to give autotuning more time
Also print auto-tune info for verbosity 3.

PiperOrigin-RevId: 823555008
2025-10-24 09:14:39 -07:00
Theotime Combes 1bdde1af3c Add config flag for global timescale & rely on config to deduce wrapping
PiperOrigin-RevId: 823512377
2025-10-24 06:54:56 -07:00
Jan Wassenberg a48e614f64 1.02x speedup: improve load balance and simplify parallelFor
Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead.

PiperOrigin-RevId: 823388890
2025-10-24 00:19:09 -07:00
Nitin Gangahar 085a34965a Update README since backprop and Adam optimizer has been deleted.
PiperOrigin-RevId: 823388833
2025-10-24 00:18:05 -07:00
Jan Wassenberg 3ed403e287 Major cleanup of profiler zones, add Caller annotation for all pool.Run
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor

PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Nitin Gangahar 9e8ac7e2f0 Use correct offsets in BlobWriter.
Updates the FileSize() calls in BlobWriter to instead use a computed offset.
FileSize() may not work with all implementations of File which can cause issues
while writing.

PiperOrigin-RevId: 822646338
2025-10-22 10:29:04 -07:00
Copybara-Service 64a82ed645 Merge pull request #735 from Hitesh-ed:gemma.cpp-windows-build-fix
PiperOrigin-RevId: 822559272
2025-10-22 06:26:29 -07:00
Hitesh K V 027288b5e4
Merge branch 'dev' into gemma.cpp-windows-build-fix 2025-10-22 16:53:48 +05:30
Jan Wassenberg acede9d682 Warning fix (unused var), Windows build fix (missing member variable)
PiperOrigin-RevId: 822172982
2025-10-21 10:17:34 -07:00
Hitesh K V c55120fc6d
Merge branch 'dev' into gemma.cpp-windows-build-fix 2025-10-16 20:18:09 +05:30
Jan Wassenberg f59eb2ed72 Remove multi-package support from topology
Also no longer assume equal-sized clusters

PiperOrigin-RevId: 820164125
2025-10-16 04:00:35 -07:00
Hitesh K V cc1d256cff
Update CMakePresets.json
Adding the following cache variable in the CMakePresets.json to enforce modern policies automatically

This ensures all developers can run cmake --preset windows without hitting legacy compatibility or deprecation issues.
2025-10-16 12:08:29 +05:30
Jan Wassenberg 9b6ed1a58f gemma_batch_bench: generate more unique prompts
PiperOrigin-RevId: 819944137
2025-10-15 15:46:05 -07:00
Phil Culliton 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith ee18916abf Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith e3e8511e79 Initialization of profiler zones.
PiperOrigin-RevId: 819662587
2025-10-15 03:05:58 -07:00
Ray Smith fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
Improved flash_attention to enable profiling using the new zones.

PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg 035273c184 tune pool kSpin mode in threading_context
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.

threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Nitin Gangahar 9dc802c7aa Add logging to io.cc on failed write and read.
This should provide insights into any failures.

PiperOrigin-RevId: 815784482
2025-10-06 10:25:41 -07:00
Ray Smith 684a0444e9 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith 14244664c8 Avoid transposing Q when it isn't needed
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg fe5a39990e Improve FlashAttention threading:
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.

PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith 6098a022b3 Increased parallelism for RMSNormAndPositionalEncoding
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith 4974f24832 Fixed bug with softcap in single flash attention
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar 16536996d1 Remove less useful spammy log lines.
PiperOrigin-RevId: 812694572
2025-09-29 02:28:41 -07:00
Nitin Gangahar 667a3f117a Utilize multiple cores to read weight batches.
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Ray Smith d15731d201 Used hn::BroadcastLane instead of Set(..., x.raw)
PiperOrigin-RevId: 811386295
2025-09-25 09:42:03 -07:00