Krzysztof Rymski
15f503e181
Internal changes
...
PiperOrigin-RevId: 835158854
2025-11-21 04:09:00 -08:00
Martin Stolle
5a500872b8
Internal change
...
PiperOrigin-RevId: 835115693
2025-11-21 01:17:45 -08:00
Martin Stolle
49d420aeaf
Add some comments.
...
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
The gemma.cpp Authors
b8f6be72b1
Improves autodetection of Gemma3-1B.
...
Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models.
Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3.
PiperOrigin-RevId: 833213331
2025-11-17 01:12:50 -08:00
The gemma.cpp Authors
7c1656f2fc
Fix NibbleCodec for AVX3_{ZEN4,DL,SPR}
...
PiperOrigin-RevId: 831002073
2025-11-11 11:31:25 -08:00
Jan Wassenberg
3e18db17f4
Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes #762 .
...
PiperOrigin-RevId: 829308896
2025-11-07 00:32:31 -08:00
Charles Zhao
f8131339a7
Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching.
...
PiperOrigin-RevId: 829104661
2025-11-06 14:20:17 -08:00
Martin Stolle
35e9f9f05f
Introduce attention implementation configurability.
...
PiperOrigin-RevId: 828971705
2025-11-06 08:43:41 -08:00
Jan Wassenberg
091b4567c9
Minor: ParallelismStrategy->Parallelism
...
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg
a344a70c59
Change (old) attention behavior to disallow wraparound, enforced via assertion.
...
Shared kU64PerLine constant
PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Charles Zhao
3a63a12624
Allow prefill only run by allowing max_prompt_size == seq_len
...
PiperOrigin-RevId: 827415258
2025-11-03 03:17:54 -08:00
Phil Culliton
ab87807a4c
Pre-compress query activations to BF16 before FlashAttention.
...
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith
8a100c1e8d
Added access to flash attention internals to TileFlashAttention4
...
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Jan Wassenberg
ee7d79c0a6
Add Decompress2AndCompressInplace helper
...
PiperOrigin-RevId: 825966142
2025-10-30 04:04:41 -07:00
Jan Wassenberg
006999063c
Fix PaliGemma matmul warning
...
PiperOrigin-RevId: 825627406
2025-10-29 11:15:50 -07:00
Phil Culliton
ecab0cef3a
Update README with Gemma 3 support and contributor acknowledgments
...
PiperOrigin-RevId: 825588241
2025-10-29 09:46:51 -07:00
Phil Culliton
036f91f63c
Add Gemma 3 270M to gemma_test
...
PiperOrigin-RevId: 825582368
2025-10-29 09:31:32 -07:00
Phil Culliton
116cd6eff6
BF16 mixed-mode flash attention
...
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg
4bd465ffd3
Also update attention.h to type-erased query_norm_scale
...
PiperOrigin-RevId: 825014334
2025-10-28 06:48:33 -07:00
Jan Wassenberg
3cc0139ebb
Fix excessive KC/MC from prior change
...
This could lead to stack overflow in B_storage.
Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.
PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo
5a05857deb
[Gemma.cpp] Allows non-owned arguments for attention methods.
...
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.
PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg
86200ce224
1.01x speedup: improved autotune
...
Group M=4..7 into same config. Add configs for power of two sizes.
Allow odd mc to enable a single range for odd M.
io.cc: warning fix(cast).
IsBlock -> !IsOneMC
benchmark_helper: best for verbosity 3, all configs for 4
ops_test: remove unused includes
PiperOrigin-RevId: 824475104
2025-10-27 05:35:31 -07:00
Jan Wassenberg
8198e7104a
Batch bench: 4 runs to give autotuning more time
...
Also print auto-tune info for verbosity 3.
PiperOrigin-RevId: 823555008
2025-10-24 09:14:39 -07:00
Theotime Combes
1bdde1af3c
Add config flag for global timescale & rely on config to deduce wrapping
...
PiperOrigin-RevId: 823512377
2025-10-24 06:54:56 -07:00
Jan Wassenberg
a48e614f64
1.02x speedup: improve load balance and simplify parallelFor
...
Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead.
PiperOrigin-RevId: 823388890
2025-10-24 00:19:09 -07:00
Nitin Gangahar
085a34965a
Update README since backprop and Adam optimizer has been deleted.
...
PiperOrigin-RevId: 823388833
2025-10-24 00:18:05 -07:00
Jan Wassenberg
3ed403e287
Major cleanup of profiler zones, add Caller annotation for all pool.Run
...
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Nitin Gangahar
9e8ac7e2f0
Use correct offsets in BlobWriter.
...
Updates the FileSize() calls in BlobWriter to instead use a computed offset.
FileSize() may not work with all implementations of File which can cause issues
while writing.
PiperOrigin-RevId: 822646338
2025-10-22 10:29:04 -07:00
Copybara-Service
64a82ed645
Merge pull request #735 from Hitesh-ed:gemma.cpp-windows-build-fix
...
PiperOrigin-RevId: 822559272
2025-10-22 06:26:29 -07:00
Hitesh K V
027288b5e4
Merge branch 'dev' into gemma.cpp-windows-build-fix
2025-10-22 16:53:48 +05:30
Jan Wassenberg
acede9d682
Warning fix (unused var), Windows build fix (missing member variable)
...
PiperOrigin-RevId: 822172982
2025-10-21 10:17:34 -07:00
Hitesh K V
c55120fc6d
Merge branch 'dev' into gemma.cpp-windows-build-fix
2025-10-16 20:18:09 +05:30
Jan Wassenberg
f59eb2ed72
Remove multi-package support from topology
...
Also no longer assume equal-sized clusters
PiperOrigin-RevId: 820164125
2025-10-16 04:00:35 -07:00
Hitesh K V
cc1d256cff
Update CMakePresets.json
...
Adding the following cache variable in the CMakePresets.json to enforce modern policies automatically
This ensures all developers can run cmake --preset windows without hitting legacy compatibility or deprecation issues.
2025-10-16 12:08:29 +05:30
Jan Wassenberg
9b6ed1a58f
gemma_batch_bench: generate more unique prompts
...
PiperOrigin-RevId: 819944137
2025-10-15 15:46:05 -07:00
Phil Culliton
503aaddd65
Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
...
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith
ee18916abf
Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
...
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith
e3e8511e79
Initialization of profiler zones.
...
PiperOrigin-RevId: 819662587
2025-10-15 03:05:58 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg
035273c184
tune pool kSpin mode in threading_context
...
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.
threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Nitin Gangahar
9dc802c7aa
Add logging to io.cc on failed write and read.
...
This should provide insights into any failures.
PiperOrigin-RevId: 815784482
2025-10-06 10:25:41 -07:00
Ray Smith
684a0444e9
Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
...
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith
14244664c8
Avoid transposing Q when it isn't needed
...
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg
fe5a39990e
Improve FlashAttention threading:
...
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.
PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith
6098a022b3
Increased parallelism for RMSNormAndPositionalEncoding
...
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith
2f6cbde8ff
Added a smaller tile size to flash attention for smaller batch sizes
...
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith
4974f24832
Fixed bug with softcap in single flash attention
...
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar
16536996d1
Remove less useful spammy log lines.
...
PiperOrigin-RevId: 812694572
2025-09-29 02:28:41 -07:00
Nitin Gangahar
667a3f117a
Utilize multiple cores to read weight batches.
...
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Ray Smith
d15731d201
Used hn::BroadcastLane instead of Set(..., x.raw)
...
PiperOrigin-RevId: 811386295
2025-09-25 09:42:03 -07:00