Ray Smith
d2806fb1dd
Fixed msan error by fixing padding of k_cache and v_cache
...
PiperOrigin-RevId: 879644219
2026-03-06 08:11:17 -08:00
Krzysztof Rymski
539d9bb8e7
Change to use faster exponent function
...
PiperOrigin-RevId: 877981568
2026-03-03 09:16:04 -08:00
Ray Smith
49cb438b1e
Rollback of erroneous rollback.
...
PiperOrigin-RevId: 877376165
2026-03-02 06:50:26 -08:00
The gemma.cpp Authors
a3d994915f
No public description
...
PiperOrigin-RevId: 877333188
2026-03-02 04:32:29 -08:00
Ray Smith
16c1b29b89
Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention.
...
PiperOrigin-RevId: 877308306
2026-03-02 03:11:01 -08:00
Krzysztof Rymski
df162ead7c
Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
...
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512
PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
Krzysztof Rymski
7dc98902d3
Internal changes
...
PiperOrigin-RevId: 872280443
2026-02-19 01:57:58 -08:00
The gemma.cpp Authors
34739fd9f0
Internal changes
...
PiperOrigin-RevId: 871792281
2026-02-18 04:07:36 -08:00
Krzysztof Rymski
c6696342fa
Internal changes
...
PiperOrigin-RevId: 871776998
2026-02-18 03:21:41 -08:00
Ray Smith
76d7951242
Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.
...
Replaced flag with attention_impl to control which attention to run.
PiperOrigin-RevId: 869694868
2026-02-13 06:05:30 -08:00
Balazs Racz
baa69dfb78
Makes the entire runtime_config passed into the activations constructor.
...
PiperOrigin-RevId: 845153671
2025-12-16 01:56:52 -08:00
Krzysztof Rymski
44dfd69b9b
Internal changes
...
PiperOrigin-RevId: 844759322
2025-12-15 07:14:37 -08:00
Martin Stolle
5a500872b8
Internal change
...
PiperOrigin-RevId: 835115693
2025-11-21 01:17:45 -08:00
Martin Stolle
35e9f9f05f
Introduce attention implementation configurability.
...
PiperOrigin-RevId: 828971705
2025-11-06 08:43:41 -08:00
Biruk Mammo
5a05857deb
[Gemma.cpp] Allows non-owned arguments for attention methods.
...
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.
PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Ray Smith
2f6cbde8ff
Added a smaller tile size to flash attention for smaller batch sizes
...
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Jan Wassenberg
501fdf000e
Remove no longer used MatVec
...
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Ray Smith
f10ac41a20
Added flash attention, with both a single-q function, and a register-tiled function.
...
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.
PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00