Commit Graph

40 Commits

Author SHA1 Message Date
Krzysztof Rymski 0bce91c4f4 Change to use faster exponent function
PiperOrigin-RevId: 875649021
2026-02-26 07:04:02 -08:00
Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512

PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
Krzysztof Rymski 7dc98902d3 Internal changes
PiperOrigin-RevId: 872280443
2026-02-19 01:57:58 -08:00
The gemma.cpp Authors 34739fd9f0 Internal changes
PiperOrigin-RevId: 871792281
2026-02-18 04:07:36 -08:00
Krzysztof Rymski c6696342fa Internal changes
PiperOrigin-RevId: 871776998
2026-02-18 03:21:41 -08:00
Ray Smith 76d7951242 Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.
Replaced flag with attention_impl to control which attention to run.

PiperOrigin-RevId: 869694868
2026-02-13 06:05:30 -08:00
Krzysztof Rymski 7e5310b908 Internal changes
PiperOrigin-RevId: 867617121
2026-02-09 08:29:15 -08:00
Krzysztof Rymski 60eed010ba Internal changes
PiperOrigin-RevId: 862680527
2026-01-29 04:48:29 -08:00
Jan Wassenberg 6d43d6ee19 Build fix for Arm SVE (invalid template arg on op)
PiperOrigin-RevId: 854110884
2026-01-09 02:56:03 -08:00
Krzysztof Rymski 2ee1fac74c Internal changes
PiperOrigin-RevId: 853138600
2026-01-07 01:21:37 -08:00
Krzysztof Rymski 08a0760271 Internal changes
PiperOrigin-RevId: 846663686
2025-12-19 03:43:15 -08:00
Balazs Racz 0ac55f71ed Avoid using Row() for unaligned storage.
PiperOrigin-RevId: 846214605
2025-12-18 05:10:57 -08:00
Krzysztof Rymski 6661d3a60c Internal changes
PiperOrigin-RevId: 846140314
2025-12-18 01:26:43 -08:00
Phil Culliton b8a409dbba Use hn::Sub for vector subtraction in flash attention.
PiperOrigin-RevId: 845883321
2025-12-17 12:57:34 -08:00
Balazs Racz 596bdfe5af Separate monolithic gemma_lib library into more specific cc_library targets.
Creates new cc_library targets for :attention, :tensor_stats and :activations. Eliminates cyclic dependencies between these libraries.

PiperOrigin-RevId: 845690136
2025-12-17 03:31:16 -08:00
Krzysztof Rymski 44dfd69b9b Internal changes
PiperOrigin-RevId: 844759322
2025-12-15 07:14:37 -08:00
Martin Stolle 9689fc82f9 internal change
PiperOrigin-RevId: 842205671
2025-12-09 06:17:08 -08:00
Krzysztof Rymski 64d700cab5 Internal changes
PiperOrigin-RevId: 842194766
2025-12-09 05:42:03 -08:00
Krzysztof Rymski 6e5e4123f1 Internal changes
PiperOrigin-RevId: 837775282
2025-11-28 02:37:06 -08:00
Jan Wassenberg 37a25c9ffe Fix warning (signed vs unsigned)
PiperOrigin-RevId: 836106478
2025-11-24 00:51:17 -08:00
Martin Stolle 49d420aeaf Add some comments.
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
Jan Wassenberg 091b4567c9 Minor: ParallelismStrategy->Parallelism
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg a344a70c59 Change (old) attention behavior to disallow wraparound, enforced via assertion.
Shared kU64PerLine constant

PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Phil Culliton ab87807a4c Pre-compress query activations to BF16 before FlashAttention.
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith 8a100c1e8d Added access to flash attention internals to TileFlashAttention4
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Phil Culliton 116cd6eff6 BF16 mixed-mode flash attention
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg 3cc0139ebb Fix excessive KC/MC from prior change
This could lead to stack overflow in B_storage.

Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.

PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo 5a05857deb [Gemma.cpp] Allows non-owned arguments for attention methods.
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold  non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.

PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg 3ed403e287 Major cleanup of profiler zones, add Caller annotation for all pool.Run
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor

PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Phil Culliton 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith ee18916abf Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
Improved flash_attention to enable profiling using the new zones.

PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Ray Smith 684a0444e9 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith 14244664c8 Avoid transposing Q when it isn't needed
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg fe5a39990e Improve FlashAttention threading:
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.

PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith 6098a022b3 Increased parallelism for RMSNormAndPositionalEncoding
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith 4974f24832 Fixed bug with softcap in single flash attention
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Ray Smith c9b8479f7d Added zero-initialization to att_out.
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.

PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Ray Smith f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function.
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.

PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00