Commit Graph

26 Commits

Author SHA1 Message Date
Balazs Racz 85e2e8ae7f Separate monolithic gemma_lib library into more specific cc_library targets.
Creates new cc_library targets for :attention, :tensor_stats and :activations. Eliminates cyclic dependencies between these libraries.

PiperOrigin-RevId: 845238905
2025-12-16 07:14:09 -08:00
Krzysztof Rymski 44dfd69b9b Internal changes
PiperOrigin-RevId: 844759322
2025-12-15 07:14:37 -08:00
Martin Stolle 9689fc82f9 internal change
PiperOrigin-RevId: 842205671
2025-12-09 06:17:08 -08:00
Krzysztof Rymski 64d700cab5 Internal changes
PiperOrigin-RevId: 842194766
2025-12-09 05:42:03 -08:00
Krzysztof Rymski 6e5e4123f1 Internal changes
PiperOrigin-RevId: 837775282
2025-11-28 02:37:06 -08:00
Jan Wassenberg 37a25c9ffe Fix warning (signed vs unsigned)
PiperOrigin-RevId: 836106478
2025-11-24 00:51:17 -08:00
Martin Stolle 49d420aeaf Add some comments.
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
Jan Wassenberg 091b4567c9 Minor: ParallelismStrategy->Parallelism
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg a344a70c59 Change (old) attention behavior to disallow wraparound, enforced via assertion.
Shared kU64PerLine constant

PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Phil Culliton ab87807a4c Pre-compress query activations to BF16 before FlashAttention.
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith 8a100c1e8d Added access to flash attention internals to TileFlashAttention4
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Phil Culliton 116cd6eff6 BF16 mixed-mode flash attention
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg 3cc0139ebb Fix excessive KC/MC from prior change
This could lead to stack overflow in B_storage.

Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.

PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo 5a05857deb [Gemma.cpp] Allows non-owned arguments for attention methods.
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold  non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.

PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg 3ed403e287 Major cleanup of profiler zones, add Caller annotation for all pool.Run
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor

PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Phil Culliton 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith ee18916abf Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
Improved flash_attention to enable profiling using the new zones.

PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Ray Smith 684a0444e9 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith 14244664c8 Avoid transposing Q when it isn't needed
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg fe5a39990e Improve FlashAttention threading:
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.

PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith 6098a022b3 Increased parallelism for RMSNormAndPositionalEncoding
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith 4974f24832 Fixed bug with softcap in single flash attention
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Ray Smith c9b8479f7d Added zero-initialization to att_out.
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.

PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Ray Smith f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function.
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.

PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00