Nikhil Dev Goyal
90f3de7f15
Use paralell blend chain path in FastSigmoid on architectures having >=32 registers
...
PiperOrigin-RevId: 886178215
2026-03-19 07:54:05 -07:00
Nikhil Dev Goyal
50144738f1
Change calculation from (ax+b)/(cx+d) to (x + b')/(c'x+ d') this replaces a MulAdd with Add reducing port contention on modern cpus and thus increasing throughput.
...
Also reduces the need for 1 register to hold b as 1.0 here
PiperOrigin-RevId: 886170146
2026-03-19 07:36:52 -07:00
Ray Smith
bea8b1cdbd
Replaced attention in ViT with flash - 8x speedup of image tokenizer on AMD
...
PiperOrigin-RevId: 880877209
2026-03-09 08:46:04 -07:00
Nikhil Dev Goyal
5081341200
Use CappedTag to prevent potential out of bound reads.
...
PiperOrigin-RevId: 879141747
2026-03-05 10:40:52 -08:00
Nikhil Dev Goyal
6721dddf38
Implement FastSigmoid.
...
PiperOrigin-RevId: 878453196
2026-03-04 06:12:33 -08:00
Ray Smith
49cb438b1e
Rollback of erroneous rollback.
...
PiperOrigin-RevId: 877376165
2026-03-02 06:50:26 -08:00
Jan Wassenberg
fbd44cee42
Fix Windows warnings
...
PiperOrigin-RevId: 877338937
2026-03-02 04:53:25 -08:00
The gemma.cpp Authors
a3d994915f
No public description
...
PiperOrigin-RevId: 877333188
2026-03-02 04:32:29 -08:00
Ray Smith
16c1b29b89
Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention.
...
PiperOrigin-RevId: 877308306
2026-03-02 03:11:01 -08:00
Nikhil Dev Goyal
dd268ddbe8
Add FastGelu activation function in a newly created created fast_ops-inl.h files.
...
This replaces the Tanh call with FastTanh call in the Gelu function written in math-inl.h.
PiperOrigin-RevId: 876339830
2026-02-27 11:14:47 -08:00
Jan Wassenberg
c6587efe70
Improve instrumentation for ViT parts
...
PiperOrigin-RevId: 875302990
2026-02-25 13:10:44 -08:00
Krzysztof Rymski
df162ead7c
Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
...
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512
PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
Jan Wassenberg
42e9cf557d
Internal change / remove unused PrintSpeed
...
PiperOrigin-RevId: 853694463
2026-01-08 05:26:31 -08:00
Jan Wassenberg
aeade052c6
Move AssertClose to test_util, add U16
...
PiperOrigin-RevId: 853321311
2026-01-07 10:33:20 -08:00
Krzysztof Rymski
6e5e4123f1
Internal changes
...
PiperOrigin-RevId: 837775282
2025-11-28 02:37:06 -08:00
Jan Wassenberg
ccb49bc82f
Add ToFloatSlow, move RandomFloat to test_util
...
PiperOrigin-RevId: 837412290
2025-11-27 00:14:51 -08:00
Martin Stolle
88a03b7ec4
Added access to softmax attention internals to regular attention
...
PiperOrigin-RevId: 835244205
2025-11-21 09:01:01 -08:00
Martin Stolle
49d420aeaf
Add some comments.
...
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
Jan Wassenberg
091b4567c9
Minor: ParallelismStrategy->Parallelism
...
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg
006999063c
Fix PaliGemma matmul warning
...
PiperOrigin-RevId: 825627406
2025-10-29 11:15:50 -07:00
Phil Culliton
116cd6eff6
BF16 mixed-mode flash attention
...
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg
3cc0139ebb
Fix excessive KC/MC from prior change
...
This could lead to stack overflow in B_storage.
Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.
PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo
5a05857deb
[Gemma.cpp] Allows non-owned arguments for attention methods.
...
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.
PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg
86200ce224
1.01x speedup: improved autotune
...
Group M=4..7 into same config. Add configs for power of two sizes.
Allow odd mc to enable a single range for odd M.
io.cc: warning fix(cast).
IsBlock -> !IsOneMC
benchmark_helper: best for verbosity 3, all configs for 4
ops_test: remove unused includes
PiperOrigin-RevId: 824475104
2025-10-27 05:35:31 -07:00
Jan Wassenberg
a48e614f64
1.02x speedup: improve load balance and simplify parallelFor
...
Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead.
PiperOrigin-RevId: 823388890
2025-10-24 00:19:09 -07:00
Jan Wassenberg
3ed403e287
Major cleanup of profiler zones, add Caller annotation for all pool.Run
...
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Jan Wassenberg
f59eb2ed72
Remove multi-package support from topology
...
Also no longer assume equal-sized clusters
PiperOrigin-RevId: 820164125
2025-10-16 04:00:35 -07:00
Phil Culliton
503aaddd65
Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
...
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith
ee18916abf
Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
...
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Ray Smith
2f6cbde8ff
Added a smaller tile size to flash attention for smaller batch sizes
...
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith
d15731d201
Used hn::BroadcastLane instead of Set(..., x.raw)
...
PiperOrigin-RevId: 811386295
2025-09-25 09:42:03 -07:00
Jan Wassenberg
501fdf000e
Remove no longer used MatVec
...
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Jan Wassenberg
f3bc1c17da
1.03x speedup: fused FFN
...
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
2025-09-15 10:26:37 -07:00
Jan Wassenberg
9457258330
Refactor MatMul to accept views in the kernel functions
...
Make arg order consistent.
Move StridedView into mat.h.
Add view support to RowPtrs.
PiperOrigin-RevId: 805197381
2025-09-09 22:09:47 -07:00
Ray Smith
f10ac41a20
Added flash attention, with both a single-q function, and a register-tiled function.
...
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.
PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg
24b1760f03
Refactor: move Worker to ThreadingContext, factor out MMDecompress
...
PiperOrigin-RevId: 804909921
2025-09-09 07:56:12 -07:00
Jan Wassenberg
461a9c7d1b
Matmul refactoring towards fusion
...
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
2025-09-09 07:13:38 -07:00
Jan Wassenberg
34ceee6c30
Update MatMul comments, removing mention of partial.
...
PiperOrigin-RevId: 804872289
2025-09-09 05:57:33 -07:00
Jan Wassenberg
a5ab99e4ba
Memory use reduction: smaller/single MMStorage
...
PiperOrigin-RevId: 804865029
2025-09-09 05:32:46 -07:00
Jan Wassenberg
06e5da1e22
Cleanup: split CacheInfo from Allocator, MatMul helper functions
...
Lift DecompressA out of main autotuner to prevent interference
Also use kMaxNR / kNR constants instead of extra args
Fix: only require vector alignment, not cache alignment
PiperOrigin-RevId: 804333769
2025-09-08 02:23:58 -07:00
Jan Wassenberg
6e52a835c6
Faster startup on tsan: use hierarchical parallelism for BF16 conversion
...
Also re-enable profiler zones
PiperOrigin-RevId: 804273899
2025-09-07 22:50:31 -07:00
Jan Wassenberg
ad7d7a2713
Further adjust dot_test threshold (numerics)
...
PiperOrigin-RevId: 803428406
2025-09-05 05:50:16 -07:00
Jan Wassenberg
56186193c1
Replace mt19937 with new generator to enable parallel sampling
...
Split it into immutable AesCtrEngine and RngStream
Also add RowSpan and Logits span
PiperOrigin-RevId: 803336423
2025-09-04 23:49:10 -07:00
Jan Wassenberg
4be4799727
Remove kMaxPackages and per-package-related code
...
matmul: remove kMaxClusters, dynamic allocation
PiperOrigin-RevId: 802950348
2025-09-04 03:33:12 -07:00
Jan Wassenberg
7263ab8445
MatMul simplification, threading strategy improvements
...
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
2025-09-03 21:45:07 -07:00
Marie White
74ffe079c4
Create separate MMStorage objects per cluster.
...
PiperOrigin-RevId: 802588625
2025-09-03 09:35:48 -07:00
Jan Wassenberg
b7b3d353db
Simplify MatMul: remove F32 special case (build time)
...
Also move kMaxM into separate kMaxBatchSize
PiperOrigin-RevId: 802086590
2025-09-02 04:29:21 -07:00
Jan Wassenberg
1e3c853e80
Add ParallelFor wrapper function and one new mode
...
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode
PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Marie White
3737224132
Add in-cluster parallel policy. Update policy to include cluster_idx.
...
PiperOrigin-RevId: 802016308
2025-09-02 00:16:00 -07:00