gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Nikhil Dev Goyal	90f3de7f15	Use paralell blend chain path in FastSigmoid on architectures having >=32 registers PiperOrigin-RevId: 886178215	2026-03-19 07:54:05 -07:00
Nikhil Dev Goyal	50144738f1	Change calculation from (ax+b)/(cx+d) to (x + b')/(c'x+ d') this replaces a MulAdd with Add reducing port contention on modern cpus and thus increasing throughput. Also reduces the need for 1 register to hold b as 1.0 here PiperOrigin-RevId: 886170146	2026-03-19 07:36:52 -07:00
Ray Smith	bea8b1cdbd	Replaced attention in ViT with flash - 8x speedup of image tokenizer on AMD PiperOrigin-RevId: 880877209	2026-03-09 08:46:04 -07:00
Nikhil Dev Goyal	5081341200	Use CappedTag to prevent potential out of bound reads. PiperOrigin-RevId: 879141747	2026-03-05 10:40:52 -08:00
Nikhil Dev Goyal	6721dddf38	Implement FastSigmoid. PiperOrigin-RevId: 878453196	2026-03-04 06:12:33 -08:00
Ray Smith	49cb438b1e	Rollback of erroneous rollback. PiperOrigin-RevId: 877376165	2026-03-02 06:50:26 -08:00
Jan Wassenberg	fbd44cee42	Fix Windows warnings PiperOrigin-RevId: 877338937	2026-03-02 04:53:25 -08:00
The gemma.cpp Authors	a3d994915f	No public description PiperOrigin-RevId: 877333188	2026-03-02 04:32:29 -08:00
Ray Smith	16c1b29b89	Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention. PiperOrigin-RevId: 877308306	2026-03-02 03:11:01 -08:00
Nikhil Dev Goyal	dd268ddbe8	Add FastGelu activation function in a newly created created fast_ops-inl.h files. This replaces the Tanh call with FastTanh call in the Gelu function written in math-inl.h. PiperOrigin-RevId: 876339830	2026-02-27 11:14:47 -08:00
Jan Wassenberg	c6587efe70	Improve instrumentation for ViT parts PiperOrigin-RevId: 875302990	2026-02-25 13:10:44 -08:00
Krzysztof Rymski	df162ead7c	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. It also supports better parallelism for small batch sizes / small models. It also is able to utilize VDPBF16PS for nice 2x improvement on avx512 PiperOrigin-RevId: 874517319	2026-02-24 03:26:49 -08:00
Jan Wassenberg	42e9cf557d	Internal change / remove unused PrintSpeed PiperOrigin-RevId: 853694463	2026-01-08 05:26:31 -08:00
Jan Wassenberg	aeade052c6	Move AssertClose to test_util, add U16 PiperOrigin-RevId: 853321311	2026-01-07 10:33:20 -08:00
Krzysztof Rymski	6e5e4123f1	Internal changes PiperOrigin-RevId: 837775282	2025-11-28 02:37:06 -08:00
Jan Wassenberg	ccb49bc82f	Add ToFloatSlow, move RandomFloat to test_util PiperOrigin-RevId: 837412290	2025-11-27 00:14:51 -08:00
Martin Stolle	88a03b7ec4	Added access to softmax attention internals to regular attention PiperOrigin-RevId: 835244205	2025-11-21 09:01:01 -08:00
Martin Stolle	49d420aeaf	Add some comments. PiperOrigin-RevId: 834173319	2025-11-19 01:09:15 -08:00
Jan Wassenberg	091b4567c9	Minor: ParallelismStrategy->Parallelism PiperOrigin-RevId: 828936578	2025-11-06 06:56:10 -08:00
Jan Wassenberg	006999063c	Fix PaliGemma matmul warning PiperOrigin-RevId: 825627406	2025-10-29 11:15:50 -07:00
Phil Culliton	116cd6eff6	BF16 mixed-mode flash attention PiperOrigin-RevId: 825433929	2025-10-29 01:48:28 -07:00
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Jan Wassenberg	86200ce224	1.01x speedup: improved autotune Group M=4..7 into same config. Add configs for power of two sizes. Allow odd mc to enable a single range for odd M. io.cc: warning fix(cast). IsBlock -> !IsOneMC benchmark_helper: best for verbosity 3, all configs for 4 ops_test: remove unused includes PiperOrigin-RevId: 824475104	2025-10-27 05:35:31 -07:00
Jan Wassenberg	a48e614f64	1.02x speedup: improve load balance and simplify parallelFor Remove ParallelizeOne/TwoRange, use ParallelForAcross/WithinCluster instead. PiperOrigin-RevId: 823388890	2025-10-24 00:19:09 -07:00
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Jan Wassenberg	f59eb2ed72	Remove multi-package support from topology Also no longer assume equal-sized clusters PiperOrigin-RevId: 820164125	2025-10-16 04:00:35 -07:00
Phil Culliton	503aaddd65	Add 8-bit integer quantization (I8Stream) to Gemma.cpp. PiperOrigin-RevId: 819787856	2025-10-15 09:25:20 -07:00
Ray Smith	ee18916abf	Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead. PiperOrigin-RevId: 819739402	2025-10-15 07:10:04 -07:00
Ray Smith	fb6fa793f4	Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle. Improved flash_attention to enable profiling using the new zones. PiperOrigin-RevId: 819235421	2025-10-14 08:30:58 -07:00
Ray Smith	2f6cbde8ff	Added a smaller tile size to flash attention for smaller batch sizes PiperOrigin-RevId: 813226193	2025-09-30 05:49:20 -07:00
Ray Smith	d15731d201	Used hn::BroadcastLane instead of Set(..., x.raw) PiperOrigin-RevId: 811386295	2025-09-25 09:42:03 -07:00
Jan Wassenberg	501fdf000e	Remove no longer used MatVec PiperOrigin-RevId: 809059409	2025-09-19 09:03:22 -07:00
Jan Wassenberg	f3bc1c17da	1.03x speedup: fused FFN matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps. PiperOrigin-RevId: 807291701	2025-09-15 10:26:37 -07:00
Jan Wassenberg	9457258330	Refactor MatMul to accept views in the kernel functions Make arg order consistent. Move StridedView into mat.h. Add view support to RowPtrs. PiperOrigin-RevId: 805197381	2025-09-09 22:09:47 -07:00
Ray Smith	f10ac41a20	Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. PiperOrigin-RevId: 804913784	2025-09-09 08:05:26 -07:00
Jan Wassenberg	24b1760f03	Refactor: move Worker to ThreadingContext, factor out MMDecompress PiperOrigin-RevId: 804909921	2025-09-09 07:56:12 -07:00
Jan Wassenberg	461a9c7d1b	Matmul refactoring towards fusion MMLoops: move dispatch code out, use overloads split build target into matmul_env (for MatMulEnv/MMOptions) weights: no longer call BindB Fix potential out of bounds in gemma_batch_bench PiperOrigin-RevId: 804895985	2025-09-09 07:13:38 -07:00
Jan Wassenberg	34ceee6c30	Update MatMul comments, removing mention of partial. PiperOrigin-RevId: 804872289	2025-09-09 05:57:33 -07:00
Jan Wassenberg	a5ab99e4ba	Memory use reduction: smaller/single MMStorage PiperOrigin-RevId: 804865029	2025-09-09 05:32:46 -07:00
Jan Wassenberg	06e5da1e22	Cleanup: split CacheInfo from Allocator, MatMul helper functions Lift DecompressA out of main autotuner to prevent interference Also use kMaxNR / kNR constants instead of extra args Fix: only require vector alignment, not cache alignment PiperOrigin-RevId: 804333769	2025-09-08 02:23:58 -07:00
Jan Wassenberg	6e52a835c6	Faster startup on tsan: use hierarchical parallelism for BF16 conversion Also re-enable profiler zones PiperOrigin-RevId: 804273899	2025-09-07 22:50:31 -07:00
Jan Wassenberg	ad7d7a2713	Further adjust dot_test threshold (numerics) PiperOrigin-RevId: 803428406	2025-09-05 05:50:16 -07:00
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	4be4799727	Remove kMaxPackages and per-package-related code matmul: remove kMaxClusters, dynamic allocation PiperOrigin-RevId: 802950348	2025-09-04 03:33:12 -07:00
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Marie White	74ffe079c4	Create separate MMStorage objects per cluster. PiperOrigin-RevId: 802588625	2025-09-03 09:35:48 -07:00
Jan Wassenberg	b7b3d353db	Simplify MatMul: remove F32 special case (build time) Also move kMaxM into separate kMaxBatchSize PiperOrigin-RevId: 802086590	2025-09-02 04:29:21 -07:00
Jan Wassenberg	1e3c853e80	Add ParallelFor wrapper function and one new mode Move ParallelismType from matmul.h to threading.h Replace SmallParallelFor with ParallelFor and the new mode PiperOrigin-RevId: 802038452	2025-09-02 01:40:09 -07:00
Marie White	3737224132	Add in-cluster parallel policy. Update policy to include cluster_idx. PiperOrigin-RevId: 802016308	2025-09-02 00:16:00 -07:00

1 2 3 4

157 Commits