gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Krzysztof Rymski	b31e8f98e8	Internal changes PiperOrigin-RevId: 836654012	2025-11-26 04:40:28 -08:00
Krzysztof Rymski	c153d5255b	Internal changes PiperOrigin-RevId: 837001762	2025-11-26 01:05:35 -08:00
Martin Stolle	8696f6dd17	Clarify indices PiperOrigin-RevId: 836235539	2025-11-24 08:27:59 -08:00
Jan Wassenberg	37a25c9ffe	Fix warning (signed vs unsigned) PiperOrigin-RevId: 836106478	2025-11-24 00:51:17 -08:00
Charles Zhao	0e5f4cbf1b	Implement Continus Batching. (1) A function GenerateTWithContinuousBatching is added to use continuous batching when enabled. (2) The ContinuousQBatch is added as a subclass of QBatch to manage prefill, insert, used-kv-cache-collection. (3) Also expanded the unit test to more diverse cases. PiperOrigin-RevId: 836090261	2025-11-23 23:54:02 -08:00
Martin Stolle	88a03b7ec4	Added access to softmax attention internals to regular attention PiperOrigin-RevId: 835244205	2025-11-21 09:01:01 -08:00
Martin Stolle	5a500872b8	Internal change PiperOrigin-RevId: 835115693	2025-11-21 01:17:45 -08:00
Martin Stolle	49d420aeaf	Add some comments. PiperOrigin-RevId: 834173319	2025-11-19 01:09:15 -08:00
The gemma.cpp Authors	b8f6be72b1	Improves autodetection of Gemma3-1B. Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models. Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3. PiperOrigin-RevId: 833213331	2025-11-17 01:12:50 -08:00
Jan Wassenberg	3e18db17f4	Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes #762 . PiperOrigin-RevId: 829308896	2025-11-07 00:32:31 -08:00
Charles Zhao	f8131339a7	Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching. PiperOrigin-RevId: 829104661	2025-11-06 14:20:17 -08:00
Martin Stolle	35e9f9f05f	Introduce attention implementation configurability. PiperOrigin-RevId: 828971705	2025-11-06 08:43:41 -08:00
Jan Wassenberg	091b4567c9	Minor: ParallelismStrategy->Parallelism PiperOrigin-RevId: 828936578	2025-11-06 06:56:10 -08:00
Jan Wassenberg	a344a70c59	Change (old) attention behavior to disallow wraparound, enforced via assertion. Shared kU64PerLine constant PiperOrigin-RevId: 828072451	2025-11-04 11:52:40 -08:00
Charles Zhao	3a63a12624	Allow prefill only run by allowing max_prompt_size == seq_len PiperOrigin-RevId: 827415258	2025-11-03 03:17:54 -08:00
Phil Culliton	ab87807a4c	Pre-compress query activations to BF16 before FlashAttention. PiperOrigin-RevId: 826524997	2025-10-31 09:49:44 -07:00
Ray Smith	8a100c1e8d	Added access to flash attention internals to TileFlashAttention4 PiperOrigin-RevId: 826011137	2025-10-30 06:50:05 -07:00
Phil Culliton	116cd6eff6	BF16 mixed-mode flash attention PiperOrigin-RevId: 825433929	2025-10-29 01:48:28 -07:00
Jan Wassenberg	4bd465ffd3	Also update attention.h to type-erased query_norm_scale PiperOrigin-RevId: 825014334	2025-10-28 06:48:33 -07:00
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Theotime Combes	1bdde1af3c	Add config flag for global timescale & rely on config to deduce wrapping PiperOrigin-RevId: 823512377	2025-10-24 06:54:56 -07:00
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Phil Culliton	503aaddd65	Add 8-bit integer quantization (I8Stream) to Gemma.cpp. PiperOrigin-RevId: 819787856	2025-10-15 09:25:20 -07:00
Ray Smith	ee18916abf	Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead. PiperOrigin-RevId: 819739402	2025-10-15 07:10:04 -07:00
Ray Smith	fb6fa793f4	Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle. Improved flash_attention to enable profiling using the new zones. PiperOrigin-RevId: 819235421	2025-10-14 08:30:58 -07:00
Jan Wassenberg	035273c184	tune pool kSpin mode in threading_context Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes. threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order Also enable SPR target (Zen4 is AMD-only), update Highway version for renamed Thread()->GlobalIdx(). PiperOrigin-RevId: 816223017	2025-10-07 08:36:26 -07:00
Ray Smith	684a0444e9	Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines PiperOrigin-RevId: 814241032	2025-10-02 08:15:16 -07:00
Ray Smith	14244664c8	Avoid transposing Q when it isn't needed PiperOrigin-RevId: 814187984	2025-10-02 05:16:35 -07:00
Jan Wassenberg	fe5a39990e	Improve FlashAttention threading: kFlat for RMSNorm (hierarchical is excessive), profiler zone naming improvements. PiperOrigin-RevId: 814144012	2025-10-02 02:37:05 -07:00
Ray Smith	6098a022b3	Increased parallelism for RMSNormAndPositionalEncoding PiperOrigin-RevId: 813738994	2025-10-01 07:11:14 -07:00
Ray Smith	2f6cbde8ff	Added a smaller tile size to flash attention for smaller batch sizes PiperOrigin-RevId: 813226193	2025-09-30 05:49:20 -07:00
Ray Smith	4974f24832	Fixed bug with softcap in single flash attention PiperOrigin-RevId: 813164938	2025-09-30 02:17:58 -07:00
Nitin Gangahar	667a3f117a	Utilize multiple cores to read weight batches. PiperOrigin-RevId: 811893059	2025-09-26 11:28:33 -07:00
Charles Zhao	4f0c633248	(1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results. PiperOrigin-RevId: 810634261	2025-09-23 17:02:29 -07:00
Jan Wassenberg	fac8aac4cb	Internal change PiperOrigin-RevId: 809975026	2025-09-22 05:37:03 -07:00
Jan Wassenberg	501fdf000e	Remove no longer used MatVec PiperOrigin-RevId: 809059409	2025-09-19 09:03:22 -07:00
Jan Wassenberg	f3bc1c17da	1.03x speedup: fused FFN matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps. PiperOrigin-RevId: 807291701	2025-09-15 10:26:37 -07:00
Ray Smith	c9b8479f7d	Added zero-initialization to att_out. Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available. PiperOrigin-RevId: 806284756	2025-09-12 07:48:23 -07:00
Jan Wassenberg	2695aab5d2	Temporarily disable flash pending msan fix PiperOrigin-RevId: 805350234	2025-09-10 07:25:41 -07:00
Jan Wassenberg	ba6131311a	Fix gemma_batch_bench for flash attention q_T rows do not change. Also repeat prefill to reflect perf after autotuning. PiperOrigin-RevId: 805319377	2025-09-10 05:32:34 -07:00
Ray Smith	f10ac41a20	Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. PiperOrigin-RevId: 804913784	2025-09-09 08:05:26 -07:00
Jan Wassenberg	461a9c7d1b	Matmul refactoring towards fusion MMLoops: move dispatch code out, use overloads split build target into matmul_env (for MatMulEnv/MMOptions) weights: no longer call BindB Fix potential out of bounds in gemma_batch_bench PiperOrigin-RevId: 804895985	2025-09-09 07:13:38 -07:00
Jan Wassenberg	a5ab99e4ba	Memory use reduction: smaller/single MMStorage PiperOrigin-RevId: 804865029	2025-09-09 05:32:46 -07:00
Jan Wassenberg	6e52a835c6	Faster startup on tsan: use hierarchical parallelism for BF16 conversion Also re-enable profiler zones PiperOrigin-RevId: 804273899	2025-09-07 22:50:31 -07:00
Jan Wassenberg	cbe24eac51	1.15x speedup: parallel sampling, enabled by new RNG Also pass pos to SampleFunc, for seeding the RNG. PiperOrigin-RevId: 803453518	2025-09-05 07:24:02 -07:00
Jan Wassenberg	2b4c16e243	Remove Griffin support Also add IsObsolete helper PiperOrigin-RevId: 803376921	2025-09-05 02:35:40 -07:00
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	5d1693e806	Internal change PiperOrigin-RevId: 803083229	2025-09-04 10:31:20 -07:00
Jan Wassenberg	4be4799727	Remove kMaxPackages and per-package-related code matmul: remove kMaxClusters, dynamic allocation PiperOrigin-RevId: 802950348	2025-09-04 03:33:12 -07:00

1 2 3 4 5 ...

406 Commits