gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Balazs Racz	384c390181	Allow overriding hardcoded max_seq_len by cmdline argument seq_len. Adds a SetMaxSeqLen method to ModelConfig to handle updating both max_seq_len and global attention window sizes. The Gemma constructor now checks if the provided inference seq_len exceeds the model's max_seq_len and, if so, emits a warning and updates the config. This prevents clipping context to the hard-coded maximum. PiperOrigin-RevId: 853676074	2026-01-08 04:28:59 -08:00
Krzysztof Rymski	2ee1fac74c	Internal changes PiperOrigin-RevId: 853138600	2026-01-07 01:21:37 -08:00
Jan Wassenberg	1605925d1e	Add int8 quantization stats Compute the L1 error and Shannon SNR (higher is better). PiperOrigin-RevId: 846832280	2025-12-19 12:43:03 -08:00
Krzysztof Rymski	08a0760271	Internal changes PiperOrigin-RevId: 846663686	2025-12-19 03:43:15 -08:00
Krzysztof Rymski	b73a9ede8f	Internal changes PiperOrigin-RevId: 846648337	2025-12-19 02:46:18 -08:00
Balazs Racz	0ac55f71ed	Avoid using Row() for unaligned storage. PiperOrigin-RevId: 846214605	2025-12-18 05:10:57 -08:00
Krzysztof Rymski	6661d3a60c	Internal changes PiperOrigin-RevId: 846140314	2025-12-18 01:26:43 -08:00
Phil Culliton	b8a409dbba	Use hn::Sub for vector subtraction in flash attention. PiperOrigin-RevId: 845883321	2025-12-17 12:57:34 -08:00
Balazs Racz	596bdfe5af	Separate monolithic gemma_lib library into more specific cc_library targets. Creates new cc_library targets for :attention, :tensor_stats and :activations. Eliminates cyclic dependencies between these libraries. PiperOrigin-RevId: 845690136	2025-12-17 03:31:16 -08:00
Balazs Racz	baa69dfb78	Makes the entire runtime_config passed into the activations constructor. PiperOrigin-RevId: 845153671	2025-12-16 01:56:52 -08:00
Krzysztof Rymski	44dfd69b9b	Internal changes PiperOrigin-RevId: 844759322	2025-12-15 07:14:37 -08:00
Jan Wassenberg	0c64987a96	Abort if args are unrecognized, refactor argument passing This catches typos/incorrect usage. Refactor: group Loader/Threading/Inference into GemmaArgs. All *Args ctors now have an extra ConsumedArgs& argument. PiperOrigin-RevId: 844690553	2025-12-15 03:18:45 -08:00
Jan Wassenberg	f50550f4ce	Warning fixes (sign mismatch), switch default PiperOrigin-RevId: 844679375	2025-12-15 02:41:19 -08:00
Martin Stolle	506fb22be7	No public description PiperOrigin-RevId: 843665619	2025-12-12 06:37:17 -08:00
Balazs Racz	338cd8a36e	Factors out a new cc_library `:query` from `:gemma-lib`. Moves query-related structs/classes to gemma/query.h. This refactors PerQuery, AllQueries, and QBatch into a dedicated header file, gemma/query.h, and updates BUILD dependencies accordingly. PiperOrigin-RevId: 843604293	2025-12-12 02:53:56 -08:00
Jan Wassenberg	73c3627b67	Add tensor stats and output tensor_info: add missing header io: fix mode weights.h: add layer_idx to LayerWeightsPtrs PiperOrigin-RevId: 843531051	2025-12-11 22:52:46 -08:00
Martin Stolle	78deacc357	Make attention configurable on the command line. PiperOrigin-RevId: 842760721	2025-12-10 09:34:06 -08:00
Martin Stolle	2441ff01bf	internal change PiperOrigin-RevId: 842749037	2025-12-10 09:01:15 -08:00
Martin Stolle	9689fc82f9	internal change PiperOrigin-RevId: 842205671	2025-12-09 06:17:08 -08:00
Krzysztof Rymski	64d700cab5	Internal changes PiperOrigin-RevId: 842194766	2025-12-09 05:42:03 -08:00
Martin Stolle	14a9ecf21d	Factor out SumHeads PiperOrigin-RevId: 842138081	2025-12-09 02:23:16 -08:00
Martin Stolle	1014ae9e2a	Adding a simple test for GemmaAttention PiperOrigin-RevId: 842135414	2025-12-09 02:13:03 -08:00
Martin Stolle	b510ba2ab2	Improve clarity of indices II Sorry, didn't see this one before. PiperOrigin-RevId: 840218378	2025-12-04 06:33:33 -08:00
Martin Stolle	9348048885	Clean up toPtrs to delegate to toPtr PiperOrigin-RevId: 840214969	2025-12-04 06:22:04 -08:00
Martin Stolle	d2090fddf3	Improve clarity of indices PiperOrigin-RevId: 839805634	2025-12-03 10:11:21 -08:00
Jan Wassenberg	a084d33e41	Fix Gemma3 image: ensure A matrix is packed, preallocate Also ignore -2 tokens PiperOrigin-RevId: 838869988	2025-12-01 11:47:23 -08:00
Krzysztof Rymski	6e5e4123f1	Internal changes PiperOrigin-RevId: 837775282	2025-11-28 02:37:06 -08:00
Krzysztof Rymski	c153d5255b	Internal changes PiperOrigin-RevId: 837001762	2025-11-26 01:05:35 -08:00
Martin Stolle	8696f6dd17	Clarify indices PiperOrigin-RevId: 836235539	2025-11-24 08:27:59 -08:00
Jan Wassenberg	37a25c9ffe	Fix warning (signed vs unsigned) PiperOrigin-RevId: 836106478	2025-11-24 00:51:17 -08:00
Charles Zhao	0e5f4cbf1b	Implement Continus Batching. (1) A function GenerateTWithContinuousBatching is added to use continuous batching when enabled. (2) The ContinuousQBatch is added as a subclass of QBatch to manage prefill, insert, used-kv-cache-collection. (3) Also expanded the unit test to more diverse cases. PiperOrigin-RevId: 836090261	2025-11-23 23:54:02 -08:00
Martin Stolle	88a03b7ec4	Added access to softmax attention internals to regular attention PiperOrigin-RevId: 835244205	2025-11-21 09:01:01 -08:00
Martin Stolle	5a500872b8	Internal change PiperOrigin-RevId: 835115693	2025-11-21 01:17:45 -08:00
Martin Stolle	49d420aeaf	Add some comments. PiperOrigin-RevId: 834173319	2025-11-19 01:09:15 -08:00
The gemma.cpp Authors	b8f6be72b1	Improves autodetection of Gemma3-1B. Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models. Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3. PiperOrigin-RevId: 833213331	2025-11-17 01:12:50 -08:00
Jan Wassenberg	3e18db17f4	Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes #762 . PiperOrigin-RevId: 829308896	2025-11-07 00:32:31 -08:00
Charles Zhao	f8131339a7	Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching. PiperOrigin-RevId: 829104661	2025-11-06 14:20:17 -08:00
Martin Stolle	35e9f9f05f	Introduce attention implementation configurability. PiperOrigin-RevId: 828971705	2025-11-06 08:43:41 -08:00
Jan Wassenberg	091b4567c9	Minor: ParallelismStrategy->Parallelism PiperOrigin-RevId: 828936578	2025-11-06 06:56:10 -08:00
Jan Wassenberg	a344a70c59	Change (old) attention behavior to disallow wraparound, enforced via assertion. Shared kU64PerLine constant PiperOrigin-RevId: 828072451	2025-11-04 11:52:40 -08:00
Charles Zhao	3a63a12624	Allow prefill only run by allowing max_prompt_size == seq_len PiperOrigin-RevId: 827415258	2025-11-03 03:17:54 -08:00
Phil Culliton	ab87807a4c	Pre-compress query activations to BF16 before FlashAttention. PiperOrigin-RevId: 826524997	2025-10-31 09:49:44 -07:00
Ray Smith	8a100c1e8d	Added access to flash attention internals to TileFlashAttention4 PiperOrigin-RevId: 826011137	2025-10-30 06:50:05 -07:00
Phil Culliton	116cd6eff6	BF16 mixed-mode flash attention PiperOrigin-RevId: 825433929	2025-10-29 01:48:28 -07:00
Jan Wassenberg	4bd465ffd3	Also update attention.h to type-erased query_norm_scale PiperOrigin-RevId: 825014334	2025-10-28 06:48:33 -07:00
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Theotime Combes	1bdde1af3c	Add config flag for global timescale & rely on config to deduce wrapping PiperOrigin-RevId: 823512377	2025-10-24 06:54:56 -07:00
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Phil Culliton	503aaddd65	Add 8-bit integer quantization (I8Stream) to Gemma.cpp. PiperOrigin-RevId: 819787856	2025-10-15 09:25:20 -07:00

1 2 3 4 5 ...

432 Commits