gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Krzysztof Rymski	f56d18dd68	Improvements to inference using int8 compressed kv's Multiplication is done using int16*int16 multiplication instructions avoid expensive conversion to f32/bf16 x2 speed on zen3 PiperOrigin-RevId: 888690192	2026-03-24 08:51:30 -07:00
Jan Wassenberg	ceb70203f0	Add min_verbosity to MaybePrint PiperOrigin-RevId: 886094998	2026-03-19 04:22:01 -07:00
Jan Wassenberg	529c201eb6	Add/use MaybePrint; also ShowConfig in non-interactive builds PiperOrigin-RevId: 882688835	2026-03-12 11:20:41 -07:00
The gemma.cpp Authors	d6e836c651	Add phase markers to stderr for high verbosity levels. This change introduces `[ BEGIN PHASE: ... ]` and `[ END PHASE: ... ]` messages printed to stderr when `timing_info.verbosity` is 2 or higher. These markers are added around the prefill, generate, image token generation, and final statistics phases to help in profiling and understanding the execution flow. PiperOrigin-RevId: 882556076	2026-03-12 06:35:25 -07:00
Jan Wassenberg	cab77f8dc7	Improved timing for image tokens Move to TimingInfo, extra newline before profiler PiperOrigin-RevId: 881943820	2026-03-11 04:47:56 -07:00
Jan Wassenberg	70cb9cf1c2	Separate profiler output for image token generation PiperOrigin-RevId: 880895239	2026-03-09 09:26:50 -07:00
Jan Wassenberg	c6587efe70	Improve instrumentation for ViT parts PiperOrigin-RevId: 875302990	2026-02-25 13:10:44 -08:00
Krzysztof Rymski	df162ead7c	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. It also supports better parallelism for small batch sizes / small models. It also is able to utilize VDPBF16PS for nice 2x improvement on avx512 PiperOrigin-RevId: 874517319	2026-02-24 03:26:49 -08:00
Ray Smith	76d7951242	Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data. Replaced flag with attention_impl to control which attention to run. PiperOrigin-RevId: 869694868	2026-02-13 06:05:30 -08:00
Balazs Racz	384c390181	Allow overriding hardcoded max_seq_len by cmdline argument seq_len. Adds a SetMaxSeqLen method to ModelConfig to handle updating both max_seq_len and global attention window sizes. The Gemma constructor now checks if the provided inference seq_len exceeds the model's max_seq_len and, if so, emits a warning and updates the config. This prevents clipping context to the hard-coded maximum. PiperOrigin-RevId: 853676074	2026-01-08 04:28:59 -08:00
Jan Wassenberg	0c64987a96	Abort if args are unrecognized, refactor argument passing This catches typos/incorrect usage. Refactor: group Loader/Threading/Inference into GemmaArgs. All *Args ctors now have an extra ConsumedArgs& argument. PiperOrigin-RevId: 844690553	2025-12-15 03:18:45 -08:00
Jan Wassenberg	f50550f4ce	Warning fixes (sign mismatch), switch default PiperOrigin-RevId: 844679375	2025-12-15 02:41:19 -08:00
Jan Wassenberg	73c3627b67	Add tensor stats and output tensor_info: add missing header io: fix mode weights.h: add layer_idx to LayerWeightsPtrs PiperOrigin-RevId: 843531051	2025-12-11 22:52:46 -08:00
Krzysztof Rymski	64d700cab5	Internal changes PiperOrigin-RevId: 842194766	2025-12-09 05:42:03 -08:00
Charles Zhao	0e5f4cbf1b	Implement Continus Batching. (1) A function GenerateTWithContinuousBatching is added to use continuous batching when enabled. (2) The ContinuousQBatch is added as a subclass of QBatch to manage prefill, insert, used-kv-cache-collection. (3) Also expanded the unit test to more diverse cases. PiperOrigin-RevId: 836090261	2025-11-23 23:54:02 -08:00
Charles Zhao	f8131339a7	Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching. PiperOrigin-RevId: 829104661	2025-11-06 14:20:17 -08:00
Martin Stolle	35e9f9f05f	Introduce attention implementation configurability. PiperOrigin-RevId: 828971705	2025-11-06 08:43:41 -08:00
Jan Wassenberg	091b4567c9	Minor: ParallelismStrategy->Parallelism PiperOrigin-RevId: 828936578	2025-11-06 06:56:10 -08:00
Charles Zhao	3a63a12624	Allow prefill only run by allowing max_prompt_size == seq_len PiperOrigin-RevId: 827415258	2025-11-03 03:17:54 -08:00
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Ray Smith	ee18916abf	Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead. PiperOrigin-RevId: 819739402	2025-10-15 07:10:04 -07:00
Ray Smith	fb6fa793f4	Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle. Improved flash_attention to enable profiling using the new zones. PiperOrigin-RevId: 819235421	2025-10-14 08:30:58 -07:00
Jan Wassenberg	035273c184	tune pool kSpin mode in threading_context Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes. threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order Also enable SPR target (Zen4 is AMD-only), update Highway version for renamed Thread()->GlobalIdx(). PiperOrigin-RevId: 816223017	2025-10-07 08:36:26 -07:00
Ray Smith	c9b8479f7d	Added zero-initialization to att_out. Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available. PiperOrigin-RevId: 806284756	2025-09-12 07:48:23 -07:00
Jan Wassenberg	2695aab5d2	Temporarily disable flash pending msan fix PiperOrigin-RevId: 805350234	2025-09-10 07:25:41 -07:00
Jan Wassenberg	cbe24eac51	1.15x speedup: parallel sampling, enabled by new RNG Also pass pos to SampleFunc, for seeding the RNG. PiperOrigin-RevId: 803453518	2025-09-05 07:24:02 -07:00
Jan Wassenberg	2b4c16e243	Remove Griffin support Also add IsObsolete helper PiperOrigin-RevId: 803376921	2025-09-05 02:35:40 -07:00
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	5d1693e806	Internal change PiperOrigin-RevId: 803083229	2025-09-04 10:31:20 -07:00
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Jan Wassenberg	6c39a2dea4	1.01x speedup: More bf16 activations to reduce DecompressA. Also move observer call into function, format gemma_args. PiperOrigin-RevId: 800827400	2025-08-29 03:19:01 -07:00
Jan Wassenberg	5411fd846d	Minor: batched NotifyGenerate, fix comment/dep PiperOrigin-RevId: 799889802	2025-08-26 23:33:17 -07:00
Jan Wassenberg	86afd53076	1.04x speedup: Parallelize SoftCap Also require opt-in constexpr flag for observer callbacks, update zones PiperOrigin-RevId: 799655163	2025-08-26 11:55:20 -07:00
Jan Wassenberg	ed2f0bd1b0	Fix pos assertions, refs #665 Ensure the streaming func pos matches the number of calls. Add two arguments that control pos+1 and pos+=1 behavior. Also cleanup/add comments. run: use batch_stream_func, add assert, higher verbosity for MM autotune output PiperOrigin-RevId: 799511163	2025-08-26 04:50:40 -07:00
Rhett Stucki	73f1140dca	Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache. The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated. PiperOrigin-RevId: 797614310	2025-08-20 22:59:58 -07:00
Jan Wassenberg	faa4102992	(Resubmit) Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 794461159	2025-08-13 01:38:24 -07:00
The gemma.cpp Authors	a2d9133f7d	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793865287	2025-08-11 17:51:38 -07:00
Jan Wassenberg	4cbf63e6f0	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793821255	2025-08-11 15:34:52 -07:00
Jan Wassenberg	799c264df3	Pre-tune thread pool before matmul Also improve profiler annotations - remove near-zero ones and add more for startup PiperOrigin-RevId: 789352414	2025-07-31 08:45:26 -07:00
Charles Zhao	50ee1a3e92	Write SBS progressively. (1) Directly write to file in BlobWriter::Add and destruct the MatOwner to release the rams. (2) Write a fake header to indicate this is V2, and write correct header and directory at the end of the file. (3) Tested on loading sbs written the old way, and new way, both worked. PiperOrigin-RevId: 789306837	2025-07-31 06:05:38 -07:00
Jan Wassenberg	d831ddce5b	Fix file mapping: was letting the smart pointer go out of scope Also save+print the IO mode used. PiperOrigin-RevId: 788848165	2025-07-30 04:30:10 -07:00
Jan Wassenberg	d1638587f0	1.14x batch decode speedup: parallelize RMSNorm ops Activations was over-parallelized, use single pool instead. Also improve profiler zone annotations, pass through worker args (for tracking concurrency), now non-optional. PiperOrigin-RevId: 788790976	2025-07-30 00:55:45 -07:00
Jan Wassenberg	e76e29ce11	De-singleton ThreadingContext so callers can pass in their own weights.cc: fix BindB argument for bf16 tensors threading_test: enable autotune PiperOrigin-RevId: 785763618	2025-07-22 02:08:46 -07:00
Jan Wassenberg	4bc44d5678	Minor: ModelWeightsPtrs -> WeightsPtrs PiperOrigin-RevId: 781954533	2025-07-11 06:11:51 -07:00
Jan Wassenberg	a04cc287b2	Move MatMulEnv out of Gemma to enable concurrent calls Also update benchmark_helper config print: add profiler, remove free mem PiperOrigin-RevId: 774662974	2025-06-23 01:20:09 -07:00
Jan Wassenberg	7f62c2606e	Fix bf16 KV recompression and Rope(), fixes #608 Also add more helpful error message for prompt > seq_len Also update ops_test, adding coverage for Rope(). PiperOrigin-RevId: 772945644	2025-06-18 09:14:20 -07:00
Jan Wassenberg	f2adbfbcab	Batch inference fixes: set pos during prefill, fix assert PiperOrigin-RevId: 772458760	2025-06-17 07:09:44 -07:00
Jan Wassenberg	cd80d8b24d	Speed up builds by skipping rarely used targets Centralize previous code into GEMMA_DISABLED_TARGETS PiperOrigin-RevId: 772433723	2025-06-17 05:44:20 -07:00
Jan Wassenberg	6773e4517c	Split Activations into Griffin/Attention to reduce memory usage for attention-only tests. PiperOrigin-RevId: 772025282	2025-06-16 07:52:59 -07:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00

1 2 3

150 Commits