gemma.cpp

History

Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. It also supports better parallelism for small batch sizes / small models. It also is able to utilize VDPBF16PS for nice 2x improvement on avx512 PiperOrigin-RevId: 874517319		2026-02-24 03:26:49 -08:00
..
bindings	Abort if args are unrecognized, refactor argument passing	2025-12-15 03:18:45 -08:00
evals	Add MMLU eval to github	2024-05-20 10:20:53 -07:00
activations.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
api_client.cc	Abort if args are unrecognized, refactor argument passing	2025-12-15 03:18:45 -08:00
api_server.cc	Abort if args are unrecognized, refactor argument passing	2025-12-15 03:18:45 -08:00
attention.cc	Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.	2026-02-13 06:05:30 -08:00
attention.h	Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.	2026-02-13 06:05:30 -08:00
attention_test.cc	Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.	2026-02-13 06:05:30 -08:00
configs.cc	Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.	2026-02-13 06:05:30 -08:00
configs.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
configs_test.cc	Minor: rename compression/shared -> types.h	2025-05-13 06:53:21 -07:00
flash_attention.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
flash_attention.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
flash_attention_test.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
flash_structs.h	Add some comments.	2025-11-19 01:09:15 -08:00
gemma-inl.h	Add tensor stats and output	2025-12-11 22:52:46 -08:00
gemma.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
gemma.h	Abort if args are unrecognized, refactor argument passing	2025-12-15 03:18:45 -08:00
gemma_args.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
gemma_args_test.cc	Abort if args are unrecognized, refactor argument passing	2025-12-15 03:18:45 -08:00
kv_cache.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
kv_cache.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
kv_cache_test.cc	Internal changes	2026-01-09 06:35:36 -08:00
model_store.cc	Allow overriding hardcoded max_seq_len by cmdline argument seq_len.	2026-01-08 04:28:59 -08:00
model_store.h	Allow overriding hardcoded max_seq_len by cmdline argument seq_len.	2026-01-08 04:28:59 -08:00
query.h	Warning fixes (sign mismatch), switch default	2025-12-15 02:41:19 -08:00
run.cc	Fix paligemma: must subtract image tokens from prompt length	2026-02-05 05:59:36 -08:00
tensor_info.cc	Add tensor stats and output	2025-12-11 22:52:46 -08:00
tensor_info.h	Add tensor stats and output	2025-12-11 22:52:46 -08:00
tensor_info_test.cc	Minor: ModelWeightsPtrs -> WeightsPtrs	2025-07-11 06:11:51 -07:00
tensor_stats.cc	Add int8 quantization stats	2025-12-19 12:43:03 -08:00
tensor_stats.h	Add int8 quantization stats	2025-12-19 12:43:03 -08:00
tiled_attention.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
tiled_attention.h	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
tiled_attention_test.cc	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.	2026-02-24 03:26:49 -08:00
tokenizer.cc	(Resubmit) Prepare profiler annotations for new API	2025-08-13 01:38:24 -07:00
tokenizer.h	6x large-batch, short-prompt prefill speedup	2025-06-10 09:56:20 -07:00
vit.cc	Fix Gemma3 image: ensure A matrix is packed, preallocate	2025-12-01 11:47:23 -08:00
vit.h	Minor: ModelWeightsPtrs -> WeightsPtrs	2025-07-11 06:11:51 -07:00
weights.cc	Minor: ParallelismStrategy->Parallelism	2025-11-06 06:56:10 -08:00
weights.h	Add tensor stats and output	2025-12-11 22:52:46 -08:00