gemma.cpp/gemma
Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512

PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
..
bindings Abort if args are unrecognized, refactor argument passing 2025-12-15 03:18:45 -08:00
evals Add MMLU eval to github 2024-05-20 10:20:53 -07:00
activations.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
api_client.cc Abort if args are unrecognized, refactor argument passing 2025-12-15 03:18:45 -08:00
api_server.cc Abort if args are unrecognized, refactor argument passing 2025-12-15 03:18:45 -08:00
attention.cc Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data. 2026-02-13 06:05:30 -08:00
attention.h Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data. 2026-02-13 06:05:30 -08:00
attention_test.cc Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data. 2026-02-13 06:05:30 -08:00
configs.cc Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data. 2026-02-13 06:05:30 -08:00
configs.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
configs_test.cc Minor: rename compression/shared -> types.h 2025-05-13 06:53:21 -07:00
flash_attention.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
flash_attention.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
flash_attention_test.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
flash_structs.h Add some comments. 2025-11-19 01:09:15 -08:00
gemma-inl.h Add tensor stats and output 2025-12-11 22:52:46 -08:00
gemma.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
gemma.h Abort if args are unrecognized, refactor argument passing 2025-12-15 03:18:45 -08:00
gemma_args.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
gemma_args_test.cc Abort if args are unrecognized, refactor argument passing 2025-12-15 03:18:45 -08:00
kv_cache.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
kv_cache.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
kv_cache_test.cc Internal changes 2026-01-09 06:35:36 -08:00
model_store.cc Allow overriding hardcoded max_seq_len by cmdline argument seq_len. 2026-01-08 04:28:59 -08:00
model_store.h Allow overriding hardcoded max_seq_len by cmdline argument seq_len. 2026-01-08 04:28:59 -08:00
query.h Warning fixes (sign mismatch), switch default 2025-12-15 02:41:19 -08:00
run.cc Fix paligemma: must subtract image tokens from prompt length 2026-02-05 05:59:36 -08:00
tensor_info.cc Add tensor stats and output 2025-12-11 22:52:46 -08:00
tensor_info.h Add tensor stats and output 2025-12-11 22:52:46 -08:00
tensor_info_test.cc Minor: ModelWeightsPtrs -> WeightsPtrs 2025-07-11 06:11:51 -07:00
tensor_stats.cc Add int8 quantization stats 2025-12-19 12:43:03 -08:00
tensor_stats.h Add int8 quantization stats 2025-12-19 12:43:03 -08:00
tiled_attention.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
tiled_attention.h Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
tiled_attention_test.cc Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. 2026-02-24 03:26:49 -08:00
tokenizer.cc (Resubmit) Prepare profiler annotations for new API 2025-08-13 01:38:24 -07:00
tokenizer.h 6x large-batch, short-prompt prefill speedup 2025-06-10 09:56:20 -07:00
vit.cc Fix Gemma3 image: ensure A matrix is packed, preallocate 2025-12-01 11:47:23 -08:00
vit.h Minor: ModelWeightsPtrs -> WeightsPtrs 2025-07-11 06:11:51 -07:00
weights.cc Minor: ParallelismStrategy->Parallelism 2025-11-06 06:56:10 -08:00
weights.h Add tensor stats and output 2025-12-11 22:52:46 -08:00