Commit Graph

467 Commits

Author SHA1 Message Date
Krzysztof Rymski 7efeb4fe06 Internal changes
PiperOrigin-RevId: 888724073
2026-03-25 05:02:01 -07:00
Krzysztof Rymski f56d18dd68 Improvements to inference using int8 compressed kv's
Multiplication is done using int16*int16 multiplication instructions avoid expensive conversion to f32/bf16
x2 speed on zen3

PiperOrigin-RevId: 888690192
2026-03-24 08:51:30 -07:00
Krzysztof Rymski 8a5e37eeb7 Updates to tests to use kv_transcodign library to reduce theris code size
PiperOrigin-RevId: 888600365
2026-03-24 05:06:01 -07:00
Jan Wassenberg 1dedcfd50d Warning fix: cast enum for HWY_ABORT %d
PiperOrigin-RevId: 886242788
2026-03-19 10:11:17 -07:00
Jan Wassenberg ceb70203f0 Add min_verbosity to MaybePrint
PiperOrigin-RevId: 886094998
2026-03-19 04:22:01 -07:00
Krzysztof Rymski 1a5226e5de Utilities to convert between different encodings of kv cache
PiperOrigin-RevId: 885553004
2026-03-18 06:16:32 -07:00
Jan Wassenberg 529c201eb6 Add/use MaybePrint; also ShowConfig in non-interactive builds
PiperOrigin-RevId: 882688835
2026-03-12 11:20:41 -07:00
Krzysztof Rymski 197c1a049c Fix int8
PiperOrigin-RevId: 882611833
2026-03-12 08:43:18 -07:00
The gemma.cpp Authors d6e836c651 Add phase markers to stderr for high verbosity levels.
This change introduces `[ BEGIN PHASE: ... ]` and `[ END PHASE: ... ]` messages printed to stderr when `timing_info.verbosity` is 2 or higher. These markers are added around the prefill, generate, image token generation, and final statistics phases to help in profiling and understanding the execution flow.

PiperOrigin-RevId: 882556076
2026-03-12 06:35:25 -07:00
Jan Wassenberg cab77f8dc7 Improved timing for image tokens
Move to TimingInfo, extra newline before profiler

PiperOrigin-RevId: 881943820
2026-03-11 04:47:56 -07:00
Jan Wassenberg 70cb9cf1c2 Separate profiler output for image token generation
PiperOrigin-RevId: 880895239
2026-03-09 09:26:50 -07:00
Ray Smith bea8b1cdbd Replaced attention in ViT with flash - 8x speedup of image tokenizer on AMD
PiperOrigin-RevId: 880877209
2026-03-09 08:46:04 -07:00
Krzysztof Rymski 029cfd0b33 Int8 + microscaling support for kv cache formats.
Right now multiplication is done by converting to corresponding float format.
Can yield up to 2x improvements for membw constrained shapes

PiperOrigin-RevId: 880748493
2026-03-09 02:50:08 -07:00
Ray Smith d2806fb1dd Fixed msan error by fixing padding of k_cache and v_cache
PiperOrigin-RevId: 879644219
2026-03-06 08:11:17 -08:00
Dani Ferreira Franco Moura d6c7576024 internal change
PiperOrigin-RevId: 879546918
2026-03-06 03:47:11 -08:00
Jan Wassenberg 8d9b9925be Fix VLM prefill batch size - prompt+tokens
PiperOrigin-RevId: 879159709
2026-03-05 11:21:55 -08:00
Ray Smith 79e640a956 Fixed tsan error.
PiperOrigin-RevId: 879069355
2026-03-05 07:59:38 -08:00
Krzysztof Rymski 539d9bb8e7 Change to use faster exponent function
PiperOrigin-RevId: 877981568
2026-03-03 09:16:04 -08:00
Ray Smith 49cb438b1e Rollback of erroneous rollback.
PiperOrigin-RevId: 877376165
2026-03-02 06:50:26 -08:00
The gemma.cpp Authors a3d994915f No public description
PiperOrigin-RevId: 877333188
2026-03-02 04:32:29 -08:00
Ray Smith 16c1b29b89 Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention.
PiperOrigin-RevId: 877308306
2026-03-02 03:11:01 -08:00
Miguel Lobo f7f5fd5863 Add ability to load custom models which are fully described by the ModelConfig blob.
PiperOrigin-RevId: 877265257
2026-03-02 01:18:33 -08:00
Krzysztof Rymski bdba3bfa63 remove const to fix windows builds
PiperOrigin-RevId: 876232691
2026-02-27 06:56:54 -08:00
Viktor Shipitsin d8a123e4ec Use a struct to manage the mapping between `AttentionImpl` enum values and their string names, simplifying `GetAttentionImplName` function. Add a test to ensure all valid `AttentionImpl` enums have a corresponding name and can be looked up.
PiperOrigin-RevId: 876124604
2026-02-27 01:31:11 -08:00
Jan Wassenberg c6587efe70 Improve instrumentation for ViT parts
PiperOrigin-RevId: 875302990
2026-02-25 13:10:44 -08:00
Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512

PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
Krzysztof Rymski 7dc98902d3 Internal changes
PiperOrigin-RevId: 872280443
2026-02-19 01:57:58 -08:00
The gemma.cpp Authors 34739fd9f0 Internal changes
PiperOrigin-RevId: 871792281
2026-02-18 04:07:36 -08:00
Krzysztof Rymski c6696342fa Internal changes
PiperOrigin-RevId: 871776998
2026-02-18 03:21:41 -08:00
Ray Smith 76d7951242 Added wheat_from_chaff_test to test the ability of a model to find a needle in a haystack of data.
Replaced flag with attention_impl to control which attention to run.

PiperOrigin-RevId: 869694868
2026-02-13 06:05:30 -08:00
Krzysztof Rymski 7e5310b908 Internal changes
PiperOrigin-RevId: 867617121
2026-02-09 08:29:15 -08:00
Jan Wassenberg 2751a194be Fix paligemma: must subtract image tokens from prompt length
PiperOrigin-RevId: 865905454
2026-02-05 05:59:36 -08:00
Krzysztof Rymski 60eed010ba Internal changes
PiperOrigin-RevId: 862680527
2026-01-29 04:48:29 -08:00
Krzysztof Rymski 16a7ba2d6e Internal changes
PiperOrigin-RevId: 854171429
2026-01-09 06:35:36 -08:00
Jan Wassenberg 6d43d6ee19 Build fix for Arm SVE (invalid template arg on op)
PiperOrigin-RevId: 854110884
2026-01-09 02:56:03 -08:00
Balazs Racz 384c390181 Allow overriding hardcoded max_seq_len by cmdline argument seq_len.
Adds a SetMaxSeqLen method to ModelConfig to handle updating both max_seq_len and global attention window sizes. The Gemma constructor now checks if the provided inference seq_len exceeds the model's max_seq_len and, if so, emits a warning and updates the config.

This prevents clipping context to the hard-coded maximum.

PiperOrigin-RevId: 853676074
2026-01-08 04:28:59 -08:00
Krzysztof Rymski 2ee1fac74c Internal changes
PiperOrigin-RevId: 853138600
2026-01-07 01:21:37 -08:00
Jan Wassenberg 1605925d1e Add int8 quantization stats
Compute the L1 error and Shannon SNR (higher is better).

PiperOrigin-RevId: 846832280
2025-12-19 12:43:03 -08:00
Krzysztof Rymski 08a0760271 Internal changes
PiperOrigin-RevId: 846663686
2025-12-19 03:43:15 -08:00
Krzysztof Rymski b73a9ede8f Internal changes
PiperOrigin-RevId: 846648337
2025-12-19 02:46:18 -08:00
Balazs Racz 0ac55f71ed Avoid using Row() for unaligned storage.
PiperOrigin-RevId: 846214605
2025-12-18 05:10:57 -08:00
Krzysztof Rymski 6661d3a60c Internal changes
PiperOrigin-RevId: 846140314
2025-12-18 01:26:43 -08:00
Phil Culliton b8a409dbba Use hn::Sub for vector subtraction in flash attention.
PiperOrigin-RevId: 845883321
2025-12-17 12:57:34 -08:00
Balazs Racz 596bdfe5af Separate monolithic gemma_lib library into more specific cc_library targets.
Creates new cc_library targets for :attention, :tensor_stats and :activations. Eliminates cyclic dependencies between these libraries.

PiperOrigin-RevId: 845690136
2025-12-17 03:31:16 -08:00
Balazs Racz baa69dfb78 Makes the entire runtime_config passed into the activations constructor.
PiperOrigin-RevId: 845153671
2025-12-16 01:56:52 -08:00
Krzysztof Rymski 44dfd69b9b Internal changes
PiperOrigin-RevId: 844759322
2025-12-15 07:14:37 -08:00
Jan Wassenberg 0c64987a96 Abort if args are unrecognized, refactor argument passing
This catches typos/incorrect usage.
Refactor: group Loader/Threading/Inference into GemmaArgs.
All *Args ctors now have an extra ConsumedArgs& argument.
PiperOrigin-RevId: 844690553
2025-12-15 03:18:45 -08:00
Jan Wassenberg f50550f4ce Warning fixes (sign mismatch), switch default
PiperOrigin-RevId: 844679375
2025-12-15 02:41:19 -08:00
Martin Stolle 506fb22be7 No public description
PiperOrigin-RevId: 843665619
2025-12-12 06:37:17 -08:00
Balazs Racz 338cd8a36e Factors out a new cc_library `:query` from `:gemma-lib`.
Moves query-related structs/classes to gemma/query.h.

This refactors PerQuery, AllQueries, and QBatch into a dedicated header file, gemma/query.h, and updates BUILD dependencies accordingly.

PiperOrigin-RevId: 843604293
2025-12-12 02:53:56 -08:00