Commit Graph

23 Commits

Author SHA1 Message Date
Krzysztof Rymski f56d18dd68 Improvements to inference using int8 compressed kv's
Multiplication is done using int16*int16 multiplication instructions avoid expensive conversion to f32/bf16
x2 speed on zen3

PiperOrigin-RevId: 888690192
2026-03-24 08:51:30 -07:00
Krzysztof Rymski 029cfd0b33 Int8 + microscaling support for kv cache formats.
Right now multiplication is done by converting to corresponding float format.
Can yield up to 2x improvements for membw constrained shapes

PiperOrigin-RevId: 880748493
2026-03-09 02:50:08 -07:00
Ray Smith d2806fb1dd Fixed msan error by fixing padding of k_cache and v_cache
PiperOrigin-RevId: 879644219
2026-03-06 08:11:17 -08:00
Ray Smith 49cb438b1e Rollback of erroneous rollback.
PiperOrigin-RevId: 877376165
2026-03-02 06:50:26 -08:00
The gemma.cpp Authors a3d994915f No public description
PiperOrigin-RevId: 877333188
2026-03-02 04:32:29 -08:00
Ray Smith 16c1b29b89 Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention.
PiperOrigin-RevId: 877308306
2026-03-02 03:11:01 -08:00
Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512

PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00
Krzysztof Rymski 16a7ba2d6e Internal changes
PiperOrigin-RevId: 854171429
2026-01-09 06:35:36 -08:00
Martin Stolle 9348048885 Clean up toPtrs to delegate to toPtr
PiperOrigin-RevId: 840214969
2025-12-04 06:22:04 -08:00
Krzysztof Rymski c153d5255b Internal changes
PiperOrigin-RevId: 837001762
2025-11-26 01:05:35 -08:00
Biruk Mammo 5a05857deb [Gemma.cpp] Allows non-owned arguments for attention methods.
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold  non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.

PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Jan Wassenberg 2b4c16e243 Remove Griffin support
Also add IsObsolete helper

PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg e76e29ce11 De-singleton ThreadingContext so callers can pass in their own
weights.cc: fix BindB argument for bf16 tensors
threading_test: enable autotune
PiperOrigin-RevId: 785763618
2025-07-22 02:08:46 -07:00
Jan Wassenberg c027a45a2e MatPtr-ify KV, shared div_seq_len, --seq_len flag
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg cb188d4a0e Fix RowT issue and improve Griffin (currently still broken)
Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT
activations: Griffin tensors are now padded
Griffin: add batching support, fix conv1d_cache allocation
weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1
const-correct fix for ForEachTensor
blob_store: move BlobIO2 to .cc and rename BlobIO
PiperOrigin-RevId: 760610094
2025-05-19 07:02:10 -07:00
Jan Wassenberg 45ad847a41 Replace RowVectorBatch with MatStorageT
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
2025-05-12 09:16:12 -07:00
Jan Wassenberg 160a5824fb Cleanup: include fixes/comments, fix leak, vector reserve
Also remove unused RowSpan
configs.cc: Assign prompt wrapping to ModelConfig
configs.h: simplify EnumValid via sentinel

PiperOrigin-RevId: 750278497
2025-04-22 12:01:46 -07:00
Apoorv Reddy 780e376023 Add KVCache.DeepCopy() . Will be useful for implementing sampling functionality like beam sampling, parallel sampling, CoT Decoding (à la https://arxiv.org/abs/2402.10200)
PiperOrigin-RevId: 725156316
2025-02-10 04:10:29 -08:00
Ray Smith 9d40f0117e Added ability to load/save a complete model file, including tokenizer.
PiperOrigin-RevId: 707914366
2024-12-19 07:59:41 -08:00
Daniel Keysers e54d9cbddd Fix Griffin model:
- use HalfRope position encodings
- zero-initialize the caches for each Generate at position 0

The lack of the latter made the tests in gemma_test dependent on each other.

PiperOrigin-RevId: 694509054
2024-11-08 08:30:53 -08:00
Ray Smith 0d68555f87 Eliminated TConfig.
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.

PiperOrigin-RevId: 686870060
2024-10-17 05:04:22 -07:00
Jan Wassenberg aaf51898b6 Major revamp #2 of Prefill: fix token order, parallel for multi-query
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.

PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Jan Wassenberg 09a7e75ead Prep for sharding gemma.cc: split into kv_cache, tokenizer.
Move activations.h to backprop/ to make space for another activations.h.

PiperOrigin-RevId: 648744500
2024-07-02 09:31:06 -07:00