Slower because we now init tiles of C and accumulate into them.
Also remove unused var in optimize_test and use BF16 typedef.
PiperOrigin-RevId: 662115916
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.
PiperOrigin-RevId: 661168418
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.
PiperOrigin-RevId: 655893197
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
This changes the command line flags, but the default value retains the previous behavior.
Also add a CreateGemma helper to enable extra args without interface changes.
PiperOrigin-RevId: 641266411
With this addition the ComputeCrossEntropy function can be moved
to its own library, because now we can compute it using only the
public API functions from gemma.h
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc
PiperOrigin-RevId: 640869202