Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.
PiperOrigin-RevId: 655893197
Helper functions to tokenize/wrap
Move LayersOutputFunc into RuntimeConfig
AcceptFunc passes the probability
Implement StringFromType using the parser, and verify results match
PiperOrigin-RevId: 643255119
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
This changes the command line flags, but the default value retains the previous behavior.
Also add a CreateGemma helper to enable extra args without interface changes.
PiperOrigin-RevId: 641266411
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc
PiperOrigin-RevId: 640869202
We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.
Benchmark results on a 1600-token summarization task:
```
Prefill speed
Num threads BEFORE AFTER
4 9.24 t/s 9.76 t/s
18 31.41 t/s 31.16 t/s
32 31.41 t/s 45.13 t/s
64 31.03 t/s 57.85 t/s
```