Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.
Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.
PiperOrigin-RevId: 771937385
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom
Also update profiler annotations in ops-inl.h
PiperOrigin-RevId: 766995010
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.
WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.
PiperOrigin-RevId: 766497657
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name
I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader
Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h
Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.
Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included.
To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows:
```
cmake --preset windows-dll
cmake --build --config Release --preset windows-dll -j 4
```
This should generate a `gemma.dll` in `<build-dir>/Release`.
To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS.
PiperOrigin-RevId: 750246272
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)
backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead
Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride
Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
Gemma receives a MatMulEnv arg, with comment on lifetime
Split threading into topology so the latter can be used in allocator
Add AllocClasses() for non-POD (ThreadPool)
Support binding pool to NUMA node
Update threading_test with latency measurements
Also update Highway version.
PiperOrigin-RevId: 736904748
ops.h: move CreateInvTimescale to allow calling without depending on gemma
Pass around MatMulEnv instead of pools to avoid re-creating the env
profiler.h can now be used outside SIMD code
allocator: add StepBytes and QuantumSteps
rename worker thread with package/cluster in the name
threading: add Visit* to IndexRange
PiperOrigin-RevId: 718766704
Add Extents2D, Range2D vocab types
Matmul uses ConstMat for inputs and RowPtr for output
Move RowVectorBatch to basics.h
Separate threading.cc
Fix topology string: report cores not LPs, and #HT
Move QStride/IsMHA into LayerConfig
ImageTokens does not require make_unique.
matmul_test: no longer require template args
PiperOrigin-RevId: 692963605
Improved threading.h, fix thread counts for single package/cluster systems
Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92.
Also fix benchmarks.cc build, update tensor allocator to Allocator
PiperOrigin-RevId: 687307167
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.
PiperOrigin-RevId: 686870060
add app.h comment
compress-inl: remove unused typedef
gemma-inl: add missing HWY_ATTR and cast
separate sum-inl.h and basics.h headers
replace more hwy::bfloat16_t with BF16
update include pragmas
update dot_test thresholds
update Highway version in Bazel for HWY_RCAST_ALIGNED fix
PiperOrigin-RevId: 684464326
See https://arxiv.org/abs/2407.07726 for a description of the model.
Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that.
PiperOrigin-RevId: 677841119
Use args only in GemmaEnv constructor, store everything else in RuntimeConfig.
Add runtime option to turn off thread spinning.
PiperOrigin-RevId: 670467320
LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs.
Instead, we directly expose the Activations structure.
PiperOrigin-RevId: 663409316
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311