Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.
Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.
Also avoid dynamic allocation for GriffinActivations.
PiperOrigin-RevId: 772333225
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.
Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.
PiperOrigin-RevId: 771937385
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom
Also update profiler annotations in ops-inl.h
PiperOrigin-RevId: 766995010
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name
I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader
Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h
Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.
Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included.
To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows:
```
cmake --preset windows-dll
cmake --build --config Release --preset windows-dll -j 4
```
This should generate a `gemma.dll` in `<build-dir>/Release`.
To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS.
PiperOrigin-RevId: 750246272
- Added prompt flag to InferenceArgs for non-interactive mode
- Set user-facing options to verbosity level 1
- Fixed prompt_size declaration and variable ordering in run.cc
- Properly set prompt_size after WrapAndTokenize calls
- Moved kVerboseLogTokens block after prompt_size is set
This change adds a --prompt command-line option that allows users to
provide prompts directly without entering interactive mode, which is
useful for scripting and automation.
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)
backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead
Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride
Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
Gemma receives a MatMulEnv arg, with comment on lifetime
Split threading into topology so the latter can be used in allocator
Add AllocClasses() for non-POD (ThreadPool)
Support binding pool to NUMA node
Update threading_test with latency measurements
Also update Highway version.
PiperOrigin-RevId: 736904748
Add Extents2D, Range2D vocab types
Matmul uses ConstMat for inputs and RowPtr for output
Move RowVectorBatch to basics.h
Separate threading.cc
Fix topology string: report cores not LPs, and #HT
Move QStride/IsMHA into LayerConfig
ImageTokens does not require make_unique.
matmul_test: no longer require template args
PiperOrigin-RevId: 692963605
Use ModelConfig values for ImageTokens.
Output timing info for image token generation.
Add a method to copy image data into Image class directly.
Minor changes: pipe ModelTraining to more places.
PiperOrigin-RevId: 690572283
Improved threading.h, fix thread counts for single package/cluster systems
Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92.
Also fix benchmarks.cc build, update tensor allocator to Allocator
PiperOrigin-RevId: 687307167
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.
PiperOrigin-RevId: 686870060
Use in dot_test
app.h: add new flags and rename num_threads to max_threads
matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases
PiperOrigin-RevId: 683216386
See https://arxiv.org/abs/2407.07726 for a description of the model.
Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that.
PiperOrigin-RevId: 677841119