Commit Graph

303 Commits

Author SHA1 Message Date
Jan Wassenberg ec02726cf7 6x large-batch, short-prompt prefill speedup
Parallelize over queries instead of tokens
introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer.
move RMSNormInplaceBatched out of Transformer to call the latter from prefill
Consistent arg order.

Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h
Also add output to gemma_batch_bench, fix name

PiperOrigin-RevId: 769676106
2025-06-10 09:56:20 -07:00
Daniel Keysers d7b23d532a Restructure internal initialization.
PiperOrigin-RevId: 769507096
2025-06-10 01:25:31 -07:00
Jan Wassenberg 6ee628ba38 Further cleanup: separate MatMulEnv arg
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env

PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg 0e2cab5187 Avoid warning about inability to map, unless explicitly requested
PiperOrigin-RevId: 767633815
2025-06-05 09:10:08 -07:00
Jan Wassenberg 3a266c662c Split gemma-inl into separate source files
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.

PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00
RangerUFO a82f8d5690 Fix compilation error on G++ 9.4 2025-06-04 17:39:37 +08:00
Jan Wassenberg 6897313080 3x speedup of EmbedImagePatches - GEMM, not GEMV.
Required fixes to handling of non-vector aligned A.
Also move row ptrs to MatMulEnv.

PiperOrigin-RevId: 767029036
2025-06-04 01:18:52 -07:00
Jan Wassenberg 9efdcfd45c 1.07x batch decode speedup: more BF16 weights and activations
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom

Also update profiler annotations in ops-inl.h

PiperOrigin-RevId: 766995010
2025-06-03 23:30:18 -07:00
Jan Wassenberg 839a642992 Fix paligemma_test, refs #588
Detect PaliGemma models from layer names
Remove unused allocator arg from CreateInvTimescale
matmul: only warn once about dim divisibility
Print config also in tests if --verbosity 2
PiperOrigin-RevId: 766605131
2025-06-03 04:45:22 -07:00
Jan Wassenberg ad3002a21c
Merge branch 'dev' into bugfix/vit_attn 2025-06-03 09:29:52 +02:00
Jan Wassenberg 794a21a4e6 Major refactor to de-templatize gemma-inl and weights
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.

WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.

PiperOrigin-RevId: 766497657
2025-06-02 23:01:35 -07:00
RangerUFO 93de2be938 Fix the broken VitAttention 2025-06-03 12:40:13 +08:00
Jan Wassenberg cf4d7ceb82 1.16x decode speedup: remove last MatVec in Attention
Precompute row pointers.
Remove no longer used MHA support; QStride -> qkv_dim.
Remove RowPtr from MatMul interface, use only MatPtrT.
Require opt-in define for NUQ to speed up builds.
Also fix io.cc on Windows.

PiperOrigin-RevId: 766228108
2025-06-02 09:40:29 -07:00
The gemma.cpp Authors 9c3e089b09 Internal change.
PiperOrigin-RevId: 765218260
2025-05-30 09:18:44 -07:00
The gemma.cpp Authors 1e8642f8f4 Internal change.
PiperOrigin-RevId: 765037449
2025-05-29 22:51:16 -07:00
Jan Wassenberg 3890eb5412 Remove backprop/
Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0).

PiperOrigin-RevId: 764243198
2025-05-28 07:01:17 -07:00
Jan Wassenberg 627cc04db9 Decouple MatMul from gemma-inl: precompile for all input types
Call MatMulStatic instead of MatMul.

Also fix build error due to Highway's Lanes not being constexpr.

PiperOrigin-RevId: 763777269
2025-05-27 07:08:58 -07:00
Jan Wassenberg 421a2ab8ac Add comments explaining non-padded tensors, kNoPad -> kPacked
PiperOrigin-RevId: 763352173
2025-05-26 03:03:38 -07:00
RangerUFO 2771f463f9 Fix the ViT weights loading 2025-05-22 12:13:29 +08:00
RangerUFO 6debdbe341 Minor fixes for ViT 2025-05-20 22:27:10 +08:00
Jan Wassenberg cb188d4a0e Fix RowT issue and improve Griffin (currently still broken)
Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT
activations: Griffin tensors are now padded
Griffin: add batching support, fix conv1d_cache allocation
weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1
const-correct fix for ForEachTensor
blob_store: move BlobIO2 to .cc and rename BlobIO
PiperOrigin-RevId: 760610094
2025-05-19 07:02:10 -07:00
Jan Wassenberg e890d46f30 1.31x batch prefill, 1.24x batch decode speedup: NUMA binding
Only the weights; binding MatMul output worsens batch=1 prefill.
Update gemma_batch_bench to use --decode_qbatch.
Fix/remove prefill_activations in gemma-inl.h.

Refactor:
use BasePageBytes directly when binding
Move BindB/C to .cc by de-templatizing
Remove MatOwners::AllocateFor because it is weights-specific (binding or not)
Disband MatOwners, replace with vector
PiperOrigin-RevId: 759610477
2025-05-16 07:42:13 -07:00
Jan Wassenberg c443adee33 3.8x speedup of weights loading via preadv on Linux
Also move BlobReader reading functionality to weights.cc

PiperOrigin-RevId: 759240310
2025-05-15 11:55:15 -07:00
Jan Wassenberg 38a08d8095 Replace last ConstMat with MatPtr
This is to reduce the number of MatMul overloads in preparation for de-templatizing.

PiperOrigin-RevId: 758288589
2025-05-13 10:55:22 -07:00
RangerUFO 30ad625f42 Fix the wrapping field of the deduced model config 2025-05-13 23:02:03 +08:00
Jan Wassenberg 8a312e9b89 Split W1/W2 as a load-time preprocess.
Remove kOnlyAllocate - no longer used. Rename ReadOrAllocate -> ReadFromBlobs.
Rename Reshape -> Fixup to reflect the new scope.
Remove no longer used ShrinkRows.

This simplifies gemma-inl and is a prerequisite for removing ConstMat
(whose .ofs was previously used for merged tensors)

PiperOrigin-RevId: 758214083
2025-05-13 07:39:59 -07:00
Jan Wassenberg 2038dfd9cc Minor: rename compression/shared -> types.h
PiperOrigin-RevId: 758199851
2025-05-13 06:53:21 -07:00
Jan Wassenberg d538a6d6c6 Cleanup: remove unused kCyclic, remove 2 suffix
Also remove now unused allocator arg and fix warnings (cast, struct/class mismatch)

PiperOrigin-RevId: 758098495
2025-05-13 01:06:41 -07:00
Biruk Mammo ba21e3beb4 Adds a `GemmaAttention` constructor that takes an explicit `ThreadingContext`.
PiperOrigin-RevId: 757839682
2025-05-12 11:17:05 -07:00
Jan Wassenberg 45ad847a41 Replace RowVectorBatch with MatStorageT
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
2025-05-12 09:16:12 -07:00
Jan Wassenberg 252a4e955e Remove support for Gemma 1 and PaliGemma 1 models, superseded by (Pali)Gemma 2.
PiperOrigin-RevId: 756671308
2025-05-09 02:17:27 -07:00
Biruk Mammo d834c07042 Exposes `GemmaAttention::DotSoftmaxWeightedSum` for experimentation.
Also in this change:
* The computation for a single `q` is factored out and exposed.
* Strided `ConstMat` views into the KV caches are introduced to enable experimentation with various KV cache layouts.

PiperOrigin-RevId: 756339313
2025-05-08 09:19:04 -07:00
The gemma.cpp Authors 20757046db cleanup, new conversation methods, bugfixes
- chore: unused parameters cleaned up
- bugfix: explicitly use hwy::Span in GenerateInternal() to prevent runtime crashes due to memory layout incompatibility
- bugfix: explicit nullptr check in LogDebug
- chore: length-related parameters renamed for clarity
- feature: SaveConversation() can be optionally used to save copy of a conversation that ResetConversation() will rewind to upon request, rather than just an empty KV cache
- feature: GetCurrentConversation() can be used to query the current conversation's name

PiperOrigin-RevId: 755873147
2025-05-07 08:52:44 -07:00
Jan Wassenberg e9ecb7794d Fix gcc build error and gemma3 crash, thanks @ufownl, fixes #551
PiperOrigin-RevId: 755729478
2025-05-07 00:59:18 -07:00
Jan Wassenberg c8d92948f4 Move fields, io* and blob* from compression/ into io/
PiperOrigin-RevId: 755445712
2025-05-06 11:17:19 -07:00
Jan Wassenberg 275135d7e8 Rename-only: remove Allocator2 etc suffixes now that refactoring is complete
PiperOrigin-RevId: 755397220
2025-05-06 09:12:43 -07:00
Jan Wassenberg 8d0882b966 Huge refactor of weight handling and model loading.
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name

I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader

Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h

Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.

Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
2025-05-06 04:44:21 -07:00
Jan Wassenberg 160a5824fb Cleanup: include fixes/comments, fix leak, vector reserve
Also remove unused RowSpan
configs.cc: Assign prompt wrapping to ModelConfig
configs.h: simplify EnumValid via sentinel

PiperOrigin-RevId: 750278497
2025-04-22 12:01:46 -07:00
The gemma.cpp Authors ba10c88a94 Add C API and C# interop files
This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included.

To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows:
```
cmake --preset windows-dll
cmake --build --config Release --preset windows-dll -j 4
```
This should generate a `gemma.dll` in `<build-dir>/Release`.

To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS.

PiperOrigin-RevId: 750246272
2025-04-22 10:35:47 -07:00
prajwalc22 2407150f84 Merge branch 'feature-prompt-flag' of github.com:prajwalc22/gemma.cpp into feature-prompt-flag 2025-04-17 23:54:46 +05:30
prajwalc22 a9e56c27eb removed unnecessary threading.h import 2025-04-17 23:44:23 +05:30
Prajwal Choudhari 09dfb144c0
Merge branch 'dev' into feature-prompt-flag 2025-04-17 18:53:28 +05:30
prajwalc22 f55c321397 Address review feedback: Fix prefill_tbatch_size and variable placement issues 2025-04-17 10:15:21 +05:30
prajwalc22 27c28cc938 Address review feedback: Fix prefill_tbatch_size and variable placement issues 2025-04-17 10:15:05 +05:30
Jan Wassenberg 87a658b1c6 Minor cleanup, on-demand NUQ buffer allocation
threading_context: add profiler
compress-inl: add constexpr, on-demand alloc NUQ buffer
gemma_py: model->gemma
Move ScaleWeights to compress.cc
Move PromptWrapping to configs.h
PiperOrigin-RevId: 748347896
2025-04-16 10:49:43 -07:00
prajwalc22 8246e49199 Add non-interactive mode support
- Added prompt flag to InferenceArgs for non-interactive mode
- Set user-facing options to verbosity level 1
- Fixed prompt_size declaration and variable ordering in run.cc
- Properly set prompt_size after WrapAndTokenize calls
- Moved kVerboseLogTokens block after prompt_size is set
2025-04-16 16:26:52 +05:30
prajwalc22 cbf179990f Add --prompt flag for non-interactive mode 2025-04-16 15:34:43 +05:30
prajwalc22 f3116d2577 Add --prompt flag for non-interactive mode
This change adds a --prompt command-line option that allows users to
provide prompts directly without entering interactive mode, which is
useful for scripting and automation.
2025-04-16 09:45:02 +05:30
The gemma.cpp Authors 7164a5e844 Internal change.
PiperOrigin-RevId: 746953110
2025-04-12 20:27:49 -07:00
Jan Wassenberg 8532da47f7 Major refactor of allocator/args:
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)

backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead

Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride

Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
2025-04-10 01:29:54 -07:00