Commit Graph

851 Commits

Author SHA1 Message Date
Jan Wassenberg 794a21a4e6 Major refactor to de-templatize gemma-inl and weights
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.

WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.

PiperOrigin-RevId: 766497657
2025-06-02 23:01:35 -07:00
RangerUFO 93de2be938 Fix the broken VitAttention 2025-06-03 12:40:13 +08:00
Jan Wassenberg cf4d7ceb82 1.16x decode speedup: remove last MatVec in Attention
Precompute row pointers.
Remove no longer used MHA support; QStride -> qkv_dim.
Remove RowPtr from MatMul interface, use only MatPtrT.
Require opt-in define for NUQ to speed up builds.
Also fix io.cc on Windows.

PiperOrigin-RevId: 766228108
2025-06-02 09:40:29 -07:00
Jan Wassenberg c4a75abe43 Cleanup gemma_batch_bench
PiperOrigin-RevId: 766177406
2025-06-02 07:04:36 -07:00
Jan Wassenberg a3f7bf0991 Fix thread name when skipping packages/clusters
PiperOrigin-RevId: 766054198
2025-06-01 23:50:11 -07:00
Jan Wassenberg 0023ff8770 Add support for arbitrary output row pointers
Useful for writing directly to KV cache.

PiperOrigin-RevId: 765615147
2025-05-31 10:55:54 -07:00
The gemma.cpp Authors 9c3e089b09 Internal change.
PiperOrigin-RevId: 765218260
2025-05-30 09:18:44 -07:00
The gemma.cpp Authors 1e8642f8f4 Internal change.
PiperOrigin-RevId: 765037449
2025-05-29 22:51:16 -07:00
Jan Wassenberg 3890eb5412 Remove backprop/
Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0).

PiperOrigin-RevId: 764243198
2025-05-28 07:01:17 -07:00
Jan Wassenberg 627cc04db9 Decouple MatMul from gemma-inl: precompile for all input types
Call MatMulStatic instead of MatMul.

Also fix build error due to Highway's Lanes not being constexpr.

PiperOrigin-RevId: 763777269
2025-05-27 07:08:58 -07:00
Jan Wassenberg 421a2ab8ac Add comments explaining non-padded tensors, kNoPad -> kPacked
PiperOrigin-RevId: 763352173
2025-05-26 03:03:38 -07:00
Copybara-Service eb8a463038 Merge pull request #574 from ufownl:bugfix/vit_weights
PiperOrigin-RevId: 761948356
2025-05-22 07:04:53 -07:00
RangerUFO 2771f463f9 Fix the ViT weights loading 2025-05-22 12:13:29 +08:00
Copybara-Service 1ce89788ef Merge pull request #573 from ufownl:bugfix/vit
PiperOrigin-RevId: 761425663
2025-05-21 01:58:00 -07:00
RangerUFO 6debdbe341 Minor fixes for ViT 2025-05-20 22:27:10 +08:00
Jan Wassenberg cb188d4a0e Fix RowT issue and improve Griffin (currently still broken)
Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT
activations: Griffin tensors are now padded
Griffin: add batching support, fix conv1d_cache allocation
weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1
const-correct fix for ForEachTensor
blob_store: move BlobIO2 to .cc and rename BlobIO
PiperOrigin-RevId: 760610094
2025-05-19 07:02:10 -07:00
Jan Wassenberg d6cfabc2c1 Shorten gemma_test so we can run it for more models.
PiperOrigin-RevId: 759685282
2025-05-16 11:14:41 -07:00
Jan Wassenberg e890d46f30 1.31x batch prefill, 1.24x batch decode speedup: NUMA binding
Only the weights; binding MatMul output worsens batch=1 prefill.
Update gemma_batch_bench to use --decode_qbatch.
Fix/remove prefill_activations in gemma-inl.h.

Refactor:
use BasePageBytes directly when binding
Move BindB/C to .cc by de-templatizing
Remove MatOwners::AllocateFor because it is weights-specific (binding or not)
Disband MatOwners, replace with vector
PiperOrigin-RevId: 759610477
2025-05-16 07:42:13 -07:00
Jan Wassenberg c443adee33 3.8x speedup of weights loading via preadv on Linux
Also move BlobReader reading functionality to weights.cc

PiperOrigin-RevId: 759240310
2025-05-15 11:55:15 -07:00
Jan Wassenberg 38a08d8095 Replace last ConstMat with MatPtr
This is to reduce the number of MatMul overloads in preparation for de-templatizing.

PiperOrigin-RevId: 758288589
2025-05-13 10:55:22 -07:00
Copybara-Service 0a6a7e4cd6 Merge pull request #566 from ufownl:bugfix/deduced_model_wrapping
PiperOrigin-RevId: 758276145
2025-05-13 10:28:16 -07:00
RangerUFO 30ad625f42 Fix the wrapping field of the deduced model config 2025-05-13 23:02:03 +08:00
Jan Wassenberg 8a312e9b89 Split W1/W2 as a load-time preprocess.
Remove kOnlyAllocate - no longer used. Rename ReadOrAllocate -> ReadFromBlobs.
Rename Reshape -> Fixup to reflect the new scope.
Remove no longer used ShrinkRows.

This simplifies gemma-inl and is a prerequisite for removing ConstMat
(whose .ofs was previously used for merged tensors)

PiperOrigin-RevId: 758214083
2025-05-13 07:39:59 -07:00
Jan Wassenberg 2038dfd9cc Minor: rename compression/shared -> types.h
PiperOrigin-RevId: 758199851
2025-05-13 06:53:21 -07:00
Jan Wassenberg d538a6d6c6 Cleanup: remove unused kCyclic, remove 2 suffix
Also remove now unused allocator arg and fix warnings (cast, struct/class mismatch)

PiperOrigin-RevId: 758098495
2025-05-13 01:06:41 -07:00
Biruk Mammo ba21e3beb4 Adds a `GemmaAttention` constructor that takes an explicit `ThreadingContext`.
PiperOrigin-RevId: 757839682
2025-05-12 11:17:05 -07:00
Jan Wassenberg 45ad847a41 Replace RowVectorBatch with MatStorageT
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
2025-05-12 09:16:12 -07:00
Jan Wassenberg cf7dd80c17 Minor: mark command line flags as required
PiperOrigin-RevId: 757775369
2025-05-12 08:30:44 -07:00
Jan Wassenberg 252a4e955e Remove support for Gemma 1 and PaliGemma 1 models, superseded by (Pali)Gemma 2.
PiperOrigin-RevId: 756671308
2025-05-09 02:17:27 -07:00
Biruk Mammo d834c07042 Exposes `GemmaAttention::DotSoftmaxWeightedSum` for experimentation.
Also in this change:
* The computation for a single `q` is factored out and exposed.
* Strided `ConstMat` views into the KV caches are introduced to enable experimentation with various KV cache layouts.

PiperOrigin-RevId: 756339313
2025-05-08 09:19:04 -07:00
Jan Wassenberg a0ff98ea60 Entirely remove constexpr on PaddedDirEnd. Refs #551
Apparently GCC 9.4 does not handle HWY_CXX17_CONSTEXPR as we intend.

PiperOrigin-RevId: 755967709
2025-05-07 12:48:19 -07:00
Biruk Mammo d9d1709df8 Updates stale references to `compression/migrate_weights`.
PiperOrigin-RevId: 755938143
2025-05-07 11:33:59 -07:00
The gemma.cpp Authors 20757046db cleanup, new conversation methods, bugfixes
- chore: unused parameters cleaned up
- bugfix: explicitly use hwy::Span in GenerateInternal() to prevent runtime crashes due to memory layout incompatibility
- bugfix: explicit nullptr check in LogDebug
- chore: length-related parameters renamed for clarity
- feature: SaveConversation() can be optionally used to save copy of a conversation that ResetConversation() will rewind to upon request, rather than just an empty KV cache
- feature: GetCurrentConversation() can be used to query the current conversation's name

PiperOrigin-RevId: 755873147
2025-05-07 08:52:44 -07:00
Jan Wassenberg e9ecb7794d Fix gcc build error and gemma3 crash, thanks @ufownl, fixes #551
PiperOrigin-RevId: 755729478
2025-05-07 00:59:18 -07:00
Jan Wassenberg c8d92948f4 Move fields, io* and blob* from compression/ into io/
PiperOrigin-RevId: 755445712
2025-05-06 11:17:19 -07:00
Jan Wassenberg 275135d7e8 Rename-only: remove Allocator2 etc suffixes now that refactoring is complete
PiperOrigin-RevId: 755397220
2025-05-06 09:12:43 -07:00
Jan Wassenberg 8d0882b966 Huge refactor of weight handling and model loading.
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name

I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader

Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h

Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.

Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
2025-05-06 04:44:21 -07:00
Jan Wassenberg a3caf6e5d2 Add summary of optimizations/infra present in the repository
PiperOrigin-RevId: 754838402
2025-05-05 01:46:01 -07:00
Jan Wassenberg fe80f10ed7 Backprop test fixes and allocator cleanup
- Shorten backprop tests to prevent timeout
- Add line number of failing test
- matmul: remove unused enable_bind
- allocator: we will retain enable_bind there
- mat: disable cyclic padding optimization (broken)

PiperOrigin-RevId: 752656068
2025-04-29 03:01:10 -07:00
Jan Wassenberg 160a5824fb Cleanup: include fixes/comments, fix leak, vector reserve
Also remove unused RowSpan
configs.cc: Assign prompt wrapping to ModelConfig
configs.h: simplify EnumValid via sentinel

PiperOrigin-RevId: 750278497
2025-04-22 12:01:46 -07:00
The gemma.cpp Authors ba10c88a94 Add C API and C# interop files
This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included.

To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows:
```
cmake --preset windows-dll
cmake --build --config Release --preset windows-dll -j 4
```
This should generate a `gemma.dll` in `<build-dir>/Release`.

To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS.

PiperOrigin-RevId: 750246272
2025-04-22 10:35:47 -07:00
Copybara-Service f20da328de Merge pull request #539 from prajwalc22:feature-prompt-flag
PiperOrigin-RevId: 750118715
2025-04-22 03:09:19 -07:00
prajwalc22 2407150f84 Merge branch 'feature-prompt-flag' of github.com:prajwalc22/gemma.cpp into feature-prompt-flag 2025-04-17 23:54:46 +05:30
prajwalc22 a9e56c27eb removed unnecessary threading.h import 2025-04-17 23:44:23 +05:30
Prajwal Choudhari 09dfb144c0
Merge branch 'dev' into feature-prompt-flag 2025-04-17 18:53:28 +05:30
prajwalc22 f55c321397 Address review feedback: Fix prefill_tbatch_size and variable placement issues 2025-04-17 10:15:21 +05:30
prajwalc22 27c28cc938 Address review feedback: Fix prefill_tbatch_size and variable placement issues 2025-04-17 10:15:05 +05:30
Jan Wassenberg 87a658b1c6 Minor cleanup, on-demand NUQ buffer allocation
threading_context: add profiler
compress-inl: add constexpr, on-demand alloc NUQ buffer
gemma_py: model->gemma
Move ScaleWeights to compress.cc
Move PromptWrapping to configs.h
PiperOrigin-RevId: 748347896
2025-04-16 10:49:43 -07:00
prajwalc22 8246e49199 Add non-interactive mode support
- Added prompt flag to InferenceArgs for non-interactive mode
- Set user-facing options to verbosity level 1
- Fixed prompt_size declaration and variable ordering in run.cc
- Properly set prompt_size after WrapAndTokenize calls
- Moved kVerboseLogTokens block after prompt_size is set
2025-04-16 16:26:52 +05:30
prajwalc22 cbf179990f Add --prompt flag for non-interactive mode 2025-04-16 15:34:43 +05:30