Commit Graph

26 Commits

Author SHA1 Message Date
Jan Wassenberg c8d92948f4 Move fields, io* and blob* from compression/ into io/
PiperOrigin-RevId: 755445712
2025-05-06 11:17:19 -07:00
Jan Wassenberg 275135d7e8 Rename-only: remove Allocator2 etc suffixes now that refactoring is complete
PiperOrigin-RevId: 755397220
2025-05-06 09:12:43 -07:00
Jan Wassenberg 8d0882b966 Huge refactor of weight handling and model loading.
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name

I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader

Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h

Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.

Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
2025-05-06 04:44:21 -07:00
Jan Wassenberg 8532da47f7 Major refactor of allocator/args:
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)

backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead

Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride

Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
2025-04-10 01:29:54 -07:00
RangerUFO 3a5a6dbcad Fix the link error when building `compress_weights` with Clang on macOS 2025-02-09 00:13:25 +08:00
Phil Culliton 7ccc6abe87 Allow conversion, loading and inference with NUQ.
PiperOrigin-RevId: 723507890
2025-02-05 07:45:54 -08:00
Daniel Keysers bcdb0d65bd Assorted small cleanups.
PiperOrigin-RevId: 720548132
2025-01-28 06:09:45 -08:00
Ray Smith b93231a47d Moved the vit config fields to their own config struct
PiperOrigin-RevId: 715692800
2025-01-15 01:09:49 -08:00
Ray Smith 9d40f0117e Added ability to load/save a complete model file, including tokenizer.
PiperOrigin-RevId: 707914366
2024-12-19 07:59:41 -08:00
Ray Smith 6254f2e5ca Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT
PiperOrigin-RevId: 705085054
2024-12-11 06:30:28 -08:00
Jan Wassenberg 02ce1e344f Use NestedPools, add NUMA infra
Improved threading.h, fix thread counts for single package/cluster systems
Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92.

Also fix benchmarks.cc build, update tensor allocator to Allocator

PiperOrigin-RevId: 687307167
2024-10-18 08:11:18 -07:00
Ray Smith 0d68555f87 Eliminated TConfig.
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.

PiperOrigin-RevId: 686870060
2024-10-17 05:04:22 -07:00
Ray Smith 85958f5fd3 Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray.
Definition of array size is moved to the constructor.
Allocation is separate and parallelized.
All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted.
Replaced all previous ForEachTensor functions with a single unified function.

PiperOrigin-RevId: 684451604
2024-10-10 08:22:30 -07:00
Jan Wassenberg 7d9fcda0d8 -467ms startup: parallel Reshape
Also split Softmax into Argmax helper, add comments;
add profiler zones + fix IDE warning

PiperOrigin-RevId: 680954573
2024-10-01 04:11:35 -07:00
Jan Wassenberg 897f902d28 Fix include order, required to build with profiler enabled
PiperOrigin-RevId: 680574177
2024-09-30 07:52:50 -07:00
Jan Wassenberg b831fa8482 1.3x prefill, 0.95x decode: matmul replacing last matvec
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW                                 :      15414 x         4692352 = 24.166318
Gen.Attention.SumHeads                  :      15414 x         1394804 =  7.183451 !!
Gen.Embedding                           :        361 x        49961894 =  6.026297
Gen.Attention.QKV                       :      15414 x         1005125 =  5.176546
Gen.Attention.DotSoftmax                :      15414 x          885480 =  4.560357
RopeAndMulBy                            :     696528 x           11867 =  2.761818
```

After 49.80, 8.68
```
Gen.FFW                                 :      14448 x         5312783 = 25.646868
Gen.Embedding                           :        338 x        63044815 =  7.119845
Gen.Attention.QKV                       :      14448 x         1115003 =  5.382557
Gen.Attention.DotSoftmax                :      14448 x          897577 =  4.332957
RopeAndMulBy                            :     673344 x           11886 =  2.674156
Gen.Attention.SumHeads                  :      14448 x          518291 =  2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Jan Wassenberg 704d936764 Further simplification to ForEachTensor, thanks I.K.
PiperOrigin-RevId: 643996210
2024-06-17 07:12:26 -07:00
Jan Wassenberg 7d0720675f Move raw_weights into separate header, used mainly by compress_weights.
Fix warnings in backprop/* (include)

PiperOrigin-RevId: 643983136
2024-06-17 06:17:02 -07:00
The gemma.cpp Authors 7dbfa44794 Refactor CompressedWeights.
PiperOrigin-RevId: 643934198
2024-06-17 02:54:54 -07:00
Zoltan Szabadka a3a75b77f9 Use CompressedWeights<TConfig<float>> in backpropagation.
kWeightsAreCompressed are removed and LoadRawWeights is moved
to compress_weights.cc
2024-06-10 14:34:24 +00:00
Jan Wassenberg f9b390b134 Support all weight types in a single binary.
This changes the command line flags, but the default value retains the previous behavior.

Also add a CreateGemma helper to enable extra args without interface changes.

PiperOrigin-RevId: 641266411
2024-06-07 09:04:45 -07:00
Zoltan Szabadka c004799cdc Add Adam optimizer.
Drive-by: Fix compilation errors and tests for backprop functions.
2024-06-06 18:41:36 +00:00
Jan Wassenberg 57c2cd8b52 Simplifications: remove GemmaInterface and GemmaImpl
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc

PiperOrigin-RevId: 640869202
2024-06-06 05:54:21 -07:00
Zoltan Szabadka 8567978541 Adress review comments 2024-06-04 08:37:54 +00:00
Zoltan Szabadka 36e4d8bbfe Add first version of backpropagation support.
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.

Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
2024-06-04 08:37:49 +00:00