Commit Graph

94 Commits

Author SHA1 Message Date
Jan Wassenberg 3ed403e287 Major cleanup of profiler zones, add Caller annotation for all pool.Run
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor

PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Phil Culliton 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
Improved flash_attention to enable profiling using the new zones.

PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg 035273c184 tune pool kSpin mode in threading_context
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.

threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Jan Wassenberg 501fdf000e Remove no longer used MatVec
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Ray Smith f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function.
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.

PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg 2b4c16e243 Remove Griffin support
Also add IsObsolete helper

PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg afd82376a5 Add AES-CTR RNG for parallel sampling (not yet used)
PiperOrigin-RevId: 802991142
2025-09-04 05:58:42 -07:00
Junji Hashimoto 41321611fd feature: add API server and client with Google protocol 2025-08-21 11:32:48 +09:00
Jan Wassenberg 71406cf6d0 More profiler interface fixes: hwy:: plus avoid ADD_ZONE
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg faa4102992 (Resubmit) Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors a2d9133f7d Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg 4cbf63e6f0 Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00
Jan Wassenberg e1585ecaf5 Update Highway version to get NEON bf16 fix
https://github.com/google/highway/pull/2598

PiperOrigin-RevId: 774664346
2025-06-23 01:25:01 -07:00
Jan Wassenberg 4f5785b0fd Update instrumentation for new Highway wall-time profiler
Pass the thread index through and use new zone_id.

PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg 1665ecc5c2 Remove CMake max version, fixes #623
PiperOrigin-RevId: 773265809
2025-06-19 02:30:03 -07:00
Jan Wassenberg 6ee628ba38 Further cleanup: separate MatMulEnv arg
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env

PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg 3a266c662c Split gemma-inl into separate source files
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.

PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00
Jan Wassenberg 794a21a4e6 Major refactor to de-templatize gemma-inl and weights
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.

WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.

PiperOrigin-RevId: 766497657
2025-06-02 23:01:35 -07:00
Jan Wassenberg 3890eb5412 Remove backprop/
Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0).

PiperOrigin-RevId: 764243198
2025-05-28 07:01:17 -07:00
Jan Wassenberg 627cc04db9 Decouple MatMul from gemma-inl: precompile for all input types
Call MatMulStatic instead of MatMul.

Also fix build error due to Highway's Lanes not being constexpr.

PiperOrigin-RevId: 763777269
2025-05-27 07:08:58 -07:00
Jan Wassenberg 2038dfd9cc Minor: rename compression/shared -> types.h
PiperOrigin-RevId: 758199851
2025-05-13 06:53:21 -07:00
Jan Wassenberg c8d92948f4 Move fields, io* and blob* from compression/ into io/
PiperOrigin-RevId: 755445712
2025-05-06 11:17:19 -07:00
Jan Wassenberg 8d0882b966 Huge refactor of weight handling and model loading.
Weight handling:
- new ModelStore2 supports both pre-2025 multi-file and single-file formats
- simpler ForEachTensor with TensorArgs
- tensors are constructed with their full suffixed name

I/O:
- support mmap and stride
- Simplified SbsWriter, single insert(); add SbsReader

Misc:
- kMockTokenizer: allow creating with unavailable tokenizer
- configs.h: Simpler enum validity checks via kSentinel
- matmul.h: remove unused enable_bind (now in allocator.h)
- tensor_info: single TensorInfoRegistry class, rename from tensor_index.h

Frontends:
- Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&)
- Deduce model/weight type, remove --model and parsing
- Replace most common.h includes with configs.h
- Remove --compressed_weights, use --weights instead
- Remove ModelInfo, replaced by ModelConfig.

Backprop:
- Reduce max loss, remove backward_scalar_test (timeout)
- Update thresholds because new RandInit changes rng eval order and thus numerics
PiperOrigin-RevId: 755317484
2025-05-06 04:44:21 -07:00
The gemma.cpp Authors ba10c88a94 Add C API and C# interop files
This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included.

To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows:
```
cmake --preset windows-dll
cmake --build --config Release --preset windows-dll -j 4
```
This should generate a `gemma.dll` in `<build-dir>/Release`.

To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS.

PiperOrigin-RevId: 750246272
2025-04-22 10:35:47 -07:00
prajwalc22 87a1c76578 Update CMake configuration and documentation for --prompt flag 2025-04-16 09:45:14 +05:30
Jan Wassenberg 8532da47f7 Major refactor of allocator/args:
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)

backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead

Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride

Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
2025-04-10 01:29:54 -07:00
Jan Wassenberg 1b72c22345 Refactor Gemma ctor and improve pool NUMA support
Gemma receives a MatMulEnv arg, with comment on lifetime
Split threading into topology so the latter can be used in allocator
Add AllocClasses() for non-POD (ThreadPool)
Support binding pool to NUMA node
Update threading_test with latency measurements
Also update Highway version.

PiperOrigin-RevId: 736904748
2025-03-14 10:19:00 -07:00
Jan Wassenberg 9a2360d719 Move batch_bench into test section, add GTest dep. Fixes #501
PiperOrigin-RevId: 729494223
2025-02-21 05:33:52 -08:00
Jan Wassenberg f9d93e4a42 Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning
Remove empty matmul_unit_test.
Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576.

PiperOrigin-RevId: 729123576
2025-02-20 08:33:46 -08:00
Jan Wassenberg a60b564b88 Infra improvements (2)
ops.h: move CreateInvTimescale to allow calling without depending on gemma
Pass around MatMulEnv instead of pools to avoid re-creating the env
profiler.h can now be used outside SIMD code
allocator: add StepBytes and QuantumSteps
rename worker thread with package/cluster in the name
threading: add Visit* to IndexRange
PiperOrigin-RevId: 718766704
2025-01-23 01:55:19 -08:00
RangerUFO 20e5ef6d2e Add the missing `migrate_weights` target for CMake 2025-01-17 18:56:43 +08:00
Jan Wassenberg 6a34e9c547 Print cache info and update Highway version for that
PiperOrigin-RevId: 702318451
2024-12-03 06:31:52 -08:00
Jan Wassenberg f74d496879 Threading/infra improvements.
* Add Parallelize*Range helpers and partitioning helpers
* Refactor Pinning class, store original affinity (required to construct another NestedPools after pinning happened)

Compress:
* prevent Compress printing stats in tests
* zero-pad tensors

Matmul:
* add matmul_unit_test (TODO) and bench_matmul
* matmul_test: change norm to row vectors (that is what is added) and include bf16 rounding error
* Prepare for L2/L3 retrieval
PiperOrigin-RevId: 700603811
2024-11-27 01:12:00 -08:00
Stanko Novakovic 109a4d9f85 Add a simple benchmark for batching.
This is a simple Gemma benchmark with a fixed batch size of 32.

PiperOrigin-RevId: 698843573
2024-11-21 10:59:49 -08:00
Ray Smith 73640d2521 Added tensor_index as a single source of truth on tensor shapes/sources and transformations
PiperOrigin-RevId: 697903886
2024-11-19 00:25:39 -08:00
Jan Wassenberg 868b01601f Simpler MatMul interface, vocab types, Tristate for use_spinning
Add Extents2D, Range2D vocab types
Matmul uses ConstMat for inputs and RowPtr for output
Move RowVectorBatch to basics.h
Separate threading.cc
Fix topology string: report cores not LPs, and #HT
Move QStride/IsMHA into LayerConfig
ImageTokens does not require make_unique.
matmul_test: no longer require template args
PiperOrigin-RevId: 692963605
2024-11-04 07:48:29 -08:00
Jan Wassenberg 52af531820 Serialization for class members for use with ModelConfig
PiperOrigin-RevId: 689720027
2024-10-25 03:12:34 -07:00
Paul Chang 4976066095 Try disabling benchmark's gtest integration
PiperOrigin-RevId: 689010657
2024-10-23 10:12:45 -07:00
Paul Chang 4197d69dfc New blob_store_test, ensure ReadOne checks actual size against requested size
PiperOrigin-RevId: 688974390
2024-10-23 08:30:46 -07:00
Jan Wassenberg 02ce1e344f Use NestedPools, add NUMA infra
Improved threading.h, fix thread counts for single package/cluster systems
Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92.

Also fix benchmarks.cc build, update tensor allocator to Allocator

PiperOrigin-RevId: 687307167
2024-10-18 08:11:18 -07:00
Ray Smith 0d68555f87 Eliminated TConfig.
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.

PiperOrigin-RevId: 686870060
2024-10-17 05:04:22 -07:00
Jan Wassenberg 6ab3ff5bde Minor cleanup, Windows+Bazel build fixes
add app.h comment
compress-inl: remove unused typedef
gemma-inl: add missing HWY_ATTR and cast
separate sum-inl.h and basics.h headers
replace more hwy::bfloat16_t with BF16
update include pragmas
update dot_test thresholds
update Highway version in Bazel for HWY_RCAST_ALIGNED fix
PiperOrigin-RevId: 684464326
2024-10-10 09:05:06 -07:00
Ray Smith 85958f5fd3 Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray.
Definition of array size is moved to the constructor.
Allocation is separate and parallelized.
All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted.
Replaced all previous ForEachTensor functions with a single unified function.

PiperOrigin-RevId: 684451604
2024-10-10 08:22:30 -07:00
Jan Wassenberg 2c28b18eb0 Add NestedPools: one per socket/cluster
Use in dot_test
app.h: add new flags and rename num_threads to max_threads
matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases
PiperOrigin-RevId: 683216386
2024-10-07 09:40:19 -07:00
Daniel Keysers f8835fe4a4 Add support for PaliGemma Vision-LM (224x224) to gemma.cpp
See https://arxiv.org/abs/2407.07726 for a description of the model.
Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that.

PiperOrigin-RevId: 677841119
2024-09-23 10:09:38 -07:00
Jan Wassenberg 8c0a8834c1 Major compression update, arbitrary-len unpack + new Dot
Compression:
* Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad
* New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test
* Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking
* NUQ: support arbitrary-length enc/dec
* New compression/shared, remove sfp.h and nuq.h
* Move Store2 into Traits and provide Compress2 wrapper
* Remove unused Decompress()-with-pool overload
* Simplify CompressedArrayLen, rename to CompressedArrayElements
* Remove unused DistortionStats b_l1_

Misc:
* Add compensated and Kahan dot, support any length
* Use same Dot function everywhere
* Move exact arithmetic functions into fp_arith
* use FloatPtr and MatPtr typedefs in tests; less stack usage
* Rename args to packed/raw
* Remove Traits::Name, instead TypeName<T>()
* Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream
PiperOrigin-RevId: 672868468
2024-09-10 02:22:19 -07:00
Jan Wassenberg c29e9752c7 Refactor/cleanup, remove even_odd
* New compression/shared.h, remove sfp.h
* Remove unused DistortionStats b_l1_
* Move exact arithmetic functions into fp_arith
* Remove even_odd optimization for MatVec (mostly unused)
* use BF16 typedef more widely
* Add kMaxSFP constant

PiperOrigin-RevId: 670996386
2024-09-04 09:25:13 -07:00
Jan Wassenberg 4033ed9e78 Avoid duplication of RMSNorm, support all activation/weight types
Add test for RMSNorm
Rename VectorizedRopeAndMulBy -> RopeAndMulBy

Move test_util to util/

PiperOrigin-RevId: 668332927
2024-08-28 01:26:55 -07:00
Jan Wassenberg 2308514e5a Experiment with compensated dot product.
ULP difference vs exact is 0..1, vs 200-5000 for previous.
Runtime overhead is 2.5-4x for f32 input.

PiperOrigin-RevId: 668084019
2024-08-27 12:05:35 -07:00