Jan Wassenberg
8532da47f7
Major refactor of allocator/args:
...
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)
backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead
Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride
Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
2025-04-10 01:29:54 -07:00
Copybara-Service
bef91a3f03
Merge pull request #529 from ufownl:refactor/wrap_and_tokenize
...
PiperOrigin-RevId: 745174371
2025-04-08 09:22:26 -07:00
Jan Wassenberg
5d4f7e0f7e
Add new singleton Allocator2 instead of monostate
...
Not yet used.
Also fix format-string warning in topology.cc.
PiperOrigin-RevId: 745166210
2025-04-08 09:00:59 -07:00
Jan Wassenberg
4e6aa36e9b
Minor cleanup: enable 0,0 Extents2D, add SerializedSpan typedef, include fixes
...
PiperOrigin-RevId: 745068776
2025-04-08 03:35:55 -07:00
RangerUFO
cc2e14e654
Improve `GemmaChatTemplate` to handle vision prompt wrapping
2025-03-29 11:31:40 +08:00
RangerUFO
c39295f497
Inline the ctor of `GemmaChatTemplate`
2025-03-29 11:31:40 +08:00
RangerUFO
d1615b56b2
Fix the prompt wrapping of gemma3-1b again
...
It seems that the previous fix was changed back due to a merge error.
2025-03-29 11:31:39 +08:00
RangerUFO
ca4ee2b63f
Refactor `WrapAndTokenize` to work properly with Gemma3
2025-03-29 11:31:39 +08:00
Jan Wassenberg
76a81ac2d6
Fix unaligned buffer causing crash on GCC. Thanks @ufownl, fixes #508
...
PiperOrigin-RevId: 741590339
2025-03-28 11:25:33 -07:00
Jan Wassenberg
e55734219d
Fix test threshold and improve warning output
...
PiperOrigin-RevId: 740738937
2025-03-26 06:11:27 -07:00
Copybara-Service
4a924f1794
Merge pull request #527 from ufownl:feature/gemma2_secondary_eos
...
PiperOrigin-RevId: 740327973
2025-03-25 06:44:41 -07:00
RangerUFO
d42deaa27c
Set the secondary EOS for Gemma2
...
So that we can remove the `<end_of_turn>` filter that was set up
specifically for Gemma2.
2025-03-22 01:32:22 +08:00
RangerUFO
2bad79f110
Fix the EOS checking
...
The secondary eos is usually `<end_of_turn>`, which can appear in the
prompt, so we can only check it not in the prompt.
2025-03-22 01:32:22 +08:00
Jan Wassenberg
6300c123ee
Update app argument documentation
...
PiperOrigin-RevId: 739159864
2025-03-21 06:33:30 -07:00
Phil Culliton
05b1cce9f7
Add support for a secondary EOS token
...
PiperOrigin-RevId: 738898976
2025-03-20 12:28:31 -07:00
Jan Wassenberg
83219e3c68
Add note on attention length and SFP
...
PiperOrigin-RevId: 738698399
2025-03-20 00:39:06 -07:00
pculliton
3d419ec173
Merge pull request #523 from ufownl/bugfix/gemma3_1b_wrapping
...
Fix the prompt wrapping of gemma3-1b
2025-03-19 10:30:27 -04:00
RangerUFO
b16ce9a0b4
Fix the prompt wrapping of gemma3-1b
2025-03-18 16:52:38 +08:00
Jan Wassenberg
1b72c22345
Refactor Gemma ctor and improve pool NUMA support
...
Gemma receives a MatMulEnv arg, with comment on lifetime
Split threading into topology so the latter can be used in allocator
Add AllocClasses() for non-POD (ThreadPool)
Support binding pool to NUMA node
Update threading_test with latency measurements
Also update Highway version.
PiperOrigin-RevId: 736904748
2025-03-14 10:19:00 -07:00
Phil Culliton
1b1b63d560
Fix PaliGemma models.
...
PiperOrigin-RevId: 736483021
2025-03-13 06:28:29 -07:00
Quirin Niedernhuber
0ff6b3123a
Point out Gemma 3 support in README.md
...
PiperOrigin-RevId: 736125794
2025-03-12 07:33:30 -07:00
Jan Wassenberg
5898fa5eb0
Update github actions/cache version
...
PiperOrigin-RevId: 736120661
2025-03-12 07:12:55 -07:00
Phil Culliton
4ab601da10
Internal change.
...
PiperOrigin-RevId: 736015810
2025-03-11 23:20:20 -07:00
Phil Culliton
9d83ff202e
Internal change.
...
PiperOrigin-RevId: 736014152
2025-03-11 23:10:48 -07:00
Jan Wassenberg
2bdf26d81d
Support bf16 output of Matmul
...
Adds Stride to ConstMat, to support decompression of C output for test
matmul_test: add line numbers to output
Also ignore "N is not a multiple of nc" when N==nc
PiperOrigin-RevId: 731096662
2025-02-25 17:53:20 -08:00
Jan Wassenberg
b3b4b9f92f
With new matmul, much larger batch sizes are advantageous, default to 256.
...
Can still override via command line argument.
PiperOrigin-RevId: 730502653
2025-02-24 10:21:58 -08:00
Jan Wassenberg
9a2360d719
Move batch_bench into test section, add GTest dep. Fixes #501
...
PiperOrigin-RevId: 729494223
2025-02-21 05:33:52 -08:00
Jan Wassenberg
f9d93e4a42
Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning
...
Remove empty matmul_unit_test.
Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576.
PiperOrigin-RevId: 729123576
2025-02-20 08:33:46 -08:00
Apoorv Reddy
d854471ae2
Use vectorized TopK using highway VQSelect
...
PiperOrigin-RevId: 728159153
2025-02-18 05:01:39 -08:00
Apoorv Reddy
0e5b59d24d
Implements FusedSoftmaxAndSampleTopK.
...
This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0
PiperOrigin-RevId: 727702149
2025-02-16 21:30:06 -08:00
Jan Wassenberg
bdf5d25e97
Only temporarily enable spinning in threading benchmark
...
PiperOrigin-RevId: 727114863
2025-02-14 17:15:38 -08:00
Jan Wassenberg
06c70dccd9
Less verbose threading_test output, improve formatting.
...
PiperOrigin-RevId: 726364085
2025-02-13 00:56:34 -08:00
Daniel Keysers
f173aa776e
Add conversion tool for HF safetensors to gemma.cpp for PaliGemma.
...
PiperOrigin-RevId: 725990158
2025-02-12 03:47:43 -08:00
Copybara-Service
c495b25995
Merge pull request #493 from ufownl:bugfix/compress_weights_le
...
PiperOrigin-RevId: 725585921
2025-02-11 05:10:13 -08:00
Apoorv Reddy
64cf6dfe0a
Using TimingInfo methods and cleaning up args to DecodeStepT
...
PiperOrigin-RevId: 725580125
2025-02-11 04:49:14 -08:00
Jan Wassenberg
953c877658
Fix nuq Enc() to handle groups < kGroupSize.
...
Also remove no longer required dynamic allocation.
PiperOrigin-RevId: 725203824
2025-02-10 07:17:59 -08:00
Jan Wassenberg
5563d94811
Add fork/join latency benchmark
...
PiperOrigin-RevId: 725174042
2025-02-10 05:23:44 -08:00
Apoorv Reddy
780e376023
Add KVCache.DeepCopy() . Will be useful for implementing sampling functionality like beam sampling, parallel sampling, CoT Decoding (à la https://arxiv.org/abs/2402.10200 )
...
PiperOrigin-RevId: 725156316
2025-02-10 04:10:29 -08:00
Apoorv Reddy
9b3e7ea8a2
Factor out DecodeStepT from GenerateT into a separate function.
...
This will be useful for adding sampling functionality like beam decoding, parallel sampling, cot decoding (as described in the [Chain-of-Thought Reasoning Without Prompting paper](https://arxiv.org/abs/2402.10200 ))
PiperOrigin-RevId: 725151530
2025-02-10 03:53:08 -08:00
Jan Wassenberg
b0fe9a43e6
Further speed up blob_compare: single alloc, use dual sockets
...
PiperOrigin-RevId: 724947361
2025-02-09 10:53:49 -08:00
RangerUFO
3a5a6dbcad
Fix the link error when building `compress_weights` with Clang on macOS
2025-02-09 00:13:25 +08:00
Jan Wassenberg
b18bd781f6
Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts
...
PiperOrigin-RevId: 724340518
2025-02-07 07:38:55 -08:00
Oleh Prypin
82ca526c0c
Remove `srcs_version` and `python_version` attributes, as they already default to `"PY3"`
...
PiperOrigin-RevId: 724122259
2025-02-06 16:51:11 -08:00
Jan Wassenberg
f31e12e63b
Improved blob diff: parallel, tolerance for float
...
PiperOrigin-RevId: 724060325
2025-02-06 13:46:28 -08:00
Jan Wassenberg
9f5159ff68
Public visibility for compression/
...
PiperOrigin-RevId: 723529541
2025-02-05 08:53:51 -08:00
Phil Culliton
7ccc6abe87
Allow conversion, loading and inference with NUQ.
...
PiperOrigin-RevId: 723507890
2025-02-05 07:45:54 -08:00
Phil Culliton
8a6edff319
Base interleaved handling for 4.5-bit NUQ, specifically Enc, DecompressAndZeroPad, and Dec2. Includes tests.
...
PiperOrigin-RevId: 721821577
2025-01-31 10:35:32 -08:00
Phil Culliton
23dac72463
Simplified interface class and example for Gemma.cpp usage.
...
PiperOrigin-RevId: 720591037
2025-01-28 08:48:27 -08:00
Daniel Keysers
7af2e70321
Add python wrappers for configs and inference.
...
Enable building compression/python/compression_test using bazel.
Add default image path for image_test and paligemma_test.
PiperOrigin-RevId: 720583438
2025-01-28 08:22:03 -08:00
Daniel Keysers
bcdb0d65bd
Assorted small cleanups.
...
PiperOrigin-RevId: 720548132
2025-01-28 06:09:45 -08:00