Copybara-Service
f20da328de
Merge pull request #539 from prajwalc22:feature-prompt-flag
...
PiperOrigin-RevId: 750118715
2025-04-22 03:09:19 -07:00
prajwalc22
2407150f84
Merge branch 'feature-prompt-flag' of github.com:prajwalc22/gemma.cpp into feature-prompt-flag
2025-04-17 23:54:46 +05:30
prajwalc22
a9e56c27eb
removed unnecessary threading.h import
2025-04-17 23:44:23 +05:30
Prajwal Choudhari
09dfb144c0
Merge branch 'dev' into feature-prompt-flag
2025-04-17 18:53:28 +05:30
prajwalc22
f55c321397
Address review feedback: Fix prefill_tbatch_size and variable placement issues
2025-04-17 10:15:21 +05:30
prajwalc22
27c28cc938
Address review feedback: Fix prefill_tbatch_size and variable placement issues
2025-04-17 10:15:05 +05:30
Jan Wassenberg
87a658b1c6
Minor cleanup, on-demand NUQ buffer allocation
...
threading_context: add profiler
compress-inl: add constexpr, on-demand alloc NUQ buffer
gemma_py: model->gemma
Move ScaleWeights to compress.cc
Move PromptWrapping to configs.h
PiperOrigin-RevId: 748347896
2025-04-16 10:49:43 -07:00
prajwalc22
8246e49199
Add non-interactive mode support
...
- Added prompt flag to InferenceArgs for non-interactive mode
- Set user-facing options to verbosity level 1
- Fixed prompt_size declaration and variable ordering in run.cc
- Properly set prompt_size after WrapAndTokenize calls
- Moved kVerboseLogTokens block after prompt_size is set
2025-04-16 16:26:52 +05:30
prajwalc22
cbf179990f
Add --prompt flag for non-interactive mode
2025-04-16 15:34:43 +05:30
prajwalc22
716713f0e6
Update .gitignore to exclude build directory and model files
2025-04-16 09:52:30 +05:30
prajwalc22
01caf379ba
Update .gitignore to exclude build directory and model files
2025-04-16 09:45:14 +05:30
prajwalc22
87a1c76578
Update CMake configuration and documentation for --prompt flag
2025-04-16 09:45:14 +05:30
prajwalc22
f3116d2577
Add --prompt flag for non-interactive mode
...
This change adds a --prompt command-line option that allows users to
provide prompts directly without entering interactive mode, which is
useful for scripting and automation.
2025-04-16 09:45:02 +05:30
The gemma.cpp Authors
7164a5e844
Internal change.
...
PiperOrigin-RevId: 746953110
2025-04-12 20:27:49 -07:00
Jan Wassenberg
2e722f14f1
Add mmap support (not yet used)
...
Also: const-correct ArgsBase,
add assert to mat.h checking element_bytes_
BUILD deps update (:shared provides shared.h, not :sfp)
PiperOrigin-RevId: 746073312
2025-04-10 10:03:40 -07:00
Jan Wassenberg
8532da47f7
Major refactor of allocator/args:
...
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)
backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead
Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride
Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
2025-04-10 01:29:54 -07:00
Copybara-Service
bef91a3f03
Merge pull request #529 from ufownl:refactor/wrap_and_tokenize
...
PiperOrigin-RevId: 745174371
2025-04-08 09:22:26 -07:00
Jan Wassenberg
5d4f7e0f7e
Add new singleton Allocator2 instead of monostate
...
Not yet used.
Also fix format-string warning in topology.cc.
PiperOrigin-RevId: 745166210
2025-04-08 09:00:59 -07:00
Jan Wassenberg
4e6aa36e9b
Minor cleanup: enable 0,0 Extents2D, add SerializedSpan typedef, include fixes
...
PiperOrigin-RevId: 745068776
2025-04-08 03:35:55 -07:00
RangerUFO
cc2e14e654
Improve `GemmaChatTemplate` to handle vision prompt wrapping
2025-03-29 11:31:40 +08:00
RangerUFO
c39295f497
Inline the ctor of `GemmaChatTemplate`
2025-03-29 11:31:40 +08:00
RangerUFO
d1615b56b2
Fix the prompt wrapping of gemma3-1b again
...
It seems that the previous fix was changed back due to a merge error.
2025-03-29 11:31:39 +08:00
RangerUFO
ca4ee2b63f
Refactor `WrapAndTokenize` to work properly with Gemma3
2025-03-29 11:31:39 +08:00
Jan Wassenberg
76a81ac2d6
Fix unaligned buffer causing crash on GCC. Thanks @ufownl, fixes #508
...
PiperOrigin-RevId: 741590339
2025-03-28 11:25:33 -07:00
Jan Wassenberg
e55734219d
Fix test threshold and improve warning output
...
PiperOrigin-RevId: 740738937
2025-03-26 06:11:27 -07:00
Copybara-Service
4a924f1794
Merge pull request #527 from ufownl:feature/gemma2_secondary_eos
...
PiperOrigin-RevId: 740327973
2025-03-25 06:44:41 -07:00
RangerUFO
d42deaa27c
Set the secondary EOS for Gemma2
...
So that we can remove the `<end_of_turn>` filter that was set up
specifically for Gemma2.
2025-03-22 01:32:22 +08:00
RangerUFO
2bad79f110
Fix the EOS checking
...
The secondary eos is usually `<end_of_turn>`, which can appear in the
prompt, so we can only check it not in the prompt.
2025-03-22 01:32:22 +08:00
Jan Wassenberg
6300c123ee
Update app argument documentation
...
PiperOrigin-RevId: 739159864
2025-03-21 06:33:30 -07:00
Phil Culliton
05b1cce9f7
Add support for a secondary EOS token
...
PiperOrigin-RevId: 738898976
2025-03-20 12:28:31 -07:00
Jan Wassenberg
83219e3c68
Add note on attention length and SFP
...
PiperOrigin-RevId: 738698399
2025-03-20 00:39:06 -07:00
pculliton
3d419ec173
Merge pull request #523 from ufownl/bugfix/gemma3_1b_wrapping
...
Fix the prompt wrapping of gemma3-1b
2025-03-19 10:30:27 -04:00
RangerUFO
b16ce9a0b4
Fix the prompt wrapping of gemma3-1b
2025-03-18 16:52:38 +08:00
Jan Wassenberg
1b72c22345
Refactor Gemma ctor and improve pool NUMA support
...
Gemma receives a MatMulEnv arg, with comment on lifetime
Split threading into topology so the latter can be used in allocator
Add AllocClasses() for non-POD (ThreadPool)
Support binding pool to NUMA node
Update threading_test with latency measurements
Also update Highway version.
PiperOrigin-RevId: 736904748
2025-03-14 10:19:00 -07:00
Phil Culliton
1b1b63d560
Fix PaliGemma models.
...
PiperOrigin-RevId: 736483021
2025-03-13 06:28:29 -07:00
Quirin Niedernhuber
0ff6b3123a
Point out Gemma 3 support in README.md
...
PiperOrigin-RevId: 736125794
2025-03-12 07:33:30 -07:00
Jan Wassenberg
5898fa5eb0
Update github actions/cache version
...
PiperOrigin-RevId: 736120661
2025-03-12 07:12:55 -07:00
Phil Culliton
4ab601da10
Internal change.
...
PiperOrigin-RevId: 736015810
2025-03-11 23:20:20 -07:00
Phil Culliton
9d83ff202e
Internal change.
...
PiperOrigin-RevId: 736014152
2025-03-11 23:10:48 -07:00
Jan Wassenberg
2bdf26d81d
Support bf16 output of Matmul
...
Adds Stride to ConstMat, to support decompression of C output for test
matmul_test: add line numbers to output
Also ignore "N is not a multiple of nc" when N==nc
PiperOrigin-RevId: 731096662
2025-02-25 17:53:20 -08:00
Jan Wassenberg
b3b4b9f92f
With new matmul, much larger batch sizes are advantageous, default to 256.
...
Can still override via command line argument.
PiperOrigin-RevId: 730502653
2025-02-24 10:21:58 -08:00
Jan Wassenberg
9a2360d719
Move batch_bench into test section, add GTest dep. Fixes #501
...
PiperOrigin-RevId: 729494223
2025-02-21 05:33:52 -08:00
Jan Wassenberg
f9d93e4a42
Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning
...
Remove empty matmul_unit_test.
Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576.
PiperOrigin-RevId: 729123576
2025-02-20 08:33:46 -08:00
Apoorv Reddy
d854471ae2
Use vectorized TopK using highway VQSelect
...
PiperOrigin-RevId: 728159153
2025-02-18 05:01:39 -08:00
Apoorv Reddy
0e5b59d24d
Implements FusedSoftmaxAndSampleTopK.
...
This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0
PiperOrigin-RevId: 727702149
2025-02-16 21:30:06 -08:00
Jan Wassenberg
bdf5d25e97
Only temporarily enable spinning in threading benchmark
...
PiperOrigin-RevId: 727114863
2025-02-14 17:15:38 -08:00
Jan Wassenberg
06c70dccd9
Less verbose threading_test output, improve formatting.
...
PiperOrigin-RevId: 726364085
2025-02-13 00:56:34 -08:00
Daniel Keysers
f173aa776e
Add conversion tool for HF safetensors to gemma.cpp for PaliGemma.
...
PiperOrigin-RevId: 725990158
2025-02-12 03:47:43 -08:00
Copybara-Service
c495b25995
Merge pull request #493 from ufownl:bugfix/compress_weights_le
...
PiperOrigin-RevId: 725585921
2025-02-11 05:10:13 -08:00
Apoorv Reddy
64cf6dfe0a
Using TimingInfo methods and cleaning up args to DecodeStepT
...
PiperOrigin-RevId: 725580125
2025-02-11 04:49:14 -08:00