Commit Graph

794 Commits

Author SHA1 Message Date
Jan Wassenberg fea9a07d9b Avoid affinity related warnings on Apple. Refs #625
PiperOrigin-RevId: 778895832
2025-07-03 08:22:31 -07:00
Jan Wassenberg e1585ecaf5 Update Highway version to get NEON bf16 fix
https://github.com/google/highway/pull/2598

PiperOrigin-RevId: 774664346
2025-06-23 01:25:01 -07:00
Jan Wassenberg a04cc287b2 Move MatMulEnv out of Gemma to enable concurrent calls
Also update benchmark_helper config print: add profiler, remove free mem

PiperOrigin-RevId: 774662974
2025-06-23 01:20:09 -07:00
Jan Wassenberg 0f70f285e0 1.1x prefill and decode speedup (attention/activations)
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations

Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul

PiperOrigin-RevId: 773723423
2025-06-20 08:59:53 -07:00
Jan Wassenberg 7630ec0c92 batch_bench tweak: more output
PiperOrigin-RevId: 773670580
2025-06-20 06:09:18 -07:00
Jan Wassenberg 4f5785b0fd Update instrumentation for new Highway wall-time profiler
Pass the thread index through and use new zone_id.

PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg 1665ecc5c2 Remove CMake max version, fixes #623
PiperOrigin-RevId: 773265809
2025-06-19 02:30:03 -07:00
Jan Wassenberg 834cbe5b39 linkstatic in most tests/binaries, remove fully_static_link
Also decrease "eternal" timeout to "long".

Add 2x/4x larger subsections of Frankenstein (from Gutenberg)

PiperOrigin-RevId: 773252901
2025-06-19 01:45:53 -07:00
Jan Wassenberg 7f62c2606e Fix bf16 KV recompression and Rope(), fixes #608
Also add more helpful error message for prompt > seq_len

Also update ops_test, adding coverage for Rope().

PiperOrigin-RevId: 772945644
2025-06-18 09:14:20 -07:00
Biruk Mammo 88284387db Reduce warning noise.
PiperOrigin-RevId: 772941142
2025-06-18 09:01:40 -07:00
Jan Wassenberg 343482c7ef 1.02x batch decode speedup: BF16 KV cache
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled

PiperOrigin-RevId: 772779163
2025-06-17 23:21:59 -07:00
Mukund Aggarwal 606e22155a Gemma CPP: move PaliGemma tests' helper to a separate class
This helps to be able to use PaliGemma functionalities directly for inference by just providing tokenizer and weight paths.

Added @mukundagg to allowed authors list.

PiperOrigin-RevId: 772705238
2025-06-17 18:37:24 -07:00
Jan Wassenberg f2adbfbcab Batch inference fixes: set pos during prefill, fix assert
PiperOrigin-RevId: 772458760
2025-06-17 07:09:44 -07:00
Jan Wassenberg d342e4e7d4 Also add CMAKE_CXX_STANDARD in examples' CMake files
PiperOrigin-RevId: 772454497
2025-06-17 06:53:54 -07:00
Jan Wassenberg cd80d8b24d Speed up builds by skipping rarely used targets
Centralize previous code into GEMMA_DISABLED_TARGETS

PiperOrigin-RevId: 772433723
2025-06-17 05:44:20 -07:00
Jan Wassenberg 9a02d6be68 Add --prompt_file and testdata for it. Refs #608
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.

Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.

Also avoid dynamic allocation for GriffinActivations.

PiperOrigin-RevId: 772333225
2025-06-16 23:41:07 -07:00
Jan Wassenberg 31d2b231af Update PaliGemma Kaggle link to point to v2
PiperOrigin-RevId: 772328912
2025-06-16 23:24:57 -07:00
Biruk Mammo 5f3797f6e1 Allow creating empty `AttentionActivations` for experimental code.
PiperOrigin-RevId: 772077675
2025-06-16 10:19:11 -07:00
Jan Wassenberg 6773e4517c Split Activations into Griffin/Attention to reduce memory usage for attention-only tests.
PiperOrigin-RevId: 772025282
2025-06-16 07:52:59 -07:00
Copybara-Service 2128d076db Merge pull request #612 from ufownl:feature/allqueries_append
PiperOrigin-RevId: 772007208
2025-06-16 06:52:43 -07:00
RangerUFO 7aac765e96 Add `Append` method to `AllQueries` 2025-06-16 20:39:27 +08:00
Jan Wassenberg e5c81f64a1 Major refactor: clarify query_idx (global) vs qi. Refs #607
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.

Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.

PiperOrigin-RevId: 771937385
2025-06-16 02:42:02 -07:00
Jan Wassenberg 2c72ff2aa5 Fix MatMul issue caused by autotuning bucketing, refs #608, thanks @ufownl
PiperOrigin-RevId: 771077158
2025-06-13 06:58:42 -07:00
Jan Wassenberg 01cdefeda7 1.64x batch=1 prefill speedup: nested parallelization for Attention
(DotSoftmaxWeightedSum)
Also fix tsan error in matmul (atomic_flag instead of static)

PiperOrigin-RevId: 770241705
2025-06-11 11:28:46 -07:00
Jan Wassenberg c027a45a2e MatPtr-ify KV, shared div_seq_len, --seq_len flag
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg bd98b43cea Rename RowPtr->StridedView, CRows->RowPtrs
PiperOrigin-RevId: 770046362
2025-06-11 02:30:53 -07:00
Jan Wassenberg b84149310b Fix paligemma, update its test
Must not pass image tokens to the EmbedMMToken used for text.
Caught by next presubmit test.

paligemma_test: move function bodies into class, regroup variables
PiperOrigin-RevId: 770040014
2025-06-11 02:12:12 -07:00
Jan Wassenberg ec02726cf7 6x large-batch, short-prompt prefill speedup
Parallelize over queries instead of tokens
introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer.
move RMSNormInplaceBatched out of Transformer to call the latter from prefill
Consistent arg order.

Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h
Also add output to gemma_batch_bench, fix name

PiperOrigin-RevId: 769676106
2025-06-10 09:56:20 -07:00
Daniel Keysers d7b23d532a Restructure internal initialization.
PiperOrigin-RevId: 769507096
2025-06-10 01:25:31 -07:00
Rhett Stucki 824a95793c Fix Image::WriteBinary() writing values to a file one at a time.
PiperOrigin-RevId: 767955187
2025-06-06 00:48:09 -07:00
Jan Wassenberg 6ee628ba38 Further cleanup: separate MatMulEnv arg
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env

PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg e774ddbaaa Github test: disable failing ubuntu-20.04
Also attempt to speed up bazel build.

PiperOrigin-RevId: 767667520
2025-06-05 10:30:38 -07:00
Jan Wassenberg 0e2cab5187 Avoid warning about inability to map, unless explicitly requested
PiperOrigin-RevId: 767633815
2025-06-05 09:10:08 -07:00
Jan Wassenberg 3a266c662c Split gemma-inl into separate source files
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.

PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00
The gemma.cpp Authors dd7d4a7717 Optimize Image::GetPatch() to copy rows instead of pixels at a time.
PiperOrigin-RevId: 767436146
2025-06-04 22:31:08 -07:00
Copybara-Service eff0213e88 Merge pull request #593 from ufownl:bugfix/dc2bf16
PiperOrigin-RevId: 767098675
2025-06-04 05:21:54 -07:00
RangerUFO a82f8d5690 Fix compilation error on G++ 9.4 2025-06-04 17:39:37 +08:00
Jan Wassenberg 6897313080 3x speedup of EmbedImagePatches - GEMM, not GEMV.
Required fixes to handling of non-vector aligned A.
Also move row ptrs to MatMulEnv.

PiperOrigin-RevId: 767029036
2025-06-04 01:18:52 -07:00
Daniel Keysers 9f74a1a098 Fix a problem in run_example.py
PiperOrigin-RevId: 767017932
2025-06-04 00:42:57 -07:00
Jan Wassenberg 9efdcfd45c 1.07x batch decode speedup: more BF16 weights and activations
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom

Also update profiler annotations in ops-inl.h

PiperOrigin-RevId: 766995010
2025-06-03 23:30:18 -07:00
Jan Wassenberg 839a642992 Fix paligemma_test, refs #588
Detect PaliGemma models from layer names
Remove unused allocator arg from CreateInvTimescale
matmul: only warn once about dim divisibility
Print config also in tests if --verbosity 2
PiperOrigin-RevId: 766605131
2025-06-03 04:45:22 -07:00
Copybara-Service 209009b57e Merge pull request #588 from ufownl:bugfix/vit_attn
PiperOrigin-RevId: 766528391
2025-06-03 00:43:30 -07:00
Jan Wassenberg ad3002a21c
Merge branch 'dev' into bugfix/vit_attn 2025-06-03 09:29:52 +02:00
Jan Wassenberg 794a21a4e6 Major refactor to de-templatize gemma-inl and weights
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.

WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.

PiperOrigin-RevId: 766497657
2025-06-02 23:01:35 -07:00
RangerUFO 93de2be938 Fix the broken VitAttention 2025-06-03 12:40:13 +08:00
Jan Wassenberg cf4d7ceb82 1.16x decode speedup: remove last MatVec in Attention
Precompute row pointers.
Remove no longer used MHA support; QStride -> qkv_dim.
Remove RowPtr from MatMul interface, use only MatPtrT.
Require opt-in define for NUQ to speed up builds.
Also fix io.cc on Windows.

PiperOrigin-RevId: 766228108
2025-06-02 09:40:29 -07:00
Jan Wassenberg c4a75abe43 Cleanup gemma_batch_bench
PiperOrigin-RevId: 766177406
2025-06-02 07:04:36 -07:00
Jan Wassenberg a3f7bf0991 Fix thread name when skipping packages/clusters
PiperOrigin-RevId: 766054198
2025-06-01 23:50:11 -07:00
Jan Wassenberg 0023ff8770 Add support for arbitrary output row pointers
Useful for writing directly to KV cache.

PiperOrigin-RevId: 765615147
2025-05-31 10:55:54 -07:00
The gemma.cpp Authors 9c3e089b09 Internal change.
PiperOrigin-RevId: 765218260
2025-05-30 09:18:44 -07:00