Charles Zhao
50ee1a3e92
Write SBS progressively.
...
(1) Directly write to file in BlobWriter::Add and destruct the MatOwner to release the rams.
(2) Write a fake header to indicate this is V2, and write correct header and directory at the end of the file.
(3) Tested on loading sbs written the old way, and new way, both worked.
PiperOrigin-RevId: 789306837
2025-07-31 06:05:38 -07:00
Jan Wassenberg
8715eda512
Improved layer idx parsing
...
PiperOrigin-RevId: 788868522
2025-07-30 05:49:45 -07:00
Jan Wassenberg
d831ddce5b
Fix file mapping: was letting the smart pointer go out of scope
...
Also save+print the IO mode used.
PiperOrigin-RevId: 788848165
2025-07-30 04:30:10 -07:00
Jan Wassenberg
2141d4788d
Add IsAppendOnly flag to file and if true, disable parallel writes
...
PiperOrigin-RevId: 788805810
2025-07-30 01:51:37 -07:00
Jan Wassenberg
d22ba2ac96
Update layer index parsing and allow tokenizer override
...
PiperOrigin-RevId: 788797948
2025-07-30 01:22:34 -07:00
Jan Wassenberg
d1638587f0
1.14x batch decode speedup: parallelize RMSNorm ops
...
Activations was over-parallelized, use single pool instead.
Also improve profiler zone annotations,
pass through worker args (for tracking concurrency), now non-optional.
PiperOrigin-RevId: 788790976
2025-07-30 00:55:45 -07:00
Jan Wassenberg
ac0d751d20
Rename GetModelConfig->Config
...
PiperOrigin-RevId: 788506480
2025-07-29 10:18:12 -07:00
Jeremiah Harmsen
33fabd4ed1
Internal change.
...
PiperOrigin-RevId: 788463042
2025-07-29 08:21:29 -07:00
Jan Wassenberg
e76e29ce11
De-singleton ThreadingContext so callers can pass in their own
...
weights.cc: fix BindB argument for bf16 tensors
threading_test: enable autotune
PiperOrigin-RevId: 785763618
2025-07-22 02:08:46 -07:00
Jan Wassenberg
5474146129
Back to f32 kv_cache, but via typedef
...
PiperOrigin-RevId: 785422614
2025-07-21 07:05:35 -07:00
Jan Wassenberg
56c9196eb6
Add blob_path to config deduction message
...
PiperOrigin-RevId: 782188689
2025-07-11 18:58:56 -07:00
Jan Wassenberg
349c86f2d9
Fix bench_matmul perf regression: A input should be padded
...
PiperOrigin-RevId: 781976414
2025-07-11 07:36:36 -07:00
Jan Wassenberg
4bc44d5678
Minor: ModelWeightsPtrs -> WeightsPtrs
...
PiperOrigin-RevId: 781954533
2025-07-11 06:11:51 -07:00
Jan Wassenberg
fea9a07d9b
Avoid affinity related warnings on Apple. Refs #625
...
PiperOrigin-RevId: 778895832
2025-07-03 08:22:31 -07:00
Jan Wassenberg
e1585ecaf5
Update Highway version to get NEON bf16 fix
...
https://github.com/google/highway/pull/2598
PiperOrigin-RevId: 774664346
2025-06-23 01:25:01 -07:00
Jan Wassenberg
a04cc287b2
Move MatMulEnv out of Gemma to enable concurrent calls
...
Also update benchmark_helper config print: add profiler, remove free mem
PiperOrigin-RevId: 774662974
2025-06-23 01:20:09 -07:00
Jan Wassenberg
0f70f285e0
1.1x prefill and decode speedup (attention/activations)
...
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations
Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul
PiperOrigin-RevId: 773723423
2025-06-20 08:59:53 -07:00
Jan Wassenberg
7630ec0c92
batch_bench tweak: more output
...
PiperOrigin-RevId: 773670580
2025-06-20 06:09:18 -07:00
Jan Wassenberg
4f5785b0fd
Update instrumentation for new Highway wall-time profiler
...
Pass the thread index through and use new zone_id.
PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg
1665ecc5c2
Remove CMake max version, fixes #623
...
PiperOrigin-RevId: 773265809
2025-06-19 02:30:03 -07:00
Jan Wassenberg
834cbe5b39
linkstatic in most tests/binaries, remove fully_static_link
...
Also decrease "eternal" timeout to "long".
Add 2x/4x larger subsections of Frankenstein (from Gutenberg)
PiperOrigin-RevId: 773252901
2025-06-19 01:45:53 -07:00
Jan Wassenberg
7f62c2606e
Fix bf16 KV recompression and Rope(), fixes #608
...
Also add more helpful error message for prompt > seq_len
Also update ops_test, adding coverage for Rope().
PiperOrigin-RevId: 772945644
2025-06-18 09:14:20 -07:00
Biruk Mammo
88284387db
Reduce warning noise.
...
PiperOrigin-RevId: 772941142
2025-06-18 09:01:40 -07:00
Jan Wassenberg
343482c7ef
1.02x batch decode speedup: BF16 KV cache
...
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled
PiperOrigin-RevId: 772779163
2025-06-17 23:21:59 -07:00
Mukund Aggarwal
606e22155a
Gemma CPP: move PaliGemma tests' helper to a separate class
...
This helps to be able to use PaliGemma functionalities directly for inference by just providing tokenizer and weight paths.
Added @mukundagg to allowed authors list.
PiperOrigin-RevId: 772705238
2025-06-17 18:37:24 -07:00
Jan Wassenberg
f2adbfbcab
Batch inference fixes: set pos during prefill, fix assert
...
PiperOrigin-RevId: 772458760
2025-06-17 07:09:44 -07:00
Jan Wassenberg
d342e4e7d4
Also add CMAKE_CXX_STANDARD in examples' CMake files
...
PiperOrigin-RevId: 772454497
2025-06-17 06:53:54 -07:00
Jan Wassenberg
cd80d8b24d
Speed up builds by skipping rarely used targets
...
Centralize previous code into GEMMA_DISABLED_TARGETS
PiperOrigin-RevId: 772433723
2025-06-17 05:44:20 -07:00
Jan Wassenberg
9a02d6be68
Add --prompt_file and testdata for it. Refs #608
...
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.
Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.
Also avoid dynamic allocation for GriffinActivations.
PiperOrigin-RevId: 772333225
2025-06-16 23:41:07 -07:00
Jan Wassenberg
31d2b231af
Update PaliGemma Kaggle link to point to v2
...
PiperOrigin-RevId: 772328912
2025-06-16 23:24:57 -07:00
Biruk Mammo
5f3797f6e1
Allow creating empty `AttentionActivations` for experimental code.
...
PiperOrigin-RevId: 772077675
2025-06-16 10:19:11 -07:00
Jan Wassenberg
6773e4517c
Split Activations into Griffin/Attention to reduce memory usage for attention-only tests.
...
PiperOrigin-RevId: 772025282
2025-06-16 07:52:59 -07:00
Copybara-Service
2128d076db
Merge pull request #612 from ufownl:feature/allqueries_append
...
PiperOrigin-RevId: 772007208
2025-06-16 06:52:43 -07:00
RangerUFO
7aac765e96
Add `Append` method to `AllQueries`
2025-06-16 20:39:27 +08:00
Jan Wassenberg
e5c81f64a1
Major refactor: clarify query_idx (global) vs qi. Refs #607
...
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.
Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.
PiperOrigin-RevId: 771937385
2025-06-16 02:42:02 -07:00
Jan Wassenberg
2c72ff2aa5
Fix MatMul issue caused by autotuning bucketing, refs #608 , thanks @ufownl
...
PiperOrigin-RevId: 771077158
2025-06-13 06:58:42 -07:00
Jan Wassenberg
01cdefeda7
1.64x batch=1 prefill speedup: nested parallelization for Attention
...
(DotSoftmaxWeightedSum)
Also fix tsan error in matmul (atomic_flag instead of static)
PiperOrigin-RevId: 770241705
2025-06-11 11:28:46 -07:00
Jan Wassenberg
c027a45a2e
MatPtr-ify KV, shared div_seq_len, --seq_len flag
...
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg
bd98b43cea
Rename RowPtr->StridedView, CRows->RowPtrs
...
PiperOrigin-RevId: 770046362
2025-06-11 02:30:53 -07:00
Jan Wassenberg
b84149310b
Fix paligemma, update its test
...
Must not pass image tokens to the EmbedMMToken used for text.
Caught by next presubmit test.
paligemma_test: move function bodies into class, regroup variables
PiperOrigin-RevId: 770040014
2025-06-11 02:12:12 -07:00
Jan Wassenberg
ec02726cf7
6x large-batch, short-prompt prefill speedup
...
Parallelize over queries instead of tokens
introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer.
move RMSNormInplaceBatched out of Transformer to call the latter from prefill
Consistent arg order.
Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h
Also add output to gemma_batch_bench, fix name
PiperOrigin-RevId: 769676106
2025-06-10 09:56:20 -07:00
Daniel Keysers
d7b23d532a
Restructure internal initialization.
...
PiperOrigin-RevId: 769507096
2025-06-10 01:25:31 -07:00
Rhett Stucki
824a95793c
Fix Image::WriteBinary() writing values to a file one at a time.
...
PiperOrigin-RevId: 767955187
2025-06-06 00:48:09 -07:00
Jan Wassenberg
6ee628ba38
Further cleanup: separate MatMulEnv arg
...
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env
PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg
e774ddbaaa
Github test: disable failing ubuntu-20.04
...
Also attempt to speed up bazel build.
PiperOrigin-RevId: 767667520
2025-06-05 10:30:38 -07:00
Jan Wassenberg
0e2cab5187
Avoid warning about inability to map, unless explicitly requested
...
PiperOrigin-RevId: 767633815
2025-06-05 09:10:08 -07:00
Jan Wassenberg
3a266c662c
Split gemma-inl into separate source files
...
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.
PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00
The gemma.cpp Authors
dd7d4a7717
Optimize Image::GetPatch() to copy rows instead of pixels at a time.
...
PiperOrigin-RevId: 767436146
2025-06-04 22:31:08 -07:00
Copybara-Service
eff0213e88
Merge pull request #593 from ufownl:bugfix/dc2bf16
...
PiperOrigin-RevId: 767098675
2025-06-04 05:21:54 -07:00
RangerUFO
a82f8d5690
Fix compilation error on G++ 9.4
2025-06-04 17:39:37 +08:00