Theotime Combes
1bdde1af3c
Add config flag for global timescale & rely on config to deduce wrapping
...
PiperOrigin-RevId: 823512377
2025-10-24 06:54:56 -07:00
Jan Wassenberg
3ed403e287
Major cleanup of profiler zones, add Caller annotation for all pool.Run
...
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Phil Culliton
503aaddd65
Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
...
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith
ee18916abf
Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
...
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg
035273c184
tune pool kSpin mode in threading_context
...
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.
threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Ray Smith
684a0444e9
Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
...
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith
14244664c8
Avoid transposing Q when it isn't needed
...
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg
fe5a39990e
Improve FlashAttention threading:
...
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.
PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith
6098a022b3
Increased parallelism for RMSNormAndPositionalEncoding
...
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith
2f6cbde8ff
Added a smaller tile size to flash attention for smaller batch sizes
...
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith
4974f24832
Fixed bug with softcap in single flash attention
...
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar
667a3f117a
Utilize multiple cores to read weight batches.
...
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Charles Zhao
4f0c633248
(1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results.
...
PiperOrigin-RevId: 810634261
2025-09-23 17:02:29 -07:00
Jan Wassenberg
fac8aac4cb
Internal change
...
PiperOrigin-RevId: 809975026
2025-09-22 05:37:03 -07:00
Jan Wassenberg
501fdf000e
Remove no longer used MatVec
...
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Jan Wassenberg
f3bc1c17da
1.03x speedup: fused FFN
...
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
2025-09-15 10:26:37 -07:00
Ray Smith
c9b8479f7d
Added zero-initialization to att_out.
...
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.
PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Jan Wassenberg
2695aab5d2
Temporarily disable flash pending msan fix
...
PiperOrigin-RevId: 805350234
2025-09-10 07:25:41 -07:00
Jan Wassenberg
ba6131311a
Fix gemma_batch_bench for flash attention
...
q_T rows do not change.
Also repeat prefill to reflect perf after autotuning.
PiperOrigin-RevId: 805319377
2025-09-10 05:32:34 -07:00
Ray Smith
f10ac41a20
Added flash attention, with both a single-q function, and a register-tiled function.
...
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.
PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg
461a9c7d1b
Matmul refactoring towards fusion
...
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
2025-09-09 07:13:38 -07:00
Jan Wassenberg
a5ab99e4ba
Memory use reduction: smaller/single MMStorage
...
PiperOrigin-RevId: 804865029
2025-09-09 05:32:46 -07:00
Jan Wassenberg
6e52a835c6
Faster startup on tsan: use hierarchical parallelism for BF16 conversion
...
Also re-enable profiler zones
PiperOrigin-RevId: 804273899
2025-09-07 22:50:31 -07:00
Jan Wassenberg
cbe24eac51
1.15x speedup: parallel sampling, enabled by new RNG
...
Also pass pos to SampleFunc, for seeding the RNG.
PiperOrigin-RevId: 803453518
2025-09-05 07:24:02 -07:00
Jan Wassenberg
2b4c16e243
Remove Griffin support
...
Also add IsObsolete helper
PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg
56186193c1
Replace mt19937 with new generator to enable parallel sampling
...
Split it into immutable AesCtrEngine and RngStream
Also add RowSpan and Logits span
PiperOrigin-RevId: 803336423
2025-09-04 23:49:10 -07:00
Jan Wassenberg
5d1693e806
Internal change
...
PiperOrigin-RevId: 803083229
2025-09-04 10:31:20 -07:00
Jan Wassenberg
4be4799727
Remove kMaxPackages and per-package-related code
...
matmul: remove kMaxClusters, dynamic allocation
PiperOrigin-RevId: 802950348
2025-09-04 03:33:12 -07:00
Jan Wassenberg
7263ab8445
MatMul simplification, threading strategy improvements
...
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
2025-09-03 21:45:07 -07:00
Jan Wassenberg
b7b3d353db
Simplify MatMul: remove F32 special case (build time)
...
Also move kMaxM into separate kMaxBatchSize
PiperOrigin-RevId: 802086590
2025-09-02 04:29:21 -07:00
Jan Wassenberg
1e3c853e80
Add ParallelFor wrapper function and one new mode
...
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode
PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Jan Wassenberg
229bd078a1
1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
...
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)
PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Jan Wassenberg
0ae8646731
Fix remainder handling for Paligemma
...
No longer attempt to skip the remainder handling because B might also be a non-padded view.
PiperOrigin-RevId: 800890805
2025-08-29 07:25:52 -07:00
Marie White
973e284ed6
Refactor Matmul to use a policy class for parallelization.
...
PiperOrigin-RevId: 800864489
2025-08-29 05:40:39 -07:00
Jan Wassenberg
6c39a2dea4
1.01x speedup: More bf16 activations to reduce DecompressA.
...
Also move observer call into function, format gemma_args.
PiperOrigin-RevId: 800827400
2025-08-29 03:19:01 -07:00
Jan Wassenberg
7288891439
Remove F64 partial storage in matmul.
...
Also remove no longer used kMaxN; row_ptrs only used for C
PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg
98ddc166db
Expand ThreadingContext comments
...
PiperOrigin-RevId: 800479954
2025-08-28 08:32:10 -07:00
Marie White
6128e758ff
Change ffw_out from B16 to F32.
...
PiperOrigin-RevId: 800330411
2025-08-28 00:01:39 -07:00
Jan Wassenberg
5411fd846d
Minor: batched NotifyGenerate, fix comment/dep
...
PiperOrigin-RevId: 799889802
2025-08-26 23:33:17 -07:00
Jan Wassenberg
86afd53076
1.04x speedup: Parallelize SoftCap
...
Also require opt-in constexpr flag for observer callbacks, update zones
PiperOrigin-RevId: 799655163
2025-08-26 11:55:20 -07:00
Jan Wassenberg
ed2f0bd1b0
Fix pos assertions, refs #665
...
Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
2025-08-26 04:50:40 -07:00
Jan Wassenberg
9bf0fe4e37
Internal change
...
PiperOrigin-RevId: 799509375
2025-08-26 04:44:08 -07:00
Jan Wassenberg
d3a5ddf657
Merge pull request #663 from junjihashimoto:feature/api-server
...
PiperOrigin-RevId: 797731089
2025-08-24 11:57:05 +02:00
Rhett Stucki
73f1140dca
Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache.
...
The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated.
PiperOrigin-RevId: 797614310
2025-08-20 22:59:58 -07:00
Junji Hashimoto
41321611fd
feature: add API server and client with Google protocol
2025-08-21 11:32:48 +09:00
Phil Culliton
78573b6718
Internal change. Add deduction for 270M.
...
PiperOrigin-RevId: 795041810
2025-08-14 08:04:38 -07:00
Phil Culliton
d044801c1d
Internal change
...
PiperOrigin-RevId: 794620076
2025-08-13 09:47:45 -07:00
Jan Wassenberg
71406cf6d0
More profiler interface fixes: hwy:: plus avoid ADD_ZONE
...
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg
faa4102992
(Resubmit) Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors
a2d9133f7d
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg
4cbf63e6f0
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00
Jan Wassenberg
4e062d68f7
Update BlobWriter comments, WriteAll->Finalize
...
PiperOrigin-RevId: 790792133
2025-08-04 10:01:38 -07:00
Jan Wassenberg
701841897b
Default to disabling per-socket parallelization
...
weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch)
PiperOrigin-RevId: 790787643
2025-08-04 09:49:14 -07:00
Jan Wassenberg
799c264df3
Pre-tune thread pool before matmul
...
Also improve profiler annotations - remove near-zero ones and add more for startup
PiperOrigin-RevId: 789352414
2025-07-31 08:45:26 -07:00
Charles Zhao
50ee1a3e92
Write SBS progressively.
...
(1) Directly write to file in BlobWriter::Add and destruct the MatOwner to release the rams.
(2) Write a fake header to indicate this is V2, and write correct header and directory at the end of the file.
(3) Tested on loading sbs written the old way, and new way, both worked.
PiperOrigin-RevId: 789306837
2025-07-31 06:05:38 -07:00
Jan Wassenberg
8715eda512
Improved layer idx parsing
...
PiperOrigin-RevId: 788868522
2025-07-30 05:49:45 -07:00
Jan Wassenberg
d831ddce5b
Fix file mapping: was letting the smart pointer go out of scope
...
Also save+print the IO mode used.
PiperOrigin-RevId: 788848165
2025-07-30 04:30:10 -07:00
Jan Wassenberg
d22ba2ac96
Update layer index parsing and allow tokenizer override
...
PiperOrigin-RevId: 788797948
2025-07-30 01:22:34 -07:00
Jan Wassenberg
d1638587f0
1.14x batch decode speedup: parallelize RMSNorm ops
...
Activations was over-parallelized, use single pool instead.
Also improve profiler zone annotations,
pass through worker args (for tracking concurrency), now non-optional.
PiperOrigin-RevId: 788790976
2025-07-30 00:55:45 -07:00
Jan Wassenberg
ac0d751d20
Rename GetModelConfig->Config
...
PiperOrigin-RevId: 788506480
2025-07-29 10:18:12 -07:00
Jeremiah Harmsen
33fabd4ed1
Internal change.
...
PiperOrigin-RevId: 788463042
2025-07-29 08:21:29 -07:00
Jan Wassenberg
e76e29ce11
De-singleton ThreadingContext so callers can pass in their own
...
weights.cc: fix BindB argument for bf16 tensors
threading_test: enable autotune
PiperOrigin-RevId: 785763618
2025-07-22 02:08:46 -07:00
Jan Wassenberg
5474146129
Back to f32 kv_cache, but via typedef
...
PiperOrigin-RevId: 785422614
2025-07-21 07:05:35 -07:00
Jan Wassenberg
56c9196eb6
Add blob_path to config deduction message
...
PiperOrigin-RevId: 782188689
2025-07-11 18:58:56 -07:00
Jan Wassenberg
4bc44d5678
Minor: ModelWeightsPtrs -> WeightsPtrs
...
PiperOrigin-RevId: 781954533
2025-07-11 06:11:51 -07:00
Jan Wassenberg
a04cc287b2
Move MatMulEnv out of Gemma to enable concurrent calls
...
Also update benchmark_helper config print: add profiler, remove free mem
PiperOrigin-RevId: 774662974
2025-06-23 01:20:09 -07:00
Jan Wassenberg
0f70f285e0
1.1x prefill and decode speedup (attention/activations)
...
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations
Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul
PiperOrigin-RevId: 773723423
2025-06-20 08:59:53 -07:00
Jan Wassenberg
4f5785b0fd
Update instrumentation for new Highway wall-time profiler
...
Pass the thread index through and use new zone_id.
PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg
7f62c2606e
Fix bf16 KV recompression and Rope(), fixes #608
...
Also add more helpful error message for prompt > seq_len
Also update ops_test, adding coverage for Rope().
PiperOrigin-RevId: 772945644
2025-06-18 09:14:20 -07:00
Biruk Mammo
88284387db
Reduce warning noise.
...
PiperOrigin-RevId: 772941142
2025-06-18 09:01:40 -07:00
Jan Wassenberg
343482c7ef
1.02x batch decode speedup: BF16 KV cache
...
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled
PiperOrigin-RevId: 772779163
2025-06-17 23:21:59 -07:00
Jan Wassenberg
f2adbfbcab
Batch inference fixes: set pos during prefill, fix assert
...
PiperOrigin-RevId: 772458760
2025-06-17 07:09:44 -07:00
Jan Wassenberg
cd80d8b24d
Speed up builds by skipping rarely used targets
...
Centralize previous code into GEMMA_DISABLED_TARGETS
PiperOrigin-RevId: 772433723
2025-06-17 05:44:20 -07:00
Jan Wassenberg
9a02d6be68
Add --prompt_file and testdata for it. Refs #608
...
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.
Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.
Also avoid dynamic allocation for GriffinActivations.
PiperOrigin-RevId: 772333225
2025-06-16 23:41:07 -07:00
Biruk Mammo
5f3797f6e1
Allow creating empty `AttentionActivations` for experimental code.
...
PiperOrigin-RevId: 772077675
2025-06-16 10:19:11 -07:00
Jan Wassenberg
6773e4517c
Split Activations into Griffin/Attention to reduce memory usage for attention-only tests.
...
PiperOrigin-RevId: 772025282
2025-06-16 07:52:59 -07:00
RangerUFO
7aac765e96
Add `Append` method to `AllQueries`
2025-06-16 20:39:27 +08:00
Jan Wassenberg
e5c81f64a1
Major refactor: clarify query_idx (global) vs qi. Refs #607
...
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.
Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.
PiperOrigin-RevId: 771937385
2025-06-16 02:42:02 -07:00
Jan Wassenberg
01cdefeda7
1.64x batch=1 prefill speedup: nested parallelization for Attention
...
(DotSoftmaxWeightedSum)
Also fix tsan error in matmul (atomic_flag instead of static)
PiperOrigin-RevId: 770241705
2025-06-11 11:28:46 -07:00
Jan Wassenberg
c027a45a2e
MatPtr-ify KV, shared div_seq_len, --seq_len flag
...
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg
b84149310b
Fix paligemma, update its test
...
Must not pass image tokens to the EmbedMMToken used for text.
Caught by next presubmit test.
paligemma_test: move function bodies into class, regroup variables
PiperOrigin-RevId: 770040014
2025-06-11 02:12:12 -07:00
Jan Wassenberg
ec02726cf7
6x large-batch, short-prompt prefill speedup
...
Parallelize over queries instead of tokens
introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer.
move RMSNormInplaceBatched out of Transformer to call the latter from prefill
Consistent arg order.
Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h
Also add output to gemma_batch_bench, fix name
PiperOrigin-RevId: 769676106
2025-06-10 09:56:20 -07:00
Daniel Keysers
d7b23d532a
Restructure internal initialization.
...
PiperOrigin-RevId: 769507096
2025-06-10 01:25:31 -07:00
Jan Wassenberg
6ee628ba38
Further cleanup: separate MatMulEnv arg
...
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env
PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg
0e2cab5187
Avoid warning about inability to map, unless explicitly requested
...
PiperOrigin-RevId: 767633815
2025-06-05 09:10:08 -07:00
Jan Wassenberg
3a266c662c
Split gemma-inl into separate source files
...
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.
PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00
RangerUFO
a82f8d5690
Fix compilation error on G++ 9.4
2025-06-04 17:39:37 +08:00
Jan Wassenberg
6897313080
3x speedup of EmbedImagePatches - GEMM, not GEMV.
...
Required fixes to handling of non-vector aligned A.
Also move row ptrs to MatMulEnv.
PiperOrigin-RevId: 767029036
2025-06-04 01:18:52 -07:00
Jan Wassenberg
9efdcfd45c
1.07x batch decode speedup: more BF16 weights and activations
...
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom
Also update profiler annotations in ops-inl.h
PiperOrigin-RevId: 766995010
2025-06-03 23:30:18 -07:00
Jan Wassenberg
839a642992
Fix paligemma_test, refs #588
...
Detect PaliGemma models from layer names
Remove unused allocator arg from CreateInvTimescale
matmul: only warn once about dim divisibility
Print config also in tests if --verbosity 2
PiperOrigin-RevId: 766605131
2025-06-03 04:45:22 -07:00
Jan Wassenberg
ad3002a21c
Merge branch 'dev' into bugfix/vit_attn
2025-06-03 09:29:52 +02:00
Jan Wassenberg
794a21a4e6
Major refactor to de-templatize gemma-inl and weights
...
This replaces per-weight instantiations of all code with only per-MatMul/norm.
Reduces binary size by 133KiB.
WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs.
Also remove unused EmbedToken, replaced with EmbedMMToken.
PiperOrigin-RevId: 766497657
2025-06-02 23:01:35 -07:00
RangerUFO
93de2be938
Fix the broken VitAttention
2025-06-03 12:40:13 +08:00
Jan Wassenberg
cf4d7ceb82
1.16x decode speedup: remove last MatVec in Attention
...
Precompute row pointers.
Remove no longer used MHA support; QStride -> qkv_dim.
Remove RowPtr from MatMul interface, use only MatPtrT.
Require opt-in define for NUQ to speed up builds.
Also fix io.cc on Windows.
PiperOrigin-RevId: 766228108
2025-06-02 09:40:29 -07:00
The gemma.cpp Authors
9c3e089b09
Internal change.
...
PiperOrigin-RevId: 765218260
2025-05-30 09:18:44 -07:00
The gemma.cpp Authors
1e8642f8f4
Internal change.
...
PiperOrigin-RevId: 765037449
2025-05-29 22:51:16 -07:00
Jan Wassenberg
3890eb5412
Remove backprop/
...
Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0).
PiperOrigin-RevId: 764243198
2025-05-28 07:01:17 -07:00
Jan Wassenberg
627cc04db9
Decouple MatMul from gemma-inl: precompile for all input types
...
Call MatMulStatic instead of MatMul.
Also fix build error due to Highway's Lanes not being constexpr.
PiperOrigin-RevId: 763777269
2025-05-27 07:08:58 -07:00
Jan Wassenberg
421a2ab8ac
Add comments explaining non-padded tensors, kNoPad -> kPacked
...
PiperOrigin-RevId: 763352173
2025-05-26 03:03:38 -07:00