Jan Wassenberg
1e3c853e80
Add ParallelFor wrapper function and one new mode
...
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode
PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Jan Wassenberg
229bd078a1
1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
...
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)
PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Jan Wassenberg
7288891439
Remove F64 partial storage in matmul.
...
Also remove no longer used kMaxN; row_ptrs only used for C
PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg
71406cf6d0
More profiler interface fixes: hwy:: plus avoid ADD_ZONE
...
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg
faa4102992
(Resubmit) Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors
a2d9133f7d
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg
4cbf63e6f0
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00
Jan Wassenberg
701841897b
Default to disabling per-socket parallelization
...
weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch)
PiperOrigin-RevId: 790787643
2025-08-04 09:49:14 -07:00
Jan Wassenberg
d1638587f0
1.14x batch decode speedup: parallelize RMSNorm ops
...
Activations was over-parallelized, use single pool instead.
Also improve profiler zone annotations,
pass through worker args (for tracking concurrency), now non-optional.
PiperOrigin-RevId: 788790976
2025-07-30 00:55:45 -07:00
Jan Wassenberg
5474146129
Back to f32 kv_cache, but via typedef
...
PiperOrigin-RevId: 785422614
2025-07-21 07:05:35 -07:00
Jan Wassenberg
0f70f285e0
1.1x prefill and decode speedup (attention/activations)
...
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations
Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul
PiperOrigin-RevId: 773723423
2025-06-20 08:59:53 -07:00
Jan Wassenberg
4f5785b0fd
Update instrumentation for new Highway wall-time profiler
...
Pass the thread index through and use new zone_id.
PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg
7f62c2606e
Fix bf16 KV recompression and Rope(), fixes #608
...
Also add more helpful error message for prompt > seq_len
Also update ops_test, adding coverage for Rope().
PiperOrigin-RevId: 772945644
2025-06-18 09:14:20 -07:00
Jan Wassenberg
343482c7ef
1.02x batch decode speedup: BF16 KV cache
...
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled
PiperOrigin-RevId: 772779163
2025-06-17 23:21:59 -07:00
Jan Wassenberg
cd80d8b24d
Speed up builds by skipping rarely used targets
...
Centralize previous code into GEMMA_DISABLED_TARGETS
PiperOrigin-RevId: 772433723
2025-06-17 05:44:20 -07:00
Jan Wassenberg
9a02d6be68
Add --prompt_file and testdata for it. Refs #608
...
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.
Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.
Also avoid dynamic allocation for GriffinActivations.
PiperOrigin-RevId: 772333225
2025-06-16 23:41:07 -07:00
Jan Wassenberg
6773e4517c
Split Activations into Griffin/Attention to reduce memory usage for attention-only tests.
...
PiperOrigin-RevId: 772025282
2025-06-16 07:52:59 -07:00
Jan Wassenberg
e5c81f64a1
Major refactor: clarify query_idx (global) vs qi. Refs #607
...
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.
Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.
PiperOrigin-RevId: 771937385
2025-06-16 02:42:02 -07:00
Jan Wassenberg
01cdefeda7
1.64x batch=1 prefill speedup: nested parallelization for Attention
...
(DotSoftmaxWeightedSum)
Also fix tsan error in matmul (atomic_flag instead of static)
PiperOrigin-RevId: 770241705
2025-06-11 11:28:46 -07:00
Jan Wassenberg
c027a45a2e
MatPtr-ify KV, shared div_seq_len, --seq_len flag
...
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg
6ee628ba38
Further cleanup: separate MatMulEnv arg
...
move row_ptrs into MatMulEnv
Consistent arg order: layer, activations, kv_cache, env
PiperOrigin-RevId: 767886386
2025-06-05 20:48:32 -07:00
Jan Wassenberg
3a266c662c
Split gemma-inl into separate source files
...
weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization.
PiperOrigin-RevId: 767562313
2025-06-05 05:36:44 -07:00