Phil Culliton
503aaddd65
Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
...
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith
ee18916abf
Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
...
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00
Jan Wassenberg
035273c184
tune pool kSpin mode in threading_context
...
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.
threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
2025-10-07 08:36:26 -07:00
Ray Smith
684a0444e9
Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines
...
PiperOrigin-RevId: 814241032
2025-10-02 08:15:16 -07:00
Ray Smith
14244664c8
Avoid transposing Q when it isn't needed
...
PiperOrigin-RevId: 814187984
2025-10-02 05:16:35 -07:00
Jan Wassenberg
fe5a39990e
Improve FlashAttention threading:
...
kFlat for RMSNorm (hierarchical is excessive),
profiler zone naming improvements.
PiperOrigin-RevId: 814144012
2025-10-02 02:37:05 -07:00
Ray Smith
6098a022b3
Increased parallelism for RMSNormAndPositionalEncoding
...
PiperOrigin-RevId: 813738994
2025-10-01 07:11:14 -07:00
Ray Smith
2f6cbde8ff
Added a smaller tile size to flash attention for smaller batch sizes
...
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith
4974f24832
Fixed bug with softcap in single flash attention
...
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar
667a3f117a
Utilize multiple cores to read weight batches.
...
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Charles Zhao
4f0c633248
(1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results.
...
PiperOrigin-RevId: 810634261
2025-09-23 17:02:29 -07:00
Jan Wassenberg
fac8aac4cb
Internal change
...
PiperOrigin-RevId: 809975026
2025-09-22 05:37:03 -07:00
Jan Wassenberg
501fdf000e
Remove no longer used MatVec
...
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Jan Wassenberg
f3bc1c17da
1.03x speedup: fused FFN
...
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
2025-09-15 10:26:37 -07:00
Ray Smith
c9b8479f7d
Added zero-initialization to att_out.
...
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.
PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Jan Wassenberg
2695aab5d2
Temporarily disable flash pending msan fix
...
PiperOrigin-RevId: 805350234
2025-09-10 07:25:41 -07:00
Jan Wassenberg
ba6131311a
Fix gemma_batch_bench for flash attention
...
q_T rows do not change.
Also repeat prefill to reflect perf after autotuning.
PiperOrigin-RevId: 805319377
2025-09-10 05:32:34 -07:00
Ray Smith
f10ac41a20
Added flash attention, with both a single-q function, and a register-tiled function.
...
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.
PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg
461a9c7d1b
Matmul refactoring towards fusion
...
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
2025-09-09 07:13:38 -07:00
Jan Wassenberg
a5ab99e4ba
Memory use reduction: smaller/single MMStorage
...
PiperOrigin-RevId: 804865029
2025-09-09 05:32:46 -07:00
Jan Wassenberg
6e52a835c6
Faster startup on tsan: use hierarchical parallelism for BF16 conversion
...
Also re-enable profiler zones
PiperOrigin-RevId: 804273899
2025-09-07 22:50:31 -07:00
Jan Wassenberg
cbe24eac51
1.15x speedup: parallel sampling, enabled by new RNG
...
Also pass pos to SampleFunc, for seeding the RNG.
PiperOrigin-RevId: 803453518
2025-09-05 07:24:02 -07:00
Jan Wassenberg
2b4c16e243
Remove Griffin support
...
Also add IsObsolete helper
PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg
56186193c1
Replace mt19937 with new generator to enable parallel sampling
...
Split it into immutable AesCtrEngine and RngStream
Also add RowSpan and Logits span
PiperOrigin-RevId: 803336423
2025-09-04 23:49:10 -07:00
Jan Wassenberg
5d1693e806
Internal change
...
PiperOrigin-RevId: 803083229
2025-09-04 10:31:20 -07:00
Jan Wassenberg
4be4799727
Remove kMaxPackages and per-package-related code
...
matmul: remove kMaxClusters, dynamic allocation
PiperOrigin-RevId: 802950348
2025-09-04 03:33:12 -07:00
Jan Wassenberg
7263ab8445
MatMul simplification, threading strategy improvements
...
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
2025-09-03 21:45:07 -07:00
Jan Wassenberg
b7b3d353db
Simplify MatMul: remove F32 special case (build time)
...
Also move kMaxM into separate kMaxBatchSize
PiperOrigin-RevId: 802086590
2025-09-02 04:29:21 -07:00
Jan Wassenberg
1e3c853e80
Add ParallelFor wrapper function and one new mode
...
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode
PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Jan Wassenberg
229bd078a1
1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
...
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)
PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Jan Wassenberg
0ae8646731
Fix remainder handling for Paligemma
...
No longer attempt to skip the remainder handling because B might also be a non-padded view.
PiperOrigin-RevId: 800890805
2025-08-29 07:25:52 -07:00
Marie White
973e284ed6
Refactor Matmul to use a policy class for parallelization.
...
PiperOrigin-RevId: 800864489
2025-08-29 05:40:39 -07:00
Jan Wassenberg
6c39a2dea4
1.01x speedup: More bf16 activations to reduce DecompressA.
...
Also move observer call into function, format gemma_args.
PiperOrigin-RevId: 800827400
2025-08-29 03:19:01 -07:00
Jan Wassenberg
7288891439
Remove F64 partial storage in matmul.
...
Also remove no longer used kMaxN; row_ptrs only used for C
PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg
98ddc166db
Expand ThreadingContext comments
...
PiperOrigin-RevId: 800479954
2025-08-28 08:32:10 -07:00
Marie White
6128e758ff
Change ffw_out from B16 to F32.
...
PiperOrigin-RevId: 800330411
2025-08-28 00:01:39 -07:00
Jan Wassenberg
5411fd846d
Minor: batched NotifyGenerate, fix comment/dep
...
PiperOrigin-RevId: 799889802
2025-08-26 23:33:17 -07:00
Jan Wassenberg
86afd53076
1.04x speedup: Parallelize SoftCap
...
Also require opt-in constexpr flag for observer callbacks, update zones
PiperOrigin-RevId: 799655163
2025-08-26 11:55:20 -07:00
Jan Wassenberg
ed2f0bd1b0
Fix pos assertions, refs #665
...
Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
2025-08-26 04:50:40 -07:00
Jan Wassenberg
9bf0fe4e37
Internal change
...
PiperOrigin-RevId: 799509375
2025-08-26 04:44:08 -07:00
Jan Wassenberg
d3a5ddf657
Merge pull request #663 from junjihashimoto:feature/api-server
...
PiperOrigin-RevId: 797731089
2025-08-24 11:57:05 +02:00
Rhett Stucki
73f1140dca
Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache.
...
The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated.
PiperOrigin-RevId: 797614310
2025-08-20 22:59:58 -07:00
Junji Hashimoto
41321611fd
feature: add API server and client with Google protocol
2025-08-21 11:32:48 +09:00
Phil Culliton
78573b6718
Internal change. Add deduction for 270M.
...
PiperOrigin-RevId: 795041810
2025-08-14 08:04:38 -07:00
Phil Culliton
d044801c1d
Internal change
...
PiperOrigin-RevId: 794620076
2025-08-13 09:47:45 -07:00
Jan Wassenberg
71406cf6d0
More profiler interface fixes: hwy:: plus avoid ADD_ZONE
...
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg
faa4102992
(Resubmit) Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors
a2d9133f7d
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg
4cbf63e6f0
Prepare profiler annotations for new API
...
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.
PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00