Commit Graph

826 Commits

Author SHA1 Message Date
Ray Smith 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes
PiperOrigin-RevId: 813226193
2025-09-30 05:49:20 -07:00
Ray Smith 4974f24832 Fixed bug with softcap in single flash attention
PiperOrigin-RevId: 813164938
2025-09-30 02:17:58 -07:00
Nitin Gangahar 16536996d1 Remove less useful spammy log lines.
PiperOrigin-RevId: 812694572
2025-09-29 02:28:41 -07:00
Nitin Gangahar 667a3f117a Utilize multiple cores to read weight batches.
PiperOrigin-RevId: 811893059
2025-09-26 11:28:33 -07:00
Ray Smith d15731d201 Used hn::BroadcastLane instead of Set(..., x.raw)
PiperOrigin-RevId: 811386295
2025-09-25 09:42:03 -07:00
Charles Zhao 4f0c633248 (1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results.
PiperOrigin-RevId: 810634261
2025-09-23 17:02:29 -07:00
Jan Wassenberg fac8aac4cb Internal change
PiperOrigin-RevId: 809975026
2025-09-22 05:37:03 -07:00
Jan Wassenberg 501fdf000e Remove no longer used MatVec
PiperOrigin-RevId: 809059409
2025-09-19 09:03:22 -07:00
Jan Wassenberg b603425bf3 Fix batch inference: dangling reference
Also add more detailed asserts/error messages.

PiperOrigin-RevId: 807695421
2025-09-16 08:01:56 -07:00
Jan Wassenberg f3bc1c17da 1.03x speedup: fused FFN
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
2025-09-15 10:26:37 -07:00
Charles Zhao 59db30e209 add const restriction for benchmark_helper.cc, and paligemma_helper.cc to remove a few uncessary copies.
PiperOrigin-RevId: 807004597
2025-09-14 16:27:26 -07:00
Ray Smith c9b8479f7d Added zero-initialization to att_out.
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.

PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Jan Wassenberg 2695aab5d2 Temporarily disable flash pending msan fix
PiperOrigin-RevId: 805350234
2025-09-10 07:25:41 -07:00
Jan Wassenberg ba6131311a Fix gemma_batch_bench for flash attention
q_T rows do not change.
Also repeat prefill to reflect perf after autotuning.

PiperOrigin-RevId: 805319377
2025-09-10 05:32:34 -07:00
Jan Wassenberg 9457258330 Refactor MatMul to accept views in the kernel functions
Make arg order consistent.
Move StridedView into mat.h.
Add view support to RowPtrs.

PiperOrigin-RevId: 805197381
2025-09-09 22:09:47 -07:00
Ray Smith f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function.
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.

PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg 24b1760f03 Refactor: move Worker to ThreadingContext, factor out MMDecompress
PiperOrigin-RevId: 804909921
2025-09-09 07:56:12 -07:00
Jan Wassenberg 461a9c7d1b Matmul refactoring towards fusion
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
2025-09-09 07:13:38 -07:00
Jan Wassenberg 34ceee6c30 Update MatMul comments, removing mention of partial.
PiperOrigin-RevId: 804872289
2025-09-09 05:57:33 -07:00
Jan Wassenberg a5ab99e4ba Memory use reduction: smaller/single MMStorage
PiperOrigin-RevId: 804865029
2025-09-09 05:32:46 -07:00
Jan Wassenberg 06e5da1e22 Cleanup: split CacheInfo from Allocator, MatMul helper functions
Lift DecompressA out of main autotuner to prevent interference
Also use kMaxNR / kNR constants instead of extra args
Fix: only require vector alignment, not cache alignment
PiperOrigin-RevId: 804333769
2025-09-08 02:23:58 -07:00
Jan Wassenberg 6e52a835c6 Faster startup on tsan: use hierarchical parallelism for BF16 conversion
Also re-enable profiler zones

PiperOrigin-RevId: 804273899
2025-09-07 22:50:31 -07:00
Jan Wassenberg cbe24eac51 1.15x speedup: parallel sampling, enabled by new RNG
Also pass pos to SampleFunc, for seeding the RNG.

PiperOrigin-RevId: 803453518
2025-09-05 07:24:02 -07:00
Jan Wassenberg ad7d7a2713 Further adjust dot_test threshold (numerics)
PiperOrigin-RevId: 803428406
2025-09-05 05:50:16 -07:00
Jan Wassenberg 2b4c16e243 Remove Griffin support
Also add IsObsolete helper

PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg 56186193c1 Replace mt19937 with new generator to enable parallel sampling
Split it into immutable AesCtrEngine and RngStream
Also add RowSpan and Logits span

PiperOrigin-RevId: 803336423
2025-09-04 23:49:10 -07:00
Jan Wassenberg 5d1693e806 Internal change
PiperOrigin-RevId: 803083229
2025-09-04 10:31:20 -07:00
Jan Wassenberg afd82376a5 Add AES-CTR RNG for parallel sampling (not yet used)
PiperOrigin-RevId: 802991142
2025-09-04 05:58:42 -07:00
Jan Wassenberg 4be4799727 Remove kMaxPackages and per-package-related code
matmul: remove kMaxClusters, dynamic allocation
PiperOrigin-RevId: 802950348
2025-09-04 03:33:12 -07:00
Jan Wassenberg 7263ab8445 MatMul simplification, threading strategy improvements
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
2025-09-03 21:45:07 -07:00
Marie White 74ffe079c4 Create separate MMStorage objects per cluster.
PiperOrigin-RevId: 802588625
2025-09-03 09:35:48 -07:00
Jan Wassenberg b7b3d353db Simplify MatMul: remove F32 special case (build time)
Also move kMaxM into separate kMaxBatchSize

PiperOrigin-RevId: 802086590
2025-09-02 04:29:21 -07:00
Jan Wassenberg 1e3c853e80 Add ParallelFor wrapper function and one new mode
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode

PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Marie White 3737224132 Add in-cluster parallel policy. Update policy to include cluster_idx.
PiperOrigin-RevId: 802016308
2025-09-02 00:16:00 -07:00
Marie White 27cb8e12d9 Handle non-threading parallel policy.
PiperOrigin-RevId: 802012517
2025-09-02 00:02:57 -07:00
Marie White 0d2e74d74a Add MMOptions as an argument to Matmul.
PiperOrigin-RevId: 802008198
2025-09-01 23:46:39 -07:00
Jan Wassenberg 229bd078a1 1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)

PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Marie White bc0c0bac8b Add non-threading parallel policy.
PiperOrigin-RevId: 800913294
2025-08-29 08:39:06 -07:00
Marie White 00b70f69c5 Include parallelism type in DoMatMul. Also remove package handling.
PiperOrigin-RevId: 800902568
2025-08-29 08:04:52 -07:00
Jan Wassenberg 0ae8646731 Fix remainder handling for Paligemma
No longer attempt to skip the remainder handling because B might also be a non-padded view.

PiperOrigin-RevId: 800890805
2025-08-29 07:25:52 -07:00
Marie White 973e284ed6 Refactor Matmul to use a policy class for parallelization.
PiperOrigin-RevId: 800864489
2025-08-29 05:40:39 -07:00
Jan Wassenberg 6c39a2dea4 1.01x speedup: More bf16 activations to reduce DecompressA.
Also move observer call into function, format gemma_args.

PiperOrigin-RevId: 800827400
2025-08-29 03:19:01 -07:00
Jan Wassenberg 7288891439 Remove F64 partial storage in matmul.
Also remove no longer used kMaxN; row_ptrs only used for C

PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg 31c09cca4c f32 LoopKC: 1.37x(M=512), 1.19(M=128) single-K F32,BF16 matmul speedup on SKX
Add a special case for A=F32,B=BF16, used when there is no native bf16 dot product.

dot-inl: ensure bf16,f32 and f32,bf16 both get promoted to float before f64 summation
matmul.cc: update autotuning to reflect actual A size
matmul_test: add all combinations of bf16/f32, report all results, not just first difference, check non-vector-aligned K
PiperOrigin-RevId: 800487817
2025-08-28 08:55:50 -07:00
Jan Wassenberg 98ddc166db Expand ThreadingContext comments
PiperOrigin-RevId: 800479954
2025-08-28 08:32:10 -07:00
Marie White 6128e758ff Change ffw_out from B16 to F32.
PiperOrigin-RevId: 800330411
2025-08-28 00:01:39 -07:00
Jan Wassenberg 5411fd846d Minor: batched NotifyGenerate, fix comment/dep
PiperOrigin-RevId: 799889802
2025-08-26 23:33:17 -07:00
Jan Wassenberg 86afd53076 1.04x speedup: Parallelize SoftCap
Also require opt-in constexpr flag for observer callbacks, update zones

PiperOrigin-RevId: 799655163
2025-08-26 11:55:20 -07:00
Jan Wassenberg ed2f0bd1b0 Fix pos assertions, refs #665
Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
2025-08-26 04:50:40 -07:00
Jan Wassenberg 9bf0fe4e37 Internal change
PiperOrigin-RevId: 799509375
2025-08-26 04:44:08 -07:00