Commit Graph

766 Commits

Author SHA1 Message Date
Charles Zhao 59db30e209 add const restriction for benchmark_helper.cc, and paligemma_helper.cc to remove a few uncessary copies.
PiperOrigin-RevId: 807004597
2025-09-14 16:27:26 -07:00
Ray Smith c9b8479f7d Added zero-initialization to att_out.
Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available.

PiperOrigin-RevId: 806284756
2025-09-12 07:48:23 -07:00
Jan Wassenberg 2695aab5d2 Temporarily disable flash pending msan fix
PiperOrigin-RevId: 805350234
2025-09-10 07:25:41 -07:00
Jan Wassenberg ba6131311a Fix gemma_batch_bench for flash attention
q_T rows do not change.
Also repeat prefill to reflect perf after autotuning.

PiperOrigin-RevId: 805319377
2025-09-10 05:32:34 -07:00
Jan Wassenberg 9457258330 Refactor MatMul to accept views in the kernel functions
Make arg order consistent.
Move StridedView into mat.h.
Add view support to RowPtrs.

PiperOrigin-RevId: 805197381
2025-09-09 22:09:47 -07:00
Ray Smith f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function.
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.

PiperOrigin-RevId: 804913784
2025-09-09 08:05:26 -07:00
Jan Wassenberg 24b1760f03 Refactor: move Worker to ThreadingContext, factor out MMDecompress
PiperOrigin-RevId: 804909921
2025-09-09 07:56:12 -07:00
Jan Wassenberg 461a9c7d1b Matmul refactoring towards fusion
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
2025-09-09 07:13:38 -07:00
Jan Wassenberg 34ceee6c30 Update MatMul comments, removing mention of partial.
PiperOrigin-RevId: 804872289
2025-09-09 05:57:33 -07:00
Jan Wassenberg a5ab99e4ba Memory use reduction: smaller/single MMStorage
PiperOrigin-RevId: 804865029
2025-09-09 05:32:46 -07:00
Jan Wassenberg 06e5da1e22 Cleanup: split CacheInfo from Allocator, MatMul helper functions
Lift DecompressA out of main autotuner to prevent interference
Also use kMaxNR / kNR constants instead of extra args
Fix: only require vector alignment, not cache alignment
PiperOrigin-RevId: 804333769
2025-09-08 02:23:58 -07:00
Jan Wassenberg 6e52a835c6 Faster startup on tsan: use hierarchical parallelism for BF16 conversion
Also re-enable profiler zones

PiperOrigin-RevId: 804273899
2025-09-07 22:50:31 -07:00
Jan Wassenberg cbe24eac51 1.15x speedup: parallel sampling, enabled by new RNG
Also pass pos to SampleFunc, for seeding the RNG.

PiperOrigin-RevId: 803453518
2025-09-05 07:24:02 -07:00
Jan Wassenberg ad7d7a2713 Further adjust dot_test threshold (numerics)
PiperOrigin-RevId: 803428406
2025-09-05 05:50:16 -07:00
Jan Wassenberg 2b4c16e243 Remove Griffin support
Also add IsObsolete helper

PiperOrigin-RevId: 803376921
2025-09-05 02:35:40 -07:00
Jan Wassenberg 56186193c1 Replace mt19937 with new generator to enable parallel sampling
Split it into immutable AesCtrEngine and RngStream
Also add RowSpan and Logits span

PiperOrigin-RevId: 803336423
2025-09-04 23:49:10 -07:00
Jan Wassenberg 5d1693e806 Internal change
PiperOrigin-RevId: 803083229
2025-09-04 10:31:20 -07:00
Jan Wassenberg afd82376a5 Add AES-CTR RNG for parallel sampling (not yet used)
PiperOrigin-RevId: 802991142
2025-09-04 05:58:42 -07:00
Jan Wassenberg 4be4799727 Remove kMaxPackages and per-package-related code
matmul: remove kMaxClusters, dynamic allocation
PiperOrigin-RevId: 802950348
2025-09-04 03:33:12 -07:00
Jan Wassenberg 7263ab8445 MatMul simplification, threading strategy improvements
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
2025-09-03 21:45:07 -07:00
Marie White 74ffe079c4 Create separate MMStorage objects per cluster.
PiperOrigin-RevId: 802588625
2025-09-03 09:35:48 -07:00
Jan Wassenberg b7b3d353db Simplify MatMul: remove F32 special case (build time)
Also move kMaxM into separate kMaxBatchSize

PiperOrigin-RevId: 802086590
2025-09-02 04:29:21 -07:00
Jan Wassenberg 1e3c853e80 Add ParallelFor wrapper function and one new mode
Move ParallelismType from matmul.h to threading.h
Replace SmallParallelFor with ParallelFor and the new mode

PiperOrigin-RevId: 802038452
2025-09-02 01:40:09 -07:00
Marie White 3737224132 Add in-cluster parallel policy. Update policy to include cluster_idx.
PiperOrigin-RevId: 802016308
2025-09-02 00:16:00 -07:00
Marie White 27cb8e12d9 Handle non-threading parallel policy.
PiperOrigin-RevId: 802012517
2025-09-02 00:02:57 -07:00
Marie White 0d2e74d74a Add MMOptions as an argument to Matmul.
PiperOrigin-RevId: 802008198
2025-09-01 23:46:39 -07:00
Jan Wassenberg 229bd078a1 1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)

PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Marie White bc0c0bac8b Add non-threading parallel policy.
PiperOrigin-RevId: 800913294
2025-08-29 08:39:06 -07:00
Marie White 00b70f69c5 Include parallelism type in DoMatMul. Also remove package handling.
PiperOrigin-RevId: 800902568
2025-08-29 08:04:52 -07:00
Jan Wassenberg 0ae8646731 Fix remainder handling for Paligemma
No longer attempt to skip the remainder handling because B might also be a non-padded view.

PiperOrigin-RevId: 800890805
2025-08-29 07:25:52 -07:00
Marie White 973e284ed6 Refactor Matmul to use a policy class for parallelization.
PiperOrigin-RevId: 800864489
2025-08-29 05:40:39 -07:00
Jan Wassenberg 6c39a2dea4 1.01x speedup: More bf16 activations to reduce DecompressA.
Also move observer call into function, format gemma_args.

PiperOrigin-RevId: 800827400
2025-08-29 03:19:01 -07:00
Jan Wassenberg 7288891439 Remove F64 partial storage in matmul.
Also remove no longer used kMaxN; row_ptrs only used for C

PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg 31c09cca4c f32 LoopKC: 1.37x(M=512), 1.19(M=128) single-K F32,BF16 matmul speedup on SKX
Add a special case for A=F32,B=BF16, used when there is no native bf16 dot product.

dot-inl: ensure bf16,f32 and f32,bf16 both get promoted to float before f64 summation
matmul.cc: update autotuning to reflect actual A size
matmul_test: add all combinations of bf16/f32, report all results, not just first difference, check non-vector-aligned K
PiperOrigin-RevId: 800487817
2025-08-28 08:55:50 -07:00
Jan Wassenberg 98ddc166db Expand ThreadingContext comments
PiperOrigin-RevId: 800479954
2025-08-28 08:32:10 -07:00
Marie White 6128e758ff Change ffw_out from B16 to F32.
PiperOrigin-RevId: 800330411
2025-08-28 00:01:39 -07:00
Jan Wassenberg 5411fd846d Minor: batched NotifyGenerate, fix comment/dep
PiperOrigin-RevId: 799889802
2025-08-26 23:33:17 -07:00
Jan Wassenberg 86afd53076 1.04x speedup: Parallelize SoftCap
Also require opt-in constexpr flag for observer callbacks, update zones

PiperOrigin-RevId: 799655163
2025-08-26 11:55:20 -07:00
Jan Wassenberg ed2f0bd1b0 Fix pos assertions, refs #665
Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
2025-08-26 04:50:40 -07:00
Jan Wassenberg 9bf0fe4e37 Internal change
PiperOrigin-RevId: 799509375
2025-08-26 04:44:08 -07:00
Jan Wassenberg d3a5ddf657 Merge pull request #663 from junjihashimoto:feature/api-server
PiperOrigin-RevId: 797731089
2025-08-24 11:57:05 +02:00
Rhett Stucki 73f1140dca Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache.
The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated.

PiperOrigin-RevId: 797614310
2025-08-20 22:59:58 -07:00
Junji Hashimoto 41321611fd feature: add API server and client with Google protocol 2025-08-21 11:32:48 +09:00
Jan Wassenberg 41a86d41a9 Fix preadv error: only enable if we have a handle
PiperOrigin-RevId: 795455020
2025-08-15 06:30:34 -07:00
Phil Culliton 78573b6718 Internal change. Add deduction for 270M.
PiperOrigin-RevId: 795041810
2025-08-14 08:04:38 -07:00
Phil Culliton d044801c1d Internal change
PiperOrigin-RevId: 794620076
2025-08-13 09:47:45 -07:00
Jan Wassenberg 71406cf6d0 More profiler interface fixes: hwy:: plus avoid ADD_ZONE
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg faa4102992 (Resubmit) Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors a2d9133f7d Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg 4cbf63e6f0 Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00