Right now multiplication is done by converting to corresponding float format.
Can yield up to 2x improvements for membw constrained shapes
PiperOrigin-RevId: 880748493
This catches typos/incorrect usage.
Refactor: group Loader/Threading/Inference into GemmaArgs.
All *Args ctors now have an extra ConsumedArgs& argument.
PiperOrigin-RevId: 844690553
This could lead to stack overflow in B_storage.
Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.
PiperOrigin-RevId: 824987689
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.
PiperOrigin-RevId: 824584177
Group M=4..7 into same config. Add configs for power of two sizes.
Allow odd mc to enable a single range for odd M.
io.cc: warning fix(cast).
IsBlock -> !IsOneMC
benchmark_helper: best for verbosity 3, all configs for 4
ops_test: remove unused includes
PiperOrigin-RevId: 824475104
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes.
threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order
Also enable SPR target (Zen4 is AMD-only),
update Highway version for renamed Thread()->GlobalIdx().
PiperOrigin-RevId: 816223017
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
MMLoops: move dispatch code out, use overloads
split build target into matmul_env (for MatMulEnv/MMOptions)
weights: no longer call BindB
Fix potential out of bounds in gemma_batch_bench
PiperOrigin-RevId: 804895985
Lift DecompressA out of main autotuner to prevent interference
Also use kMaxNR / kNR constants instead of extra args
Fix: only require vector alignment, not cache alignment
PiperOrigin-RevId: 804333769
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086