Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC
matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps.
PiperOrigin-RevId: 807291701
The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine.
PiperOrigin-RevId: 804913784
remove MatMul f32 special case (smaller code),
types: Add u32/u64 for use by Activations
move renamed ParallelismStrategy to threading_context so can pass ctx
ensure worker index is unique across clusters
matmul.h: const member functions for renamed policy classes (easier to call)
PiperOrigin-RevId: 802848086
Activations was over-parallelized, use single pool instead.
Also improve profiler zone annotations,
pass through worker args (for tracking concurrency), now non-optional.
PiperOrigin-RevId: 788790976
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations
Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul
PiperOrigin-RevId: 773723423
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled
PiperOrigin-RevId: 772779163
BF16 att_sums and ffw_out
Support BF16 B views without decompression
Support arbitrary types in MulByConstAndAdd, AddFrom
Also update profiler annotations in ops-inl.h
PiperOrigin-RevId: 766995010
KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out
optimize_test: larger vocab_size requires more steps
shared.h: Remove unused u128 type
correctly set Activation matrix rows, avoid passing as arg
ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type
mat: add OverrideRows, used by SetBatchSize
PiperOrigin-RevId: 757790736
use new ThreadingContext2 instead of monostate/init in each frontend
Add ThreadingArgs(replaces AppArgs)
backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride
compress_weights: remove, moving to py-only exporter instead
Move MatPtr to mat.h and revise interface:
- Generic MatOwner
- rename accessors to Packed*
- support stride/row accessors, fix RowPtr stride
Add TypeBits(Type)
Move GenerateMat to test_util-inl for sharing between matmul test/bench
Move internal init to gemma.cc to avoid duplication
Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage
Remove --compressed_weights, use --weights instead.
tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img
Allocator: use normal unique_ptr for AllocBytes so users can call directly
threading: use -> because AlignedPtr no longer assumes arrays
PiperOrigin-RevId: 745918637
This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0
PiperOrigin-RevId: 727702149
ops.h: move CreateInvTimescale to allow calling without depending on gemma
Pass around MatMulEnv instead of pools to avoid re-creating the env
profiler.h can now be used outside SIMD code
allocator: add StepBytes and QuantumSteps
rename worker thread with package/cluster in the name
threading: add Visit* to IndexRange
PiperOrigin-RevId: 718766704
Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively.
Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly.
Adjusted WeightsWrapper and ForwardLayer etc to match.
The only remaining template arg is the weight type.
This enables all the instantiations to be deleted, apart from one per type.
It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately.
Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively.
PiperOrigin-RevId: 686870060
add app.h comment
compress-inl: remove unused typedef
gemma-inl: add missing HWY_ATTR and cast
separate sum-inl.h and basics.h headers
replace more hwy::bfloat16_t with BF16
update include pragmas
update dot_test thresholds
update Highway version in Bazel for HWY_RCAST_ALIGNED fix
PiperOrigin-RevId: 684464326
This can affect generation results after a few hundred tokens.
Also remove profiler from DecompressAndCall, use Add instead of +=,
use PackedSpan for args and remove alignment requirement.
Changing accumulation order in AssimilateCascadedSums updates dot_test thresholds.
PiperOrigin-RevId: 676891797
- Sampling was not correct for k>1 and temperature=0.
- Sampling was not correct for only negative logits.
Also restructure the code a bit for better readability and add some asserts for things that shouldn't happen.
PiperOrigin-RevId: 676043267