gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Ray Smith	8a100c1e8d	Added access to flash attention internals to TileFlashAttention4 PiperOrigin-RevId: 826011137	2025-10-30 06:50:05 -07:00
Phil Culliton	116cd6eff6	BF16 mixed-mode flash attention PiperOrigin-RevId: 825433929	2025-10-29 01:48:28 -07:00
Jan Wassenberg	4bd465ffd3	Also update attention.h to type-erased query_norm_scale PiperOrigin-RevId: 825014334	2025-10-28 06:48:33 -07:00
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Theotime Combes	1bdde1af3c	Add config flag for global timescale & rely on config to deduce wrapping PiperOrigin-RevId: 823512377	2025-10-24 06:54:56 -07:00
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Phil Culliton	503aaddd65	Add 8-bit integer quantization (I8Stream) to Gemma.cpp. PiperOrigin-RevId: 819787856	2025-10-15 09:25:20 -07:00
Ray Smith	ee18916abf	Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead. PiperOrigin-RevId: 819739402	2025-10-15 07:10:04 -07:00
Ray Smith	fb6fa793f4	Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle. Improved flash_attention to enable profiling using the new zones. PiperOrigin-RevId: 819235421	2025-10-14 08:30:58 -07:00
Jan Wassenberg	035273c184	tune pool kSpin mode in threading_context Previously, this happened concurrently with the matmul autotune, which could lead to incorrect outcomes. threading: de-singleton Pinning (no longer stores affinity); pass PoolWorkerMapping; fix Pool dtor order Also enable SPR target (Zen4 is AMD-only), update Highway version for renamed Thread()->GlobalIdx(). PiperOrigin-RevId: 816223017	2025-10-07 08:36:26 -07:00
Ray Smith	684a0444e9	Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines PiperOrigin-RevId: 814241032	2025-10-02 08:15:16 -07:00
Ray Smith	14244664c8	Avoid transposing Q when it isn't needed PiperOrigin-RevId: 814187984	2025-10-02 05:16:35 -07:00
Jan Wassenberg	fe5a39990e	Improve FlashAttention threading: kFlat for RMSNorm (hierarchical is excessive), profiler zone naming improvements. PiperOrigin-RevId: 814144012	2025-10-02 02:37:05 -07:00
Ray Smith	6098a022b3	Increased parallelism for RMSNormAndPositionalEncoding PiperOrigin-RevId: 813738994	2025-10-01 07:11:14 -07:00
Ray Smith	2f6cbde8ff	Added a smaller tile size to flash attention for smaller batch sizes PiperOrigin-RevId: 813226193	2025-09-30 05:49:20 -07:00
Ray Smith	4974f24832	Fixed bug with softcap in single flash attention PiperOrigin-RevId: 813164938	2025-09-30 02:17:58 -07:00
Nitin Gangahar	667a3f117a	Utilize multiple cores to read weight batches. PiperOrigin-RevId: 811893059	2025-09-26 11:28:33 -07:00
Charles Zhao	4f0c633248	(1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results. PiperOrigin-RevId: 810634261	2025-09-23 17:02:29 -07:00
Jan Wassenberg	fac8aac4cb	Internal change PiperOrigin-RevId: 809975026	2025-09-22 05:37:03 -07:00
Jan Wassenberg	501fdf000e	Remove no longer used MatVec PiperOrigin-RevId: 809059409	2025-09-19 09:03:22 -07:00
Jan Wassenberg	f3bc1c17da	1.03x speedup: fused FFN matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps. PiperOrigin-RevId: 807291701	2025-09-15 10:26:37 -07:00
Ray Smith	c9b8479f7d	Added zero-initialization to att_out. Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available. PiperOrigin-RevId: 806284756	2025-09-12 07:48:23 -07:00
Jan Wassenberg	2695aab5d2	Temporarily disable flash pending msan fix PiperOrigin-RevId: 805350234	2025-09-10 07:25:41 -07:00
Jan Wassenberg	ba6131311a	Fix gemma_batch_bench for flash attention q_T rows do not change. Also repeat prefill to reflect perf after autotuning. PiperOrigin-RevId: 805319377	2025-09-10 05:32:34 -07:00
Ray Smith	f10ac41a20	Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. PiperOrigin-RevId: 804913784	2025-09-09 08:05:26 -07:00
Jan Wassenberg	461a9c7d1b	Matmul refactoring towards fusion MMLoops: move dispatch code out, use overloads split build target into matmul_env (for MatMulEnv/MMOptions) weights: no longer call BindB Fix potential out of bounds in gemma_batch_bench PiperOrigin-RevId: 804895985	2025-09-09 07:13:38 -07:00
Jan Wassenberg	a5ab99e4ba	Memory use reduction: smaller/single MMStorage PiperOrigin-RevId: 804865029	2025-09-09 05:32:46 -07:00
Jan Wassenberg	6e52a835c6	Faster startup on tsan: use hierarchical parallelism for BF16 conversion Also re-enable profiler zones PiperOrigin-RevId: 804273899	2025-09-07 22:50:31 -07:00
Jan Wassenberg	cbe24eac51	1.15x speedup: parallel sampling, enabled by new RNG Also pass pos to SampleFunc, for seeding the RNG. PiperOrigin-RevId: 803453518	2025-09-05 07:24:02 -07:00
Jan Wassenberg	2b4c16e243	Remove Griffin support Also add IsObsolete helper PiperOrigin-RevId: 803376921	2025-09-05 02:35:40 -07:00
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	5d1693e806	Internal change PiperOrigin-RevId: 803083229	2025-09-04 10:31:20 -07:00
Jan Wassenberg	4be4799727	Remove kMaxPackages and per-package-related code matmul: remove kMaxClusters, dynamic allocation PiperOrigin-RevId: 802950348	2025-09-04 03:33:12 -07:00
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Jan Wassenberg	b7b3d353db	Simplify MatMul: remove F32 special case (build time) Also move kMaxM into separate kMaxBatchSize PiperOrigin-RevId: 802086590	2025-09-02 04:29:21 -07:00
Jan Wassenberg	1e3c853e80	Add ParallelFor wrapper function and one new mode Move ParallelismType from matmul.h to threading.h Replace SmallParallelFor with ParallelFor and the new mode PiperOrigin-RevId: 802038452	2025-09-02 01:40:09 -07:00
Jan Wassenberg	229bd078a1	1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage. Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs) PiperOrigin-RevId: 801789922	2025-09-01 06:34:04 -07:00
Jan Wassenberg	0ae8646731	Fix remainder handling for Paligemma No longer attempt to skip the remainder handling because B might also be a non-padded view. PiperOrigin-RevId: 800890805	2025-08-29 07:25:52 -07:00
Marie White	973e284ed6	Refactor Matmul to use a policy class for parallelization. PiperOrigin-RevId: 800864489	2025-08-29 05:40:39 -07:00
Jan Wassenberg	6c39a2dea4	1.01x speedup: More bf16 activations to reduce DecompressA. Also move observer call into function, format gemma_args. PiperOrigin-RevId: 800827400	2025-08-29 03:19:01 -07:00
Jan Wassenberg	7288891439	Remove F64 partial storage in matmul. Also remove no longer used kMaxN; row_ptrs only used for C PiperOrigin-RevId: 800774757	2025-08-29 00:12:08 -07:00
Jan Wassenberg	98ddc166db	Expand ThreadingContext comments PiperOrigin-RevId: 800479954	2025-08-28 08:32:10 -07:00
Marie White	6128e758ff	Change ffw_out from B16 to F32. PiperOrigin-RevId: 800330411	2025-08-28 00:01:39 -07:00
Jan Wassenberg	5411fd846d	Minor: batched NotifyGenerate, fix comment/dep PiperOrigin-RevId: 799889802	2025-08-26 23:33:17 -07:00
Jan Wassenberg	86afd53076	1.04x speedup: Parallelize SoftCap Also require opt-in constexpr flag for observer callbacks, update zones PiperOrigin-RevId: 799655163	2025-08-26 11:55:20 -07:00
Jan Wassenberg	ed2f0bd1b0	Fix pos assertions, refs #665 Ensure the streaming func pos matches the number of calls. Add two arguments that control pos+1 and pos+=1 behavior. Also cleanup/add comments. run: use batch_stream_func, add assert, higher verbosity for MM autotune output PiperOrigin-RevId: 799511163	2025-08-26 04:50:40 -07:00
Jan Wassenberg	9bf0fe4e37	Internal change PiperOrigin-RevId: 799509375	2025-08-26 04:44:08 -07:00
Jan Wassenberg	d3a5ddf657	Merge pull request #663 from junjihashimoto:feature/api-server PiperOrigin-RevId: 797731089	2025-08-24 11:57:05 +02:00
Rhett Stucki	73f1140dca	Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache. The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated. PiperOrigin-RevId: 797614310	2025-08-20 22:59:58 -07:00

1 2 3 4 5 ...

390 Commits