gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	3ed403e287	Major cleanup of profiler zones, add Caller annotation for all pool.Run Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones Add GCPP_ZONE helper Add Caller argument to pool.Run to enable new stats Remove most direct dependencies on ThreadPool, prefer ParallelFor PiperOrigin-RevId: 822934530	2025-10-23 01:54:24 -07:00
Ray Smith	f10ac41a20	Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. PiperOrigin-RevId: 804913784	2025-09-09 08:05:26 -07:00
Jan Wassenberg	faa4102992	(Resubmit) Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 794461159	2025-08-13 01:38:24 -07:00
The gemma.cpp Authors	a2d9133f7d	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793865287	2025-08-11 17:51:38 -07:00
Jan Wassenberg	4cbf63e6f0	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793821255	2025-08-11 15:34:52 -07:00
Jan Wassenberg	5474146129	Back to f32 kv_cache, but via typedef PiperOrigin-RevId: 785422614	2025-07-21 07:05:35 -07:00
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Jan Wassenberg	343482c7ef	1.02x batch decode speedup: BF16 KV cache ops-inl.h: Vectorize Rope(), template Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax Fix for DecompressAndZeroPad: ensure second vector filled PiperOrigin-RevId: 772779163	2025-06-17 23:21:59 -07:00
Jan Wassenberg	6773e4517c	Split Activations into Griffin/Attention to reduce memory usage for attention-only tests. PiperOrigin-RevId: 772025282	2025-06-16 07:52:59 -07:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00
Jan Wassenberg	c027a45a2e	MatPtr-ify KV, shared div_seq_len, --seq_len flag PiperOrigin-RevId: 770194455	2025-06-11 09:49:38 -07:00
Jan Wassenberg	6ee628ba38	Further cleanup: separate MatMulEnv arg move row_ptrs into MatMulEnv Consistent arg order: layer, activations, kv_cache, env PiperOrigin-RevId: 767886386	2025-06-05 20:48:32 -07:00
Jan Wassenberg	3a266c662c	Split gemma-inl into separate source files weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization. PiperOrigin-RevId: 767562313	2025-06-05 05:36:44 -07:00

13 Commits