gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Jan Wassenberg	4f5785b0fd	Update instrumentation for new Highway wall-time profiler Pass the thread index through and use new zone_id. PiperOrigin-RevId: 773344242	2025-06-19 07:46:04 -07:00
Jan Wassenberg	7f62c2606e	Fix bf16 KV recompression and Rope(), fixes #608 Also add more helpful error message for prompt > seq_len Also update ops_test, adding coverage for Rope(). PiperOrigin-RevId: 772945644	2025-06-18 09:14:20 -07:00
Jan Wassenberg	343482c7ef	1.02x batch decode speedup: BF16 KV cache ops-inl.h: Vectorize Rope(), template Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax Fix for DecompressAndZeroPad: ensure second vector filled PiperOrigin-RevId: 772779163	2025-06-17 23:21:59 -07:00
Jan Wassenberg	cd80d8b24d	Speed up builds by skipping rarely used targets Centralize previous code into GEMMA_DISABLED_TARGETS PiperOrigin-RevId: 772433723	2025-06-17 05:44:20 -07:00
Jan Wassenberg	9a02d6be68	Add --prompt_file and testdata for it. Refs #608 Linux terminals truncate input after 4096 chars. testdata is Frankenstein from project Gutenberg, which are long out of copyright. Also fix loss of coherence after long context caused by incorrect IsGlobalLayer. Move that to config.h and use max_seq_len as the initializer to make this clear. Also avoid dynamic allocation for GriffinActivations. PiperOrigin-RevId: 772333225	2025-06-16 23:41:07 -07:00
Jan Wassenberg	6773e4517c	Split Activations into Griffin/Attention to reduce memory usage for attention-only tests. PiperOrigin-RevId: 772025282	2025-06-16 07:52:59 -07:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00
Jan Wassenberg	01cdefeda7	1.64x batch=1 prefill speedup: nested parallelization for Attention (DotSoftmaxWeightedSum) Also fix tsan error in matmul (atomic_flag instead of static) PiperOrigin-RevId: 770241705	2025-06-11 11:28:46 -07:00
Jan Wassenberg	c027a45a2e	MatPtr-ify KV, shared div_seq_len, --seq_len flag PiperOrigin-RevId: 770194455	2025-06-11 09:49:38 -07:00
Jan Wassenberg	6ee628ba38	Further cleanup: separate MatMulEnv arg move row_ptrs into MatMulEnv Consistent arg order: layer, activations, kv_cache, env PiperOrigin-RevId: 767886386	2025-06-05 20:48:32 -07:00
Jan Wassenberg	3a266c662c	Split gemma-inl into separate source files weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization. PiperOrigin-RevId: 767562313	2025-06-05 05:36:44 -07:00

12 Commits