gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Krzysztof Rymski	61dedf73ed	Internal changes PiperOrigin-RevId: 841765739	2025-12-09 01:55:38 -08:00
Martin Stolle	b510ba2ab2	Improve clarity of indices II Sorry, didn't see this one before. PiperOrigin-RevId: 840218378	2025-12-04 06:33:33 -08:00
Jan Wassenberg	a084d33e41	Fix Gemma3 image: ensure A matrix is packed, preallocate Also ignore -2 tokens PiperOrigin-RevId: 838869988	2025-12-01 11:47:23 -08:00
Martin Stolle	88a03b7ec4	Added access to softmax attention internals to regular attention PiperOrigin-RevId: 835244205	2025-11-21 09:01:01 -08:00
Martin Stolle	5a500872b8	Internal change PiperOrigin-RevId: 835115693	2025-11-21 01:17:45 -08:00
Martin Stolle	49d420aeaf	Add some comments. PiperOrigin-RevId: 834173319	2025-11-19 01:09:15 -08:00
Martin Stolle	35e9f9f05f	Introduce attention implementation configurability. PiperOrigin-RevId: 828971705	2025-11-06 08:43:41 -08:00
Jan Wassenberg	a344a70c59	Change (old) attention behavior to disallow wraparound, enforced via assertion. Shared kU64PerLine constant PiperOrigin-RevId: 828072451	2025-11-04 11:52:40 -08:00
Phil Culliton	ab87807a4c	Pre-compress query activations to BF16 before FlashAttention. PiperOrigin-RevId: 826524997	2025-10-31 09:49:44 -07:00
Phil Culliton	116cd6eff6	BF16 mixed-mode flash attention PiperOrigin-RevId: 825433929	2025-10-29 01:48:28 -07:00
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Jan Wassenberg	ba6131311a	Fix gemma_batch_bench for flash attention q_T rows do not change. Also repeat prefill to reflect perf after autotuning. PiperOrigin-RevId: 805319377	2025-09-10 05:32:34 -07:00
Ray Smith	f10ac41a20	Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. PiperOrigin-RevId: 804913784	2025-09-09 08:05:26 -07:00
Jan Wassenberg	a5ab99e4ba	Memory use reduction: smaller/single MMStorage PiperOrigin-RevId: 804865029	2025-09-09 05:32:46 -07:00
Jan Wassenberg	cbe24eac51	1.15x speedup: parallel sampling, enabled by new RNG Also pass pos to SampleFunc, for seeding the RNG. PiperOrigin-RevId: 803453518	2025-09-05 07:24:02 -07:00
Jan Wassenberg	2b4c16e243	Remove Griffin support Also add IsObsolete helper PiperOrigin-RevId: 803376921	2025-09-05 02:35:40 -07:00
Jan Wassenberg	4be4799727	Remove kMaxPackages and per-package-related code matmul: remove kMaxClusters, dynamic allocation PiperOrigin-RevId: 802950348	2025-09-04 03:33:12 -07:00
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Jan Wassenberg	229bd078a1	1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage. Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs) PiperOrigin-RevId: 801789922	2025-09-01 06:34:04 -07:00
Jan Wassenberg	6c39a2dea4	1.01x speedup: More bf16 activations to reduce DecompressA. Also move observer call into function, format gemma_args. PiperOrigin-RevId: 800827400	2025-08-29 03:19:01 -07:00
Marie White	6128e758ff	Change ffw_out from B16 to F32. PiperOrigin-RevId: 800330411	2025-08-28 00:01:39 -07:00
Jan Wassenberg	d1638587f0	1.14x batch decode speedup: parallelize RMSNorm ops Activations was over-parallelized, use single pool instead. Also improve profiler zone annotations, pass through worker args (for tracking concurrency), now non-optional. PiperOrigin-RevId: 788790976	2025-07-30 00:55:45 -07:00
Jan Wassenberg	e76e29ce11	De-singleton ThreadingContext so callers can pass in their own weights.cc: fix BindB argument for bf16 tensors threading_test: enable autotune PiperOrigin-RevId: 785763618	2025-07-22 02:08:46 -07:00
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Biruk Mammo	88284387db	Reduce warning noise. PiperOrigin-RevId: 772941142	2025-06-18 09:01:40 -07:00
Jan Wassenberg	9a02d6be68	Add --prompt_file and testdata for it. Refs #608 Linux terminals truncate input after 4096 chars. testdata is Frankenstein from project Gutenberg, which are long out of copyright. Also fix loss of coherence after long context caused by incorrect IsGlobalLayer. Move that to config.h and use max_seq_len as the initializer to make this clear. Also avoid dynamic allocation for GriffinActivations. PiperOrigin-RevId: 772333225	2025-06-16 23:41:07 -07:00
Biruk Mammo	5f3797f6e1	Allow creating empty `AttentionActivations` for experimental code. PiperOrigin-RevId: 772077675	2025-06-16 10:19:11 -07:00
Jan Wassenberg	6773e4517c	Split Activations into Griffin/Attention to reduce memory usage for attention-only tests. PiperOrigin-RevId: 772025282	2025-06-16 07:52:59 -07:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00
Jan Wassenberg	c027a45a2e	MatPtr-ify KV, shared div_seq_len, --seq_len flag PiperOrigin-RevId: 770194455	2025-06-11 09:49:38 -07:00
Jan Wassenberg	ec02726cf7	6x large-batch, short-prompt prefill speedup Parallelize over queries instead of tokens introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer. move RMSNormInplaceBatched out of Transformer to call the latter from prefill Consistent arg order. Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h Also add output to gemma_batch_bench, fix name PiperOrigin-RevId: 769676106	2025-06-10 09:56:20 -07:00
Jan Wassenberg	6ee628ba38	Further cleanup: separate MatMulEnv arg move row_ptrs into MatMulEnv Consistent arg order: layer, activations, kv_cache, env PiperOrigin-RevId: 767886386	2025-06-05 20:48:32 -07:00
Jan Wassenberg	3a266c662c	Split gemma-inl into separate source files weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization. PiperOrigin-RevId: 767562313	2025-06-05 05:36:44 -07:00
Jan Wassenberg	6897313080	3x speedup of EmbedImagePatches - GEMM, not GEMV. Required fixes to handling of non-vector aligned A. Also move row ptrs to MatMulEnv. PiperOrigin-RevId: 767029036	2025-06-04 01:18:52 -07:00
Jan Wassenberg	9efdcfd45c	1.07x batch decode speedup: more BF16 weights and activations BF16 att_sums and ffw_out Support BF16 B views without decompression Support arbitrary types in MulByConstAndAdd, AddFrom Also update profiler annotations in ops-inl.h PiperOrigin-RevId: 766995010	2025-06-03 23:30:18 -07:00
Jan Wassenberg	839a642992	Fix paligemma_test, refs #588 Detect PaliGemma models from layer names Remove unused allocator arg from CreateInvTimescale matmul: only warn once about dim divisibility Print config also in tests if --verbosity 2 PiperOrigin-RevId: 766605131	2025-06-03 04:45:22 -07:00
Jan Wassenberg	ad3002a21c	Merge branch 'dev' into bugfix/vit_attn	2025-06-03 09:29:52 +02:00
Jan Wassenberg	794a21a4e6	Major refactor to de-templatize gemma-inl and weights This replaces per-weight instantiations of all code with only per-MatMul/norm. Reduces binary size by 133KiB. WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs. Also remove unused EmbedToken, replaced with EmbedMMToken. PiperOrigin-RevId: 766497657	2025-06-02 23:01:35 -07:00
RangerUFO	93de2be938	Fix the broken VitAttention	2025-06-03 12:40:13 +08:00
Jan Wassenberg	cf4d7ceb82	1.16x decode speedup: remove last MatVec in Attention Precompute row pointers. Remove no longer used MHA support; QStride -> qkv_dim. Remove RowPtr from MatMul interface, use only MatPtrT. Require opt-in define for NUQ to speed up builds. Also fix io.cc on Windows. PiperOrigin-RevId: 766228108	2025-06-02 09:40:29 -07:00
The gemma.cpp Authors	1e8642f8f4	Internal change. PiperOrigin-RevId: 765037449	2025-05-29 22:51:16 -07:00
Jan Wassenberg	cb188d4a0e	Fix RowT issue and improve Griffin (currently still broken) Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT activations: Griffin tensors are now padded Griffin: add batching support, fix conv1d_cache allocation weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1 const-correct fix for ForEachTensor blob_store: move BlobIO2 to .cc and rename BlobIO PiperOrigin-RevId: 760610094	2025-05-19 07:02:10 -07:00
Jan Wassenberg	e890d46f30	1.31x batch prefill, 1.24x batch decode speedup: NUMA binding Only the weights; binding MatMul output worsens batch=1 prefill. Update gemma_batch_bench to use --decode_qbatch. Fix/remove prefill_activations in gemma-inl.h. Refactor: use BasePageBytes directly when binding Move BindB/C to .cc by de-templatizing Remove MatOwners::AllocateFor because it is weights-specific (binding or not) Disband MatOwners, replace with vector PiperOrigin-RevId: 759610477	2025-05-16 07:42:13 -07:00
Biruk Mammo	ba21e3beb4	Adds a `GemmaAttention` constructor that takes an explicit `ThreadingContext`. PiperOrigin-RevId: 757839682	2025-05-12 11:17:05 -07:00
Jan Wassenberg	45ad847a41	Replace RowVectorBatch with MatStorageT KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out optimize_test: larger vocab_size requires more steps shared.h: Remove unused u128 type correctly set Activation matrix rows, avoid passing as arg ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type mat: add OverrideRows, used by SetBatchSize PiperOrigin-RevId: 757790736	2025-05-12 09:16:12 -07:00
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Jan Wassenberg	a60b564b88	Infra improvements (2) ops.h: move CreateInvTimescale to allow calling without depending on gemma Pass around MatMulEnv instead of pools to avoid re-creating the env profiler.h can now be used outside SIMD code allocator: add StepBytes and QuantumSteps rename worker thread with package/cluster in the name threading: add Visit* to IndexRange PiperOrigin-RevId: 718766704	2025-01-23 01:55:19 -08:00

1 2

69 Commits