gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	3cc0139ebb	Fix excessive KC/MC from prior change This could lead to stack overflow in B_storage. Also do not require specific type for query_norm_scale, update batch sizes for attention tensors, more verbose Mat shape/type checks. PiperOrigin-RevId: 824987689	2025-10-28 05:33:01 -07:00
Biruk Mammo	5a05857deb	[Gemma.cpp] Allows non-owned arguments for attention methods. * Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`. * Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches. * Enables the `MatPtrT` default constructor for simpler initializations. * Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor. PiperOrigin-RevId: 824584177	2025-10-27 10:43:25 -07:00
Phil Culliton	503aaddd65	Add 8-bit integer quantization (I8Stream) to Gemma.cpp. PiperOrigin-RevId: 819787856	2025-10-15 09:25:20 -07:00
Jan Wassenberg	f3bc1c17da	1.03x speedup: fused FFN matmul-inl: support CView=StridedView or RowPtrs; rename to C_MC_NC matmul.cc: Allow 1 more rep for MC/NC to allow half-sized tiles, which helps. PiperOrigin-RevId: 807291701	2025-09-15 10:26:37 -07:00
Jan Wassenberg	ba6131311a	Fix gemma_batch_bench for flash attention q_T rows do not change. Also repeat prefill to reflect perf after autotuning. PiperOrigin-RevId: 805319377	2025-09-10 05:32:34 -07:00
Jan Wassenberg	9457258330	Refactor MatMul to accept views in the kernel functions Make arg order consistent. Move StridedView into mat.h. Add view support to RowPtrs. PiperOrigin-RevId: 805197381	2025-09-09 22:09:47 -07:00
Jan Wassenberg	461a9c7d1b	Matmul refactoring towards fusion MMLoops: move dispatch code out, use overloads split build target into matmul_env (for MatMulEnv/MMOptions) weights: no longer call BindB Fix potential out of bounds in gemma_batch_bench PiperOrigin-RevId: 804895985	2025-09-09 07:13:38 -07:00
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	4be4799727	Remove kMaxPackages and per-package-related code matmul: remove kMaxClusters, dynamic allocation PiperOrigin-RevId: 802950348	2025-09-04 03:33:12 -07:00
Jan Wassenberg	e76e29ce11	De-singleton ThreadingContext so callers can pass in their own weights.cc: fix BindB argument for bf16 tensors threading_test: enable autotune PiperOrigin-RevId: 785763618	2025-07-22 02:08:46 -07:00
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Jan Wassenberg	bd98b43cea	Rename RowPtr->StridedView, CRows->RowPtrs PiperOrigin-RevId: 770046362	2025-06-11 02:30:53 -07:00
Jan Wassenberg	9efdcfd45c	1.07x batch decode speedup: more BF16 weights and activations BF16 att_sums and ffw_out Support BF16 B views without decompression Support arbitrary types in MulByConstAndAdd, AddFrom Also update profiler annotations in ops-inl.h PiperOrigin-RevId: 766995010	2025-06-03 23:30:18 -07:00
Jan Wassenberg	794a21a4e6	Major refactor to de-templatize gemma-inl and weights This replaces per-weight instantiations of all code with only per-MatMul/norm. Reduces binary size by 133KiB. WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs. Also remove unused EmbedToken, replaced with EmbedMMToken. PiperOrigin-RevId: 766497657	2025-06-02 23:01:35 -07:00
Jan Wassenberg	cf4d7ceb82	1.16x decode speedup: remove last MatVec in Attention Precompute row pointers. Remove no longer used MHA support; QStride -> qkv_dim. Remove RowPtr from MatMul interface, use only MatPtrT. Require opt-in define for NUQ to speed up builds. Also fix io.cc on Windows. PiperOrigin-RevId: 766228108	2025-06-02 09:40:29 -07:00
Jan Wassenberg	3890eb5412	Remove backprop/ Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0). PiperOrigin-RevId: 764243198	2025-05-28 07:01:17 -07:00
Jan Wassenberg	cb188d4a0e	Fix RowT issue and improve Griffin (currently still broken) Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT activations: Griffin tensors are now padded Griffin: add batching support, fix conv1d_cache allocation weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1 const-correct fix for ForEachTensor blob_store: move BlobIO2 to .cc and rename BlobIO PiperOrigin-RevId: 760610094	2025-05-19 07:02:10 -07:00
Jan Wassenberg	e890d46f30	1.31x batch prefill, 1.24x batch decode speedup: NUMA binding Only the weights; binding MatMul output worsens batch=1 prefill. Update gemma_batch_bench to use --decode_qbatch. Fix/remove prefill_activations in gemma-inl.h. Refactor: use BasePageBytes directly when binding Move BindB/C to .cc by de-templatizing Remove MatOwners::AllocateFor because it is weights-specific (binding or not) Disband MatOwners, replace with vector PiperOrigin-RevId: 759610477	2025-05-16 07:42:13 -07:00
Jan Wassenberg	8a312e9b89	Split W1/W2 as a load-time preprocess. Remove kOnlyAllocate - no longer used. Rename ReadOrAllocate -> ReadFromBlobs. Rename Reshape -> Fixup to reflect the new scope. Remove no longer used ShrinkRows. This simplifies gemma-inl and is a prerequisite for removing ConstMat (whose .ofs was previously used for merged tensors) PiperOrigin-RevId: 758214083	2025-05-13 07:39:59 -07:00
Jan Wassenberg	2038dfd9cc	Minor: rename compression/shared -> types.h PiperOrigin-RevId: 758199851	2025-05-13 06:53:21 -07:00
Jan Wassenberg	d538a6d6c6	Cleanup: remove unused kCyclic, remove 2 suffix Also remove now unused allocator arg and fix warnings (cast, struct/class mismatch) PiperOrigin-RevId: 758098495	2025-05-13 01:06:41 -07:00
Jan Wassenberg	45ad847a41	Replace RowVectorBatch with MatStorageT KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out optimize_test: larger vocab_size requires more steps shared.h: Remove unused u128 type correctly set Activation matrix rows, avoid passing as arg ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type mat: add OverrideRows, used by SetBatchSize PiperOrigin-RevId: 757790736	2025-05-12 09:16:12 -07:00
Jan Wassenberg	c8d92948f4	Move fields, io* and blob* from compression/ into io/ PiperOrigin-RevId: 755445712	2025-05-06 11:17:19 -07:00
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8d0882b966	Huge refactor of weight handling and model loading. Weight handling: - new ModelStore2 supports both pre-2025 multi-file and single-file formats - simpler ForEachTensor with TensorArgs - tensors are constructed with their full suffixed name I/O: - support mmap and stride - Simplified SbsWriter, single insert(); add SbsReader Misc: - kMockTokenizer: allow creating with unavailable tokenizer - configs.h: Simpler enum validity checks via kSentinel - matmul.h: remove unused enable_bind (now in allocator.h) - tensor_info: single TensorInfoRegistry class, rename from tensor_index.h Frontends: - Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&) - Deduce model/weight type, remove --model and parsing - Replace most common.h includes with configs.h - Remove --compressed_weights, use --weights instead - Remove ModelInfo, replaced by ModelConfig. Backprop: - Reduce max loss, remove backward_scalar_test (timeout) - Update thresholds because new RandInit changes rng eval order and thus numerics PiperOrigin-RevId: 755317484	2025-05-06 04:44:21 -07:00
Jan Wassenberg	fe80f10ed7	Backprop test fixes and allocator cleanup - Shorten backprop tests to prevent timeout - Add line number of failing test - matmul: remove unused enable_bind - allocator: we will retain enable_bind there - mat: disable cyclic padding optimization (broken) PiperOrigin-RevId: 752656068	2025-04-29 03:01:10 -07:00
Jan Wassenberg	160a5824fb	Cleanup: include fixes/comments, fix leak, vector reserve Also remove unused RowSpan configs.cc: Assign prompt wrapping to ModelConfig configs.h: simplify EnumValid via sentinel PiperOrigin-RevId: 750278497	2025-04-22 12:01:46 -07:00
Jan Wassenberg	87a658b1c6	Minor cleanup, on-demand NUQ buffer allocation threading_context: add profiler compress-inl: add constexpr, on-demand alloc NUQ buffer gemma_py: model->gemma Move ScaleWeights to compress.cc Move PromptWrapping to configs.h PiperOrigin-RevId: 748347896	2025-04-16 10:49:43 -07:00
Jan Wassenberg	2e722f14f1	Add mmap support (not yet used) Also: const-correct ArgsBase, add assert to mat.h checking element_bytes_ BUILD deps update (:shared provides shared.h, not :sfp) PiperOrigin-RevId: 746073312	2025-04-10 10:03:40 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00

30 Commits