gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	56186193c1	Replace mt19937 with new generator to enable parallel sampling Split it into immutable AesCtrEngine and RngStream Also add RowSpan and Logits span PiperOrigin-RevId: 803336423	2025-09-04 23:49:10 -07:00
Jan Wassenberg	ed2f0bd1b0	Fix pos assertions, refs #665 Ensure the streaming func pos matches the number of calls. Add two arguments that control pos+1 and pos+=1 behavior. Also cleanup/add comments. run: use batch_stream_func, add assert, higher verbosity for MM autotune output PiperOrigin-RevId: 799511163	2025-08-26 04:50:40 -07:00
Jan Wassenberg	701841897b	Default to disabling per-socket parallelization weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch) PiperOrigin-RevId: 790787643	2025-08-04 09:49:14 -07:00
Jan Wassenberg	799c264df3	Pre-tune thread pool before matmul Also improve profiler annotations - remove near-zero ones and add more for startup PiperOrigin-RevId: 789352414	2025-07-31 08:45:26 -07:00
Jan Wassenberg	d831ddce5b	Fix file mapping: was letting the smart pointer go out of scope Also save+print the IO mode used. PiperOrigin-RevId: 788848165	2025-07-30 04:30:10 -07:00
Jan Wassenberg	ac0d751d20	Rename GetModelConfig->Config PiperOrigin-RevId: 788506480	2025-07-29 10:18:12 -07:00
Jan Wassenberg	e76e29ce11	De-singleton ThreadingContext so callers can pass in their own weights.cc: fix BindB argument for bf16 tensors threading_test: enable autotune PiperOrigin-RevId: 785763618	2025-07-22 02:08:46 -07:00
Jan Wassenberg	a04cc287b2	Move MatMulEnv out of Gemma to enable concurrent calls Also update benchmark_helper config print: add profiler, remove free mem PiperOrigin-RevId: 774662974	2025-06-23 01:20:09 -07:00
Jan Wassenberg	f2adbfbcab	Batch inference fixes: set pos during prefill, fix assert PiperOrigin-RevId: 772458760	2025-06-17 07:09:44 -07:00
Jan Wassenberg	9a02d6be68	Add --prompt_file and testdata for it. Refs #608 Linux terminals truncate input after 4096 chars. testdata is Frankenstein from project Gutenberg, which are long out of copyright. Also fix loss of coherence after long context caused by incorrect IsGlobalLayer. Move that to config.h and use max_seq_len as the initializer to make this clear. Also avoid dynamic allocation for GriffinActivations. PiperOrigin-RevId: 772333225	2025-06-16 23:41:07 -07:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00
Jan Wassenberg	c027a45a2e	MatPtr-ify KV, shared div_seq_len, --seq_len flag PiperOrigin-RevId: 770194455	2025-06-11 09:49:38 -07:00
Daniel Keysers	d7b23d532a	Restructure internal initialization. PiperOrigin-RevId: 769507096	2025-06-10 01:25:31 -07:00
Jan Wassenberg	6897313080	3x speedup of EmbedImagePatches - GEMM, not GEMV. Required fixes to handling of non-vector aligned A. Also move row ptrs to MatMulEnv. PiperOrigin-RevId: 767029036	2025-06-04 01:18:52 -07:00
Jan Wassenberg	9efdcfd45c	1.07x batch decode speedup: more BF16 weights and activations BF16 att_sums and ffw_out Support BF16 B views without decompression Support arbitrary types in MulByConstAndAdd, AddFrom Also update profiler annotations in ops-inl.h PiperOrigin-RevId: 766995010	2025-06-03 23:30:18 -07:00
RangerUFO	6debdbe341	Minor fixes for ViT	2025-05-20 22:27:10 +08:00
Jan Wassenberg	2038dfd9cc	Minor: rename compression/shared -> types.h PiperOrigin-RevId: 758199851	2025-05-13 06:53:21 -07:00
Jan Wassenberg	45ad847a41	Replace RowVectorBatch with MatStorageT KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out optimize_test: larger vocab_size requires more steps shared.h: Remove unused u128 type correctly set Activation matrix rows, avoid passing as arg ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type mat: add OverrideRows, used by SetBatchSize PiperOrigin-RevId: 757790736	2025-05-12 09:16:12 -07:00
Jan Wassenberg	8d0882b966	Huge refactor of weight handling and model loading. Weight handling: - new ModelStore2 supports both pre-2025 multi-file and single-file formats - simpler ForEachTensor with TensorArgs - tensors are constructed with their full suffixed name I/O: - support mmap and stride - Simplified SbsWriter, single insert(); add SbsReader Misc: - kMockTokenizer: allow creating with unavailable tokenizer - configs.h: Simpler enum validity checks via kSentinel - matmul.h: remove unused enable_bind (now in allocator.h) - tensor_info: single TensorInfoRegistry class, rename from tensor_index.h Frontends: - Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&) - Deduce model/weight type, remove --model and parsing - Replace most common.h includes with configs.h - Remove --compressed_weights, use --weights instead - Remove ModelInfo, replaced by ModelConfig. Backprop: - Reduce max loss, remove backward_scalar_test (timeout) - Update thresholds because new RandInit changes rng eval order and thus numerics PiperOrigin-RevId: 755317484	2025-05-06 04:44:21 -07:00
The gemma.cpp Authors	ba10c88a94	Add C API and C# interop files This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included. To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows: ``` cmake --preset windows-dll cmake --build --config Release --preset windows-dll -j 4 ``` This should generate a `gemma.dll` in `<build-dir>/Release`. To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS. PiperOrigin-RevId: 750246272	2025-04-22 10:35:47 -07:00
prajwalc22	a9e56c27eb	removed unnecessary threading.h import	2025-04-17 23:44:23 +05:30
prajwalc22	f55c321397	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:21 +05:30
prajwalc22	27c28cc938	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:05 +05:30
prajwalc22	8246e49199	Add non-interactive mode support - Added prompt flag to InferenceArgs for non-interactive mode - Set user-facing options to verbosity level 1 - Fixed prompt_size declaration and variable ordering in run.cc - Properly set prompt_size after WrapAndTokenize calls - Moved kVerboseLogTokens block after prompt_size is set	2025-04-16 16:26:52 +05:30
prajwalc22	cbf179990f	Add --prompt flag for non-interactive mode	2025-04-16 15:34:43 +05:30
prajwalc22	f3116d2577	Add --prompt flag for non-interactive mode This change adds a --prompt command-line option that allows users to provide prompts directly without entering interactive mode, which is useful for scripting and automation.	2025-04-16 09:45:02 +05:30
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Copybara-Service	bef91a3f03	Merge pull request #529 from ufownl:refactor/wrap_and_tokenize PiperOrigin-RevId: 745174371	2025-04-08 09:22:26 -07:00
RangerUFO	ca4ee2b63f	Refactor `WrapAndTokenize` to work properly with Gemma3	2025-03-29 11:31:39 +08:00
RangerUFO	d42deaa27c	Set the secondary EOS for Gemma2 So that we can remove the `<end_of_turn>` filter that was set up specifically for Gemma2.	2025-03-22 01:32:22 +08:00
RangerUFO	2bad79f110	Fix the EOS checking The secondary eos is usually `<end_of_turn>`, which can appear in the prompt, so we can only check it not in the prompt.	2025-03-22 01:32:22 +08:00
Phil Culliton	05b1cce9f7	Add support for a secondary EOS token PiperOrigin-RevId: 738898976	2025-03-20 12:28:31 -07:00
Jan Wassenberg	1b72c22345	Refactor Gemma ctor and improve pool NUMA support Gemma receives a MatMulEnv arg, with comment on lifetime Split threading into topology so the latter can be used in allocator Add AllocClasses() for non-POD (ThreadPool) Support binding pool to NUMA node Update threading_test with latency measurements Also update Highway version. PiperOrigin-RevId: 736904748	2025-03-14 10:19:00 -07:00
Phil Culliton	1b1b63d560	Fix PaliGemma models. PiperOrigin-RevId: 736483021	2025-03-13 06:28:29 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Jan Wassenberg	f9d93e4a42	Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning Remove empty matmul_unit_test. Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576. PiperOrigin-RevId: 729123576	2025-02-20 08:33:46 -08:00
Ray Smith	b93231a47d	Moved the vit config fields to their own config struct PiperOrigin-RevId: 715692800	2025-01-15 01:09:49 -08:00
Daniel Keysers	62c70d6715	Rename ModelTraining to PromptWrapping which is a more accurate name. PiperOrigin-RevId: 705881500	2024-12-13 07:45:59 -08:00
Daniel Keysers	aed17396be	Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. Do not echo <end_of_turn> tokens to the user. Have verbosity=0 only show the dialog. PiperOrigin-RevId: 705021391	2024-12-11 01:52:00 -08:00
Daniel Keysers	331d2ccc02	Add support for 448px resolution to PaliGemma and PaliGemma2. PiperOrigin-RevId: 704361579	2024-12-09 11:38:10 -08:00
Daniel Keysers	719699f132	Make top_k a runtime argument (instead of a model argument). PiperOrigin-RevId: 696170691	2024-11-13 09:48:59 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Daniel Keysers	583bd93e9a	Factor out addition of ViTConfig to a ModelConfig. Use ModelConfig values for ImageTokens. Output timing info for image token generation. Add a method to copy image data into Image class directly. Minor changes: pipe ModelTraining to more places. PiperOrigin-RevId: 690572283	2024-10-28 05:29:33 -07:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Daniel Keysers	a4d6adbc43	Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. Remove max_tokens (and rely on only max_generated_tokens). PiperOrigin-RevId: 685662260	2024-10-14 04:45:21 -07:00
Jan Wassenberg	2c28b18eb0	Add NestedPools: one per socket/cluster Use in dot_test app.h: add new flags and rename num_threads to max_threads matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases PiperOrigin-RevId: 683216386	2024-10-07 09:40:19 -07:00
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
Jan Wassenberg	282f73ec2f	Add pin flag to disable pinning. Refs #338 PiperOrigin-RevId: 661389171	2024-08-09 13:47:12 -07:00

1 2

74 Commits