gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Keysers	493688f6f1	Allow interactive use with new single-file weight format. Add section about new weights format to README.md. Remove model_type_required parameter. Update error handling for flags. PiperOrigin-RevId: 715788822	2025-01-15 07:22:33 -08:00
Ray Smith	9d40f0117e	Added ability to load/save a complete model file, including tokenizer. PiperOrigin-RevId: 707914366	2024-12-19 07:59:41 -08:00
Daniel Keysers	62c70d6715	Rename ModelTraining to PromptWrapping which is a more accurate name. PiperOrigin-RevId: 705881500	2024-12-13 07:45:59 -08:00
Daniel Keysers	aed17396be	Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. Do not echo <end_of_turn> tokens to the user. Have verbosity=0 only show the dialog. PiperOrigin-RevId: 705021391	2024-12-11 01:52:00 -08:00
Daniel Keysers	5bbe814a53	Tiny cleanup. PiperOrigin-RevId: 704636988	2024-12-10 03:34:05 -08:00
Jan Wassenberg	6a34e9c547	Print cache info and update Highway version for that PiperOrigin-RevId: 702318451	2024-12-03 06:31:52 -08:00
Jan Wassenberg	f74d496879	Threading/infra improvements. * Add ParallelizeRange helpers and partitioning helpers Refactor Pinning class, store original affinity (required to construct another NestedPools after pinning happened) Compress: * prevent Compress printing stats in tests * zero-pad tensors Matmul: * add matmul_unit_test (TODO) and bench_matmul * matmul_test: change norm to row vectors (that is what is added) and include bf16 rounding error * Prepare for L2/L3 retrieval PiperOrigin-RevId: 700603811	2024-11-27 01:12:00 -08:00
Stanko Novakovic	109a4d9f85	Add a simple benchmark for batching. This is a simple Gemma benchmark with a fixed batch size of 32. PiperOrigin-RevId: 698843573	2024-11-21 10:59:49 -08:00
Daniel Keysers	719699f132	Make top_k a runtime argument (instead of a model argument). PiperOrigin-RevId: 696170691	2024-11-13 09:48:59 -08:00
Daniel Keysers	e54d9cbddd	Fix Griffin model: - use HalfRope position encodings - zero-initialize the caches for each Generate at position 0 The lack of the latter made the tests in gemma_test dependent on each other. PiperOrigin-RevId: 694509054	2024-11-08 08:30:53 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Daniel Keysers	a4d6adbc43	Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. Remove max_tokens (and rely on only max_generated_tokens). PiperOrigin-RevId: 685662260	2024-10-14 04:45:21 -07:00
The gemma.cpp Authors	dfda53e634	Benchmark gemma.cpp with different length inputs. PiperOrigin-RevId: 684607945	2024-10-10 15:59:26 -07:00
Jan Wassenberg	2c28b18eb0	Add NestedPools: one per socket/cluster Use in dot_test app.h: add new flags and rename num_threads to max_threads matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases PiperOrigin-RevId: 683216386	2024-10-07 09:40:19 -07:00
Jan Wassenberg	2d14d796e3	1.09x decode speedup for topk=1/temp0: fuse softmax and sample PiperOrigin-RevId: 680589099	2024-09-30 08:37:41 -07:00
Daniel Keysers	673673cc98	Update expected entropy values for GRIFFIN_2B model. These changed after introduction of "Cascaded summation for Softmax" PiperOrigin-RevId: 678145851	2024-09-24 02:12:59 -07:00
Jan Wassenberg	c6c10e0a53	Fix topology display for platforms where it fails (Apple) PiperOrigin-RevId: 677800053	2024-09-23 08:14:54 -07:00
Daniel Keysers	760a69449e	Add entropy expectations for Griffin-2b model in gemma_test and make sure it passes. PiperOrigin-RevId: 675564389	2024-09-17 07:46:06 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Daniel Keysers	437e0eb9af	Internal change. Slight restructuring of gemma_test. PiperOrigin-RevId: 670529565	2024-09-03 06:16:09 -07:00
Daniel Keysers	a8e08778d4	Add an additional QueryModel() overload to GemmaEnv. Use args only in GemmaEnv constructor, store everything else in RuntimeConfig. Add runtime option to turn off thread spinning. PiperOrigin-RevId: 670467320	2024-09-03 02:25:19 -07:00
Daniel Keysers	3c17911875	Make gemma_test slightly more allowing on MultiTurn. PiperOrigin-RevId: 668097277	2024-08-27 12:40:16 -07:00
Jan Wassenberg	c4303cd89b	Fix test for 2b - update prompt PiperOrigin-RevId: 667878053	2024-08-27 00:56:47 -07:00
Daniel Keysers	18e6012872	Fix prefill for batched queries. This lets gemma_test/GeographyBatched pass now also for gemma2-27B. PiperOrigin-RevId: 664827485	2024-08-19 08:50:42 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Daniel Keysers	7316ee8f96	Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it. PiperOrigin-RevId: 662072395	2024-08-12 07:12:46 -07:00
Apoorv Reddy	fd1b0743a7	Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. This is to make it clear that these models are part of the Gemma2 family of models. PiperOrigin-RevId: 661181682	2024-08-09 02:09:06 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Daniel Keysers	cf76f0a401	Update gemma_test to also pass for the v1.1. models. Make it an error if the model cannot be loaded. PiperOrigin-RevId: 650232602	2024-07-08 06:45:37 -07:00
Jan Wassenberg	cbb67b4ee0	Move benchmark_helper to evals/, weights_raw to compression/. PiperOrigin-RevId: 650155983	2024-07-08 01:13:23 -07:00
Daniel Keysers	cdebcc3533	Update gemma_test with the expected entropy values for the IT models of size 2B/7B/9B/27B. PiperOrigin-RevId: 649662047	2024-07-05 08:58:51 -07:00
Jan Wassenberg	118e802b00	Fix gemma_test - moved to evals/. PiperOrigin-RevId: 649338633	2024-07-04 02:04:05 -07:00
Jan Wassenberg	85fcd3cd80	Cleanup: add ModelInfo struct, remove gcpp:: PiperOrigin-RevId: 648707763	2024-07-02 07:11:15 -07:00
Jan Wassenberg	af8eb2fde3	Declutter gemma/ directory, move binaries to evals/ and util/. PiperOrigin-RevId: 648400795	2024-07-01 09:51:04 -07:00

41 Commits