gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Daniel Keysers	a8e08778d4	Add an additional QueryModel() overload to GemmaEnv. Use args only in GemmaEnv constructor, store everything else in RuntimeConfig. Add runtime option to turn off thread spinning. PiperOrigin-RevId: 670467320	2024-09-03 02:25:19 -07:00
Paul Chang	773333e5be	Expose underlying model configuration: number of layers, heads, etc. PiperOrigin-RevId: 663747853	2024-08-16 09:03:24 -07:00
Paul Chang	b9ed12a325	Support directly observing activations, partially replacing LayersOutputFunc LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs. Instead, we directly expose the Activations structure. PiperOrigin-RevId: 663409316	2024-08-15 12:39:07 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Jan Wassenberg	c7c3daa624	7x compile time speedup: shard gemma.cc Use overloaded functions defined in gemma/instantiations. Also split out activations.h. PiperOrigin-RevId: 649053122	2024-07-03 06:35:04 -07:00
Daniel Keysers	a40165dea2	Small cleanups. Fixes gemma_test build. PiperOrigin-RevId: 649008524	2024-07-03 03:13:38 -07:00
Jan Wassenberg	09a7e75ead	Prep for sharding gemma.cc: split into kv_cache, tokenizer. Move activations.h to backprop/ to make space for another activations.h. PiperOrigin-RevId: 648744500	2024-07-02 09:31:06 -07:00
Jan Wassenberg	85fcd3cd80	Cleanup: add ModelInfo struct, remove gcpp:: PiperOrigin-RevId: 648707763	2024-07-02 07:11:15 -07:00
Jan Wassenberg	e527e7662e	Remove unused kSystemPrompt PiperOrigin-RevId: 648429567	2024-07-01 11:18:07 -07:00
The gemma.cpp Authors	da7507e6f0	Add prompt batching to Gemma.cpp. This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts. PiperOrigin-RevId: 648367559	2024-07-01 07:51:31 -07:00
Jan Wassenberg	d3c6a45b59	Major duplicated code reduction in test/benchmarks Helper functions to tokenize/wrap Move LayersOutputFunc into RuntimeConfig AcceptFunc passes the probability Implement StringFromType using the parser, and verify results match PiperOrigin-RevId: 643255119	2024-06-14 00:16:25 -07:00
Daniel Keysers	6e67a6d8a9	Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. PiperOrigin-RevId: 642614278	2024-06-12 07:52:13 -07:00
Daniel Keysers	1ac9857014	Extends Transformer() to prepare for batched processing. PiperOrigin-RevId: 642603025	2024-06-12 07:01:03 -07:00
Jan Wassenberg	3e2396f98c	Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc accept_token: allow default, check if empty when using allow mixing sample_func and stream_func, call the latter after the former Also fix missing includes/deps. PiperOrigin-RevId: 642240012	2024-06-11 05:53:10 -07:00
Jan Wassenberg	f9b390b134	Support all weight types in a single binary. This changes the command line flags, but the default value retains the previous behavior. Also add a CreateGemma helper to enable extra args without interface changes. PiperOrigin-RevId: 641266411	2024-06-07 09:04:45 -07:00
Copybara-Service	24db2ff725	Merge pull request #217 from szabadka:cross-entropy PiperOrigin-RevId: 641241133	2024-06-07 07:17:35 -07:00
Daniel Keysers	06f814fc8b	Small code cleanup suggestions while reading the code. PiperOrigin-RevId: 641220788	2024-06-07 05:33:17 -07:00
Zoltan Szabadka	465998d25a	Add support for custom sampling function to runtime config. With this addition the ComputeCrossEntropy function can be moved to its own library, because now we can compute it using only the public API functions from gemma.h	2024-06-07 11:45:07 +00:00
Zoltan Szabadka	c004799cdc	Add Adam optimizer. Drive-by: Fix compilation errors and tests for backprop functions.	2024-06-06 18:41:36 +00:00
Jan Wassenberg	57c2cd8b52	Simplifications: remove GemmaInterface and GemmaImpl Split common and weights into separate lib Remove common-inl (does not have to be SIMD code), activations.cc Centralize switch(Model) to avoid duplication Move CompressWeightsT to compress_weights.cc Move LoadWeights to weights.cc PiperOrigin-RevId: 640869202	2024-06-06 05:54:21 -07:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Apoorv Reddy	7f4b85d00b	Add MMLU eval to github PiperOrigin-RevId: 635495178	2024-05-20 10:20:53 -07:00
Apoorv Reddy	8e641eb4cd	Add TTFT to TimingInfo PiperOrigin-RevId: 634378994	2024-05-16 07:16:53 -07:00
Apoorv Reddy	eb0b96e0a8	Pass most runtime parameters using const RuntimeConfig& PiperOrigin-RevId: 633572507	2024-05-14 07:04:53 -07:00
Apoorv Reddy	f1eab987d8	Store tokens/sec in auxiliary struct TimingInfo. PiperOrigin-RevId: 633108908	2024-05-13 00:04:19 -07:00
Zoltan Szabadka	afaca4efa8	Use more parallelism in the QKV projections in MQA mode. Instead of MatVecLoop, we use MatVec and we combine k and v into one 2 * kQKVDim long vector so that K and V projections can be combined into one MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s 18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s 32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s 64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s ```	2024-04-30 13:10:14 +00:00
Zoltan Szabadka	27117cc39f	Simplify threading: remove the use of inner_pool. We only used inner_pool in the prefill FFW function, and there we can achieve sufficient parallelism on the rows of the matrix-vector multiplications. Benchmark results on a 1600-token summarization task: ``` Prefill speed Num threads BEFORE AFTER 4 9.24 t/s 9.76 t/s 18 31.41 t/s 31.16 t/s 32 31.41 t/s 45.13 t/s 64 31.03 t/s 57.85 t/s ```	2024-04-29 16:07:30 +00:00
Jan Wassenberg	e9a0caed87	Further improve IO, enable multiple backends without -D. Move Path into io.h and use for opening files. Removes dependency of gemma_lib on args. Separate Windows codepath instead of emulating POSIX functions. Plus lint fixes. PiperOrigin-RevId: 626279004	2024-04-19 00:40:29 -07:00
Andrey Mikhaylov	4ef3da733a	Fixed minor things and added comments.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	2c5706f159	Add comments regarding layers output usage.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	03284d752e	Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file.	2024-04-12 15:39:16 +00:00
RangerUFO	2099b37732	Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs	2024-04-09 20:44:41 +08:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00

42 Commits