Daniel Keysers
a8e08778d4
Add an additional QueryModel() overload to GemmaEnv.
...
Use args only in GemmaEnv constructor, store everything else in RuntimeConfig.
Add runtime option to turn off thread spinning.
PiperOrigin-RevId: 670467320
2024-09-03 02:25:19 -07:00
Paul Chang
773333e5be
Expose underlying model configuration: number of layers, heads, etc.
...
PiperOrigin-RevId: 663747853
2024-08-16 09:03:24 -07:00
Jan Wassenberg
22995c699d
Simplify pos handling, auto-increment output arg
...
- no longer multiply by num_queries
- remove unused interleaved prompts
- Rename to Queries*
- Rename batch_start/interleaved_pos/pos to queries_pos
PiperOrigin-RevId: 663331823
2024-08-15 09:25:26 -07:00
RangerUFO
730b6bfc94
Implement `start_pos` per query for batch interface
2024-08-12 18:50:23 +02:00
Jan Wassenberg
5e433e774a
1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
...
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Jan Wassenberg
aaf51898b6
Major revamp #2 of Prefill: fix token order, parallel for multi-query
...
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.
PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Jan Wassenberg
12016d31c3
Major Prefill/Generate cleanup, 1.3x Prefill speedup
...
This fixes TTFT, which was not including prefill.
PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg
992a2cbbc0
De-templatize Activations, add RowVectorBatch class
...
Also remove most kBatchSize args.
PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Jan Wassenberg
c7c3daa624
7x compile time speedup: shard gemma.cc
...
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.
PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00
Daniel Keysers
a40165dea2
Small cleanups. Fixes gemma_test build.
...
PiperOrigin-RevId: 649008524
2024-07-03 03:13:38 -07:00
Kan Wu
7e4b20455e
Add sliding window attention for Gemma 2.
...
PiperOrigin-RevId: 648778253
2024-07-02 11:08:03 -07:00
Jan Wassenberg
09a7e75ead
Prep for sharding gemma.cc: split into kv_cache, tokenizer.
...
Move activations.h to backprop/ to make space for another activations.h.
PiperOrigin-RevId: 648744500
2024-07-02 09:31:06 -07:00
Jan Wassenberg
85fcd3cd80
Cleanup: add ModelInfo struct, remove gcpp::
...
PiperOrigin-RevId: 648707763
2024-07-02 07:11:15 -07:00
Jan Wassenberg
b1c1ec1d59
Use benchmark_helper in py bindings (adds BOS)
...
Also remove thread clamp (OK to be zero or large).
PiperOrigin-RevId: 648657155
2024-07-02 03:27:15 -07:00
Jan Wassenberg
e588a7f45d
Add config for att/final cap, skip max-subtract. Fixes #278
...
Also update includes/deps for backprop/.
PiperOrigin-RevId: 648399222
2024-07-01 09:45:26 -07:00
The gemma.cpp Authors
da7507e6f0
Add prompt batching to Gemma.cpp.
...
This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts.
PiperOrigin-RevId: 648367559
2024-07-01 07:51:31 -07:00
The gemma.cpp Authors
80b1347393
Skip the last RMSNormInplaceBatched in the Prefill phase.
...
That only modifies activations.x, but it is called with prefill_activations which are not used after the Prefill call.
PiperOrigin-RevId: 645391387
2024-06-21 08:04:22 -07:00
RangerUFO
d7787c8f6c
Fix KV cache size calculation error
2024-06-21 13:06:26 +08:00
The gemma.cpp Authors
a85725614a
Refactor kCachePosSize and kCacheLayerSize into separate functors.
...
PiperOrigin-RevId: 645048519
2024-06-20 08:52:08 -07:00
Jan Wassenberg
48ebba8b7a
Code cleanup
...
- Simplify template arg list, enable deduction
- missing hn:: on " Lanes"
- 1.0f suffix
- move RMSNormBatched into ops.h
- static constexpr -> constexpr
- concrete type instead of LayerT, WeightArrayT
- inline GetWeights
- remove if (runtime_config.verbosity
- merge AllocatePrefill and AllocateDecode
- remove bf_ffw_hidden
PiperOrigin-RevId: 644931277
2024-06-20 01:10:24 -07:00
The gemma.cpp Authors
658fb3e506
Move test placeholder to a later pos.
...
PiperOrigin-RevId: 644808456
2024-06-19 13:24:10 -07:00
The gemma.cpp Authors
0e612d9a20
Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions.
...
PiperOrigin-RevId: 644455520
2024-06-18 11:24:56 -07:00
Jan Wassenberg
15135f5b3d
Simplify Attention.
...
Shared kMHA, reuse from Activations,
inline Attn lambda, use QDim as the stride between successive Q.
PiperOrigin-RevId: 644343854
2024-06-18 05:08:12 -07:00
Jan Wassenberg
a07f60c9a1
1.15x 7b sfp prefill speedup: Matmul in attention
...
2b bf16:
prefill 114.456 -> 115.222
decode 16.8847 -> 16.9987
7b sfp:
prefill 18.8575 -> 21.7325
decode 5.68428 -> 5.79791
PiperOrigin-RevId: 644283676
2024-06-18 01:00:51 -07:00
Jan Wassenberg
7d0720675f
Move raw_weights into separate header, used mainly by compress_weights.
...
Fix warnings in backprop/* (include)
PiperOrigin-RevId: 643983136
2024-06-17 06:17:02 -07:00
Jan Wassenberg
29c0c574e6
Integrate matmul into FFW: 4.3x prefill speedup
...
```
before, bf16:
27.2929 prefill tokens / sec
17.2114 tokens / sec
after, bf16
116.496 prefill tokens / sec
17.5391 tokens / sec
```
PiperOrigin-RevId: 643328437
2024-06-14 06:32:26 -07:00
Jan Wassenberg
d3c6a45b59
Major duplicated code reduction in test/benchmarks
...
Helper functions to tokenize/wrap
Move LayersOutputFunc into RuntimeConfig
AcceptFunc passes the probability
Implement StringFromType using the parser, and verify results match
PiperOrigin-RevId: 643255119
2024-06-14 00:16:25 -07:00
Daniel Keysers
6e67a6d8a9
Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding.
...
PiperOrigin-RevId: 642614278
2024-06-12 07:52:13 -07:00
Daniel Keysers
1ac9857014
Extends Transformer() to prepare for batched processing.
...
PiperOrigin-RevId: 642603025
2024-06-12 07:01:03 -07:00
Jan Wassenberg
3e2396f98c
Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc
...
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
2024-06-11 05:53:10 -07:00
Daniel Keysers
c557ad23a8
Adds simple-loop versions of missing batched functions.
...
PiperOrigin-RevId: 642189741
2024-06-11 02:14:02 -07:00
Copybara-Service
49d814b519
Merge pull request #224 from szabadka:cleanup
...
PiperOrigin-RevId: 641922102
2024-06-10 09:11:13 -07:00
Jan Wassenberg
c1c6714ad4
Internal experiment
...
PiperOrigin-RevId: 641915024
2024-06-10 08:46:10 -07:00
Zoltan Szabadka
a3a75b77f9
Use CompressedWeights<TConfig<float>> in backpropagation.
...
kWeightsAreCompressed are removed and LoadRawWeights is moved
to compress_weights.cc
2024-06-10 14:34:24 +00:00
Jan Wassenberg
36e6915e18
Add CPU output, error if not C++17, simplify tokenizer ctor
...
PiperOrigin-RevId: 641850879
2024-06-10 04:01:11 -07:00
Jan Wassenberg
f9b390b134
Support all weight types in a single binary.
...
This changes the command line flags, but the default value retains the previous behavior.
Also add a CreateGemma helper to enable extra args without interface changes.
PiperOrigin-RevId: 641266411
2024-06-07 09:04:45 -07:00
Zoltan Szabadka
465998d25a
Add support for custom sampling function to runtime config.
...
With this addition the ComputeCrossEntropy function can be moved
to its own library, because now we can compute it using only the
public API functions from gemma.h
2024-06-07 11:45:07 +00:00
Zoltan Szabadka
c004799cdc
Add Adam optimizer.
...
Drive-by: Fix compilation errors and tests for backprop functions.
2024-06-06 18:41:36 +00:00
Jan Wassenberg
57c2cd8b52
Simplifications: remove GemmaInterface and GemmaImpl
...
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc
PiperOrigin-RevId: 640869202
2024-06-06 05:54:21 -07:00
Jan Wassenberg
5c3e5f7038
Remove no longer required stats.h - use Highway version instead
...
PiperOrigin-RevId: 640440379
2024-06-05 01:37:48 -07:00
Zoltan Szabadka
3b4fa4a0e3
Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible.
2024-06-04 09:18:56 +00:00
Zoltan Szabadka
36e4d8bbfe
Add first version of backpropagation support.
...
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.
Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
2024-06-04 08:37:49 +00:00
Paul Chang
ed8f39c058
Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T
...
PiperOrigin-RevId: 639793810
2024-06-03 08:32:29 -07:00
Paul Chang
419dc34ed5
Generic MHA/MQA/GQA implementation
...
PiperOrigin-RevId: 636937885
2024-05-24 09:05:53 -07:00
Apoorv Reddy
8e641eb4cd
Add TTFT to TimingInfo
...
PiperOrigin-RevId: 634378994
2024-05-16 07:16:53 -07:00
Apoorv Reddy
eb0b96e0a8
Pass most runtime parameters using const RuntimeConfig&
...
PiperOrigin-RevId: 633572507
2024-05-14 07:04:53 -07:00
Apoorv Reddy
f1eab987d8
Store tokens/sec in auxiliary struct TimingInfo.
...
PiperOrigin-RevId: 633108908
2024-05-13 00:04:19 -07:00
Paul Chang
bacba351d4
Support additional scaling
...
PiperOrigin-RevId: 631429113
2024-05-07 08:16:25 -07:00
Jan Wassenberg
f6d02b2870
Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale.
...
Remove extra Dot() overload
MatVecAdd always adds, use MatVecT<kAdd> if conditional.
Remove ununsed MatVecAddLoop and MatVecLoop
No longer tsan-verify even_odd
PiperOrigin-RevId: 631377279
2024-05-07 04:40:42 -07:00
Zoltan Szabadka
19017fdb6d
Fix expression in DASSERT()
2024-05-03 13:54:20 +00:00