The gemma.cpp Authors
27258b03e6
Improve performance logging
...
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg
5e433e774a
1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
...
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Jan Wassenberg
a24eda8d02
Split matmul into matvec; add large matrix benchmark
...
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.
PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang
d37c088e44
Extend LayersOutputFunc to take query index and auxillary int
...
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg
8b4915f321
Fix Windows build - macro conflict with param name
...
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg
6ea4232b2e
MatMul cleanup: Mat struct, simplify args.
...
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.
PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Jan Wassenberg
f27683152c
1.05x prefill speedup: matvec -> matmul for !MHA
...
Also add C_stride and make shape normal non-template arguments.
PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg
2721f54446
Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
...
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg
aaf51898b6
Major revamp #2 of Prefill: fix token order, parallel for multi-query
...
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.
PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Daniel Keysers
2346b5a434
Minor polishing: adding comments, renaming variables.
...
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers
33334ad454
Fix msan uninitialized scale in optimize_test
...
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
Jan Wassenberg
85cac13fb1
Split up ops.h into ops/ops-inl and matmul-inl
...
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg
5844e6a1e5
Cleanup: add wrapper functions and rename vars to interleaved
...
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.
PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg
12016d31c3
Major Prefill/Generate cleanup, 1.3x Prefill speedup
...
This fixes TTFT, which was not including prefill.
PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Daniel Keysers
e87e65ca45
Add scale parameter to MatMul.
...
Add accessor to CompressedArray that asserts the scale is 1 and use it.
PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Jan Wassenberg
992a2cbbc0
De-templatize Activations, add RowVectorBatch class
...
Also remove most kBatchSize args.
PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers
ff34370aac
Simplify FFW by using MatMul_4x4_Batch_Add.
...
Affects only the griffin model, where prefill TPS improves by about 70%.
PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Kan Wu
f519ab6693
Refactor configurables.
...
PiperOrigin-RevId: 651259154
2024-07-10 21:30:58 -07:00
Daniel Keysers
063bbaa683
Add more comments to attention computation (and some small restructuring).
...
PiperOrigin-RevId: 650929097
2024-07-10 02:39:07 -07:00
Jan Wassenberg
c7c3daa624
7x compile time speedup: shard gemma.cc
...
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.
PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00