Zelalem Aweke
9e213b3d96
Use system topology to pin threads across clusters.
...
PiperOrigin-RevId: 640151974
2024-06-04 07:50:32 -07:00
Jan Wassenberg
a44cbdadc2
Update to Highway 1.2 for topology/VQSelect
...
Also fix unused-warning in compress-inl.
PiperOrigin-RevId: 639116915
2024-05-31 12:29:10 -07:00
Paul Chang
82623bdc7f
Refer to --weights rather than --compressed_weights to simplify CLI docs
...
PiperOrigin-RevId: 634391135
2024-05-16 07:51:49 -07:00
Apoorv Reddy
8e641eb4cd
Add TTFT to TimingInfo
...
PiperOrigin-RevId: 634378994
2024-05-16 07:16:53 -07:00
Apoorv Reddy
eb0b96e0a8
Pass most runtime parameters using const RuntimeConfig&
...
PiperOrigin-RevId: 633572507
2024-05-14 07:04:53 -07:00
Apoorv Reddy
f1eab987d8
Store tokens/sec in auxiliary struct TimingInfo.
...
PiperOrigin-RevId: 633108908
2024-05-13 00:04:19 -07:00
Zoltan Szabadka
27117cc39f
Simplify threading: remove the use of inner_pool.
...
We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.
Benchmark results on a 1600-token summarization task:
```
Prefill speed
Num threads BEFORE AFTER
4 9.24 t/s 9.76 t/s
18 31.41 t/s 31.16 t/s
32 31.41 t/s 45.13 t/s
64 31.03 t/s 57.85 t/s
```
2024-04-29 16:07:30 +00:00
Jan Wassenberg
3bf22abb22
Fix sign comparison warnings
...
PiperOrigin-RevId: 627299902
2024-04-23 01:16:51 -07:00
Jan Wassenberg
a982ec1287
Move code to gemma/ so we can remove error-prone copybara: comments.
...
Also fix includes and Lint warnings.
PiperOrigin-RevId: 623127487
2024-04-09 04:45:42 -07:00