Commit Graph

6 Commits

Author SHA1 Message Date
Krzysztof Rymski f56d18dd68 Improvements to inference using int8 compressed kv's
Multiplication is done using int16*int16 multiplication instructions avoid expensive conversion to f32/bf16
x2 speed on zen3

PiperOrigin-RevId: 888690192
2026-03-24 08:51:30 -07:00
Jan Wassenberg 1dedcfd50d Warning fix: cast enum for HWY_ABORT %d
PiperOrigin-RevId: 886242788
2026-03-19 10:11:17 -07:00
Krzysztof Rymski 197c1a049c Fix int8
PiperOrigin-RevId: 882611833
2026-03-12 08:43:18 -07:00
Krzysztof Rymski 029cfd0b33 Int8 + microscaling support for kv cache formats.
Right now multiplication is done by converting to corresponding float format.
Can yield up to 2x improvements for membw constrained shapes

PiperOrigin-RevId: 880748493
2026-03-09 02:50:08 -07:00
Krzysztof Rymski bdba3bfa63 remove const to fix windows builds
PiperOrigin-RevId: 876232691
2026-02-27 06:56:54 -08:00
Krzysztof Rymski df162ead7c Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512

PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00