Krzysztof Rymski
8a5e37eeb7
Updates to tests to use kv_transcodign library to reduce theris code size
...
PiperOrigin-RevId: 888600365
2026-03-24 05:06:01 -07:00
Ray Smith
bea8b1cdbd
Replaced attention in ViT with flash - 8x speedup of image tokenizer on AMD
...
PiperOrigin-RevId: 880877209
2026-03-09 08:46:04 -07:00
Krzysztof Rymski
029cfd0b33
Int8 + microscaling support for kv cache formats.
...
Right now multiplication is done by converting to corresponding float format.
Can yield up to 2x improvements for membw constrained shapes
PiperOrigin-RevId: 880748493
2026-03-09 02:50:08 -07:00
Ray Smith
49cb438b1e
Rollback of erroneous rollback.
...
PiperOrigin-RevId: 877376165
2026-03-02 06:50:26 -08:00
The gemma.cpp Authors
a3d994915f
No public description
...
PiperOrigin-RevId: 877333188
2026-03-02 04:32:29 -08:00
Ray Smith
16c1b29b89
Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention.
...
PiperOrigin-RevId: 877308306
2026-03-02 03:11:01 -08:00
Krzysztof Rymski
df162ead7c
Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models.
...
It also supports better parallelism for small batch sizes / small models.
It also is able to utilize VDPBF16PS for nice 2x improvement on avx512
PiperOrigin-RevId: 874517319
2026-02-24 03:26:49 -08:00