gemma.cpp

Author	SHA1	Message	Date
Krzysztof Rymski	8a5e37eeb7	Updates to tests to use kv_transcodign library to reduce theris code size PiperOrigin-RevId: 888600365	2026-03-24 05:06:01 -07:00
Ray Smith	bea8b1cdbd	Replaced attention in ViT with flash - 8x speedup of image tokenizer on AMD PiperOrigin-RevId: 880877209	2026-03-09 08:46:04 -07:00
Krzysztof Rymski	029cfd0b33	Int8 + microscaling support for kv cache formats. Right now multiplication is done by converting to corresponding float format. Can yield up to 2x improvements for membw constrained shapes PiperOrigin-RevId: 880748493	2026-03-09 02:50:08 -07:00
Ray Smith	49cb438b1e	Rollback of erroneous rollback. PiperOrigin-RevId: 877376165	2026-03-02 06:50:26 -08:00
The gemma.cpp Authors	a3d994915f	No public description PiperOrigin-RevId: 877333188	2026-03-02 04:32:29 -08:00
Ray Smith	16c1b29b89	Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention. PiperOrigin-RevId: 877308306	2026-03-02 03:11:01 -08:00
Krzysztof Rymski	df162ead7c	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. It also supports better parallelism for small batch sizes / small models. It also is able to utilize VDPBF16PS for nice 2x improvement on avx512 PiperOrigin-RevId: 874517319	2026-02-24 03:26:49 -08:00