gemma.cpp

Author	SHA1	Message	Date
Krzysztof Rymski	f56d18dd68	Improvements to inference using int8 compressed kv's Multiplication is done using int16*int16 multiplication instructions avoid expensive conversion to f32/bf16 x2 speed on zen3 PiperOrigin-RevId: 888690192	2026-03-24 08:51:30 -07:00
Jan Wassenberg	1dedcfd50d	Warning fix: cast enum for HWY_ABORT %d PiperOrigin-RevId: 886242788	2026-03-19 10:11:17 -07:00
Krzysztof Rymski	197c1a049c	Fix int8 PiperOrigin-RevId: 882611833	2026-03-12 08:43:18 -07:00
Krzysztof Rymski	029cfd0b33	Int8 + microscaling support for kv cache formats. Right now multiplication is done by converting to corresponding float format. Can yield up to 2x improvements for membw constrained shapes PiperOrigin-RevId: 880748493	2026-03-09 02:50:08 -07:00
Krzysztof Rymski	bdba3bfa63	remove const to fix windows builds PiperOrigin-RevId: 876232691	2026-02-27 06:56:54 -08:00
Krzysztof Rymski	df162ead7c	Implementation of tiled attention with bf16 and circular buffers which reduces memory requirements by 4x on longer context on gemma models. It also supports better parallelism for small batch sizes / small models. It also is able to utilize VDPBF16PS for nice 2x improvement on avx512 PiperOrigin-RevId: 874517319	2026-02-24 03:26:49 -08:00