llama.cpp

History

Progeny Alpha 88396c3923 vulkan: optimize chunked intra kernel barrier and bank conflicts Remove unnecessary barrier after A-matrix dot product writes. Each thread writes only to its own row; s_A isn't read cross-thread until forward substitution. Cuts A-matrix barriers from 128 to 65 (one per broadcast + one before forward sub). Pad s_A stride from 64 to 65 to eliminate bank conflicts in the W/U accumulation phase where all active threads read A(tid, j) with the same j value. GDN per-op: 5205 → 5136 µs. Combined with inter fusion: 6818 → 5136 µs (-24.7%). 16/16 tests pass.		2026-03-14 22:48:11 -04:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	llama : enable chunked fused GDN path (#20340 )	2026-03-11 22:46:40 +02:00
src	vulkan: optimize chunked intra kernel barrier and bank conflicts	2026-03-14 22:48:11 -04:00
.gitignore	…
CMakeLists.txt	ggml : fix typo gmml (#20512 )	2026-03-13 14:36:13 +01:00