llama.cpp/ggml
Progeny Alpha 88396c3923 vulkan: optimize chunked intra kernel barrier and bank conflicts
Remove unnecessary barrier after A-matrix dot product writes. Each
thread writes only to its own row; s_A isn't read cross-thread until
forward substitution. Cuts A-matrix barriers from 128 to 65 (one
per broadcast + one before forward sub).

Pad s_A stride from 64 to 65 to eliminate bank conflicts in the W/U
accumulation phase where all active threads read A(tid, j) with the
same j value.

GDN per-op: 5205 → 5136 µs. Combined with inter fusion: 6818 → 5136 µs
(-24.7%). 16/16 tests pass.
2026-03-14 22:48:11 -04:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include llama : enable chunked fused GDN path (#20340) 2026-03-11 22:46:40 +02:00
src vulkan: optimize chunked intra kernel barrier and bank conflicts 2026-03-14 22:48:11 -04:00
.gitignore
CMakeLists.txt ggml : fix typo gmml (#20512) 2026-03-13 14:36:13 +01:00