llama.cpp

History

Progeny Alpha b0323615c9 vulkan: fused inter+output kernel for chunked GDN Merge the inter-chunk state propagation and output computation into a single dispatch, reducing the chunked pipeline from 3 dispatches to 2. State lives in registers across the sequential chunk loop. vnew is computed in-kernel and passed to the coopmat GEMM via shared memory (f16, packed with subgroup shuffles). This eliminates the VNew scratch buffer (wu_size) and H_snapshots buffer (h_size) — ~786KB/head/seq saved for PP-512. Architecture per chunk: Step 1: Load K, Q, gcum → shared (all 256 threads) Step 2: Q@K^T coopmat → sh_attn (all 256 threads) Step 3: Decay mask + O_inter = Q@state → dst (parallel) Step 4: vnew = U - W@state → sh_kv (128 threads + k_gated assist) Step 5: O_intra = A_decayed @ vnew coopmat GEMM → dst Step 6: state = exp(decay) * state + delta Shared memory: 63,744 / 65,536 bytes. 16/16 backend tests pass.		2026-03-13 21:45:42 -04:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	llama : enable chunked fused GDN path (#20340 )	2026-03-11 22:46:40 +02:00
src	vulkan: fused inter+output kernel for chunked GDN	2026-03-13 21:45:42 -04:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml : fix typo gmml (#20512 )	2026-03-13 14:36:13 +01:00