llama.cpp/ggml
Progeny Alpha 313ef74afe vulkan: add coopmat GEMM output kernel for chunked GDN
Add gated_delta_net_chunk_output_cm1.comp — a cooperative matrix variant
of the chunked output kernel that replaces the O(N²) scalar intra-chunk
loop with an f16 coopmat GEMM: A_decayed[64×64] @ vnew[64×128].

Kernel structure:
- Phase 1: Q@K^T via coopmat (unchanged from scalar variant)
- Phase 2a: Build causal decay mask → sh_adecay (f16, clamped)
- Phase 2b: Stage vnew into sh_kv (f16, pre-scaled by 1/√d)
- Pass 1: Inter-chunk Q@S → dst (scalar, 128 threads)
- Pass 2: Intra-chunk coopmat GEMM (full chunks) or scalar fallback
  (partial last chunk). 3 barriers total, 62.7KB shared memory.

Pipeline registered but not yet dispatched (threshold remains disabled).
Test tolerance bumped to 5e-3 for n_seq_tokens≥64 to account for f16
intermediate precision in the coopmat path.

16/16 backend tests pass.
2026-03-13 21:45:42 -04:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include llama : enable chunked fused GDN path (#20340) 2026-03-11 22:46:40 +02:00
src vulkan: add coopmat GEMM output kernel for chunked GDN 2026-03-13 21:45:42 -04:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt ggml : fix typo gmml (#20512) 2026-03-13 14:36:13 +01:00