llama.cpp/ggml
kangletian 49a5ff40e2 mmvq: add RDNA4-specific parameter table (nwarps=8, rows=1)
Add a dedicated MMVQ_PARAMETERS_RDNA4 entry separate from RDNA2/RDNA3.
For bs=1 decode on RDNA4 (gfx1201), optimal config is nwarps=8 rows=1:
- 8 warps × 32 threads = 256 threads per block
- blocks_per_iter = vdr*nwarps*warp_size/qi = 2*8*32/4 = 128
- For K=4096: blocks_per_row=128, entire K dimension in single iteration
- Maximizes memory-level parallelism on RDNA4

Benchmark (Llama 2 7B Q4_0, AMD Radeon AI PRO R9700):
  Master:   95.05 tok/s (tg128)
  nwarps=8: 104.82 tok/s (tg128) → +10.3%
  pp512: no regression (1448 vs 1449 tok/s)
2026-02-11 09:44:01 +00:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include ggml-virtgpu: make the code thread safe (#19204) 2026-02-04 10:46:18 +08:00
src mmvq: add RDNA4-specific parameter table (nwarps=8, rows=1) 2026-02-11 09:44:01 +00:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt Bump cmake max version (needed for Windows on Snapdragon builds) (#19188) 2026-02-01 14:13:38 -08:00