Add a dedicated MMVQ_PARAMETERS_RDNA4 entry separate from RDNA2/RDNA3. For bs=1 decode on RDNA4 (gfx1201), optimal config is nwarps=8 rows=1: - 8 warps × 32 threads = 256 threads per block - blocks_per_iter = vdr*nwarps*warp_size/qi = 2*8*32/4 = 128 - For K=4096: blocks_per_row=128, entire K dimension in single iteration - Maximizes memory-level parallelism on RDNA4 Benchmark (Llama 2 7B Q4_0, AMD Radeon AI PRO R9700): Master: 95.05 tok/s (tg128) nwarps=8: 104.82 tok/s (tg128) → +10.3% pp512: no regression (1448 vs 1449 tok/s) |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||