llama.cpp/ggml/src/ggml-cuda/vendors
yulo ea4a321f2a
HIP: add fattn-mma-f16 for RDNA4 (#18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-13 13:52:16 +01:00
..
cuda.h CUDA: experimental native mxfp4 support for blackwell (#17906) 2025-12-24 22:28:26 +08:00
hip.h HIP: add fattn-mma-f16 for RDNA4 (#18481) 2026-01-13 13:52:16 +01:00
musa.h sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00