llama.cpp/ggml/src/ggml-cuda
deepsek 66906cd82a
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624)
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-27 00:28:14 +02:00
..
template-instances CUDA: FA support for Deepseek (Ampere or newer) (#13306) 2025-05-09 13:34:58 +02:00
vendors HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624) 2025-07-27 00:28:14 +02:00
CMakeLists.txt cuda: remove linking to cublasLt (#14790) 2025-07-22 07:45:26 +08:00
acc.cu llama/ggml: add LLM training support (#10544) 2025-05-12 14:44:49 +02:00
acc.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
arange.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
arange.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
argmax.cu cuda : optimize argmax (#10441) 2024-11-21 18:18:50 +01:00
argmax.cuh ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
argsort.cu ggml : reduce hash table reset cost (#8698) 2024-07-27 04:41:55 +02:00
argsort.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
binbcast.cu Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) 2025-03-03 18:18:11 +02:00
binbcast.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
clamp.cu cuda: unary ops as float + de-duplicate (ggml/1130) 2025-03-03 18:18:11 +02:00
clamp.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
common.cuh HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624) 2025-07-27 00:28:14 +02:00
concat.cu musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
concat.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
conv-transpose-1d.cu musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
conv-transpose-1d.cuh feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) 2024-07-08 12:23:00 +03:00
conv2d-dw.cu CUDA: add conv_2d_dw (#14265) 2025-06-20 09:50:24 +08:00
conv2d-dw.cuh CUDA: add conv_2d_dw (#14265) 2025-06-20 09:50:24 +08:00
conv2d-transpose.cu CUDA: add conv_2d_transpose (#14287) 2025-06-20 22:48:24 +08:00
conv2d-transpose.cuh CUDA: add conv_2d_transpose (#14287) 2025-06-20 22:48:24 +08:00
convert.cu CUDA: fix compilation with GGML_CUDA_F16 (#14837) 2025-07-23 18:22:30 +02:00
convert.cuh CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361) 2025-06-29 01:30:53 +08:00
count-equal.cu ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) 2024-11-09 08:35:46 +01:00
count-equal.cuh ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
cp-async.cuh CUDA: FA support for Deepseek (Ampere or newer) (#13306) 2025-05-09 13:34:58 +02:00
cpy-utils.cuh cuda : implement bf16 cpy ops and enable bf16 cont (#14763) 2025-07-22 12:33:10 +02:00
cpy.cu musa: upgrade musa sdk to rc4.2.0 (#14498) 2025-07-24 20:05:37 +01:00
cpy.cuh ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970) 2025-04-17 15:19:42 +02:00
cross-entropy-loss.cu CUDA: add dynamic shared mem to softmax, refactor general usage (#14497) 2025-07-03 07:45:11 +08:00
cross-entropy-loss.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
dequantize.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
diagmask.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
diagmask.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-common.cuh CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
fattn-mma-f16.cuh musa: fix build warnings (unused variable) (#14869) 2025-07-26 10:36:02 +08:00
fattn-tile-f16.cu CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
fattn-tile-f16.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-tile-f32.cu musa: fix build warnings (unused variable) (#14869) 2025-07-26 10:36:02 +08:00
fattn-tile-f32.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-vec-f16.cuh musa: fix build warnings (unused variable) (#14869) 2025-07-26 10:36:02 +08:00
fattn-vec-f32.cuh CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
fattn-wmma-f16.cu CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
fattn-wmma-f16.cuh CUDA: use mma PTX instructions for FlashAttention (#11583) 2025-02-02 19:31:09 +01:00
fattn.cu CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
fattn.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
getrows.cu CUDA: add bf16 and i32 to getrows (#14529) 2025-07-07 21:45:43 +08:00
getrows.cuh CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199) 2025-04-30 23:12:59 +02:00
ggml-cuda.cu CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
gla.cu llama: add support for QRWKV6 model architecture (#11001) 2025-01-10 09:58:08 +08:00
gla.cuh llama: add support for QRWKV6 model architecture (#11001) 2025-01-10 09:58:08 +08:00
im2col.cu vulkan/cuda: Fix im2col when KW!=KH (#14789) 2025-07-21 13:35:40 +02:00
im2col.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
mean.cu CUDA: add mean operation (#14313) 2025-06-22 12:39:54 +08:00
mean.cuh CUDA: add mean operation (#14313) 2025-06-22 12:39:54 +08:00
mma.cuh HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624) 2025-07-27 00:28:14 +02:00
mmq.cu HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624) 2025-07-27 00:28:14 +02:00
mmq.cuh HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624) 2025-07-27 00:28:14 +02:00
mmv.cu CUDA/HIP: optimize mmv paths taken for HIP devices (#14324) 2025-06-24 01:12:56 +02:00
mmv.cuh CUDA: mul_mat_v support for batch sizes > 1 (#14262) 2025-06-23 13:11:31 +02:00
mmvq.cu CUDA: fix crash with partial offloading of MoE (#13439) 2025-05-11 16:09:33 +02:00
mmvq.cuh CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014) 2025-04-22 21:27:40 +02:00
norm.cu CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
norm.cuh CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
opt-step-adamw.cu ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
opt-step-adamw.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
out-prod.cu CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380) 2025-01-24 12:38:31 +01:00
out-prod.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
pad.cu musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
pad.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
pool2d.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
pool2d.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
quantize.cu CUDA: fix crash on large batch size for quant. MoE (#13537) 2025-05-14 16:41:02 +02:00
quantize.cuh CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199) 2025-04-30 23:12:59 +02:00
rope.cu cuda : fix rope with partial rotation and non-cont src (#14580) 2025-07-08 10:15:21 +03:00
rope.cuh RoPE: fix back, CUDA support for back + noncont. (#11240) 2025-01-15 12:51:37 +01:00
scale.cu ggml : add ggml_scale_bias (#14417) 2025-07-09 18:16:12 +02:00
scale.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
set-rows.cu musa: fix build warnings (unused variable) (#14869) 2025-07-26 10:36:02 +08:00
set-rows.cuh CUDA: add set rows for f32 and f16 (#14551) 2025-07-12 16:31:38 +03:00
softmax.cu CUDA: add dynamic shared mem to softmax, refactor general usage (#14497) 2025-07-03 07:45:11 +08:00
softmax.cuh CUDA: backwards pass for misc. ops, add tests (#11257) 2025-01-16 16:43:38 +01:00
ssm-conv.cu model : support LiquidAI LFM2 hybrid family (#14620) 2025-07-11 20:27:01 +02:00
ssm-conv.cuh ggml : faster ssm scan (#10558) 2025-03-31 18:05:13 +02:00
ssm-scan.cu cuda : support Falcon-H1 state size for SSM_SCAN (#14602) 2025-07-09 23:54:38 -04:00
ssm-scan.cuh ggml : faster ssm scan (#10558) 2025-03-31 18:05:13 +02:00
sum.cu llama/ggml: add LLM training support (#10544) 2025-05-12 14:44:49 +02:00
sum.cuh tests: add gradient tests for all backends (ggml/932) 2024-09-08 11:05:55 +03:00
sumrows.cu CUDA: add mean operation (#14313) 2025-06-22 12:39:54 +08:00
sumrows.cuh CUDA: add mean operation (#14313) 2025-06-22 12:39:54 +08:00
tsembd.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
tsembd.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
unary.cu cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
unary.cuh cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
upscale.cu CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00
upscale.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
vecdotq.cuh CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014) 2025-04-22 21:27:40 +02:00
wkv.cu llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00
wkv.cuh llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00