llama.cpp

History

deepsek 66906cd82a HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 ) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.		2025-07-27 00:28:14 +02:00
..
template-instances	CUDA: FA support for Deepseek (Ampere or newer) (#13306 )	2025-05-09 13:34:58 +02:00
vendors	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 )	2025-07-27 00:28:14 +02:00
CMakeLists.txt	cuda: remove linking to cublasLt (#14790 )	2025-07-22 07:45:26 +08:00
acc.cu	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
acc.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
arange.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
arange.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
argmax.cu	cuda : optimize argmax (#10441 )	2024-11-21 18:18:50 +01:00
argmax.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
argsort.cu	ggml : reduce hash table reset cost (#8698 )	2024-07-27 04:41:55 +02:00
argsort.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
binbcast.cu	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)	2025-03-03 18:18:11 +02:00
binbcast.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
clamp.cu	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
clamp.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
common.cuh	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 )	2025-07-27 00:28:14 +02:00
concat.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
concat.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
conv-transpose-1d.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
conv-transpose-1d.cuh	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)	2024-07-08 12:23:00 +03:00
conv2d-dw.cu	CUDA: add conv_2d_dw (#14265 )	2025-06-20 09:50:24 +08:00
conv2d-dw.cuh	CUDA: add conv_2d_dw (#14265 )	2025-06-20 09:50:24 +08:00
conv2d-transpose.cu	CUDA: add conv_2d_transpose (#14287 )	2025-06-20 22:48:24 +08:00
conv2d-transpose.cuh	CUDA: add conv_2d_transpose (#14287 )	2025-06-20 22:48:24 +08:00
convert.cu	CUDA: fix compilation with GGML_CUDA_F16 (#14837 )	2025-07-23 18:22:30 +02:00
convert.cuh	CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361 )	2025-06-29 01:30:53 +08:00
count-equal.cu	ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )	2024-11-09 08:35:46 +01:00
count-equal.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
cp-async.cuh	CUDA: FA support for Deepseek (Ampere or newer) (#13306 )	2025-05-09 13:34:58 +02:00
cpy-utils.cuh	cuda : implement bf16 cpy ops and enable bf16 cont (#14763 )	2025-07-22 12:33:10 +02:00
cpy.cu	musa: upgrade musa sdk to rc4.2.0 (#14498 )	2025-07-24 20:05:37 +01:00
cpy.cuh	ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970 )	2025-04-17 15:19:42 +02:00
cross-entropy-loss.cu	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 )	2025-07-03 07:45:11 +08:00
cross-entropy-loss.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
dequantize.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
diagmask.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
diagmask.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-common.cuh	CUDA: fix overflow in FA, tune performance (#14840 )	2025-07-23 21:43:25 +02:00
fattn-mma-f16.cuh	musa: fix build warnings (unused variable) (#14869 )	2025-07-26 10:36:02 +08:00
fattn-tile-f16.cu	CUDA: fix overflow in FA, tune performance (#14840 )	2025-07-23 21:43:25 +02:00
fattn-tile-f16.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-tile-f32.cu	musa: fix build warnings (unused variable) (#14869 )	2025-07-26 10:36:02 +08:00
fattn-tile-f32.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-vec-f16.cuh	musa: fix build warnings (unused variable) (#14869 )	2025-07-26 10:36:02 +08:00
fattn-vec-f32.cuh	CUDA: fix overflow in FA, tune performance (#14840 )	2025-07-23 21:43:25 +02:00
fattn-wmma-f16.cu	CUDA: fix overflow in FA, tune performance (#14840 )	2025-07-23 21:43:25 +02:00
fattn-wmma-f16.cuh	CUDA: use mma PTX instructions for FlashAttention (#11583 )	2025-02-02 19:31:09 +01:00
fattn.cu	CUDA: fix overflow in FA, tune performance (#14840 )	2025-07-23 21:43:25 +02:00
fattn.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
getrows.cu	CUDA: add bf16 and i32 to getrows (#14529 )	2025-07-07 21:45:43 +08:00
getrows.cuh	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199 )	2025-04-30 23:12:59 +02:00
ggml-cuda.cu	CUDA: add fused rms norm (#14800 )	2025-07-23 09:25:42 +08:00
gla.cu	llama: add support for QRWKV6 model architecture (#11001 )	2025-01-10 09:58:08 +08:00
gla.cuh	llama: add support for QRWKV6 model architecture (#11001 )	2025-01-10 09:58:08 +08:00
im2col.cu	vulkan/cuda: Fix im2col when KW!=KH (#14789 )	2025-07-21 13:35:40 +02:00
im2col.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
mean.cu	CUDA: add mean operation (#14313 )	2025-06-22 12:39:54 +08:00
mean.cuh	CUDA: add mean operation (#14313 )	2025-06-22 12:39:54 +08:00
mma.cuh	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 )	2025-07-27 00:28:14 +02:00
mmq.cu	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 )	2025-07-27 00:28:14 +02:00
mmq.cuh	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 )	2025-07-27 00:28:14 +02:00
mmv.cu	CUDA/HIP: optimize mmv paths taken for HIP devices (#14324 )	2025-06-24 01:12:56 +02:00
mmv.cuh	CUDA: mul_mat_v support for batch sizes > 1 (#14262 )	2025-06-23 13:11:31 +02:00
mmvq.cu	CUDA: fix crash with partial offloading of MoE (#13439 )	2025-05-11 16:09:33 +02:00
mmvq.cuh	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014 )	2025-04-22 21:27:40 +02:00
norm.cu	CUDA: add fused rms norm (#14800 )	2025-07-23 09:25:42 +08:00
norm.cuh	CUDA: add fused rms norm (#14800 )	2025-07-23 09:25:42 +08:00
opt-step-adamw.cu	ggml: new optimization interface (ggml/988)	2024-11-17 08:30:29 +02:00
opt-step-adamw.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
out-prod.cu	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380 )	2025-01-24 12:38:31 +01:00
out-prod.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
pad.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
pad.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
pool2d.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
pool2d.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
quantize.cu	CUDA: fix crash on large batch size for quant. MoE (#13537 )	2025-05-14 16:41:02 +02:00
quantize.cuh	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199 )	2025-04-30 23:12:59 +02:00
rope.cu	cuda : fix rope with partial rotation and non-cont src (#14580 )	2025-07-08 10:15:21 +03:00
rope.cuh	RoPE: fix back, CUDA support for back + noncont. (#11240 )	2025-01-15 12:51:37 +01:00
scale.cu	ggml : add ggml_scale_bias (#14417 )	2025-07-09 18:16:12 +02:00
scale.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
set-rows.cu	musa: fix build warnings (unused variable) (#14869 )	2025-07-26 10:36:02 +08:00
set-rows.cuh	CUDA: add set rows for f32 and f16 (#14551 )	2025-07-12 16:31:38 +03:00
softmax.cu	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 )	2025-07-03 07:45:11 +08:00
softmax.cuh	CUDA: backwards pass for misc. ops, add tests (#11257 )	2025-01-16 16:43:38 +01:00
ssm-conv.cu	model : support LiquidAI LFM2 hybrid family (#14620 )	2025-07-11 20:27:01 +02:00
ssm-conv.cuh	ggml : faster ssm scan (#10558 )	2025-03-31 18:05:13 +02:00
ssm-scan.cu	cuda : support Falcon-H1 state size for SSM_SCAN (#14602 )	2025-07-09 23:54:38 -04:00
ssm-scan.cuh	ggml : faster ssm scan (#10558 )	2025-03-31 18:05:13 +02:00
sum.cu	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
sum.cuh	tests: add gradient tests for all backends (ggml/932)	2024-09-08 11:05:55 +03:00
sumrows.cu	CUDA: add mean operation (#14313 )	2025-06-22 12:39:54 +08:00
sumrows.cuh	CUDA: add mean operation (#14313 )	2025-06-22 12:39:54 +08:00
tsembd.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
tsembd.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
unary.cu	cuda : add ELU support (#14657 )	2025-07-13 11:33:16 +02:00
unary.cuh	cuda : add ELU support (#14657 )	2025-07-13 11:33:16 +02:00
upscale.cu	CUDA: add bilinear interpolation for upscale (#14563 )	2025-07-08 10:11:18 +08:00
upscale.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
vecdotq.cuh	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014 )	2025-04-22 21:27:40 +02:00
wkv.cu	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
wkv.cuh	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00