llama.cpp/ggml/src/ggml-cuda
Johannes Gäßler 86221cf6da
CUDA: fix FA kernel selection logic (#21271)
2026-04-01 22:28:19 +03:00
..
template-instances ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) 2026-04-01 12:04:58 +02:00
vendors ggml-cuda: Add NVFP4 dp4a kernel (#20644) 2026-03-26 09:54:03 +01:00
CMakeLists.txt ggml-cuda: native bf16 flash attention for vec kernel (#20525) 2026-03-22 11:05:51 +01:00
acc.cu
acc.cuh
add-id.cu
add-id.cuh
arange.cu
arange.cuh
argmax.cu ggml : use WARP_SIZE/2 for argmax reduction offset (#18092) 2025-12-17 11:47:01 +08:00
argmax.cuh
argsort.cu CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181) 2026-03-30 16:20:00 +02:00
argsort.cuh sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00
binbcast.cu ggml : extend bin bcast for permuted src1 (#19484) 2026-02-11 07:52:00 +02:00
binbcast.cuh
clamp.cu
clamp.cuh
common.cuh ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) 2026-04-01 12:04:58 +02:00
concat.cu
concat.cuh
conv-transpose-1d.cu
conv-transpose-1d.cuh
conv2d-dw.cu
conv2d-dw.cuh
conv2d-transpose.cu CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094) 2026-03-26 10:19:14 +08:00
conv2d-transpose.cuh CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094) 2026-03-26 10:19:14 +08:00
conv2d.cu
conv2d.cuh
convert.cu ggml-cuda: Add NVFP4 dp4a kernel (#20644) 2026-03-26 09:54:03 +01:00
convert.cuh CUDA: fix BF16 FA compilation (#20865) 2026-03-22 17:53:33 +01:00
count-equal.cu
count-equal.cuh
cp-async.cuh
cpy-utils.cuh cuda : support non-contiguous i32 to i32 copy (#17326) 2025-11-23 11:13:34 +01:00
cpy.cu Fix data race in CUDA's "cpy" kernel (influences GGML's DUP, CONT operations). (#20507) 2026-03-14 13:19:44 +08:00
cpy.cuh cuda : remove legacy copy-op pointer indirection code (#16485) 2025-10-14 11:53:49 +02:00
cross-entropy-loss.cu
cross-entropy-loss.cuh
cumsum.cu sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00
cumsum.cuh Add support for CUMSUM and TRI for CUDA. (#17584) 2025-12-04 22:19:51 +01:00
dequantize.cuh
diag.cu Add DIAG for CUDA (#17873) 2025-12-09 20:28:57 +01:00
diag.cuh Add DIAG for CUDA (#17873) 2025-12-09 20:28:57 +01:00
diagmask.cu
diagmask.cuh
fattn-common.cuh ggml-cuda: native bf16 flash attention for vec kernel (#20525) 2026-03-22 11:05:51 +01:00
fattn-mma-f16.cuh CUDA: Add Flash Attention Support for Head Dimension 512 (#20998) 2026-04-01 09:07:24 +02:00
fattn-tile.cu CUDA: Add Flash Attention Support for Head Dimension 512 (#20998) 2026-04-01 09:07:24 +02:00
fattn-tile.cuh CUDA: Add Flash Attention Support for Head Dimension 512 (#20998) 2026-04-01 09:07:24 +02:00
fattn-vec.cuh ggml-cuda: native bf16 flash attention for vec kernel (#20525) 2026-03-22 11:05:51 +01:00
fattn-wmma-f16.cu Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591) 2026-02-16 14:46:08 +01:00
fattn-wmma-f16.cuh chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
fattn.cu CUDA: fix FA kernel selection logic (#21271) 2026-04-01 22:28:19 +03:00
fattn.cuh
fill.cu ggml : allow fill node alloc inplace (#17870) 2025-12-09 12:23:47 +01:00
fill.cuh cuda : add FILL op support (#17851) 2025-12-08 21:10:12 +08:00
gated_delta_net.cu CUDA: GDN hide memory latency (#20537) 2026-03-16 11:41:45 +08:00
gated_delta_net.cuh ggml: add GATED_DELTA_NET op (#19504) 2026-03-07 15:41:10 +08:00
getrows.cu CUDA: fix GET_ROWS for large tensors (#15882) 2025-09-09 08:11:01 +02:00
getrows.cuh
ggml-cuda.cu ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) 2026-04-01 12:04:58 +02:00
gla.cu
gla.cuh
im2col.cu CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956) 2025-09-16 00:28:31 +02:00
im2col.cuh ggml: add ops for WAN video model (cuda && cpu) (#15669) 2025-09-04 10:38:49 +02:00
mean.cu ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934) 2026-01-24 14:25:20 +08:00
mean.cuh
mma.cuh CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806) 2026-02-27 19:37:26 +01:00
mmf.cu HIP: add mmf for CDNA (#18896) 2026-01-29 11:10:53 +01:00
mmf.cuh HIP: add mmf for CDNA (#18896) 2026-01-29 11:10:53 +01:00
mmid.cu CUDA: add fp kernel for larger batch size MoE (#16512) 2025-10-14 13:15:15 +02:00
mmid.cuh CUDA: add fp kernel for larger batch size MoE (#16512) 2025-10-14 13:15:15 +02:00
mmq.cu ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) 2026-04-01 12:04:58 +02:00
mmq.cuh ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) 2026-04-01 12:04:58 +02:00
mmvf.cu CUDA: use mmvq for mul-mat-id for small batch sizes (#18958) 2026-02-03 23:31:23 +08:00
mmvf.cuh CUDA: use mmvq for mul-mat-id for small batch sizes (#18958) 2026-02-03 23:31:23 +08:00
mmvq.cu CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238) 2026-04-01 10:21:20 +02:00
mmvq.cuh Optimize MOE GEMV kernel for BS > 1. (#20905) 2026-03-29 18:35:18 +02:00
norm.cu CUDA: Factor out and re-use `block_reduce` function (#18785) 2026-01-15 10:44:54 +08:00
norm.cuh
opt-step-adamw.cu
opt-step-adamw.cuh
opt-step-sgd.cu
opt-step-sgd.cuh
out-prod.cu
out-prod.cuh
pad.cu cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429) 2026-02-10 08:07:16 +02:00
pad.cuh
pad_reflect_1d.cu musa: fix build warnings (#15611) 2025-09-26 02:56:10 +02:00
pad_reflect_1d.cuh
pool2d.cu
pool2d.cuh
quantize.cu chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
quantize.cuh CUDA: experimental native mxfp4 support for blackwell (#17906) 2025-12-24 22:28:26 +08:00
reduce_rows.cuh CUDA: Factor out and re-use `block_reduce` function (#18785) 2026-01-15 10:44:54 +08:00
roll.cu
roll.cuh
rope.cu CUDA: Fix non-contig rope (#19338) 2026-02-08 15:12:51 +02:00
rope.cuh CUDA: fuse rope + set_rows (#16884) 2025-11-13 08:50:01 +08:00
scale.cu ggml: add ops for WAN video model (cuda && cpu) (#15669) 2025-09-04 10:38:49 +02:00
scale.cuh
set-rows.cu CUDA: use fastdiv in set-rows (#16834) 2025-10-29 21:11:53 +08:00
set-rows.cuh
set.cu cuda: add SET operation support (#16804) 2025-10-28 20:10:28 +01:00
set.cuh cuda: add SET operation support (#16804) 2025-10-28 20:10:28 +01:00
softcap.cu
softcap.cuh
softmax.cu chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
softmax.cuh
solve_tri.cu chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
solve_tri.cuh SOLVE_TRI CUDA kernel for small matrices (#17457) 2025-11-28 12:15:32 +08:00
ssm-conv.cu cuda/hip: fix loop unrolling in ssm-conv (#20369) 2026-03-11 13:04:32 +08:00
ssm-conv.cuh CUDA: use shared mem for ssm_conv (#20128) 2026-03-06 23:09:59 +08:00
ssm-scan.cu ggml : optimize cuda ssm_scan using warp-level reduction (#18505) 2026-01-07 02:24:34 +08:00
ssm-scan.cuh
sum.cu
sum.cuh
sumrows.cu
sumrows.cuh
top-k.cu CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930) 2026-01-20 20:11:01 +08:00
top-k.cuh sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00
topk-moe.cu ggml-cuda: add mem check for fusion (#19916) 2026-03-07 00:05:43 +08:00
topk-moe.cuh CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126) 2026-01-29 10:31:28 +08:00
tri.cu Add support for CUMSUM and TRI for CUDA. (#17584) 2025-12-04 22:19:51 +01:00
tri.cuh Add support for CUMSUM and TRI for CUDA. (#17584) 2025-12-04 22:19:51 +01:00
tsembd.cu ggml : fix padding in timestep embedding kernels (#15932) 2025-09-16 15:25:57 +02:00
tsembd.cuh
unary.cu CUDA: use shared mem for ssm_conv (#20128) 2026-03-06 23:09:59 +08:00
unary.cuh CUDA: use shared mem for ssm_conv (#20128) 2026-03-06 23:09:59 +08:00
upscale.cu model: LFM2-VL fixes (#17577) 2025-11-30 21:57:31 +01:00
upscale.cuh
vecdotq.cuh ggml-cuda: Add NVFP4 dp4a kernel (#20644) 2026-03-26 09:54:03 +01:00
wkv.cu
wkv.cuh