llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aman Gupta	ed32089927	ggml-cuda: reorder only relevant nodes (#17639 )	2025-12-02 12:36:31 +08:00
Aman Gupta	6eea666912	llama-graph: avoid expand_forward for fusion (#17633 )	2025-12-01 11:12:48 +02:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
Aman Gupta	c7af376c29	CUDA: add stream-based concurrency (#16991 ) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-30 08:17:55 +08:00
Mahekk Shaikh	00425e2ed1	cuda : add error checking for cudaMemcpyAsync in argsort (#17599 ) * cuda : add error checking for cudaMemcpyAsync in argsort (#12836) * fix indentation	2025-11-30 08:16:28 +08:00
R0CKSTAR	c6f7a423c8	[MUSA] enable fp16/fast_fp16/bf16_mma on PH1 (#17551 ) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-28 14:08:29 +01:00
Aman Gupta	2e7ef98f18	ggml-cuda: add stricter checking for fusion (#17568 ) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it	2025-11-28 20:34:51 +08:00
Johannes Gäßler	73955f7d2a	CUDA: no FP16 arithmetic for vector FA kernel (#17558 )	2025-11-28 10:29:09 +01:00
yulo	6bca76ff5e	HIP: enable mul_mat_f for RDNA4 (#17437 ) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit `db9ae8b6b4`. * Revert "enable mmf for rdna4" This reverts commit `698c9f2418`. * Revert "move some mmvf to mmf" This reverts commit `99b92bd665`. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>	2025-11-28 08:24:30 +01:00
Piotr Wilkin (ilintar)	cd0e3a7a3b	SOLVE_TRI CUDA kernel for small matrices (#17457 )	2025-11-28 12:15:32 +08:00
matt23654	909072abcf	cuda : fix UMA detection on discrete GPUs. (#17537 )	2025-11-27 13:35:35 +02:00
Jiacheng (Jason) Chen	3e18dba9fd	HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (#17502 ) * patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4 * Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162	2025-11-26 11:18:48 +01:00
Jiacheng (Jason) Chen	0543f928a3	HIP: WMMA-MMQ kernels for RDNA 4 (#17156 ) * first commit naive test to enable mmq for RDNA4 * adding appropriate WMMA instructions * git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4 * clean up merge conflicts * add comments and code clean up * PR clean up, addressed comments * enable MMQ fallback on RDNA4 * addressed comments: add guards in load generic, separate wmma branch for use_mmq function * Revert build-xcframework.sh * Formating: remove trailing whitespace * revert CMake files * clean up after rebase: remove duplicated change, revert cmake files * clean up after rebase: revert changes from build-xcframework.sh * clean up: remove extra space line in mma.cuh * Revert "clean up: remove extra space line in mma.cuh" This reverts commit `b39ed57c45`.	2025-11-24 20:00:10 +01:00
Sigbjørn Skjæret	96ac5a2329	cuda : support non-contiguous i32 to i32 copy (#17326 ) * support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params	2025-11-23 11:13:34 +01:00
yulo	028f93ef98	HIP: RDNA4 tensor core support for MMF (#17077 ) * mmf for rdna4 * align the padding for rdna4 * forbit mul_mat_f for rdna4 * fix as comment * remove device kernels * add constexpr for early return * update based on review comment * change based on the review comment * pass compile error * keep code consistency --------- Co-authored-by: zhang hui <you@example.com>	2025-11-22 00:03:24 +01:00
Scott Fudally	a7784a8b1d	DGX Spark: UMA support (#17368 ) * DGX Spark: UMA support * Updates from PR feedback * More PR feedback cleanup * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Remove trailing whitespace * Update ggml/src/ggml-cuda/ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-20 12:32:02 +02:00
Aman Gupta	fd7353d5eb	cuda: fix rope fusion for gemma3 (#17378 )	2025-11-19 18:25:05 +08:00
Piotr Wilkin (ilintar)	6fd4f95367	Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition (#17332 ) * Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition * Argh. * Making CISC happy ;) * Integrate CONT tests * Use loopy loop * Skip new tests for (B)F16 for now.	2025-11-19 10:36:33 +01:00
Piotr Wilkin (ilintar)	389ac78b26	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063 ) * Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from #14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-13 20:54:47 +02:00
Aman Gupta	a90eb94ca9	CUDA: fuse rope + set_rows (#16884 ) * CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem	2025-11-13 08:50:01 +08:00
Johannes Gäßler	5d6838b74f	CUDA: static assert to prevent misuse of memcpy_1 (#17198 )	2025-11-12 23:13:55 +01:00
Acly	1032256ec9	cuda/vulkan : bicubic interpolation (#17022 ) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-10 10:19:39 +01:00
Aman Gupta	64fe17fbb8	Revert "CUDA: add expert reduce kernel (#16857 )" (#17100 )	2025-11-08 21:05:19 +08:00
Aman Gupta	c1b187688d	CUDA: skip fusion for repeating adds in bias (#17080 )	2025-11-08 16:58:05 +08:00
Johannes Gäßler	e14e842e87	CUDA: fix MMQ stream-k fixup ne1 indices (#17089 )	2025-11-08 08:26:18 +01:00
bssrdf	299f5d782c	CUDA: properly handle nb00=nb02 case for cpy (#17081 )	2025-11-07 23:41:58 +01:00
Johannes Gäßler	6515610506	CUDA: fix should_use_mmvf for ne11 == 1 (#17085 ) * CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-07 20:53:14 +01:00
Johannes Gäßler	aa374175c3	CUDA: fix crash on uneven context without FA (#16988 )	2025-11-06 14:05:47 +01:00
bssrdf	230d1169e5	improve CUDA cpy memory bandwidth when copying transposed tensor (#16841 ) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>	2025-11-05 21:55:04 +01:00
Aman Gupta	2759ccdb4a	CUDA: avoid mul + bias fusion when doing fusion (#16935 )	2025-11-04 10:53:48 +08:00
theo77186	622cd010ff	ggml: CUDA: add head size 72 for flash-attn (#16962 )	2025-11-03 14:29:11 +01:00
mnehete32	7db35a7958	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917 )	2025-11-02 11:12:57 +08:00
Oliver Simons	d3dc9dd898	CUDA: Remove unneded bias/gate dims in fused mmvq (#16858 ) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-01 13:13:26 +08:00
Johannes Gäßler	31c511a968	CUDA: Volta tensor core support for MMF (#16843 ) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-10-31 15:57:19 +01:00
Aman Gupta	4146d6a1a6	CUDA: add expert reduce kernel (#16857 ) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-31 20:05:07 +08:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Sigbjørn Skjæret	229bf68628	cuda : fix argsort with 64k+ rows (#16849 )	2025-10-30 08:56:28 +01:00
Oliver Simons	8b11deea46	Hide latency of bias and gate-loading (#16847 ) This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.	2025-10-30 11:34:15 +08:00
Aman Gupta	e41bcce8f0	CUDA: use fastdiv in set-rows (#16834 ) * CUDA: use fastdiv in set-rows * add assert about value fitting in u32	2025-10-29 21:11:53 +08:00
Aman Gupta	9a3ea685b9	CUDA: Fix bug in topk-moe for gpt-oss (#16821 ) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef	2025-10-29 15:55:06 +08:00
YaelGitAccount	851553ea6b	cuda: add SET operation support (#16804 ) * feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-28 20:10:28 +01:00
Aman Gupta	463bbf20bf	CUDA: add unused vars to mmvf and mmvq (#16807 )	2025-10-28 10:31:21 +08:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-10-27 21:50:22 +01:00
Aman Gupta	75d33b9302	CUDA: support for weight clamp in top-k norm (#16702 )	2025-10-27 09:06:16 +08:00
Sigbjørn Skjæret	bd562fe4f7	cuda : use fast copy when src and dst are of different type and contiguous (#16789 ) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape	2025-10-26 21:31:41 +01:00
leejet	bbac6a26b2	ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (#16744 ) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-26 19:13:31 +01:00
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 )	2025-10-26 19:28:04 +08:00
leejet	55945d2ef5	ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (#16742 ) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions	2025-10-24 21:39:37 +02:00
Aman Gupta	0bcb40b48c	CUDA: use CUB for arbitary size argsort (#16754 )	2025-10-24 20:46:19 +08:00
Aman Gupta	061f0eff02	ggml-cuda: use passed ops instead of hardcoded ops (#16712 )	2025-10-23 19:14:06 +08:00

1 2 3 4 5 ...

366 Commits