llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	4e010b4d7b	ggml : add ggml_build_forward_select	2026-01-02 19:26:52 +02:00
MeeMin	e86f3c2221	cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433 ) * ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140) * ggml-cuda: changes in data types to int64_t * ggml-cuda: added asserts for CUDA block numbers * ggml-cuda: changed the condition for y and z dimension	2026-01-02 00:24:20 +01:00
Aman Gupta	26831bded9	ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502 )	2026-01-01 19:18:43 +08:00
Johannes Gäßler	ecc343de63	CUDA: fix KQ max calculation (#18487 )	2025-12-31 09:37:00 +01:00
Aman Gupta	d77d7c5c06	CUDA: add log line when mxfp4 acceleration is used (#18483 ) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features	2025-12-30 17:40:46 +08:00
Johannes Gäßler	0bd1212a43	CUDA: fix replacment of bad archs in CMake (#18457 )	2025-12-29 17:58:20 +01:00
Johannes Gäßler	e70e640db3	CUDA: Blackwell features for non-native builds (#18436 )	2025-12-29 09:35:42 +01:00
Aman Gupta	5fa66c6e67	cuda: fix race condition in cumsum (#18448 ) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-29 14:07:17 +08:00
uvos	4ffc47cb20	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202 )	2025-12-28 20:12:55 +01:00
Aman Gupta	07a0c4ba92	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (#18426 )	2025-12-28 20:53:36 +08:00
QDelta	4fd59e8427	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )	2025-12-28 09:33:14 +08:00
Aman Gupta	06705fdcb3	ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407 )	2025-12-27 19:56:27 +08:00
Aman Gupta	85c40c9b02	ggml-cuda: fix regex for arch list (#18371 ) * ggml-cuda: fix regex for arch list * make regex exact	2025-12-26 01:35:14 +08:00
Aman Gupta	83b3b1c271	cuda: optimize cumsum cub path (#18362 ) * cuda: optimize cumsum cub path * remove heavy perf test	2025-12-25 23:55:38 +08:00
Aman Gupta	b0fb0f0aee	ggml-cuda: fix blackwell native builds (#18361 ) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman>	2025-12-25 22:12:11 +08:00
Aadeshveer Singh	c54bba869d	ggml : optimize cuda cumsum fallback kernel (#18343 )	2025-12-25 12:11:13 +08:00
Aman Gupta	c8a2417d7b	CUDA: experimental native mxfp4 support for blackwell (#17906 ) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman>	2025-12-24 22:28:26 +08:00
Jeff Bolz	b365c3ff01	vulkan/cuda: fix topk_moe with exp_probs_b (#18071 ) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-21 10:27:34 +01:00
Aadeshveer Singh	10b4f82d44	Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212 )	2025-12-20 19:28:57 +08:00
Xuan-Son Nguyen	8ea958d4d9	model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106 ) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai>	2025-12-19 00:18:01 +01:00
yulo	54189c0d39	remove i_major_dual (#18157 ) Co-authored-by: zhang hui <you@example.com>	2025-12-18 12:50:56 +01:00
yulo	acec774ef6	HIP: Refactor mma for RDNA and CDNA (#17990 ) * mma.cuh for rdna4 * mma for rdna3 * mmq for rdna4 * mmq for rdna3 * align i-major and j-major * cdna * fix cuda error * add missing tile of mfma * fix j-major wrong ne on CDNA * fix gramma and empty spaces --------- Co-authored-by: zhang hui <you@example.com>	2025-12-17 09:34:54 +01:00
Aadeshveer Singh	58062860af	ggml : use WARP_SIZE/2 for argmax reduction offset (#18092 )	2025-12-17 11:47:01 +08:00
Johannes Gäßler	482211438d	CUDA: fix overflow in MMA kernel without stream-k (#17939 )	2025-12-12 17:43:58 +01:00
yulo	c33a58bced	HIP: enable mmf for RDNA3 (#17879 ) * enable mmf for RDNA3 * disable mmf for some shape * move some mmvf to mmf * more mmfv to mmf * 3 is good in mmvf --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 11:34:33 +01:00
Piotr Wilkin (ilintar)	53ecd4fdb9	SOLVE_TRI extension to more dimensions (#17793 ) * Extended TRI * Fix whitespace * chore: update webui build output * Just use cuBLAS for everything... * Merge both versions * Remove incorrect imports causing failures for CI * Still failing... remove all direct cublas imports and rely on common imports from "common.cuh" * Defines for hipBlas * Aaaand MUSA defines... * I hate this job... * Stupid typo... * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-11 17:20:43 +01:00
Sigbjørn Skjæret	4df6e859e9	cuda : add missing support check for xielu (#17895 )	2025-12-10 16:16:20 +01:00
Johannes Gäßler	17f7f4baad	CUDA: fix unpadded strides in MMA FA kernel (#17891 )	2025-12-10 12:39:56 +01:00
Piotr Wilkin (ilintar)	b63509262a	Add DIAG for CUDA (#17873 ) * Add DIAG for CUDA * Refactor parameters	2025-12-09 20:28:57 +01:00
Sigbjørn Skjæret	86a3f0fad8	ggml : allow fill node alloc inplace (#17870 )	2025-12-09 12:23:47 +01:00
Johannes Gäßler	0cdce38a97	CUDA: fix FP16 overflow in tile FA kernel (#17875 )	2025-12-09 09:34:02 +01:00
Jay Zenith	51e0c2d917	cuda : add FILL op support (#17851 ) * cuda : add FILL op support * cuda : add missing FILL op files	2025-12-08 21:10:12 +08:00
wsbagnsv1	5814b4dce1	cuda: optimize SOLVE_TRI using registers and FMAF (#17703 ) * ggml-cuda: optimize solve_tri_f32_fast and fix stride handling - Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char ` before addition). - Remove unused `MAX_K_FAST` definition. Small cleanup * Remove comments in solve_tri.cu * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use const for variables in solve_tri.cu * Replace fmaf with more readable code * remove last fmaf --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-08 10:41:08 +01:00
Phylliida Dev	09c7c50e64	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985 ) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-06 15:07:02 +01:00
Johannes Gäßler	f334b79494	HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817 )	2025-12-06 13:45:36 +01:00
Johannes Gäßler	6016d0bd41	HIP : fix RDNA4 build (#17792 )	2025-12-05 13:47:52 +01:00
Johannes Gäßler	e95d0bc8fd	CUDA: fix FA VKQ accumulator overflow (#17746 )	2025-12-05 09:18:10 +01:00
Jiacheng (Jason) Chen	668ed76574	HIP: enable WMMA-MMQ INT kernels for RDNA 3 (#17576 ) * enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI	2025-12-05 09:17:37 +01:00
Piotr Wilkin (ilintar)	96fe9badfc	Add support for CUMSUM and TRI for CUDA. (#17584 ) * Add support for CUMSUM and TRI for CUDA. * Minor optimizations. * Correct warp_prefix_inclusive_sum in float2 variant to return float2 * Optimize TRI * Whitespace * Fix strides. * Implement double loop * Whitespace * Fix HIP compilation bugs * Optimizations + big case performance tests * Implement using CUB with fallback to custom kernel * Remove error message. * Fixes from code review * Comment out CPU-unsupported F16/BF16 cases to fix CI * Fine, you win :P * Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS * Vary warp-size based on physical warp size * Add GGML_UNUSED_VARS in tri as well * Use constexpr and call prefix_inclusive with warp_size template param * Update ggml/src/ggml-cuda/cumsum.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change to tid % warp_size * Fix strides; hardcode mask; add ggml_lane_mask_t * Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info() * Too hasty... --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-04 22:19:51 +01:00
Johannes Gäßler	2e1c9cd814	CUDA: generalized (mma) FA, add Volta support (#17505 ) * CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>	2025-12-03 16:57:05 +01:00
Aman Gupta	ed32089927	ggml-cuda: reorder only relevant nodes (#17639 )	2025-12-02 12:36:31 +08:00
Aman Gupta	6eea666912	llama-graph: avoid expand_forward for fusion (#17633 )	2025-12-01 11:12:48 +02:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
Aman Gupta	c7af376c29	CUDA: add stream-based concurrency (#16991 ) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-30 08:17:55 +08:00
Mahekk Shaikh	00425e2ed1	cuda : add error checking for cudaMemcpyAsync in argsort (#17599 ) * cuda : add error checking for cudaMemcpyAsync in argsort (#12836) * fix indentation	2025-11-30 08:16:28 +08:00
R0CKSTAR	c6f7a423c8	[MUSA] enable fp16/fast_fp16/bf16_mma on PH1 (#17551 ) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-28 14:08:29 +01:00
Aman Gupta	2e7ef98f18	ggml-cuda: add stricter checking for fusion (#17568 ) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it	2025-11-28 20:34:51 +08:00
Johannes Gäßler	73955f7d2a	CUDA: no FP16 arithmetic for vector FA kernel (#17558 )	2025-11-28 10:29:09 +01:00
yulo	6bca76ff5e	HIP: enable mul_mat_f for RDNA4 (#17437 ) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit `db9ae8b6b4`. * Revert "enable mmf for rdna4" This reverts commit `698c9f2418`. * Revert "move some mmvf to mmf" This reverts commit `99b92bd665`. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>	2025-11-28 08:24:30 +01:00
Piotr Wilkin (ilintar)	cd0e3a7a3b	SOLVE_TRI CUDA kernel for small matrices (#17457 )	2025-11-28 12:15:32 +08:00

1 2 3 4 5 ...

406 Commits