llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	392e09a608	vulkan: fix memory allocations (#17122 )	2025-11-09 16:14:41 +01:00
Ruben Ortlam	7f3e9d339c	vulkan: iGPU memory reporting fix (#17110 ) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 09:54:47 +01:00
Ruben Ortlam	8a3519b708	vulkan: fix mmq out of bounds reads (#17108 ) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 09:52:57 +01:00
Jeff Bolz	80a6cf6347	vulkan: fuse mul_mat_id + mul (#17095 ) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 09:48:42 +01:00
Jeff Bolz	53d7d21e61	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978 ) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-08 13:24:29 -06:00
Jeff Bolz	b4e335d8dc	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977 ) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-08 08:52:15 +01:00
Jeff Bolz	d6fe40fa00	vulkan: Fix test-thread-safety crashes (#17024 ) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-08 08:39:45 +01:00
Acly	ac76d36201	vulkan : refactor buffer handling in vk_op_f32 (#16840 ) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-07 21:08:50 +01:00
Jeff Bolz	a44d77126c	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919 )	2025-11-05 19:51:03 +01:00
Jeff Bolz	ad51c0a720	vulkan: remove the need for the dryrun (#16826 ) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-04 13:28:17 -06:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 )	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-01 06:45:28 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-10-31 08:18:59 +01:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Jeff Bolz	052df28b0e	vulkan: Handle argsort with a large number of rows (#16851 )	2025-10-30 07:27:41 +01:00
Jeff Bolz	b9ce940177	vulkan: Fuse rope+set_rows (#16769 ) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-10-29 15:13:10 -05:00
Jeff Bolz	10fcc41290	vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656 ) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-29 14:44:29 +01:00
Ruben Ortlam	bcf5bda6f5	Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536 ) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-10-29 14:39:03 +01:00
Jeff Bolz	f549b0007d	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793 ) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-10-29 09:53:04 +01:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-10-27 21:50:22 +01:00
Gilad S.	3cfa9c3f12	vulkan: deduplicate Microsoft Direct3D12 devices (#16689 ) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|`	2025-10-26 05:37:38 +01:00
Giuseppe Scrivano	f90b4a8efe	vulkan: delete dead code (#16732 ) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-25 10:59:54 +02:00
Jeff Bolz	8423d01931	vulkan: Optimize SSM_SCAN (#16645 )	2025-10-25 07:04:12 +02:00
Jeff Bolz	e56abd2098	vulkan: Implement topk_moe fused shader, ported from CUDA (#16641 ) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-18 12:22:57 +02:00
Giuseppe Scrivano	3d4e86bbeb	vulkan: Add State Space Model (SSM) Operations Support (#16463 ) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-17 14:23:47 +02:00
Jeff Bolz	4258e0cfe7	vulkan: Support FA with K/V in F32 (#16543 )	2025-10-14 15:53:37 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Eve	132d673554	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345 ) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-01 09:56:36 +02:00
Jeff Bolz	d8359f5fde	vulkan: 64-bit im2col (#16135 ) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-28 08:38:37 +02:00
Jeff Bolz	1384abf8b8	vulkan: handle mat_mul with A matrix > 4GB (#16176 ) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes	2025-09-27 20:36:34 -05:00
Acly	8656f5de68	vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224 ) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same	2025-09-27 22:41:03 +02:00
Dmytro Minochkin	0499b29c6f	vulkan: throw system error instead of SIGABRT during init on older devices (#16156 ) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init	2025-09-27 18:26:46 +02:00
Jeff Bolz	3f81b4e91c	vulkan: support GET_ROWS for k-quants (#16235 ) The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.	2025-09-27 12:36:11 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-22 10:04:01 +02:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-22 07:37:17 +02:00
Giuseppe Scrivano	1eeb523c3e	vulkan: optimize UMA buffer operations and fix driver hangs (#16059 ) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-21 08:31:55 +02:00
Jeff Bolz	5bb4a3edec	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086 )	2025-09-21 08:23:37 +02:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-17 09:35:37 +02:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-14 16:56:28 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Ruben Ortlam	304ac5693d	Vulkan iGPU device selection overhaul and PCI ID API support (#15947 ) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-12 13:24:21 +02:00
Mathieu Baudier	6c88ad8fa7	vulkan: Make device memory check more portable (#15939 )	2025-09-12 09:06:20 +02:00
Diego Devesa	360d6533db	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797 ) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-11 22:47:38 +02:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00

1 2 3 4 5

236 Commits