llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	db97837385	vulkan: perf_logger improvements (#17672 ) * vulkan: perf_logger improvements - Move perf_logger from device to ctx. - Add an env var to control the frequency we dump the stats. If you set a very large value, it just dumps when the ctx is destroyed. - Add a fusion info string to the tracking, only log one item per fused op. - Fix MUL_MAT_ID flops calculation. * fix vector sizes	2025-12-06 18:46:46 +01:00
Phylliida Dev	09c7c50e64	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985 ) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-06 15:07:02 +01:00
Jeff Bolz	2960eb2975	vulkan: Use one row per workgroup for f32 mmv (#17711 ) The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.	2025-12-06 11:12:26 +01:00
Jeff Bolz	c6c5e85979	vulkan: support solve_tri with larger N/K values (#17781 ) Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-06 08:56:45 +01:00
Masato Nakasaka	67788f6846	vulkan: Replace deprecated VK_EXT_validation_features (#17637 ) * replaced deprecated VK_EXT_validation_features * forgot to remove old code	2025-12-06 06:39:42 +01:00
Masato Nakasaka	d8c0a7b085	vulkan: Fix mismatch in TOPK_MOE unit test (#17541 ) * Fix shader to support 2D workgroup mapping to a single subgroup * Set required_subgroup_size topk_moe shader requires static WARP_SIZE and actual subgroup size to match	2025-12-06 06:23:30 +01:00
Jeff Bolz	933414c0b6	vulkan: add more num_blocks instantiations in rms_norm (#17701 )	2025-12-05 22:08:56 +01:00
Jeff Bolz	a0f3897d53	vulkan: fix top_k bug when there are ties in the input (#17659 ) * vulkan: Reduce temporary memory usage for TOP_K - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB. * vulkan: fix top_k bug when there are ties in the input I noticed by inspection a bug in the vulkan top_k shader where if the least value in the top_k appears multiple times we could end up writing those extra copies out rather than some larger values (if the larger values are on higher numbered threads). I rewrote the test verification to handle this case, where the final index set is not necessarily the same. * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-05 22:03:19 +01:00
Acly	e15cd06a94	vulkan : support conv-2d with large output size (#17685 )	2025-12-05 21:46:39 +01:00
Jeff Bolz	6ab0d64960	vulkan: enable mmvq for q2_k on NVIDIA (#17675 )	2025-12-05 21:21:57 +01:00
Jeff Bolz	93bb92664e	vulkan: set all memory allocations to high priority (#17624 ) * vulkan: set all memory allocations to high priority * gate by env var	2025-12-05 21:21:04 +01:00
Jeff Bolz	61bde8e21f	vulkan: Reduce temporary memory usage for TOP_K (#17623 ) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-02 19:22:04 +01:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
Acly	385c3da5e6	vulkan : fix FA mask load with bounds check (coopmat2) (#17606 )	2025-11-30 01:03:21 +01:00
Ruben Ortlam	47a268ea50	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (#16900 ) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-11-29 09:37:22 +01:00
Jeff Bolz	59d8d4e963	vulkan: improve topk perf for large k, fix overflow in unit tests (#17582 )	2025-11-29 08:39:57 +01:00
Jeff Bolz	35cf8887e1	vulkan: Implement GGML_OP_TRI (#17503 ) * vulkan: Implement GGML_OP_TRI * check types match	2025-11-28 10:07:29 +01:00
Jeff Bolz	4abef75f2c	vulkan: Implement SOLVE_TRI (#17486 ) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-11-27 15:48:00 +01:00
Acly	b78db3bd50	vulkan : move contiguous checks to device_supports_op (#17490 ) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-11-27 06:54:19 +01:00
Jeff Bolz	142df17c9c	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (#17514 )	2025-11-27 06:32:30 +01:00
Jeff Bolz	eec1e33a9e	vulkan: allow graph_optimize for prompt processing workloads (#17475 )	2025-11-26 16:46:33 +01:00
Jeff Bolz	879d673759	vulkan: Implement top-k (#17418 ) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-11-26 16:45:43 +01:00
Jeff Bolz	b3b03a7baf	vulkan: Implement GGML_OP_CUMSUM (#17479 )	2025-11-26 07:08:10 +01:00
Jeff Bolz	d414db02d3	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (#17455 )	2025-11-25 07:11:27 +01:00
Jeff Bolz	3d07caa99b	vulkan: more FA details in vk_perf_logger (#17443 )	2025-11-24 22:25:24 +01:00
Jeff Bolz	54d83bbe85	vulkan: remove a couple unnecessary switches (#17419 )	2025-11-23 06:29:40 +01:00
Jeff Bolz	f1ffbba68e	vulkan: disable async for older Intel devices (#17369 ) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-11-21 09:58:17 +01:00
Giuseppe Scrivano	7d77f07325	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (#17319 ) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-19 17:29:45 +01:00
Jeff Bolz	1fa4551af0	vulkan: support larger argsort (#17313 ) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-11-19 17:25:50 +01:00
Jeff Bolz	2eba631b81	vulkan: Add copy_transpose shader (#17371 )	2025-11-19 16:50:43 +01:00
Ruben Ortlam	980b7cd17e	vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )	2025-11-19 08:46:26 +01:00
Jeff Bolz	da95bf2a85	vulkan: support noncontig i32 copy (#17328 )	2025-11-18 07:41:24 +01:00
Ruben Ortlam	38e2c1b412	vulkan: add log RTE support to fix Nvidia CI (#17320 ) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-11-17 14:37:49 -06:00
Pavels Zaicenkovs	dbed61294a	vulkan: add LOG operation support for F32 and F16 (#17183 ) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-16 22:50:09 +01:00
Ruben Ortlam	80deff3648	vulkan: fix MMQ quantize_y condition (#17301 )	2025-11-16 19:38:17 +01:00
Jeff Bolz	24dc769f1b	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287 ) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-15 19:54:23 +01:00
Ruben Ortlam	4dca015b7e	vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285 )	2025-11-15 15:18:58 +01:00
Giuseppe Scrivano	1568d13c2c	vulkan: implement ABS and NEG (#17245 ) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-15 12:00:29 +01:00
Jeff Bolz	439342ea0b	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244 ) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign	2025-11-15 11:56:15 +01:00
Jeff Bolz	234ae7d7bd	vulkan: skip all-negative-inf blocks in FA (#17186 )	2025-11-15 10:37:25 +01:00
Jeff Bolz	38eaf32af1	vulkan: change graph_compute to be async and enable get_tensor_async (#17158 ) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst	2025-11-15 09:06:41 +01:00
Ruben Ortlam	a19bd6f7ce	vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219 ) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-11-13 14:51:21 +01:00
Eve	7d019cff74	disable rms norm mul rope for chips with no fp16 rte (#17134 )	2025-11-11 12:53:30 -06:00
Ruben Ortlam	f117be185e	vulkan: check glslc executable string (#17144 )	2025-11-10 16:59:26 +01:00
Ruben Ortlam	85234a4b3a	vulkan: fix validation issue introduced by #16868 (#17145 )	2025-11-10 16:59:10 +01:00
Acly	1032256ec9	cuda/vulkan : bicubic interpolation (#17022 ) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-10 10:19:39 +01:00
Ruben Ortlam	392e09a608	vulkan: fix memory allocations (#17122 )	2025-11-09 16:14:41 +01:00
Ruben Ortlam	7f3e9d339c	vulkan: iGPU memory reporting fix (#17110 ) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 09:54:47 +01:00
Ruben Ortlam	8a3519b708	vulkan: fix mmq out of bounds reads (#17108 ) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 09:52:57 +01:00
Jeff Bolz	80a6cf6347	vulkan: fuse mul_mat_id + mul (#17095 ) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 09:48:42 +01:00
Jeff Bolz	53d7d21e61	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978 ) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-08 13:24:29 -06:00
SavicStefan	b8a5cfd11a	vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636 ) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-11-08 09:28:22 +01:00
Jeff Bolz	b4e335d8dc	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977 ) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-08 08:52:15 +01:00
Jeff Bolz	d6fe40fa00	vulkan: Fix test-thread-safety crashes (#17024 ) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-08 08:39:45 +01:00
Acly	ac76d36201	vulkan : refactor buffer handling in vk_op_f32 (#16840 ) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-07 21:08:50 +01:00
Jeff Bolz	a44d77126c	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919 )	2025-11-05 19:51:03 +01:00
Jeff Bolz	ad51c0a720	vulkan: remove the need for the dryrun (#16826 ) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-04 13:28:17 -06:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 )	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-01 06:45:28 +01:00
Jeff Bolz	d2d931f173	vulkan: disable spirv-opt for rope shaders (#16872 )	2025-10-31 08:34:47 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-10-31 08:18:59 +01:00
Ruben Ortlam	d2a2673dd1	vulkan: fix shmem overrun in mmq id shader (#16873 ) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-31 08:14:49 +01:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Jeff Bolz	052df28b0e	vulkan: Handle argsort with a large number of rows (#16851 )	2025-10-30 07:27:41 +01:00
Jeff Bolz	b9ce940177	vulkan: Fuse rope+set_rows (#16769 ) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-10-29 15:13:10 -05:00
Jeff Bolz	10fcc41290	vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656 ) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-29 14:44:29 +01:00
Ruben Ortlam	bcf5bda6f5	Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536 ) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-10-29 14:39:03 +01:00
Jeff Bolz	f549b0007d	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793 ) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-10-29 09:53:04 +01:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-10-27 21:50:22 +01:00
Gilad S.	3cfa9c3f12	vulkan: deduplicate Microsoft Direct3D12 devices (#16689 ) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|`	2025-10-26 05:37:38 +01:00
Giuseppe Scrivano	f90b4a8efe	vulkan: delete dead code (#16732 ) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-25 10:59:54 +02:00
Jeff Bolz	8423d01931	vulkan: Optimize SSM_SCAN (#16645 )	2025-10-25 07:04:12 +02:00
Jeff Bolz	fb349848f3	vulkan: Handle FA with all -inf mask values (#16447 )	2025-10-20 22:16:08 -05:00
Jeff Bolz	e56abd2098	vulkan: Implement topk_moe fused shader, ported from CUDA (#16641 ) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-18 12:22:57 +02:00
Giuseppe Scrivano	3d4e86bbeb	vulkan: Add State Space Model (SSM) Operations Support (#16463 ) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-17 14:23:47 +02:00
Jeff Bolz	b19491599d	vulkan: fix debug build (add_rms_len/data not found) (#16624 )	2025-10-17 09:31:04 +02:00
SavicStefan	ffa059034c	vulkan: Add ACC_TYPE_VEC2 implementation (#16203 ) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-10-14 19:18:05 +02:00
Jeff Bolz	4258e0cfe7	vulkan: Support FA with K/V in F32 (#16543 )	2025-10-14 15:53:37 +02:00
Jeff Bolz	7ea15bb64c	vulkan: Improve build time for MSVC (#16545 ) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.	2025-10-14 14:51:36 +02:00
Eve	86df2c9ae4	vulkan: use a more appropriate amount of threads when generating shaders (#16418 ) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-04 22:04:27 +02:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	0e1f838556	vulkan: Fix FA coopmat1 invalid array indexing (#16365 ) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.	2025-10-03 11:52:46 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Eve	132d673554	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345 ) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-01 09:56:36 +02:00
Jeff Bolz	92cd103f62	vulkan: Fix validation failure in quantized flash attention (#16292 )	2025-09-29 06:50:37 +02:00
Jeff Bolz	d8359f5fde	vulkan: 64-bit im2col (#16135 ) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-28 08:38:37 +02:00
Jeff Bolz	1384abf8b8	vulkan: handle mat_mul with A matrix > 4GB (#16176 ) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes	2025-09-27 20:36:34 -05:00
Jeff Bolz	e6d65fb02d	vulkan: support arbitrary KV dimension in flash attention (#16160 ) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).	2025-09-27 22:43:39 +02:00
Acly	8656f5de68	vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224 ) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same	2025-09-27 22:41:03 +02:00
Dmytro Minochkin	0499b29c6f	vulkan: throw system error instead of SIGABRT during init on older devices (#16156 ) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init	2025-09-27 18:26:46 +02:00
Jeff Bolz	3f81b4e91c	vulkan: support GET_ROWS for k-quants (#16235 ) The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.	2025-09-27 12:36:11 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-22 10:04:01 +02:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-22 07:37:17 +02:00
Ruben Ortlam	9073a73d82	vulkan: vec dot matrix multiplication fix (#16151 ) * vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching	2025-09-22 07:22:43 +02:00
Giuseppe Scrivano	1eeb523c3e	vulkan: optimize UMA buffer operations and fix driver hangs (#16059 ) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-21 08:31:55 +02:00
Jeff Bolz	5bb4a3edec	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086 )	2025-09-21 08:23:37 +02:00
Ruben Ortlam	803dac2e48	vulkan: use vec dot for matrix matrix multiplications (#16056 ) * vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues	2025-09-20 10:42:56 +02:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-17 09:35:37 +02:00
Daniel Bevenius	3913f8730e	ggml : fix padding in timestep embedding kernels (#15932 ) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.	2025-09-16 15:25:57 +02:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-14 16:56:28 +02:00
Jeff Bolz	aa0c461efe	vulkan: fix failing dequant shaders (#15862 ) * vulkan: fix failing dequant shaders * add missing const	2025-09-13 17:29:43 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Ruben Ortlam	304ac5693d	Vulkan iGPU device selection overhaul and PCI ID API support (#15947 ) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-12 13:24:21 +02:00
Mathieu Baudier	6c88ad8fa7	vulkan: Make device memory check more portable (#15939 )	2025-09-12 09:06:20 +02:00
Diego Devesa	360d6533db	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797 ) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-11 22:47:38 +02:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Xuan-Son Nguyen	9fcb29f22f	ggml: allow casting between f32 and i32 (#15783 ) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan	2025-09-08 12:33:01 +02:00
Jeff Bolz	3976dfbe00	vulkan: support im2col_3d (#15795 )	2025-09-07 13:50:26 -05:00
Jeff Bolz	c97b5e5854	vulkan: Support pad_ext (#15794 )	2025-09-07 19:00:49 +02:00
Jeff Bolz	267e99867f	vulkan: Use larger loads in scalar/coopmat1 matmul (#15729 ) I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.	2025-09-07 18:53:07 +02:00
leejet	0a1b3982cd	ggml: add ops for WAN video model (cuda && cpu) (#15669 ) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit `d8377a0a37`. * Revert "add gguf_init_from_file_ext impl" This reverts commit `d9f1d13208`. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-04 10:38:49 +02:00
Ruben Ortlam	dff7551bfd	vulkan: fix mmv subgroup16 selection (#15775 )	2025-09-03 21:55:10 +01:00
Jeff Bolz	0fce7a1248	vulkan: don't use std::string in load_shaders, to improve compile time (#15724 ) * vulkan: don't use std::string in load_shaders, to improve compile time * keep the string version for those calls that use it	2025-09-03 20:33:15 +02:00
Daniel Bevenius	8227695d7a	vulkan : update ggml_vk_instance_validation_ext_available (#15666 ) * vulkan : update ggml_vk_instance_validation_ext_available This commit updates ggml_vk_instance_validation_ext_available() to check for VK_EXT_validation_features instead of VK_KHR_portability_enumeration. Based on how the returned boolean is used later in the code (to enable both the validation layer and the VK_EXT_validation_features extension), it appears the function may have been intended to check for the validation layer features extension. * remove try/catch This was a left over from a previous iteration where I was explicitly quering for a specific validation layer first, which would throw. * update warning message about validation layers	2025-09-03 20:24:50 +02:00
Shin-myoung-serp	0014fb4add	ggml vulkan: add hardsigmoid and hardswish operations (#15762 )	2025-09-03 20:22:55 +02:00
Ruben Ortlam	0a2a3841e8	vulkan: fix shaders gen when no integer dot is available (#15740 )	2025-09-02 16:02:26 +02:00
Jeff Bolz	25f1045f07	vulkan: Fix macro parameter order for f32 matmul shaders (#15716 )	2025-09-02 14:37:01 +08:00
Gilad S.	d4d8dbe383	vulkan: use memory budget extension to read memory usage (#15545 ) * vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints	2025-09-01 21:17:42 +02:00
Jeff Bolz	35a42edac8	vulkan: add missing clamps in new mul_mat_id paths (#15702 ) This is a missing interaction between #15546 and #15652	2025-09-01 21:01:10 +02:00
Ruben Ortlam	fec7911f8f	vulkan: disable large mmv subgroups on older Nvidia GPUs (#15717 )	2025-09-01 20:58:35 +02:00
Ruben Ortlam	02c1813517	Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (#14903 ) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (#15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue	2025-09-01 16:19:07 +02:00
Jeff Bolz	bbbf5ecccb	vulkan: handle large sizes for get_rows (#15686 )	2025-08-31 10:13:27 +02:00
Jeff Bolz	c37052ab4d	vulkan: mul_mat_id coopmat2 optimizations (#15546 ) * vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent	2025-08-31 09:06:43 +02:00
Daniel Bevenius	5c16b9c87d	vulkan : remove unused portability_enumeration_ext variable (#15679 ) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.	2025-08-31 08:46:42 +02:00
Jeff Bolz	b97c9edc59	vulkan: Allow fallback to sysmem memory when vidmem is full (#15649 ) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK	2025-08-31 08:30:54 +02:00
Jeff Bolz	94e82c7ead	vulkan: clamp matmul and FA results to the max finite value (#15652 ) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16	2025-08-31 08:27:57 +02:00
Jeff Bolz	696fccf354	vulkan: Skip syncing for prealloc_y when it is reused (#15544 )	2025-08-30 11:11:22 +02:00
Jeff Bolz	34bdbbd7c2	vulkan: Remove splitting for mul_mat_id (#15568 ) row_ids only needs to hold the BN rows for the current tile.	2025-08-26 06:42:44 +02:00
Ruben Ortlam	4d917cd4f6	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565 )	2025-08-25 17:56:59 +02:00
Ruben Ortlam	043fb27d38	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524 ) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16	2025-08-24 19:36:36 +02:00
Jeff Bolz	c9a24fb932	vulkan: Support FA with any multiple of 8 head sizes (#15537 ) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.	2025-08-24 11:24:25 +02:00
Ruben Ortlam	a9c6ffcbfa	vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526 )	2025-08-24 10:48:53 +02:00
Jeff Bolz	e78cf0d4b1	vulkan: workaround MoltenVK compile failure in multi_add (#15506 ) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <picard12@live.de>	2025-08-24 10:48:21 +02:00
Jeff Bolz	611f419cff	vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281 ) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums	2025-08-23 13:16:17 -05:00
Jeff Bolz	289bf4113e	vulkan: Rewrite synchronization to allow some overlap between nodes (#15489 ) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.	2025-08-23 09:33:36 +02:00
Acly	0a9b43e507	vulkan : support ggml_mean (#15393 ) * vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader	2025-08-23 08:35:21 +02:00
Jeff Bolz	330c3d2d21	vulkan: optimize mul_mat_id loading row ids into shared memory (#15427 ) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.	2025-08-23 08:31:54 +02:00
Acly	97ae5961a4	vulkan : support conv_2d_dw with f16 weights (#15392 )	2025-08-21 17:01:51 +02:00
Dong Won Kim	20c2dac8c6	vulkan: add exp operation (#15456 ) Co-authored-by: aeseulgi <kim2h7903@gmail.com>	2025-08-21 17:00:16 +02:00
Jeff Bolz	96452a3fa4	vulkan: Reuse conversion results in prealloc_y (#15410 ) * vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used	2025-08-21 16:55:00 +02:00
Jeff Bolz	fec9519802	vulkan: shorten pipeline name strings (#15431 ) These detailed strings were causing increased build time on gcc.	2025-08-20 16:33:14 +02:00
Jeff Bolz	ae532eac2c	vulkan: disable spirv-opt for bfloat16 shaders (#15352 )	2025-08-18 07:56:29 +02:00
Jeff Bolz	21c17b5bef	vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355 ) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-17 18:08:57 +02:00
Dong Won Kim	19f4decae0	vulkan: support sqrt (#15370 )	2025-08-17 16:03:09 +02:00

1 2 3 4 5 ...

460 Commits