Commit Graph

424 Commits

Author SHA1 Message Date
Ruben Ortlam 4dca015b7e
vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285) 2025-11-15 15:18:58 +01:00
Giuseppe Scrivano 1568d13c2c
vulkan: implement ABS and NEG (#17245)
* docs: update Vulkan ops

* vulkan: add NEG op

* vulkan: add ABS op

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-15 12:00:29 +01:00
Jeff Bolz 439342ea0b
vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244)
* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths

* set allow_misalign
2025-11-15 11:56:15 +01:00
Jeff Bolz 234ae7d7bd
vulkan: skip all-negative-inf blocks in FA (#17186) 2025-11-15 10:37:25 +01:00
Jeff Bolz 38eaf32af1
vulkan: change graph_compute to be async and enable get_tensor_async (#17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
2025-11-15 09:06:41 +01:00
Ruben Ortlam a19bd6f7ce
vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-11-13 14:51:21 +01:00
Eve 7d019cff74
disable rms norm mul rope for chips with no fp16 rte (#17134) 2025-11-11 12:53:30 -06:00
Ruben Ortlam f117be185e
vulkan: check glslc executable string (#17144) 2025-11-10 16:59:26 +01:00
Ruben Ortlam 85234a4b3a
vulkan: fix validation issue introduced by #16868 (#17145) 2025-11-10 16:59:10 +01:00
Acly 1032256ec9
cuda/vulkan : bicubic interpolation (#17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-10 10:19:39 +01:00
Ruben Ortlam 392e09a608
vulkan: fix memory allocations (#17122) 2025-11-09 16:14:41 +01:00
Ruben Ortlam 7f3e9d339c
vulkan: iGPU memory reporting fix (#17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 09:54:47 +01:00
Ruben Ortlam 8a3519b708
vulkan: fix mmq out of bounds reads (#17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 09:52:57 +01:00
Jeff Bolz 80a6cf6347
vulkan: fuse mul_mat_id + mul (#17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 09:48:42 +01:00
Jeff Bolz 53d7d21e61
vulkan: Use spec constants for conv2d s/d/p and kernel W/H (#16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
2025-11-08 13:24:29 -06:00
SavicStefan b8a5cfd11a
vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-11-08 09:28:22 +01:00
Jeff Bolz b4e335d8dc
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-08 08:52:15 +01:00
Jeff Bolz d6fe40fa00
vulkan: Fix test-thread-safety crashes (#17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
2025-11-08 08:39:45 +01:00
Acly ac76d36201
vulkan : refactor buffer handling in vk_op_f32 (#16840)
* vulkan : refactor/simplify buffer handling in vk_op_* functions

* Combine UMA handling into ggml_vk_tensor_subbuffer
2025-11-07 21:08:50 +01:00
Jeff Bolz a44d77126c
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919) 2025-11-05 19:51:03 +01:00
Jeff Bolz ad51c0a720
vulkan: remove the need for the dryrun (#16826)
* vulkan: remove the need for the dryrun

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.

* remove dryrun parameters
2025-11-04 13:28:17 -06:00
Jeff Bolz 5d8bb900bc
vulkan: Fix multi_add invalid descriptor usage (#16899) 2025-11-01 06:52:14 +01:00
Jeff Bolz 2e76e01360
vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868)
* vulkan: fuse mul_mat+add and mul_mat_id+add_id

The fusion is only applied for the mat-vec mul paths.

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix 32b build

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-01 06:45:28 +01:00
Jeff Bolz d2d931f173
vulkan: disable spirv-opt for rope shaders (#16872) 2025-10-31 08:34:47 +01:00
Masato Nakasaka 2976b0374d
vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796)
* Experimenting crash fix

* added assert for aborting and fixed comment

* changed to check if a pipeline is empty or not

* Moved function in class definition

* replaced with is_empty

* Modified is_empty to check only unaligned pipelines
2025-10-31 08:18:59 +01:00
Ruben Ortlam d2a2673dd1
vulkan: fix shmem overrun in mmq id shader (#16873)
* vulkan: fix shmem overrun in mmq id shader

* metal : fix mul_mm_id

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-31 08:14:49 +01:00
JJJYmmm d261223d24
model: add support for qwen3vl series (#16780)
* support qwen3vl series.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9fdf1.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Jeff Bolz 052df28b0e
vulkan: Handle argsort with a large number of rows (#16851) 2025-10-30 07:27:41 +01:00
Jeff Bolz b9ce940177
vulkan: Fuse rope+set_rows (#16769)
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).

Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).

Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.

Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.

Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.

Add new backend tests.
2025-10-29 15:13:10 -05:00
Jeff Bolz 10fcc41290
vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656)
* vulkan: Update topk_moe fusion to handle gpt's late softmax

Based on #16649.

* Add ggml_check_edges

* Add sync logging to show fusion effects

* handle clamp added in #16655

* Update ggml/src/ggml-impl.h

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-29 14:44:29 +01:00
Ruben Ortlam bcf5bda6f5
Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536)
* vulkan: add mmq q2_k integer dot support

* Refactor mmq caching

* Reduce mmq register use

* Load 4 quant blocks into shared memory in one step

* Pack q2_k blocks into caches of 32

* Use 32-bit accumulators for integer dot matmul

* Add q4_k mmq

* Add q3_k mmq

* Add q5_k mmq

* Add q6_k mmq

* Add mxfp4 mmq, enable MMQ MUL_MAT_ID

* Fix mmv dm loads
2025-10-29 14:39:03 +01:00
Jeff Bolz f549b0007d
vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793)
This lets the copy to the destination device use the host-visible
vidmem optimization.
2025-10-29 09:53:04 +01:00
Acly 10640e31aa
ggml : fix interpolate with align-corners and ne=1 (#16700)
* ggml : fix interpolate with align-corners and ne=1

* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken

* fix clang warning
2025-10-27 21:50:22 +01:00
Gilad S. 3cfa9c3f12
vulkan: deduplicate Microsoft Direct3D12 devices (#16689)
* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver

* style: indent

* fix: decrease priority

* fix: switch to `||`
2025-10-26 05:37:38 +01:00
Giuseppe Scrivano f90b4a8efe
vulkan: delete dead code (#16732)
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-25 10:59:54 +02:00
Jeff Bolz 8423d01931
vulkan: Optimize SSM_SCAN (#16645) 2025-10-25 07:04:12 +02:00
Jeff Bolz fb349848f3
vulkan: Handle FA with all -inf mask values (#16447) 2025-10-20 22:16:08 -05:00
Jeff Bolz e56abd2098
vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)
This is similar to the CUDA shader from #16130, but doesn't use shared memory
and handles different subgroup sizes.
2025-10-18 12:22:57 +02:00
Giuseppe Scrivano 3d4e86bbeb
vulkan: Add State Space Model (SSM) Operations Support (#16463)
* vulkan: implement SSM scan operation

Add State Space Model scan operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* vulkan: implement SSM conv operation

Add State Space Model conv operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-17 14:23:47 +02:00
Jeff Bolz b19491599d
vulkan: fix debug build (add_rms_len/data not found) (#16624) 2025-10-17 09:31:04 +02:00
SavicStefan ffa059034c
vulkan: Add ACC_TYPE_VEC2 implementation (#16203)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-10-14 19:18:05 +02:00
Jeff Bolz 4258e0cfe7
vulkan: Support FA with K/V in F32 (#16543) 2025-10-14 15:53:37 +02:00
Jeff Bolz 7ea15bb64c
vulkan: Improve build time for MSVC (#16545)
Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel.

Enable /MP so source files are compiled in parallel.
2025-10-14 14:51:36 +02:00
Eve 86df2c9ae4
vulkan: use a more appropriate amount of threads when generating shaders (#16418)
* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
2025-10-04 22:04:27 +02:00
Acly e29acf74fe
vulkan : incremental shader builds (#16341)
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-10-04 11:42:56 +02:00
Jeff Bolz 2aaf0a2a20
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354)
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
2025-10-03 12:50:46 +02:00
Jeff Bolz 0e1f838556
vulkan: Fix FA coopmat1 invalid array indexing (#16365)
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
2025-10-03 11:52:46 +02:00
Jeff Bolz e308efda8e
vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316) 2025-10-03 10:33:08 +02:00
Eve 132d673554
vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345)
* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
2025-10-01 09:56:36 +02:00
Jeff Bolz 92cd103f62
vulkan: Fix validation failure in quantized flash attention (#16292) 2025-09-29 06:50:37 +02:00
Jeff Bolz d8359f5fde
vulkan: 64-bit im2col (#16135)
* vulkan: 64-bit im2col

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.

* fix validation error for large im2col
2025-09-28 08:38:37 +02:00
Jeff Bolz 1384abf8b8
vulkan: handle mat_mul with A matrix > 4GB (#16176)
* vulkan: handle mat_mul with A matrix > 4GB

This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.

Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.

* build fixes
2025-09-27 20:36:34 -05:00
Jeff Bolz e6d65fb02d
vulkan: support arbitrary KV dimension in flash attention (#16160)
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
2025-09-27 22:43:39 +02:00
Acly 8656f5de68
vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224)
* don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same
2025-09-27 22:41:03 +02:00
Dmytro Minochkin 0499b29c6f
vulkan: throw system error instead of SIGABRT during init on older devices (#16156)
* Throw system error on old Vulkan driver rather than SIGABRT

* Optionally handle any potential error in vulkan init
2025-09-27 18:26:46 +02:00
Jeff Bolz 3f81b4e91c
vulkan: support GET_ROWS for k-quants (#16235)
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
2025-09-27 12:36:11 +02:00
Sigbjørn Skjæret 3ecb2f671a
ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-22 19:13:00 +02:00
Shin-myoung-serp 96fdca043b
Vulkan: add conv_transpose_2d operation (#16022)
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
2025-09-22 10:04:01 +02:00
Jeff Bolz a20d810d79
vulkan: add RTE variants of exp shader (#16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
2025-09-22 07:37:17 +02:00
Ruben Ortlam 9073a73d82
vulkan: vec dot matrix multiplication fix (#16151)
* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching

* add odd m/n + odd k test with batching
2025-09-22 07:22:43 +02:00
Giuseppe Scrivano 1eeb523c3e
vulkan: optimize UMA buffer operations and fix driver hangs (#16059)
* vulkan: optimize UMA buffer operations and fix driver hangs

The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.

[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang

* vulkan: implement deferred_memset on UMA

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-09-21 08:31:55 +02:00
Jeff Bolz 5bb4a3edec
vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086) 2025-09-21 08:23:37 +02:00
Ruben Ortlam 803dac2e48
vulkan: use vec dot for matrix matrix multiplications (#16056)
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues
2025-09-20 10:42:56 +02:00
Jeff Bolz c0b45097c3
rename optimize_graph to graph_optimize (#16082) 2025-09-18 13:46:17 -05:00
Eve cb5bb6cc05
vulkan: automatically remove unsupported devices (#15976)
* remove unsupported vulkan devices

* make this happen during selection instead

* pass by reference
2025-09-17 09:35:37 +02:00
Daniel Bevenius 3913f8730e
ggml : fix padding in timestep embedding kernels (#15932)
* ggml : remove adding extra dim timestep embedding

This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.

The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.

* ggml-cuda : fix padding in timestep embedding kernel

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.

* ggml-metal : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel

* ggml-opencl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-sycl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-vulkan : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-cpu : fix padding in timestep embedding function

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
2025-09-16 15:25:57 +02:00
Ruben Ortlam 261e6a20ff
Vulkan: Clean up mul_mm shader (#15987)
* vulkan: move mul_mm dequantization steps into a separate file and functions

* improve mul_mm vector load code

* fix debug mode issues and warnings
2025-09-14 16:56:28 +02:00
Jeff Bolz aa0c461efe
vulkan: fix failing dequant shaders (#15862)
* vulkan: fix failing dequant shaders

* add missing const
2025-09-13 17:29:43 +02:00
Jeff Bolz b9c9c9f789
vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705)
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
2025-09-13 17:23:30 +02:00
Ruben Ortlam 304ac5693d
Vulkan iGPU device selection overhaul and PCI ID API support (#15947)
* vulkan: implement ggml igpu device type, implement pci id support

* fix compiler warning

* prevent printf overflow warning
2025-09-12 13:24:21 +02:00
Mathieu Baudier 6c88ad8fa7
vulkan: Make device memory check more portable (#15939) 2025-09-12 09:06:20 +02:00
Diego Devesa 360d6533db
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
Ruben Ortlam ae355f6f71
vulkan: throw the oom error instead of no memory type found (#15905) 2025-09-09 22:26:03 +02:00
Jeff Bolz 4f63cd705c
vulkan: Fix OOB accesses in soft_max_back (#15861) 2025-09-09 14:41:15 +02:00
lksj92hs ed54e32558
Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886) 2025-09-09 14:01:15 +02:00
Jeff Bolz e68aa10d8f
vulkan: sort graph to allow more parallel execution (#15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-09 02:10:07 +08:00
Xuan-Son Nguyen 9fcb29f22f
ggml: allow casting between f32 and i32 (#15783)
* ggml: allow casting between f32 and i32

* fix cuda

* add vulkan

* fix CPU non-cont

* add non-cont test case

* add note

* extend test number range

* correct note

* add cont version for vulkan
2025-09-08 12:33:01 +02:00
Jeff Bolz 3976dfbe00
vulkan: support im2col_3d (#15795) 2025-09-07 13:50:26 -05:00
Jeff Bolz c97b5e5854
vulkan: Support pad_ext (#15794) 2025-09-07 19:00:49 +02:00
Jeff Bolz 267e99867f
vulkan: Use larger loads in scalar/coopmat1 matmul (#15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
2025-09-07 18:53:07 +02:00
leejet 0a1b3982cd
ggml: add ops for WAN video model (cuda && cpu) (#15669)
* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0a37.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13208.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-04 10:38:49 +02:00
Ruben Ortlam dff7551bfd
vulkan: fix mmv subgroup16 selection (#15775) 2025-09-03 21:55:10 +01:00
Jeff Bolz 0fce7a1248
vulkan: don't use std::string in load_shaders, to improve compile time (#15724)
* vulkan: don't use std::string in load_shaders, to improve compile time

* keep the string version for those calls that use it
2025-09-03 20:33:15 +02:00
Daniel Bevenius 8227695d7a
vulkan : update ggml_vk_instance_validation_ext_available (#15666)
* vulkan : update ggml_vk_instance_validation_ext_available

This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.

Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.

* remove try/catch

This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.

* update warning message about validation layers
2025-09-03 20:24:50 +02:00
Shin-myoung-serp 0014fb4add
ggml vulkan: add hardsigmoid and hardswish operations (#15762) 2025-09-03 20:22:55 +02:00
Ruben Ortlam 0a2a3841e8
vulkan: fix shaders gen when no integer dot is available (#15740) 2025-09-02 16:02:26 +02:00
Jeff Bolz 25f1045f07
vulkan: Fix macro parameter order for f32 matmul shaders (#15716) 2025-09-02 14:37:01 +08:00
Gilad S. d4d8dbe383
vulkan: use memory budget extension to read memory usage (#15545)
* vulkan: use memory budget extension to read memory usage

* fix: formatting and names

* formatting

* fix: detect and cache memory budget extension availability on init

* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available

* style: lints
2025-09-01 21:17:42 +02:00
Jeff Bolz 35a42edac8
vulkan: add missing clamps in new mul_mat_id paths (#15702)
This is a missing interaction between #15546 and #15652
2025-09-01 21:01:10 +02:00
Ruben Ortlam fec7911f8f
vulkan: disable large mmv subgroups on older Nvidia GPUs (#15717) 2025-09-01 20:58:35 +02:00
Ruben Ortlam 02c1813517
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (#14903)
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants

* vulkan: use subgroup operations for quantize_q8_1 shader

* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader

* vulkan: use q8_1_x4 blocks in mul_mmq shader

* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec

* vulkan: tune mul_mat_vecq performance for Intel

* vulkan: fix quantizing issue when tensor is not divisible by 128

* vulkan: adapt integer dot mmv to mmv small m optimization (#15355)

* vulkan: allow all subgroup modes for mmv and mmvq

* vulkan: use prealloc intermediate reuse for mmvq path

* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090

* vulkan: adapt mmv quantize_y path to conditional sync logic

* vulkan: disable q8_0 mmvq on Nvidia

* vulkan: enable q8_0 on Nvidia pre-turing

* fix prealloc sync condition

* fix llvmpipe subgroup 8 issue
2025-09-01 16:19:07 +02:00
Jeff Bolz bbbf5ecccb
vulkan: handle large sizes for get_rows (#15686) 2025-08-31 10:13:27 +02:00
Jeff Bolz c37052ab4d
vulkan: mul_mat_id coopmat2 optimizations (#15546)
* vulkan: mul_mat_id coopmat2 optimizations

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.

Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.

* Also add a path for BN/4 - worth a couple more percent
2025-08-31 09:06:43 +02:00
Daniel Bevenius 5c16b9c87d
vulkan : remove unused portability_enumeration_ext variable (#15679)
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
2025-08-31 08:46:42 +02:00
Jeff Bolz b97c9edc59
vulkan: Allow fallback to sysmem memory when vidmem is full (#15649)
* vulkan: Allow fallback to sysmem memory when vidmem is full

* vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
2025-08-31 08:30:54 +02:00
Jeff Bolz 94e82c7ead
vulkan: clamp matmul and FA results to the max finite value (#15652)
* vulkan: clamp matmul and FA results to the max finite value

* only clamp for fp16
2025-08-31 08:27:57 +02:00
Jeff Bolz 696fccf354
vulkan: Skip syncing for prealloc_y when it is reused (#15544) 2025-08-30 11:11:22 +02:00
Jeff Bolz 34bdbbd7c2
vulkan: Remove splitting for mul_mat_id (#15568)
row_ids only needs to hold the BN rows for the current tile.
2025-08-26 06:42:44 +02:00
Ruben Ortlam 4d917cd4f6
vulkan: fix min subgroup 16 condition for mmid subgroup optimization (#15565) 2025-08-25 17:56:59 +02:00
Ruben Ortlam 043fb27d38
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (#15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
2025-08-24 19:36:36 +02:00
Jeff Bolz c9a24fb932
vulkan: Support FA with any multiple of 8 head sizes (#15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.

Store the FA pipelines in a map, indexed by the pipeline state.
2025-08-24 11:24:25 +02:00
Ruben Ortlam a9c6ffcbfa
vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526) 2025-08-24 10:48:53 +02:00
Jeff Bolz e78cf0d4b1
vulkan: workaround MoltenVK compile failure in multi_add (#15506)
* vulkan: workaround MoltenVK compile failure in multi_add

* Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-24 10:48:21 +02:00
Jeff Bolz 611f419cff
vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (#15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against #15489, sync after clearing partial sums
2025-08-23 13:16:17 -05:00
Jeff Bolz 289bf4113e
vulkan: Rewrite synchronization to allow some overlap between nodes (#15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.

The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
2025-08-23 09:33:36 +02:00
Acly 0a9b43e507
vulkan : support ggml_mean (#15393)
* vulkan : support ggml_mean

* vulkan : support sum, sum_rows and mean with non-contiguous tensors

* vulkan : fix subbuffer size not accounting for misalign offset

* tests : add backend-op tests for non-contiguous sum_rows

* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
2025-08-23 08:35:21 +02:00
Jeff Bolz 330c3d2d21
vulkan: optimize mul_mat_id loading row ids into shared memory (#15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
2025-08-23 08:31:54 +02:00
Acly 97ae5961a4
vulkan : support conv_2d_dw with f16 weights (#15392) 2025-08-21 17:01:51 +02:00
Dong Won Kim 20c2dac8c6
vulkan: add exp operation (#15456)
Co-authored-by: aeseulgi <kim2h7903@gmail.com>
2025-08-21 17:00:16 +02:00
Jeff Bolz 96452a3fa4
vulkan: Reuse conversion results in prealloc_y (#15410)
* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used
2025-08-21 16:55:00 +02:00
Jeff Bolz fec9519802
vulkan: shorten pipeline name strings (#15431)
These detailed strings were causing increased build time on gcc.
2025-08-20 16:33:14 +02:00
Jeff Bolz ae532eac2c
vulkan: disable spirv-opt for bfloat16 shaders (#15352) 2025-08-18 07:56:29 +02:00
Jeff Bolz 21c17b5bef
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-17 18:08:57 +02:00
Dong Won Kim 19f4decae0
vulkan: support sqrt (#15370) 2025-08-17 16:03:09 +02:00
Jeff Bolz de5627910d
vulkan: Optimize argsort (#15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
2025-08-17 10:41:45 +02:00
Jeff Bolz 1fe00296f5
vulkan: fuse adds (#15252)
* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug
2025-08-16 11:48:22 -05:00
Jeff Bolz de2192794f
vulkan: Support mul_mat_id with f32 accumulators (#15337)
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
2025-08-16 11:18:31 +02:00
Jeff Bolz 2e2b22ba66
vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334) 2025-08-16 10:58:38 +02:00
Georgi Gerganov 5edf1592fd
vulkan : fix out-of-bounds access in argmax kernel (#15342)
ggml-ci
2025-08-15 16:16:36 +02:00
Georgi Gerganov db3010bd23
vulkan : fix compile warnings on macos (#15340)
ggml-ci
2025-08-15 15:28:28 +02:00
Jeff Bolz 863d341eeb
vulkan: perf_logger improvements (#15246)
* vulkan: perf_logger improvements

- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.

* use <=mul_mat_vec_max_cols rather than ==1
2025-08-14 08:38:10 -05:00
Jonathan Graehl 5cdb27e091
finetune: SGD optimizer, more CLI args (#13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-14 12:03:57 +02:00
AN Long cd6983d56d
ggml : fix field name when new ggml_backend (#14944) 2025-08-08 14:37:22 +02:00
Jeff Bolz c4f53563df
vulkan: support fattn sinks (#15126) 2025-08-07 22:44:20 +02:00
Jeff Bolz a0552c8bee
vulkan: Add env var to disable host visible vidmem (#15109) 2025-08-07 22:07:11 +02:00
Georgi Gerganov fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Jeff Bolz 5aa1105da2
vulkan: fix build when using glslang that does not support coopmat2 (#15062) 2025-08-04 07:09:19 +02:00
Jeff Bolz 6c7a441161
vulkan: Use coopmat2 for conv2d (#14982) 2025-08-03 14:23:57 +02:00
Jeff Bolz 4cb208c93c
vulkan: coopmat2 mul_mat optimizations (#14934)
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-02 11:21:37 +02:00
Jeff Bolz ec0b18802c
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015) 2025-08-02 10:48:30 +02:00
Jeff Bolz a9f7541ec2
vulkan: optimizations for direct convolution (#14933)
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-02 09:57:04 +02:00
Ruben Ortlam e08a98826b
Vulkan: Fix minor debug mode issues (#14899)
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-07-31 17:46:54 +02:00
Kai Pastor 73a8e5ca03 vulkan : fix 32-bit builds (ggml/1313)
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
2025-07-30 17:33:11 +03:00
Erik Scholz 89d1029559
vulkan : add fp16 support for the conv_2d kernel (#14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-27 12:04:33 +02:00
Jeff Bolz f1a4e72de5
vulkan: skip empty set_rows to avoid invalid API usage (#14860) 2025-07-27 11:05:34 +02:00
Jeff Bolz 84712b6043
vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817) 2025-07-22 17:35:21 +02:00
Jeff Bolz c2e058f1b4
vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi a979ca22db
ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
Peter0x44 d4b91ea7b2
vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m 83f5872404
Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
Jeff Bolz ba1ceb3456
vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-15 21:51:09 +02:00
Jeff Bolz 10a0351a97
vulkan: add RTE variants for glu/add/sub/mul/div (#14653) 2025-07-15 21:32:11 +02:00
Georgi Gerganov 3120413ccd vulkan : remove unused vars (#0)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 74bb294591 vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 3e303b1107 vulkan : implement ggml_roll (ggml/1290)
ggml-ci
2025-07-12 14:25:44 +03:00
Jeff Bolz b3ad3a0191
vulkan: support SET_ROWS (#14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 12:12:26 +02:00
Jeff Bolz 98197e5c98
vulkan: optimizations for deepseek prompt processing (#14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 11:51:58 +02:00
Xuan-Son Nguyen 98bab638fb
ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Jeff Bolz 6efcd65945
vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
Jeff Bolz b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src (#14582) 2025-07-08 15:21:21 +02:00