* modify warptile tuning for xe3
* intel vendor check w/ coopmat support
* fix back formatting
* fix formatting change 2
* move intel check to chip specific tuning part
* Change to support both windows and linux
* modify m_warptile to l_warptile for intel
* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)
* Code style changes
* Code style changes (2)
* Code style changes (3)
* vulkan: support buffer_from_host_ptr
* hacky use of buffer_from_host_ptr for directio
* disable buffer_from_host_ptr cap
* use external memory for ggml_vk_host_malloc, revert model loader changes
* disable external_memory_host for MoltenVK
* take buffer memory types into account
* don't use external_memory_host for ggml_vk_host_malloc
* vulkan: Optimize GGML_OP_CUMSUM
There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.
In the whole-row shader, handle multiple elements per invocation.
* use 2 ELEM_PER_THREAD for AMD/Intel
* address feedback
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron
Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).
Fewer pipeline variants and spec constants, just use push constants.
In test_topk_moe, change exp_probs_b to be 1D, matching real networks.
Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
* change test_topk_moe to allow results in arbitrary order
* disable sigmoid fusion for moltenvk
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.
GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).
GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
* vulkan: perf_logger improvements
- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.
* fix vector sizes
* Feat: Added vulkan circular tiling support
* Feat: Added cpu circular
* Feat: Added cuda kernels
* Added tests
* Added tests
* Removed non-pad operations
* Removed unneded changes
* removed backend non pad tests
* Update test-backend-ops.cpp
* Fixed comment on pad test
* removed trailing whitespace
* Removed unneded test in test-backend-ops
* Removed removed test from calls
* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp
Co-authored-by: Ruben Ortlam <picard12@live.de>
* Fixed alignment
* Formatting
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Format pad
* Format
* Clang format
* format
* format
* don't change so much stuff
* clang format and update to bool
* fix duplicates
* don't need to fix the padding
* make circular bool
* duplicate again
* rename vulkan to wrap around
* Don't need indent
* moved to const expr
* removed unneded extra line break
* More readable method calls
* Minor wording changes
* Added final newline
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added circular pad ext tests
* Gate non circular pad devices
* Cleaned gating of non-circular pad devices
---------
Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
* Fix shader to support 2D workgroup mapping to a single subgroup
* Set required_subgroup_size
topk_moe shader requires static WARP_SIZE and actual subgroup size to match
* vulkan: Reduce temporary memory usage for TOP_K
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* vulkan: fix top_k bug when there are ties in the input
I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).
I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.
* Update tests/test-backend-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
* fix pipeline selection
* vulkan: Add N-ary search algorithm for topk
* microoptimizations
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* vulkan: add LOG operation support for F32 and F16
Part of #14909.
* vulkan: Fix LOG operation types
* docs: Update operation support documentation for Vulkan LOG operation
* vulkan: fix log_f16 shader
* docs: restore missing LOG test cases and regenerate ops.md
* vulkan: change graph_compute to be async and enable get_tensor_async
This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.
Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.
The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.
The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.
* fix thread safety errors
* teardown context cleanly
* Handle async read to non-pinned dst
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
* vulkan: use all device-local heaps for memory availability reporting
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
* use all available heaps for iGPU memory reporting
* Allow multiple memory types per buffer request for devices with split heaps
---------
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
* vulkan: remove the need for the dryrun
Allocate pipelines and descriptor sets when requested.
Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.
For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
* remove dryrun parameters
* vulkan: fuse mul_mat+add and mul_mat_id+add_id
The fusion is only applied for the mat-vec mul paths.
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix 32b build
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Experimenting crash fix
* added assert for aborting and fixed comment
* changed to check if a pipeline is empty or not
* Moved function in class definition
* replaced with is_empty
* Modified is_empty to check only unaligned pipelines
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Add new backend tests.
* vulkan: Update topk_moe fusion to handle gpt's late softmax
Based on #16649.
* Add ggml_check_edges
* Add sync logging to show fusion effects
* handle clamp added in #16655
* Update ggml/src/ggml-impl.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver
* style: indent
* fix: decrease priority
* fix: switch to `||`
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: implement SSM scan operation
Add State Space Model scan operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: implement SSM conv operation
Add State Space Model conv operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE
Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.
For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.
Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.
With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.
* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
* fix validation error for large im2col
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.