Commit Graph

317 Commits

Author SHA1 Message Date
Ruben Ortlam f6b533d898
Vulkan Flash Attention Coopmat1 Refactor (#19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-28 18:52:45 +01:00
Oleksandr Kuvshynov 88d23ad515
vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-28 12:35:54 +01:00
Jeff Bolz bd544c94a3
vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-21 18:01:40 +01:00
Jeff Bolz 33f890e579
vulkan: support flash attention GQA/split_k with small batches (#18938) 2026-01-21 17:43:43 +01:00
Masato Nakasaka 067b8d7af3
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)
This reverts commit 980b7cd17e.
2026-01-21 17:13:43 +01:00
Jeff Bolz 50b7f076a5
vulkan: Use mul_mat_vec_id for small values of n (#18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-21 16:22:02 +01:00
Georgi Gerganov 365a3e8c31
ggml : add ggml_build_forward_select (#18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-19 20:03:19 +02:00
Jeff Bolz 3e4bb29666
vulkan: Check maxStorageBufferRange in supports_op (#18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-14 10:59:05 +01:00
Jeff Bolz 8e2da778da
vulkan: change memory_logger to be controlled by an env var (#18769) 2026-01-12 13:32:55 +01:00
Jeff Bolz 2bbe4c2cf8
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-12 12:32:13 +01:00
Ruben Ortlam 1051ecd289
vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-12 07:29:35 +01:00
Ruben Ortlam 0e76501e1d
Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-11 17:33:33 +01:00
Jeff Bolz 2524c26164
vulkan: fix push constant size for quantize_q8_1 (#18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-08 15:40:58 +01:00
Jeff Bolz cb14b06995
vulkan: optimize ssm_scan (#18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-08 15:16:54 +01:00
Doctor Shotgun 9a5724dee2
ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-08 11:03:21 +02:00
Jeff Bolz ca4a8370bc
vulkan: reject ops when a tensor is too large to allocate (#18646) 2026-01-07 12:03:32 +01:00
virajwad 03023296cf
vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-07 11:59:47 +01:00
Jeff Bolz ea13cba850
vulkan: support buffer_from_host_ptr (#18467)
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
2026-01-06 17:37:07 +01:00
Jeff Bolz b37124d2d2
vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-05 11:30:14 +01:00
Jeff Bolz 18ddaea2ae
vulkan: Optimize GGML_OP_CUMSUM (#18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-02 15:32:30 -06:00
Jeff Bolz 706e3f93a6
vulkan: Implement mmvq for iq1_s/iq1_m (#18450) 2026-01-02 20:19:04 +01:00
Jeff Bolz be47fb9285
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-01 08:58:27 +01:00
Jeff Bolz c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-26 16:12:58 -06:00
Jeff Bolz 7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-26 18:15:50 +01:00
Jeff Bolz 9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332) 2025-12-26 18:15:02 +01:00
Jeff Bolz b96b82fc85
vulkan: Support UPSCALE w/antialias (#18327) 2025-12-26 17:00:57 +01:00
Jeff Bolz 10dc500bdb
vulkan: handle rope with large number of rows (#18306) 2025-12-26 16:53:46 +01:00
Jeff Bolz 2a9ea2020c
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302) 2025-12-24 12:36:34 +01:00
Ruben Ortlam 7f459c98e7
vulkan: use fewer FA rows for small cache runs (#18280) 2025-12-24 08:59:14 +01:00
Jeff Bolz e3b35ddf1c
vulkan: Extend rope fusions to allow mrope (#18264)
Extend the test-backend-ops tests as well.
2025-12-22 11:03:13 -06:00
Jeff Bolz e1f15b454f
vulkan: Implement set_tensor_async and the event interfaces (#18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
2025-12-21 21:52:09 +01:00
Jeff Bolz fd05c51cec
vulkan: fix im2col overflowing maxworkgroupcount (#18180) 2025-12-21 10:32:58 +01:00
Jeff Bolz b365c3ff01
vulkan/cuda: fix topk_moe with exp_probs_b (#18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
2025-12-21 10:27:34 +01:00
Jeff Bolz cb64222b0c
vulkan: support GGML_UNARY_OP_XIELU (#18062) 2025-12-21 10:17:58 +01:00
Jeff Bolz 6eb7081860
vulkan: in graph_optimize, try to group ADD operations (#18060)
I saw the adds not staying together in the new nemotron 3 nano model.
2025-12-21 10:05:08 +01:00
Jeff Bolz cdbada8d10
vulkan: Add perf logger mode with concurrency (#17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
2025-12-19 06:36:46 +01:00
Jeff Bolz 36255a2268
vulkan: support get_rows for i32 (#17941) 2025-12-13 10:12:53 +01:00
Jeff Bolz 3229a23fa6
vulkan: support GGML_OP_DIAG (#17893) 2025-12-13 10:07:49 +01:00
Jeff Bolz 303f8615e9
vulkan: Multi-pass softmax for large number of cols (#17892)
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
2025-12-13 10:04:29 +01:00
Jeff Bolz 07a10c1090
vulkan: Allow non-pow2 n_experts in topk_moe (#17872) 2025-12-13 08:40:04 +01:00
Jeff Bolz db97837385
vulkan: perf_logger improvements (#17672)
* vulkan: perf_logger improvements

- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.

* fix vector sizes
2025-12-06 18:46:46 +01:00
Phylliida Dev 09c7c50e64
ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985)
* Feat: Added vulkan circular tiling support

* Feat: Added cpu circular

* Feat: Added cuda kernels

* Added tests

* Added tests

* Removed non-pad operations

* Removed unneded changes

* removed backend non pad tests

* Update test-backend-ops.cpp

* Fixed comment on pad test

* removed trailing whitespace

* Removed unneded test in test-backend-ops

* Removed removed test from calls

* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam <picard12@live.de>

* Fixed alignment

* Formatting

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Format pad

* Format

* Clang format

* format

* format

* don't change so much stuff

* clang format and update to bool

* fix duplicates

* don't need to fix the padding

* make circular bool

* duplicate again

* rename vulkan to wrap around

* Don't need indent

* moved to const expr

* removed unneded extra line break

* More readable method calls

* Minor wording changes

* Added final newline

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Added circular pad ext tests

* Gate non circular pad devices

* Cleaned gating of non-circular pad devices

---------

Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-06 15:07:02 +01:00
Jeff Bolz 2960eb2975
vulkan: Use one row per workgroup for f32 mmv (#17711)
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
2025-12-06 11:12:26 +01:00
Jeff Bolz c6c5e85979
vulkan: support solve_tri with larger N/K values (#17781)
Split N into chunks to fit into shared memory.
If K > 128, use a larger workgroup with enough invocations.
Add perf tests matching qwen3next.
2025-12-06 08:56:45 +01:00
Masato Nakasaka 67788f6846
vulkan: Replace deprecated VK_EXT_validation_features (#17637)
* replaced deprecated VK_EXT_validation_features

* forgot to remove old code
2025-12-06 06:39:42 +01:00
Masato Nakasaka d8c0a7b085
vulkan: Fix mismatch in TOPK_MOE unit test (#17541)
* Fix shader to support 2D workgroup mapping to a single subgroup

* Set required_subgroup_size

topk_moe shader requires static WARP_SIZE and actual subgroup size to match
2025-12-06 06:23:30 +01:00
Jeff Bolz a0f3897d53
vulkan: fix top_k bug when there are ties in the input (#17659)
* vulkan: Reduce temporary memory usage for TOP_K

- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.

* vulkan: fix top_k bug when there are ties in the input

I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).

I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-05 22:03:19 +01:00
Acly e15cd06a94
vulkan : support conv-2d with large output size (#17685) 2025-12-05 21:46:39 +01:00
Jeff Bolz 6ab0d64960
vulkan: enable mmvq for q2_k on NVIDIA (#17675) 2025-12-05 21:21:57 +01:00
Jeff Bolz 93bb92664e
vulkan: set all memory allocations to high priority (#17624)
* vulkan: set all memory allocations to high priority

* gate by env var
2025-12-05 21:21:04 +01:00