Commit Graph

350 Commits

Author SHA1 Message Date
Ruben Ortlam 0b4b0d2e57 device tuning 2026-02-14 11:16:20 +01:00
Ruben Ortlam dd92b1f8d5 fix regressions 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9a8743c4 add Intel shader core count lookup-table 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ae5466aaf Use wave32 on AMD RDNA for scalar FA 2026-02-14 11:16:20 +01:00
Ruben Ortlam 16cb912442 Bc 4 for scalar FA is not a valid configuration 2026-02-14 11:16:20 +01:00
Ruben Ortlam cd54ba2b86 fixes 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3946eb657f fix rebase issues 2026-02-14 11:16:20 +01:00
Ruben Ortlam 28a3c0b859 fix shmem support function 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ed9183ac9 use minimal subgroup size on Intel 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9b701ff5 relax flash attention split_k condition to allow non-gqa use 2026-02-14 11:16:17 +01:00
Ruben Ortlam d6a004547f use smaller scalar rows size for smaller rows count 2026-02-14 07:05:36 +01:00
Ruben Ortlam 4819fd3014 dynamic subgroups for intel 2026-02-14 07:05:16 +01:00
Ruben Ortlam b626e3296d also stage V through shmem when this is done for K 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8fbd3575e0 default to Bc 32 2026-02-14 07:05:16 +01:00
Ruben Ortlam d8d536cf98 only stage through shmem on Nvidia 2026-02-14 07:05:16 +01:00
Ruben Ortlam b7b67f8742 stage K loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam 07afb5128f fixes 2026-02-14 07:04:32 +01:00
Ruben Ortlam e3bba64e82 add medium rows FA shader Br size 2026-02-14 07:03:07 +01:00
Ruben Ortlam 9b309bbc51 fix amd workgroup size issue 2026-02-14 06:57:22 +01:00
Ruben Ortlam f92d7eddab use f32 scalar FA if f16 is not supported by device 2026-02-14 06:57:22 +01:00
Ruben Ortlam 828b7e9bb1 use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 2026-02-14 06:57:22 +01:00
Jeff Bolz dbb023336b
vulkan: support L2_NORM with contiguous rows (#19604) 2026-02-14 06:42:04 +01:00
Jeff Bolz 53aef25a88
vulkan: support GGML_OP_SET (#19584) 2026-02-14 06:36:38 +01:00
Sophon 2dec548094
vulkan: Add vendor id for Qualcomm drivers (#19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-14 06:29:17 +01:00
Jeff Bolz 05a6f0e894
vulkan: restore -inf check in FA shaders (#19582) 2026-02-13 13:35:29 -06:00
ymcki 0e21991472
fix vulkan ggml_acc only works in 3d but not 4d (#19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-13 13:31:37 +01:00
Jeff Bolz f9bd518a6b
vulkan: make FA mask/softcap enables spec constants (#19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-06 08:49:58 +01:00
Jeff Bolz 449ec2ab07
vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-05 09:26:38 -06:00
Oleksandr Kuvshynov a498c75ad1
vulkan: fix GPU deduplication logic. (#19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-05 09:06:59 +01:00
Jeff Bolz 3409ab842d
vulkan: Set k_load_shmem to false when K is too large (#19301) 2026-02-05 08:48:33 +01:00
Jeff Bolz c342c3b93d
vulkan: fix non-contig rope (#19299) 2026-02-05 08:38:59 +01:00
Ruben Ortlam 32b17abdb0
vulkan: disable coopmat1 fa on Nvidia Turing (#19290) 2026-02-03 17:37:32 +01:00
Simon Redman 13f3ebfae1
Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (#19194) 2026-01-30 17:27:16 +01:00
Ruben Ortlam f6b533d898
Vulkan Flash Attention Coopmat1 Refactor (#19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-28 18:52:45 +01:00
Oleksandr Kuvshynov 88d23ad515
vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-28 12:35:54 +01:00
Jeff Bolz bd544c94a3
vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-21 18:01:40 +01:00
Jeff Bolz 33f890e579
vulkan: support flash attention GQA/split_k with small batches (#18938) 2026-01-21 17:43:43 +01:00
Masato Nakasaka 067b8d7af3
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)
This reverts commit 980b7cd17e.
2026-01-21 17:13:43 +01:00
Jeff Bolz 50b7f076a5
vulkan: Use mul_mat_vec_id for small values of n (#18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-21 16:22:02 +01:00
Georgi Gerganov 365a3e8c31
ggml : add ggml_build_forward_select (#18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-19 20:03:19 +02:00
Jeff Bolz 3e4bb29666
vulkan: Check maxStorageBufferRange in supports_op (#18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-14 10:59:05 +01:00
Jeff Bolz 8e2da778da
vulkan: change memory_logger to be controlled by an env var (#18769) 2026-01-12 13:32:55 +01:00
Jeff Bolz 2bbe4c2cf8
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-12 12:32:13 +01:00
Ruben Ortlam 1051ecd289
vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-12 07:29:35 +01:00
Ruben Ortlam 0e76501e1d
Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-11 17:33:33 +01:00
Jeff Bolz 2524c26164
vulkan: fix push constant size for quantize_q8_1 (#18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-08 15:40:58 +01:00
Jeff Bolz cb14b06995
vulkan: optimize ssm_scan (#18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-08 15:16:54 +01:00
Doctor Shotgun 9a5724dee2
ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-08 11:03:21 +02:00
Jeff Bolz ca4a8370bc
vulkan: reject ops when a tensor is too large to allocate (#18646) 2026-01-07 12:03:32 +01:00
virajwad 03023296cf
vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-07 11:59:47 +01:00