Commit Graph

2056 Commits

Author SHA1 Message Date
Ruben Ortlam 32d504cd94 fix editorconfig 2026-02-14 13:02:32 +01:00
Ruben Ortlam 02ccf81496 tmpsh size fix 2026-02-14 11:43:31 +01:00
Ruben Ortlam 0b4b0d2e57 device tuning 2026-02-14 11:16:20 +01:00
Ruben Ortlam dd92b1f8d5 fix regressions 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9a8743c4 add Intel shader core count lookup-table 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ae5466aaf Use wave32 on AMD RDNA for scalar FA 2026-02-14 11:16:20 +01:00
Ruben Ortlam 16cb912442 Bc 4 for scalar FA is not a valid configuration 2026-02-14 11:16:20 +01:00
Ruben Ortlam cd54ba2b86 fixes 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3946eb657f fix rebase issues 2026-02-14 11:16:20 +01:00
Ruben Ortlam 28a3c0b859 fix shmem support function 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ed9183ac9 use minimal subgroup size on Intel 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9b701ff5 relax flash attention split_k condition to allow non-gqa use 2026-02-14 11:16:17 +01:00
Ruben Ortlam d6a004547f use smaller scalar rows size for smaller rows count 2026-02-14 07:05:36 +01:00
Ruben Ortlam de6db3fed6 use float_type for dequantize4 functions 2026-02-14 07:05:36 +01:00
Ruben Ortlam 356f18c444 use vectorized stores 2026-02-14 07:05:36 +01:00
Ruben Ortlam 4819fd3014 dynamic subgroups for intel 2026-02-14 07:05:16 +01:00
Ruben Ortlam b626e3296d also stage V through shmem when this is done for K 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8fbd3575e0 default to Bc 32 2026-02-14 07:05:16 +01:00
Ruben Ortlam d8d536cf98 only stage through shmem on Nvidia 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8236c453a5 stage V loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam b7b67f8742 stage K loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam 50a420e044 fuse lf accumulation, pf and v accumulation into a loop 2026-02-14 07:05:16 +01:00
Ruben Ortlam ca5ec63cfb cache q values into registers for KQ 2026-02-14 07:05:16 +01:00
Ruben Ortlam 3c2088121c add padding to mask shmem buffer 2026-02-14 07:05:15 +01:00
Ruben Ortlam 07afb5128f fixes 2026-02-14 07:04:32 +01:00
Ruben Ortlam e3bba64e82 add medium rows FA shader Br size 2026-02-14 07:03:07 +01:00
Ruben Ortlam c0f419351c optimize masksh use 2026-02-14 07:03:06 +01:00
Ruben Ortlam 9b309bbc51 fix amd workgroup size issue 2026-02-14 06:57:22 +01:00
Ruben Ortlam f92d7eddab use f32 scalar FA if f16 is not supported by device 2026-02-14 06:57:22 +01:00
Ruben Ortlam 828b7e9bb1 use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 2026-02-14 06:57:22 +01:00
Ruben Ortlam e7a758fb66 split rows inside of subgroups for faster synchronization 2026-02-14 06:57:22 +01:00
Ruben Ortlam 015d7bcd66 vulkan: allow using fp16 in coopmat1 flash attention shader 2026-02-14 06:57:21 +01:00
Jeff Bolz dbb023336b
vulkan: support L2_NORM with contiguous rows (#19604) 2026-02-14 06:42:04 +01:00
Jeff Bolz 53aef25a88
vulkan: support GGML_OP_SET (#19584) 2026-02-14 06:36:38 +01:00
Sophon 2dec548094
vulkan: Add vendor id for Qualcomm drivers (#19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-14 06:29:17 +01:00
Max Krasnyansky 0ccbfdef3e
hexagon: further optimizations and refactoring for flash attention (#19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-13 16:27:30 -08:00
Jeff Bolz 05a6f0e894
vulkan: restore -inf check in FA shaders (#19582) 2026-02-13 13:35:29 -06:00
Alberto Cabrera Pérez cc2aa81513
Fix wrong memcpy length for block_interleave == 4 (#19575) 2026-02-13 20:32:14 +08:00
ymcki 0e21991472
fix vulkan ggml_acc only works in 3d but not 4d (#19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-13 13:31:37 +01:00
Aman Gupta 5065da554e
CUDA: loop over ne2*ne3 in case it overflows (#19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
2026-02-13 17:01:40 +05:30
Oliver Simons 43919b7f4f
CUDA: Do not mutate cgraph for fused ADDs (#19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-13 15:07:55 +05:30
Georgi Gerganov 0644baefde
metal : improve concurrency (#19555) 2026-02-13 07:35:57 +02:00
Georgi Gerganov 490eb96b88
metal : support GGML_OP_SET (#19548) 2026-02-13 07:34:52 +02:00
Shupei Fan 3bb78133ab
hexagon: fix typo in vtcm_needs_release (#19545) 2026-02-12 15:07:49 -08:00
lhez 79cc0f2daf
opencl: add basic support for q4_1 (#19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
2026-02-12 14:52:37 -08:00
Georgi Gerganov 3b3a948134
metal : update sum_rows kernel to support float4 (#19524) 2026-02-12 11:35:28 +02:00
Mario Limonciello 6845f7f87f
Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
 2549 |             using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2].  When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2026-02-12 09:38:35 +01:00
Max Krasnyansky b1ff83bbb0
hexagon: further optimization and tuning of matmul and dot kernels (#19407)
* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-02-11 23:04:27 -08:00
lhez 4d3daf80f8
opencl: add general Q6_K mm and Q4_K mv (#19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-11 10:33:13 -08:00
Georgi Gerganov 914dde72ba
ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-11 18:58:43 +02:00