Commit Graph

2070 Commits

Author SHA1 Message Date
Ruben Ortlam 856d20988b
Merge 32d504cd94 into 2ba9adc093 2026-02-16 14:50:19 +01:00
Mario Limonciello 2ba9adc093
Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)
Avoids issues with ROCm 6.4.4.

Closes: https://github.com/ggml-org/llama.cpp/issues/19580
Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)")

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2026-02-16 14:46:08 +01:00
abhijain1204fujitsu 267ba5a1d9
ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132)
* Updated repack.cpp

* Updated repack.cpp

* Updated repack.cpp

* Added if condition to support only vector length 256.

* Changed the format removed comments and duplicate variable

* If SVE 256 not present then was using generic function to compute, hence slowing the performance. 

So added code if SVE 256 is not present then use NEON code.

* Code format change suggestion

---------

Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
2026-02-16 14:38:43 +08:00
Georgi Gerganov 55d58599c8 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-15 22:24:29 +02:00
Georgi Gerganov 1a8c700bfd ggml : bump version to 0.9.6 (ggml/1423) 2026-02-15 22:24:29 +02:00
David Friehs 27b93cbd15
cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
2026-02-15 22:38:42 +05:30
Daniel Bevenius 57088276d4
cmake : check if KleidiAI API has been fetched (#19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
2026-02-15 13:59:38 +01:00
Georgi Gerganov 08e6d914b8
ggml : avoid UB in gemm ukernel (#19642) 2026-02-15 14:56:35 +02:00
Aaron Teo 184c694f45
ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399) 2026-02-15 18:20:35 +08:00
Aman Gupta 684b36101c
ggml-cpu: FA add GEMM microkernel (#19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
2026-02-15 11:09:24 +05:30
SamareshSingh 3a00c98584
cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
2026-02-15 06:22:53 +01:00
Ruben Ortlam 32d504cd94 fix editorconfig 2026-02-14 13:02:32 +01:00
Georgi Gerganov 1725e316c1
models : optimize qwen3next graph (#19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-14 12:57:36 +02:00
Ruben Ortlam 02ccf81496 tmpsh size fix 2026-02-14 11:43:31 +01:00
Adrien Gallouët b7742cf321
ggml : fix GGML_DEBUG with OpenMP (#19599)
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-14 11:22:57 +01:00
Ruben Ortlam 0b4b0d2e57 device tuning 2026-02-14 11:16:20 +01:00
Ruben Ortlam dd92b1f8d5 fix regressions 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9a8743c4 add Intel shader core count lookup-table 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ae5466aaf Use wave32 on AMD RDNA for scalar FA 2026-02-14 11:16:20 +01:00
Ruben Ortlam 16cb912442 Bc 4 for scalar FA is not a valid configuration 2026-02-14 11:16:20 +01:00
Ruben Ortlam cd54ba2b86 fixes 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3946eb657f fix rebase issues 2026-02-14 11:16:20 +01:00
Ruben Ortlam 28a3c0b859 fix shmem support function 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ed9183ac9 use minimal subgroup size on Intel 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9b701ff5 relax flash attention split_k condition to allow non-gqa use 2026-02-14 11:16:17 +01:00
Georgi Gerganov 6e473fb384
metal : fix ACC op (#19427) 2026-02-14 09:54:03 +02:00
Ruben Ortlam d6a004547f use smaller scalar rows size for smaller rows count 2026-02-14 07:05:36 +01:00
Ruben Ortlam de6db3fed6 use float_type for dequantize4 functions 2026-02-14 07:05:36 +01:00
Ruben Ortlam 356f18c444 use vectorized stores 2026-02-14 07:05:36 +01:00
Ruben Ortlam 4819fd3014 dynamic subgroups for intel 2026-02-14 07:05:16 +01:00
Ruben Ortlam b626e3296d also stage V through shmem when this is done for K 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8fbd3575e0 default to Bc 32 2026-02-14 07:05:16 +01:00
Ruben Ortlam d8d536cf98 only stage through shmem on Nvidia 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8236c453a5 stage V loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam b7b67f8742 stage K loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam 50a420e044 fuse lf accumulation, pf and v accumulation into a loop 2026-02-14 07:05:16 +01:00
Ruben Ortlam ca5ec63cfb cache q values into registers for KQ 2026-02-14 07:05:16 +01:00
Ruben Ortlam 3c2088121c add padding to mask shmem buffer 2026-02-14 07:05:15 +01:00
Ruben Ortlam 07afb5128f fixes 2026-02-14 07:04:32 +01:00
Ruben Ortlam e3bba64e82 add medium rows FA shader Br size 2026-02-14 07:03:07 +01:00
Ruben Ortlam c0f419351c optimize masksh use 2026-02-14 07:03:06 +01:00
Ruben Ortlam 9b309bbc51 fix amd workgroup size issue 2026-02-14 06:57:22 +01:00
Ruben Ortlam f92d7eddab use f32 scalar FA if f16 is not supported by device 2026-02-14 06:57:22 +01:00
Ruben Ortlam 828b7e9bb1 use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 2026-02-14 06:57:22 +01:00
Ruben Ortlam e7a758fb66 split rows inside of subgroups for faster synchronization 2026-02-14 06:57:22 +01:00
Ruben Ortlam 015d7bcd66 vulkan: allow using fp16 in coopmat1 flash attention shader 2026-02-14 06:57:21 +01:00
Jeff Bolz dbb023336b
vulkan: support L2_NORM with contiguous rows (#19604) 2026-02-14 06:42:04 +01:00
Jeff Bolz 53aef25a88
vulkan: support GGML_OP_SET (#19584) 2026-02-14 06:36:38 +01:00
Sophon 2dec548094
vulkan: Add vendor id for Qualcomm drivers (#19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-14 06:29:17 +01:00
Max Krasnyansky 0ccbfdef3e
hexagon: further optimizations and refactoring for flash attention (#19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-13 16:27:30 -08:00