Commit Graph

8073 Commits

Author SHA1 Message Date
Ruben Ortlam dd92b1f8d5 fix regressions 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9a8743c4 add Intel shader core count lookup-table 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ae5466aaf Use wave32 on AMD RDNA for scalar FA 2026-02-14 11:16:20 +01:00
Ruben Ortlam 16cb912442 Bc 4 for scalar FA is not a valid configuration 2026-02-14 11:16:20 +01:00
Ruben Ortlam cd54ba2b86 fixes 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3946eb657f fix rebase issues 2026-02-14 11:16:20 +01:00
Ruben Ortlam 28a3c0b859 fix shmem support function 2026-02-14 11:16:20 +01:00
Ruben Ortlam 3ed9183ac9 use minimal subgroup size on Intel 2026-02-14 11:16:20 +01:00
Ruben Ortlam 9f9b701ff5 relax flash attention split_k condition to allow non-gqa use 2026-02-14 11:16:17 +01:00
Ruben Ortlam d6a004547f use smaller scalar rows size for smaller rows count 2026-02-14 07:05:36 +01:00
Ruben Ortlam de6db3fed6 use float_type for dequantize4 functions 2026-02-14 07:05:36 +01:00
Ruben Ortlam 356f18c444 use vectorized stores 2026-02-14 07:05:36 +01:00
Ruben Ortlam 4819fd3014 dynamic subgroups for intel 2026-02-14 07:05:16 +01:00
Ruben Ortlam b626e3296d also stage V through shmem when this is done for K 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8fbd3575e0 default to Bc 32 2026-02-14 07:05:16 +01:00
Ruben Ortlam d8d536cf98 only stage through shmem on Nvidia 2026-02-14 07:05:16 +01:00
Ruben Ortlam 8236c453a5 stage V loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam b7b67f8742 stage K loads through shmem 2026-02-14 07:05:16 +01:00
Ruben Ortlam 50a420e044 fuse lf accumulation, pf and v accumulation into a loop 2026-02-14 07:05:16 +01:00
Ruben Ortlam ca5ec63cfb cache q values into registers for KQ 2026-02-14 07:05:16 +01:00
Ruben Ortlam 3c2088121c add padding to mask shmem buffer 2026-02-14 07:05:15 +01:00
Ruben Ortlam 07afb5128f fixes 2026-02-14 07:04:32 +01:00
Ruben Ortlam e3bba64e82 add medium rows FA shader Br size 2026-02-14 07:03:07 +01:00
Ruben Ortlam c0f419351c optimize masksh use 2026-02-14 07:03:06 +01:00
Ruben Ortlam 9b309bbc51 fix amd workgroup size issue 2026-02-14 06:57:22 +01:00
Ruben Ortlam f92d7eddab use f32 scalar FA if f16 is not supported by device 2026-02-14 06:57:22 +01:00
Ruben Ortlam 828b7e9bb1 use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 2026-02-14 06:57:22 +01:00
Ruben Ortlam e7a758fb66 split rows inside of subgroups for faster synchronization 2026-02-14 06:57:22 +01:00
Ruben Ortlam 015d7bcd66 vulkan: allow using fp16 in coopmat1 flash attention shader 2026-02-14 06:57:21 +01:00
Adrien Gallouët 91ea5d67f2
build : fix libtool call in build-xcframework.sh (#19605)
Run libtool via xcrun like strip and dsymutil, to have proper tool resolution.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-14 06:48:37 +01:00
Jeff Bolz dbb023336b
vulkan: support L2_NORM with contiguous rows (#19604) 2026-02-14 06:42:04 +01:00
Jeff Bolz 53aef25a88
vulkan: support GGML_OP_SET (#19584) 2026-02-14 06:36:38 +01:00
Sophon 2dec548094
vulkan: Add vendor id for Qualcomm drivers (#19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-14 06:29:17 +01:00
Max Krasnyansky 0ccbfdef3e
hexagon: further optimizations and refactoring for flash attention (#19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-13 16:27:30 -08:00
Mengsheng Wu 94a602db66
github : add missing backends to issue templates (#19603) 2026-02-14 00:56:53 +01:00
Jeff Bolz 05a6f0e894
vulkan: restore -inf check in FA shaders (#19582) 2026-02-13 13:35:29 -06:00
Adrien Gallouët b48e80f677
common : update download code (#19573)
* common : remove legacy .json to .etag migration code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : simplify common_download_file_single_online

This commit also force a redownload if the file exists
but has no .etag file.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 15:10:46 +01:00
Xuan-Son Nguyen 752584d5f5
model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-13 14:56:53 +01:00
Alberto Cabrera Pérez cc2aa81513
Fix wrong memcpy length for block_interleave == 4 (#19575) 2026-02-13 20:32:14 +08:00
ymcki 0e21991472
fix vulkan ggml_acc only works in 3d but not 4d (#19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-13 13:31:37 +01:00
Sigbjørn Skjæret b2ecc0cdb4
support --verbose-prompt (#19576) 2026-02-13 12:49:10 +01:00
Aman Gupta 5065da554e
CUDA: loop over ne2*ne3 in case it overflows (#19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
2026-02-13 17:01:40 +05:30
Aleksander Grygier 5174d7206f
webui: UI and routing fixes (#19586)
* chore: update webui build output

* chore: update webui build output

* fix: Scroll issues in DropdownMenuSearchable

* webui: fix redirect to root ignoring base path

* fix: Word wrapping

* fix: remove obsolete modality UI tests causing CI failures

- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)

* feat: Improve formatting performance time

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-02-13 12:31:00 +01:00
Oliver Simons 43919b7f4f
CUDA: Do not mutate cgraph for fused ADDs (#19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-13 15:07:55 +05:30
Pavan Shinde 423cf0b26f
docs : fix broken link and typo (#19560) 2026-02-13 09:38:09 +01:00
ymcki 33a56f90a6
model : Kimi Linear fix conv state update (#19531)
* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-13 09:10:18 +01:00
Adrien Gallouët 25224c8021
llama : remove deprecated codecvt (#19565)
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:53 +01:00
Adrien Gallouët 2f5d8f8edc
vendor : update BoringSSL to 0.20260211.0 (#19562)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:26 +01:00
Georgi Gerganov bb96bfd361
memory : fix kv cache size for hybrid models (#19559) 2026-02-13 07:36:24 +02:00
Georgi Gerganov 0644baefde
metal : improve concurrency (#19555) 2026-02-13 07:35:57 +02:00