Ruben Ortlam
28a3c0b859
fix shmem support function
2026-02-14 11:16:20 +01:00
Ruben Ortlam
3ed9183ac9
use minimal subgroup size on Intel
2026-02-14 11:16:20 +01:00
Ruben Ortlam
9f9b701ff5
relax flash attention split_k condition to allow non-gqa use
2026-02-14 11:16:17 +01:00
Ruben Ortlam
d6a004547f
use smaller scalar rows size for smaller rows count
2026-02-14 07:05:36 +01:00
Ruben Ortlam
de6db3fed6
use float_type for dequantize4 functions
2026-02-14 07:05:36 +01:00
Ruben Ortlam
356f18c444
use vectorized stores
2026-02-14 07:05:36 +01:00
Ruben Ortlam
4819fd3014
dynamic subgroups for intel
2026-02-14 07:05:16 +01:00
Ruben Ortlam
b626e3296d
also stage V through shmem when this is done for K
2026-02-14 07:05:16 +01:00
Ruben Ortlam
8fbd3575e0
default to Bc 32
2026-02-14 07:05:16 +01:00
Ruben Ortlam
d8d536cf98
only stage through shmem on Nvidia
2026-02-14 07:05:16 +01:00
Ruben Ortlam
8236c453a5
stage V loads through shmem
2026-02-14 07:05:16 +01:00
Ruben Ortlam
b7b67f8742
stage K loads through shmem
2026-02-14 07:05:16 +01:00
Ruben Ortlam
50a420e044
fuse lf accumulation, pf and v accumulation into a loop
2026-02-14 07:05:16 +01:00
Ruben Ortlam
ca5ec63cfb
cache q values into registers for KQ
2026-02-14 07:05:16 +01:00
Ruben Ortlam
3c2088121c
add padding to mask shmem buffer
2026-02-14 07:05:15 +01:00
Ruben Ortlam
07afb5128f
fixes
2026-02-14 07:04:32 +01:00
Ruben Ortlam
e3bba64e82
add medium rows FA shader Br size
2026-02-14 07:03:07 +01:00
Ruben Ortlam
c0f419351c
optimize masksh use
2026-02-14 07:03:06 +01:00
Ruben Ortlam
9b309bbc51
fix amd workgroup size issue
2026-02-14 06:57:22 +01:00
Ruben Ortlam
f92d7eddab
use f32 scalar FA if f16 is not supported by device
2026-02-14 06:57:22 +01:00
Ruben Ortlam
828b7e9bb1
use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
2026-02-14 06:57:22 +01:00
Ruben Ortlam
e7a758fb66
split rows inside of subgroups for faster synchronization
2026-02-14 06:57:22 +01:00
Ruben Ortlam
015d7bcd66
vulkan: allow using fp16 in coopmat1 flash attention shader
2026-02-14 06:57:21 +01:00
Adrien Gallouët
91ea5d67f2
build : fix libtool call in build-xcframework.sh ( #19605 )
...
Run libtool via xcrun like strip and dsymutil, to have proper tool resolution.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-14 06:48:37 +01:00
Jeff Bolz
dbb023336b
vulkan: support L2_NORM with contiguous rows ( #19604 )
2026-02-14 06:42:04 +01:00
Jeff Bolz
53aef25a88
vulkan: support GGML_OP_SET ( #19584 )
2026-02-14 06:36:38 +01:00
Sophon
2dec548094
vulkan: Add vendor id for Qualcomm drivers ( #19569 )
...
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-14 06:29:17 +01:00
Max Krasnyansky
0ccbfdef3e
hexagon: further optimizations and refactoring for flash attention ( #19583 )
...
* ggml-hexagon: fa improvements
ggml-hexagon: optimize flash attention calculations with improved variable handling
ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32
ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements
ggml-hexagon: optimize flash attention by changing slope vector type to F16
* hexfa: fixed test-backend-ops failurs due to leftover element handling
* hexagon: refactor and optimize fa to use local context struct
* ggml-hexagon: optimize flash-attention using hvx_vec_expf
Use HVX for online softmax.
---------
Co-authored-by: chraac <chraac@gmail.com>
2026-02-13 16:27:30 -08:00
Mengsheng Wu
94a602db66
github : add missing backends to issue templates ( #19603 )
2026-02-14 00:56:53 +01:00
Jeff Bolz
05a6f0e894
vulkan: restore -inf check in FA shaders ( #19582 )
2026-02-13 13:35:29 -06:00
Adrien Gallouët
b48e80f677
common : update download code ( #19573 )
...
* common : remove legacy .json to .etag migration code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : simplify common_download_file_single_online
This commit also force a redownload if the file exists
but has no .etag file.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 15:10:46 +01:00
Xuan-Son Nguyen
752584d5f5
model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) ( #19460 )
...
* model: support GLM MoE DSA arch
* working version
* pyright
* keep indexer tensors
* add indexer gguf params
* loaded now
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* minor fix and cleanup
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-13 14:56:53 +01:00
Alberto Cabrera Pérez
cc2aa81513
Fix wrong memcpy length for block_interleave == 4 ( #19575 )
2026-02-13 20:32:14 +08:00
ymcki
0e21991472
fix vulkan ggml_acc only works in 3d but not 4d ( #19426 )
...
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-13 13:31:37 +01:00
Sigbjørn Skjæret
b2ecc0cdb4
support --verbose-prompt ( #19576 )
2026-02-13 12:49:10 +01:00
Aman Gupta
5065da554e
CUDA: loop over ne2*ne3 in case it overflows ( #19538 )
...
* CUDA: loop over ne2*ne3 in case it overflows
* use fastdiv
2026-02-13 17:01:40 +05:30
Aleksander Grygier
5174d7206f
webui: UI and routing fixes ( #19586 )
...
* chore: update webui build output
* chore: update webui build output
* fix: Scroll issues in DropdownMenuSearchable
* webui: fix redirect to root ignoring base path
* fix: Word wrapping
* fix: remove obsolete modality UI tests causing CI failures
- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)
* feat: Improve formatting performance time
---------
Co-authored-by: Pascal <admin@serveurperso.com>
2026-02-13 12:31:00 +01:00
Oliver Simons
43919b7f4f
CUDA: Do not mutate cgraph for fused ADDs ( #19566 )
...
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
* Assert ggml_tensor is trivially copyable
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-13 15:07:55 +05:30
Pavan Shinde
423cf0b26f
docs : fix broken link and typo ( #19560 )
2026-02-13 09:38:09 +01:00
ymcki
33a56f90a6
model : Kimi Linear fix conv state update ( #19531 )
...
* fix conv state update for llama-server parallel serving
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-13 09:10:18 +01:00
Adrien Gallouët
25224c8021
llama : remove deprecated codecvt ( #19565 )
...
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:53 +01:00
Adrien Gallouët
2f5d8f8edc
vendor : update BoringSSL to 0.20260211.0 ( #19562 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:26 +01:00
Georgi Gerganov
bb96bfd361
memory : fix kv cache size for hybrid models ( #19559 )
2026-02-13 07:36:24 +02:00
Georgi Gerganov
0644baefde
metal : improve concurrency ( #19555 )
2026-02-13 07:35:57 +02:00
Georgi Gerganov
490eb96b88
metal : support GGML_OP_SET ( #19548 )
2026-02-13 07:34:52 +02:00
Shupei Fan
3bb78133ab
hexagon: fix typo in vtcm_needs_release ( #19545 )
2026-02-12 15:07:49 -08:00
lhez
79cc0f2daf
opencl: add basic support for q4_1 ( #19534 )
...
* opencl: add q4_1 mv
* opencl: clean up
* opencl: add flattened q4_1 mv
* opencl: clean up
* opencl: add basic q4_1 mm
* opencl: fix whitespace
* opencl: add general q4_0 mm
2026-02-12 14:52:37 -08:00
Georgi Gerganov
338085c69e
args : add -kvu to llama-parallel ( #19577 )
2026-02-12 21:52:41 +02:00
Aleksander Grygier
4c61875bf8
webui: Add switcher to Chat Message UI to show raw LLM output ( #19571 )
2026-02-12 19:55:51 +01:00
Adrien Gallouët
4b385bfcf8
vendor : update cpp-httplib ( #19537 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-12 16:11:22 +01:00