Reese Levine
8ced5f41f9
Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends ( #20618 )
2026-03-18 10:23:47 -07:00
Shaw Nguyen
78d550b541
ggml-cpu/x86: fix unused changemask warning in repack ( #20692 )
2026-03-18 18:45:06 +02:00
Georgi Gerganov
4efd326e71
sync : ggml
2026-03-18 15:17:28 +02:00
Georgi Gerganov
b08f7322ee
ggml : bump version to 0.9.8 (ggml/1442)
2026-03-18 15:17:28 +02:00
Georgi Gerganov
79187f2fb8
ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)
2026-03-18 15:17:28 +02:00
Julien Chaumond
48e61238e1
webui: improve tooltip wording for attachment requirements ( #20688 )
...
* webui: improve tooltip wording for attachment requirements
Co-Authored-By: Claude <Agents+claude@huggingface.co>
* chore: update webui build output
* chore: update webui build output
---------
Co-authored-by: Claude <Agents+claude@huggingface.co>
2026-03-18 14:01:02 +01:00
Pop Flamingo
312cf03328
llama : re-enable manual LoRA adapter free ( #19983 )
...
* Re-enable manual LoRA adapter free
* Remove stale "all adapters must be loaded before context creation" stale comments
2026-03-18 12:03:26 +02:00
Masato Nakasaka
f4049ad735
tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] ( #20483 )
...
* Fix errors occurring on Windows
* Reverted fix
#20365 will take care of CRLF isue
* Changed to write to directly to stdin
* Prevent fclose to happen twice
2026-03-18 10:43:31 +01:00
Aldehir Rojas
5e8910a0db
common : rework gpt-oss parser ( #20393 )
...
* common : rework gpt-oss parser
* cont : fix gpt-oss tests
* cont : add structured output test
* cont : rename final to final_msg
2026-03-18 10:41:25 +01:00
Aaron Teo
fe00a84b4b
tests: enable kv_unified to prevent cuda oom error on rtx 2060 ( #20645 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-18 17:40:22 +08:00
Aleksander Grygier
7ab321d40d
webui: Fix duplicated messages on q param ( #20715 )
...
* fix: Remove duplicate message sending on `?q` param
* chore: update webui build output
2026-03-18 10:32:43 +01:00
uvos
7533a7d509
HIP : ignore return of hipMemAdvise [no ci] ( #20696 )
2026-03-18 09:53:13 +01:00
Andreas Obersteiner
a69d54f990
context : fix graph not resetting when control vector changes ( #20381 )
2026-03-18 08:10:13 +02:00
Krishna Sridhar
cf23ee2447
hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops ( #20701 )
...
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.
- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
the worker pool, supporting f32 and f16 types. The kernel parallelizes
across output rows and uses memcpy for each tile.
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-17 15:34:36 -07:00
Ruben Ortlam
892e3c333a
vulkan: disable mmvq on Intel Windows driver ( #20672 )
...
* vulkan: disable mmvq on Intel Windows driver
* improve comment
2026-03-17 21:51:43 +01:00
Kevin Hannon
ee4801e5a6
ggml-blas: set mkl threads from thread context ( #20602 )
...
* ggml blas: set mkl threads from thread context
* add code to run blas locally
2026-03-18 01:16:49 +08:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add `--skip-chat-parsing` to force a pure content parser. ( #20289 )
...
* Add `--force-pure-content` to force a pure content parser.
* Update common/arg.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Change parameter name [no ci]
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Taimur Ahmad
054d8b0f24
ggml-cpu: fix RVV checks in quants and repacking ( #20682 )
...
* ggml-cpu: refactor quants.c; add rvv check
* ggml-cpu: refactor; disable generic fallback
2026-03-17 16:03:40 +02:00
Sigbjørn Skjæret
ab0bb93748
ci : bump ccache [no ci] ( #20679 )
...
* bump ccache
* forgotten
* disable for s390x
* disable also for ppc64le
2026-03-17 14:54:31 +01:00
Ruben Ortlam
3a5cb629b1
vulkan: async and event fixes ( #20518 )
...
* vulkan: fix event wait submission, event command buffer reset
* fix event command buffer reset validation error
* also reset command buffers before reuse
* use timeline semaphores instead of fences for event_synchronize
* don't use initializer list for semaphore wait info
* use multiple events to avoid reset issues
* fix event reuse issue with multiple vectors
* add semaphore wait condition also if compute_ctx already exists
* remove event pending stage
2026-03-17 14:27:23 +01:00
Georgi Gerganov
8cc2d81264
server : fix ctx checkpoint invalidation ( #20671 )
2026-03-17 15:21:14 +02:00
Justin Bradford
627670601a
kleidiai : fix MUL_MAT support for batched (3D) inputs ( #20620 )
...
* kleidiai : fix MUL_MAT support for batched (3D) inputs
The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.
This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.
Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.
Fixes #20608
* Kleidiai support_ops should only return true for 3D inputs, not also 4D
2026-03-17 14:03:54 +02:00
Ruben Ortlam
740a447fc3
vulkan: allow graphics queue only through env var ( #20599 )
...
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
2026-03-17 10:09:59 +01:00
Neo Zhang
b6c83aad55
[SYCL] ehance UPSCALE to support all UT cases ( #20637 )
...
* [SYCL] ehance UPSCALE to support more cases
* rm test case result of SYCL1
2026-03-17 10:01:52 +08:00
Piotr Wilkin (ilintar)
2e4a6edd4a
tools/server: support refusal content for Responses API ( #20285 )
...
* Support refusal content for Responses API
* Update tools/server/server-common.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tools/server/server-common.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen
d34ff7eb5b
model: mistral small 4 support ( #20649 )
...
* model: mistral small 4 support
* fix test
* fix test (2)
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* change newline
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 00:31:14 +01:00
Georgi Gerganov
45172df4d6
ci : disable AMX jobs ( #20654 )
...
[no ci]
2026-03-16 22:38:59 +02:00
Georgi Gerganov
9b342d0a9f
benches : add Nemotron 3 Nano on DGX Spark ( #20652 )
...
[no ci]
2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret
55e87026f7
tests : write to binary buffer to avoid newline translation in jinja -py [no ci] ( #20365 )
2026-03-16 20:40:22 +01:00
Martin Klacer
cf21cdf36c
kleidiai: add data type check to get_tensor_traits ( #20639 )
...
* kleidiai: add data type check to get_tensor_traits
* Added check for F16 data type into get_tensor_traits path with input data
not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7
* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
updated kleidiai.cpp file as per suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret
0ed992973b
ci : update labeler ( #20629 )
2026-03-16 20:24:20 +01:00
Aldehir Rojas
1bbec6a75d
jinja : add capability check for object args ( #20612 )
2026-03-16 17:43:14 +01:00
Georgi Gerganov
f47a246a08
sync : ggml
2026-03-16 17:22:06 +02:00
Georgi Gerganov
c0ccbd1f86
ggml : try fix arm build (whisper/0)
2026-03-16 17:22:06 +02:00
David366AI
f6da02c3f2
ggml : extend im2col f16 (ggml/1434)
...
* examples/yolo: fix load_model memory leak
* fix/issue-1433 ggml_compute_forward_im2col_f16 assert error
* fix/issue-1433
2026-03-16 17:22:06 +02:00
Pascal
dddca026bf
webui: add model information dialog to router mode ( #20600 )
...
* webui: add model information dialog to router mode
* webui: add "Available models" section header in model list
* webui: remove nested scrollbar from chat template in model info dialog
* chore: update webui build output
* feat: UI improvements
* refactor: Cleaner rendering + UI docs
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-16 15:38:11 +01:00
Aman Gupta
3c8521c4f5
llama-graph: replace cont with reshape for alpha in qwen35 ( #20640 )
2026-03-16 22:07:13 +08:00
Aleksander Grygier
67a2209fab
webui: Add MCP CORS Proxy detection logic & UI ( #20167 )
...
* refactor: MCP store cleanup
* feat: Add MCP proxy availability detection
* fix: Sidebar icon
* chore: update webui build output
* chore: Formatting
* chore: update webui build output
* chore: Update package lock
* chore: update webui build output
* chore: update webui build output
* chore: update webui build output
2026-03-16 13:05:36 +01:00
Pascal
d65c4f2dc9
Fix model selector locked to first loaded model with multiple models ( #20580 )
...
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
2026-03-16 12:04:06 +01:00
Woof Dog
d8c331c0af
webui: use date in more human readable exported filename ( #19939 )
...
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
2026-03-16 11:18:13 +01:00
Ruben Ortlam
46dba9fce8
vulkan: fix flash attention dot product precision ( #20589 )
2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret
de8f01c2d7
model : wire up Nemotron-H tensors for NVFP4 support ( #20561 )
...
* wire up Nemotron-H tensors for NVFP4 support
* add ssm tensors
* alignment
2026-03-16 09:19:16 +01:00
Richard Davison
079e5a45f0
convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization ( #20539 )
...
* support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization
* cleanup
* fallback
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-16 09:18:47 +01:00
Masato Nakasaka
d3936498a3
common : fix iterator::end() dereference ( #20445 )
2026-03-16 08:50:38 +02:00
Aman Gupta
34818ea6c0
CUDA: GDN hide memory latency ( #20537 )
2026-03-16 11:41:45 +08:00
Piotr Wilkin (ilintar)
9e2e2198b0
tools/cli: fix disable reasoning ( #20606 )
2026-03-15 22:40:53 +01:00
Georgi Gerganov
88915cb55c
server : fix wait in test_cancel_requests() test ( #20601 )
...
* server : fix wait in test_cancel_requests() test
* codeowners : add team for server tests
2026-03-15 20:54:37 +02:00
Sigbjørn Skjæret
ebbf544ed1
sycl : fix for untransposed GDA recurrent state ( #20583 )
2026-03-15 19:10:15 +01:00
Sigbjørn Skjæret
b91d7dfe5b
ci : only save openvino caches on github-hosted master ( #20593 )
...
* only save openvino ccache on master
* disable toolkit cache if self-hosted
* only cache on github-hosted runners
* remove toolkit cache [no ci]
2026-03-15 18:58:13 +01:00
Johannes Gäßler
ae40cd27c8
CUDA: limit number of FA stream-k CUDA blocks ( #20586 )
2026-03-15 18:30:47 +01:00