Commit Graph

8657 Commits

Author SHA1 Message Date
Georgi Gerganov 0be6c7c9ce ggml : bump version to 0.9.9 (ggml/1449) 2026-03-31 14:00:41 +03:00
Adrien Gallouët 41361c8599
common : move up common_init() and fix Windows UTF-8 logs (#21176)
The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 12:53:41 +02:00
Neo Zhang 62278cedde
sycl : enhance fattn perf (#21185) 2026-03-31 13:31:50 +03:00
mtmcp 90aa83c6bd
common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082)
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e584ca

* Add regression test

* Remove regression test for init-fail sampler check
2026-03-31 13:04:42 +03:00
SATISH K C fcc2d598c8
fix: include API key in CORS proxy requests for MCP connections (#21193)
* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes #21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
2026-03-31 10:52:34 +02:00
Piotr Wilkin (ilintar) 4453e77561
server/webui: cleanup dual representation approach, simplify to openai-compat (#21090)
* server/webui: cleanup dual representation approach, simplify to openai-compat

* feat: Fix regression for Agentic Loop UI

* chore: update webui build output

* refactor: Post-review code improvements

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-31 10:42:06 +02:00
Adrien Gallouët 26dac845cc
vendor : update BoringSSL to 0.20260327.0 (#21211)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 09:21:54 +02:00
Galunid 5ce013cd7e
common : Disable backend sampling if reasoning budget is enabled (#21209) 2026-03-31 10:14:01 +03:00
shaofeiqi 08f21453ae
opencl: add q4_K gemm and gemv kernels for Adreno (#20919)
* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-03-30 12:19:16 -07:00
Seungmin Kim 84ae8434d0
CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>

* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-30 20:24:37 +02:00
Zhihao "Zephyr" Yao ead417f01c
jinja : handle empty expressions correctly (#20913)
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-30 20:08:46 +02:00
Oliver Simons 64ac9ab66a
CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)
* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes #21162

* Reduce nrows in test case to 256, don't need 768
2026-03-30 16:20:00 +02:00
Radoslav Gerganov cad2d3884c
rpc : fix misleading error log (#21184)
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
2026-03-30 17:05:11 +03:00
Aleksander Grygier 389c7d4955
webui: Fix branching logic on edit message (#21175)
* fix: Branching logic + small refactor

* chore: update webui build output
2026-03-30 14:40:50 +02:00
Aman Gupta 278521c33a
llama-model-loader: print warning when using overrides with mmap (#20978)
* llama-model-loader: use pinned memory for tensor overrides

* change to warning
2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret e2eb39e81c
ci : bump ty to 0.0.26 (#21156)
* fix incorrect type ignore comments

* bump ty to 0.0.26
2026-03-30 09:29:15 +02:00
Xuan-Son Nguyen abf9a62161
server: wrap headers for mcp proxy (#21072)
* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-30 08:59:16 +02:00
Sigbjørn Skjæret 7c203670f8
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150) 2026-03-29 19:45:40 +02:00
Gaurav Garg ec16a072f0
Optimize MOE GEMV kernel for BS > 1. (#20905)
* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-03-29 18:35:18 +02:00
Max Krasnyansky f5d1c4179f
hexagon: dma optimizations (mostly fixing regressions) (#21137)
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
2026-03-29 06:40:13 -07:00
Davi Henrique Linhares 2405d59cb6
devops: including compute-runtime for intel.Dockerfile (#21076) 2026-03-29 13:34:03 +08:00
Neo Zhang afe65aa282
[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)
* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0
2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret 65097181e4
fix **/x glob matching (#21129) 2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar) 98ae0a0d36
common/parser: fix handling of tool definition with missing properties key (#21128) 2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret 3a14a542f5
common : add character class support to glob_match (#21111)
* add character class support to glob_match

* remove pointless reference
2026-03-28 19:57:37 +01:00
BlueMöhre 968189729f
WebUI: Replace illegal nested button elements (#21026)
* remove/replace nested button elements

* map rest props to outer element

* solve TODO

* chore: update webui build output
2026-03-28 17:57:59 +01:00
Adrien e397d3885c
common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)
2026-03-28 17:55:38 +01:00
Aldehir Rojas e6f2ec01ff
common : add reasoning_format = none support to gpt-oss (#21094) 2026-03-28 09:33:39 -05:00
Georgi Gerganov edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks (#21107) 2026-03-28 16:27:36 +02:00
Adrien Gallouët 3d66da1809
ci : gracefully shut down the server (#21110)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 14:49:57 +01:00
Woof Dog 82b703f8bc
Document custom default webui preferences in server README (#19771) 2026-03-28 14:19:16 +01:00
Aleksander Grygier 51a84efc53
webui: Conversation forking + branching improvements (#21021)
* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output
2026-03-28 13:38:15 +01:00
Adrien Gallouët b0f0dd3e51
vendor : update cpp-httplib to 0.40.0 (#21100)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 08:59:44 +01:00
Ruben Ortlam 0eb4764182
vulkan: add noncontiguous GLU support (#21081)
* vulkan: add noncontiguous GLU support

* fix compile issue
2026-03-28 08:44:56 +01:00
hipudding cb15cdb020 CANN: add SOFTPLUS unary op support
Implement GGML_UNARY_OP_SOFTPLUS using aclnnSoftplus with beta=1.0
and threshold=20.0. This enables hybrid models like Qwen3.5 to run
entirely on the CANN backend without graph splitting, which fixes
graph cache instability caused by the backend scheduler fragmenting
the computation graph when SOFTPLUS falls back to CPU.
2026-03-28 07:16:07 +00:00
hipudding 168d05f3d5 CANN: add GGML_OP_SOLVE_TRI support
Implement triangular linear system solve (AX=B) using
aclnnTriangularSolve for the lower-triangular, non-unit case.
2026-03-28 06:47:56 +00:00
hipudding 871ffea262 CANN: add GGML_OP_DIAG support
Create diagonal matrix from vector by filling dst with zeros then
copying src onto the diagonal via a strided view with InplaceCopy.
2026-03-28 06:47:56 +00:00
hipudding 4a7bb25226 CANN: add GGML_OP_FILL support
Implement FILL using aclnnInplaceFillScalar to fill a tensor with
a constant scalar value from op_params.
2026-03-28 06:47:56 +00:00
hipudding 93e0c17661 CANN: add CUMSUM and TRI op support, fix graph cache op_params matching
- Implement GGML_OP_CUMSUM using aclnnCumsum
- Implement GGML_OP_TRI with all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG)
  using Tril/MaskedFillScalar approach to work around CANN sparse-zero bugs
- Fix graph cache to always compare op_params for all ops, not just a whitelist
2026-03-28 06:47:56 +00:00
hipudding 11e78d8499 CANN: simplify GATED_DELTA_NET implementation
- Remove dead code: _math and _naive variants are no longer needed
- Rename _batched to the public entry point ggml_cann_gated_delta_net
- In supports_op, return false for non-contiguous / GQA / non-F32 cases
  so the framework falls back to CPU instead of running the slow naive path
- The single remaining implementation uses aclnnBatchMatMul over all H
  heads per timestep, reducing kernel launches to O(n_seqs * n_tokens)
2026-03-28 06:47:56 +00:00
hipudding 3707b58628 CANN: add GATED_DELTA_NET op support
Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a
batched approach that groups all attention heads into a single 3-D
BatchMatMul per recurrence step, reducing kernel launches from
O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens).

Key design decisions:
- Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch
  all H heads together for M×k, outer-product, and M×q steps
- Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused
  across all time steps to avoid per-step allocations
- Support both scalar gate (g shape [1,H]) and KDA per-dim gate
  (g shape [S_v,H]) via appropriate broadcast shapes
- Fall back to naive per-head scalar loop for permuted/GQA/non-F32
  inputs that don't meet batched path requirements
- Relax CANN precision tolerance to 1e-6 in tests to account for
  different FP32 accumulation order in BatchMatMul vs scalar loops
2026-03-28 06:47:56 +00:00
hipudding 140c5a3d1b CANN: add GATED_DELTA_NET op support 2026-03-28 06:47:56 +00:00
hipudding c0e78773e9 CANN: implement GGML_OP_SET for CANN backend
Add SET operator support using aclnnInplaceCopy, modeled after the
existing ACC implementation. This enables the scheduler to assign
SET ops to CANN when the output tensor resides on device memory,
avoiding cross-device write issues with delta-net hybrid models.

All 12 test-backend-ops SET tests pass (f32/i32, inplace/non-inplace, dim 1/2/3).
2026-03-28 06:47:56 +00:00
hipudding be1492d21f CANN: implement backend memset_tensor interface
Add ggml_backend_cann_buffer_memset_tensor and wire it into
`ggml_backend_cann_buffer_interface`.

This ensures backend tensor memset operations are supported
and avoids incorrect behavior when tensors need explicit
zero-initialization (e.g. cache buffers).
2026-03-28 06:47:56 +00:00
Piotr Wilkin (ilintar) 1f5d15e665
common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker
2026-03-28 07:29:26 +01:00
Sigbjørn Skjæret c46758d28f
cli : add /glob command (#21084)
* add /glob command

* output error when max files reached

* support globbing outside curdir
2026-03-28 02:33:04 +01:00
Ts-sound bf934f28db
docker : fix and enable ARM64 image build (#20929)
* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-28 01:45:09 +01:00
Adrien Gallouët 5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT (#21056)
* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix windows

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 01:12:43 +01:00
Aldehir Rojas 59d840209a
common : inhibit lazy grammar sampler while reasoning is active (#20970)
* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-03-27 18:30:40 +01:00
Kusha Gharahi ff934e29bc
server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158)
* introduce LLAMA_SERVER_NO_WEBUI

* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI

* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE

* MIssed this

* Add useWebUi to package.nix
2026-03-27 17:25:55 +01:00