Commit Graph

8602 Commits

Author SHA1 Message Date
Xuan-Son Nguyen 4a00bbfed6
server: (webui) no more gzip compression (#21073)
* webui: no more gzip

* try changing a small line

* Revert "try changing a small line"

This reverts commit 0d7a353159.

* fix lint

* fix test

* rebuild

* split into html/css/js

* lint

* chore: update webui build output

* chore: Update git hooks script

* server: update webui build output

* chore: Update pre-commit hook

* refactor: Cleanup

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-31 15:44:26 +02:00
Aldehir Rojas 624733d631
common : gpt-oss handle builtin and unsolicited tool calls (#21213) 2026-03-31 13:52:42 +02:00
lainon1 0b6ff47996
fix: correct misspellings in code comments (#21217)
- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp,
gemma-embedding.cpp)
- imlpemented → implemented (llama-adapter.cpp)
- interere → interfere (llama-graph.cpp)
- overridde → overridden (chat.cpp)
- stastistics → statistics (ngram-map.h)
- layed → laid (llama-kv-cache.h)
- worster → worst (llama-context.cpp)
- sequantial → sequential (llama-batch.h)
2026-03-31 13:50:51 +02:00
Seungmin Kim eec6f85d7b
CI: Enable CPU and Vulkan ARM64 Release (#21207) 2026-03-31 19:02:56 +08:00
Georgi Gerganov 9281dd135d sync : ggml 2026-03-31 14:00:41 +03:00
Georgi Gerganov 0be6c7c9ce ggml : bump version to 0.9.9 (ggml/1449) 2026-03-31 14:00:41 +03:00
Adrien Gallouët 41361c8599
common : move up common_init() and fix Windows UTF-8 logs (#21176)
The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 12:53:41 +02:00
Neo Zhang 62278cedde
sycl : enhance fattn perf (#21185) 2026-03-31 13:31:50 +03:00
mtmcp 90aa83c6bd
common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082)
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e584ca

* Add regression test

* Remove regression test for init-fail sampler check
2026-03-31 13:04:42 +03:00
SATISH K C fcc2d598c8
fix: include API key in CORS proxy requests for MCP connections (#21193)
* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes #21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
2026-03-31 10:52:34 +02:00
Piotr Wilkin (ilintar) 4453e77561
server/webui: cleanup dual representation approach, simplify to openai-compat (#21090)
* server/webui: cleanup dual representation approach, simplify to openai-compat

* feat: Fix regression for Agentic Loop UI

* chore: update webui build output

* refactor: Post-review code improvements

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-31 10:42:06 +02:00
Adrien Gallouët 26dac845cc
vendor : update BoringSSL to 0.20260327.0 (#21211)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-31 09:21:54 +02:00
Galunid 5ce013cd7e
common : Disable backend sampling if reasoning budget is enabled (#21209) 2026-03-31 10:14:01 +03:00
shaofeiqi 08f21453ae
opencl: add q4_K gemm and gemv kernels for Adreno (#20919)
* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-03-30 12:19:16 -07:00
Seungmin Kim 84ae8434d0
CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>

* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-30 20:24:37 +02:00
Zhihao "Zephyr" Yao ead417f01c
jinja : handle empty expressions correctly (#20913)
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-30 20:08:46 +02:00
Oliver Simons 64ac9ab66a
CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)
* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes #21162

* Reduce nrows in test case to 256, don't need 768
2026-03-30 16:20:00 +02:00
Radoslav Gerganov cad2d3884c
rpc : fix misleading error log (#21184)
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
2026-03-30 17:05:11 +03:00
Aleksander Grygier 389c7d4955
webui: Fix branching logic on edit message (#21175)
* fix: Branching logic + small refactor

* chore: update webui build output
2026-03-30 14:40:50 +02:00
Aman Gupta 278521c33a
llama-model-loader: print warning when using overrides with mmap (#20978)
* llama-model-loader: use pinned memory for tensor overrides

* change to warning
2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret e2eb39e81c
ci : bump ty to 0.0.26 (#21156)
* fix incorrect type ignore comments

* bump ty to 0.0.26
2026-03-30 09:29:15 +02:00
Xuan-Son Nguyen abf9a62161
server: wrap headers for mcp proxy (#21072)
* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-30 08:59:16 +02:00
Sigbjørn Skjæret 7c203670f8
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150) 2026-03-29 19:45:40 +02:00
Gaurav Garg ec16a072f0
Optimize MOE GEMV kernel for BS > 1. (#20905)
* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-03-29 18:35:18 +02:00
Max Krasnyansky f5d1c4179f
hexagon: dma optimizations (mostly fixing regressions) (#21137)
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
2026-03-29 06:40:13 -07:00
Davi Henrique Linhares 2405d59cb6
devops: including compute-runtime for intel.Dockerfile (#21076) 2026-03-29 13:34:03 +08:00
Neo Zhang afe65aa282
[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)
* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0
2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret 65097181e4
fix **/x glob matching (#21129) 2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar) 98ae0a0d36
common/parser: fix handling of tool definition with missing properties key (#21128) 2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret 3a14a542f5
common : add character class support to glob_match (#21111)
* add character class support to glob_match

* remove pointless reference
2026-03-28 19:57:37 +01:00
BlueMöhre 968189729f
WebUI: Replace illegal nested button elements (#21026)
* remove/replace nested button elements

* map rest props to outer element

* solve TODO

* chore: update webui build output
2026-03-28 17:57:59 +01:00
Adrien e397d3885c
common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)
2026-03-28 17:55:38 +01:00
Aldehir Rojas e6f2ec01ff
common : add reasoning_format = none support to gpt-oss (#21094) 2026-03-28 09:33:39 -05:00
Georgi Gerganov edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks (#21107) 2026-03-28 16:27:36 +02:00
Adrien Gallouët 3d66da1809
ci : gracefully shut down the server (#21110)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 14:49:57 +01:00
Woof Dog 82b703f8bc
Document custom default webui preferences in server README (#19771) 2026-03-28 14:19:16 +01:00
Aleksander Grygier 51a84efc53
webui: Conversation forking + branching improvements (#21021)
* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output
2026-03-28 13:38:15 +01:00
Adrien Gallouët b0f0dd3e51
vendor : update cpp-httplib to 0.40.0 (#21100)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 08:59:44 +01:00
Ruben Ortlam 0eb4764182
vulkan: add noncontiguous GLU support (#21081)
* vulkan: add noncontiguous GLU support

* fix compile issue
2026-03-28 08:44:56 +01:00
Piotr Wilkin (ilintar) 1f5d15e665
common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker
2026-03-28 07:29:26 +01:00
Sigbjørn Skjæret c46758d28f
cli : add /glob command (#21084)
* add /glob command

* output error when max files reached

* support globbing outside curdir
2026-03-28 02:33:04 +01:00
Ts-sound bf934f28db
docker : fix and enable ARM64 image build (#20929)
* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-28 01:45:09 +01:00
Adrien Gallouët 5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT (#21056)
* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix windows

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 01:12:43 +01:00
Aldehir Rojas 59d840209a
common : inhibit lazy grammar sampler while reasoning is active (#20970)
* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-03-27 18:30:40 +01:00
Kusha Gharahi ff934e29bc
server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158)
* introduce LLAMA_SERVER_NO_WEBUI

* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI

* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE

* MIssed this

* Add useWebUi to package.nix
2026-03-27 17:25:55 +01:00
Yiwei Shao ee051c1e4e
hexagon: support for IQ4_NL and MXFP4 (#21018)
* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support

- Add IQ4_NL quantization type support to Hexagon backend (buffer
  set/get tensor repack, mul_mat, mul_mat_id dispatch)
- Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with
  LUT-based 4-bit index to int8 kvalue dequantization
- Add MXFP4 HMX dequantization path with E8M0 scale conversion,
  including batch-4 fast path and single-tile fallback
- Unify quantized row size / scale offset logic to handle Q4_0,
  Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path

* ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models

* Fix the pragma indent
2026-03-27 09:22:41 -07:00
Aleksander Grygier e6f6770515
webui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading with transitions to content blocks (#20999)
* refactor: Always use agentic content renderer for Assistant Message

* feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks

* chore: update webui build output
2026-03-27 17:01:36 +01:00
AN Long 48cda24c11
server: remove the verbose_prompt parameter (#21059)
* server: respect the verbose_prompt parameter

* Revert "server: respect the verbose_prompt parameter"

This reverts commit 8ed885cf37.

* Remove --verbose-prompt parameter from llama-server

* Using set_examples instead of set_excludes
2026-03-27 13:36:13 +02:00
Xuan-Son Nguyen 871f1a2d2f
mtmd: add more sanity checks (#21047) 2026-03-27 11:00:52 +01:00
Xuan-Son Nguyen 20197b6fe3
server: add built-in tools backend support (#20898)
* wip: server_tools

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* change arg to --tools all

* add readme mention

* llama-gen-docs
2026-03-27 10:07:11 +01:00