Commit Graph

246 Commits

Author SHA1 Message Date
Oleksandr Kuvshynov c5fef0fcea
server: update readme to mention n_past_max metric (#16436)
https://github.com/ggml-org/llama.cpp/pull/15361 added new metric
exported, but I've missed this doc.
2025-10-06 10:53:31 +03:00
Gabe Goodhart ca71fb9b36
model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-05 14:57:47 +02:00
Radoslav Gerganov 898acba681
rpc : add support for multiple devices (#16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-04 12:49:16 +03:00
ddh0 f6dcda3900
server : context checkpointing for hybrid and recurrent models (#16382)
* initial commit for branch 3

* generalize `swa_checkpoint` to `ctx_checkpoint`

this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite

* oops

* disable debug prints

* keep backwards compat with `--swa-checkpoints`

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update prompt re-processing message

* fix off-by-one error per GG

* keep `seq_rm` log per GG

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : fix checkpoint logic to support recurrent caches

* server : cleanup and fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-03 21:34:51 +03:00
Aleksander Grygier 84c8e305e8
Fix missing messages on sibling navigation (#16408)
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output
2025-10-03 12:51:40 +02:00
Aleksander Grygier 77233277c9
Capture model name only after first token (streaming) or completed request (#16405)
* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output
2025-10-03 11:30:39 +02:00
Aleksander Grygier 136bda78c5
webui : Fix messages payload sent to chat completions (#16402)
* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output
2025-10-03 10:11:34 +03:00
Pascal 5113efd34c
fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356)
Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-10-03 08:01:31 +02:00
Piotr Wilkin (ilintar) 34fcc5a4ac
model : Apertus model implementation (#15852)
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-02 20:43:22 +03:00
Adrien Gallouët 4201deae9c
common: introduce http.h for httplib-based client (#16373)
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-10-01 20:22:18 +03:00
Aleksander Grygier 764799279f
Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369)
* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
2025-10-01 18:18:10 +02:00
Aleksander Grygier 2a9b63383a
Improve code block color theming (#16325)
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
2025-10-01 15:54:42 +02:00
Aleksander Grygier 4f1575921c
Add optional setting for showing "Model used:" information (#16337)
* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
2025-10-01 12:08:16 +02:00
Aleksander Grygier aa9538a63a
webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363) 2025-10-01 08:40:26 +03:00
Pascal 16b0ca0d2e
Chatapi ignore empty sampling (#16330)
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
2025-09-30 19:18:54 +02:00
Pascal 5f7e166cbf
Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326)
* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-09-29 18:49:47 +02:00
Aleksander Grygier 3a2bdcda0b
Improve Mobile UI for dialogs and action dropdowns (#16222)
* fix: Always show conversation item actions

* feat: Improve Alert Dialog and Dialog mobile UI

* feat: Add settings reset to default confirmation

* fix: Close Edit dialog on save

* chore: update webui build output

* webui: implement proper z-index system and scroll management

- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture

* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides

* feat: Use `dvh` instead of computed px height for dialogs max height on mobile

* chore: update webui build output

* feat: Improve Settings fields UI

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-09-29 10:37:20 +02:00
Pascal 66bb7985c3
fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) 2025-09-29 09:08:41 +02:00
Vinkal 2f61c0f5bf
llama-cli: prevent spurious assistant token (#16202)
* tools/main: llama-cli: prevent spurious assistant token (#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes #13402.

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>

---------

Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-29 10:03:12 +03:00
ddh0 3ffd0fae47
perplexity : show more kl-divergence data (#16321)
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`:
- Added 95 percentile (mirroring existing 5 percentile)
- Added 0.1 percentile (mirroring existing 99.9 percentile)
2025-09-29 09:30:45 +03:00
Imad Saddik 2811c65286
Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] (#16297) 2025-09-28 13:04:46 +02:00
Aleksander Grygier 4807e8f96a
Show message actions by default (#16289) 2025-09-27 19:56:40 +02:00
Adrien Gallouët 234e2ff8ed
server : remove old LLAMA_SERVER_SSL (#16290)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-27 19:17:08 +03:00
Aleksander Grygier 807e8c6d31
Enhance text file detection logic for file attachments (#16199)
* feat: Enhances text file detection logic

* chore: Build static `webui` output

* chore: update webui build output
2025-09-26 19:25:29 +02:00
Aleksander Grygier 1a18927894
Allow viewing conversations even when llama server is down (#16255)
* webui: allow viewing conversations and sending messages even if llama-server is down

- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.

* feat: Add UI for `props` endpoint unavailable + cleanup logic

* webui: extend cached props fallback to offline errors

Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.

* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages

e.g., to restart the backend over llama-swap, even while cached /props data is in use

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-09-26 18:35:42 +02:00
Isaac McFadyen e0539eb6ae
webui: switch to hash-based routing (alternative of #16079) (#16157)
* Switched web UI to hash-based routing

* Added hash to missed goto function call

* Removed outdated SPA handling code

* Fixed broken sidebar home link
2025-09-26 18:36:48 +03:00
Aleksander Grygier 5d0a40f390
Always show message actions for mobile UI + improvements for user message sizing (#16076) 2025-09-26 15:59:07 +02:00
Aleksei Nikiforov cc1cfa277b
mtmd : fix uninitialized variable in bicubic_resize (#16275)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-26 15:00:44 +02:00
Daniel Bevenius d0991da39d
server : add support for external server for tests (#16243)
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.

The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
2025-09-25 11:36:47 +02:00
Douglas Hanley b5bd037832
llama : add support for qwen3 reranker (#15824) 2025-09-25 11:53:09 +03:00
Johannes Gäßler e789095502
llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Quentin Bramas 138c87ce8b
webui : fix handling incomplete chunks (#16107) 2025-09-22 11:53:13 +03:00
Georgi Gerganov 1d660d2fae
ci : use smaller model (#16168)
* ci : switch from gemma to qwen3 0.6b

* ci : use smaller model for some tests
2025-09-22 09:11:39 +03:00
Georgi Gerganov 4d0a7cbc61
ci : adjust params for less runtime (#16167)
* ci : adjust params for less runtime

* ci : gate BF16 on some hardware

* ci : move extra tests to Arm runner
2025-09-22 08:31:40 +03:00
Benni 459c0c2c1a
server: fix SSE and OpenAI compatibility for error messages when streaming (#16109)
* server: fix SSE and OpenAI compatibility for error messages when streaming

* server: remove obsolete event parameter and use required data fieldname instead
2025-09-20 07:56:30 +02:00
ssweens be79d9fdd9
llama-bench: add --devices and --list-devices support (#16039)
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience

* fix: field display ordering restored

* fix: integrated the rpc devices
- aimed to mimic the server as much as possible

* cleanup: defaults for list-devices
- handle dup device listing with RPC

* cleanup: remove dup device load calls

* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there

* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
2025-09-20 00:15:21 +02:00
Aleksander Grygier 4067f07fc5
feat: Improve mobile UI for Settings Dialog (#16084)
* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output
2025-09-19 09:52:27 +02:00
Radoslav Gerganov 2b6b55a59f
server : include usage statistics only when user request them (#16052)
* server : include usage statistics only when user request them

When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics

closes: #16048

* add unit test
2025-09-18 10:36:57 +00:00
Aleksander Grygier a7a98e0fff
SvelteKit-based WebUI (#14839) 2025-09-17 19:29:13 +02:00
jacekpoplawski 8ff206097c
llama-bench: add --n-cpu-moe support (#15952)
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
2025-09-16 16:17:08 +02:00
Yuri Khrustalev 07808ebb07
cmake : Do not install tools on iOS targets (#15903) 2025-09-16 09:54:44 +07:00
Nikolay Popov 28c39da7c6
llama-run: Fix model download on Windows (#15988)
* llama-run: Fix model download on Windows
 * fix SSL error (SSL peer certificate or SSH remote key was not OK)
 * fix program crash on std::filesystem::rename

* llama-run: create a separate method to utilize RAII

* llama-run: handle rename exception
2025-09-15 11:08:30 +01:00
ddh0 a68f31edd7
fix KLD percentile output (#15999)
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
2025-09-15 09:54:57 +02:00
Sigbjørn Skjæret 6c019cb04e
server : only attempt to enable thinking if using jinja (#15967) 2025-09-14 21:17:04 +02:00
Radoslav Gerganov 918b26f197
rpc : fix regression when --device is used (#15981)
Fix regression introduced with commit 50f4281a6
2025-09-14 12:28:18 +03:00
Radoslav Gerganov d1c6f11f47
doc : update documentation for --tensor-split (#15980)
* doc : update documentation for --tensor-split

* Update tools/main/README.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update tools/main/README.md

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-09-14 12:10:07 +03:00
Diego Devesa 50f4281a6f
llama : allow using iGPUs with --device (#15951)
* llama : allow using iGPUs with --device

* mtmd : allow iGPU

* rpc-server : allow iGPU
2025-09-13 16:49:49 +02:00
Georgi Gerganov f088b6a84f
server : adjust prompt similarity thold + add logs (#15913)
ggml-ci
2025-09-12 17:02:55 +03:00
Diego Devesa 360d6533db
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
Daniel Bevenius 70cd37dbbe
requirements : update transformers/torch for Embedding Gemma (#15828)
* requirements : update transformers/torch for Embedding Gemma

This commit updates the requirements to support converting
Embedding Gemma 300m models.

The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.

I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.

* resolve additional python dependencies

* fix pyright errors in tokenizer test and remove unused import
2025-09-09 06:06:52 +02:00