Commit Graph

448 Commits

Author SHA1 Message Date
Johannes Gäßler a52dc60ba3
llama_fit_params: return enum for fail vs. error (#18374) 2025-12-27 09:59:19 +01:00
o7si 4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot
2025-12-26 16:35:29 +01:00
Xuan-Son Nguyen f5acfb2ffa
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option

* also allow stop while loading

* add docs

* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Aadeshveer Singh c184284230
fit-params : fix race condition in fit-params output (#18276) 2025-12-24 15:57:38 +01:00
Xuan-Son Nguyen 5ee4e43f26
server: return_progress to also report 0% processing state (#18305) 2025-12-23 21:49:05 +01:00
Pascal 5b6c9bc0f3
webui: apply webui_settings on first load (#18223)
* webui: apply webui_settings on first load

The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null

Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes

* chore: update webui build output
2025-12-23 15:48:03 +01:00
Xuan-Son Nguyen 849d021104
server: fix crash with model not having BOS/EOS (#18321) 2025-12-23 14:39:36 +01:00
Xuan-Son Nguyen 179fd82a72
gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file

* also strip whitespace

* do not add extra newline

* update TOC
2025-12-22 19:30:19 +01:00
Xuan-Son Nguyen 6ce863c803
server: prevent data race from HTTP threads (#18263)
* server: prevent data race from HTTP threads

* fix params

* fix default_generation_settings

* nits: make handle_completions_impl looks less strange

* stricter const

* fix GGML_ASSERT(idx < states.size())

* move index to be managed by server_response_reader

* http: make sure req & res lifecycle are tied together

* fix compile

* fix index handling buggy

* fix data race for lora endpoint

* nits: fix shadow variable

* nits: revert redundant changes

* nits: correct naming for json_webui_settings
2025-12-22 14:23:34 +01:00
Xuan-Son Nguyen 3997c78e33
server: fix data race in to_json_anthropic (#18283) 2025-12-22 13:21:43 +01:00
Xuan-Son Nguyen 86af848153
server: (docs) remove mention about extra_args (#18262) 2025-12-22 12:22:01 +01:00
Johannes Gäßler 147a521636
tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
Xuan-Son Nguyen ddcb75dd8a
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level

* implement server-context suspend

* add test

* add docs

* optimization: add fast path

* make sure to free llama_init

* nits

* fix use-after-free

* allow /models to be accessed during sleeping, fix use-after-free

* don't allow accessing /models during sleep, it is not thread-safe

* fix data race on accessing props and model_meta

* small clean up

* trailing whitespace

* rm outdated comments
2025-12-21 02:24:42 +01:00
Oleksandr Kuvshynov 408616adbd
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

```
% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```

After the fix, get correct per round logging:

```
slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
2025-12-20 10:57:40 +01:00
Xuan-Son Nguyen 9e39a1e6a9
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-20 09:25:27 +01:00
Pascal 14931a826e
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form

* arg: update doc

* arg: update test-arg-parser

* arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

* arg: update doc
2025-12-19 18:01:56 +01:00
Aman Gupta cc0a04343e
server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

* llama-server: use string_format inline

* fix test
2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen 98c1c7a7bf
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources

* update docs

* fix neg arg handling

* fix empty mmproj

* also filter out server-controlled args before to_ini()

* skip loading custom_models if not specified

* fix unset_reserved_args

* fix crash on windows
2025-12-19 12:08:20 +01:00
Aleksander Grygier acb73d8340
webui: Add editing attachments in user messages (#18147)
* feat: Enable editing attachments in user messages

* feat: Improvements for data handling & UI

* docs: Update Architecture diagrams

* chore: update webui build output

* refactor: Exports

* chore: update webui build output

* feat: Add handling paste for Chat Message Edit Form

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output
2025-12-19 11:14:07 +01:00
Xuan-Son Nguyen 8ea958d4d9
model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
2025-12-19 00:18:01 +01:00
Pascal f9ec8858ed
webui: display prompt processing stats (#18146)
* webui: display prompt processing stats

* feat: Improve UI of Chat Message Statistics

* chore: update webui build output

* refactor: Post-review improvements

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-18 17:55:03 +01:00
Aleksander Grygier 9ce64aed7d
webui: Fix selecting generated output issues during active streaming (#18091)
* draft: incremental markdown rendering with stable blocks

* refactor: Logic improvements

* refactor: DRY Markdown post-processing logic

* refactor: ID generation improvements

* fix: Remove runes

* refactor: Clean up & add JSDocs

* chore: update webui static output

* fix: Add tick to prevent race conditions for rendering Markdown blocks

Suggestion from @ServeurpersoCom

Co-authored-by: Pascal <admin@serveurperso.com>

* chore: Run `npm audit fix`

* chore: update webui static output

* feat: Improve performance using global counter & id instead of UUID

* refactor: Enhance Markdown rendering with link and code features

* chore: update webui static output

* fix: Code block content extraction

* chore: update webui static output

* chore: update webui static output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-18 11:13:52 +01:00
Kim S. 900316da4e
webui: fix chat screen shadow width (#18010)
* webui: fix chat screen shadow width

* chore: add index.html.gz
2025-12-18 11:08:42 +01:00
Pascal 6ce3d85796
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support

Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.

Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes

Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls

Addresses feedback from @ngxson and @ggerganov

* server: address review feedback from ngxson

* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Xuan-Son Nguyen e85e9d7637
server: (router) disable SSL on child process (#18141) 2025-12-17 21:39:08 +01:00
Kim S. d37fc93505
webui: fix chat header width when sidebar is closed (#17981)
* webui: fix chat header width when sidebar is closed

* chore: add index.html.gz
2025-12-17 20:05:45 +01:00
HonestQiao 15dd67d869
model: fix GLM-ASR-Nano-2512 load error (#18130) (#18142) 2025-12-17 16:34:35 +01:00
Xuan-Son Nguyen bde461de8c
server: (router) allow child process to report status via stdout (#18110)
* server: (router) allow child process to report status via stdout

* apply suggestions
2025-12-17 14:54:11 +01:00
Johannes Gäßler 4164596c76
llama-fit-params: QoL impr. for prints/errors (#18089) 2025-12-17 00:03:19 +01:00
yifant-code 59977eba7b
server: fix crash when batch > ubatch with embeddings (#17912)
* server: fix crash when batch > ubatch with embeddings (#12836)

Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.

Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.

Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
  * Log warnings explaining the issue
  * Automatically set n_batch = n_ubatch
  * Prevent server crash

This follows the approach suggested by @ggerganov in issue #12836.

Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).

Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model

* Update tools/server/server.cpp

---------

Co-authored-by: ytian218 <ytian218@bloomberg.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 14:27:36 +02:00
Xuan-Son Nguyen 7b1db3d3b7
arg: clarify auto kvu/np being set on server (#17997)
* arg: clarify auto kvu/np being set on server

* improve docs

* use invalid_argument
2025-12-16 12:01:27 +01:00
2114L3 5f5f9b4637
server: Update README.md incorrect argument (#18073)
n-gpu-layer is incorrect
argument is n-gpu-layers with the 's'
2025-12-16 11:50:43 +01:00
Xuan-Son Nguyen 3d86c6c2b5
model: support GLM4V vision encoder (#18042)
* convert ok

* no deepstack

* less new tensors

* cgraph ok

* add mrope for text model

* faster patch merger

* add GGML_ROPE_TYPE_MRNORM

* add support for metal

* move glm4v do dedicated graph

* convert: add norm_embd

* clip: add debugging fn

* working correctly

* fix style

* use bicubic

* fix mrope metal

* improve cpu

* convert to neox ordering on conversion

* revert backend changes

* force stop if using old weight

* support moe variant

* fix conversion

* fix convert (2)

* Update tools/mtmd/clip-graph.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* process mrope_section on TextModel base class

* resolve conflict merge

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 11:25:26 +01:00
Aleksander Grygier 3034836d36
webui: Improve copy to clipboard with text attachments (#17969)
* feat: Create copy/paste user message including "pasted text" attachments

* chore: update webui build output

* chore: update webui static output

* fix: UI issues

* chore: update webui static output

* fix: Decode HTML entities using `DOMParser`

* chore: update webui build output

* chore: update webui static output
2025-12-16 07:38:46 +01:00
Aleksander Grygier a20979d433
webui: Add setting to always show sidebar on Desktop (#17809)
* feat: Add setting to always show Sidebar on Desktop

* chore: update webui build output

* feat: Add auto-show sidebar setting

* fix: Mobile settings dialog UI

* chore: update webui build output

* feat: UI label update

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output
2025-12-16 07:31:37 +01:00
Darius Lukas 40d9c394f4
Webui: Disable attachment button and model selector button when prompt textbox is disabled. (#17925)
* Pass disabled state to the file attachments button and the model
selector button.

* Update index.html.gz

* Fix model info card in non-router mode.

* Update index.html.gz
2025-12-16 07:15:49 +01:00
Pascal 0f4f35e7be
Fix unreadable user markdown colors and truncate long texts in deletion dialogs (#17555)
* webui: limit conversation name length in dialogs

* webui: fix unreadable colors on links and table cell hover in user markdown

* webui: keep table borders visible in user markdown

* webui: updating unified exports

* Update tools/server/webui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentThumbnailFile.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-15 16:34:53 +01:00
Xuan-Son Nguyen 96a181a933
mtmd: refactor audio preprocessing (#17978)
* mtmd: refactor audio preprocessing

* refactor

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>

* wip

* wip (2)

* improve constructor

* fix use_natural_log

* fix padding for short input

* clean up

* remove need_chunking

---------

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>
2025-12-15 14:16:52 +01:00
Andrew Aladjev 4a4f7e6550
cli: fixed dead links to tools/main for cli and completion, fixed code owners (#17993)
Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
2025-12-15 11:47:04 +01:00
Thomas Jarosch e73d548659
webui: add "delete all conversations" button to import/export tab (#17444)
* webui: add "delete all conversations" button to import/export tab

- Add 'Delete all conversations' functionality with confirmation dialog
- Add Trash icon and destructive styling for clear visual indication
- Redirects to "?new_chat=true#/" by using conversationsStore.deleteAll()

* chore: update webui build output
2025-12-15 11:29:29 +01:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
piDack 745fa0e78b
model : add glm-asr support (#17901)
* [model] add glm-asr support

* fix format for ci

* fix convert format for ci

* update glm_asr convert script & use build_ffn for glm_asr clip & use build_stack for padding and review

* check root architecture for convert hf script

* fix conficlt with upstream

* fix convert script for glm asr & format clip-impl

* format

* restore hparams text

* improved conversion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-15 03:18:46 +01:00
Haowei Wu 37f5a1093b
mtmd: enhance image resizing in llava_uhd (#18014) 2025-12-14 15:57:52 +01:00
Georgi Gerganov 254098a279
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes

* tests : increase max_tokens to get needed response

* batched : fix uninitialized samplers
2025-12-14 10:11:13 +02:00
Sergey Fedorov 4ed2bae50d
server-models.cpp: add missing <filesystem> (#18000)
Fixes: https://github.com/ggml-org/llama.cpp/issues/17999
2025-12-13 22:02:43 +01:00
Xuan-Son Nguyen 4d5ae24c0a
arg: fix common_params_parse not accepting negated arg (#17991) 2025-12-13 12:53:37 +01:00
Xuan-Son Nguyen 380b4c984e
common: support negated args (#17919)
* args: support negated args

* update docs

* fix typo

* add more neg options

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rm duplicated arg

* fix LLAMA_ARG_NO_HOST

* add test

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-12 23:58:53 +01:00
Xuan-Son Nguyen e39a2ce66d
clip: move model cgraphs into their own files (#17965)
* clip: move model cgraphs into their own files

* more explicit enums

* fix linux build

* fix naming

* missing headers

* nits: add comments for contributors
2025-12-12 21:14:48 +01:00
Xuan-Son Nguyen 17158965ac
mtmd: explicitly forbidden inclusion of private header and libcommon (#17946) 2025-12-12 15:16:06 +01:00
Aleksander Grygier 12280ae905
webui: Fix parsing non-LaTeX occurrencies of `\(` or `\)` (#17810)
* fix: Improve latex protection logic to prevent turning non-latex `\(` into `$`

* chore: update webui build output
2025-12-12 15:13:36 +01:00