* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
In router mode with --models-max 1, switching models kills the child
process, destroying all in-memory state including the prompt cache and
context checkpoints. This forces a full prompt re-processing on every
model swap return, which can take tens of seconds for long prompts.
This patch adds two methods (auto_save_slots, auto_restore_slots) that
are called automatically during the child process lifecycle:
- auto_save_slots: called after start_loop() returns (before clean_up),
saves each slot's state + checkpoints to --slot-save-path using the
model filename stem as the save name.
- auto_restore_slots: called after load_model() (before start_loop),
checks if a save file exists for this model and restores it.
Combined with the checkpoint persistence from the previous commit,
this makes model hot-swapping fully transparent: the conversation
context is preserved across swaps with no client-side changes.
Tested with Qwen3.5-27B + Qwen3.5-35B-A3B MoE in router mode:
- Swap 27B→MoE: ~7s (incl auto-save 826 MiB state + 749 MiB checkpoints)
- Swap MoE→27B: ~6s (incl auto-restore)
- cache_n after restore: 26549 (91ms vs 23s without)
For hybrid/recurrent models (Qwen3.5, Jamba, Falcon-H1), the server
creates context checkpoints during prompt processing that snapshot the
full recurrent state at regular intervals. These checkpoints are
essential to avoid full prompt re-processing when a slot is reused.
The existing /slots save/restore API persists the raw KV+recurrent
memory via llama_state_seq_{save,load}_file, and also restores the
token list. However, it does not persist the checkpoint metadata
stored in server_prompt::checkpoints. Without these, the hybrid model
cache validation logic in update_slots() cannot find any checkpoint to
restore from and falls back to full prompt re-processing.
This patch adds two small helper functions (slot_checkpoints_save and
slot_checkpoints_load) that write/read a companion file alongside the
main slot save file. The format is a versioned binary file with a
magic header.
This is particularly useful in router mode with --models-max 1, where
switching between models destroys the in-memory prompt cache. Users
can now call /slots/0?action=save before a model swap and
/slots/0?action=restore after, recovering the full cache including
checkpoints.
Tested with Qwen3.5-27B (64 layers, 16 attention + 48 recurrent):
- Without patch: cache_n=0, 23s re-processing after swap
- With patch: cache_n=26549, 75ms after swap
* server: (doc) clarify in-scope and out-scope features
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Two bugs in `server_models::load()` that affect router mode reliability:
**Bug 1: Deadlock when child process crashes**
When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.
Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.
**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**
`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.
Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
* webui: make server the source of truth for sampling defaults
* webui: fix Custom badge for sampling parameters
* webui: log user overrides after server sync
* chore: update webui build output
* fix: Default values for sampling settings config object
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* add tests for model id parser
* add test case having activated params
* add structured tests for model id parser
* add ToDo
* feat: Improve model parsing logic + tests
* chore: update webui build output
---------
Co-authored-by: bluemoehre <bluemoehre@gmx.de>
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Parse port numbers from MCP server URLs
* Pass scheme to http proxy for determining whether to use SSL
* Fix download on non-standard port and re-add port to logging
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Set C locale for consistent float formatting across all binaries.
* Add C locale setting to all tools binaries
Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.
* Apply suggestion from @JohannesGaessler
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>