In router mode with --models-max 1, switching models kills the child
process, destroying all in-memory state including the prompt cache and
context checkpoints. This forces a full prompt re-processing on every
model swap return, which can take tens of seconds for long prompts.
This patch adds two methods (auto_save_slots, auto_restore_slots) that
are called automatically during the child process lifecycle:
- auto_save_slots: called after start_loop() returns (before clean_up),
saves each slot's state + checkpoints to --slot-save-path using the
model filename stem as the save name.
- auto_restore_slots: called after load_model() (before start_loop),
checks if a save file exists for this model and restores it.
Combined with the checkpoint persistence from the previous commit,
this makes model hot-swapping fully transparent: the conversation
context is preserved across swaps with no client-side changes.
Tested with Qwen3.5-27B + Qwen3.5-35B-A3B MoE in router mode:
- Swap 27B→MoE: ~7s (incl auto-save 826 MiB state + 749 MiB checkpoints)
- Swap MoE→27B: ~6s (incl auto-restore)
- cache_n after restore: 26549 (91ms vs 23s without)
* Set C locale for consistent float formatting across all binaries.
* Add C locale setting to all tools binaries
Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.
* Apply suggestion from @JohannesGaessler
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
* server: prevent data race from HTTP threads
* fix params
* fix default_generation_settings
* nits: make handle_completions_impl looks less strange
* stricter const
* fix GGML_ASSERT(idx < states.size())
* move index to be managed by server_response_reader
* http: make sure req & res lifecycle are tied together
* fix compile
* fix index handling buggy
* fix data race for lora endpoint
* nits: fix shadow variable
* nits: revert redundant changes
* nits: correct naming for json_webui_settings
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
* server: fix crash when batch > ubatch with embeddings (#12836)
Fixes#12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.
Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.
Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
* Log warnings explaining the issue
* Automatically set n_batch = n_ubatch
* Prevent server crash
This follows the approach suggested by @ggerganov in issue #12836.
Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).
Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model
* Update tools/server/server.cpp
---------
Co-authored-by: ytian218 <ytian218@bloomberg.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* git mv
* add server-context.h
* add server-context.h
* clean up headers
* cont : cleanup
* also expose server_response_reader (to be used by CLI)
* fix windows build
* decouple server_routes and server_http
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : add Anthropic Messages API support
* remove -@pytest.mark.slow from tool calling/jinja tests
* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py
* server : removed redundant n field logic in anthropic_params_from_json
* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()
* server : refactor Anthropic API to use OAI conversion
* make sure basic test always go first
* clean up
* clean up api key check, add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server: split HTTP into its own interface
* move server-http and httplib to its own file
* add the remaining endpoints
* fix exception/error handling
* renaming
* missing header
* fix missing windows header
* fix error responses from http layer
* fix slot save/restore handler
* fix case where only one stream chunk is returned
* add NOMINMAX
* do not call sink.write on empty data
* use safe_json_to_str for SSE
* clean up
* add some comments
* improve usage of next()
* bring back the "server is listening on" message
* more generic handler
* add req.headers
* move the chat template print to init()
* add req.path
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: correct time_ms calculation in send_partial_response
The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.
Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000
* docs : document time_ms field in prompt_progress
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : support unified context across slots
* cont : fix speculative decoding initialization
* context : fix n_ctx_per_seq computation
* server : purge slots one by one
* tests : add unified cache server tests
* llama : update per-seq context computation
* test-thread-safety : handle tiny training context of the input model
* server : fix server_tokens clear()
* server : use 4 slots + unified KV by default
* llama : add note about context size queries
* cont : update todos [no ci]
* context : do not cap the size of the context
* tests : adjust parameters to be CI friendlier
* context : add warning
* server / ranking : add sorting and management of top_n
* Make the retro compatible if no top_n will return
all results
here is a script to make some test
```script
URL=${1:-http://127.0.0.1:8181}
curl "$URL/v1/rerank" -H "Content-Type: application/json" \
-d '{ "model": "M", "query": "What is the recipe to make bread ?",
"return_text" : true,
"texts" : true,
"top_n": 6,
"documents": [
"voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel",
"it is a bear",
"bread recipe : floor, water, yest, salt",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
"here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"recipe to make cookies : floor, eggs, water, chocolat",
"here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"il fait tres beau aujourd hui",
"je n ai pas faim, je ne veux pas manger",
"je suis a paris"
] }' | jq
```
* use resize() instead for(...)
* simplify top_n init since no need to return error
result to test :
./tests.sh unit/test_rerank.py -v -x
==================================================== test session starts =====================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 8 items
unit/test_rerank.py::test_rerank PASSED [ 12%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%]
===================================================== 8 passed in 4.31s ======================================================
* add rerank top_n unit test
here is the result :
./tests.sh unit/test_rerank.py -v -x
=================================================================== test session starts ===================================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 16 items
unit/test_rerank.py::test_rerank PASSED [ 6%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%]
unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%]
unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%]
unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%]
unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%]
unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%]
unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%]
unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%]
unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%]
=================================================================== 16 passed in 8.84s ===================================================================
* editor config check fix