Root cause found: copy_cell crashes during find_slot because it calls
ggml_backend_tensor_copy on GPU tensors while the compute graph is
being built. Fixed by using CPU staging: tensor_get (GPU→CPU) then
tensor_set (CPU→GPU).
Also increased rs_size from 1 to 3 cells per sequence to make room
for checkpoint cells needed by speculative decoding rollback.
Results:
- No more crashes during speculative decode
- 23.8 tok/s with MTP (vs 16.7 without)
- 75% acceptance rate
- Output still garbled on long generation due to seq_rm not finding
checkpoints at the right positions (checkpoint position mismatch)
Next: fix checkpoint position tracking so seq_rm can find and restore
the correct recurrent state after draft rejection.
* server: (doc) clarify in-scope and out-scope features
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Two bugs in `server_models::load()` that affect router mode reliability:
**Bug 1: Deadlock when child process crashes**
When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.
Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.
**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**
`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.
Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
- Add cooldown flag to MTP speculative state: after draft rejection,
skip next proposal to force single-token decode for fresh MTP logits
- Root cause: MTP logits are from the last batch position (draft token).
When draft is rejected, next proposal uses stale/wrong logits (13% accept).
With cooldown: proposals only use fresh single-token MTP logits (95% accept).
- Simplified seq_rm fallback: log and continue instead of re-evaluating
- Added debug logging (MTP-DBG, MTP-VERIFY) for acceptance rate tracking
- Results: 95% acceptance rate, 0 restarts, no garbled output on 2048 tokens
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
* webui: make server the source of truth for sampling defaults
* webui: fix Custom badge for sampling parameters
* webui: log user overrides after server sync
* chore: update webui build output
* fix: Default values for sampling settings config object
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* add tests for model id parser
* add test case having activated params
* add structured tests for model id parser
* add ToDo
* feat: Improve model parsing logic + tests
* chore: update webui build output
---------
Co-authored-by: bluemoehre <bluemoehre@gmx.de>
Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B).
What works:
- MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp)
- MTP tensor loading and registration for QWEN35 arch
- GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.)
- Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers()
- Server auto-detects MTP from GGUF metadata
- Speculative state machine for MTP draft token generation
- PR #20075 applied: recurrent state checkpoint/restore for hybrid models
- M-RoPE position check relaxed for speculative re-evaluation
- Windows os.kill fix for gateway process detection
What needs work:
- Speculative verify loop conflicts with tool-calling requests (400 error)
- The recommended fix: bypass the speculative framework entirely and
implement MTP acceptance directly in the server generation loop
(no seq_rm/rollback needed since MTP drafts are produced in-graph)
- MTP attention skipped (projection + FFN path only) due to
inp_out_ids token count mismatch
Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2
Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization)
Base: llama.cpp b8388
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Parse port numbers from MCP server URLs
* Pass scheme to http proxy for determining whether to use SSL
* Fix download on non-standard port and re-add port to logging
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Set C locale for consistent float formatting across all binaries.
* Add C locale setting to all tools binaries
Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.
* Apply suggestion from @JohannesGaessler
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags