Georgi Gerganov
e1141d1cd1
cont : remove server_prompt_checkpoint_with_size
2026-04-03 16:35:23 +03:00
Georgi Gerganov
8491e15405
cont : avoid --spec-use-checkpoints argument
2026-04-01 16:14:03 +03:00
Georgi Gerganov
63c66f1512
Merge branch 'master' into pr/19493
2026-04-01 14:17:41 +03:00
Georgi Gerganov
edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks ( #21107 )
2026-03-28 16:27:36 +02:00
Sascha Rogmann
d0a856895f
server : restore sampler in spec checkpoint and clear mem
2026-03-26 23:37:05 +01:00
Xuan-Son Nguyen
49bfddeca1
server: allow router to report child instances sleep status ( #20849 )
...
* server: allow router to report child instances sleep status
* refactor
* move sleeping to state
* nits
2026-03-22 18:33:52 +01:00
Sascha Rogmann
b5b3ac3b55
server : fix server_speculative_callback (slot.id)
2026-03-20 22:56:40 +01:00
Sascha Rogmann
91932ae05b
server : n_tokens_cur and create_checkpoint in draft
2026-03-20 22:56:40 +01:00
Sascha Rogmann
fe4f859a67
speculative : checkpoints with draft model, logging
2026-03-20 22:56:40 +01:00
Sascha Rogmann
af3b630e0b
server : fix spec checkpoints, logging
2026-03-20 22:56:40 +01:00
Sascha Rogmann
bd2f7f2d7f
server : renamed spec checkpoints option
2026-03-20 22:56:40 +01:00
Sascha Rogmann
e994c4ec1f
server : refactored spec logic to speculative.cpp
2026-03-20 22:56:40 +01:00
Sascha Rogmann
01763e800d
server : log levels
2026-03-20 22:56:30 +01:00
Sascha Rogmann
e002b095e5
server : rename spec vars
2026-03-20 22:51:45 +01:00
Sascha Rogmann
3723f8e57c
server : fix draft check with checkpoints
2026-03-20 22:51:45 +01:00
Sascha Rogmann
a4237ea0f0
server : speculative decoding using checkpoints
2026-03-20 22:51:04 +01:00
Georgi Gerganov
ab9d4c3678
server : improve mtmd ctx checkpoints ( #20726 )
...
* server : improve mtmd ctx checkpoints
* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
Ryan Goulden
26c9ce1288
server: Add cached_tokens info to oaicompat responses ( #19361 )
...
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)
5e54d51b19
common/parser: add proper reasoning tag prefill reading ( #20424 )
...
* Implement proper prefill extraction
* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp
* Update tools/server/server-task.cpp
* refactor: move grammars to variant, remove grammar_external, handle exception internally
* Make code less C++y
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add `--skip-chat-parsing` to force a pure content parser. ( #20289 )
...
* Add `--force-pure-content` to force a pure content parser.
* Update common/arg.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Change parameter name [no ci]
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Georgi Gerganov
8cc2d81264
server : fix ctx checkpoint invalidation ( #20671 )
2026-03-17 15:21:14 +02:00
SoftwareRenderer
d7ba99c485
server: reset counter related to kill-switch on client error ( #20513 )
...
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Piotr Wilkin (ilintar)
acb7c79069
common/parser: handle reasoning budget ( #20297 )
...
* v1
* Finished!
* Handlie cli
* Reasoning sampler
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Less explosive terminology :)
* Add utf-8 case and tests
* common : migrate reasoning budget sampler to common
* cont : clean up
* cont : expose state and allow passing as initial state
* cont : remove unused imports
* cont : update state machine doc string
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Georgi Gerganov
a7b3dee7a5
server : make 2 checkpoints near the end of the prompt ( #20288 )
...
* server : make 2 checkpoints near the end of the prompt
* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Georgi Gerganov
96cfc4992c
server : fix checkpoints n_tokens calculation ( #20287 )
2026-03-09 16:47:06 +02:00
Georgi Gerganov
344ee2a38a
server : warn swa-full is not supported for non-SWA models ( #20291 )
2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499
server : fix off-by-1 in server_tokens::size_up_to_pos() ( #20279 )
...
* server : fix off-by-1 in server_tokens::size_up_to_pos()
* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Georgi Gerganov
107d599952
server : add kill switch when server is stuck ( #20277 )
2026-03-09 10:33:12 +02:00
Georgi Gerganov
d417bc43dd
server : do not create checkpoints right after mtmd chunks ( #20232 )
2026-03-08 22:16:46 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696
Checkpoint every n tokens: squash ( #20087 )
2026-03-06 11:39:26 +01:00
Pascal
2e7e638523
server : support multiple model aliases via comma-separated --alias ( #19926 )
...
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Georgi Gerganov
01cd448b8c
server : fix ctx checkpoint restore logic ( #19924 )
2026-02-26 18:20:16 +02:00
Georgi Gerganov
f20469d919
server : enable multi-modal prompt caching ( #19877 )
2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1
server : support multi-modal context checkpoints ( #19849 )
...
* Modify llama-memory-hybrid-iswa.cpp
* Modify llama-memory-recurrent.cpp
* Modify server-common.cpp
* Modify server-common.h
* Modify server-context.cpp
* Modify server-task.h
* Added comment to llama-memory-hybrid-iswa.cpp
* Remove comment from server-context.cpp
* Stylistic fix server-context.cpp
* Fix an issue when seqrm isn't called in server-context.cpp
* cont : alternative impl
* cont : cleanup
* cont : n_tokens -> int64_t
---------
Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Sigbjørn Skjæret
e8e261699a
cli : provide model with text filename ( #19783 )
2026-02-22 22:33:49 +01:00
matteo
b55dcdef5d
server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) ( #19622 )
...
* save generated text for the /slots endpoint
* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0
* Apply suggestions from code review
---------
Co-authored-by: Matteo <matteo@matteo>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-18 18:53:37 +01:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format ( #19457 )
...
* Log converting requests
* Print as debug instead of info [no ci]
---------
Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic ( #19408 )
2026-02-08 09:40:04 +02:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() ( #19270 )
...
* llama : add llama_memory_can_rm_suffix()
* Revert "llama : add llama_memory_can_rm_suffix()"
This reverts commit d30e59b62a .
* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter ( #19207 )
...
* server : wrap around the "id_slot" parameter
* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod ( #19164 )
...
* spec : add ngram-mod
* cont : simplify + keep track of occupancy
* cont : cleanup
* cont : move initialization to common/speculative
* cont : cleanup
* cont : cleanup
* cont : fix
2026-01-30 18:21:48 +02:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor ( #18471 )
...
* server: introduce self-speculative decoding
* server: moved self-call into speculative.cpp
* can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: can_speculate() tests self-spec
* server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* server: can_speculate() requires a task instance
* common: ngram map, config self-speculative decoding
* common: add enum common_speculative_type
* common: add vector of speculative states
* common: add option --spec-draftless
* server: cleanup (remove slot.batch_spec, rename)
* common: moved self-spec impl to ngram-map
* common: cleanup (use common_speculative_state_draft)
* spec : refactor
* cont : naming
* spec: remove --spec-config
* doc: (draftless) speculative decoding
* common: print performance in spec decoding
* minor : cleanup
* common : better names
* minor : cleanup + fix build
* minor: comments
* CODEOWNERS: add common/ngram-map.* (#18471 )
* common : rename speculative.draftless_type -> speculative.type
* ngram-map : fix uninitialized values
* ngram-map : take into account the input can become shorter
* ngram-map : revert len check for now
* arg : change `--spec-draftless` -> `--spec-type`
* spec : add common_speculative_state::accept()
* spec : refactor + add common_speculative_begin()
* spec : fix begin() call with mtmd
* spec : additional refactor + remove common_speculative_params
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message ( #18994 )
...
* support reasoning_content input
* report template caps to webui
* add docs
* rm commented code
2026-01-22 21:30:06 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) ( #18486 )
...
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params ( #18937 )
...
* server_chat_params
* move chat format into CLI
* use meta whenever possible
* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
Lennart Austenfeld
18361c579c
server: fix memory reservations in populate_token_probs ( #18787 )
2026-01-19 19:13:31 +01:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine ( #18462 )
...
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Xuan-Son Nguyen
a04c2b06a3
server: improve slots scheduling for n_cmpl ( #18789 )
...
* server : make sure children tasks are scheduled to launch with parent
* fix
* add comment pointing to this PR
* fix
* clean up
* more debug messages
* add pop_deferred_task with specific ID version
* improve the logic
* simple approach
* no double move
* correct return type of launch_slots_with_parent_task
2026-01-15 17:10:28 +01:00
Georgi Gerganov
39173bcacb
context : reserve new scheduler when graph topology changes ( #18547 )
...
* context : reserve new scheduler when graph topology changes
* cont : fix
* cont : fix reserve
* cont : reserve only when changes occur + timing
* context : add comments
* llama : reserve on sampler changes
* common : allow null common_sampler
* server : task declares needs (embd, logits, sampling)
* server : do not init sampler if not needed
* llama : fix need_reserve when unsetting a sampler
* server : consolidate slot reset/clear logic
2026-01-15 16:39:17 +02:00
Xuan-Son Nguyen
9ac2693a30
server: fix n_cmpl not skipping processing prompt ( #18663 )
...
* server: fix n_cmpl not skipping processing
* fix infinite loop on empty batch
* cont : init child samplers + modify child logic
* cont : cleanup
* cont : improve n_cmpl logic
- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache
* cont : remove redundant function
* cont : reduce parent checks
* fix : nullptr task dereference
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00