* llama : remove write/read of output ids/logits/embeddings
This commit removes the write/read of output ids, logits and
embeddings from the llama context state.
Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941
* completion : add replying of session state
This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.
* common : add common_prompt_batch_decode function
This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.
* update save-state.cpp to use llama_state_load_file
This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.
* examples : set n_seq_max = 2 for ctx3
This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.
The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.
* common : fix Step-3.5-Flash format detection and thinking support
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.
Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.
Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
<parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test
Builds on: https://github.com/ggml-org/llama.cpp/pull/19283
* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.
Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.
Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.
* chat : remove dead thinking code from qwen3_coder_xml
Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
* mtmd : chat : Fix extra \n between text and media marker
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
* Refactor to use explicite per type ifs
* Update common/chat.cpp
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* Update common_chat_templates_apply_legacy
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Refactoring to use new llama_put_adapter_loras
* cont : alternative lora API
---------
Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* common : remove legacy .json to .etag migration code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : simplify common_download_file_single_online
This commit also force a redownload if the file exists
but has no .etag file.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit removes two unused functions `common_lcp` and `common_lcs`.
The last usage of these functions was removed in
Commit 33eff40240 ("server : vision support
via libmtmd") and are no longer used anywhere in the codebase.
* jinja : add missing 'in' test to template engine (#19004)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* reuse test_is_in in binary op
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
On macos Sequoia 15.7.3, x86_64, the build has recently started failing with
```
In file included from .../code/cpp/llama.cpp/common/jinja/string.cpp:2:
.../code/cpp/llama.cpp/common/./jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
| ~~~~~^
In file included from .../code/cpp/llama.cpp/common/jinja/caps.cpp:1:
.../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
| ~~~~~^
In file included from .../code/cpp/llama.cpp/common/jinja/value.cpp:1:
In file included from .../code/cpp/llama.cpp/common/jinja/runtime.h:4:
.../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
[...]
```
After a bit of digging to make sure all the appropriate flags were used, I notifced that the necessary header was not included. This fixes the build for me and should not affect negatively other builds that for some reasons were already succeeding
* undefined is treated as iterable (string/array) by filters
`tojson` is not a supported `undefined` filter
* add tests
* add sequence and iterable tests
keep it DRY and fix some types
* common : clarify HTTPS build options in error message
This commit updates the https error message to provide clearer
instructions for users who encounter the "HTTPS is not supported" error.
The motivation for this is that it might not be clear to users that only
one of these options are needed to enable HTTPS support.
The LLAMA_OPENSSL option is also added to the message to cover all
possible build configurations.
* clarify that OpenSSL is the default for HTTPS support
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs