Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
update() and reused consistently across output_item.added,
function_call_arguments.delta, and output_item.done events
(was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp
Test fixes:
- Assert sequence_number on every event unconditionally (was using
weak "if present" guard)
- Check actual values not just key presence in streaming created
event test (completed_at is None, usage tokens are 0, etc.)
Refs: ggml-org/llama.cpp#21174 (patrick review)
- test_responses_stream_created_event_has_full_response: verify
response.created contains all 24+ fields with status in_progress
- test_responses_stream_all_events_have_sequence_number: every event
has sequence_number and they are strictly increasing across stream
- test_responses_stream_delta_events_have_indices: output_index and
content_index present on all delta/added events
All 14 tests pass (2 original + 9 from previous commit + 3 new).
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org/llama.cpp#21174 (fumlig review comment)
Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- output_text convenience field in streaming and non-streaming responses
Responses API compliance (ideas from #19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to streaming events
- Accept input_text type and EasyInputMessage for multi-turn input
Verified: codex -p local and codex -p fast work against local
llama.cpp with Qwen3.5 models including native tool calling.
Refs: ggml-org/llama.cpp#19138, ggml-org/llama.cpp#19720
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* hex-fa: add simple dma cache for Mask
I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
* hex-dma: unset in-order desc bit which caused signficant perf regression
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
* hex-rope: update comment to clarify that we don't need in-order DMA completions
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).
Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.
This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).
The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
implementations do not yet support this syntax)
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf37.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.
closes: #21006
The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.
* webui: send reasoning_content back to model in context
Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).
Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.
* webui: add syncable parameter for excludeReasoningFromContext
* chore: update webui build output
Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).
* cann: update docker images to 8.5.0
- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0
Move to newer stable releases.
* cann: update CANN.md
* Update CANN.md to include BF16 support
Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.
* Fix formatting issues in CANN.md
Fix 234: Trailing whitespace
* mtmd: refactor image pre-processing
* correct some places
* correct lfm2
* fix deepseek-ocr on server
* add comment to clarify about mtmd_image_preprocessor_dyn_size