* feat: Per-conversation loading states and tracking streaming stats
* chore: update webui build output
* refactor: Chat state management
Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.
This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.
Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.
* feat: Adds loading indicator to conversation items
* chore: update webui build output
* fix: Fix aborting chat streaming
Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.
This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.
* refactor: Remove redundant comments
* chore: build webui static output
* refactor: Cleanup
* chore: update webui build output
* chore: update webui build output
* fix: Conversation loading indicator for regenerating messages
* chore: update webui static build
* feat: Improve configuration
* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters
So inline and block equations are converted before KaTeX rendering,
enabling proper display of model-generated LaTeX in the WebUI
* chore: update webui build output
* fix: add remark plugin to render raw HTML as literal text
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering
Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure
Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization
Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required
Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify
* fix: address review feedback from allozaur
* chore: update webui build output
* fix: make SSE client robust to premature [DONE] in agentic proxy chains
* webui: remove client-side context pre-check and rely on backend for limits
Removed the client-side context window pre-check and now simply sends messages
while keeping the dialog imports limited to core components, eliminating the
maximum context alert path
Simplified streaming and non-streaming chat error handling to surface a generic
'No response received from server' error whenever the backend returns no content
Removed the obsolete maxContextError plumbing from the chat store so state
management now focuses on the core message flow without special context-limit cases
* webui: cosmetic rename of error messages
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output
* chore: update webui build output
* server / ranking : add sorting and management of top_n
* Make the retro compatible if no top_n will return
all results
here is a script to make some test
```script
URL=${1:-http://127.0.0.1:8181}
curl "$URL/v1/rerank" -H "Content-Type: application/json" \
-d '{ "model": "M", "query": "What is the recipe to make bread ?",
"return_text" : true,
"texts" : true,
"top_n": 6,
"documents": [
"voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel",
"it is a bear",
"bread recipe : floor, water, yest, salt",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
"here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"recipe to make cookies : floor, eggs, water, chocolat",
"here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"il fait tres beau aujourd hui",
"je n ai pas faim, je ne veux pas manger",
"je suis a paris"
] }' | jq
```
* use resize() instead for(...)
* simplify top_n init since no need to return error
result to test :
./tests.sh unit/test_rerank.py -v -x
==================================================== test session starts =====================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 8 items
unit/test_rerank.py::test_rerank PASSED [ 12%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%]
===================================================== 8 passed in 4.31s ======================================================
* add rerank top_n unit test
here is the result :
./tests.sh unit/test_rerank.py -v -x
=================================================================== test session starts ===================================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 16 items
unit/test_rerank.py::test_rerank PASSED [ 6%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%]
unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%]
unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%]
unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%]
unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%]
unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%]
unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%]
unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%]
unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%]
=================================================================== 16 passed in 8.84s ===================================================================
* editor config check fix
In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body. This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.
This patch fixes this problem and makes the server return HTTP 400 in
such cases.
* webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel
* chore: update webui build output
* minor : code style
* server : fix prompt similarity calculation
* server : initial host-memory prompt caching
* cont
* server : refactor
* cont
* cont : make the server task of the slot const
* cont : minor [no ci]
* server : cache prompts and checkpoints only for completion tasks
* server : improve prompt caching logic
* cont : fix check for number of cached prompts [no ci]
* server : improve caching logic, add -cram CLI arg
* server : print prompt mismatch info
* cont : better naming [no ci]
* server : improve prompt cache loading logic
* server : add option to debug the slot contents (#16482)
* server : add option to debug the slot contents
* Update tools/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : add option to disable prompt cache
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing
- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
* refactor: implement streaming-aware universal reasoning parser
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.
- Rework try_parse_reasoning() to track whitespace, partial tags, and
multiple reasoning segments, allowing proper separation of reasoning_content
and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments
The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.
Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
* refactor: address review feedback from allozaur
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)
- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
* refactor: address review feedback from ngxson
* debug: say goodbye to curl -N, hello one-click raw stream
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story
- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
* npm run format
* chat-parser: address review feedback from ngxson
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* initial commit for branch 3
* generalize `swa_checkpoint` to `ctx_checkpoint`
this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite
* oops
* disable debug prints
* keep backwards compat with `--swa-checkpoints`
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update prompt re-processing message
* fix off-by-one error per GG
* keep `seq_rm` log per GG
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix checkpoint logic to support recurrent caches
* server : cleanup and fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Capture model name only after first token (streaming) or completed request (non-streaming)
* chore: update webui build output
* chore: update webui build output
* fix: Include just the currently active message branches instead of all in chat completions request
* chore: Build webui static output
* chore: Formatting
* chore: update webui build output
* feat: Add a setting to include model name used to generate the message
* feat: UI improvements
* feat: Save model info along with the database message entry creation
* chore: Build webui static output
* webui: allow viewing conversations and sending messages even if llama-server is down
- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.
* feat: Add UI for `props` endpoint unavailable + cleanup logic
* webui: extend cached props fallback to offline errors
Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.
* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages
e.g., to restart the backend over llama-swap, even while cached /props data is in use
* chore: update webui build output
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* Switched web UI to hash-based routing
* Added hash to missed goto function call
* Removed outdated SPA handling code
* Fixed broken sidebar home link
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.
The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
* server: fix SSE and OpenAI compatibility for error messages when streaming
* server: remove obsolete event parameter and use required data fieldname instead
* server : include usage statistics only when user request them
When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics
closes: #16048
* add unit test
* requirements : update transformers/torch for Embedding Gemma
This commit updates the requirements to support converting
Embedding Gemma 300m models.
The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.
I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.
* resolve additional python dependencies
* fix pyright errors in tokenizer test and remove unused import
* server : implement `return_progress`
* add timings.cache_n
* add progress.time_ms
* add test
* fix test for chat/completions
* readme: add docs on timings
* use ggml_time_us
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Add python-side constants and conversion for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side constants for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse invocation string for adapters from GGUF
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(python): Update conversion to alora_invocation_tokens
This is the preferred method in PEFT which is the source of ground truth
https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(cpp): Update to alora_invocation_tokens on c++ side
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add C APIs to get alora invocation token array from lora
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Initial implementation of alora cache logic in server
This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Identify alora invocation sequences
This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Only reuse cache for tokens before the alora invocation start
This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Handle un-cached tokens that come before the alora activation
The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use || instead of 'or'
Too much python 🤦
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one for limiting cached tokens to before alora start
This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json
While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove duplicate logging
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Set enable_thinking IFF not disabled and supported
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix inverted logic condition for prefill error
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Always parse the enable_thinking kwarg to overwrite the default value
From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Don't limit tempalte expansion check to jinja
With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add the error text to json type errors in json_value
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Explicitly reject string values for "enable_thinking"
There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Move logic for detecting template enable_thinking support to common
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use raw pointer for common chat template function
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* sampling : optimize sorting using bucket sort in more places
ggml-ci
* sampling : do not sort in dist sampler
ggml-ci
* sampling : avoid heap allocations for sort buffers
ggml-ci
* common : add option to sort sampling candidates by probability
ggml-ci
* sampling : revert the change for preserving sort buffers
* sampling : use std::copy instead of memcpy
* sampling : clarify purpose of partial sort helpers
ggml-ci
* cont : remove wrong comment [no ci]
* common : update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : enable /slots by default and make it secure
ggml-ci
* server : fix tests to pass `--no-slots` when necessary
* server : extend /props with info about enabled endpoints
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.
* Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
* feat:Modify the header file include path
1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.
* Restore the docker.yml file
Add tracking for high watermark cache usage and make it available in /metrics endpoint.
Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
* model : add harmony parser for gpt-oss
* gpt-oss : fix grammar trigger from causing empty stack
* gpt-oss: tweak the grammar trigger again
* gpt-oss : add support for recipient in role header
* gpt-oss : fix ungrouped tool calls in grammar
* gpt-oss : loosen function name matching during parse
* gpt-oss : clean up workarounds
* gpt-oss : add template tests
* gpt-oss : simulate thinking and tool call tags
* gpt-oss : undo think tags when reasoning_format is none
* gpt-oss : set special tokens back to user defined
* gpt-oss : update openai-gpt-oss template
* server : filter out harmony thought messages
* gpt-oss : simplify parsing
* server : add SWA checkpoints
ggml-ci
* cont : server clean-up
* server : handle state restore fails
* llama : add extended llama_state_seq_ API
* server : do not make checkpoints if --swa-full
ggml-ci
* llama : remove flags value for NONE
* server : configure number of SWA checkpoints with CLI arg
ggml-ci
* args : fix scope of new argument
* Checkpoint from VS Code for coding agent session
* Initial plan
* Fix typo in --override-tensor-draft flag implementation
* Add null termination for speculative tensor buffer overrides
* Apply suggestions from code review
* Apply suggestions from code review
* Extract tensor override parsing logic to common function (addresses @slaren's feedback)
* Apply suggestions from code review
* Apply suggestions
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* llama-server : implement universal assisted decoding
* Erase prompt tail for kv-cache
* set vocab_dft_compatible in common_speculative
* rename ctx_main to ctx_tgt
* move vocab_dft_compatible to spec struct
* clear mem_dft, remove mem
* detokenize id_last for incompatible models
* update comment
* add --spec-replace flag
* accept special tokens when translating between draft/main models
* Escape spec-replace
* clamp draft result to size to params.n_draft
* fix comment
* clean up code
* restore old example
* log common_speculative_are_compatible in speculative example
* fix
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit adds support for the `embd_normalize` parameter in the
server code.
The motivation for this is that currently if the server is started with
a pooling type that is not `none`, then Euclidean/L2 normalization will
be the normalization method used for embeddings. However, this is not
always the desired behavior, and users may want to use other
normalization (or none) and this commit allows that.
Example usage:
```console
curl --request POST \
--url http://localhost:8080/embedding \
--header "Content-Type: application/json" \
--data '{"input": "Hello world today", "embd_normalize": -1}
```
* initial commit for handling extra template kwargs
* enable_thinking and assistant prefill cannot be enabled at the same time
* can set chat_template_kwargs in command line
* added doc
* fixed formatting
* add support for extra context in generic template init
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix merge conflict
* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
* normalize environment variable name
* simplify code
* prefill cannot be used with thinking models
* compatibility with the new reasoning-budget parameter
* fix prefill for non thinking models
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>