* Refactor llama_model_quantize_params to expose a pure C interface
* Restore comment and cleanup struct def
* Code review refactoring
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Code review refactoring
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load
* Revert a308e584ca
* Add regression test
* Remove regression test for init-fail sampler check
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes#21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
Accept all valid reasoning item content formats in multi-turn input:
- Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format)
- Plain string: "thinking about it" (OpenCode format)
- Null: content:null with encrypted_content (Codex, openai/codex#11834)
- Omitted entirely: no content field present
Previously threw "item['content'] is not an array" for non-array formats,
breaking OpenCode multi-turn conversations. The encrypted_content field
is accepted but ignored for local models (no server-side decryption).
Add 4 tests covering each format variant.
Refs: openai/codex#11834, anomalyco/opencode#19081
Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
update() and reused consistently across output_item.added,
function_call_arguments.delta, and output_item.done events
(was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp
Test fixes:
- Assert sequence_number on every event unconditionally (was using
weak "if present" guard)
- Check actual values not just key presence in streaming created
event test (completed_at is None, usage tokens are 0, etc.)
Refs: ggml-org/llama.cpp#21174 (patrick review)
- test_responses_stream_created_event_has_full_response: verify
response.created contains all 24+ fields with status in_progress
- test_responses_stream_all_events_have_sequence_number: every event
has sequence_number and they are strictly increasing across stream
- test_responses_stream_delta_events_have_indices: output_index and
content_index present on all delta/added events
All 14 tests pass (2 original + 9 from previous commit + 3 new).
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org/llama.cpp#21174 (fumlig review comment)
Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- output_text convenience field in streaming and non-streaming responses
Responses API compliance (ideas from #19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to streaming events
- Accept input_text type and EasyInputMessage for multi-turn input
Verified: codex -p local and codex -p fast work against local
llama.cpp with Qwen3.5 models including native tool calling.
Refs: ggml-org/llama.cpp#19138, ggml-org/llama.cpp#19720
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf37.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.
* webui: send reasoning_content back to model in context
Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).
Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.
* webui: add syncable parameter for excludeReasoningFromContext
* chore: update webui build output
* mtmd: refactor image pre-processing
* correct some places
* correct lfm2
* fix deepseek-ocr on server
* add comment to clarify about mtmd_image_preprocessor_dyn_size
* imatrix: fix crash when using --show-statistics with zero counts
Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors.
Fix for the bug #19190
* imatrix: lower log level for zero-count skip message to DBG
* mtmd: llama.cpp DeepSeekOCR support
init commit
* loading sam tensors
* mtmd: fix vision model processing
* deepseek-ocr clip-vit model impl
* mtmd: add DeepSeek-OCR LM support with standard attention
* mtmd: successfully runs DeepSeek-OCR LM in llama-cli
* mtmd: Fix RoPE type for DeepSeek-OCR LM.
* loading LM
testing Vision model loading
* sam warmup working
* sam erroneous return corrected
* clip-vit: corrected cls_embd concat
* clip-vit: model convert qkv_proj split
* corrected combining of image encoders' results
* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model
* concat image_newline and image_seperator tokens
* visual_model warmup (technically) works
* window partitioning using standard ggml ops
* sam implementation without using CPU only ops
* clip: fixed warnings
* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr
* mtmd: fix get_rel_pos
* mtmd: fixed the wrong scaler for get_rel_pos
* image encoding technically works but the output can't be checked singe image decoding fails
* mtmd: minor changed
* mtmd: add native resolution support
* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter
* mtmd: correct token order
* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4
* mtmd: quick fix token order
* mtmd: fix danling pointer
* mtmd: SAM numerically works
* mtmd: debug CLIP-L (vit_pre_ln)
* mtmd: debug CLIP-L & first working DeepSeek-OCR model
* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work
* mtmd: simplify SAM patch embedding
* mtmd: adapt Pillow image resizing function
* mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing
* mtmd: remove --dsocr-mode argument
* mtmd: refactor code & remove unused helper functions
* mtmd: fix tensor names for image newlines and view separator
* clean up
* reverting automatically removed spaces
* reverting automatically removed spaces
* mtmd: fixed bad ocr check in Deepseek2 (LM)
* mtmd: support combined QKV projection in buid_vit
* using common build_attn in sam
* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option
* mtmd: minor fix
* minor formatting and style
* fixed flake8 lint issues
* minor editorconfig-check fixes
* minor editorconfig-check fixes
* mtmd: simplify get_rel_pos
* mtmd: make sam hparams configurable
* mtmd: add detailed comments for resize_bicubic_pillow
* mtmd: fixed wrong input setting
* mtmd: convert model in FP16
* mtmd: minor fix
* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template
* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution
* minor: editconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn
* minor: editconfig-check fix
* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR
* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909
* refactoring, one single builder function and static helpers
* added deepseek-ocr test to tests.sh
* minor formatting fixes
* check with fixed expected resutls
* minor formatting
* editorconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042
* minor
- added GLM-4.6V to big tests
- added missing deps for python test
* convert: minor fix
* mtmd: format code
* convert: quick fix
* convert: quick fix
* minor python formatting
* fixed merge build issue
* merge resolved
- fixed issues in convert
- tested several deepseek models
* minor fix
* minor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
* - cleaning commented out code
* fixing instabilities issues reintroducing resize_bicubic_pillow
* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr
* rename fc_w --> mm_fc_w
* add links to OCR discussion
* cleaner loading code
* add missing .weight to some tensors
* add default jinja template (to be used by server)
* move test model to ggml-org
* rolling back upscale change
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* common : add standard Hugging Face cache support
- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check with the quant tag
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve error handling and report API errors
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Restore common_cached_model_info and align mmproj filtering
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Prefer main when getting cached ref
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use cached files when HF API fails
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use final_path..
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check all inputs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)
* add min/max dynamic patch to gguf meta
* clean up
* simplified handling min/max dynamic patch
* reuse llava_uhd logic for slice images
* provide default values for older models
* flake8
* prevent writing 0 value to gguf
* remove duplicated resolution candidates with a better algorithm
* fix indentation
* format
* add protection from divide by zero
* change to 0 to be safe
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>