* requirements : update transformers to 5.5.0
This commit updates the transformers dependency to version 5.5.0.
The motivation for this is that transformers 5.5.0 includes support for
Gemma4 and is required to be able to convert Gemma4 models. This is also
causing issues for user of gguf-my-repo.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202
* fix huggingface_hub version
* set version of transformers to 5.5.0
* convert : add ty ignore directives to convert_hf_to_gguf.py
This commit adds `ty: ignore` directives to transformers tokenizers
field/methods to avoid type check errors. There might be better ways to
handle this and perhaps this can be done in a follow up commit.
The motivation for this is that it looks like in transformers 5.5.0
AutoTokenizer.from_pretrained can return generic tokenizer types or None
and the type checker now produces an error when the conversion script
accesses field like tokenizer.vocab.
* convert : add ty ignore to suppress type check errors
* convert : remove incorrect type ignores
* convert : fix remaining python checks
I was running a newer version of ty locally but I've switched to
version 0.0.26 which is what CI uses and I was then able to reproduce
the errors. Sorry about the noise.
* update transformers version to 5.5.1
* feat: support step3-vl-10b
* use fused QKV && mapping tensor in tensor_mapping.py
* guard hardcoded params and drop crop metadata
* get understand_projector_stride from global config
* img_u8_resize_bilinear_to_f32 move in step3vl class
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix the \r\n mess
* add width and heads to MmprojModel.set_gguf_parameters
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* webui: fix syntax highlighting lost for non-common languages after streaming
rehype-highlight uses lowlight internally, which only bundles 37 "common"
languages. The streaming code path uses highlight.js directly (192 languages),
so languages like Haskell highlight correctly while streaming but lose all
color once the code block closes. Pass the full lowlight language set to
rehype-highlight so both paths support the same languages.
* webui: rebuild static files after rebase
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content
* Clean up commented duplicate function
Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements
This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.
* Implement rehype plugin for comprehensive RTL text support
- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content
* Fix RTL text rendering with rehype plugin and cleanup
* fix: prettier formatting
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
* fix: Bypass API Key validation for static bundle assets
* refactor: All bypassed routes in `public_endpoints`
* test: Update static assets API Key test
* Refactor llama_model_quantize_params to expose a pure C interface
* Restore comment and cleanup struct def
* Code review refactoring
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Code review refactoring
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load
* Revert a308e584ca
* Add regression test
* Remove regression test for init-fail sampler check
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes#21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf37.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.
* webui: send reasoning_content back to model in context
Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).
Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.
* webui: add syncable parameter for excludeReasoningFromContext
* chore: update webui build output
* mtmd: refactor image pre-processing
* correct some places
* correct lfm2
* fix deepseek-ocr on server
* add comment to clarify about mtmd_image_preprocessor_dyn_size
* imatrix: fix crash when using --show-statistics with zero counts
Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors.
Fix for the bug #19190
* imatrix: lower log level for zero-count skip message to DBG
* mtmd: llama.cpp DeepSeekOCR support
init commit
* loading sam tensors
* mtmd: fix vision model processing
* deepseek-ocr clip-vit model impl
* mtmd: add DeepSeek-OCR LM support with standard attention
* mtmd: successfully runs DeepSeek-OCR LM in llama-cli
* mtmd: Fix RoPE type for DeepSeek-OCR LM.
* loading LM
testing Vision model loading
* sam warmup working
* sam erroneous return corrected
* clip-vit: corrected cls_embd concat
* clip-vit: model convert qkv_proj split
* corrected combining of image encoders' results
* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model
* concat image_newline and image_seperator tokens
* visual_model warmup (technically) works
* window partitioning using standard ggml ops
* sam implementation without using CPU only ops
* clip: fixed warnings
* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr
* mtmd: fix get_rel_pos
* mtmd: fixed the wrong scaler for get_rel_pos
* image encoding technically works but the output can't be checked singe image decoding fails
* mtmd: minor changed
* mtmd: add native resolution support
* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter
* mtmd: correct token order
* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4
* mtmd: quick fix token order
* mtmd: fix danling pointer
* mtmd: SAM numerically works
* mtmd: debug CLIP-L (vit_pre_ln)
* mtmd: debug CLIP-L & first working DeepSeek-OCR model
* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work
* mtmd: simplify SAM patch embedding
* mtmd: adapt Pillow image resizing function
* mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing
* mtmd: remove --dsocr-mode argument
* mtmd: refactor code & remove unused helper functions
* mtmd: fix tensor names for image newlines and view separator
* clean up
* reverting automatically removed spaces
* reverting automatically removed spaces
* mtmd: fixed bad ocr check in Deepseek2 (LM)
* mtmd: support combined QKV projection in buid_vit
* using common build_attn in sam
* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option
* mtmd: minor fix
* minor formatting and style
* fixed flake8 lint issues
* minor editorconfig-check fixes
* minor editorconfig-check fixes
* mtmd: simplify get_rel_pos
* mtmd: make sam hparams configurable
* mtmd: add detailed comments for resize_bicubic_pillow
* mtmd: fixed wrong input setting
* mtmd: convert model in FP16
* mtmd: minor fix
* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template
* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution
* minor: editconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn
* minor: editconfig-check fix
* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR
* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909
* refactoring, one single builder function and static helpers
* added deepseek-ocr test to tests.sh
* minor formatting fixes
* check with fixed expected resutls
* minor formatting
* editorconfig-check fix
* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042
* minor
- added GLM-4.6V to big tests
- added missing deps for python test
* convert: minor fix
* mtmd: format code
* convert: quick fix
* convert: quick fix
* minor python formatting
* fixed merge build issue
* merge resolved
- fixed issues in convert
- tested several deepseek models
* minor fix
* minor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
* - cleaning commented out code
* fixing instabilities issues reintroducing resize_bicubic_pillow
* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr
* rename fc_w --> mm_fc_w
* add links to OCR discussion
* cleaner loading code
* add missing .weight to some tensors
* add default jinja template (to be used by server)
* move test model to ggml-org
* rolling back upscale change
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* common : add standard Hugging Face cache support
- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check with the quant tag
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve error handling and report API errors
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Restore common_cached_model_info and align mmproj filtering
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Prefer main when getting cached ref
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use cached files when HF API fails
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use final_path..
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check all inputs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)
* add min/max dynamic patch to gguf meta
* clean up
* simplified handling min/max dynamic patch
* reuse llava_uhd logic for slice images
* provide default values for older models
* flake8
* prevent writing 0 value to gguf
* remove duplicated resolution candidates with a better algorithm
* fix indentation
* format
* add protection from divide by zero
* change to 0 to be safe
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
* server: (doc) clarify in-scope and out-scope features
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Two bugs in `server_models::load()` that affect router mode reliability:
**Bug 1: Deadlock when child process crashes**
When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.
Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.
**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**
`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.
Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
* webui: make server the source of truth for sampling defaults
* webui: fix Custom badge for sampling parameters
* webui: log user overrides after server sync
* chore: update webui build output
* fix: Default values for sampling settings config object
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* add tests for model id parser
* add test case having activated params
* add structured tests for model id parser
* add ToDo
* feat: Improve model parsing logic + tests
* chore: update webui build output
---------
Co-authored-by: bluemoehre <bluemoehre@gmx.de>
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output