llama.cpp

Commit Graph

Author	SHA1	Message	Date
SamareshSingh	0fac87b157	imatrix : fix crash when using --show-statistics with zero counts (#19532 ) * imatrix: fix crash when using --show-statistics with zero counts Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors. Fix for the bug #19190 * imatrix: lower log level for zero-count skip message to DBG	2026-03-26 08:14:36 +01:00
Saba Fallah	a970515bdb	mtmd: Add DeepSeekOCR Support (#17400 ) * mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-03-25 19:57:40 +01:00
Aman Gupta	9c600bcd4b	llama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984 )	2026-03-25 21:17:27 +08:00
Francisco Herrera	8fc17493c3	gguf-split : clarify operation of gguf-split (#19749 ) * clarify operation of gguf-split so that you don't have to find out by trial and error * formatting	2026-03-25 13:12:50 +02:00
Aleksander Grygier	69e0ecef06	webui: Fix editing assistant message without branching (#20944 ) * fix: Editing assistant response without branching * chore: update webui build output	2026-03-25 12:47:33 +02:00
Pascal	062cca58fc	Add SLEEPING status to the WebUI model selector (#20949 ) * webui: handle sleeping model status, fix favourite -> favorite * Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: fix optional event parameter in sleeping model onclick * typo * webui: restore orange sleeping indicator dot with hover unload * chore: update webui build output * webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore * chore: update webui build output * webui: fix favourite -> favorite (UK -> US spelling) everywhere Address review feedback from WhyNotHugo * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-25 11:02:32 +01:00
BlueMöhre	a94fdb090a	WebUI: fix edit msg form textarea height (#20830 ) * autoresize textarea on mount * allow textarea to grow to same height as rendered messages * add UI build file	2026-03-24 13:17:45 +01:00
Adrien Gallouët	8c7957ca33	common : add standard Hugging Face cache support (#20775 ) * common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 07:30:33 +01:00
Aleksander Grygier	11fb11b901	webui: Improve chat form positioning (#20901 )	2026-03-23 14:30:55 +01:00
Eric Zhang	841bc203e2	docs : rerun llama-gen-docs to include new CLI args (#20892 )	2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen	31a5cf4c3f	server: use httplib dynamic threads (#20817 ) * server: use httplib dynamic threads * change to n_threads_http + 1024	2026-03-23 12:22:46 +01:00
Pascal	c44a932cf4	webui: fix --webui-config-file settings not applied on load (#20823 ) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output	2026-03-23 11:25:35 +01:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
DorianRudolph	d3ac030a5d	mtmd : fix LightOnOCR image preprocessing (#20877 )	2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Evgeny Kurnevsky	81bc4d3ddc	server: fix Host header (#20843 ) It should include port when it's not default.	2026-03-22 22:29:22 +08:00
ddh0	3306dbaef7	misc : prefer ggml-org models in docs and examples (#20827 ) * misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file	2026-03-21 22:00:26 +01:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Piotr Wilkin (ilintar)	b1c70e2e54	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
Xuan-Son Nguyen	fb78ad29bb	server: (doc) clarify in-scope and out-scope features (#20794 ) * server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 14:03:50 +01:00
Georgi Gerganov	ab9d4c3678	server : improve mtmd ctx checkpoints (#20726 ) * server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold	2026-03-20 11:13:12 +02:00
Ben Racicot	c1b911654a	server: fix router mode deadlock on child crash and TOCTOU race in models_max (#20763 ) Two bugs in `server_models::load()` that affect router mode reliability: Bug 1: Deadlock when child process crashes When a child process is killed (e.g., SIGKILL from OS code signature validation), the monitoring thread deadlocks on `stopping_thread.join()` because the stopping_thread's wait predicate (`is_stopping`) is never satisfied — the model name was never inserted into `stopping_models`. `update_status()` is never reached and the model stays stuck in LOADING state permanently. Fix: extend the stopping_thread's wait predicate to also wake when the child process is no longer alive (`!subprocess_alive()`). When woken by a dead child, the thread skips the shutdown sequence and returns immediately. The original `stopping_models.erase()` logic is preserved for normal unloads. Bug 2: TOCTOU race bypasses `--models-max` (ref #20137) `unload_lru()` is called outside the mutex, then `load()` acquires the lock afterward. Under concurrent requests, multiple threads observe capacity and all proceed to load, exceeding the limit. Fix: re-check capacity under the lock after `unload_lru()` returns. If another thread filled the slot in the window between `unload_lru()` and the lock acquisition, reject with an error instead of silently exceeding the limit.	2026-03-19 22:16:05 +01:00
Tomeamis	b739738dad	docs: Update server README to reflect PR #20297 (#20560 )	2026-03-19 21:28:44 +01:00
Ryan Goulden	26c9ce1288	server: Add cached_tokens info to oaicompat responses (#19361 ) * tests : fix fetch_server_test_models.py * server: to_json_oaicompat cached_tokens Adds OpenAI and Anthropic compatible information about the number of cached prompt tokens used in a response.	2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)	5e54d51b19	common/parser: add proper reasoning tag prefill reading (#20424 ) * Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-19 16:58:21 +01:00
Pascal	4065c1a3a6	Server becomes the source of truth for sampling parameter defaults (#20558 ) * webui: make server the source of truth for sampling defaults * webui: fix Custom badge for sampling parameters * webui: log user overrides after server sync * chore: update webui build output * fix: Default values for sampling settings config object * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-19 13:20:39 +01:00
Xuan-Son Nguyen	1e64534570	mtmd: add clip_graph::build_mm() (#20751 ) * clip: add build_mm() * apply to all models * add TODO for bias overload	2026-03-19 13:11:39 +01:00
Pascal	cd708db0cc	WebUI: Persist the on/off state of the MCP servers for new conversations (#20750 ) * webui: add persistent storage for MCP server on/off state in new chats * webui: simplify MCP enabled checks, remove dead server.enabled fallback * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-19 12:54:06 +01:00
Aleksander Grygier	512bba6ee0	webui: Improve model parsing logic + add unit tests (#20749 ) * add tests for model id parser * add test case having activated params * add structured tests for model id parser * add ToDo * feat: Improve model parsing logic + tests * chore: update webui build output --------- Co-authored-by: bluemoehre <bluemoehre@gmx.de>	2026-03-19 12:25:50 +01:00
crsawyer	5744d7ec43	Rebuild index.html.gz (#20724 )	2026-03-18 18:49:57 +01:00
Julien Chaumond	48e61238e1	webui: improve tooltip wording for attachment requirements (#20688 ) * webui: improve tooltip wording for attachment requirements Co-Authored-By: Claude <Agents+claude@huggingface.co> * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Claude <Agents+claude@huggingface.co>	2026-03-18 14:01:02 +01:00
Aleksander Grygier	7ab321d40d	webui: Fix duplicated messages on q param (#20715 ) * fix: Remove duplicate message sending on `?q` param * chore: update webui build output	2026-03-18 10:32:43 +01:00
Piotr Wilkin (ilintar)	d2ecd2d1cf	common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289 ) * Add `--force-pure-content` to force a pure content parser. * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Change parameter name [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 16:16:43 +01:00
Georgi Gerganov	8cc2d81264	server : fix ctx checkpoint invalidation (#20671 )	2026-03-17 15:21:14 +02:00
Piotr Wilkin (ilintar)	2e4a6edd4a	tools/server: support refusal content for Responses API (#20285 ) * Support refusal content for Responses API * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 01:42:04 +01:00
Pascal	dddca026bf	webui: add model information dialog to router mode (#20600 ) * webui: add model information dialog to router mode * webui: add "Available models" section header in model list * webui: remove nested scrollbar from chat template in model info dialog * chore: update webui build output * feat: UI improvements * refactor: Cleaner rendering + UI docs * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-16 15:38:11 +01:00
Aleksander Grygier	67a2209fab	webui: Add MCP CORS Proxy detection logic & UI (#20167 ) * refactor: MCP store cleanup * feat: Add MCP proxy availability detection * fix: Sidebar icon * chore: update webui build output * chore: Formatting * chore: update webui build output * chore: Update package lock * chore: update webui build output * chore: update webui build output * chore: update webui build output	2026-03-16 13:05:36 +01:00
Pascal	d65c4f2dc9	Fix model selector locked to first loaded model with multiple models (#20580 ) * webui: fix model selector being locked to first loaded model When multiple models are loaded, the auto-select effect would re-fire on every loadedModelIds change, overriding the user's manual model selection. Guard with selectedModelId so auto-select only kicks in when no model is chosen yet. * chore: update webui build output	2026-03-16 12:04:06 +01:00
Woof Dog	d8c331c0af	webui: use date in more human readable exported filename (#19939 ) * webui: use date in exported filename Move conversation naming and export to utils update index.html.gz * webui: move literals to message export constants file * webui: move export naming and download back to the conversation store * chore: update webui build output * webui: add comments to some constants * chore: update webui build output	2026-03-16 11:18:13 +01:00
Piotr Wilkin (ilintar)	9e2e2198b0	tools/cli: fix disable reasoning (#20606 )	2026-03-15 22:40:53 +01:00
Georgi Gerganov	88915cb55c	server : fix wait in test_cancel_requests() test (#20601 ) * server : fix wait in test_cancel_requests() test * codeowners : add team for server tests	2026-03-15 20:54:37 +02:00
Xuan-Son Nguyen	94d0262277	mtmd: add llama-mtmd-debug binary (#20508 ) * mtmd: add llama-mtmd-debug binary * adapt * fixes * fix compile error * fix windows compile error * rm legacy clip_debug_encode() * add MTMD_API to fix build	2026-03-14 15:52:29 +01:00
Chedrian07	710878a7dd	webui: restore code preview iframe origin isolation (#20477 )	2026-03-14 11:28:28 +01:00
Adrien Gallouët	463b6a963c	tools : enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954 ) llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande winogrande_score : tokenizing selected tasks winogrande_score : calculating winogrande score over selected tasks. split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 46 failed to decode the batch, n_batch = 2048, ret = 1 winogrande_score: llama_decode() failed same for hellaswag: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 99 failed to decode the batch, n_batch = 2048, ret = 1 hellaswag_score: llama_decode() failed Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-13 21:25:57 +01:00
ZeroV0LT	f17b3be63f	llama : fix pooling assertion crash in chunked GDN detection path (#20468 ) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (`d28961d`). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>	2026-03-13 20:53:42 +02:00
SoftwareRenderer	d7ba99c485	server: reset counter related to kill-switch on client error (#20513 ) * server: reset kill-switch on client error This avoids triggering a server kill switch. If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated. However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates. * moved counter reset as per recommendation * cont : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-13 19:58:09 +02:00
Daniel Bevenius	8f974d2392	mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate (#20105 ) This commit renames the the function `mtmd_get_audio_bitrate` to `mtmd_get_audio_sample_rate` to better reflect its purpose. The motivation for this is that the function currently returns the audio sample rate, not the bitrate (sample_rate × bit_depth × channels), and that is how it is used in the code as well. This is a breaking change, but I believe mtmd is still in experimental/development phase so it might be alright to simply rename.	2026-03-13 12:30:02 +01:00
Piotr Wilkin (ilintar)	0e810413bb	tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432 )	2026-03-12 13:41:01 +01:00
Pascal	de190154c8	New conversations now auto-select the first loaded model (#20403 ) * webui: auto-select first loaded model for new conversations in router mode * chore: update webui build output	2026-03-12 09:07:05 +01:00
DAN™	fdb17643d3	model : add support for Phi4ForCausalLMV (#20168 ) * Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-12 00:25:54 +01:00

1 2 3 4 5 ...

639 Commits