Commit Graph

608 Commits

Author SHA1 Message Date
David 58e88f0eb0
Merge c522288ab6 into 9e2e2198b0 2026-03-15 21:40:57 +00:00
Piotr Wilkin (ilintar) 9e2e2198b0
tools/cli: fix disable reasoning (#20606) 2026-03-15 22:40:53 +01:00
Georgi Gerganov 88915cb55c
server : fix wait in test_cancel_requests() test (#20601)
* server : fix wait in test_cancel_requests() test

* codeowners : add team for server tests
2026-03-15 20:54:37 +02:00
Xuan-Son Nguyen 94d0262277
mtmd: add llama-mtmd-debug binary (#20508)
* mtmd: add llama-mtmd-debug binary

* adapt

* fixes

* fix compile error

* fix windows compile error

* rm legacy clip_debug_encode()

* add MTMD_API to fix build
2026-03-14 15:52:29 +01:00
Chedrian07 710878a7dd
webui: restore code preview iframe origin isolation (#20477) 2026-03-14 11:28:28 +01:00
Adrien Gallouët 463b6a963c
tools : enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954)
llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande

    winogrande_score : tokenizing selected tasks
    winogrande_score : calculating winogrande score over selected tasks.
    split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
    decode: failed to find a memory slot for batch of size 46
    failed to decode the batch, n_batch = 2048, ret = 1
    winogrande_score: llama_decode() failed

same for hellaswag:

    split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
    decode: failed to find a memory slot for batch of size 99
    failed to decode the batch, n_batch = 2048, ret = 1
    hellaswag_score: llama_decode() failed

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-13 21:25:57 +01:00
David Baker c522288ab6
Switch to storing the pointer in a std::optional<std::ofstream *> as part of the context class. 2026-03-13 19:22:02 +00:00
David Baker 80391cd613
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into cli_output 2026-03-13 19:01:25 +00:00
ZeroV0LT f17b3be63f
llama : fix pooling assertion crash in chunked GDN detection path (#20468)
* llama : fix pooling assertion crash in chunked GDN detection path

The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.

* server : add mean pooling tests to embedding test suite

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.

---------

Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
2026-03-13 20:53:42 +02:00
SoftwareRenderer d7ba99c485
server: reset counter related to kill-switch on client error (#20513)
* server: reset kill-switch on client error

This avoids triggering a server kill switch.

If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.

However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.

* moved counter reset as per recommendation

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Daniel Bevenius 8f974d2392
mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate (#20105)
This commit renames the the function `mtmd_get_audio_bitrate` to
`mtmd_get_audio_sample_rate` to better reflect its purpose.

The motivation for this is that the function currently returns the audio
sample rate, not the bitrate (sample_rate × bit_depth × channels), and
that is how it is used in the code as well.

This is a breaking change, but I believe mtmd is still in
experimental/development phase so it might be alright to simply rename.
2026-03-13 12:30:02 +01:00
Piotr Wilkin (ilintar) 0e810413bb
tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432) 2026-03-12 13:41:01 +01:00
Pascal de190154c8
New conversations now auto-select the first loaded model (#20403)
* webui: auto-select first loaded model for new conversations in router mode

* chore: update webui build output
2026-03-12 09:07:05 +01:00
DAN™ fdb17643d3
model : add support for Phi4ForCausalLMV (#20168)
* Add support for Phi4ForCausalLMV.

* Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.

* Rename contants + fix tokenizer label

* Clean-ups.

* Fix GGUF export.

* Set tokenizer.ggml.pre explicitly.

* Default vocab name rather than forcing it.

* Clean-ups.

* Fix indent.

* Fix subscriptable error.

* remov overcomplicated code path

* Clean-ups.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-12 00:25:54 +01:00
David Baker f60b3cc4f0
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into cli_output 2026-03-11 13:58:53 +00:00
David Baker c2baff9161
Refactor to use a common function to do file output, which both outputs to file and selects different outputs for special token and plain text cases 2026-03-11 13:53:31 +00:00
Piotr Wilkin (ilintar) acb7c79069
common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Pascal 00de615345
Fix agentic mcp image single model (#20339)
* webui: fix MCP image attachments dropped during the agentic loop in single-model mode

* chore: update webui build output
2026-03-11 05:31:33 +01:00
Georgi Gerganov a7b3dee7a5
server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
ddh0 1dab5f5a44
llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770)
* quantize : imatrix-fail early + code cleanup

* fix manual override printing

it's in the preliminary loop now, so needs to be on its own line

* revert header changes per ggerganov

* remove old #includes

* clarify naming

rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality

* fix per barto
2026-03-10 08:16:05 +02:00
Evan Huus 23fbfcb1ad
server: Parse port numbers from MCP server URLs in CORS proxy (#20208)
* Parse port numbers from MCP server URLs

* Pass scheme to http proxy for determining whether to use SSL

* Fix download on non-standard port and re-add port to logging

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-09 17:47:54 +01:00
Georgi Gerganov 96cfc4992c
server : fix checkpoints n_tokens calculation (#20287) 2026-03-09 16:47:06 +02:00
Georgi Gerganov 344ee2a38a
server : warn swa-full is not supported for non-SWA models (#20291) 2026-03-09 16:44:25 +02:00
Georgi Gerganov d6e1556499
server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
* server : fix off-by-1 in server_tokens::size_up_to_pos()

* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
David Baker a0319dd06c
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into cli_output 2026-03-09 10:41:09 +00:00
Georgi Gerganov 107d599952
server : add kill switch when server is stuck (#20277) 2026-03-09 10:33:12 +02:00
Aaron Teo ae87863dc1
llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default (#20211) 2026-03-09 09:05:44 +08:00
Georgi Gerganov d417bc43dd
server : do not create checkpoints right after mtmd chunks (#20232) 2026-03-08 22:16:46 +02:00
Johannes Gäßler a976ff081b
llama: end-to-end tests (#19802)
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
2026-03-08 12:30:21 +01:00
decahedron1 ff52ee964d
server : correct index on finish in OAI completion streams (#20226) 2026-03-08 10:08:57 +01:00
Piotr Wilkin (ilintar) 566059a26b
Autoparser - complete refactoring of parser architecture (#18675)
* Autoparser - full single commit squish

* Final pre-merge changes: minor fixes, Kimi 2.5 model parser
2026-03-06 21:01:00 +01:00
David Baker 96fc9a91ef
Added documentation line 2026-03-06 16:54:23 +00:00
Tom Vaucourt e68f2fb894
server : preserve anthropic thinking blocks in conversion (#20120)
* server : preserve anthropic thinking blocks in conversion (#20090)

* server : add tests for anthropic thinking block conversion

---------

Co-authored-by: root <root@llamacpp.home>
2026-03-06 17:41:12 +01:00
David Baker c494c70a06
Implement output flag on cli 2026-03-06 16:40:36 +00:00
Piotr Wilkin (ilintar) f5ddcd1696
Checkpoint every n tokens: squash (#20087) 2026-03-06 11:39:26 +01:00
Aleksander Grygier f6235a41ef
webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655) 2026-03-06 10:00:39 +01:00
Roj234 f7db3f3789
cli : Don't clear system prompt when using '/clear' (#20067)
* Enhance /clear command to include system prompt

Add system prompt to messages when clearing chat history.

* Use lambda
2026-03-06 06:41:11 +01:00
Sigbjørn Skjæret b5ed0e058c
cli : add command and file auto-completion (#19985) 2026-03-05 10:47:28 +01:00
Aleksander Grygier 5e335ba113
webui: Improvements for Models Selector UI (#20066) 2026-03-05 08:52:22 +01:00
Marcel Petrick 92f7da00b4
chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
Sigbjørn Skjæret d969e933e1
tools : add missing clocale include in mtmd-cli [no ci] (#20107) 2026-03-04 14:18:04 +01:00
SamareshSingh cb8f4fa3f8
Fix locale-dependent float printing in GGUF metadata (#17331)
* Set C locale for consistent float formatting across all binaries.

* Add C locale setting to all tools binaries

Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.

* Apply suggestion from @JohannesGaessler

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-04 09:30:40 +01:00
standby24x7 54910bd4f3
completion : Fix a typo in warning message (#20082)
resuse -> reuse
2026-03-04 06:44:49 +01:00
Roj234 3e6ab244ad
server: Add pragma once to server-context.h (#19944) 2026-02-27 18:28:36 +01:00
Sami Kama 5596a35791
server: Mirroring /v1/responses to /responses to match /v1/chat/completions pattern (#19873) 2026-02-28 00:44:42 +08:00
Pascal 2e7e638523
server : support multiple model aliases via comma-separated --alias (#19926)
* server : support multiple model aliases via comma-separated --alias

* server : update --alias description and regenerate docs

* server : multiple model aliases and tags

- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields

* regenerate docs

* nits

* server : use first alias as model_name for backward compat

address review feedback from ngxson

* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Georgi Gerganov 37964f44f9
mtmd : fix padding of n_tokens (#19930) 2026-02-26 18:39:49 +02:00
Georgi Gerganov 01cd448b8c
server : fix ctx checkpoint restore logic (#19924) 2026-02-26 18:20:16 +02:00
drrros efba35a860
server: fix load-on-startup not respected in ini file (#19897)
Co-authored-by: Roman Marchenko <r.marchenko@ideco.ru>
2026-02-26 12:32:31 +01:00
Maximilian Werk 66287bdaac
model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826)
* WIP: Add EuroBERT support with autoformatting changes

This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files

Saving before reverting formatting-only changes.

* feat: add back eos assert when not last token pooling

* feat: removed duplicated code and cleanup

* feat: removed not working architectures and unnecessary check

* fix: typo

* fix: dynamic pooling config

* feat: added an example model for eurobert

* feat: proper llama-vocab implementation for jina-v5

* fix: removed unnecessary comments
2026-02-26 12:14:09 +01:00