Commit Graph

199 Commits

Author SHA1 Message Date
Ed Addario 09198c470b
Merge branch 'master' into quantize 2025-08-30 10:17:45 +01:00
Sergey Alirzaev d82f6aa34a
server : removed obsolete doc (#15670)
completing a4090d1174
2025-08-30 00:12:53 +02:00
ExtReMLapin 792b44f2ed
server : add documentation for `parallel_tool_calls` param (#15647)
Co-authored-by: Pierre F <no@p.e>
2025-08-29 20:25:40 +03:00
Ed Addario 556f6b04fe
Add --precise-lambda option 2025-08-28 16:08:08 +01:00
Sigbjørn Skjæret 84ab83cc0b
model : jina-embeddings-v3 support (#13693)
* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review
2025-08-28 15:49:50 +02:00
Joshua Cogliati d35a1e8c41
cli : change log to warning to explain reason for stopping (#15604)
* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-28 10:48:20 +03:00
Johannes Gäßler fbef0fad7a
server: higher timeout for tests (#15621) 2025-08-27 20:58:09 +02:00
fidoriel 8ce3ff1d91
mtmd : fix mtmd ios build (#15579) 2025-08-26 20:05:50 +02:00
Georgi Gerganov b3964c1e89
metal : optimize FA vec for large sequences and BS <= 8 (#15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen 79a546220c
mtmd : support Kimi VL model (#15458)
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
2025-08-26 12:54:19 +02:00
tc-mb c4e9239064
model : support MiniCPM-V 4.5 (#15575) 2025-08-26 10:05:55 +02:00
Georgi Gerganov 6b64f74b55
batched-bench : fix unified KV cache handling + pp timing (#15562)
* batched-bench : fix unified KV cache handling + pp timing

* cont : run dummy token only with split KV cache
2025-08-25 13:56:43 +03:00
Ed Addario ccaab24441
Merge branch 'master' into quantize 2025-08-24 20:47:53 +01:00
Ed Addario d4ac2106fb
Improve logging and some minor code refactoring 2025-08-24 13:39:10 +01:00
Georgi Gerganov 9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
65a 4afb0a746f
server : Support multimodal completion and embeddings prompts in JSON format (#15108)
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.
2025-08-22 10:10:14 +02:00
Tarek Dakhran e288693669
readme : model : mtdm : lfm2 improvements (#15476)
* Support untied embeddings

* Increase number of image tokens to 1024

* Add LFM2-VL to readme

* Actually use untied embeddings
2025-08-22 09:29:08 +02:00
Ed Addario e6eefa68f1
Merge branch 'master' into quantize 2025-08-21 19:22:24 +01:00
Michael Giba b108e42904
ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without issue (#15221)
* Fix -Werror=return-type so ci/run.sh can run

* Update tools/mtmd/clip.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Remove false now that we have abort

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-21 12:06:46 +02:00
stduhpf 1b0db8f6e0
server : fix webui (#15462)
* Fix webui crash after streaming

* build webui
2025-08-21 08:19:22 +03:00
teo 1bc664a26a
server: fix OpenAI API compatibility for usage statistics in chat streams (#15444) 2025-08-21 00:10:08 +02:00
Ed Addario 69586e212e
Add F16/BF16 type 2025-08-20 13:23:11 +01:00
xiaobing318 1a99c2d948
cmake : fix target include directories (#15450)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
2025-08-20 13:32:05 +03:00
Ed Addario b33abae231
Merge branch 'master' into quantize 2025-08-19 23:39:07 +01:00
Georgi Gerganov d2fcd91cf9
server : disable context shift by default (#15416)
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
Ed Addario 1b3d5b5744
Populate params 2025-08-19 10:56:02 +01:00
Ed Addario e877474458
Process target_bpw parameter 2025-08-19 10:54:02 +01:00
Ed Addario 0edbf0c176
Process activations 2025-08-19 10:51:58 +01:00
Ed Addario 77b818c040
Populate activations_data with imatrix activations if present 2025-08-19 10:50:37 +01:00
Ed Addario e6d55dc47b
Load activations 2025-08-19 10:49:01 +01:00
Ed Addario 5e85fb3ff3
Add parse_target_bpw() 2025-08-19 10:46:36 +01:00
Ed Addario cfec4048ab
Update usage 2025-08-19 10:43:51 +01:00
Georgi Gerganov f0d3c7405c
batched-bench : use rand tokens (#15398) 2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen f08c4c0d8d
mtmd : clean up clip_n_output_tokens (#15391) 2025-08-18 22:53:52 +02:00
Sigbjørn Skjæret baa9255a45
llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
2025-08-18 19:30:17 +02:00
davidef d1d8241600
server : fix incoming tasks not process in order (#15395) 2025-08-18 17:51:42 +03:00
Oleksandr Kuvshynov e5155e6986
server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
2025-08-18 00:28:58 +02:00
Tarek Dakhran 65349f26f2
model : support vision LiquidAI LFM2-VL family (#15347)
* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-16 23:33:54 +02:00
Diego Devesa f75b830647
chat : include kwargs in template example (#15309) 2025-08-14 10:28:29 -07:00
Aldehir Rojas b204a5a234
gpt-oss: implement harmony parsing (#15181)
* model : add harmony parser for gpt-oss

* gpt-oss : fix grammar trigger from causing empty stack

* gpt-oss: tweak the grammar trigger again

* gpt-oss : add support for recipient in role header

* gpt-oss : fix ungrouped tool calls in grammar

* gpt-oss : loosen function name matching during parse

* gpt-oss : clean up workarounds

* gpt-oss : add template tests

* gpt-oss : simulate thinking and tool call tags

* gpt-oss : undo think tags when reasoning_format is none

* gpt-oss : set special tokens back to user defined

* gpt-oss : update openai-gpt-oss template

* server : filter out harmony thought messages

* gpt-oss : simplify parsing
2025-08-14 17:23:11 +03:00
Georgi Gerganov d32e03f449
server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
kallewoof 3ea913f1ce
perplexity: give more information about constraints on failure (#15303)
* perplexity: give more information about constraints on failure

This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.

* log formatting

* log error and return instead of storing max_seq_exceeded int

* check if s0 is zero for -np check
2025-08-14 09:16:32 +03:00
Sigbjørn Skjæret b3e16665e1
server : enable -td and -tbd parameters (#15172) 2025-08-13 15:43:00 +02:00
Copilot d8914fc47e
common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191)
* Checkpoint from VS Code for coding agent session

* Initial plan

* Fix typo in --override-tensor-draft flag implementation

* Add null termination for speculative tensor buffer overrides

* Apply suggestions from code review

* Apply suggestions from code review

* Extract tensor override parsing logic to common function (addresses @slaren's feedback)

* Apply suggestions from code review

* Apply suggestions

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-13 12:44:40 +02:00
Aldehir Rojas e885445bc1
server : filter out harmony thought messages (#15278) 2025-08-13 12:28:21 +02:00
rainred cf9e5648a7
mtmd : Fix MinicpmV model converter and clip to avoid using hardcode. (#14750)
* Fix MinicpmV model converter and clip to avoid using hardcode.

* Code update for pr/14750

* Remove unused field, update script path in docs.

* Add version 5 for fallback code.

---------

Co-authored-by: lzhang <zhanglei@modelbest.cn>
2025-08-11 16:12:12 +02:00
Xuan-Son Nguyen 53d0a12658
server : allow specifying reasoning_format in HTTP request (#15238) 2025-08-11 14:48:41 +02:00
Daniel Bevenius 1ebbaddff2
perplexity : update comments/error msg to use decode [no ci] (#15227)
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.

The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
2025-08-11 11:21:24 +03:00
Daniel Bevenius 36d3f00e14
requirements : fix PyTorch uint64 compatibility (#15134)
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```

This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734).

PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.

Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734
2025-08-07 05:31:48 +02:00
Juk Armstrong 476aa3fd57
Fixed name `-override-tensors` to `-override-tensor` (#15129) 2025-08-06 17:28:48 +01:00