Commit Graph

348 Commits

Author SHA1 Message Date
Diego Devesa 360d6533db
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-11 22:47:38 +02:00
Daniel Bevenius 70cd37dbbe
requirements : update transformers/torch for Embedding Gemma (#15828)
* requirements : update transformers/torch for Embedding Gemma

This commit updates the requirements to support converting
Embedding Gemma 300m models.

The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.

I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.

* resolve additional python dependencies

* fix pyright errors in tokenizer test and remove unused import
2025-09-09 06:06:52 +02:00
Aldehir Rojas 7057faf64b
json : support `enum` values within `allOf` (#15830) 2025-09-08 16:14:32 -05:00
Xuan-Son Nguyen 56920f5665
server : bring back timings_per_token (#15879) 2025-09-08 16:50:05 +02:00
Georgi Gerganov a885dcff11
batched-bench : fix llama_synchronize usage during prompt processing (#15835)
ggml-ci
2025-09-08 10:27:07 +03:00
Xuan-Son Nguyen 3c3635d2f2
server : speed up tests (#15836)
* server : speed up tests

* clean up

* restore timeout_seconds in some places

* flake8

* explicit offline
2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen 61bdfd5298
server : implement prompt processing progress report in stream mode (#15827)
* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-06 13:35:04 +02:00
Gabe Goodhart fd621880f3
aLoRA Support (#15327)
* feat: Add python-side constants and conversion for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse invocation string for adapters from GGUF

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(python): Update conversion to alora_invocation_tokens

This is the preferred method in PEFT which is the source of ground truth

https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(cpp): Update to alora_invocation_tokens on c++ side

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add C APIs to get alora invocation token array from lora

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Initial implementation of alora cache logic in server

This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Identify alora invocation sequences

This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Only reuse cache for tokens before the alora invocation start

This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Handle un-cached tokens that come before the alora activation

The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use || instead of 'or'

Too much python 🤦

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one for limiting cached tokens to before alora start

This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json

While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove duplicate logging

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-05 17:32:39 -06:00
Gabe Goodhart 5fac79cbc7
Thinking model disabled assistant prefill (#15404)
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-05 14:31:24 -06:00
Xuan-Son Nguyen a68d914426
server: add exceed_context_size_error type (#15780)
* server: add exceed_context_size_error type

* change error code to 400
2025-09-04 11:50:23 +02:00
Georgi Gerganov e92d53b29e
sampling : optimize samplers by reusing bucket sort (#15665)
* sampling : optimize sorting using bucket sort in more places

ggml-ci

* sampling : do not sort in dist sampler

ggml-ci

* sampling : avoid heap allocations for sort buffers

ggml-ci

* common : add option to sort sampling candidates by probability

ggml-ci

* sampling : revert the change for preserving sort buffers

* sampling : use std::copy instead of memcpy

* sampling : clarify purpose of partial sort helpers

ggml-ci

* cont : remove wrong comment [no ci]

* common : update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-31 20:41:02 +03:00
Georgi Gerganov 0d161f021a
server : enable /slots by default and make it secure (#15630)
* server : enable /slots by default and make it secure

ggml-ci

* server : fix tests to pass `--no-slots` when necessary

* server : extend /props with info about enabled endpoints
2025-08-31 20:11:58 +03:00
Johannes Gäßler e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Sergey Alirzaev d82f6aa34a
server : removed obsolete doc (#15670)
completing a4090d1174
2025-08-30 00:12:53 +02:00
ExtReMLapin 792b44f2ed
server : add documentation for `parallel_tool_calls` param (#15647)
Co-authored-by: Pierre F <no@p.e>
2025-08-29 20:25:40 +03:00
Sigbjørn Skjæret 84ab83cc0b
model : jina-embeddings-v3 support (#13693)
* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review
2025-08-28 15:49:50 +02:00
Joshua Cogliati d35a1e8c41
cli : change log to warning to explain reason for stopping (#15604)
* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-28 10:48:20 +03:00
Johannes Gäßler fbef0fad7a
server: higher timeout for tests (#15621) 2025-08-27 20:58:09 +02:00
fidoriel 8ce3ff1d91
mtmd : fix mtmd ios build (#15579) 2025-08-26 20:05:50 +02:00
Georgi Gerganov b3964c1e89
metal : optimize FA vec for large sequences and BS <= 8 (#15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
2025-08-26 14:22:14 +03:00
Xuan-Son Nguyen 79a546220c
mtmd : support Kimi VL model (#15458)
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
2025-08-26 12:54:19 +02:00
tc-mb c4e9239064
model : support MiniCPM-V 4.5 (#15575) 2025-08-26 10:05:55 +02:00
Georgi Gerganov 6b64f74b55
batched-bench : fix unified KV cache handling + pp timing (#15562)
* batched-bench : fix unified KV cache handling + pp timing

* cont : run dummy token only with split KV cache
2025-08-25 13:56:43 +03:00
Georgi Gerganov 9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
65a 4afb0a746f
server : Support multimodal completion and embeddings prompts in JSON format (#15108)
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.
2025-08-22 10:10:14 +02:00
Tarek Dakhran e288693669
readme : model : mtdm : lfm2 improvements (#15476)
* Support untied embeddings

* Increase number of image tokens to 1024

* Add LFM2-VL to readme

* Actually use untied embeddings
2025-08-22 09:29:08 +02:00
Michael Giba b108e42904
ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without issue (#15221)
* Fix -Werror=return-type so ci/run.sh can run

* Update tools/mtmd/clip.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Remove false now that we have abort

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-21 12:06:46 +02:00
stduhpf 1b0db8f6e0
server : fix webui (#15462)
* Fix webui crash after streaming

* build webui
2025-08-21 08:19:22 +03:00
teo 1bc664a26a
server: fix OpenAI API compatibility for usage statistics in chat streams (#15444) 2025-08-21 00:10:08 +02:00
xiaobing318 1a99c2d948
cmake : fix target include directories (#15450)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
2025-08-20 13:32:05 +03:00
Georgi Gerganov d2fcd91cf9
server : disable context shift by default (#15416)
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
Georgi Gerganov f0d3c7405c
batched-bench : use rand tokens (#15398) 2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen f08c4c0d8d
mtmd : clean up clip_n_output_tokens (#15391) 2025-08-18 22:53:52 +02:00
Sigbjørn Skjæret baa9255a45
llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
2025-08-18 19:30:17 +02:00
davidef d1d8241600
server : fix incoming tasks not process in order (#15395) 2025-08-18 17:51:42 +03:00
Oleksandr Kuvshynov e5155e6986
server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
2025-08-18 00:28:58 +02:00
Tarek Dakhran 65349f26f2
model : support vision LiquidAI LFM2-VL family (#15347)
* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-16 23:33:54 +02:00
Diego Devesa f75b830647
chat : include kwargs in template example (#15309) 2025-08-14 10:28:29 -07:00
Aldehir Rojas b204a5a234
gpt-oss: implement harmony parsing (#15181)
* model : add harmony parser for gpt-oss

* gpt-oss : fix grammar trigger from causing empty stack

* gpt-oss: tweak the grammar trigger again

* gpt-oss : add support for recipient in role header

* gpt-oss : fix ungrouped tool calls in grammar

* gpt-oss : loosen function name matching during parse

* gpt-oss : clean up workarounds

* gpt-oss : add template tests

* gpt-oss : simulate thinking and tool call tags

* gpt-oss : undo think tags when reasoning_format is none

* gpt-oss : set special tokens back to user defined

* gpt-oss : update openai-gpt-oss template

* server : filter out harmony thought messages

* gpt-oss : simplify parsing
2025-08-14 17:23:11 +03:00
Georgi Gerganov d32e03f449
server : add SWA checkpoints (#15293)
* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument
2025-08-14 14:59:50 +03:00
kallewoof 3ea913f1ce
perplexity: give more information about constraints on failure (#15303)
* perplexity: give more information about constraints on failure

This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.

* log formatting

* log error and return instead of storing max_seq_exceeded int

* check if s0 is zero for -np check
2025-08-14 09:16:32 +03:00
Sigbjørn Skjæret b3e16665e1
server : enable -td and -tbd parameters (#15172) 2025-08-13 15:43:00 +02:00
Copilot d8914fc47e
common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191)
* Checkpoint from VS Code for coding agent session

* Initial plan

* Fix typo in --override-tensor-draft flag implementation

* Add null termination for speculative tensor buffer overrides

* Apply suggestions from code review

* Apply suggestions from code review

* Extract tensor override parsing logic to common function (addresses @slaren's feedback)

* Apply suggestions from code review

* Apply suggestions

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-13 12:44:40 +02:00
Aldehir Rojas e885445bc1
server : filter out harmony thought messages (#15278) 2025-08-13 12:28:21 +02:00
rainred cf9e5648a7
mtmd : Fix MinicpmV model converter and clip to avoid using hardcode. (#14750)
* Fix MinicpmV model converter and clip to avoid using hardcode.

* Code update for pr/14750

* Remove unused field, update script path in docs.

* Add version 5 for fallback code.

---------

Co-authored-by: lzhang <zhanglei@modelbest.cn>
2025-08-11 16:12:12 +02:00
Xuan-Son Nguyen 53d0a12658
server : allow specifying reasoning_format in HTTP request (#15238) 2025-08-11 14:48:41 +02:00
Daniel Bevenius 1ebbaddff2
perplexity : update comments/error msg to use decode [no ci] (#15227)
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.

The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
2025-08-11 11:21:24 +03:00
Daniel Bevenius 36d3f00e14
requirements : fix PyTorch uint64 compatibility (#15134)
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```

This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734).

PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.

Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734
2025-08-07 05:31:48 +02:00
Juk Armstrong 476aa3fd57
Fixed name `-override-tensors` to `-override-tensor` (#15129) 2025-08-06 17:28:48 +01:00
Georgi Gerganov fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Alex Wu 22f060c9c4
webui: fix markdown table (#15081)
* webui: fix markdown table

* webui: fix table display with themes
2025-08-05 13:56:44 +02:00
compilade 19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076)
* imatrix : add warning when suffix is not .gguf for GGUF imatrix

* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
Sigbjørn Skjæret 2721257e3e
quantize : fix confusing error message if ftype is invalid (#15071) 2025-08-04 18:11:02 +02:00
compilade d31192b4ee
imatrix : use GGUF by default (#14842)
* imatrix : use GGUF by default

* imatrix : use GGUF regardless of the output filename

The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
compilade 0a2f5496be
imatrix : fix 3d activation handling for hybrid and recurrent models (#14994)
* imatrix : use a single count for dense 3d tensors

* imatrix : fix 3d activations when model tensor is 2d

* imatrix : fix 3d tensor counts
2025-08-03 21:49:13 +02:00
R0CKSTAR 3025b621d1
llama-bench: rename DB table name from test to llama_bench (#15003)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-02 17:20:40 +08:00
Johannes Gäßler f906275537
server: enable token array inputs for OAI API (#15001) 2025-08-02 10:12:41 +02:00
tc-mb 952a47f455
mtmd : support MiniCPM-V 4.0 (#14983)
* support minicpm-v 4

* add md

* support MiniCPM-o 4.0

* add default location

* temp rm MiniCPM-o 4.0

* fix code

* fix "minicpmv_projector" default path
2025-07-31 17:22:17 +02:00
g2mt 94933c8c2e
server : implement universal assisted decoding (#12635)
* llama-server : implement universal assisted decoding

* Erase prompt tail for kv-cache

* set vocab_dft_compatible in common_speculative

* rename ctx_main to ctx_tgt

* move vocab_dft_compatible to spec struct

* clear mem_dft, remove mem

* detokenize id_last for incompatible models

* update comment

* add --spec-replace flag

* accept special tokens when translating between draft/main models

* Escape spec-replace

* clamp draft result to size to params.n_draft

* fix comment

* clean up code

* restore old example

* log common_speculative_are_compatible in speculative example

* fix

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/speculative.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Lukas Straub a9f77a8be3
server : add openai-style logit_bias support (#14946)
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2025-07-31 14:08:23 +02:00
Ed Addario e9192bec56
quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973) 2025-07-30 21:11:56 +02:00
Daniel Bevenius 41e78c567e
server : add support for `embd_normalize` parameter (#14964)
This commit adds support for the `embd_normalize` parameter in the
server code.

The motivation for this is that currently if the server is started with
a pooling type that is not `none`, then Euclidean/L2 normalization will
be the normalization method used for embeddings. However, this is not
always the desired behavior, and users may want to use other
normalization (or none) and this commit allows that.

Example usage:
```console
curl --request POST \
    --url http://localhost:8080/embedding \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello world today", "embd_normalize": -1}
```
2025-07-30 18:07:11 +02:00
Radoslav Gerganov c556418b60
llama-bench : use local GPUs along with RPC servers (#14917)
Currently if RPC servers are specified with '--rpc' and there is a local
GPU available (e.g. CUDA), the benchmark will be performed only on the
RPC device(s) but the backend result column will say "CUDA,RPC" which is
incorrect. This patch is adding all local GPU devices and makes
llama-bench consistent with llama-cli.
2025-07-28 18:59:04 +03:00
Xuan-Son Nguyen 00fa15fedc
mtmd : add support for Voxtral (#14862)
* mtmd : add support for Voxtral

* clean up

* fix python requirements

* add [BEGIN_AUDIO] token

* also support Devstral conversion

* add docs and tests

* fix regression for ultravox

* minor coding style improvement

* correct project activation fn

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-28 15:01:48 +02:00
Ed Addario 7f97599581
quantize : update README.md (#14905)
* Update README.md

* Fix trailing whitespace

* Update README.md

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-27 23:31:11 +02:00
kiwi 749e0d27f0
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)
* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip

* Update export-lora.cpp

* Update clip.cpp

* Update export-lora.cpp

* format: use space to replace tab
2025-07-25 13:08:04 +02:00
Ed Addario d1aa0cc5d1
imatrix: add option to display importance score statistics for a given imatrix file (#12718)
* Add --show-statistics option

* Add --show-statistics logic

* Add tensor name parsing

* Tidy output format

* Fix typo in title

* Improve tensor influence ranking

* Add better statistics

* Change statistics' sort order

* Add Cosine Similarity

* Add header search path

* Change header search path to private

* Add weighted statistics per layer

* Update report title

* Refactor compute_statistics out of main

* Refactor compute_cossim out of load_imatrix

* Refactor compute_statistics out of load_imatrix

* Move imatrix statistics calculation into its own functions

* Add checks and validations

* Remove unnecessary include directory

* Rename labels

* Add m_stats getter and refactor compute_statistics out of load_imatrix

* Refactor variable names

* Minor cosmetic change

* Retrigger checks (empty commit)

* Rerun checks (empty commit)

* Fix unnecessary type promotion

Co-authored-by: compilade <git@compilade.net>

* Reverting change to improve code readability

* Rerun checks (empty commit)

* Rerun checks (empty commit)

* Rerun checks - third time's the Charm 🤞 (empty commit)

* Minor cosmetic change

* Update README

* Fix typo

* Update README

* Rerun checks (empty commit)

* Re-implement changes on top of #9400

* Update README.md

* Update README

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

* Remove duplicate option in print_usage()

* Update README.md

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Remove input check

* Remove commented out code

---------

Co-authored-by: compilade <git@compilade.net>
2025-07-22 14:33:37 +02:00
stduhpf c8ade30036
Mtmd: add a way to select device for vision encoder (#14236)
* Mtmd: add a way to select device for vision encoder

* simplify

* format

* Warn user if manual device selection failed

* initialize backend to nullptr
2025-07-22 12:51:03 +02:00
Molly Sophia adef81781a
server : allow setting `--reverse-prompt` arg (#14799)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
Molly Sophia c82d48ec23
llama : fix `--reverse-prompt` crashing issue (#14794)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-21 17:38:36 +08:00
IsaacDynamo b4efd77f8a
server : add parse_special option to /tokenize endpoint (#14783) 2025-07-21 10:24:51 +03:00
compilade 90083283ec
imatrix : use GGUF to store importance matrices (#9400)
* imatrix : allow processing multiple chunks per batch

* perplexity : simplify filling the batch

* imatrix : fix segfault when using a single chunk per batch

* imatrix : use GGUF to store imatrix data

* imatrix : fix conversion problems

* imatrix : use FMA and sort tensor names

* py : add requirements for legacy imatrix convert script

* perplexity : revert changes

* py : include imatrix converter requirements in toplevel requirements

* imatrix : avoid using designated initializers in C++

* imatrix : remove unused n_entries

* imatrix : allow loading mis-ordered tensors

Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* quantize : use unused imatrix chunk_size with LLAMA_TRACE

* common : use GGUF for imatrix output by default

* imatrix : two-way conversion between old format and GGUF

* convert : remove imatrix to gguf python script

* imatrix : use the function name in more error messages

* imatrix : don't use FMA explicitly

This should make comparisons between the formats easier
because this matches the behavior of the previous version.

* imatrix : avoid returning from void function save_imatrix

* imatrix : support 3d tensors with MUL_MAT

* quantize : fix dataset name loading from gguf imatrix

* common : move string_remove_suffix from quantize and imatrix

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* imatrix : add warning when legacy format is written

* imatrix : warn when writing partial data, to help guess dataset coverage

Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.

* imatrix : avoid loading model to convert or combine imatrix

* imatrix : avoid using imatrix.dat in README

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Georgi Gerganov 6ffd4e9c44
server : pre-calculate EOG logit biases (#14721)
ggml-ci
2025-07-16 14:04:12 +03:00
Georgi Gerganov 538cc77f7f
server : fix handling of the ignore_eos flag (#14710)
ggml-ci
2025-07-16 12:13:57 +03:00
Johannes Gäßler 5cae766541
scripts: synthetic prompt mode for server-bench.py (#14695) 2025-07-16 09:33:28 +02:00
Johannes Gäßler 494c5899cb
scripts: benchmark for HTTP server throughput (#14668)
* scripts: benchmark for HTTP server throughput

* fix server connection reset
2025-07-14 13:14:30 +02:00
Douglas Hanley 0c1df14b5f
server : fix pooled embedding output (#14645) 2025-07-12 13:21:02 +03:00
Eric Zhang a457551332
cmake : do not search for curl libraries by ourselves (#14613)
* cmake : do not search for curl libraries by ourselves

* run : do not search for curl libraries by ourselves
2025-07-10 15:29:05 +03:00
Alawode Oluwandabira 17a1f0d2d4
server: Add ability to mount server at prefix (#14544)
* Add server_prefix

* Correct server path env

* Rename cli flag to --api-prefix

* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Sigbjørn Skjæret ddef99522d
server : fix assistant prefilling when content is an array (#14360) 2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret 28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445) 2025-07-03 23:07:22 +02:00
Vedran Miletić e9b6350e61
scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
matteo caf5681fcb
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
* initial commit for handling extra template kwargs

* enable_thinking and assistant prefill cannot be enabled at the same time

* can set chat_template_kwargs in command line

* added doc

* fixed formatting

* add support for extra context in generic template init

* coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* coding standard:  common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix merge conflict

* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)

* normalize environment variable name

* simplify code

* prefill cannot be used with thinking models

* compatibility with the new reasoning-budget parameter

* fix prefill for non thinking models

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
2025-06-29 20:02:53 +02:00
Renat 83790b0e7e
server : fix appearance of the chats list context menu for Safari (#14322) 2025-06-29 19:29:57 +02:00
Nigel Bosch 1b809cee22
server : move no API key doc to /health (#14352) 2025-06-24 10:59:11 +02:00
Sigbjørn Skjæret abf241045d
main : honor --verbose-prompt on interactive prompts (#14350) 2025-06-24 09:31:00 +02:00
Molly Sophia 72c6bc3f3d
llama : better rwkv chat template and add missing `inputs.use_jinja` setting (#14336)
* llama-cli : add missing `inputs.use_jinja` setting

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama : better legacy chat template for rwkv

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-06-23 19:56:19 +08:00
Georgi Gerganov 7b50d589a8
kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00
Ed Addario fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) (#13037) 2025-06-22 23:16:26 +02:00
Ruikai Peng 66aba7aca9
run : avoid double tokenization (#14327)
* run : avoid double tokenization by adopting common_tokenize heuristic

* build : fix windows gcc and clang warnings

* lint : fixed trailing whitepace

* run : fix is_first flag
2025-06-23 01:28:06 +08:00
Georgi Gerganov f1f5e82df6
examples : fix is_first logic for tokenization (#14329)
ggml-ci
2025-06-22 20:10:07 +03:00
yuiseki 5d5c066de8
mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326)
Mistral Small 2506 models using Pixtral vision encoder were running out
of GPU memory when processing images larger than 1024x1024 pixels due to
exponential memory growth from unlimited image size.

This fix applies the same 1024x1024 limit used by Qwen2VL models to
prevent OOM issues while maintaining compatibility with existing models.
2025-06-22 14:44:57 +02:00
Sigbjørn Skjæret 88fc854b4b
llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
Georgi Gerganov 4c9fdfbe15
ubatch : new splitting logic (#14217)
ggml-ci
2025-06-20 10:14:14 +03:00
aa956 d67341dc18
server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
2025-06-19 16:01:03 +03:00
bashayer hijji fffcce535e
llama-bench : add --no-warmup flag (#14224) (#14270)
Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

- Add no_warmup boolean field to cmd_params struct

- Add --no-warmup command-line argument parsing

- Add help text documentation for the new flag

- Wrap existing warmup logic in conditional check

- Maintain full backward compatibility (warmup enabled by default)

Addresses #14224
2025-06-19 12:24:12 +02:00
Xuan-Son Nguyen 413977de32
mtmd : refactor llava-uhd preprocessing logic (#14247)
* mtmd : refactor llava-uhd preprocessing logic

* fix editorconfig
2025-06-18 10:43:57 +02:00
Georgi Gerganov 89fea80d29
server : fix incorrect usage of llama_get_embeddings() (#14225)
* server : fix incorrect usage of llama_get_embeddings()

ggml-ci

* cont : fix the fix

ggml-ci
2025-06-16 22:33:27 +03:00
Georgi Gerganov d3e64b9f49
llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Eric Curtin cd355eda7d
server : When listening on a unix domain socket don't print http:// and port (#14180)
Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-06-15 23:36:22 +02:00
Georgi Gerganov ffad043973
server : fix SWA condition for full context reprocess (#14163)
ggml-ci
2025-06-13 11:18:25 +03:00
Georgi Gerganov 7d516443dd
server : re-enable SWA speculative decoding (#14131)
ggml-ci
2025-06-12 11:51:38 +03:00
Aman 7781e5fe99
webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
* webui: Wrap long numbers instead of infinite horizontal scroll

* Use tailwind class

* update index.html.gz
2025-06-11 16:42:25 +02:00
Taylor 2baf07727f
server : pass default --keep argument (#14120) 2025-06-11 13:43:43 +03:00
Juk Armstrong 3a12db23b6
Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104) 2025-06-10 16:48:07 +01:00
R0CKSTAR dc0623fddb
webui: fix sidebar being covered by main content (#14082)
* webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* webui: update index.html.gz

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-09 12:01:17 +02:00
Georgi Gerganov 87d34b381d
server : fix LRU check (#14079)
ggml-ci
2025-06-09 12:57:58 +03:00
Georgi Gerganov 745aa5319b
llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov 3637576288
server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
2025-06-02 21:34:40 +03:00
Olivier Chafik c9bbc77931
`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-02 16:29:28 +02:00
Max Krasnyansky 053b1539c0
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 15:39:19 -07:00
Georgi Gerganov 3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
igardev c7e0a2054b
webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
Georgi Gerganov 3f55f781f1
llama : auto-batch preparation (#13845)
* llama : auto-batch

ggml-ci

* context : simplify if branching
2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen 51fa76f172
mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir
2025-05-31 10:14:29 +02:00
Georgi Gerganov 12d0188c0d
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Georgi Gerganov 53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen 10961339b2
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
Đinh Trọng Huy e0e3aa231d
llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Sky c962ae3382
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
2025-05-28 16:33:54 +02:00
Xuan-Son Nguyen bc583e3c63
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
2025-05-27 14:06:10 +02:00
Olivier Chafik 03f582ae8f
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik d74e94c1b3
`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik f13847cfb5
server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Georgi Gerganov 79c137f776
examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
2025-05-26 14:03:54 +03:00
Olivier Chafik e121edc432
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Xuan-Son Nguyen 2f099b510f
webui : bump max upload file size to 500MB (#13779) 2025-05-25 18:02:18 +01:00
Percy Piper c508256db2
rpc : Fix build on OpenBSD (#13541) 2025-05-25 15:35:53 +03:00
Xuan-Son Nguyen 40aaa8a403
mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)
* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont
2025-05-25 14:06:32 +02:00
Olivier Chafik d785f9c1fd
server: fix/test add_generation_prompt (#13770)
Co-authored-by: ochafik <ochafik@google.com>
2025-05-25 10:45:49 +01:00
Olivier Chafik f5cd27b71d
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Xuan-Son Nguyen 9ecf3e66a3
server : support audio input (#13714)
* server : support audio input

* add audio support on webui
2025-05-23 11:03:47 +02:00
Georgi Gerganov 8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
2025-05-22 22:21:07 +03:00
Xuan-Son Nguyen 797990c4bc
mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
Georgi Gerganov cc74d5be99
server : pad small embedding batches (#13692)
ggml-ci
2025-05-22 16:33:39 +03:00
Georgi Gerganov 5fbfe384d4
server : improve error reporting (#13680) 2025-05-21 19:46:56 +03:00
Robin Davidsson 0d5c742161
server : Add the endpoints /api/tags and /api/chat (#13659)
* Add the endpoints /api/tags and /api/chat

Add the endpoints /api/tags and /api/chat, and improved the model metadata response

* Remove trailing whitespaces

* Removed code that is not needed for copilot to work.
2025-05-21 15:15:27 +02:00
Dorin-Andrei Geman 42158ae2e8
server : fix first message identification (#13634)
* server : fix first message identification

When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

* server : Fix checks for first role message for stream=True

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

---------

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-21 15:07:57 +02:00
Georgi Gerganov 797f2ac062
kv-cache : simplify the interface (#13660)
* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci
2025-05-21 15:11:13 +03:00
l3utterfly b7a17463ec
mtmd-helper : bug fix to token batching in mtmd (#13650)
* Update mtmd-helper.cpp

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-20 18:55:30 +02:00
Georgi Gerganov e298d2fbd0
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
2025-05-20 08:05:46 +03:00
Nicolò Scipione f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
Xuan-Son Nguyen 92ecdcc06a
mtmd : add vision support for llama 4 (#13282)
* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality
2025-05-19 13:04:14 +02:00
Isaac McFadyen 6a2bc8bfb7
server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
2025-05-17 23:59:48 +02:00
Xuan-Son Nguyen 6aa892ec2a
server : do not return error out of context (with ctx shift disabled) (#13577) 2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen aea9f8b4e7
webui : improve accessibility for visually impaired people (#13551)
* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content
2025-05-16 21:49:01 +02:00
Diego Devesa 6c8b91500e
llama-bench : fix -ot with dl backends (#13563) 2025-05-15 15:46:55 +02:00
Xuan-Son Nguyen 3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)
* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar) c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540) 2025-05-15 08:40:58 +02:00
Georgi Gerganov b2838049cc
bench : handle decode errors (#13548)
ggml-ci
2025-05-15 05:57:02 +03:00
Olivier Chafik aa48e373f2
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Olivier Chafik 3198405e98
`common`: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Georgi Gerganov 053174436f
server : passthrough the /models endpoint during loading (#13535)
* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen 360a9c98e1
server : fix cache_tokens bug with no cache_prompt (#13533) 2025-05-14 13:35:07 +02:00
Xuan-Son Nguyen bb1681fbd5
webui : use fflate for more deterministic gzip compress (#13525)
* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako
2025-05-14 10:26:12 +02:00
Luca Stefani d486dd3e8e
webui: Allow pasting file from clipboard (#13526)
* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00
Ed Addario e5c834f718
quantize : improve tensor-type pattern matching (#13033) 2025-05-13 19:12:31 +02:00
Xuan-Son Nguyen 71bdbdb587
clip : clip.h become private API (⚠️ breaking change) (#13510) 2025-05-13 17:07:21 +02:00
Georgi Gerganov b89d605a91
batched-bench : fix pp batch contents (#13492) 2025-05-13 18:01:53 +03:00
Xuan-Son Nguyen b4726345ac
mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)
* mtmd : remove libllava, remove clip-quantize-cli

* rm clip_model_quantize
2025-05-13 15:33:58 +02:00
Diego Devesa cf0a43bb64
llama-bench : add defrag-thold, check for invalid ranges (#13487) 2025-05-13 00:31:37 +02:00
Xuan-Son Nguyen de4c07f937
clip : cap max image size 1024 for qwen vl model (#13478) 2025-05-12 15:06:51 +02:00
Anudit Nagar 91159ee9df
server : allow content to be null in oaicompat_completion_params_parse (#13477) 2025-05-12 13:56:42 +02:00
Diego Devesa 22cdab343b
llama-bench : accept ranges for integer parameters (#13410) 2025-05-12 13:08:22 +02:00
City c104023994
mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459) 2025-05-12 00:39:06 +02:00
Anthony Umfer 9a390c4829
tools : fix uninitialized llama_batch in server (#13436)
* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-11 17:08:26 +02:00
David Huang 7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
City 3eac209319
mtmd : support InternVL 3 38B and 78B mmproj (#13443)
* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together
2025-05-11 11:35:52 +02:00
Xuan-Son Nguyen a634d75d1b
mtmd : move helpers to dedicated file (#13442)
* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include
2025-05-11 11:34:23 +02:00
Xuan-Son Nguyen 15e6125a39
mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo
2025-05-10 19:57:54 +02:00
Xuan-Son Nguyen 3b24d26c22
server : update docs (#13432) 2025-05-10 18:44:49 +02:00
Xuan-Son Nguyen 053367d149
mtmd : support InternVL 2.5 and 3 (#13422)
* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps
2025-05-10 16:26:42 +02:00
Xuan-Son Nguyen 33eff40240
server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-09 19:29:37 +02:00
Diego Devesa 27ebfcacba
llama : do not crash if there is no CPU backend (#13395)
* llama : do not crash if there is no CPU backend

* add checks to examples
2025-05-09 13:02:07 +02:00
Bartowski efb8b47eda
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)
* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace
2025-05-09 11:53:58 +02:00
R0CKSTAR 0527771dd8
llama-run: add support for downloading models from ModelScope (#13370)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-09 10:25:50 +01:00
Xuan-Son Nguyen 2189fd3b63
mtmd : fix batch_view for m-rope (#13397)
* mtmd : fix batch_view for m-rope

* nits : fix comment
2025-05-09 11:18:02 +02:00
Xuan-Son Nguyen 3f96aeff39
llama : one-off chat template fix for Mistral-Small-2503 (#13398)
* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken
2025-05-09 11:17:51 +02:00
Xuan-Son Nguyen d9c4accaff
server : (webui) rename has_multimodal --> modalities (#13393)
* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code
2025-05-09 09:06:37 +02:00
Matt Clayton f05a6d71a0
mtmd : Expose helper_decode_image_chunk (#13366)
* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups
2025-05-08 20:25:39 +02:00
Xuan-Son Nguyen ee01d71e58
server : (webui) fix a very small misalignment (#13387)
* server : (webui) fix a very small misalignment

* restore font-bold
2025-05-08 18:51:45 +02:00
Xuan-Son Nguyen 8c83449cb7
server : (webui) revamp the input area, plus many small UI improvements (#13365)
* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log
2025-05-08 15:37:29 +02:00
welix 0ccc121354
mtmd : fix the calculation of n_tokens for smolvlm (#13381)
Co-authored-by: Taichi Nishimura <Taichi.A.Nishimura@sony.com>
2025-05-08 15:03:53 +02:00
Georgi Gerganov 6562e5a4d6
context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Georgi Gerganov 51fb96b1ff
context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
2025-05-08 14:26:50 +03:00
Xuan-Son Nguyen 32916a4907
clip : refactor graph builder (#13321)
* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)
2025-05-06 22:40:24 +02:00
oobabooga 233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>
2025-05-05 22:12:19 +02:00
igardev b34c859146
server : Webui - change setText command from parent window to also send the message. (#13309)
* setText command from parent window for llama-vscode now sends the message automatically.

* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.

* Fix code formatting.

* Add index.html.gz changes.

* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."

This reverts commit 67687b7fda.

* easier approach

* add setTimeout

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-05 16:03:31 +02:00
Xuan-Son Nguyen 9b61acf060
mtmd : rename llava directory to mtmd (#13311)
* mv llava to mtmd

* change ref everywhere
2025-05-05 16:02:55 +02:00
Xuan-Son Nguyen 5215b91e93
clip : fix confused naming ffn_up and ffn_down (#13290)
* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff
2025-05-05 12:54:44 +02:00
Xuan-Son Nguyen 27aa259532
mtmd : add C public API (#13184)
* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* add const to various places

* add warning about breaking changes

* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
Diego Devesa 9fdfcdaedd
rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
Diego Devesa 86bd60d3fe
llava/mtmd : fixes to fully support dl backends (#13303) 2025-05-04 17:05:20 +02:00
Johannes Gäßler 3e959f0976
imatrix: fix oob writes if src1 is not contiguous (#13286) 2025-05-04 00:50:37 +02:00
Xuan-Son Nguyen 36667c8edc
clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259) 2025-05-03 20:07:54 +02:00
Diego Devesa 1d36b3670b
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00