Commit Graph

7444 Commits

Author SHA1 Message Date
Aadeshveer Singh 58062860af
ggml : use WARP_SIZE/2 for argmax reduction offset (#18092) 2025-12-17 11:47:01 +08:00
Yuri Khrustalev 2973a65ecb
gguf-py : allow converting multi-tensor models from read-only locations (#18100) 2025-12-17 02:27:03 +01:00
Johannes Gäßler d0794e89d9
llama-fit-params: force disable mlock (#18103) 2025-12-17 00:50:12 +01:00
Johannes Gäßler 9dcac6cf9f
llama-fit-params: lower ctx size for multi GPU (#18101) 2025-12-17 00:49:34 +01:00
Johannes Gäßler 0e49a7b8b4
llama-fit-params: fix underflow for dense models (#18095) 2025-12-17 00:47:37 +01:00
Johannes Gäßler 4164596c76
llama-fit-params: QoL impr. for prints/errors (#18089) 2025-12-17 00:03:19 +01:00
Xuan-Son Nguyen ef83fb8601
model: fix LFM2 missing tensors (#18105) 2025-12-16 19:07:43 +01:00
Johannes Gäßler ec98e20021
llama: fix early stop in params_fit if ctx is set (#18070) 2025-12-16 14:24:00 +01:00
yifant-code 59977eba7b
server: fix crash when batch > ubatch with embeddings (#17912)
* server: fix crash when batch > ubatch with embeddings (#12836)

Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.

Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.

Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
  * Log warnings explaining the issue
  * Automatically set n_batch = n_ubatch
  * Prevent server crash

This follows the approach suggested by @ggerganov in issue #12836.

Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).

Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model

* Update tools/server/server.cpp

---------

Co-authored-by: ytian218 <ytian218@bloomberg.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 14:27:36 +02:00
Daniel Bevenius 79dbae034a
model-conversion : remove -fa option in model card template [no ci] (#18088)
This commit updates the causal model card template and removes the
-fa option as it is no longer required (fa is auto detected).
2025-12-16 13:25:09 +01:00
Xuan-Son Nguyen 7f2b2f3c77
arch: refactor LLM_TENSOR_NAMES (#18051)
* arch: refactor LLM_TENSOR_NAMES

* update docs

* typo

* fix LLM_ARCH_NEMOTRON_H_MOE

* show more meaningful error message on missing tensor

* fix and tested LLM_ARCH_NEMOTRON_H_MOE
2025-12-16 13:22:30 +01:00
Xuan-Son Nguyen 7b1db3d3b7
arg: clarify auto kvu/np being set on server (#17997)
* arg: clarify auto kvu/np being set on server

* improve docs

* use invalid_argument
2025-12-16 12:01:27 +01:00
Piotr Wilkin (ilintar) a5251ca11d
Optimization: Qwen3 next autoregressive pass (#17996)
* It's Qwen3 Next, the lean mean token generation machine!

* Apply patches from thread

* Remove recurrent version, only keep chunked and autoregressive

* Remove unnecessary conts and asserts

* Remove more extra conts and asserts

* Cleanup masking
2025-12-16 11:59:53 +01:00
Andrew Aladjev fb644247de
CLI: fixed adding cli and completion into docker containers, improved docs (#18003)
Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
2025-12-16 11:52:23 +01:00
2114L3 5f5f9b4637
server: Update README.md incorrect argument (#18073)
n-gpu-layer is incorrect
argument is n-gpu-layers with the 's'
2025-12-16 11:50:43 +01:00
Xuan-Son Nguyen 3d86c6c2b5
model: support GLM4V vision encoder (#18042)
* convert ok

* no deepstack

* less new tensors

* cgraph ok

* add mrope for text model

* faster patch merger

* add GGML_ROPE_TYPE_MRNORM

* add support for metal

* move glm4v do dedicated graph

* convert: add norm_embd

* clip: add debugging fn

* working correctly

* fix style

* use bicubic

* fix mrope metal

* improve cpu

* convert to neox ordering on conversion

* revert backend changes

* force stop if using old weight

* support moe variant

* fix conversion

* fix convert (2)

* Update tools/mtmd/clip-graph.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* process mrope_section on TextModel base class

* resolve conflict merge

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 11:25:26 +01:00
Daniel Bevenius 9963b81f63
model-conversion : add note about verifying previous models (#18082)
This commit adds a note to the README in the model-conversion
examples, advising developers to verify that previous versions of models
pass logits verification before adding new models from the same family.
2025-12-16 11:17:40 +01:00
Daniel Bevenius db81d5ec4b
model-conversion : use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)
This commit updates the embedding model verification script to use the
CONVERTED_EMBEDDING_MODEL environment variable instead of using the
EMBEDDING_MODEL_PATH (the original embedding model path) as the basis
for the converted model file name.

The motivation for this that currently if the converted embedding model
file name differs from the original embedding model directory/name the
verification script will look for the wrong .bin files that were
generating when running the models.
2025-12-16 11:17:20 +01:00
Aldehir Rojas c05aa69f32
common : add nemotron 3 parsing (#18077)
* common : expose json-schema functionality to extract type info

* common : fix peg parser negation during needs_more_input

* common : add some defensive measures in constructed peg parser

* common : add nemotron nano 3 support

* common : add nemotron nano 3 tests

* remove debug line
2025-12-16 04:05:23 -06:00
Francisco Herrera 279cef27c2
added note for old Intel hardware pre sycl (#18017)
* added note for old Intel hardware pre sycl

Older hardware used opencl

* typo

* use consistent terms
2025-12-16 17:45:09 +08:00
Georgi Gerganov 5ba95754ee
security : add collaborator guidance (#18081) 2025-12-16 11:17:11 +02:00
Chris Peterson 2aa45ef9e3
llama: Include algorithm header needed for C++23 (#18078) 2025-12-16 09:37:55 +02:00
Georgi Gerganov c560316440
graph : reuse SSM graphs (#16490)
* graph : reuse hybrid graphs

* graph : reuse recurrent graphs

* graph : fix reuse check for recurrent inputs

* memory : move the recurrent state into the memory context

* Revert "memory : move the recurrent state into the memory context"

This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1.

* cont : fix build
2025-12-16 09:36:21 +02:00
Sigbjørn Skjæret d6742125c3
ci : separate webui from server (#18072)
* separate webui from server

* add public to path
2025-12-16 08:17:26 +01:00
Aleksander Grygier 3034836d36
webui: Improve copy to clipboard with text attachments (#17969)
* feat: Create copy/paste user message including "pasted text" attachments

* chore: update webui build output

* chore: update webui static output

* fix: UI issues

* chore: update webui static output

* fix: Decode HTML entities using `DOMParser`

* chore: update webui build output

* chore: update webui static output
2025-12-16 07:38:46 +01:00
Aleksander Grygier a20979d433
webui: Add setting to always show sidebar on Desktop (#17809)
* feat: Add setting to always show Sidebar on Desktop

* chore: update webui build output

* feat: Add auto-show sidebar setting

* fix: Mobile settings dialog UI

* chore: update webui build output

* feat: UI label update

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output
2025-12-16 07:31:37 +01:00
Daniel Bevenius 2995341730
llama : add support for NVIDIA Nemotron 3 Nano (#18058)
* llama : add support for NVIDIA Nemotron Nano 3

This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 07:19:26 +01:00
Darius Lukas 40d9c394f4
Webui: Disable attachment button and model selector button when prompt textbox is disabled. (#17925)
* Pass disabled state to the file attachments button and the model
selector button.

* Update index.html.gz

* Fix model info card in non-router mode.

* Update index.html.gz
2025-12-16 07:15:49 +01:00
Sigbjørn Skjæret d6a1e18c65
convert : move rope_parameters to TextModel class (#18061)
* make sure to search text_config for rope parameters

* move rope_parameters to TextModel class
2025-12-15 22:03:16 +01:00
Shouyu c45f89d551
ggml-hexagon: mm for mtmd (#17894)
* feat: add run_mtmd script for hexagon

* fix: fix issue in fp16xfp32 mm

* fix: remove opt_experiment for fp16xfp32 mm

* fix: ggml-hexagon: matmul fp16xfp32 support non-contigious src0

* fix: fix syntax check for run-mtmd.sh for cli
2025-12-15 10:53:56 -08:00
HelloKS 9d52f17ae3
model : add KORMo model (#18032)
* vocab: add KORMo Tokenizer

* model: add KORMoForCausalLM

* vocab: change pretokenizer to qwen2

* lint: fix unintended line removal

* model: make qwen2 bias tensor optional

* model: use qwen2 architecture for KORMo
2025-12-15 18:51:43 +01:00
ssweens 4529c660c8
kv-cache: Fix state restore fragmented cache (#17982)
* kv-cache : fix state restore with fragmented cache (#17527)

Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.

* tests : update logic

* cleanup: tightened state_read_meta sig, added is_contiguous case

* fix: state_read_meta arg reorder loose ends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-15 19:28:35 +02:00
Pascal 0f4f35e7be
Fix unreadable user markdown colors and truncate long texts in deletion dialogs (#17555)
* webui: limit conversation name length in dialogs

* webui: fix unreadable colors on links and table cell hover in user markdown

* webui: keep table borders visible in user markdown

* webui: updating unified exports

* Update tools/server/webui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentThumbnailFile.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-15 16:34:53 +01:00
Jeremy Demeule 165caaf5fb
metal: use shared buffers on eGPU (#17866)
* metal: use shared buffers on eGPU

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

* metal: use shared buffers on eGPU

* metal: use shared buffers on eGPU
2025-12-15 16:14:49 +02:00
Xuan-Son Nguyen 96a181a933
mtmd: refactor audio preprocessing (#17978)
* mtmd: refactor audio preprocessing

* refactor

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>

* wip

* wip (2)

* improve constructor

* fix use_natural_log

* fix padding for short input

* clean up

* remove need_chunking

---------

Co-authored-by: Tarek <tdakhran@users.noreply.github.com>
2025-12-15 14:16:52 +01:00
Andrew Aladjev 4a4f7e6550
cli: fixed dead links to tools/main for cli and completion, fixed code owners (#17993)
Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
2025-12-15 11:47:04 +01:00
Thomas Jarosch e73d548659
webui: add "delete all conversations" button to import/export tab (#17444)
* webui: add "delete all conversations" button to import/export tab

- Add 'Delete all conversations' functionality with confirmation dialog
- Add Trash icon and destructive styling for clear visual indication
- Redirects to "?new_chat=true#/" by using conversationsStore.deleteAll()

* chore: update webui build output
2025-12-15 11:29:29 +01:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Neo Zhang Jianyu 4aced7a631
[SYCL] Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (#17826)
* support gpt-oss GPU by OP add-id, mul_mat for mxfp4, swiglu_oai, fix warning

* fix fault ut case, update ops.md

* rebase, fix format issue
2025-12-15 10:35:15 +08:00
piDack 745fa0e78b
model : add glm-asr support (#17901)
* [model] add glm-asr support

* fix format for ci

* fix convert format for ci

* update glm_asr convert script & use build_ffn for glm_asr clip & use build_stack for padding and review

* check root architecture for convert hf script

* fix conficlt with upstream

* fix convert script for glm asr & format clip-impl

* format

* restore hparams text

* improved conversion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-15 03:18:46 +01:00
Xuan-Son Nguyen 52392291b2
preset: handle negated arg, reverse the meaning if needed (#18041) 2025-12-14 22:08:10 +01:00
Sigbjørn Skjæret 5c8a717128
convert : refactor rope scaling handling (#18013)
* refactor rope scaling handling

* ws--

* missed a couple

* use find_hparam
2025-12-14 16:04:37 +01:00
Haowei Wu 37f5a1093b
mtmd: enhance image resizing in llava_uhd (#18014) 2025-12-14 15:57:52 +01:00
Ruben Ortlam 9e6649ecf2
vulkan: fix mul_mat_vec_iq1_s formatting (#18026) 2025-12-14 14:52:46 +01:00
Xuan-Son Nguyen 0759b09c90
graph: add f_attn_temp_offset (#18025) 2025-12-14 13:05:59 +01:00
Georgi Gerganov 254098a279
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes

* tests : increase max_tokens to get needed response

* batched : fix uninitialized samplers
2025-12-14 10:11:13 +02:00
Jeff Bolz 3238b1400c
vulkan: Fix data race/hang in scalar/cm1 flash attention (#17887) 2025-12-14 09:00:00 +01:00
lovedheart 4722671641
vulkan: improve mul_mat_vec_iq1_s speed (#17874) 2025-12-14 08:47:49 +01:00
Eve d15d177f43
vulkan: faster q6_k matmul (#17813)
* q6_k faster mul mat

* 8 values

* fix comment

* switch to two at a time

* start ci for .glsl files
2025-12-14 08:29:37 +01:00
Georgi Gerganov 77ad8542bd
model-conversion : cast logits to float32 (#18009) 2025-12-14 08:58:13 +02:00