Commit Graph

840 Commits

Author SHA1 Message Date
ddh0 521a13e6c6 correct fallback logic 2026-02-16 13:14:05 -06:00
ddh0 aaf010edeb new function `llama_tensor_update_stats` 2026-02-16 12:20:16 -06:00
ddh0 0c976fafd7
Merge branch 'ggml-org:master' into llama-quant-refactor 2026-02-16 11:00:49 -06:00
Saurabh Dash 5f28c53d11
model: Add support for Tiny Aya Models (#19611)
* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-16 16:28:46 +01:00
Georgi Gerganov cc45f2ada6
models : deduplicate delta-net graphs for Qwen family (#19597)
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments
2026-02-16 14:35:04 +02:00
Georgi Gerganov d5dfc33027
graph : fix KQ mask, lora, cvec reuse checks (#19644)
* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse
2026-02-16 09:21:11 +02:00
Georgi Gerganov 341bc7d23c
context : fix output reorder with backend sampling (#19638) 2026-02-15 14:57:40 +02:00
ddh0 f14fd0c7f2
Merge branch 'ggml-org:master' into llama-quant-refactor 2026-02-14 22:36:18 -06:00
Georgi Gerganov 1725e316c1
models : optimize qwen3next graph (#19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-14 12:57:36 +02:00
agent-enemy-2 2d8015e8a4
llama : update LoRA API. + fix excessive graph reserves (#19280)
* Refactoring to use new llama_put_adapter_loras

* cont : alternative lora API

---------

Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-14 10:06:27 +02:00
George eb145c0753
mmap: Fix Windows handle lifetime (#19598)
* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.

* Update llama-mmap.cpp

* Update llama-mmap.cpp

Remove trailing whitespace from line 567
2026-02-14 10:05:12 +02:00
ddh0 a3bf07ea05
Merge branch 'ggml-org:master' into llama-quant-refactor 2026-02-13 21:29:21 -06:00
ddh0 7b127e126a correct function names 2026-02-13 21:17:53 -06:00
ddh0 bddc67547f correct function names 2026-02-13 21:13:53 -06:00
Xuan-Son Nguyen 752584d5f5
model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-13 14:56:53 +01:00
ymcki 33a56f90a6
model : Kimi Linear fix conv state update (#19531)
* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-13 09:10:18 +01:00
Adrien Gallouët 25224c8021
llama : remove deprecated codecvt (#19565)
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:53 +01:00
Georgi Gerganov bb96bfd361
memory : fix kv cache size for hybrid models (#19559) 2026-02-13 07:36:24 +02:00
ddh0 97aefac773 update_stats guard 2026-02-12 20:00:23 -06:00
ddh0 053a28980b don't double-count `qs` 2026-02-12 18:31:59 -06:00
ddh0 fd3787ee05 typo 2026-02-12 18:24:47 -06:00
ddh0 d648629f56 remove unused `std::vector<ggml_tensor*> tensors;` 2026-02-12 18:24:16 -06:00
ddh0 6734e77662 don't throw by pointer; unify MiB formatting 2026-02-12 18:22:52 -06:00
ddh0 1f25c130de pretty error msg 2026-02-12 18:11:44 -06:00
ddh0 67e25bbae1 fix compile errors 2026-02-12 18:02:40 -06:00
ddh0 5d6c92440c initial commit for branch 2026-02-12 17:52:59 -06:00
ddh0 f58de63ec3 remove unused `params` parameter 2026-02-11 22:30:06 -06:00
ddh0 44f9fee248 remove per @compilade 2026-02-11 22:23:10 -06:00
ddh0 40528248fc comment ref #12557 2026-02-11 22:18:56 -06:00
ddh0 1658228d6a add back Q2_K edge case for imatrix 2026-02-11 21:53:07 -06:00
ddh0 1ccd7a49ba simplify for style 2026-02-11 21:41:37 -06:00
ddh0 ae786b862d simplify and rename `tensor_type_requires_imatrix` 2026-02-11 21:21:40 -06:00
ddh0 22db76409b add missing `GGML_TYPE`s 2026-02-11 21:14:19 -06:00
ddh0 55dbee2bbe fixup tensor_requires_imatrix 2026-02-11 21:03:34 -06:00
ddh0 3211a847ef logic error 2026-02-11 20:58:52 -06:00
ddh0 ea8da0503c missing __func__, move imatrix flag set 2026-02-11 20:57:16 -06:00
ddh0 2769f35207 new function `tensor_requires_imatrix`, add courtesy warning about imatrix 2026-02-11 20:49:05 -06:00
ddh0 966b21a981 show model and quant BPW when quant completes 2026-02-11 15:30:12 -06:00
ddh0 b9b32f0d2d no need to re-calculate ggml_nbytes for tensor 2026-02-11 14:45:44 -06:00
ddh0 c3f42dedd1 use 6 characters for tensor dims (cont.) 2026-02-11 14:29:22 -06:00
ddh0 56c27b13ad add --dry-run to llama-quantize 2026-02-11 14:08:17 -06:00
ddh0 0d22288f00 use 6 characters for tensor dims 2026-02-11 14:08:01 -06:00
ddh0 844ad3e326 clean slate for branch 2026-02-11 12:47:13 -06:00
Georgi Gerganov 6d95707827
model : fix wavtokenizer embedding notions (#19479) 2026-02-11 07:52:20 +02:00
Daniel Bevenius 2cce9fddb7
llama : refactor sampling_info to use buffer_view template (#19368)
* llama : refactor sampling_info to use buffer_view template

This commit updates the sampling_info struct in llama-context to use a
buffer_view template for the logits, probs, sampled tokens, and
candidates buffers.

The motivation for this is to simplify the code, improve type safety
and readability.
2026-02-11 05:38:13 +01:00
JJJYmmm fc0fe40049
models : support qwen3.5 series (#19468)
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
2026-02-10 18:00:26 +02:00
Georgi Gerganov 972f323e73
revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" (#19453)
This reverts commit 39bf692af1.
2026-02-09 14:57:51 +02:00
Piotr Wilkin (ilintar) 39bf692af1
[Model] Qwen3.5 dense and MoE support (no vision) (#19435)
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 00:24:08 +01:00
forforever73 b83111815e
model : support Step3.5-Flash (#19283)
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-06 21:06:14 +01:00
Lasse Lauwerys 06bf3796f4
unicode : MSVC regex fix (#19340)
* Fix model loading regex error

* Change comments

* Use const_iterator and remove specializations

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
2026-02-06 15:56:13 +02:00