Commit Graph

828 Commits

Author SHA1 Message Date
ddh0 1cd57c81d8
Merge branch 'ggml-org:master' into llama-quantize-dry-run 2026-02-16 11:00:36 -06:00
Saurabh Dash 5f28c53d11
model: Add support for Tiny Aya Models (#19611)
* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-16 16:28:46 +01:00
Georgi Gerganov cc45f2ada6
models : deduplicate delta-net graphs for Qwen family (#19597)
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments
2026-02-16 14:35:04 +02:00
Georgi Gerganov d5dfc33027
graph : fix KQ mask, lora, cvec reuse checks (#19644)
* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse
2026-02-16 09:21:11 +02:00
Georgi Gerganov 341bc7d23c
context : fix output reorder with backend sampling (#19638) 2026-02-15 14:57:40 +02:00
ddh0 679792b517
Merge branch 'ggml-org:master' into llama-quantize-dry-run 2026-02-14 22:36:03 -06:00
Georgi Gerganov 1725e316c1
models : optimize qwen3next graph (#19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-14 12:57:36 +02:00
agent-enemy-2 2d8015e8a4
llama : update LoRA API. + fix excessive graph reserves (#19280)
* Refactoring to use new llama_put_adapter_loras

* cont : alternative lora API

---------

Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-14 10:06:27 +02:00
George eb145c0753
mmap: Fix Windows handle lifetime (#19598)
* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.

* Update llama-mmap.cpp

* Update llama-mmap.cpp

Remove trailing whitespace from line 567
2026-02-14 10:05:12 +02:00
ddh0 5bc85aff4c
Merge branch 'ggml-org:master' into llama-quantize-dry-run 2026-02-13 21:29:05 -06:00
Xuan-Son Nguyen 752584d5f5
model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-13 14:56:53 +01:00
ymcki 33a56f90a6
model : Kimi Linear fix conv state update (#19531)
* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-13 09:10:18 +01:00
Adrien Gallouët 25224c8021
llama : remove deprecated codecvt (#19565)
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-13 06:43:53 +01:00
Georgi Gerganov bb96bfd361
memory : fix kv cache size for hybrid models (#19559) 2026-02-13 07:36:24 +02:00
ddh0 f58de63ec3 remove unused `params` parameter 2026-02-11 22:30:06 -06:00
ddh0 44f9fee248 remove per @compilade 2026-02-11 22:23:10 -06:00
ddh0 40528248fc comment ref #12557 2026-02-11 22:18:56 -06:00
ddh0 1658228d6a add back Q2_K edge case for imatrix 2026-02-11 21:53:07 -06:00
ddh0 1ccd7a49ba simplify for style 2026-02-11 21:41:37 -06:00
ddh0 ae786b862d simplify and rename `tensor_type_requires_imatrix` 2026-02-11 21:21:40 -06:00
ddh0 22db76409b add missing `GGML_TYPE`s 2026-02-11 21:14:19 -06:00
ddh0 55dbee2bbe fixup tensor_requires_imatrix 2026-02-11 21:03:34 -06:00
ddh0 3211a847ef logic error 2026-02-11 20:58:52 -06:00
ddh0 ea8da0503c missing __func__, move imatrix flag set 2026-02-11 20:57:16 -06:00
ddh0 2769f35207 new function `tensor_requires_imatrix`, add courtesy warning about imatrix 2026-02-11 20:49:05 -06:00
ddh0 966b21a981 show model and quant BPW when quant completes 2026-02-11 15:30:12 -06:00
ddh0 b9b32f0d2d no need to re-calculate ggml_nbytes for tensor 2026-02-11 14:45:44 -06:00
ddh0 c3f42dedd1 use 6 characters for tensor dims (cont.) 2026-02-11 14:29:22 -06:00
ddh0 56c27b13ad add --dry-run to llama-quantize 2026-02-11 14:08:17 -06:00
ddh0 0d22288f00 use 6 characters for tensor dims 2026-02-11 14:08:01 -06:00
ddh0 844ad3e326 clean slate for branch 2026-02-11 12:47:13 -06:00
Georgi Gerganov 6d95707827
model : fix wavtokenizer embedding notions (#19479) 2026-02-11 07:52:20 +02:00
Daniel Bevenius 2cce9fddb7
llama : refactor sampling_info to use buffer_view template (#19368)
* llama : refactor sampling_info to use buffer_view template

This commit updates the sampling_info struct in llama-context to use a
buffer_view template for the logits, probs, sampled tokens, and
candidates buffers.

The motivation for this is to simplify the code, improve type safety
and readability.
2026-02-11 05:38:13 +01:00
JJJYmmm fc0fe40049
models : support qwen3.5 series (#19468)
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
2026-02-10 18:00:26 +02:00
Georgi Gerganov 972f323e73
revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" (#19453)
This reverts commit 39bf692af1.
2026-02-09 14:57:51 +02:00
Piotr Wilkin (ilintar) 39bf692af1
[Model] Qwen3.5 dense and MoE support (no vision) (#19435)
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 00:24:08 +01:00
forforever73 b83111815e
model : support Step3.5-Flash (#19283)
* Support Step3.5-Flash

* fix: norm.weight + 1 (HF zero_centered=true)

* step35: simplify GGUF conversion + drop redundant rope KVs

* Address review feedback

* rename limits -> clamp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rename swiglu limits -> swiglu clamp in LLM_KV

* avoid CI fail

* Apply suggestions from code review

* Apply suggestions from code review

* disabled KV shifting for LLM_ARCH_STEP35

* Apply suggestions from code review

* mistakenly removed cmath

* add model size && apply missed suggestion

* assert partial_rotary_factors

* fix CI errors:

* load freq_base_swa

---------

Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-06 21:06:14 +01:00
Lasse Lauwerys 06bf3796f4
unicode : MSVC regex fix (#19340)
* Fix model loading regex error

* Change comments

* Use const_iterator and remove specializations

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
2026-02-06 15:56:13 +02:00
ymcki 3688c4f504
Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)
* kimi linear model implementation

* kimi linear convert_hf_to_gguf

* kimi linear constants.py tensor_mapping.py

* Kimi Linear ggml.h

* kimi linear ggml-cpu

* Kimi Linear ggml-cuda

* Kimi Linear ggml.c

* kimi linear src/llama

* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning

* remove type mismatch warning

* read MoE params

* removed some hard coded code

* removed all hard code

* use DeepseekV2 tokenizer

* removed unnecessary internal methods called by the old set_vocab of KimiLinear

* rewrite get_vocab for KimiLinear. Removed all kda_scan code

* removed all traces of kda_scan

* reduce OP count by 1 due to removal of kda_scan

* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

* set n_embd_head_k/v to ensure kv cache works

* don't quantize conv1d of Kimi Linear

* Kimi Linear backend agnostic

* removed LOG_INFO

* naive chunking form implemented

* fixed some comments

* add Kimi-K2 specific tokens to be recognized as EOG

* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682

* replaced Akk and Aqk with mul_mat and clamp

* no clamp version

* Moved Aqk computation out of the loop

* fixed typo and split wkv_b into wk_b and wv_b

* MLA KV cache support

* fix trailing spaces

* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error

* fix trailing whitespace

* removed traling whitespaces in empty line + make sure indentation is multiple of 4

* try to make lint happy

* remove blank lines to make lint happy

* removed at least blank line containing white space

* fixed flake8 complaints locally

* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement

* removed Kimi-Linear specific change that causes failure at server-windows

* removed private: from kimi_linear to make build checks happy

* removed unnecessary ggml_cont before ggml_reshape

* created static function causal_conv1d to abtract similar code for q/k/v

* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.

* reverted to original

* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.

* remove DT_B from constants.py. remove one comment line in llama-model.cpp

* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight

* remove ssm_o_norm_b

* remove ssm_o_norm_b

* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k

* removed all ggml_cont b4 ggml_reshape_4d

* Whitespace

* replaced all hparams.get with find_hparams

* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

* use is_mla to switch between different mem_hybrid types

* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

* removed if else for required parameters kv_lora_rank and qk_rope_head_dim

* add back ggml_cont for Vcur

* minor changes

* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp

* f16 gguf cannot run without context length

* made a mistake of adding back n_ctx parsing

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-06 11:39:58 +01:00
Daniel Bevenius e696cfc016
llama : rename llama-sampling to llama-sampler (#19363)
This commit addresses the TODO in llama-sampling.h to rename that header
and the implementation to llama-sampler.
2026-02-06 07:26:54 +01:00
Xuan-Son Nguyen e0c93af2a0
debug: make common_debug_print_tensor readable (#19331)
* debug: make common_debug_print_tensor readable

* editorconfig
2026-02-04 17:55:31 +01:00
Xuan-Son Nguyen 8abcc70a74
model: (qwen3next) correct vectorized key_gdiff calculation (#19324)
* model: (qwen3next) correct vectorized key_gdiff calculation

* move transpose to outside of loop
2026-02-04 13:09:58 +01:00
Georgi Gerganov faa1bc26ee
sampling : delegate input allocation to the scheduler (#19266)
* sampling : delegate input allocation to the scheduler

* graph : compute backend samplers only if needed
2026-02-03 22:16:16 +02:00
Sigbjørn Skjæret a6fd8ca1fe
models : remove unnecessary cont in openelm (#19289) 2026-02-03 14:20:57 +01:00
Alexey Dubrov 1efb5f7ae1
vocab: add Falcon-H1-Tiny-Coder FIM tokens (#19249) 2026-02-03 08:31:01 +02:00
Georgi Gerganov 6fdddb4987
metal : support virtual devices (#18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Christian Kastner 7a4ca3cbd9
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-02 08:38:55 +02:00
Daniel Bevenius f3bc98890c
memory : clarify comments for r_l and s_l tensors [no ci] (#19203)
This commit updates the comments in state_write_data to clarify that it
is handling the R and S tensors and not Key and Value tensors.
2026-01-30 15:18:41 +01:00
Daniel Bevenius 83bcdf7217
memory : remove unused tmp_buf (#19199)
This commit removes the unused tmp_buf variable from llama-kv-cache.cpp
and llama-memory-recurrent.cpp.

The tmp_buf variable was declared but never used but since it has a
non-trivial constructor/desctuctor we don't get an unused variable
warning about it.
2026-01-30 10:37:06 +01:00
Georgi Gerganov 4fdbc1e4db
cuda : fix nkvo, offload and cuda graph node properties matching (#19165)
* cuda : fix nkvo

* cont : more robust cuda graph node property matching

* cont : restore pre-leafs implementation

* cont : comments + static_assert
2026-01-29 18:45:30 +02:00