Commit Graph

724 Commits

Author SHA1 Message Date
ddh0 85b6e52e39
Merge branch 'ggml-org:master' into power-law-sampler 2025-12-15 21:23:25 -06:00
ddh0 1c2d2e900d simplify target computation
last commit with debug logging!
2025-12-15 21:02:11 -06:00
HelloKS 9d52f17ae3
model : add KORMo model (#18032)
* vocab: add KORMo Tokenizer

* model: add KORMoForCausalLM

* vocab: change pretokenizer to qwen2

* lint: fix unintended line removal

* model: make qwen2 bias tensor optional

* model: use qwen2 architecture for KORMo
2025-12-15 18:51:43 +01:00
ssweens 4529c660c8
kv-cache: Fix state restore fragmented cache (#17982)
* kv-cache : fix state restore with fragmented cache (#17527)

Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.

* tests : update logic

* cleanup: tightened state_read_meta sig, added is_contiguous case

* fix: state_read_meta arg reorder loose ends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-15 19:28:35 +02:00
ddh0 0344068cf1
remove extraneous logging 2025-12-15 09:35:44 -06:00
ddh0 9c50b573f5
improve logging messages in llama_sampler_power_law 2025-12-15 09:25:05 -06:00
ddh0 6e66095e1f
Merge branch 'ggml-org:master' into power-law-sampler 2025-12-15 09:07:13 -06:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
ddh0 4e04bd1ce2 log sampler init values 2025-12-14 23:14:51 -06:00
ddh0 4e28eb2ffe format (double) 2025-12-14 22:11:34 -06:00
ddh0 b5ed673ce9 fix logging 2025-12-14 22:08:36 -06:00
ddh0 493bf301ff silence `missing initializer for member` 2025-12-14 21:55:45 -06:00
ddh0 6934780669 optimize 2025-12-14 16:26:15 -06:00
ddh0 36b526d768
Merge branch 'master' into power-law-sampler 2025-12-14 15:43:49 -06:00
Xuan-Son Nguyen 0759b09c90
graph: add f_attn_temp_offset (#18025) 2025-12-14 13:05:59 +01:00
ddh0 ec54fe5f14 no, but does this? 2025-12-14 02:54:14 -06:00
ddh0 2a3f579d1f does this fix it? 2025-12-14 01:55:02 -06:00
ddh0 9613c48172 with logging 2025-12-14 00:36:59 -06:00
Georgi Gerganov 609a2d0268
models : fix YaRN regression + consolidate logic (#18006)
* models : fix YaRN regression + consolidate logic

* cont : fix the fix

* cont : remove header

* cont : add header
2025-12-14 08:34:56 +02:00
ddh0 a96ddd743a re-write + change parameters + simplify 2025-12-13 22:15:03 -06:00
ddh0 67a733670e
Merge branch 'ggml-org:master' into power-law-sampler 2025-12-13 17:27:35 -06:00
Jeff Bolz 5266379bca
llama_context: synchronize before reallocating output buffer (#17974) 2025-12-13 09:19:51 -06:00
ddh0 1879fc6dc6
Merge branch 'ggml-org:master' into power-law-sampler 2025-12-13 01:17:53 -06:00
ddh0 824bb3aa6e fix compiler warning, add commented-out logging per token 2025-12-13 00:23:15 -06:00
ddh0 0a19a3fd6c remove old debug log, style nit 2025-12-12 23:45:45 -06:00
ddh0 94cb883ed9 copy from author
ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069
2025-12-12 23:19:08 -06:00
Georgi Gerganov 7bed317f53
models : fix the attn_factor for mistral3 graphs + improve consistency (#17945)
* models : fix the attn_factor for mistral3 graphs

* cont : rework attn_factor correction logic

* cont : make deepseek2 consistent

* cont : add TODO

* cont : special-case DSv2

* cont : revert Mistral 3 Large changes

* cont : fix DS2 to use the original attn_factor

* cont : minor comments
2025-12-12 17:12:40 +02:00
ddh0 2d62bbea9f remove `target_range` param, make `target == 1` no-op, cleanup code 2025-12-11 22:43:10 -06:00
ddh0 b3aea57768 minor 2025-12-11 16:48:52 -06:00
ddh0 93169593b8 remove old unused code from algorithm 2025-12-11 16:46:17 -06:00
ddh0 4959878a74 improved comments 2025-12-11 16:27:14 -06:00
ddh0 ffe163911b add args, rename `queue_size` -> `window_size` 2025-12-11 15:16:11 -06:00
ddh0 374bfd4363 explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]` 2025-12-11 14:22:58 -06:00
ddh0 88fb0f3f32 add params to `struct common_params_sampling`, add reference to PR 2025-12-11 13:47:51 -06:00
ddh0 66e2d17c7f
Merge branch 'ggml-org:master' into power-law-sampler 2025-12-11 12:52:53 -06:00
Georgi Gerganov d9f8f60618
batch : fix sequence id ownership (#17915)
* batch : fix sequence id ownage

* cont : reduce allocations
2025-12-11 14:29:47 +02:00
ddh0 5ab4ff7e44 simplify constants 2025-12-10 22:30:14 -06:00
ddh0 774cf23ee5 initial commit for branch 2025-12-10 22:13:58 -06:00
Georgi Gerganov 4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant (#17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-10 20:53:16 +02:00
Eric Zhang b677721819
model : Qwen3-Next-80B-A3B has 48 layers (#17898)
* model : Qwen3-Next-80B-A3B has 48 layers

* model : Add 80B-A3B type name
2025-12-10 15:22:40 +01:00
Rhys-T 63908b631a
cmake: fix Mach-O current version number (#17877)
PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd
is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the
Mach-O 'current version' field's 'micro' part, which only goes up to
255. This just sets the Mach-O current version to 0 to get it building
properly again.

Fixes #17258.
2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret 42b12b5608
model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652)
* nit, DeepSeek V1 MoE is 16B

* base type on n_ff_exp instead
2025-12-09 12:15:06 +01:00
Aldehir Rojas e39502e74b
llama : add token matching support to llama-grammar (#17816)
* llama : add token support to llama-grammar

* fix inverse token comment

* refactor trigger_patterns to replay tokens instead of the entire string

* add token documentation

* fix test-llama-grammar

* improve test cases for tokens
2025-12-09 00:32:57 -06:00
philip-essential 1d2a1ab73d
model : support Rnj-1 (#17811)
* add support for rnj1

* refactor gemma3 to support rnj-1

* address review comments
2025-12-09 04:49:03 +01:00
Sigbjørn Skjæret c8554b66e0
graph : use fill instead of scale_bias in grouped expert selection (#17867)
* use fill instead of scale_bias in grouped expert selection

* do not explicitly use _inplace
2025-12-08 21:29:59 +01:00
Piotr Wilkin (ilintar) e4e9c4329c
Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
Xuan-Son Nguyen 4d3726278b
model: add llama 4 scaling for mistral-large (deepseek arch) (#17744) 2025-12-07 22:29:54 +01:00
Daniel Bevenius 444f00b0ec
llama : remove quantization sanity check (#17788)
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
2025-12-06 12:26:20 +01:00
Pascal 1be97831e4
fix: prevent segfault in tokenizer on highly repetitive input (#17786)
Add nosubs|optimize flags to std::regex constructors to prevent
catastrophic backtracking when processing prompts with repeated
identical characters (e.g., 'A' * 10000).

The nosubs flag disables subgroup capture, significantly reducing
memory usage and backtracking on uniform token sequences
2025-12-05 13:52:23 +02:00
Georgi Gerganov a67ef0f47f
llama : fix sanity checks during quantization (#17721) 2025-12-04 10:33:42 +02:00