Commit Graph

802 Commits

Author SHA1 Message Date
Oliver Simons 0a17687c72 Make backend dist sampler use same rnd's as dist sampler
We sample in double precision and cast to float to match rnd numbers of
llama_dampler_dist which uses double precision (sampling from
std::uniform_real_distribution<double> and
std::uniform_real_distribution<float> with same rng will produce
different sequences).
2025-12-19 11:43:19 +01:00
Georgi Gerganov 3b3f5fed31
common : disable backend sampling when grammar is involved 2025-12-18 10:52:21 +02:00
Georgi Gerganov eefdb0da17
Merge branch 'master' into HEAD 2025-12-18 10:12:47 +02:00
Johannes Gäßler 57c1e05643
llama: offload output layer to GPU first (#18148) 2025-12-18 08:12:18 +01:00
Julius Tischbein 4d4f4cacd1
llama : Async DirectIO model loading on Linux (#18012)
* Uncached model read

* Removing additional --mmap arg

* Removing trailing whitespaces

* Adding fallback when O_DIRECT is not supported

* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp

* Adding maybe unused keyword for Mac and Windows.

* File seek aligned

* Removing all branches for direct_io in llama-model-loader.cpp

* Always use alignment from llama_file

* use_mmap=true
2025-12-18 08:27:19 +02:00
Johannes Gäßler 8dcc3662a2
llama-fit-params: fix memory print (#18136) 2025-12-17 21:10:03 +01:00
Georgi Gerganov 4301e27319
common : restore grammar-based rejection sampling (#18137)
* common : restart grammar-based rejection sampling

* sampling : allow null samplers
2025-12-17 19:46:00 +02:00
Tarek Dakhran 982060fadc
model: fix LFM2_MOE missing tensors (#18132) 2025-12-17 12:17:11 +01:00
Daniel Bevenius c5d44b8525
llama : fix typo in comment [no ci] 2025-12-17 09:02:30 +01:00
Johannes Gäßler d0794e89d9
llama-fit-params: force disable mlock (#18103) 2025-12-17 00:50:12 +01:00
Johannes Gäßler 9dcac6cf9f
llama-fit-params: lower ctx size for multi GPU (#18101) 2025-12-17 00:49:34 +01:00
Johannes Gäßler 0e49a7b8b4
llama-fit-params: fix underflow for dense models (#18095) 2025-12-17 00:47:37 +01:00
Xuan-Son Nguyen ef83fb8601
model: fix LFM2 missing tensors (#18105) 2025-12-16 19:07:43 +01:00
Johannes Gäßler ec98e20021
llama: fix early stop in params_fit if ctx is set (#18070) 2025-12-16 14:24:00 +01:00
Xuan-Son Nguyen 7f2b2f3c77
arch: refactor LLM_TENSOR_NAMES (#18051)
* arch: refactor LLM_TENSOR_NAMES

* update docs

* typo

* fix LLM_ARCH_NEMOTRON_H_MOE

* show more meaningful error message on missing tensor

* fix and tested LLM_ARCH_NEMOTRON_H_MOE
2025-12-16 13:22:30 +01:00
Piotr Wilkin (ilintar) a5251ca11d
Optimization: Qwen3 next autoregressive pass (#17996)
* It's Qwen3 Next, the lean mean token generation machine!

* Apply patches from thread

* Remove recurrent version, only keep chunked and autoregressive

* Remove unnecessary conts and asserts

* Remove more extra conts and asserts

* Cleanup masking
2025-12-16 11:59:53 +01:00
Xuan-Son Nguyen 3d86c6c2b5
model: support GLM4V vision encoder (#18042)
* convert ok

* no deepstack

* less new tensors

* cgraph ok

* add mrope for text model

* faster patch merger

* add GGML_ROPE_TYPE_MRNORM

* add support for metal

* move glm4v do dedicated graph

* convert: add norm_embd

* clip: add debugging fn

* working correctly

* fix style

* use bicubic

* fix mrope metal

* improve cpu

* convert to neox ordering on conversion

* revert backend changes

* force stop if using old weight

* support moe variant

* fix conversion

* fix convert (2)

* Update tools/mtmd/clip-graph.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* process mrope_section on TextModel base class

* resolve conflict merge

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 11:25:26 +01:00
Daniel Bevenius ad1b60abc4
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-16 09:45:08 +01:00
Chris Peterson 2aa45ef9e3
llama: Include algorithm header needed for C++23 (#18078) 2025-12-16 09:37:55 +02:00
Georgi Gerganov c560316440
graph : reuse SSM graphs (#16490)
* graph : reuse hybrid graphs

* graph : reuse recurrent graphs

* graph : fix reuse check for recurrent inputs

* memory : move the recurrent state into the memory context

* Revert "memory : move the recurrent state into the memory context"

This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1.

* cont : fix build
2025-12-16 09:36:21 +02:00
Daniel Bevenius 2995341730
llama : add support for NVIDIA Nemotron 3 Nano (#18058)
* llama : add support for NVIDIA Nemotron Nano 3

This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling
the conversion and running of this model.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-16 07:19:26 +01:00
HelloKS 9d52f17ae3
model : add KORMo model (#18032)
* vocab: add KORMo Tokenizer

* model: add KORMoForCausalLM

* vocab: change pretokenizer to qwen2

* lint: fix unintended line removal

* model: make qwen2 bias tensor optional

* model: use qwen2 architecture for KORMo
2025-12-15 18:51:43 +01:00
ssweens 4529c660c8
kv-cache: Fix state restore fragmented cache (#17982)
* kv-cache : fix state restore with fragmented cache (#17527)

Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.

* tests : update logic

* cleanup: tightened state_read_meta sig, added is_contiguous case

* fix: state_read_meta arg reorder loose ends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-15 19:28:35 +02:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Georgi Gerganov 0086c246ee
Merge branch 'master' into HEAD 2025-12-14 16:44:30 +02:00
Xuan-Son Nguyen 0759b09c90
graph: add f_attn_temp_offset (#18025) 2025-12-14 13:05:59 +01:00
Georgi Gerganov 22c7f85b9c
Merge branch 'master' into HEAD 2025-12-14 10:19:58 +02:00
Georgi Gerganov 609a2d0268
models : fix YaRN regression + consolidate logic (#18006)
* models : fix YaRN regression + consolidate logic

* cont : fix the fix

* cont : remove header

* cont : add header
2025-12-14 08:34:56 +02:00
Jeff Bolz 5266379bca
llama_context: synchronize before reallocating output buffer (#17974) 2025-12-13 09:19:51 -06:00
Georgi Gerganov 7bed317f53
models : fix the attn_factor for mistral3 graphs + improve consistency (#17945)
* models : fix the attn_factor for mistral3 graphs

* cont : rework attn_factor correction logic

* cont : make deepseek2 consistent

* cont : add TODO

* cont : special-case DSv2

* cont : revert Mistral 3 Large changes

* cont : fix DS2 to use the original attn_factor

* cont : minor comments
2025-12-12 17:12:40 +02:00
Georgi Gerganov 4d10b78e23
Merge branch 'master' into HEAD 2025-12-11 14:42:56 +02:00
Georgi Gerganov d9f8f60618
batch : fix sequence id ownership (#17915)
* batch : fix sequence id ownage

* cont : reduce allocations
2025-12-11 14:29:47 +02:00
Georgi Gerganov ab65b47a52
tests : run backend sampler tests always on the CPU 2025-12-11 14:14:47 +02:00
Georgi Gerganov 74b112e3e7
sampling : fix greedy 2025-12-11 13:37:02 +02:00
Georgi Gerganov 8544aba37f
sampling : generic ggml op support detection 2025-12-11 13:19:43 +02:00
Georgi Gerganov d5d16651a8
cont : fix build 2025-12-11 11:27:47 +02:00
Georgi Gerganov 54e9054017
sampling : optimize logit_bias sampler 2025-12-11 11:14:39 +02:00
Georgi Gerganov 4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant (#17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-10 20:53:16 +02:00
Georgi Gerganov 804e7e3795
graph : respect sampler order for graph reuse 2025-12-10 20:40:15 +02:00
Georgi Gerganov 44d5c4b592
batch : fix sequence id ownage 2025-12-10 20:35:58 +02:00
Georgi Gerganov 38882247d3
Merge branch 'master' into HEAD 2025-12-10 17:07:21 +02:00
Eric Zhang b677721819
model : Qwen3-Next-80B-A3B has 48 layers (#17898)
* model : Qwen3-Next-80B-A3B has 48 layers

* model : Add 80B-A3B type name
2025-12-10 15:22:40 +01:00
Georgi Gerganov c02654eb7d
graph : make the compute graph constant with respect to active samplers 2025-12-10 16:19:18 +02:00
Georgi Gerganov 81cb5783c8
Merge branch 'master' into HEAD 2025-12-10 13:41:32 +02:00
Georgi Gerganov 34b407b41c
sampling : use host buffer type for inputs 2025-12-09 17:53:17 +02:00
Georgi Gerganov 92ff767918
llama : require backend samplers to be of type llama_sampler_chain 2025-12-09 15:38:37 +02:00
Rhys-T 63908b631a
cmake: fix Mach-O current version number (#17877)
PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd
is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the
Mach-O 'current version' field's 'micro' part, which only goes up to
255. This just sets the Mach-O current version to 0 to get it building
properly again.

Fixes #17258.
2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret 42b12b5608
model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652)
* nit, DeepSeek V1 MoE is 16B

* base type on n_ff_exp instead
2025-12-09 12:15:06 +01:00
Georgi Gerganov 560ac16f7d
server : handle unsupported cases 2025-12-09 10:55:11 +02:00
Aldehir Rojas e39502e74b
llama : add token matching support to llama-grammar (#17816)
* llama : add token support to llama-grammar

* fix inverse token comment

* refactor trigger_patterns to replay tokens instead of the entire string

* add token documentation

* fix test-llama-grammar

* improve test cases for tokens
2025-12-09 00:32:57 -06:00