Commit Graph

720 Commits

Author SHA1 Message Date
Georgi Gerganov 88cca45bb8
sampling : fix top_p empty condition 2025-12-01 18:02:34 +02:00
Georgi Gerganov 04f2822a86
sampling : do not create empty samplers 2025-12-01 17:52:07 +02:00
Georgi Gerganov 4032ce2378
common : simplify sampler chain initialization 2025-12-01 17:11:11 +02:00
Oliver Simons 217469f07f Make backend's top_p sampler inclusive
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
2025-12-01 15:28:06 +01:00
Oliver Simons ae0bb6a6da Factor out `ggml_sort` into its own function 2025-12-01 15:28:06 +01:00
Georgi Gerganov 16451d6bc3
Merge branch 'master' into HEAD 2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen cd3c118908
model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Oliver Simons 8bee483c97 Fix backend_top_p_sampler
softmax(softmax) will return uniform distribution, so we should not
return the softmax but the logits instead.
2025-12-01 12:07:30 +01:00
Aman Gupta 6eea666912
llama-graph: avoid expand_forward for fusion (#17633) 2025-12-01 11:12:48 +02:00
Daniel Bevenius cf0e1475c5
sampling : lower log level for output buffer reallocations [no ci]
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.

The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:

```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
 1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
2025-12-01 09:13:47 +01:00
Georgi Gerganov 80742cbaeb
cont : naming 2025-11-30 11:24:30 +02:00
Georgi Gerganov c187003d81
llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov 1760bd69b3
llama : reserve graphs with samplers 2025-11-29 23:57:25 +02:00
Georgi Gerganov ff7b0bf632
llama : call backend_init once 2025-11-29 23:09:53 +02:00
Georgi Gerganov d8d98bb4bb
Merge branch 'master' into HEAD 2025-11-29 22:38:44 +02:00
Georgi Gerganov 9028ebfea8
llama : cleanup + naming 2025-11-29 22:37:07 +02:00
Georgi Gerganov fbc8f49f3c
llama : simplify 2025-11-29 17:01:00 +02:00
Diego Devesa e072b2052e
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 17:33:23 +02:00
Georgi Gerganov 2464d1b3fc
sampling : simplify 2025-11-28 17:21:12 +02:00
Daniel Bevenius 8cac9dee45
sampling : use logits directly for min-p filtering 2025-11-28 16:12:05 +01:00
Oliver Simons 333da805fe Add initial version for top-p sampling
As we only support static graphs for the time and we don't know the size
of the output of top-p, we have to do value-scaling same as for min-p
operator.

Further improvements can be applied to the unit-test (i.e. check for
equivalence of top_p happening on backend with top_p happening on cpu)
and also by constructing candidates and sorting those as opposed to
reversing the sort of the logits (this would be arange +
get_rows instead of argsort + get_rows)
2025-11-28 15:16:20 +01:00
Georgi Gerganov 117e2079a9
refactor : simplify and improve memory management 2025-11-28 16:09:42 +02:00
Daniel Bevenius 459b7ae7b9
squash! sampling : support intermixed backend/cpu samplers
Fix llama-save-load-state which currently fails by handling the case
when batch.logits is nullptr (like when loading state) by allocating
space for all outputs as CPU logits.
2025-11-28 13:50:47 +01:00
Piotr Wilkin (ilintar) ff55414c42
model : Qwen3 Next (#16095)
* Qwen3 Next - cleaned up version

* Whitespaces and stuff

* Correct minor errors

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Misc. fixes.

* Clean up code, add missing hybrid qualifier

* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Whitespace

* Proper tensors for cb calls

* Use llama-graph.h vertical alignment

* BROKEN: chunking

* Set new tensors as inputs.

* Proper chunk logic

* It's the circle of life...

* More shenanigans for n_seq > 1

* Nail in the coffin?

* Fix Windows build

* Eh, one fails on Windows, the other fails on Mac... just use general capture.

* quant : cleanup

* model : cleanup

* qwen3 : cleanup

* cont : cleanup

* cont : cleanup

* ggml : revert change

* qwen3 : cleanup

* cont : cleanup

* Readd cmath

* qwen3 : fix typo

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Usual suspects

* fix my bad suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 12:02:56 +01:00
Daniel Bevenius 9ad6522be6
squash! sampling : support intermixed backend/cpu samplers
Add check that logits is not null which is can happen for embeddings.
2025-11-28 08:57:48 +01:00
Daniel Bevenius 74be332e24
sampling : support intermixed backend/cpu samplers
This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.

The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.

The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.

This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.

The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.

Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.
2025-11-28 08:38:05 +01:00
Georgi Gerganov c386114922
arch : add description about LLM_TENSOR_INFOS (#17550) 2025-11-27 16:34:13 +02:00
Georgi Gerganov 6783b11fb0
models : fix LFM2 tensors (#17548) 2025-11-27 16:04:29 +02:00
Daniel Bevenius 172208afbf
sampling : add comments about backend sampler [no ci]
This commit adds a comment to llama_context's constructor explaining why
backend samplers are initialized early in the process.
2025-11-27 14:59:52 +01:00
Daniel Bevenius d9d736102b
sampling : use argmax for min-p sampling 2025-11-27 07:38:44 +01:00
Daniel Bevenius b45d504e70
sampling : add min-p backend sampler 2025-11-26 10:50:58 +01:00
Daniel Bevenius ec047e12ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 15:16:44 +01:00
Georgi Gerganov 583cb83416
ggml : add ggml_top_k (#17365)
* ggml : add ggml_top_k

* cont : add ggml_argsort_top_k

* metal : add top_k support

* ggml : cleanup

* tests : add virtual err() function for test_case

* ggml : add comments
2025-11-25 15:31:43 +02:00
Daniel Bevenius 2b4c7927ee
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 06:10:33 +01:00
Aaron Teo 877566d512
llama: introduce support for model-embedded sampling parameters (#17120) 2025-11-25 09:56:07 +08:00
Daniel Bevenius 134e6940ca
llama : skip output reordering for single token batches (#17466)
This commit adds a check to skip the output reordering logic when
n_outputs == 1. With a single output token, the data is trivially
sorted and the reordering code is currently doing unnecessary work
(resetting and rebuilding output_ids to the same values).

The motivation for this change is improved code clarity and avoiding
confusion when debugging. While the performance impact is probably
negligible, this unnecessary work happens on every decode call in
llama-server when processing batches with single-token outputs.
2025-11-24 21:06:17 +01:00
Daniel Bevenius a02adf4211
sampling : add assertions for contiguous tensors in async copy functions 2025-11-24 21:01:06 +01:00
Georgi Gerganov 883a87043a
samplers : add missing cont 2025-11-24 21:46:57 +02:00
Daniel Bevenius 25f33806d3
sampling : add debug log when backend sampler selects token
This commit adds a debug log statement in the llama_sampler_sample
to indicate when a backend sampler has selected a token for a given
index.

The modification helps in tracing the sampling process and understanding
the flow of control when backend samplers are used.
2025-11-24 15:03:41 +01:00
Daniel Bevenius 8eb9b4769d
sampling : remove redundant checks for stride and size [no ci] 2025-11-24 13:53:29 +01:00
Daniel Bevenius 4a90583d7d
sampling : cleanup and clarify output_reserve 2025-11-24 13:26:18 +01:00
Daniel Bevenius 7816f0bb56
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-24 07:44:06 +01:00
william pan 4902eebe33
models : Added support for RND1 Diffusion Language Model (#17433)
* Converted RND1 model to GGUF weights

* RND1 llama.cpp support v1

* RND1 llama.cpp support v2 non causal bug

* RND1 llama.cpp support v3 doccumentation

* RND1 llama.cpp support v4 clean code

* linting issues

* RND1 pr fixes v1

* RND1 pr fixes v2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Diffusion documentation edits

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-24 14:16:56 +08:00
Daniel Bevenius 9e273f7aa4
sampling : fix copying both sampled tokens and logits/probs from backend
This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.

A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
2025-11-23 13:12:01 +01:00
Daniel Bevenius ae23d2d2c1
sampling: clarify candidate ids usage in comments 2025-11-23 11:28:19 +01:00
Daniel Bevenius 65500d05ab
sampling : add stride variable for clarity 2025-11-23 11:27:54 +01:00
Daniel Bevenius 79b8cf2a75
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-21 16:38:32 +01:00
ubergarm 23bc779a6e
model : detect GigaChat3-10-A1.8B as deepseek lite (#17420)
* Detect GigaChat3-10-A1.8B as deepseek lite

Hardcodes checking number of layers to detect if lite version of deepseek.

* Add commnent identifying deepseek lite variants

deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B
2025-11-21 14:51:38 +01:00
Daniel Bevenius 61ffe41dc1
sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Xuan-Son Nguyen 054a45c3d3
grammar: fix regression caused by #17381 (#17412)
* grammar: fix regression caused by #17381

* more readable
2025-11-20 18:35:10 +01:00