Commit Graph

75 Commits

Author SHA1 Message Date
Georgi Gerganov 74b112e3e7
sampling : fix greedy 2025-12-11 13:37:02 +02:00
Georgi Gerganov 8544aba37f
sampling : generic ggml op support detection 2025-12-11 13:19:43 +02:00
Georgi Gerganov d5d16651a8
cont : fix build 2025-12-11 11:27:47 +02:00
Georgi Gerganov 54e9054017
sampling : optimize logit_bias sampler 2025-12-11 11:14:39 +02:00
Georgi Gerganov 92ff767918
llama : require backend samplers to be of type llama_sampler_chain 2025-12-09 15:38:37 +02:00
Georgi Gerganov 560ac16f7d
server : handle unsupported cases 2025-12-09 10:55:11 +02:00
Georgi Gerganov f3beb22b17
sampling : handle n_probs case 2025-12-08 21:30:10 +02:00
Georgi Gerganov 72e3681073
sampling : fix top-p 2025-12-07 17:11:50 +02:00
Georgi Gerganov 8ef5f900db
cont : fixes 2025-12-07 15:45:00 +02:00
Georgi Gerganov 30742a6ff5
sampling : expand support (wip) 2025-12-06 16:51:56 +02:00
Georgi Gerganov cf74b1a8ec
sampling : fix candidates logic 2025-12-05 14:24:28 +02:00
Georgi Gerganov 7864074fdb
sampling : fix outputs and device checks 2025-12-04 19:33:01 +02:00
Georgi Gerganov 6958d41366
sampling : check backend support during init 2025-12-04 17:29:08 +02:00
Georgi Gerganov 1bde70785d
sampling : remove redundant calls to ggml_build_forward_expand 2025-12-04 14:25:28 +02:00
Georgi Gerganov fce571ee51
sampling : simplify temp sampling 2025-12-04 14:23:02 +02:00
Daniel Bevenius ac9e164714
sampling : fix backend temp sampling to use logits masking 2025-12-04 09:39:20 +01:00
Georgi Gerganov cce3b2a8ad
sampling : minor cleanup 2025-12-03 15:39:44 +02:00
Daniel Bevenius aad5a6afd7
sampling : implement temp_ext_backend sampling
This commit implements the apply function for the extended temperature
sampling.
2025-12-02 17:26:04 +01:00
Daniel Bevenius db8972e251
squash! sampling : fix backend temp sampler for zero temperature
This modifies the parent commit to simply return the most probably token
instead of masking the logits.
2025-12-02 11:53:29 +01:00
Daniel Bevenius 739b597804 sampling : fix backend temp sampler for zero temperature
This commit fixes the implementation of the temperature-based sampler
for the case when the temperature is set to zero. This now correctly
selects the most probable token by masking out all other tokens in the
logits.
2025-12-02 09:13:07 +01:00
Georgi Gerganov 88cca45bb8
sampling : fix top_p empty condition 2025-12-01 18:02:34 +02:00
Georgi Gerganov 04f2822a86
sampling : do not create empty samplers 2025-12-01 17:52:07 +02:00
Georgi Gerganov 4032ce2378
common : simplify sampler chain initialization 2025-12-01 17:11:11 +02:00
Oliver Simons 217469f07f Make backend's top_p sampler inclusive
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
2025-12-01 15:28:06 +01:00
Oliver Simons ae0bb6a6da Factor out `ggml_sort` into its own function 2025-12-01 15:28:06 +01:00
Oliver Simons 8bee483c97 Fix backend_top_p_sampler
softmax(softmax) will return uniform distribution, so we should not
return the softmax but the logits instead.
2025-12-01 12:07:30 +01:00
Georgi Gerganov c187003d81
llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov 9028ebfea8
llama : cleanup + naming 2025-11-29 22:37:07 +02:00
Georgi Gerganov fbc8f49f3c
llama : simplify 2025-11-29 17:01:00 +02:00
Georgi Gerganov 117e2079a9
refactor : simplify and improve memory management 2025-11-28 16:09:42 +02:00
Daniel Bevenius 25f33806d3
sampling : add debug log when backend sampler selects token
This commit adds a debug log statement in the llama_sampler_sample
to indicate when a backend sampler has selected a token for a given
index.

The modification helps in tracing the sampling process and understanding
the flow of control when backend samplers are used.
2025-11-24 15:03:41 +01:00
Daniel Bevenius 79b8cf2a75
Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-21 16:38:32 +01:00
Daniel Bevenius 61ffe41dc1
sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Georgi Gerganov 196f5083ef
common : more accurate sampling timing (#17382)
* common : more accurate sampling timing

* eval-callback : minor fixes

* cont : add time_meas impl

* cont : fix log msg [no ci]

* cont : fix multiple definitions of time_meas

* llama-cli : exclude chat template init from time measurement

* cont : print percentage of unaccounted time

* cont : do not reset timings
2025-11-20 13:40:10 +02:00
Daniel Bevenius ed4345bdd9 squash! common : fix regression caused by extra memory allocations during sampling
Apply the same changes to llama-sampling.cpp, llama_sampler_sample as
were applied in commit 38f408c25.
2025-11-20 07:56:33 +01:00
Daniel Bevenius 51fee29822
sampling : always populate logits for sampled probs
This commit updates common/sampler.cpp set_logits and
src/llama-sampling.cpp llama_sampler_sample to always populate the
logits field when backend sampled probabilities are available.

The motivation for this is that this ensure that CPU sampler always have
access to the logits values even when probabilites have been produced by
backend samplers.
2025-11-19 07:14:11 +01:00
Daniel Bevenius 0da7e7dccc
sampling : remove version from sampler chain
This commit removes the version field from the sampler chain and instead
used the sampler pointer itself for change detection.
2025-11-19 06:59:03 +01:00
Daniel Bevenius 82957a90f2
sampling : always expose sampled_ids
This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.
2025-11-18 15:11:59 +01:00
Daniel Bevenius 7884b0e0ac
sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
2025-11-17 16:15:58 +01:00
Marek Hradil jr. 6cd0cf72ce
fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048)
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
2025-11-14 14:35:26 +02:00
Georgi Gerganov 81086cd6a3
vocab : mark EOT token for Granite models (#16499)
* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found
2025-10-10 17:17:31 +03:00
Georgi Gerganov cdedb70a99
sampling : optimize dist sampler (#15704)
ggml-ci
2025-09-03 18:16:26 +03:00
Georgi Gerganov e92d53b29e
sampling : optimize samplers by reusing bucket sort (#15665)
* sampling : optimize sorting using bucket sort in more places

ggml-ci

* sampling : do not sort in dist sampler

ggml-ci

* sampling : avoid heap allocations for sort buffers

ggml-ci

* common : add option to sort sampling candidates by probability

ggml-ci

* sampling : revert the change for preserving sort buffers

* sampling : use std::copy instead of memcpy

* sampling : clarify purpose of partial sort helpers

ggml-ci

* cont : remove wrong comment [no ci]

* common : update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-31 20:41:02 +03:00
Georgi Gerganov f9cd68398b
sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
2025-05-27 12:07:52 +03:00
DocShotgun ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) 2025-05-06 22:36:24 +02:00
oobabooga 91a86a6f35
sampling : don't consider -infinity values in top_n_sigma (#13344) 2025-05-06 20:24:15 +02:00
oobabooga 233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>
2025-05-05 22:12:19 +02:00
Georgi Gerganov d9d398f84f
sampling : when top-k <= 0 -> noop (#13173)
ggml-ci
2025-04-29 20:22:57 +03:00
Johannes Gäßler dd373dd3bf
llama: fix error on bad grammar (#12628) 2025-03-28 18:08:52 +01:00
Olivier Chafik 669912d9a5
`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00