Commit Graph

4 Commits

Author SHA1 Message Date
Daniel Bevenius 61ffe41dc1
sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Daniel Bevenius 311c1a347f
sampling : ensure at most one output token per seq
This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.
2025-11-18 16:06:23 +01:00
Daniel Bevenius 71574f9273 sampling : enable all backend sampler tests
This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.
2025-11-18 07:31:54 +01:00
Daniel Bevenius 7884b0e0ac
sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
2025-11-17 16:15:58 +01:00