From 0f17ccdee7e8cd8d1d452f45d2f9d1cc8448276f Mon Sep 17 00:00:00 2001 From: Daniel Bevenius Date: Tue, 25 Nov 2025 08:12:42 +0100 Subject: [PATCH] examples : add info about hybrid sampling in batched [no ci] --- examples/batched/README.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/examples/batched/README.md b/examples/batched/README.md index de2aa41fba..f10639220e 100644 --- a/examples/batched/README.md +++ b/examples/batched/README.md @@ -53,4 +53,17 @@ performed on the backend device, like a GPU. --backend_sampling --top-k 80 --backend_dist ``` The `--verbose` flag can be added to see more detailed output and also show -that the backend samplers are being used. +that the backend samplers are being used. The above example will perform distribution +sampling on the backend device and only transfer the sampled token ids back to the host. + +It is also possible to perform partial sampling on the backend, and then allow CPU samplers +to process those results further. This is sometimes referred to as hybrid sampling. +For an example of this we can remove `--backend_dist` from the above command: +```bash +./llama-batched \ + -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf -p "Hello my name is" \ + -np 4 -kvu \ + --backend_sampling --top-k 80 -v +``` +This will perform the top-k filtering on the backend device, and then transfer the filtered logits +back to the host for sampling.