llama.cpp/examples
Daniel Bevenius 9e5e09d087
sampling : remove backend-dist option (wip)
This commit removes the `--backend-dist` option and instead uses the
configured --samplers chain to determine which samplers run on the
backend.

Backend sampling is still enabled using With `--backend_sampling`, and
the sampler chain, either explictly specified using `--samplers` or the
default, is automatically analyzed to determine which samplers can run
on the backend. The system finds the longest contiguous chain of
backend supported samplers from the start of the sampler sequence.
For example:

* If the chain is `top-k -> temperature -> top-p`, and both `top-k` and
  `temperature` are backend-supported but `top-p` is not, then `top-k`
  and `temperature` will run on the backend, while `top-p` and
  subsequent samplers run on the CPU.

* If all configured samplers are supported, the final distribution
  sampling will also happen on the backend, transferring only the
  sampled token IDs back to the host.

* If the sampler chain starts with an unsupported sampler (e.g.,
  `penalties`), all sampling runs on the CPU. Note that this is
  currently the case with the default sampler so to use backend sampling
  it is required to specify a sampler chain. See below for an example.

The following shows how llama-cli can be run with backend sampling:
```console
$ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    --prompt 'What is the capital of Sweden?' \
    -n 20 \
    -no-cnv \
    --verbose-prompt \
    -ngl 40 \
    --backend-sampling \
    --samplers 'top_k;temperature'
```
In this case the all sampling will happen on the backend since both
`top_k` and `temperature` are supported backend samplers.

To enable a partial backend sampling (hybrid sampling), for example
running `top_k` and `temperature` on the backend and `typ_p` on the CPU
the following sampler chain could be specified:
```console
$ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    --prompt 'What is the capital of Sweden?' \
    -n 20 \
    -no-cnv \
    --verbose-prompt \
    -ngl 40 \
    --backend-sampling \
    --samplers 'top_k;temperature;top_p'
```

If this looks good then I'll follow up with updates the llama-cli and
llama-server documentation to reflect these changes.
2025-11-25 14:01:23 +01:00
..
batched sampling : remove backend-dist option (wip) 2025-11-25 14:01:23 +01:00
batched.swift examples : remove references to `make` in examples [no ci] (#15457) 2025-08-21 06:12:28 +02:00
convert-llama2c-to-ggml gguf: gguf_writer refactor (#15691) 2025-09-05 11:34:28 +02:00
deprecation-warning Update deprecation-warning.cpp (#10619) 2024-12-04 23:19:20 +01:00
diffusion models : Added support for RND1 Diffusion Language Model (#17433) 2025-11-24 14:16:56 +08:00
embedding embedding: add raw option for --embd-output-format (#16541) 2025-10-28 12:51:41 +02:00
eval-callback common : more accurate sampling timing (#17382) 2025-11-20 13:40:10 +02:00
gen-docs ggml : move AMX to the CPU backend (#10570) 2024-11-29 21:54:58 +01:00
gguf examples(gguf): GGUF example outputs (#17025) 2025-11-05 19:58:16 +02:00
gguf-hash GGUF: C++ refactor, backend support, misc fixes (#11030) 2025-01-07 18:01:58 +01:00
llama.android llama : deprecate llama_kv_self_ API (#14030) 2025-06-06 14:11:15 +03:00
llama.swiftui llama : deprecate llama_kv_self_ API (#14030) 2025-06-06 14:11:15 +03:00
lookahead lookahead : add sample command to readme (#15447) 2025-08-20 13:30:46 +03:00
lookup llama : deprecate llama_kv_self_ API (#14030) 2025-06-06 14:11:15 +03:00
model-conversion model-conversion : pass config to from_pretrained (#16963) 2025-11-03 18:01:59 +01:00
parallel parallel : add option for different RNG seeds (#14757) 2025-07-18 17:33:41 +03:00
passkey examples : remove references to `make` in examples [no ci] (#15457) 2025-08-21 06:12:28 +02:00
retrieval examples : remove references to `make` in examples [no ci] (#15457) 2025-08-21 06:12:28 +02:00
save-load-state tests : update for LLAMA_SET_ROWS=1 (#14961) 2025-07-30 15:12:02 +03:00
simple examples : support encoder-decoder models in the simple example (#16002) 2025-09-17 10:29:00 +03:00
simple-chat simple-chat : fix context-exceeded condition (#14494) 2025-07-02 14:12:07 +03:00
simple-cmake-pkg repo : update links to new url (#11886) 2025-02-15 16:40:57 +02:00
speculative sampling : optimize samplers by reusing bucket sort (#15665) 2025-08-31 20:41:02 +03:00
speculative-simple common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191) 2025-08-13 12:44:40 +02:00
sycl examples : remove references to `make` in examples [no ci] (#15457) 2025-08-21 06:12:28 +02:00
training finetune: SGD optimizer, more CLI args (#13873) 2025-08-14 12:03:57 +02:00
CMakeLists.txt codeowners : update + cleanup (#16174) 2025-09-22 18:20:21 +03:00
convert_legacy_llama.py metadata: Detailed Dataset Authorship Metadata (#8875) 2024-11-13 21:10:38 +11:00
json_schema_pydantic_example.py py : type-check all Python scripts with Pyright (#8341) 2024-07-07 15:04:39 -04:00
json_schema_to_grammar.py grammar : support array references in json schema (#16792) 2025-10-28 09:37:52 +01:00
llama.vim llama : remove KV cache defragmentation logic (#15473) 2025-08-22 12:22:13 +03:00
pydantic_models_to_grammar.py pydantic : replace uses of __annotations__ with get_type_hints (#8474) 2024-07-14 19:51:21 -04:00
pydantic_models_to_grammar_examples.py llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
reason-act.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
regex_to_grammar.py py : switch to snake_case (#8305) 2024-07-05 07:53:33 +03:00
server-llama2-13B.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
server_embd.py llama : fix FA when KV cache is not used (i.e. embeddings) (#12825) 2025-04-08 19:54:51 +03:00
ts-type-to-grammar.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00