llama.cpp

History

Daniel Bevenius 9e5e09d087 sampling : remove backend-dist option (wip) This commit removes the `--backend-dist` option and instead uses the configured --samplers chain to determine which samplers run on the backend. Backend sampling is still enabled using With `--backend_sampling`, and the sampler chain, either explictly specified using `--samplers` or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence. For example: * If the chain is `top-k -> temperature -> top-p`, and both `top-k` and `temperature` are backend-supported but `top-p` is not, then `top-k` and `temperature` will run on the backend, while `top-p` and subsequent samplers run on the CPU. * If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host. * If the sampler chain starts with an unsupported sampler (e.g., `penalties`), all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example. The following shows how llama-cli can be run with backend sampling: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature' ``` In this case the all sampling will happen on the backend since both `top_k` and `temperature` are supported backend samplers. To enable a partial backend sampling (hybrid sampling), for example running `top_k` and `temperature` on the backend and `typ_p` on the CPU the following sampler chain could be specified: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature;top_p' ``` If this looks good then I'll follow up with updates the llama-cli and llama-server documentation to reflect these changes.		2025-11-25 14:01:23 +01:00
..
batched	sampling : remove backend-dist option (wip)	2025-11-25 14:01:23 +01:00
batched.swift	examples : remove references to `make` in examples [no ci] (#15457 )	2025-08-21 06:12:28 +02:00
convert-llama2c-to-ggml	gguf: gguf_writer refactor (#15691 )	2025-09-05 11:34:28 +02:00
deprecation-warning	Update deprecation-warning.cpp (#10619 )	2024-12-04 23:19:20 +01:00
diffusion	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
embedding	embedding: add raw option for --embd-output-format (#16541 )	2025-10-28 12:51:41 +02:00
eval-callback	common : more accurate sampling timing (#17382 )	2025-11-20 13:40:10 +02:00
gen-docs	ggml : move AMX to the CPU backend (#10570 )	2024-11-29 21:54:58 +01:00
gguf	examples(gguf): GGUF example outputs (#17025 )	2025-11-05 19:58:16 +02:00
gguf-hash	GGUF: C++ refactor, backend support, misc fixes (#11030 )	2025-01-07 18:01:58 +01:00
llama.android	llama : deprecate llama_kv_self_ API (#14030 )	2025-06-06 14:11:15 +03:00
llama.swiftui	llama : deprecate llama_kv_self_ API (#14030 )	2025-06-06 14:11:15 +03:00
lookahead	lookahead : add sample command to readme (#15447 )	2025-08-20 13:30:46 +03:00
lookup	llama : deprecate llama_kv_self_ API (#14030 )	2025-06-06 14:11:15 +03:00
model-conversion	model-conversion : pass config to from_pretrained (#16963 )	2025-11-03 18:01:59 +01:00
parallel	parallel : add option for different RNG seeds (#14757 )	2025-07-18 17:33:41 +03:00
passkey	examples : remove references to `make` in examples [no ci] (#15457 )	2025-08-21 06:12:28 +02:00
retrieval	examples : remove references to `make` in examples [no ci] (#15457 )	2025-08-21 06:12:28 +02:00
save-load-state	tests : update for LLAMA_SET_ROWS=1 (#14961 )	2025-07-30 15:12:02 +03:00
simple	examples : support encoder-decoder models in the simple example (#16002 )	2025-09-17 10:29:00 +03:00
simple-chat	simple-chat : fix context-exceeded condition (#14494 )	2025-07-02 14:12:07 +03:00
simple-cmake-pkg	repo : update links to new url (#11886 )	2025-02-15 16:40:57 +02:00
speculative	sampling : optimize samplers by reusing bucket sort (#15665 )	2025-08-31 20:41:02 +03:00
speculative-simple	common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191 )	2025-08-13 12:44:40 +02:00
sycl	examples : remove references to `make` in examples [no ci] (#15457 )	2025-08-21 06:12:28 +02:00
training	finetune: SGD optimizer, more CLI args (#13873 )	2025-08-14 12:03:57 +02:00
CMakeLists.txt	codeowners : update + cleanup (#16174 )	2025-09-22 18:20:21 +03:00
convert_legacy_llama.py	metadata: Detailed Dataset Authorship Metadata (#8875 )	2024-11-13 21:10:38 +11:00
json_schema_pydantic_example.py	py : type-check all Python scripts with Pyright (#8341 )	2024-07-07 15:04:39 -04:00
json_schema_to_grammar.py	grammar : support array references in json schema (#16792 )	2025-10-28 09:37:52 +01:00
llama.vim	llama : remove KV cache defragmentation logic (#15473 )	2025-08-22 12:22:13 +03:00
pydantic_models_to_grammar.py	pydantic : replace uses of __annotations__ with get_type_hints (#8474 )	2024-07-14 19:51:21 -04:00
pydantic_models_to_grammar_examples.py	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
reason-act.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
regex_to_grammar.py	py : switch to snake_case (#8305 )	2024-07-05 07:53:33 +03:00
server-llama2-13B.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
server_embd.py	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 )	2025-04-08 19:54:51 +03:00
ts-type-to-grammar.sh	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00