llama.cpp

History

Daniel Bevenius 9e5e09d087 sampling : remove backend-dist option (wip) This commit removes the `--backend-dist` option and instead uses the configured --samplers chain to determine which samplers run on the backend. Backend sampling is still enabled using With `--backend_sampling`, and the sampler chain, either explictly specified using `--samplers` or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence. For example: * If the chain is `top-k -> temperature -> top-p`, and both `top-k` and `temperature` are backend-supported but `top-p` is not, then `top-k` and `temperature` will run on the backend, while `top-p` and subsequent samplers run on the CPU. * If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host. * If the sampler chain starts with an unsupported sampler (e.g., `penalties`), all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example. The following shows how llama-cli can be run with backend sampling: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature' ``` In this case the all sampling will happen on the backend since both `top_k` and `temperature` are supported backend samplers. To enable a partial backend sampling (hybrid sampling), for example running `top_k` and `temperature` on the backend and `typ_p` on the CPU the following sampler chain could be specified: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature;top_p' ``` If this looks good then I'll follow up with updates the llama-cli and llama-server documentation to reflect these changes.		2025-11-25 14:01:23 +01:00
..
CMakeLists.txt	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
arg.cpp	sampling : remove backend-dist option (wip)	2025-11-25 14:01:23 +01:00
arg.h	common: move download functions to download.(cpp\|h) (#17059 )	2025-11-07 11:23:34 +01:00
base64.hpp	llava : expose as a shared library for downstream projects (#3613 )	2023-11-07 00:36:23 +03:00
build-info.cpp.in	cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167 )	2025-06-13 10:38:52 +02:00
chat-parser-xml-toolcall.cpp	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
chat-parser-xml-toolcall.h	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
chat-parser.cpp	common : handle unicode during partial json parsing (#16526 )	2025-10-12 16:18:47 +03:00
chat-parser.h	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
chat.cpp	chat: fix int overflow, prevent size calculation in float/double (#17357 )	2025-11-18 19:11:53 +01:00
chat.h	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
common.cpp	sampling : remove backend-dist option (wip)	2025-11-25 14:01:23 +01:00
common.h	sampling : remove backend-dist option (wip)	2025-11-25 14:01:23 +01:00
console.cpp	console : utf-8 fix for windows stdin (#9690 )	2024-09-30 11:23:42 +03:00
console.h	gguf : new file format with flexible meta data (beta) (#2398 )	2023-08-21 23:07:43 +03:00
download.cpp	cmake : move OpenSSL linking to vendor/cpp-httplib (#17177 )	2025-11-12 12:32:50 +01:00
download.h	arg: add --cache-list argument to list cached models (#17073 )	2025-11-08 21:54:14 +01:00
http.h	common: introduce http.h for httplib-based client (#16373 )	2025-10-01 20:22:18 +03:00
json-partial.cpp	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
json-partial.h	sync : vendor (#13901 )	2025-05-30 16:25:45 +03:00
json-schema-to-grammar.cpp	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
json-schema-to-grammar.h	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
llguidance.cpp	sampling : add support for backend sampling	2025-11-17 16:15:58 +01:00
log.cpp	mtmd: add mtmd_log_set (#17268 )	2025-11-14 15:56:19 +01:00
log.h	mtmd: add mtmd_log_set (#17268 )	2025-11-14 15:56:19 +01:00
ngram-cache.cpp	ggml : portability fixes for VS 2017 (#12150 )	2025-03-04 18:53:26 +02:00
ngram-cache.h	llama : use LLAMA_TOKEN_NULL (#11062 )	2025-01-06 10:52:15 +02:00
regex-partial.cpp	`common`: add partial regex support (#12808 )	2025-05-14 19:50:57 +01:00
regex-partial.h	`common`: add partial regex support (#12808 )	2025-05-14 19:50:57 +01:00
sampling.cpp	sampling : remove backend-dist option (wip)	2025-11-25 14:01:23 +01:00
sampling.h	sampling : add support for backend sampling	2025-11-17 16:15:58 +01:00
speculative.cpp	sampling : optimize samplers by reusing bucket sort (#15665 )	2025-08-31 20:41:02 +03:00
speculative.h	server : implement universal assisted decoding (#12635 )	2025-07-31 14:25:23 +02:00