llama.cpp

History

Johannes Gäßler d6f3030047 ggml: backend-agnostic tensor parallelism (experimental) (#19378 ) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>		2026-04-09 16:42:19 +02:00
..
jinja	jinja : support ensure_ascii=true, string repetition and int/float self-filtering (#21623 )	2026-04-09 11:28:33 +02:00
CMakeLists.txt	common : add standard Hugging Face cache support (#20775 )	2026-03-24 07:30:33 +01:00
arg.cpp	ggml: backend-agnostic tensor parallelism (experimental) (#19378 )	2026-04-09 16:42:19 +02:00
arg.h	vendor : update cpp-httplib to 0.30.0 (#18660 )	2026-01-08 13:53:54 +01:00
base64.hpp	llava : expose as a shared library for downstream projects (#3613 )	2023-11-07 00:36:23 +03:00
build-info.cpp.in	cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167 )	2025-06-13 10:38:52 +02:00
chat-auto-parser-generator.cpp	common : simplify autoparser tagged parser rules (#21216 )	2026-04-09 12:24:20 +02:00
chat-auto-parser-helpers.cpp	common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912 )	2026-03-23 22:21:47 -05:00
chat-auto-parser-helpers.h	chat : avoid including json in chat.h (#21306 )	2026-04-03 09:07:59 +03:00
chat-auto-parser.h	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
chat-diff-analyzer.cpp	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
chat-peg-parser.cpp	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
chat-peg-parser.h	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
chat.cpp	common : fix ambiguous grammar rule in gemma4 (#21661 )	2026-04-09 12:25:07 +02:00
chat.h	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
common.cpp	tests: allow exporting graph ops from HF file without downloading weights (#21182 )	2026-04-02 18:19:20 +02:00
common.h	server: save and clear idle slots on new task (`--clear-idle`) (#20993 )	2026-04-03 19:02:27 +02:00
console.cpp	cli: fix stripping of \n in multiline input (#21485 )	2026-04-06 20:54:06 +02:00
console.h	cli : add command and file auto-completion (#19985 )	2026-03-05 10:47:28 +01:00
debug.cpp	debug: make common_debug_print_tensor readable (#19331 )	2026-02-04 17:55:31 +01:00
debug.h	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
download.cpp	common : skip non-primary GGUF split files when selecting model (#21633 )	2026-04-09 07:28:06 +02:00
download.h	common : add standard Hugging Face cache support (#20775 )	2026-03-24 07:30:33 +01:00
hf-cache.cpp	common : add getpwuid fallback for HF cache when HOME is not set (#21035 )	2026-03-26 20:34:23 +01:00
hf-cache.h	common : fix split model migration (#21019 )	2026-03-26 12:04:37 +01:00
http.h	server: Parse port numbers from MCP server URLs in CORS proxy (#20208 )	2026-03-09 17:47:54 +01:00
json-partial.cpp	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
json-partial.h	cli : fix reasoning responses in CLI (#18961 )	2026-01-20 18:23:25 +01:00
json-schema-to-grammar.cpp	common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124 )	2026-03-28 17:55:38 +01:00
json-schema-to-grammar.h	common : add nemotron 3 parsing (#18077 )	2025-12-16 04:05:23 -06:00
llguidance.cpp	sampling : add support for backend sampling (#17004 )	2026-01-04 22:22:16 +02:00
log.cpp	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
log.h	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
ngram-cache.cpp	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 )	2026-01-28 19:42:42 +02:00
ngram-cache.h	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 )	2026-01-28 19:42:42 +02:00
ngram-map.cpp	llama : correct typos 'occured' and 'occurences' (#19414 )	2026-02-11 07:05:31 +01:00
ngram-map.h	fix: correct misspellings in code comments (#21217 )	2026-03-31 13:50:51 +02:00
ngram-mod.cpp	spec : add ngram-mod (#19164 )	2026-01-30 18:21:48 +02:00
ngram-mod.h	ngram-mod : fix build [no ci] (#19216 )	2026-01-30 21:27:27 +02:00
peg-parser.cpp	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
peg-parser.h	common : add gemma 4 specialized parser (#21418 )	2026-04-04 20:39:00 +02:00
preset.cpp	preset: allow named remote preset (#18728 )	2026-01-10 15:12:29 +01:00
preset.h	common: support remote preset (#18520 )	2026-01-08 22:35:40 +01:00
reasoning-budget.cpp	common : inhibit lazy grammar sampler while reasoning is active (#20970 )	2026-03-27 18:30:40 +01:00
reasoning-budget.h	common : inhibit lazy grammar sampler while reasoning is active (#20970 )	2026-03-27 18:30:40 +01:00
regex-partial.cpp	common : fix iterator::end() dereference (#20445 )	2026-03-16 08:50:38 +02:00
regex-partial.h	`common`: add partial regex support (#12808 )	2025-05-14 19:50:57 +01:00
sampling.cpp	common : Disable backend sampling if reasoning budget is enabled (#21209 )	2026-03-31 10:14:01 +03:00
sampling.h	sampling : add support for backend sampling (#17004 )	2026-01-04 22:22:16 +02:00
speculative.cpp	spec : remove check rate (#19377 )	2026-02-09 15:30:50 +02:00
speculative.h	common : add common_speculative_is_compat() (#19270 )	2026-02-06 16:47:22 +02:00
unicode.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
unicode.h	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00