* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande
winogrande_score : tokenizing selected tasks
winogrande_score : calculating winogrande score over selected tasks.
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 46
failed to decode the batch, n_batch = 2048, ret = 1
winogrande_score: llama_decode() failed
same for hellaswag:
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 99
failed to decode the batch, n_batch = 2048, ret = 1
hellaswag_score: llama_decode() failed
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Set C locale for consistent float formatting across all binaries.
* Add C locale setting to all tools binaries
Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.
* Apply suggestion from @JohannesGaessler
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
* perplexity: give more information about constraints on failure
This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.
* log formatting
* log error and return instead of storing max_seq_exceeded int
* check if s0 is zero for -np check
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.
The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci