* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* tests: allow exporting graph ops from HF file without downloading weights
* use unique_ptr for llama_context in HF metadata case
* fix missing non-required tensors falling back to type f32
* use unique pointers where possible
* use no_alloc instead of fixing f32 fallback
* fix missing space
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf37.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : add standard Hugging Face cache support
- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check with the quant tag
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve error handling and report API errors
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Restore common_cached_model_info and align mmproj filtering
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Prefer main when getting cached ref
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use cached files when HF API fails
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use final_path..
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check all inputs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
* tests: allow loading test-backend-ops tests from json
* add error threshold based on op
* add error when file cannot be read
* add graph operator json extraction tool
* add nb parameter for non-contiguous input tensors
* fix view check
* only use view if non-contiguous/permuted, use C++ random instead of rand()
* replace internal API calls with public llama_graph_reserve call
* reduce test description length
* fix nb[0] not getting set for view
* add name to tests
* fix inplace error
* use text file instead of json
* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/
* fix missing declaration
* use pragma once
* fix indent
* fix Windows build
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
* initial commit for branch
* simplify constants
* add params to `struct common_params_sampling`, add reference to PR
* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`
* add args, rename `queue_size` -> `window_size`
* improved comments
* minor
* remove old unused code from algorithm
* minor
* add power law case to `common_sampler_init`, add sampler name mappings
* clarify behaviour when `window_size = 0`
* add missing enums
* remove `target_range` param, make `target == 1` no-op, cleanup code
* oops, straggler
* add missing parameters in `server-task.cpp`
* copy from author
ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069
* remove old debug log, style nit
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
* server : add arg for disabling prompt caching
Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.
* address review comments
* address review comments
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.
The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.
This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
* arg: support remote preset
* proof reading
* allow one HF repo to point to multiple HF repos
* docs: mention about multiple GGUF use case
* correct clean_file_name
* download: also return HTTP status code
* fix case with cache file used
* fix --offline option
* Adding --direct-io flag for model loading
* Fixing read_raw() calls
* Fixing Windows read_raw_at
* Changing type off_t to size_t for windows and Renaming functions
* disable direct io when mmap is explicitly enabled
* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL
* Fallback to std::fread in case O_DIRECT fails due to bad address
* Windows: remove const keywords and unused functions
* Update src/llama-mmap.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* examples : add debug utility/example
This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.
The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.
Example usage:
```console
./build/bin/llama-debug \
-m models/Qwen2.5-0.5B-Instruct.gguf \
--prompt "Hello, my name is" \
--save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11) my(847) name(829) is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```
For more details about the options available for this example, please
refer to examples/debug/README.md.
* throw runtime error instead of logging error
* remove params.warmup and enable the warmup/nowarmup option
* model-conversion : remove logits.cpp
This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.
* examples : remove model-conversion directory
This was missed in the previous commit.
* model-conversion : add support for saving prompt and token ids
This commit add support for storing the prompt and the token ids for the
prompt when running the original models.
The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).
* squash! model-conversion : add support for saving prompt and token ids
fix pyright errors.
* model-conversion : add compare_tokens utility
This commit adds a script to compare token outputs between original and
converted models.
Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v
Original model prompt (pytorch-gemma-3-270m-it):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
* model-conversion : add token comparison to verifiction scripts
This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.
Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.
Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.
* debug : use llama_model_n_embd_out
This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.
The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.
* debug : add print_usage function
This commit adds a print_usage function that is passed to the
common_params_parse.
The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:
Print tensors:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose
The tensors to be printed can be filtered with --tensor-filter option.
Save logits/embeddings:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits
Add --embedding to save embeddings
```