LLM inference in C/C++
Go to file
Tim Burke ccea34ba41
perf : multiple fixes and enhancements, remove MSE search, expand test coverage
* fix: correct tiled flash attention SoA pointer math for multihead MXFP

The cleanup refactoring (c919bc471) extracted mxfp_dequant_head as a
shared helper but failed to update the tiled path's data pointers.
The helper expects the full SoA row base (no per-head offset), but the
tiled path was passing a pointer that already included ik2*nbk2, causing
a double head offset that produced NaN during prefill.

Add mxfp_row_ptr helper to centralize the multihead-aware pointer
calculation across both one_chunk and tiled paths. Verified with 16-chunk
perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8)
produce exact matches with the known-good commit (23e88631c).

* perf: reduce E8M0 MSE search range from ±2 to ±1

The base estimate round(log2(amax)) is always within 1 step of optimal.
Empirically verified across 30K blocks and 6 distributions: ±1 and ±2
never disagree. This reduces the scale search from 5 to 3 candidates
(40% fewer inner loop iterations) with zero quality impact.

* perf: eliminate redundant work in MXFP quantize and flash attention

- mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d
- mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop
- tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer

* cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers

- EMAX_OFFSET comments: fix ceil/floor labels to match actual values
- Hadamard flag: unify write path (llama-kv-cache.cpp) and read path
  (ops.cpp) to both use DK==DV condition instead of is_mla()
- E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style,
  add cross-reference comment

* fix: MXFP8/6 flash attention tests crash on init

The view base tensors for K/V don't get named "k"/"v" but inherit the
MXFP type. The name-based filter in initialize_tensors missed them,
falling through to init_tensor_uniform which calls quantize_chunk and
aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for
all tensors, matching the pattern set_rows tests already use.

* test: expand MXFP set_rows coverage

- Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage
- Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs
- Coverage: 18 → 51 MXFP set_rows tests

* perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path)

* cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers

Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared
helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4)
and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths.
Eliminates duplicated per-block logic while keeping layout-specific
pointer arithmetic in the callers.

* feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support)

Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers
from the SoA generic impl and use them to build AoS row functions for
MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref
in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT.

MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used
as general quantization types. The SoA impl is also DRY'd to delegate
to the same per-block helpers.

* cleanup: remove dead soa_elems field from mxfp_kv_params

Computed but never read — leftover from an earlier design.

* feat: add MXFP8/MXFP6 vec_dot and full CPU type support

Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations,
register from_float + vec_dot + vec_dot_type in CPU traits, and add
fallback remaps for all architectures. MXFP8/6 are now fully tested:
AoS quantization error, reference match, and dot product accuracy all
pass in test-quantize-fns.

* perf: remove E8M0 MSE search — base estimate is perplexity-optimal

The MSE search over ±1 candidates around round(log2(amax)) was found to
HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b.
The base estimate alone (no search) produces better attention patterns
because minimizing per-block reconstruction error is not the same as
minimizing attention score distortion through softmax.

Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant,
and the entire search loop. E8M0 computation is now a single amax scan +
integer bit extraction — no inner loop, no function pointers. This also
simplifies future GPU/Metal implementations.

* perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer)

Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and
quantize block-by-block with a 32-float stack buffer. Eliminates the
std::vector heap allocation and 2 extra memory passes over the full row.

set_rows now dispatches to the fused path when Hadamard is enabled,
falling through to the unfused quantize for non-Hadamard types.

This pattern maps directly to a CUDA kernel: global memory read →
register Hadamard → register quantize → global memory write.

* cleanup: consistent MXFP type names and variable naming

- Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6"
  to match "mxfp4". Only one variant of each exists — the suffix was
  unnecessary disambiguation that implied alternatives.
- Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles
  all types via ggml_type_name matching).
- Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K).

* perf: fuse Q preprocessing round-trip (no SoA buffer needed)

Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions
that apply quantization error to float values without materializing SoA
bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA
buffer → dequant from SoA buffer) with a single pass through per-block
round-trip helpers.

Eliminates the Q_q intermediate buffer and two function pointer calls
from the flash attention hot path. Maps directly to CUDA: Q stays in
registers, Hadamard butterfly + quantize error applied in-place.

* fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures)

E8M0 = 255 means NaN per MX spec, but our encode path already clamps
to 254. When test data contains random E8M0 = 255 bytes, the decode
produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to
fail on MXFP6 (and potentially MXFP4/8).

Fix: clamp 255 → 254 in both E8M0 decode functions:
  - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h)
  - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h)

These are unfortunately duplicated across two headers because
ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves
CPU-only callers that don't include ggml-common.h.
2026-03-22 20:12:09 -04:00
.devops docker : force Python 3.13 in Vulkan container (#20530) 2026-03-14 21:37:09 +01:00
.gemini contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
.github ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
benches benches : add Nemotron 3 Nano on DGX Spark (#20652) 2026-03-16 21:50:43 +02:00
ci ggml-blas: set mkl threads from thread context (#20602) 2026-03-18 01:16:49 +08:00
cmake ci : add sanitizer runs for server (#19291) 2026-02-03 22:41:20 +02:00
common perf : multiple fixes and enhancements, remove MSE search, expand test coverage 2026-03-22 20:12:09 -04:00
docs docs : fix Metal backend op support status in ops.md (#20779) 2026-03-20 11:06:38 +02:00
examples ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
ggml perf : multiple fixes and enhancements, remove MSE search, expand test coverage 2026-03-22 20:12:09 -04:00
gguf-py Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730) 2026-03-21 13:35:21 +02:00
grammars docs : document that JSON Schema is not available to model when using response_format (#18492) 2025-12-30 15:13:49 -06:00
include llama : re-enable manual LoRA adapter free (#19983) 2026-03-18 12:03:26 +02:00
licenses refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
media media : add transparent icon svg and png [no ci] (#15891) 2025-09-10 14:51:28 +03:00
models common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
pocs ggml : move AMX to the CPU backend (#10570) 2024-11-29 21:54:58 +01:00
requirements gguf-py : bump sentencepiece version (#19319) 2026-02-06 21:05:19 +01:00
scripts ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
src perf : multiple fixes and enhancements, remove MSE search, expand test coverage 2026-03-22 20:12:09 -04:00
tests perf : multiple fixes and enhancements, remove MSE search, expand test coverage 2026-03-22 20:12:09 -04:00
tools cleanup : remove unused untested code and improve consistency 2026-03-22 02:44:56 -04:00
vendor vendor : update cpp-httplib to 0.38.0 (#20578) 2026-03-15 17:30:06 +01:00
.clang-format fix: apply clang-format to CUDA macros (#16017) 2025-09-16 08:59:19 +02:00
.clang-tidy clang-tidy : disable warning about performance enum size (#16127) 2025-09-22 19:57:46 +02:00
.dockerignore ci : fix docker build number and tag name (#9638) 2024-09-25 17:26:01 +02:00
.ecrc common : Update stb_image.h to latest version (#9161) 2024-08-27 08:58:50 +03:00
.editorconfig editorconfig : ignore benches/ (#17140) 2025-11-10 12:17:19 +02:00
.flake8 llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
.gitignore scripts : update get-hellaswag.sh and get-winogrande.sh (#20542) 2026-03-14 11:21:50 +01:00
.gitmodules ggml : remove kompute backend (#14501) 2025-07-03 07:48:32 +03:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
AGENTS.md docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
AUTHORS authors : update (#19263) 2026-02-02 08:51:25 +02:00
CLAUDE.md contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
CMakeLists.txt hexagon : fix build release (#19444) (#19587) 2026-02-20 16:40:00 -08:00
CMakePresets.json cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
CODEOWNERS server : fix wait in test_cancel_requests() test (#20601) 2026-03-15 20:54:37 +02:00
CONTRIBUTING.md docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
LICENSE docs : Minor cleanups (#19252) 2026-02-02 08:38:55 +02:00
Makefile make : remove make in favor of CMake (#15449) 2025-08-20 13:31:16 +03:00
README.md ggml : add OpenVINO backend (#15307) 2026-03-14 07:56:55 +02:00
SECURITY.md docs : fix broken link and typo (#19560) 2026-02-13 09:38:09 +01:00
build-xcframework.sh build : remove LLAMA_HTTPLIB option (#19623) 2026-02-15 15:38:50 +01:00
convert_hf_to_gguf.py Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730) 2026-03-21 13:35:21 +02:00
convert_hf_to_gguf_update.py model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826) 2026-02-26 12:14:09 +01:00
convert_llama_ggml_to_gguf.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
convert_lora_to_gguf.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
flake.lock flake.lock: Update (#10470) 2024-11-24 08:03:25 -08:00
flake.nix fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295) 2025-08-13 11:21:31 -07:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
poetry.lock build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
pyproject.toml gguf-py : bump sentencepiece version (#19319) 2026-02-06 21:05:19 +01:00
pyrightconfig.json ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
requirements.txt `tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00
ty.toml ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00

README.md

llama.cpp

llama

License: MIT Release Server

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2, AVX512 and AMX support for x86 architectures
  • RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

Bindings
UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools
  • akx/ggify download PyTorch models from HuggingFace Hub and convert them to GGML
  • akx/ollama-dl download models from the Ollama library to be used directly with llama.cpp
  • crashr/gppm launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
  • gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
  • Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
  • unslothai/unsloth 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
Infrastructure
  • Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
  • GPUStack - Manage GPU clusters for running LLMs
  • llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
  • llama-swap - transparent proxy that adds automatic model switching with llama-server
  • Kalavai - Crowdsource end to end LLM deployment at any scale
  • llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
  • LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support"
Games
  • Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel and Nvidia GPU
OpenVINO [In Progress] Intel CPUs, GPUs, and NPUs
MUSA Moore Threads GPU
CUDA Nvidia GPU
HIP AMD GPU
ZenDNN AMD CPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU
IBM zDNN IBM Z & LinuxONE
WebGPU [In Progress] All
RPC All
Hexagon [In Progress] Snapdragon
VirtGPU VirtGPU APIR

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, such as ModelScope, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. MODEL_ENDPOINT=https://www.modelscope.cn/.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

  • Run in conversation mode

    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

    llama-cli -m model.gguf
    
    # > hi, who are you?
    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
    #
    # > what is 1+1?
    # Easy peasy! The answer to 1+1 is... 2!
    
  • Run in conversation mode with custom chat template
    # use the "chatml" template (use -h to see the list of supported templates)
    llama-cli -m model.gguf -cnv --chat-template chatml
    
    # use a custom template
    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
    
  • Constrain the output with a custom grammar
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
    
    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
    

    The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

  • Start a local HTTP server with default configuration on port 8080
    llama-server -m model.gguf --port 8080
    
    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    
  • Support multiple-users and parallel decoding
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
    
  • Enable speculative decoding
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
    
  • Serve an embedding model
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
    
  • Serve a reranking model
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
    
  • Constrain all outputs with a grammar
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf
    
    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf
    

llama-perplexity

A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.

  • Measure the perplexity over a text file
    llama-perplexity -m model.gguf -f file.txt
    
    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
    # Final estimate: PPL = 5.4007 +/- 0.67339
    
  • Measure KL divergence
    # TODO
    

llama-bench

Benchmark the performance of the inference for various parameters.

  • Run default benchmark
    llama-bench -m model.gguf
    
    # Output:
    # | model               |       size |     params | backend    | threads |          test |                  t/s |
    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
    #
    # build: 3e0ba0e60 (4229)
    

llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

  • Basic text completion
    llama-simple -m model.gguf
    
    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
    

Contributing

  • Contributors can open PRs
  • Collaborators will be invited based on contributions
  • Maintainers can push to branches in the llama.cpp repo and merge PRs into the master branch
  • Any help with managing issues, PRs and projects is very appreciated!
  • See good first issues for tasks suitable for first contributions
  • Read the CONTRIBUTING.md for more information
  • Make sure to read this: Inference at the edge
  • A bit of backstory for those who are interested: Changelog podcast

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

Dependencies

  • yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
  • stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
  • nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
  • miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
  • subprocess.h - Single-header process launching solution for C and C++ - Public domain