llama.cpp

History

itigges22 19fdba56b5 feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming Add Multi-Token Prediction (MTP) speculative decoding for Qwen3.5 dense models (0.8B-27B). The MTP head uses a full transformer block (attention + FFN) to predict the next-next token, enabling ~28 tok/s on RTX 5060 Ti. Key changes: - Model loading: Qwen3.5 MTP layer tensors (nextn.eh_proj, attention weights, FFN) loaded into layers[n_layer-1] - Graph builder: Full MTP head with self-attention, gated RoPE, FFN, and vocabulary projection. Unfiltered hidden state passed for proper KV cache population during prompt processing. - FastMTP: Vocabulary trimming from 248K to 32K tokens via ggml_view_2d on the lm_head. Reduces draft generation from 22ms to 6ms (3.7x). - Speculative framework: MTP auto-detection for hybrid models, fuzzy seq_rm checkpoint matching for DeltaNet rollback. - Server: Two-phase decode option for hybrid/recurrent models to avoid DeltaNet state corruption from rejected drafts. - Recurrent state: Fixed copy_cell (ggml_view_1d takes element count, not bytes), buffer assignment for no_alloc views. Results on Qwen3.5-9B Q4_K_M (RTX 5060 Ti 16GB): - 28.1 tok/s with 82% acceptance rate (temp=0) - 92% acceptance with two-phase decode (correct output, 15 tok/s) - Draft generation: 6.1ms with FastMTP (vs 22.4ms full vocab)		2026-03-21 14:18:40 -04:00
..
examples	Refactor gguf scripts to improve metadata handling (#11909 )	2025-02-26 08:04:48 -05:00
gguf	feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming	2026-03-21 14:18:40 -04:00
tests	ggml : add NVFP4 quantization type support (#19769 )	2026-03-11 21:02:54 +01:00
LICENSE	gguf : make gguf pip-installable	2023-08-25 09:26:05 +03:00
README.md	gguf-py : GGUF Editor GUI - Python + Qt6 (#12930 )	2025-04-18 20:30:41 +02:00
pyproject.toml	gguf-py : dump version to 0.18.0 (#19950 )	2026-02-27 11:02:53 +01:00

README.md

gguf

This is a Python package for writing binary files in the GGUF (GGML Universal File) format.

See convert_hf_to_gguf.py as an example for its usage.

Installation

pip install gguf

Optionally, you can install gguf with the extra 'gui' to enable the visual GGUF editor.

pip install gguf[gui]

API Examples/Simple Tools

examples/writer.py — Generates example.gguf in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.

examples/reader.py — Extracts and displays key-value pairs and tensor details from a GGUF file in a readable format.

gguf/scripts/gguf_dump.py — Dumps a GGUF file's metadata to the console.

gguf/scripts/gguf_set_metadata.py — Allows changing simple metadata values in a GGUF file by key.

gguf/scripts/gguf_convert_endian.py — Allows converting the endianness of GGUF files.

gguf/scripts/gguf_new_metadata.py — Copies a GGUF file with added/modified/removed metadata values.

gguf/scripts/gguf_editor_gui.py — Allows for viewing, editing, adding, or removing metadata values within a GGUF file as well as viewing its tensors with a Qt interface.

Development

Maintainers who participate in development of this package are advised to install it in editable mode:

cd /path/to/llama.cpp/gguf-py

pip install --editable .

Note: This may require to upgrade your Pip installation, with a message saying that editable installation currently requires setup.py. In this case, upgrade Pip to the latest:

pip install --upgrade pip

Automatic publishing with CI

There's a GitHub workflow to make a release automatically upon creation of tags in a specified format.

Bump the version in pyproject.toml.
Create a tag named gguf-vx.x.x where x.x.x is the semantic version number.

git tag -a gguf-v1.0.0 -m "Version 1.0 release"

Push the tags.

git push origin --tags

Manual publishing

If you want to publish the package manually for any reason, you need to have twine and build installed:

pip install build twine

Then, follow these steps to release a new version:

Bump the version in pyproject.toml.
Build the package:

python -m build

Upload the generated distribution archives:

python -m twine upload dist/*

Run Unit Tests

From root of this repository you can run this command to run all the unit tests

python -m unittest discover ./gguf-py -v

TODO

Include conversion scripts as command line entry points in this package.