llama.cpp/gguf-py/gguf
itigges22 19fdba56b5 feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming
Add Multi-Token Prediction (MTP) speculative decoding for Qwen3.5 dense
models (0.8B-27B). The MTP head uses a full transformer block (attention
+ FFN) to predict the next-next token, enabling ~28 tok/s on RTX 5060 Ti.

Key changes:
- Model loading: Qwen3.5 MTP layer tensors (nextn.eh_proj, attention
  weights, FFN) loaded into layers[n_layer-1]
- Graph builder: Full MTP head with self-attention, gated RoPE, FFN,
  and vocabulary projection. Unfiltered hidden state passed for proper
  KV cache population during prompt processing.
- FastMTP: Vocabulary trimming from 248K to 32K tokens via ggml_view_2d
  on the lm_head. Reduces draft generation from 22ms to 6ms (3.7x).
- Speculative framework: MTP auto-detection for hybrid models, fuzzy
  seq_rm checkpoint matching for DeltaNet rollback.
- Server: Two-phase decode option for hybrid/recurrent models to avoid
  DeltaNet state corruption from rejected drafts.
- Recurrent state: Fixed copy_cell (ggml_view_1d takes element count,
  not bytes), buffer assignment for no_alloc views.

Results on Qwen3.5-9B Q4_K_M (RTX 5060 Ti 16GB):
- 28.1 tok/s with 82% acceptance rate (temp=0)
- 92% acceptance with two-phase decode (correct output, 15 tok/s)
- Draft generation: 6.1ms with FastMTP (vs 22.4ms full vocab)
2026-03-21 14:18:40 -04:00
..
scripts ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
__init__.py convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499) 2024-07-18 20:40:15 +10:00
constants.py feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming 2026-03-21 14:18:40 -04:00
gguf.py gguf-py: Refactor and allow reading/modifying existing GGUF files (#3981) 2023-11-11 08:04:50 +03:00
gguf_reader.py ggml/gguf : prevent integer overflows (#19856) 2026-02-24 20:17:11 +02:00
gguf_writer.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
lazy.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
metadata.py chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
py.typed convert : various script cleanups/fixes + merges and special token handling (#2842) 2023-08-30 11:25:50 +03:00
quants.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
tensor_mapping.py llama : add support for Nemotron 3 Super (#20411) 2026-03-11 19:27:53 +01:00
utility.py gguf-py : do not align the data start offset (#18291) 2025-12-22 20:25:16 +01:00
vocab.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00