llama.cpp

History

itigges22 6075918309 feat: add MTP (Multi-Token Prediction) support for dense Qwen 3.5 Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B). What works: - MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp) - MTP tensor loading and registration for QWEN35 arch - GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.) - Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers() - Server auto-detects MTP from GGUF metadata - Speculative state machine for MTP draft token generation - PR #20075 applied: recurrent state checkpoint/restore for hybrid models - M-RoPE position check relaxed for speculative re-evaluation - Windows os.kill fix for gateway process detection What needs work: - Speculative verify loop conflicts with tool-calling requests (400 error) - The recommended fix: bypass the speculative framework entirely and implement MTP acceptance directly in the server generation loop (no seq_rm/rollback needed since MTP drafts are produced in-graph) - MTP attention skipped (projection + FFN path only) due to inp_out_ids token count mismatch Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2 Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization) Base: llama.cpp b8388	2026-03-17 16:49:22 -04:00
..
llama-cpp.h	lora: make sure model keep track of associated adapters (#18490 )	2026-01-15 10:24:28 +01:00
llama.h	feat: add MTP (Multi-Token Prediction) support for dense Qwen 3.5	2026-03-17 16:49:22 -04:00

itigges22 6075918309 feat: add MTP (Multi-Token Prediction) support for dense Qwen 3.5

Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B).

What works:
- MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp)
- MTP tensor loading and registration for QWEN35 arch
- GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.)
- Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers()
- Server auto-detects MTP from GGUF metadata
- Speculative state machine for MTP draft token generation
- PR #20075 applied: recurrent state checkpoint/restore for hybrid models
- M-RoPE position check relaxed for speculative re-evaluation
- Windows os.kill fix for gateway process detection

What needs work:
- Speculative verify loop conflicts with tool-calling requests (400 error)
- The recommended fix: bypass the speculative framework entirely and
  implement MTP acceptance directly in the server generation loop
  (no seq_rm/rollback needed since MTP drafts are produced in-graph)
- MTP attention skipped (projection + FFN path only) due to
  inp_out_ids token count mismatch

Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2
Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization)
Base: llama.cpp b8388

2026-03-17 16:49:22 -04:00

llama-cpp.h

lora: make sure model keep track of associated adapters (#18490 )

2026-01-15 10:24:28 +01:00

llama.h

feat: add MTP (Multi-Token Prediction) support for dense Qwen 3.5

2026-03-17 16:49:22 -04:00