Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B).
What works:
- MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp)
- MTP tensor loading and registration for QWEN35 arch
- GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.)
- Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers()
- Server auto-detects MTP from GGUF metadata
- Speculative state machine for MTP draft token generation
- PR #20075 applied: recurrent state checkpoint/restore for hybrid models
- M-RoPE position check relaxed for speculative re-evaluation
- Windows os.kill fix for gateway process detection
What needs work:
- Speculative verify loop conflicts with tool-calling requests (400 error)
- The recommended fix: bypass the speculative framework entirely and
implement MTP acceptance directly in the server generation loop
(no seq_rm/rollback needed since MTP drafts are produced in-graph)
- MTP attention skipped (projection + FFN path only) due to
inp_out_ids token count mismatch
Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2
Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization)
Base: llama.cpp b8388