llama.cpp

History

itigges22 19fdba56b5 feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming Add Multi-Token Prediction (MTP) speculative decoding for Qwen3.5 dense models (0.8B-27B). The MTP head uses a full transformer block (attention + FFN) to predict the next-next token, enabling ~28 tok/s on RTX 5060 Ti. Key changes: - Model loading: Qwen3.5 MTP layer tensors (nextn.eh_proj, attention weights, FFN) loaded into layers[n_layer-1] - Graph builder: Full MTP head with self-attention, gated RoPE, FFN, and vocabulary projection. Unfiltered hidden state passed for proper KV cache population during prompt processing. - FastMTP: Vocabulary trimming from 248K to 32K tokens via ggml_view_2d on the lm_head. Reduces draft generation from 22ms to 6ms (3.7x). - Speculative framework: MTP auto-detection for hybrid models, fuzzy seq_rm checkpoint matching for DeltaNet rollback. - Server: Two-phase decode option for hybrid/recurrent models to avoid DeltaNet state corruption from rejected drafts. - Recurrent state: Fixed copy_cell (ggml_view_1d takes element count, not bytes), buffer assignment for no_alloc views. Results on Qwen3.5-9B Q4_K_M (RTX 5060 Ti 16GB): - 28.1 tok/s with 82% acceptance rate (temp=0) - 92% acceptance with two-phase decode (correct output, 15 tok/s) - Draft generation: 6.1ms with FastMTP (vs 22.4ms full vocab)		2026-03-21 14:18:40 -04:00
..
jinja	jinja : fix heap OOB read in value equality comparison (#20782 )	2026-03-20 07:15:17 +01:00
CMakeLists.txt	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
arg.cpp	feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming	2026-03-21 14:18:40 -04:00
arg.h	vendor : update cpp-httplib to 0.30.0 (#18660 )	2026-01-08 13:53:54 +01:00
base64.hpp	llava : expose as a shared library for downstream projects (#3613 )	2023-11-07 00:36:23 +03:00
build-info.cpp.in	cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167 )	2025-06-13 10:38:52 +02:00
chat-auto-parser-generator.cpp	chat : handle tool calls with no required args in TAG_WITH_TAGGED format (#20764 )	2026-03-19 17:53:11 +01:00
chat-auto-parser-helpers.cpp	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
chat-auto-parser-helpers.h	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
chat-auto-parser.h	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
chat-diff-analyzer.cpp	common : fix typo in debug log ('extracft' -> 'extract') (#20807 )	2026-03-20 18:23:18 +01:00
chat-peg-parser.cpp	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
chat-peg-parser.h	common/parser: use nlohmann::ordered_json to preserve parameter order (#20385 )	2026-03-11 10:26:51 +01:00
chat.cpp	common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777 )	2026-03-20 02:37:22 +01:00
chat.h	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
common.cpp	llama : re-enable manual LoRA adapter free (#19983 )	2026-03-18 12:03:26 +02:00
common.h	feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming	2026-03-21 14:18:40 -04:00
console.cpp	cli : add command and file auto-completion (#19985 )	2026-03-05 10:47:28 +01:00
console.h	cli : add command and file auto-completion (#19985 )	2026-03-05 10:47:28 +01:00
debug.cpp	debug: make common_debug_print_tensor readable (#19331 )	2026-02-04 17:55:31 +01:00
debug.h	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
download.cpp	build : remove LLAMA_HTTPLIB option (#19623 )	2026-02-15 15:38:50 +01:00
download.h	preset: allow named remote preset (#18728 )	2026-01-10 15:12:29 +01:00
http.h	server: Parse port numbers from MCP server URLs in CORS proxy (#20208 )	2026-03-09 17:47:54 +01:00
json-partial.cpp	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 )	2025-11-18 18:54:15 +01:00
json-partial.h	cli : fix reasoning responses in CLI (#18961 )	2026-01-20 18:23:25 +01:00
json-schema-to-grammar.cpp	common : fix incorrect uses of stoul (#20313 )	2026-03-10 11:40:26 +01:00
json-schema-to-grammar.h	common : add nemotron 3 parsing (#18077 )	2025-12-16 04:05:23 -06:00
llguidance.cpp	sampling : add support for backend sampling (#17004 )	2026-01-04 22:22:16 +02:00
log.cpp	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
log.h	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
ngram-cache.cpp	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 )	2026-01-28 19:42:42 +02:00
ngram-cache.h	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 )	2026-01-28 19:42:42 +02:00
ngram-map.cpp	llama : correct typos 'occured' and 'occurences' (#19414 )	2026-02-11 07:05:31 +01:00
ngram-map.h	llama : correct typos 'occured' and 'occurences' (#19414 )	2026-02-11 07:05:31 +01:00
ngram-mod.cpp	spec : add ngram-mod (#19164 )	2026-01-30 18:21:48 +02:00
ngram-mod.h	ngram-mod : fix build [no ci] (#19216 )	2026-01-30 21:27:27 +02:00
peg-parser.cpp	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
peg-parser.h	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
preset.cpp	preset: allow named remote preset (#18728 )	2026-01-10 15:12:29 +01:00
preset.h	common: support remote preset (#18520 )	2026-01-08 22:35:40 +01:00
reasoning-budget.cpp	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
reasoning-budget.h	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
regex-partial.cpp	common : fix iterator::end() dereference (#20445 )	2026-03-16 08:50:38 +02:00
regex-partial.h	`common`: add partial regex support (#12808 )	2025-05-14 19:50:57 +01:00
sampling.cpp	feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming	2026-03-21 14:18:40 -04:00
sampling.h	sampling : add support for backend sampling (#17004 )	2026-01-04 22:22:16 +02:00
speculative.cpp	feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming	2026-03-21 14:18:40 -04:00
speculative.h	common : add common_speculative_is_compat() (#19270 )	2026-02-06 16:47:22 +02:00
unicode.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
unicode.h	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00