llama.cpp

Commit Graph

Author	SHA1	Message	Date
itigges22	4aeffc690d	doc: document MTP attention requirement for higher acceptance The MTP head has attention weights (Q/K/V) but they are currently unused (FFN-only path). Adding attention requires resolving the ggml buffer allocation for the MTP layer, which has has_kv=false. Approaches tried: - build_attn with KV cache at il_kv=31: corrupts main model KV - build_attn_inp_no_cache: GGML_ASSERT(buffer) failed - build_attn_mha: GGML_ASSERT(buffer) failed - Manual attention with ggml ops: GGML_ASSERT(buffer) failed Root cause: graph scheduler doesn't allocate buffers for MTP layer attention ops. Need to either extend n_layer_kv_from_start to include MTP layers, or add the MTP attention to the graph plan before scheduler runs. Current state: FFN-only MTP gives 95% acceptance rate at temp=0.6.	2026-03-20 00:52:32 -04:00

Author

SHA1

Message

Date

itigges22

4aeffc690d

doc: document MTP attention requirement for higher acceptance

The MTP head has attention weights (Q/K/V) but they are currently unused
(FFN-only path). Adding attention requires resolving the ggml buffer
allocation for the MTP layer, which has has_kv=false.

Approaches tried:
- build_attn with KV cache at il_kv=31: corrupts main model KV
- build_attn_inp_no_cache: GGML_ASSERT(buffer) failed
- build_attn_mha: GGML_ASSERT(buffer) failed
- Manual attention with ggml ops: GGML_ASSERT(buffer) failed

Root cause: graph scheduler doesn't allocate buffers for MTP layer
attention ops. Need to either extend n_layer_kv_from_start to include
MTP layers, or add the MTP attention to the graph plan before
scheduler runs.

Current state: FFN-only MTP gives 95% acceptance rate at temp=0.6.

2026-03-20 00:52:32 -04:00

1 Commits