The eviction was inside the is_empty() check, so it never ran when
all cells were occupied. Moved eviction outside to always try freeing
old checkpoints for new ones.
Results:
- 48 successful rollbacks (44% of rejections properly restored)
- 34.4 tok/s on short, ~20 tok/s on long generation
- 65% acceptance rate with proper rollback
Remaining: 56% of rejections still can't find checkpoints because
the checkpoint was evicted before seq_rm ran. Need to either keep
checkpoints longer or increase rs_size further.
- Check next_empty_cell < size before accessing cells array
- Update next_empty_cell to freed cell after eviction
- Increase rs_size from 3 to 4 for better checkpoint room
- Fix: eviction now correctly reuses freed cells for new checkpoints
Still TODO: checkpoint positions don't match what seq_rm looks for.
Checkpoints are created at the current tail position (post-update),
but seq_rm needs the pre-update position. Need to capture the position
BEFORE the speculative batch updates the tail.
Performance: 19.8-24 tok/s, 63-75% acceptance, no crashes.
Root cause found: copy_cell crashes during find_slot because it calls
ggml_backend_tensor_copy on GPU tensors while the compute graph is
being built. Fixed by using CPU staging: tensor_get (GPU→CPU) then
tensor_set (CPU→GPU).
Also increased rs_size from 1 to 3 cells per sequence to make room
for checkpoint cells needed by speculative decoding rollback.
Results:
- No more crashes during speculative decode
- 23.8 tok/s with MTP (vs 16.7 without)
- 75% acceptance rate
- Output still garbled on long generation due to seq_rm not finding
checkpoints at the right positions (checkpoint position mismatch)
Next: fix checkpoint position tracking so seq_rm can find and restore
the correct recurrent state after draft rejection.
The MTP head has attention weights (Q/K/V) but they are currently unused
(FFN-only path). Adding attention requires resolving the ggml buffer
allocation for the MTP layer, which has has_kv=false.
Approaches tried:
- build_attn with KV cache at il_kv=31: corrupts main model KV
- build_attn_inp_no_cache: GGML_ASSERT(buffer) failed
- build_attn_mha: GGML_ASSERT(buffer) failed
- Manual attention with ggml ops: GGML_ASSERT(buffer) failed
Root cause: graph scheduler doesn't allocate buffers for MTP layer
attention ops. Need to either extend n_layer_kv_from_start to include
MTP layers, or add the MTP attention to the graph plan before
scheduler runs.
Current state: FFN-only MTP gives 95% acceptance rate at temp=0.6.
Temperature sampling from MTP logits doesn't match the main model's
distribution because they have different probability spaces. Argmax
gives 89-95% acceptance vs 39% with temperature sampling.
The 5% mismatch at temp=0.6 is expected — the main model sometimes
samples non-argmax tokens. This is the natural speculative decoding
behavior and doesn't need fixing.
- Add cooldown flag to MTP speculative state: after draft rejection,
skip next proposal to force single-token decode for fresh MTP logits
- Root cause: MTP logits are from the last batch position (draft token).
When draft is rejected, next proposal uses stale/wrong logits (13% accept).
With cooldown: proposals only use fresh single-token MTP logits (95% accept).
- Simplified seq_rm fallback: log and continue instead of re-evaluating
- Added debug logging (MTP-DBG, MTP-VERIFY) for acceptance rate tracking
- Results: 95% acceptance rate, 0 restarts, no garbled output on 2048 tokens
Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B).
What works:
- MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp)
- MTP tensor loading and registration for QWEN35 arch
- GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.)
- Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers()
- Server auto-detects MTP from GGUF metadata
- Speculative state machine for MTP draft token generation
- PR #20075 applied: recurrent state checkpoint/restore for hybrid models
- M-RoPE position check relaxed for speculative re-evaluation
- Windows os.kill fix for gateway process detection
What needs work:
- Speculative verify loop conflicts with tool-calling requests (400 error)
- The recommended fix: bypass the speculative framework entirely and
implement MTP acceptance directly in the server generation loop
(no seq_rm/rollback needed since MTP drafts are produced in-graph)
- MTP attention skipped (projection + FFN path only) due to
inp_out_ids token count mismatch
Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2
Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization)
Base: llama.cpp b8388
* kleidiai: add data type check to get_tensor_traits
* Added check for F16 data type into get_tensor_traits path with input data
not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7
* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
updated kleidiai.cpp file as per suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain
On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).
Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().
Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.
Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix
---------
Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* hexagon: fix tail corruption with rows sizes not multiple of 256
* hexagon: use different stride for repacking partial blocks
* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks
Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.
* hex-mm: update rmpy x8 for better optimizations
* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures
* hex-mm: use vzero to init accumulators
* hex-mm: properly call partial rmpy_x8