Root cause found: copy_cell crashes during find_slot because it calls
ggml_backend_tensor_copy on GPU tensors while the compute graph is
being built. Fixed by using CPU staging: tensor_get (GPU→CPU) then
tensor_set (CPU→GPU).
Also increased rs_size from 1 to 3 cells per sequence to make room
for checkpoint cells needed by speculative decoding rollback.
Results:
- No more crashes during speculative decode
- 23.8 tok/s with MTP (vs 16.7 without)
- 75% acceptance rate
- Output still garbled on long generation due to seq_rm not finding
checkpoints at the right positions (checkpoint position mismatch)
Next: fix checkpoint position tracking so seq_rm can find and restore
the correct recurrent state after draft rejection.