For hybrid/recurrent models (Qwen3.5, Jamba, Falcon-H1), the server
creates context checkpoints during prompt processing that snapshot the
full recurrent state at regular intervals. These checkpoints are
essential to avoid full prompt re-processing when a slot is reused.
The existing /slots save/restore API persists the raw KV+recurrent
memory via llama_state_seq_{save,load}_file, and also restores the
token list. However, it does not persist the checkpoint metadata
stored in server_prompt::checkpoints. Without these, the hybrid model
cache validation logic in update_slots() cannot find any checkpoint to
restore from and falls back to full prompt re-processing.
This patch adds two small helper functions (slot_checkpoints_save and
slot_checkpoints_load) that write/read a companion file alongside the
main slot save file. The format is a versioned binary file with a
magic header.
This is particularly useful in router mode with --models-max 1, where
switching between models destroys the in-memory prompt cache. Users
can now call /slots/0?action=save before a model swap and
/slots/0?action=restore after, recovering the full cache including
checkpoints.
Tested with Qwen3.5-27B (64 layers, 16 attention + 48 recurrent):
- Without patch: cache_n=0, 23s re-processing after swap
- With patch: cache_n=26549, 75ms after swap