llama.cpp/tools
European Tech d5c325051f server: persist context checkpoints across slot save/restore
For hybrid/recurrent models (Qwen3.5, Jamba, Falcon-H1), the server
creates context checkpoints during prompt processing that snapshot the
full recurrent state at regular intervals.  These checkpoints are
essential to avoid full prompt re-processing when a slot is reused.

The existing /slots save/restore API persists the raw KV+recurrent
memory via llama_state_seq_{save,load}_file, and also restores the
token list.  However, it does not persist the checkpoint metadata
stored in server_prompt::checkpoints.  Without these, the hybrid model
cache validation logic in update_slots() cannot find any checkpoint to
restore from and falls back to full prompt re-processing.

This patch adds two small helper functions (slot_checkpoints_save and
slot_checkpoints_load) that write/read a companion file alongside the
main slot save file.  The format is a versioned binary file with a
magic header.

This is particularly useful in router mode with --models-max 1, where
switching between models destroys the in-memory prompt cache.  Users
can now call /slots/0?action=save before a model swap and
/slots/0?action=restore after, recovering the full cache including
checkpoints.

Tested with Qwen3.5-27B (64 layers, 16 attention + 48 recurrent):
- Without patch: cache_n=0, 23s re-processing after swap
- With patch:    cache_n=26549, 75ms after swap
2026-03-20 21:16:11 +01:00
..
batched-bench Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
cli common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
completion common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289) 2026-03-17 16:16:43 +01:00
cvector-generator chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
export-lora Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
fit-params llama-fit-params: keep explicit --ctx-size 0 (#19070) 2026-01-24 22:13:08 +01:00
gguf-split Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
imatrix chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
llama-bench llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default (#20211) 2026-03-09 09:05:44 +08:00
mtmd mtmd: add clip_graph::build_mm() (#20751) 2026-03-19 13:11:39 +01:00
parser common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
perplexity tools : enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954) 2026-03-13 21:25:57 +01:00
quantize llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770) 2026-03-10 08:16:05 +02:00
results llama: end-to-end tests (#19802) 2026-03-08 12:30:21 +01:00
rpc Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
server server: persist context checkpoints across slot save/restore 2026-03-20 21:16:11 +01:00
tokenize Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
tts Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
CMakeLists.txt llama: end-to-end tests (#19802) 2026-03-08 12:30:21 +01:00