llama.cpp

History

European Tech d5c325051f server: persist context checkpoints across slot save/restore For hybrid/recurrent models (Qwen3.5, Jamba, Falcon-H1), the server creates context checkpoints during prompt processing that snapshot the full recurrent state at regular intervals. These checkpoints are essential to avoid full prompt re-processing when a slot is reused. The existing /slots save/restore API persists the raw KV+recurrent memory via llama_state_seq_{save,load}_file, and also restores the token list. However, it does not persist the checkpoint metadata stored in server_prompt::checkpoints. Without these, the hybrid model cache validation logic in update_slots() cannot find any checkpoint to restore from and falls back to full prompt re-processing. This patch adds two small helper functions (slot_checkpoints_save and slot_checkpoints_load) that write/read a companion file alongside the main slot save file. The format is a versioned binary file with a magic header. This is particularly useful in router mode with --models-max 1, where switching between models destroys the in-memory prompt cache. Users can now call /slots/0?action=save before a model swap and /slots/0?action=restore after, recovering the full cache including checkpoints. Tested with Qwen3.5-27B (64 layers, 16 attention + 48 recurrent): - Without patch: cache_n=0, 23s re-processing after swap - With patch: cache_n=26549, 75ms after swap		2026-03-20 21:16:11 +01:00
..
batched-bench	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
cli	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
completion	common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289 )	2026-03-17 16:16:43 +01:00
cvector-generator	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
export-lora	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
fit-params	llama-fit-params: keep explicit --ctx-size 0 (#19070 )	2026-01-24 22:13:08 +01:00
gguf-split	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
imatrix	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
llama-bench	llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by default (#20211 )	2026-03-09 09:05:44 +08:00
mtmd	mtmd: add clip_graph::build_mm() (#20751 )	2026-03-19 13:11:39 +01:00
parser	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
perplexity	tools : enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954 )	2026-03-13 21:25:57 +01:00
quantize	llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770 )	2026-03-10 08:16:05 +02:00
results	llama: end-to-end tests (#19802 )	2026-03-08 12:30:21 +01:00
rpc	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
server	server: persist context checkpoints across slot save/restore	2026-03-20 21:16:11 +01:00
tokenize	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
tts	Fix locale-dependent float printing in GGUF metadata (#17331 )	2026-03-04 09:30:40 +01:00
CMakeLists.txt	llama: end-to-end tests (#19802 )	2026-03-08 12:30:21 +01:00