From 91e3535a0beaae7246d5a0d7c4f992ad5db3cff4 Mon Sep 17 00:00:00 2001
From: itigges22 <jitigges@vt.edu>
Date: Fri, 20 Mar 2026 18:15:13 -0400
Subject: [PATCH] fix: increase rs_size to 8 for more checkpoint room
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With rs_size=8, 88 out of 196 rejections find checkpoints for proper
rollback (45% rollback success, up from 0%).

Output quality significantly improved — proper code structure with
minor garbling. Speed reduced to 12.4 tok/s due to CPU staging copy
overhead (50 MiB per checkpoint × 24 recurrent layers × multiple
checkpoints per generation step).

TODO: Replace CPU staging copy with direct GPU copy to restore speed.
The original GPU crash may have been from the old rs_size=1 (out of
bounds access), not from the copy itself.
---
 src/llama-model.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 5e1b512318..bbfc430b4d 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -8119,7 +8119,7 @@ llama_memory_i * llama_model::create_memory(const llama_memory_params & params,
                         // For MTP: need room for active cell + checkpoint cells.
                         // With size=4: active(1) + checkpoint(1) + room(2) ensures
                         // can_checkpoint (used < size*0.9 = 3.6) can fire even with 3 cells in use.
-                        const uint32_t rs_per_seq = 1 + (n_mtp > 0 ? 3 : 0);
+                        const uint32_t rs_per_seq = 1 + (n_mtp > 0 ? 7 : 0);
                         const uint32_t rs_size = std::max((uint32_t) 1, cparams.n_seq_max * rs_per_seq);
 
                         res = new llama_memory_hybrid_iswa(