fix: increase rs_size to 8 for more checkpoint room
With rs_size=8, 88 out of 196 rejections find checkpoints for proper rollback (45% rollback success, up from 0%). Output quality significantly improved — proper code structure with minor garbling. Speed reduced to 12.4 tok/s due to CPU staging copy overhead (50 MiB per checkpoint × 24 recurrent layers × multiple checkpoints per generation step). TODO: Replace CPU staging copy with direct GPU copy to restore speed. The original GPU crash may have been from the old rs_size=1 (out of bounds access), not from the copy itself.
This commit is contained in:
parent
4e908332c4
commit
91e3535a0b
|
|
@ -8119,7 +8119,7 @@ llama_memory_i * llama_model::create_memory(const llama_memory_params & params,
|
|||
// For MTP: need room for active cell + checkpoint cells.
|
||||
// With size=4: active(1) + checkpoint(1) + room(2) ensures
|
||||
// can_checkpoint (used < size*0.9 = 3.6) can fire even with 3 cells in use.
|
||||
const uint32_t rs_per_seq = 1 + (n_mtp > 0 ? 3 : 0);
|
||||
const uint32_t rs_per_seq = 1 + (n_mtp > 0 ? 7 : 0);
|
||||
const uint32_t rs_size = std::max((uint32_t) 1, cparams.n_seq_max * rs_per_seq);
|
||||
|
||||
res = new llama_memory_hybrid_iswa(
|
||||
|
|
|
|||
Loading…
Reference in New Issue