From 56666fa6072f150fcc48f138433f82ad3fb76d79 Mon Sep 17 00:00:00 2001 From: Berk Idem <55372926+berkidem@users.noreply.github.com> Date: Tue, 14 Apr 2026 06:43:06 -0400 Subject: [PATCH] common: skip reasoning budget sampler when no budget is requested (#21870) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy. --- common/sampling.cpp | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/common/sampling.cpp b/common/sampling.cpp index 2f60be1943..526f036ff9 100644 --- a/common/sampling.cpp +++ b/common/sampling.cpp @@ -287,8 +287,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st } } - // reasoning budget sampler - if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty()) { + // reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression) + if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) { rbudget = common_reasoning_budget_init( vocab, params.reasoning_budget_start,