refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
This commit is contained in:
Pascal 2025-10-08 22:18:41 +02:00 committed by GitHub
parent 9d0882840e
commit 12bbc3fa50
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
14 changed files with 276 additions and 431 deletions

View File

@ -3432,7 +3432,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--reasoning-format"}, "FORMAT", {"--reasoning-format"}, "FORMAT",
"controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n" "controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n"
"- none: leaves thoughts unparsed in `message.content`\n" "- none: leaves thoughts unparsed in `message.content`\n"
"- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n" "- deepseek: puts thoughts in `message.reasoning_content`\n"
"- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
"(default: auto)", "(default: auto)",
[](common_params & params, const std::string & value) { [](common_params & params, const std::string & value) {
params.reasoning_format = common_reasoning_format_from_name(value); params.reasoning_format = common_reasoning_format_from_name(value);

View File

@ -3,9 +3,12 @@
#include "log.h" #include "log.h"
#include "regex-partial.h" #include "regex-partial.h"
#include <algorithm>
#include <cctype>
#include <optional> #include <optional>
#include <stdexcept> #include <stdexcept>
#include <string> #include <string>
#include <string_view>
#include <vector> #include <vector>
using json = nlohmann::ordered_json; using json = nlohmann::ordered_json;
@ -166,6 +169,27 @@ void common_chat_msg_parser::consume_literal(const std::string & literal) {
} }
bool common_chat_msg_parser::try_parse_reasoning(const std::string & start_think, const std::string & end_think) { bool common_chat_msg_parser::try_parse_reasoning(const std::string & start_think, const std::string & end_think) {
std::string pending_reasoning_prefix;
if (syntax_.reasoning_format == COMMON_REASONING_FORMAT_NONE) {
return false;
}
auto set_reasoning_prefix = [&](size_t prefix_pos) {
if (!syntax_.thinking_forced_open || syntax_.reasoning_in_content) {
return;
}
if (prefix_pos + start_think.size() > input_.size()) {
pending_reasoning_prefix.clear();
return;
}
// Capture the exact literal that opened the reasoning section so we can
// surface it back to callers. This ensures formats that force the
// reasoning tag open (e.g. DeepSeek R1) retain their original prefix
// instead of dropping it during parsing.
pending_reasoning_prefix = input_.substr(prefix_pos, start_think.size());
};
auto handle_reasoning = [&](const std::string & reasoning, bool closed) { auto handle_reasoning = [&](const std::string & reasoning, bool closed) {
auto stripped_reasoning = string_strip(reasoning); auto stripped_reasoning = string_strip(reasoning);
if (stripped_reasoning.empty()) { if (stripped_reasoning.empty()) {
@ -178,28 +202,116 @@ bool common_chat_msg_parser::try_parse_reasoning(const std::string & start_think
add_content(syntax_.reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK ? "</think>" : end_think); add_content(syntax_.reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK ? "</think>" : end_think);
} }
} else { } else {
if (!pending_reasoning_prefix.empty()) {
add_reasoning_content(pending_reasoning_prefix);
pending_reasoning_prefix.clear();
}
add_reasoning_content(stripped_reasoning); add_reasoning_content(stripped_reasoning);
} }
}; };
if (syntax_.reasoning_format != COMMON_REASONING_FORMAT_NONE) {
if (syntax_.thinking_forced_open || try_consume_literal(start_think)) { const size_t saved_pos = pos_;
if (auto res = try_find_literal(end_think)) { const size_t saved_content_size = result_.content.size();
handle_reasoning(res->prelude, /* closed */ true); const size_t saved_reasoning_size = result_.reasoning_content.size();
consume_spaces();
return true; auto restore_state = [&]() {
} move_to(saved_pos);
auto rest = consume_rest(); result_.content.resize(saved_content_size);
result_.reasoning_content.resize(saved_reasoning_size);
};
// Allow leading whitespace to be preserved as content when reasoning is present at the start
size_t cursor = pos_;
size_t whitespace_end = cursor;
while (whitespace_end < input_.size() && std::isspace(static_cast<unsigned char>(input_[whitespace_end]))) {
++whitespace_end;
}
if (whitespace_end >= input_.size()) {
restore_state();
if (syntax_.thinking_forced_open) {
auto rest = input_.substr(saved_pos);
if (!rest.empty()) { if (!rest.empty()) {
handle_reasoning(rest, /* closed */ !is_partial()); handle_reasoning(rest, /* closed */ !is_partial());
} }
// Allow unclosed thinking tags, for now (https://github.com/ggml-org/llama.cpp/issues/13812, https://github.com/ggml-org/llama.cpp/issues/13877) move_to(input_.size());
// if (!syntax_.thinking_forced_open) {
// throw common_chat_msg_partial_exception(end_think);
// }
return true; return true;
} }
return false;
}
cursor = whitespace_end;
const size_t remaining = input_.size() - cursor;
const size_t start_prefix = std::min(start_think.size(), remaining);
const bool has_start_tag = input_.compare(cursor, start_prefix, start_think, 0, start_prefix) == 0;
if (has_start_tag && start_prefix < start_think.size()) {
move_to(input_.size());
return true;
}
if (has_start_tag) {
if (whitespace_end > pos_) {
add_content(input_.substr(pos_, whitespace_end - pos_));
}
set_reasoning_prefix(cursor);
cursor += start_think.size();
} else if (syntax_.thinking_forced_open) {
cursor = whitespace_end;
} else {
restore_state();
return false;
}
while (true) {
if (cursor >= input_.size()) {
move_to(input_.size());
return true;
}
size_t end_pos = input_.find(end_think, cursor);
if (end_pos == std::string::npos) {
std::string_view remaining_view(input_.data() + cursor, input_.size() - cursor);
size_t partial_off = string_find_partial_stop(remaining_view, end_think);
size_t reasoning_end = partial_off == std::string::npos ? input_.size() : cursor + partial_off;
if (reasoning_end > cursor) {
handle_reasoning(input_.substr(cursor, reasoning_end - cursor), /* closed */ partial_off == std::string::npos && !is_partial());
}
move_to(input_.size());
return true;
}
if (end_pos > cursor) {
handle_reasoning(input_.substr(cursor, end_pos - cursor), /* closed */ true);
} else {
handle_reasoning("", /* closed */ true);
}
cursor = end_pos + end_think.size();
while (cursor < input_.size() && std::isspace(static_cast<unsigned char>(input_[cursor]))) {
++cursor;
}
const size_t next_remaining = input_.size() - cursor;
if (next_remaining == 0) {
move_to(cursor);
return true;
}
const size_t next_prefix = std::min(start_think.size(), next_remaining);
if (input_.compare(cursor, next_prefix, start_think, 0, next_prefix) == 0) {
if (next_prefix < start_think.size()) {
move_to(input_.size());
return true;
}
set_reasoning_prefix(cursor);
cursor += start_think.size();
continue;
}
move_to(cursor);
return true;
} }
return false;
} }
std::string common_chat_msg_parser::consume_rest() { std::string common_chat_msg_parser::consume_rest() {

View File

@ -1408,6 +1408,8 @@ static common_chat_params common_chat_params_init_apertus(const common_chat_temp
return data; return data;
} }
static void common_chat_parse_llama_3_1(common_chat_msg_parser & builder, bool with_builtin_tools = false) { static void common_chat_parse_llama_3_1(common_chat_msg_parser & builder, bool with_builtin_tools = false) {
builder.try_parse_reasoning("<think>", "</think>");
if (!builder.syntax().parse_tool_calls) { if (!builder.syntax().parse_tool_calls) {
builder.add_content(builder.consume_rest()); builder.add_content(builder.consume_rest());
return; return;
@ -2862,6 +2864,7 @@ common_chat_params common_chat_templates_apply(
} }
static void common_chat_parse_content_only(common_chat_msg_parser & builder) { static void common_chat_parse_content_only(common_chat_msg_parser & builder) {
builder.try_parse_reasoning("<think>", "</think>");
builder.add_content(builder.consume_rest()); builder.add_content(builder.consume_rest());
} }

View File

@ -433,7 +433,7 @@ struct common_params {
std::string chat_template = ""; // NOLINT std::string chat_template = ""; // NOLINT
bool use_jinja = false; // NOLINT bool use_jinja = false; // NOLINT
bool enable_chat_template = true; bool enable_chat_template = true;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_AUTO; common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int reasoning_budget = -1; int reasoning_budget = -1;
bool prefill_assistant = true; // if true, any trailing assistant message will be prefilled into the response bool prefill_assistant = true; // if true, any trailing assistant message will be prefilled into the response

View File

@ -106,6 +106,34 @@ static void test_reasoning() {
assert_equals("<think>Cogito</think>", builder.result().content); assert_equals("<think>Cogito</think>", builder.result().content);
assert_equals("Ergo sum", builder.consume_rest()); assert_equals("Ergo sum", builder.consume_rest());
} }
{
const std::string variant("content_only_inline_think");
common_chat_syntax syntax = {
/* .format = */ COMMON_CHAT_FORMAT_CONTENT_ONLY,
/* .reasoning_format = */ COMMON_REASONING_FORMAT_DEEPSEEK,
/* .reasoning_in_content = */ false,
/* .thinking_forced_open = */ false,
/* .parse_tool_calls = */ false,
};
const std::string input = "<think>Pense</think>Bonjour";
auto msg = common_chat_parse(input, false, syntax);
assert_equals(variant, std::string("Pense"), msg.reasoning_content);
assert_equals(variant, std::string("Bonjour"), msg.content);
}
{
const std::string variant("llama_3_inline_think");
common_chat_syntax syntax = {
/* .format = */ COMMON_CHAT_FORMAT_LLAMA_3_X,
/* .reasoning_format = */ COMMON_REASONING_FORMAT_DEEPSEEK,
/* .reasoning_in_content = */ false,
/* .thinking_forced_open = */ false,
/* .parse_tool_calls = */ false,
};
const std::string input = "<think>Plan</think>Réponse";
auto msg = common_chat_parse(input, false, syntax);
assert_equals(variant, std::string("Plan"), msg.reasoning_content);
assert_equals(variant, std::string("Réponse"), msg.content);
}
// Test DeepSeek V3.1 parsing - reasoning content followed by "</think>" and then regular content // Test DeepSeek V3.1 parsing - reasoning content followed by "</think>" and then regular content
{ {
common_chat_syntax syntax = { common_chat_syntax syntax = {

View File

@ -190,7 +190,7 @@ The project is under active development, and we are [looking for feedback and co
| `--no-slots` | disables slots monitoring endpoint<br/>(env: LLAMA_ARG_NO_ENDPOINT_SLOTS) | | `--no-slots` | disables slots monitoring endpoint<br/>(env: LLAMA_ARG_NO_ENDPOINT_SLOTS) |
| `--slot-save-path PATH` | path to save slot kv cache (default: disabled) | | `--slot-save-path PATH` | path to save slot kv cache (default: disabled) |
| `--jinja` | use jinja template for chat (default: disabled)<br/>(env: LLAMA_ARG_JINJA) | | `--jinja` | use jinja template for chat (default: disabled)<br/>(env: LLAMA_ARG_JINJA) |
| `--reasoning-format FORMAT` | controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)<br/>(default: auto)<br/>(env: LLAMA_ARG_THINK) | | `--reasoning-format FORMAT` | controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: deepseek)<br/>(env: LLAMA_ARG_THINK) |
| `--reasoning-budget N` | controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) | | `--reasoning-budget N` | controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) | | `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) | | `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |

View File

@ -1,7 +1,6 @@
<script lang="ts"> <script lang="ts">
import { getDeletionInfo } from '$lib/stores/chat.svelte'; import { getDeletionInfo } from '$lib/stores/chat.svelte';
import { copyToClipboard } from '$lib/utils/copy'; import { copyToClipboard } from '$lib/utils/copy';
import { parseThinkingContent } from '$lib/utils/thinking';
import ChatMessageAssistant from './ChatMessageAssistant.svelte'; import ChatMessageAssistant from './ChatMessageAssistant.svelte';
import ChatMessageUser from './ChatMessageUser.svelte'; import ChatMessageUser from './ChatMessageUser.svelte';
@ -47,26 +46,13 @@
let thinkingContent = $derived.by(() => { let thinkingContent = $derived.by(() => {
if (message.role === 'assistant') { if (message.role === 'assistant') {
if (message.thinking) { const trimmedThinking = message.thinking?.trim();
return message.thinking;
}
const parsed = parseThinkingContent(message.content); return trimmedThinking ? trimmedThinking : null;
return parsed.thinking;
} }
return null; return null;
}); });
let messageContent = $derived.by(() => {
if (message.role === 'assistant') {
const parsed = parseThinkingContent(message.content);
return parsed.cleanContent?.replace('<|channel|>analysis', '');
}
return message.content?.replace('<|channel|>analysis', '');
});
function handleCancelEdit() { function handleCancelEdit() {
isEditing = false; isEditing = false;
editedContent = message.content; editedContent = message.content;
@ -165,7 +151,7 @@
{editedContent} {editedContent}
{isEditing} {isEditing}
{message} {message}
{messageContent} messageContent={message.content}
onCancelEdit={handleCancelEdit} onCancelEdit={handleCancelEdit}
onConfirmDelete={handleConfirmDelete} onConfirmDelete={handleConfirmDelete}
onCopy={handleCopy} onCopy={handleCopy}

View File

@ -131,7 +131,11 @@
</div> </div>
</div> </div>
{:else if message.role === 'assistant'} {:else if message.role === 'assistant'}
<MarkdownContent content={messageContent || ''} /> {#if config().disableReasoningFormat}
<pre class="raw-output">{messageContent || ''}</pre>
{:else}
<MarkdownContent content={messageContent || ''} />
{/if}
{:else} {:else}
<div class="text-sm whitespace-pre-wrap"> <div class="text-sm whitespace-pre-wrap">
{messageContent} {messageContent}
@ -203,4 +207,21 @@
background-position: -200% 0; background-position: -200% 0;
} }
} }
.raw-output {
width: 100%;
max-width: 48rem;
margin-top: 1.5rem;
padding: 1rem 1.25rem;
border-radius: 1rem;
background: hsl(var(--muted) / 0.3);
color: var(--foreground);
font-family:
ui-monospace, SFMono-Regular, 'SF Mono', Monaco, 'Cascadia Code', 'Roboto Mono', Consolas,
'Liberation Mono', Menlo, monospace;
font-size: 0.875rem;
line-height: 1.6;
white-space: pre-wrap;
word-break: break-word;
}
</style> </style>

View File

@ -148,6 +148,12 @@
key: 'showThoughtInProgress', key: 'showThoughtInProgress',
label: 'Show thought in progress', label: 'Show thought in progress',
type: 'checkbox' type: 'checkbox'
},
{
key: 'disableReasoningFormat',
label:
'Show raw LLM output without backend parsing and frontend Markdown rendering to inspect streaming across different models.',
type: 'checkbox'
} }
] ]
}, },

View File

@ -6,6 +6,7 @@ export const SETTING_CONFIG_DEFAULT: Record<string, string | number | boolean> =
theme: 'system', theme: 'system',
showTokensPerSecond: false, showTokensPerSecond: false,
showThoughtInProgress: false, showThoughtInProgress: false,
disableReasoningFormat: false,
keepStatsVisible: false, keepStatsVisible: false,
askForTitleConfirmation: false, askForTitleConfirmation: false,
pasteLongTextToFileLen: 2500, pasteLongTextToFileLen: 2500,
@ -76,6 +77,8 @@ export const SETTING_CONFIG_INFO: Record<string, string> = {
custom: 'Custom JSON parameters to send to the API. Must be valid JSON format.', custom: 'Custom JSON parameters to send to the API. Must be valid JSON format.',
showTokensPerSecond: 'Display generation speed in tokens per second during streaming.', showTokensPerSecond: 'Display generation speed in tokens per second during streaming.',
showThoughtInProgress: 'Expand thought process by default when generating messages.', showThoughtInProgress: 'Expand thought process by default when generating messages.',
disableReasoningFormat:
'Show raw LLM output without backend parsing and frontend Markdown rendering to inspect streaming across different models.',
keepStatsVisible: 'Keep processing statistics visible after generation finishes.', keepStatsVisible: 'Keep processing statistics visible after generation finishes.',
askForTitleConfirmation: askForTitleConfirmation:
'Ask for confirmation before automatically changing conversation title when editing the first message.', 'Ask for confirmation before automatically changing conversation title when editing the first message.',

View File

@ -78,6 +78,8 @@ export class ChatService {
timings_per_token timings_per_token
} = options; } = options;
const currentConfig = config();
// Cancel any ongoing request and create a new abort controller // Cancel any ongoing request and create a new abort controller
this.abort(); this.abort();
this.abortController = new AbortController(); this.abortController = new AbortController();
@ -117,7 +119,7 @@ export class ChatService {
stream stream
}; };
requestBody.reasoning_format = 'auto'; requestBody.reasoning_format = currentConfig.disableReasoningFormat ? 'none' : 'auto';
if (temperature !== undefined) requestBody.temperature = temperature; if (temperature !== undefined) requestBody.temperature = temperature;
// Set max_tokens to -1 (infinite) if not provided or empty // Set max_tokens to -1 (infinite) if not provided or empty
@ -161,7 +163,6 @@ export class ChatService {
} }
try { try {
const currentConfig = config();
const apiKey = currentConfig.apiKey?.toString().trim(); const apiKey = currentConfig.apiKey?.toString().trim();
const response = await fetch(`./v1/chat/completions`, { const response = await fetch(`./v1/chat/completions`, {
@ -256,10 +257,8 @@ export class ChatService {
} }
const decoder = new TextDecoder(); const decoder = new TextDecoder();
let fullResponse = ''; let aggregatedContent = '';
let fullReasoningContent = ''; let fullReasoningContent = '';
let regularContent = '';
let insideThinkTag = false;
let hasReceivedData = false; let hasReceivedData = false;
let lastTimings: ChatMessageTimings | undefined; let lastTimings: ChatMessageTimings | undefined;
@ -277,7 +276,7 @@ export class ChatService {
if (line.startsWith('data: ')) { if (line.startsWith('data: ')) {
const data = line.slice(6); const data = line.slice(6);
if (data === '[DONE]') { if (data === '[DONE]') {
if (!hasReceivedData && fullResponse.length === 0) { if (!hasReceivedData && aggregatedContent.length === 0) {
const contextError = new Error( const contextError = new Error(
'The request exceeds the available context size. Try increasing the context size or enable context shift.' 'The request exceeds the available context size. Try increasing the context size or enable context shift.'
); );
@ -286,7 +285,7 @@ export class ChatService {
return; return;
} }
onComplete?.(regularContent, fullReasoningContent || undefined, lastTimings); onComplete?.(aggregatedContent, fullReasoningContent || undefined, lastTimings);
return; return;
} }
@ -310,27 +309,8 @@ export class ChatService {
if (content) { if (content) {
hasReceivedData = true; hasReceivedData = true;
fullResponse += content; aggregatedContent += content;
onChunk?.(content);
// Track the regular content before processing this chunk
const regularContentBefore = regularContent;
// Process content character by character to handle think tags
insideThinkTag = this.processContentForThinkTags(
content,
insideThinkTag,
() => {
// Think content is ignored - we don't include it in API requests
},
(regularChunk) => {
regularContent += regularChunk;
}
);
const newRegularContent = regularContent.slice(regularContentBefore.length);
if (newRegularContent) {
onChunk?.(newRegularContent);
}
} }
if (reasoningContent) { if (reasoningContent) {
@ -345,7 +325,7 @@ export class ChatService {
} }
} }
if (!hasReceivedData && fullResponse.length === 0) { if (!hasReceivedData && aggregatedContent.length === 0) {
const contextError = new Error( const contextError = new Error(
'The request exceeds the available context size. Try increasing the context size or enable context shift.' 'The request exceeds the available context size. Try increasing the context size or enable context shift.'
); );
@ -552,51 +532,6 @@ export class ChatService {
} }
} }
/**
* Processes content to separate thinking tags from regular content.
* Parses <think> and </think> tags to route content to appropriate handlers.
*
* @param content - The content string to process
* @param currentInsideThinkTag - Current state of whether we're inside a think tag
* @param addThinkContent - Callback to handle content inside think tags
* @param addRegularContent - Callback to handle regular content outside think tags
* @returns Boolean indicating if we're still inside a think tag after processing
* @private
*/
private processContentForThinkTags(
content: string,
currentInsideThinkTag: boolean,
addThinkContent: (chunk: string) => void,
addRegularContent: (chunk: string) => void
): boolean {
let i = 0;
let insideThinkTag = currentInsideThinkTag;
while (i < content.length) {
if (!insideThinkTag && content.substring(i, i + 7) === '<think>') {
insideThinkTag = true;
i += 7; // Skip the <think> tag
continue;
}
if (insideThinkTag && content.substring(i, i + 8) === '</think>') {
insideThinkTag = false;
i += 8; // Skip the </think> tag
continue;
}
if (insideThinkTag) {
addThinkContent(content[i]);
} else {
addRegularContent(content[i]);
}
i++;
}
return insideThinkTag;
}
/** /**
* Aborts any ongoing chat completion request. * Aborts any ongoing chat completion request.
* Cancels the current request and cleans up the abort controller. * Cancels the current request and cleans up the abort controller.

View File

@ -5,7 +5,6 @@ import { config } from '$lib/stores/settings.svelte';
import { filterByLeafNodeId, findLeafNode, findDescendantMessages } from '$lib/utils/branching'; import { filterByLeafNodeId, findLeafNode, findDescendantMessages } from '$lib/utils/branching';
import { browser } from '$app/environment'; import { browser } from '$app/environment';
import { goto } from '$app/navigation'; import { goto } from '$app/navigation';
import { extractPartialThinking } from '$lib/utils/thinking';
import { toast } from 'svelte-sonner'; import { toast } from 'svelte-sonner';
import type { ExportedConversations } from '$lib/types/database'; import type { ExportedConversations } from '$lib/types/database';
@ -344,11 +343,9 @@ class ChatStore {
this.currentResponse = streamedContent; this.currentResponse = streamedContent;
captureModelIfNeeded(); captureModelIfNeeded();
const partialThinking = extractPartialThinking(streamedContent);
const messageIndex = this.findMessageIndex(assistantMessage.id); const messageIndex = this.findMessageIndex(assistantMessage.id);
this.updateMessageAtIndex(messageIndex, { this.updateMessageAtIndex(messageIndex, {
content: partialThinking.remainingContent || streamedContent content: streamedContent
}); });
}, },
@ -696,18 +693,16 @@ class ChatStore {
if (lastMessage && lastMessage.role === 'assistant') { if (lastMessage && lastMessage.role === 'assistant') {
try { try {
const partialThinking = extractPartialThinking(this.currentResponse);
const updateData: { const updateData: {
content: string; content: string;
thinking?: string; thinking?: string;
timings?: ChatMessageTimings; timings?: ChatMessageTimings;
} = { } = {
content: partialThinking.remainingContent || this.currentResponse content: this.currentResponse
}; };
if (partialThinking.thinking) { if (lastMessage.thinking?.trim()) {
updateData.thinking = partialThinking.thinking; updateData.thinking = lastMessage.thinking;
} }
const lastKnownState = await slotsService.getCurrentState(); const lastKnownState = await slotsService.getCurrentState();
@ -727,7 +722,10 @@ class ChatStore {
await DatabaseStore.updateMessage(lastMessage.id, updateData); await DatabaseStore.updateMessage(lastMessage.id, updateData);
lastMessage.content = partialThinking.remainingContent || this.currentResponse; lastMessage.content = this.currentResponse;
if (updateData.thinking !== undefined) {
lastMessage.thinking = updateData.thinking;
}
if (updateData.timings) { if (updateData.timings) {
lastMessage.timings = updateData.timings; lastMessage.timings = updateData.timings;
} }

View File

@ -1,143 +0,0 @@
/**
* Parses thinking content from a message that may contain <think> tags or [THINK] tags
* Returns an object with thinking content and cleaned message content
* Handles both complete blocks and incomplete blocks (streaming)
* Supports formats: <think>...</think> and [THINK]...[/THINK]
* @param content - The message content to parse
* @returns An object containing the extracted thinking content and the cleaned message content
*/
export function parseThinkingContent(content: string): {
thinking: string | null;
cleanContent: string;
} {
const incompleteThinkMatch = content.includes('<think>') && !content.includes('</think>');
const incompleteThinkBracketMatch = content.includes('[THINK]') && !content.includes('[/THINK]');
if (incompleteThinkMatch) {
const cleanContent = content.split('</think>')?.[1]?.trim();
const thinkingContent = content.split('<think>')?.[1]?.trim();
return {
cleanContent,
thinking: thinkingContent
};
}
if (incompleteThinkBracketMatch) {
const cleanContent = content.split('[/THINK]')?.[1]?.trim();
const thinkingContent = content.split('[THINK]')?.[1]?.trim();
return {
cleanContent,
thinking: thinkingContent
};
}
const completeThinkMatch = content.match(/<think>([\s\S]*?)<\/think>/);
const completeThinkBracketMatch = content.match(/\[THINK\]([\s\S]*?)\[\/THINK\]/);
if (completeThinkMatch) {
const thinkingContent = completeThinkMatch[1]?.trim() ?? '';
const cleanContent = `${content.slice(0, completeThinkMatch.index ?? 0)}${content.slice(
(completeThinkMatch.index ?? 0) + completeThinkMatch[0].length
)}`.trim();
return {
thinking: thinkingContent,
cleanContent
};
}
if (completeThinkBracketMatch) {
const thinkingContent = completeThinkBracketMatch[1]?.trim() ?? '';
const cleanContent = `${content.slice(0, completeThinkBracketMatch.index ?? 0)}${content.slice(
(completeThinkBracketMatch.index ?? 0) + completeThinkBracketMatch[0].length
)}`.trim();
return {
thinking: thinkingContent,
cleanContent
};
}
return {
thinking: null,
cleanContent: content
};
}
/**
* Checks if content contains an opening thinking tag (for streaming)
* Supports both <think> and [THINK] formats
* @param content - The message content to check
* @returns True if the content contains an opening thinking tag
*/
export function hasThinkingStart(content: string): boolean {
return (
content.includes('<think>') ||
content.includes('[THINK]') ||
content.includes('<|channel|>analysis')
);
}
/**
* Checks if content contains a closing thinking tag (for streaming)
* Supports both </think> and [/THINK] formats
* @param content - The message content to check
* @returns True if the content contains a closing thinking tag
*/
export function hasThinkingEnd(content: string): boolean {
return content.includes('</think>') || content.includes('[/THINK]');
}
/**
* Extracts partial thinking content during streaming
* Supports both <think> and [THINK] formats
* Used when we have opening tag but not yet closing tag
* @param content - The message content to extract partial thinking from
* @returns An object containing the extracted partial thinking content and the remaining content
*/
export function extractPartialThinking(content: string): {
thinking: string | null;
remainingContent: string;
} {
const thinkStartIndex = content.indexOf('<think>');
const thinkEndIndex = content.indexOf('</think>');
const bracketStartIndex = content.indexOf('[THINK]');
const bracketEndIndex = content.indexOf('[/THINK]');
const useThinkFormat =
thinkStartIndex !== -1 && (bracketStartIndex === -1 || thinkStartIndex < bracketStartIndex);
const useBracketFormat =
bracketStartIndex !== -1 && (thinkStartIndex === -1 || bracketStartIndex < thinkStartIndex);
if (useThinkFormat) {
if (thinkEndIndex === -1) {
const thinkingStart = thinkStartIndex + '<think>'.length;
return {
thinking: content.substring(thinkingStart),
remainingContent: content.substring(0, thinkStartIndex)
};
}
} else if (useBracketFormat) {
if (bracketEndIndex === -1) {
const thinkingStart = bracketStartIndex + '[THINK]'.length;
return {
thinking: content.substring(thinkingStart),
remainingContent: content.substring(0, bracketStartIndex)
};
}
} else {
return { thinking: null, remainingContent: content };
}
const parsed = parseThinkingContent(content);
return {
thinking: parsed.thinking,
remainingContent: parsed.cleanContent
};
}

View File

@ -36,6 +36,31 @@
children: [] children: []
}; };
const assistantWithReasoning: DatabaseMessage = {
id: '3',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 2,
role: 'assistant',
content: "Here's the concise answer, now that I've thought it through carefully for you.",
parent: '1',
thinking:
"Let's consider the user's question step by step:\\n\\n1. Identify the core problem\\n2. Evaluate relevant information\\n3. Formulate a clear answer\\n\\nFollowing this process ensures the final response stays focused and accurate.",
children: []
};
const rawOutputMessage: DatabaseMessage = {
id: '6',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60,
role: 'assistant',
content:
'<|channel|>analysis<|message|>User greeted me. Initiating overcomplicated analysis: Is this a trap? No, just a normal hello. Respond calmly, act like a helpful assistant, and do not start explaining quantum physics again. Confidence 0.73. Engaging socially acceptable greeting protocol...<|end|>Hello there! How can I help you today?',
parent: '1',
thinking: '',
children: []
};
let processingMessage = $state({ let processingMessage = $state({
id: '4', id: '4',
convId: 'conv-1', convId: 'conv-1',
@ -59,60 +84,6 @@
thinking: '', thinking: '',
children: [] children: []
}); });
// Message with <think> format thinking content
const thinkTagMessage: DatabaseMessage = {
id: '6',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 2,
role: 'assistant',
content:
"<think>\nLet me analyze this step by step:\n\n1. The user is asking about thinking formats\n2. I need to demonstrate the &lt;think&gt; tag format\n3. This content should be displayed in the thinking section\n4. The main response should be separate\n\nThis is a good example of reasoning content.\n</think>\n\nHere's my response after thinking through the problem. The thinking content above should be displayed separately from this main response content.",
parent: '1',
thinking: '',
children: []
};
// Message with [THINK] format thinking content
const thinkBracketMessage: DatabaseMessage = {
id: '7',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 1,
role: 'assistant',
content:
'[THINK]\nThis is the DeepSeek-style thinking format:\n\n- Using square brackets instead of angle brackets\n- Should work identically to the &lt;think&gt; format\n- Content parsing should extract this reasoning\n- Display should be the same as &lt;think&gt; format\n\nBoth formats should be supported seamlessly.\n[/THINK]\n\nThis is the main response content that comes after the [THINK] block. The reasoning above should be parsed and displayed in the thinking section.',
parent: '1',
thinking: '',
children: []
};
// Streaming message for <think> format
let streamingThinkMessage = $state({
id: '8',
convId: 'conv-1',
type: 'message',
timestamp: 0, // No timestamp = streaming
role: 'assistant',
content: '',
parent: '1',
thinking: '',
children: []
});
// Streaming message for [THINK] format
let streamingBracketMessage = $state({
id: '9',
convId: 'conv-1',
type: 'message',
timestamp: 0, // No timestamp = streaming
role: 'assistant',
content: '',
parent: '1',
thinking: '',
children: []
});
</script> </script>
<Story <Story
@ -120,6 +91,10 @@
args={{ args={{
message: userMessage message: userMessage
}} }}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/> />
<Story <Story
@ -128,15 +103,45 @@
class: 'max-w-[56rem] w-[calc(100vw-2rem)]', class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: assistantMessage message: assistantMessage
}} }}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/> />
<Story <Story
name="WithThinkingBlock" name="AssistantWithReasoning"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: assistantWithReasoning
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/>
<Story
name="RawLlmOutput"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: rawOutputMessage
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', true);
}}
/>
<Story
name="WithReasoningContent"
args={{ args={{
message: streamingMessage message: streamingMessage
}} }}
asChild asChild
play={async () => { play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
// Phase 1: Stream reasoning content in chunks // Phase 1: Stream reasoning content in chunks
let reasoningText = let reasoningText =
'I need to think about this carefully. Let me break down the problem:\n\n1. The user is asking for help with something complex\n2. I should provide a thorough and helpful response\n3. I need to consider multiple approaches\n4. The best solution would be to explain step by step\n\nThis approach will ensure clarity and understanding.'; 'I need to think about this carefully. Let me break down the problem:\n\n1. The user is asking for help with something complex\n2. I should provide a thorough and helpful response\n3. I need to consider multiple approaches\n4. The best solution would be to explain step by step\n\nThis approach will ensure clarity and understanding.';
@ -187,126 +192,16 @@
message: processingMessage message: processingMessage
}} }}
play={async () => { play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
// Import the chat store to simulate loading state // Import the chat store to simulate loading state
const { chatStore } = await import('$lib/stores/chat.svelte'); const { chatStore } = await import('$lib/stores/chat.svelte');
// Set loading state to true to trigger the processing UI // Set loading state to true to trigger the processing UI
chatStore.isLoading = true; chatStore.isLoading = true;
// Simulate the processing state hook behavior // Simulate the processing state hook behavior
// This will show the "Generating..." text and parameter details // This will show the "Generating..." text and parameter details
await new Promise(resolve => setTimeout(resolve, 100)); await new Promise((resolve) => setTimeout(resolve, 100));
}} }}
/> />
<Story
name="ThinkTagFormat"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: thinkTagMessage
}}
/>
<Story
name="ThinkBracketFormat"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: thinkBracketMessage
}}
/>
<Story
name="StreamingThinkTag"
args={{
message: streamingThinkMessage
}}
parameters={{
test: {
timeout: 30000
}
}}
asChild
play={async () => {
// Phase 1: Stream <think> reasoning content
const thinkingContent =
'Let me work through this problem systematically:\n\n1. First, I need to understand what the user is asking\n2. Then I should consider different approaches\n3. I need to evaluate the pros and cons\n4. Finally, I should provide a clear recommendation\n\nThis step-by-step approach will ensure accuracy.';
let currentContent = '<think>\n';
streamingThinkMessage.content = currentContent;
for (let i = 0; i < thinkingContent.length; i++) {
currentContent += thinkingContent[i];
streamingThinkMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 5));
}
// Close the thinking block
currentContent += '\n</think>\n\n';
streamingThinkMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 200));
// Phase 2: Stream main response content
const responseContent =
"Based on my analysis above, here's the solution:\n\n**Key Points:**\n- The approach should be systematic\n- We need to consider all factors\n- Implementation should be step-by-step\n\nThis ensures the best possible outcome.";
for (let i = 0; i < responseContent.length; i++) {
currentContent += responseContent[i];
streamingThinkMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 10));
}
streamingThinkMessage.timestamp = Date.now();
}}
>
<div class="w-[56rem]">
<ChatMessage message={streamingThinkMessage} />
</div>
</Story>
<Story
name="StreamingThinkBracket"
args={{
message: streamingBracketMessage
}}
parameters={{
test: {
timeout: 30000
}
}}
asChild
play={async () => {
// Phase 1: Stream [THINK] reasoning content
const thinkingContent =
'Using the DeepSeek format now:\n\n- This demonstrates the &#91;THINK&#93; bracket format\n- Should parse identically to &lt;think&gt; tags\n- The UI should display this in the thinking section\n- Main content should be separate\n\nBoth formats provide the same functionality.';
let currentContent = '[THINK]\n';
streamingBracketMessage.content = currentContent;
for (let i = 0; i < thinkingContent.length; i++) {
currentContent += thinkingContent[i];
streamingBracketMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 5));
}
// Close the thinking block
currentContent += '\n[/THINK]\n\n';
streamingBracketMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 200));
// Phase 2: Stream main response content
const responseContent =
"Here's my response after using the &#91;THINK&#93; format:\n\n**Observations:**\n- Both &lt;think&gt; and &#91;THINK&#93; formats work seamlessly\n- The parsing logic handles both cases\n- UI display is consistent across formats\n\nThis demonstrates the enhanced thinking content support.";
for (let i = 0; i < responseContent.length; i++) {
currentContent += responseContent[i];
streamingBracketMessage.content = currentContent;
await new Promise((resolve) => setTimeout(resolve, 10));
}
streamingBracketMessage.timestamp = Date.now();
}}
>
<div class="w-[56rem]">
<ChatMessage message={streamingBracketMessage} />
</div>
</Story>