llama.cpp/docs/autoparser.md

21 KiB

Unified Auto-Parser Architecture

The auto-parser automatically analyzes chat templates to determine how to parse model outputs, including content, reasoning, and tool calls.

Overview

The unified auto-parser uses a two-phase incremental analysis approach:

  1. Phase 1: Content & Reasoning Analysis - Analyzes how the template handles basic content and reasoning, without considering tools
  2. Phase 2: Tool Call Analysis - Analyzes tool calling patterns, layered on top of Phase 1

Data Structures

content_structure (Phase 1 Result)

Describes how the template handles content and reasoning:

struct content_structure {
    enum reasoning_mode_type {
        REASONING_NONE,         // No reasoning markers detected
        REASONING_OPTIONAL,     // <think>...</think> may appear before content
        REASONING_FORCED_OPEN,  // Template ends with open reasoning tag OR starts implicitly (empty start, present end)
    };

    reasoning_mode_type reasoning_mode = REASONING_NONE;
    std::string         reasoning_start;  // e.g., "<think>", "<|START_THINKING|>"
    std::string         reasoning_end;    // e.g., "</think>", "<|END_THINKING|>"

    // Content wrapping mode
    enum content_mode_type {
        CONTENT_PLAIN,                   // No content markers
        CONTENT_ALWAYS_WRAPPED,          // <response>...</response> always present
        CONTENT_WRAPPED_WITH_REASONING,  // Content wrapped only when reasoning present
    };

    content_mode_type content_mode = CONTENT_PLAIN;
    std::string       content_start;  // e.g., "<response>", "<|START_RESPONSE|>"
    std::string       content_end;    // e.g., "</response>", "<|END_RESPONSE|>"
};

tool_call_structure (Phase 2 Result)

Describes how the template formats tool calls:

struct tool_call_structure {
    bool supports_tools = false;

    // Container markers (what wraps all tool calls)
    std::string tool_section_start;  // e.g., "<tool_call>", "[TOOL_CALLS]", "<TOOLCALL>", ""
    std::string tool_section_end;    // e.g., "</tool_call>", "]", "</TOOLCALL>", ""

    // Function format (how individual functions are structured)
    enum function_format {
        FUNC_JSON_OBJECT,       // {"name": "X", "arguments": {...}}
        FUNC_TAG_WITH_NAME,     // <function=X>{...}</function>
        FUNC_TAG_NAME_ONLY,     // <X>...</X> where X is function name (rare)
        FUNC_PREFIXED_INDEXED,  // <|tool_call_begin|>functions.X:0<|tool_call_argument_begin|>{...}<|tool_call_end|>
        FUNC_NAME_AS_KEY,       // [{"function_name": {...arguments...}}] (Apertus-style)
        FUNC_BRACKET_TAG,       // [TOOL_CALLS]X[CALL_ID]id[ARGS]{...} (Mistral Small 3.2 style)
        FUNC_RECIPIENT_BASED,   // >>>recipient\n{content} where recipient is "all" (content) or function name (tools)
        FUNC_MARKDOWN_CODE_BLOCK,  // Action:\n```json\n[{"tool_name": "X", ...}]\n``` (Cohere Command-R Plus)
    };
    function_format function_format = FUNC_JSON_OBJECT;

    // For FUNC_JSON_OBJECT format - field names (may vary between templates)
    std::string name_field = "name";       // Could be "tool_name", "function"
    std::string args_field = "arguments";  // Could be "parameters", "params", "input"
    std::string id_field;                  // Optional: "id", "tool_call_id", ""

    // For FUNC_TAG_WITH_NAME format
    std::string function_prefix;  // e.g., "<function="
    std::string function_suffix;  // e.g., ">"
    std::string function_close;   // e.g., "</function>"

    // For FUNC_PREFIXED_INDEXED format (e.g., Kimi-K2)
    std::string per_call_start;      // e.g., "<|tool_call_begin|>"
    std::string function_namespace;  // e.g., "functions." (prefix before function name)
    std::string args_marker;         // e.g., "<|tool_call_argument_begin|>"
    std::string per_call_end;        // e.g., "<|tool_call_end|>"

    // For FUNC_BRACKET_TAG format (e.g., Mistral Small 3.2)
    std::string id_marker;  // e.g., "[CALL_ID]" - marker before tool call ID

    // For FUNC_MARKDOWN_CODE_BLOCK format (Cohere Command-R Plus)
    std::string code_block_marker;    // e.g., "Action:" - text marker before code block
    std::string code_block_language;  // e.g., "json" - language identifier in code fence

    // Argument format (how arguments are structured within a function)
    enum argument_format {
        ARGS_JSON,            // Standard JSON object: {"key": "value", ...}
        ARGS_TAGGED,          // XML-style: <param=key>value</param>
        ARGS_KEY_VALUE_TAGS,  // <arg_key>key</arg_key><arg_value>value</arg_value> (GLM-4.6)
    };
    argument_format argument_format = ARGS_JSON;

    // For ARGS_TAGGED format
    std::string arg_prefix;     // e.g., "<param=", "<parameter="
    std::string arg_suffix;     // e.g., ">"
    std::string arg_close;      // e.g., "</param>", "</parameter>"
    std::string arg_separator;  // e.g., "", "\n"

    // Flag: template renders null content as "None" string, requires empty string instead
    bool requires_nonnull_content = false;
};

Analysis Flow

Template
    |
    v
Phase 1: analyze_content_structure()
    |-- detect_reasoning_markers() - compare outputs with reasoning_content vs without
    |-- detect_content_markers() - render with content and detect wrapping
    |-- detect_reasoning_mode() - check if prompt ends with open tag
    |
    v
content_structure
    |
    v
Phase 2: analyze_tool_structure()
    |-- Check minja.supports_tool_calls
    |-- Differential analysis for tool patterns
    |-- Classify function format (JSON vs tagged)
    |-- Classify argument format (JSON vs tagged)
    |
    v
tool_call_structure
    |
    v
generate_parser(content_structure, tool_call_structure)
    |-- build_reasoning_block(content_structure)
    |-- build_content_block(content_structure)
    |-- build_tool_section(tool_call_structure, tools)
    |-- Compose into final parser
    |
    v
common_chat_params (parser, grammar, triggers, preserved_tokens)

Entry Point

The mechanism starts in common/chat.cpp, in common_chat_templates_apply_jinja:

// 1. Analyze the template (two-phase)
template_analysis_result analysis = template_analyzer::analyze_template(tmpl);

// 2. Generate the parser and grammar
auto auto_params = universal_peg_generator::generate_parser(analysis, tmpl, params);

// 3. Use if it provides more than basic content handling
if (auto_params.format != COMMON_CHAT_FORMAT_CONTENT_ONLY ||
    auto_params.thinking_forced_open ||
    !auto_params.parser.empty()) {
    return auto_params;
}

Builder Methods

The unified builder (common_chat_peg_unified_builder) provides high-level methods:

  • build_reasoning_block(cs, reasoning_format, thinking_forced_open) - Build reasoning parser
  • build_content_block(cs, reasoning_format) - Build content parser
  • build_tool_section(ts, tools, parallel_tool_calls, force_tool_calls) - Build tool section
  • build_function(ts, name, schema) - Build single function parser
  • build_arguments(ts, schema) - Build arguments parser

Key Templates Supported

  • Granite - <think></think> + <response></response> with tool calls
  • Nemotron - JSON tools with <TOOLCALL> wrapper
  • Qwen/Hermes - XML-style <function=X><param=key> format
  • Command-R7B - <|START_THINKING|>/<|START_RESPONSE|> + <|START_ACTION|> tools
  • DeepSeek R1 - Forced thinking + complex tools
  • Mistral Nemo - [TOOL_CALLS] wrapper
  • MiniMax - <minimax:tool_call> wrapper with XML tools
  • GLM-4.6 - <minimax:tool_call> + <tool_call>name\n<arg_key>...<arg_value>... format
  • Kimi-K2 - FUNC_PREFIXED_INDEXED format with namespace and indices
  • Mistral Small 3.2 - FUNC_BRACKET_TAG format with [TOOL_CALLS] markers
  • Functionary v3.2 - FUNC_RECIPIENT_BASED format with >>> routing

Files

File Purpose
common/chat-auto-parser.h Data structures and API declarations
common/chat-auto-parser-analyzer.cpp Phase 1 and Phase 2 analysis implementation
common/chat-auto-parser-generator.cpp PEG parser generator
common/chat-auto-parser-helpers.h/cpp Shared helper functions
common/chat-peg-parser.h/cpp Unified builder and mapper classes
common/chat.cpp Main entry point and wire-up

Algorithm Details

Phase 1: Content & Reasoning Analysis

Reasoning Detection (4 Methods)

Method 1: Differential Reasoning Content Analysis

  • Render template with reasoning_content field present vs absent
  • Compare outputs to find markers between THOUGHT_MARKER and CONTENT_MARKER
  • If only closing tag found, derive opening tag using patterns:
    • XML: </tag><tag>
    • Special tokens: <|END_X|><|START_X|>, <|/X|><|X|>
  • Handles various tag formats including XML and special token formats

Method 2: Enable-Thinking Toggle Analysis

  • Toggle enable_thinking context variable between true/false
  • Detects differences in generated prompts
  • Handles two scenarios:
    • Normal case: enable_thinking=true adds reasoning markers
    • Reverse case: enable_thinking=false adds empty thinking block (GLM-4.6 style)
  • Uses string difference analysis to extract markers
  • Validates extracted tags against blacklist of role markers

Method 3: Prompt Ending Analysis

  • Checks if prompt ends with unclosed reasoning tag
  • Looks for trailing tags in prompt with enable_thinking=true
  • Differentiates between open tags (<think>) and close tags (</think>)
  • Handles blacklisted tags (role markers, system tokens)
  • Validates reasoning-like patterns (contains "think", "reason", "thought")

Method 4: Adjacent Tag Pair Detection

  • Looks for patterns like <minimax:tool_call></think>, <|START_THINKING|><|END_THINKING|>, [think][/think]
  • Searches for predefined tag patterns in prompt
  • Validates tags are adjacent with only whitespace between
  • Supports both simple and complex token formats

Content Detection Algorithm

  1. Dual-Mode Rendering: Render template with content marker in both thinking-enabled and thinking-disabled modes
  2. Pattern Matching: Search for known content wrapper patterns:
    • <|START_RESPONSE|> / <|END_RESPONSE|>
    • <response> / </response>
    • <output> / </output>
    • <answer> / </answer>
    • <|CHATBOT_TOKEN|> / <|END_OF_TURN_TOKEN|>
  3. Mode Classification:
    • CONTENT_ALWAYS_WRAPPED: Found in both thinking modes
    • CONTENT_WRAPPED_WITH_REASONING: Found only with thinking enabled
    • CONTENT_PLAIN: No wrapping detected

Reasoning Mode Detection

  • REASONING_FORCED_OPEN:
    • Explicit: Prompt ends with reasoning start marker (e.g., <think>).
    • Implicit: reasoning end marker is present but start marker is empty (e.g., [BEGIN FINAL RESPONSE]).
  • REASONING_OPTIONAL: Markers present but not forced.
  • REASONING_NONE: No markers detected.

Phase 2: Tool Call Structure Analysis

Differential Analysis Algorithm

Test Payload Strategy:

  1. Base: User + Assistant with content only (no tools)
  2. Tool 1: User + Assistant with tool_calls (empty args)
  3. Tool 2: User + Assistant with tool_calls (with args)
  4. Tool 3: User + Assistant with multiple tool calls

Pattern Extraction Process:

  1. Compute string differences between base and tool outputs
  2. Use test_function_name as reliable search anchor (using rfind for last occurrence)
  3. Extract structural elements:
    • tool_call_opener: Common prefix before function name
    • tool_call_closer: Common suffix after function calls
    • function_opener: Tag immediately before function name
    • function_closer: Tag after function content
    • parameter_key_prefix/suffix: Argument wrapping patterns

Format Classification Logic

FORMAT_JSON_NATIVE:

  • Detected by {"name": pattern in tool_call_opener
  • Or XML markers with JSON structure

FORMAT_XML_CONSTRUCTED:

  • function_opener starts with <
  • No substantial parameter markers

FORMAT_RECIPIENT_BASED:

  • tool_call_start_marker == function_opener
  • No parameter markers
  • Opener doesn't start with structural chars

FORMAT_BRACKET_TAG:

  • function_name_suffix contains bracket tags like [CALL_ID]...[ARGS]
  • tool_call_start_marker matches [TOOL_CALLS] pattern

FORMAT_PREFIXED_INDEXED:

  • function_opener ends with . (namespace separator)
  • function_name_suffix starts with : followed by digit
  • Example: functions.name:0<|tool_call_argument_begin|>

Specialized Format Handling

FUNC_PREFIXED_INDEXED (Kimi-K2):

  • Splits function_opener at last > to get per_call_start + function_namespace
  • Extracts args_marker from function_name_suffix
  • Derives per_call_end by matching structural patterns in tool_call_closer

FUNC_TAG_WITH_NAME (Functionary/Nemotron):

  • Detects nested vs non-nested formats
  • Uses overlap detection between tool_section_start and function_prefix
  • Handles double-wrapping prevention

ARGS_KEY_VALUE_TAGS (GLM-4.6):

  • Detects <arg_key>key</arg_key><arg_value>value</arg_value> pattern
  • Cleans up suffix to extract just the key closer

FUNC_RECIPIENT_BASED (Functionary v3.2):

  • Detects >>> recipient delimiter format
  • Routes to "all" for content, function name for tools
  • Uses same delimiter for both content and tool routing

FUNC_BRACKET_TAG (Mistral Small 3.2/Devstral):

  • Detects [TOOL_CALLS]function_name[ARGS]{...} pattern
  • Optional [CALL_ID]id marker for tool call identification
  • No section wrapper - each call starts independently

Generator Algorithms

Unified Parser Building

Composition Strategy:

// Standard format
sequence({ reasoning, space(), content, space(), tools, space(), content, end() })

// With section markers
sequence({ reasoning, space(), content_until(section_start), space(), tools, space(), content, end() })

// Forced thinking handling
optional(reasoning) when thinking_forced_open && tools present

Trigger Word Detection:

  • Uses tool_section_start as primary trigger
  • Falls back to function_prefix or per_call_start
  • Raw JSON uses regex pattern trigger

Lazy Grammar Optimization:

  • Enabled by default for performance
  • Disabled when thinking forced open
  • Disabled when no clear trigger word exists

Testing & Debugging

Comprehensive Test Coverage

The test suite covers:

Reasoning Models:

  • Qwen-QwQ-32B (forced-open thinking)
  • DeepSeek R1 variants (reasoning only)
  • IBM Granite (reasoning + tools)
  • ByteDance Seed-OSS (custom reasoning tags)
  • Ministral-3-14B-Reasoning
  • llama-cpp-deepseek-r1

Tool Call Formats:

  • JSON: Llama 3.x, Mistral Nemo, Hermes, MiMo-VL
  • XML: Nemotron, Qwen3-Coder, MiniMax
  • Tagged: GLM-4.6 (key-value tags)
  • Bracket-tag: Mistral Small 3.2, Devstral
  • Prefixed-indexed: Kimi-K2 variants
  • Name-as-key: Apertus-8B
  • Recipient-based: Functionary v3.2

Edge Cases:

  • Streaming/partial parsing
  • Empty content with tools
  • Parallel tool calls
  • Forced thinking mode
  • Multi-byte Unicode markers
  • Null content handling
  • Multi-line code in tool arguments
  • Custom reasoning tags (ByteDance Seed-OSS)

Debug Tools

Template Debugger: tests/debug-template-parser.cpp

  • Usage: ./bin/debug-template-parser path/to/template.jinja
  • Shows detected format, markers, generated parser, and GBNF grammar

Debug Logging: Enable with LLAMA_LOG_VERBOSITY=2

  • Shows detailed analysis steps
  • Displays pattern extraction results
  • Lists generated parser structure

PEG Test Builder: Fluent API for creating test cases

auto tst = peg_tester("template.jinja");
tst.test("input")
   .reasoning_format(COMMON_REASONING_FORMAT_AUTO)
   .tools({tool})
   .expect(expected_message)
   .run();

Adding Support for New Templates

To support a new template format:

  1. If it follows standard patterns - The auto-parser should detect it automatically
  2. If it has unique markers - Add the markers to the detection patterns in:
    • detect_reasoning_markers() for reasoning tags
    • detect_content_markers() for content wrappers
    • extract_patterns_from_differences() for tool call patterns
  3. If it needs special handling - Add a dedicated handler in chat.cpp before the auto-parser block

Edge Cases and Quirks

  1. Forced Thinking: If enable_thinking is true but the model has already started a thought block (e.g., ended the prompt with <think>), the parser enters "forced thinking" mode where it immediately expects reasoning content.
  2. Ambiguous Content: Templates that mix content and tool calls without clear delimiters can be tricky. The analyzer tries to find "common" start/end patterns across multiple examples to be robust.
  3. Double Wrapping: Some templates (e.g., Functionary) use the same string for both the tool section start and the function prefix (e.g., <function=). The analyzer detects this overlap and prevents double-wrapping in the generated parser.
  4. Null Content Rendering: Some templates render null content as Python "None" string. The analyzer detects this and patches content to empty string.
  5. Multi-byte Unicode Markers: Some templates use special Unicode characters in markers that require careful handling in GBNF generation.

State of the Autoparser (Jan 2026)

As of January 2026, the unified auto-parser successfully handles major template families including DeepSeek V3/R1, Llama 3.x (native JSON), GLM-4/4.6, and standard XML/JSON formats. It also supports Functionary v3.1/v3.2, Mistral variants, and specialized formats like Kimi-K2's prefixed-indexed structure.

Tested Templates

The following templates have active tests in tests/test-chat.cpp:

Template Format Notes
DeepSeek V3.1 FUNC_JSON_OBJECT Forced thinking mode
DeepSeek R1 Distill (Llama/Qwen) Reasoning only Forced-open thinking
llama-cpp-deepseek-r1 Reasoning only Forced-open thinking
GLM-4.6 ARGS_KEY_VALUE_TAGS <tool_call>name\n<arg_key>...<arg_value>... format
Kimi-K2 / Kimi-K2-Instruct / Kimi-K2-Thinking FUNC_PREFIXED_INDEXED functions.name:0 with special markers
Apertus-8B-Instruct FUNC_NAME_AS_KEY {"function_name": {...}} format
MiniMax-M2 FUNC_TAG_WITH_NAME XML invoke with parameter tags
NVIDIA-Nemotron-Nano-v2 FUNC_JSON_OBJECT <TOOLCALL> wrapper (nested)
Mistral-Nemo-Instruct-2407 FUNC_JSON_OBJECT [TOOL_CALLS] wrapper with id field
Functionary v3.1 FUNC_TAG_WITH_NAME <function=X> non-nested format
Functionary v3.2 FUNC_RECIPIENT_BASED >>> recipient delimiter format
MiMo-VL / Hermes 3 / Qwen 2.5 FUNC_JSON_OBJECT <tool_call> wrapper
Apriel 1.5 FUNC_JSON_OBJECT <tool_calls> wrapper with JSON array
Apriel 1.6 Thinker Reasoning only Implicit reasoning start
Cohere Command-R7B FUNC_JSON_OBJECT START_RESPONSE/ACTION/THINKING markers
Mistral Small 3.2 FUNC_BRACKET_TAG [TOOL_CALLS]func[ARGS]{...} with ID
Devstral FUNC_BRACKET_TAG [TOOL_CALLS]func[ARGS]{...} without ID
Ministral-3-14B-Reasoning Custom reasoning [THINK]...[/THINK] tags
IBM Granite FUNC_JSON_OBJECT <think></think> + <response></response>
ByteDance Seed-OSS FUNC_TAG_WITH_NAME Custom <seed:think> and <seed:tool_call> tags
Qwen3-Coder FUNC_TAG_WITH_NAME XML-style tool format
Cohere Command-R Plus FUNC_MARKDOWN_CODE_BLOCK Action:\n\``json\n[...]\n```` format

Currently Unsupported Templates

Template Family Model / Variant Issue Description
OpenAI GPT-OSS Complex channel markers need new format

Templates Without Tool Support

Some templates genuinely don't support tool calls (this is not a detection bug):

  • Phi 3.5 Mini - The official template has no tool handling. Use Phi-4-mini-instruct for function calling, or community fine-tuned versions.
  • Google Gemma 2 2B - Pure instruction-following model without tool capabilities.

TODO / Roadmap

  • Fix OpenAI GPT-OSS: Add FUNC_CHANNEL_BASED format for channel marker structure.
  • Fix Cohere Command-R Plus: Added FUNC_MARKDOWN_CODE_BLOCK format for Action:\n\``json` structure.

Recent Additions (Dec 2025 - Jan 2026)

  • FUNC_RECIPIENT_BASED: Support for Functionary v3.2's >>> recipient delimiter format
  • FUNC_BRACKET_TAG: Support for Mistral Small 3.2 and Devstral's [TOOL_CALLS]... format
  • Enhanced Content Detection: Better handling of custom reasoning tags and content wrappers
  • Improved Streaming Support: Better handling of partial parsing for all supported formats
  • Custom Tag Support: Support for non-standard reasoning tags like <seed:think> (ByteDance)
  • Multi-line Tool Arguments: Better parsing of complex tool arguments with code blocks
  • FUNC_MARKDOWN_CODE_BLOCK: Support for Cohere Command-R Plus markdown code block format
  • Implicit Reasoning Support: Support for templates where reasoning starts implicitly without a start marker.

The auto-parser now successfully handles 25+ different template formats across reasoning-only, tool-calling, and hybrid models, with comprehensive test coverage ensuring robust parsing across streaming and non-streaming scenarios.