llama.cpp/examples/llama-eval/llama-eval-discussion.md

16 KiB

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

  • Start with AIME2025 (most familiar with it)
  • Don't support multiple evals initially

2. Implement an "eval state" object

  • ID
  • List of tasks
  • Task states
  • Sampling config

3. Implement a "processor" object

  • List of endpoints
  • Threads per endpoint
  • Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

  • Accepts eval state
  • Starts processing
  • Dumps eval state periodically as it progresses

5. Real-time feedback

  • Default: show "correct / not correct" for each task
  • Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

  • Abstract grading to support external "grader" or "judge"
  • Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

  • Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

  • Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
  • Regex-based answer extraction
  • HTTP requests to OpenAI-compatible endpoint
  • Checkpointing/resume capability
  • Thread-based parallel execution
  • Summary reporting

What needs to be removed:

  • All task implementations except AIME
  • Regex-based grading
  • Multiple endpoint support
  • Complex task loading logic
  • Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

  • What fields should be in the eval state object?
  • Should it include the actual prompts, or just metadata?
  • How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

  • Should the processor handle multiple endpoints (for distributed evaluation)?
  • What's the threading model?
  • How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

  • How should the grader be configured?
  • Should it be a separate service, or a local LLM call?
  • What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

  • Should the eval state be serialized to disk?
  • How often should it be dumped?
  • What format should it use?

5. Real-time Output

Status: Not Started

Questions:

  • How should progress be displayed?
  • Console output, file logging, or both?
  • What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

  • Should responses be in JSON format?
  • How should the grader interface work with JSON output?

Next Steps

  1. Eval State Object - Currently discussing
  2. Processor Architecture
  3. Grader Interface
  4. Checkpointing
  5. Real-time Output
  6. Output Format

References

Session Work Summary

llama-server-simulator Implementation

Created:

  • llama-server-simulator.py - Standalone Python script simulating llama-server HTTP endpoint
  • test-simulator.sh - Test script for verifying simulator functionality
  • llama-server-simulator-plan.md - Implementation plan
  • simulator-summary.md - Summary of implementation

Features Implemented:

  1. HTTP Server - Flask-based /v1/chat/completions endpoint with OpenAI-compatible format
  2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
  3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
  4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
  5. Debug Logging - Helps troubleshoot matching issues

Testing Results:

  • Correct answers returned when success rate allows
  • Wrong answers returned when success rate doesn't allow
  • No matching questions return errors
  • Success rate verified (80% in 10 requests)
  • HuggingFace dataset caching working correctly

Key Technical Decisions:

  • Used Levenshtein distance for partial matching (threshold: 0.3)
  • Automatic caching via HuggingFace datasets library
  • Wrong answers generated by incrementing expected answer
  • Debug output written to stderr for better visibility

Refactoring:

  • Extracted repeating question string into TEST_QUESTION variable
  • Created make_request() helper function to reduce code duplication
  • Added proper error handling for error responses
  • Fixed simulator stopping issue at script completion

llama-eval-new.py Implementation

Created:

  • llama-eval-new.py - Simplified evaluation tool focused on AIME

Features Implemented:

  1. Eval State Object - Structured dataclass with ID, tasks, task states, and sampling config
  2. Processor Object - Handles processing, grading, and state management
  3. Real-time Feedback - Shows correct/incorrect status for each case
  4. Flexible Grading System - Supports regex, CLI, and LLM-based grading
  5. Structured JSON Output - Saves complete eval state to JSON file
  6. HuggingFace Dataset Caching - Uses cached dataset path to avoid HF Hub requests
  7. Enhanced Answer Extraction - Extracts answers from full responses for display

Grading System:

  • Regex Grading: Built-in patterns for different task types
    • aime: \boxed{(\d+)}|\b(\d+)\b (handles boxed and plain text)
    • gsm8k: \b(\d+)\b (extract first number)
    • mmlu, hellaswag, arc, winogrande: [A-D] (extract single letter)
  • CLI Grading: External script interface
    • Script accepts --answer <pred> and --expected <gold>
    • Returns exit code 0 if correct, non-zero if incorrect
    • 30-second timeout to prevent hanging
  • LLM Judge: Generic answer extraction using LLM
    • Uses configured server and model for extraction
    • Includes problem statement in prompt for context
    • Case-insensitive comparison
    • Returns extracted answer for display

Configuration Options:

  • --server: llama-server URL (default: http://localhost:8033)
  • --n_cases: Number of cases to evaluate (default: all)
  • --n_predict: Max tokens to predict per prompt (default: 2048)
  • --threads: Number of threads for parallel requests (default: 32)
  • --verbose: Show detailed output for each case
  • --output: Output file for eval state (default: llama-eval-state.json)
  • --grader-type: regex, cli, or llm
  • --grader-regex-type: aime, gsm8k, mmlu, hellaswag, arc, winogrande
  • --grader-script: Path to CLI grader script
  • --judge-server: Server URL for LLM judge (default: same as main server)
  • --judge-model: Model name for LLM judge (default: same as main model)

Testing Results:

  • Works with simulator at 100% success rate (all correct)
  • Works with simulator at 0% success rate (all incorrect)
  • Works with simulator at 80% success rate (8/10 correct)
  • Real-time verbose output shows gold/pred/status for each case
  • JSON output contains complete eval state with all cases
  • HF Hub telemetry disabled (no warnings)
  • Uses cached dataset path to avoid HF Hub requests when available
  • Regex grader extracts answers correctly from various formats
  • LLM judge can extract answers with problem context
  • Response truncation focuses grading on final answer
  • Case-insensitive matching works for both regex and LLM grader
  • Judge model and server configuration propagate correctly
  • Progress table shows extracted answers instead of full responses

Key Technical Decisions:

  • Removed Levenshtein matching - eval script only sends requests and validates answers
  • Abstract grading interface for external grader support
  • Exact match requirement for regex patterns
  • Handles both boxed and plain text formats for AIME answers
  • 30-second timeout for CLI grader
  • Validates script exists before running
  • Judge parameters set once during Grader construction
  • LLM judge prompt includes problem statement for better extraction
  • Response truncation to last 2-3 lines focuses grading on final answer
  • Case-insensitive comparison for more flexible matching

Refactoring:

  • Removed all task implementations except AIME
  • Removed regex-based grading (moved to flexible grader system)
  • Removed multiple endpoint support
  • Removed complex task loading logic
  • Removed summary reporting (replaced with real-time feedback)
  • Added HuggingFace dataset caching optimization
  • Added LLM grader support with configurable server and model
  • Added response truncation before grading
  • Refactored grader interface to return extracted answers

llama-eval-new.py Threading and Model Parameter Updates

Changes Made:

  1. Threading Support - Added ThreadPoolExecutor for parallel request processing

    • Added from concurrent.futures import ThreadPoolExecutor, as_completed
    • Created _process_single_case() method for thread-safe case processing
    • Refactored process() to use ThreadPoolExecutor with configurable thread count
    • Updated progress tracking to work with concurrent execution
    • Thread-safe eval state updates (task_states and counters)
  2. Model Parameter - Added --model argument to specify model name in request data

    • Added model_name parameter to Processor.init()
    • Updated _make_request() to use provided model name or default to "llama"
    • Added --model argument to argument parser
    • Model name is included in request JSON as "model": "gpt-oss-20b-hf"

Testing Results:

  • Works with 2 threads (5 cases processed in ~0.2s)
  • Works with 4 threads (slightly faster throughput)
  • Model parameter correctly added to request data
  • Thread-safe progress tracking with tqdm
  • No race conditions in eval state updates

Key Technical Decisions:

  • Used ThreadPoolExecutor for simple, effective parallelism
  • No rate limiting needed (server can handle concurrent requests)
  • Thread-safe counter updates for correct/total tracking
  • Progress bar shows completion status across all threads
  • Model parameter is optional - defaults to "llama" if not specified

Refactoring:

  • Extracted single case processing into _process_single_case() method
  • Changed from sequential loop to ThreadPoolExecutor with futures
  • Updated verbose output to show total count instead of index
  • Made eval state updates thread-safe

llama-eval-new.py Enhanced Grading System

Changes Made:

  1. Enhanced Grader Interface - Updated to return extracted answers

    • grade() method now returns Tuple[bool, Optional[str]] (correctness + extracted answer)
    • Added extracted field to TaskState dataclass
    • All grader types (regex, cli, llm) now return extracted answers
  2. Improved Regex Grader

    • New _extract_answer_regex() method extracts answers using configured patterns
    • Supports case-insensitive matching
    • Returns first valid match found
    • Handles both single values and multiple matches
  3. LLM-Based Judge

    • New _grade_llm() method for generic answer extraction
    • Includes problem statement in prompt for context
    • Configurable server URL (defaults to main server)
    • Configurable model name (defaults to main model)
    • Case-insensitive comparison
    • Returns extracted answer for display
  4. Response Truncation

    • New _truncate_response() method keeps only last 2-3 lines
    • Applied before grading to focus on final answer section
  5. CLI Grader Update

    • Now also returns extracted answer
    • Returns None if grading fails
  6. Display Updates

    • Progress table shows extracted answer instead of full response
    • Verbose mode shows full response plus extracted answer
  7. New CLI Arguments

    • --grader-type: Added "llm" option
    • --judge-server: Separate server for LLM judge
    • --judge-model: Separate model for LLM judge

Testing Results:

  • Regex grader extracts answers correctly from various formats
  • LLM judge can extract answers with problem context
  • Response truncation focuses grading on final answer
  • Case-insensitive matching works for both regex and LLM grader
  • Judge model and server configuration propagate correctly
  • Progress table shows extracted answers instead of full responses

Key Technical Decisions:

  • Judge parameters set once during Grader construction (not on each call)
  • LLM judge prompt includes problem statement for better extraction
  • Response truncation to last 2-3 lines focuses grading on final answer
  • Case-insensitive comparison for more flexible matching
  • Judge configuration propagates through Processor to Grader
  • Display shows extracted answer for cleaner output

Refactoring:

  • Removed judge parameters from grade() method calls
  • Added judge_server_url and judge_model_name to Grader class
  • Updated _grade_llm() to use instance variables instead of parameters
  • Simplified Processor initialization to pass judge config to grader
  • Updated startup info to show judge server and model

llama-eval-new.py GSM8K Dataset Support

Changes Made:

  1. GSM8K Dataset Integration - Added support for GSM8K dataset alongside AIME

    • Created Gsm8kDataset class with proper answer extraction logic
    • GSM8K uses "question" field instead of "problem" field
    • GSM8K answer field contains full reasoning with #### prefix
    • Extracts numeric answer from answer field during initialization
    • Uses same regex grader pattern as AIME (\b(\d+)\b)
  2. Dataset Type Configuration - Added dataset selection support

    • Added --dataset CLI argument with choices aime and gsm8k
    • Updated Processor class to accept dataset_type parameter
    • Dataset-specific initialization in Processor.__init__()
    • Dataset name displayed in task summary table
  3. Template Registry - Added dataset-specific prompt templates

    • AIME template: includes \boxed{} wrapper for final answer
    • GSM8K template: plain text answer without wrapper
    • Templates applied based on question["dataset_type"] field
  4. Answer Extraction Logic - Fixed GSM8K answer extraction

    • GSM8K has pre-extracted "gold" field with numeric answer
    • Gsm8kDataset.get_answer() checks for "gold" field first
    • Falls back to answer field if gold field not present
    • AimeDataset.get_answer() simplified to remove duplicate method
  5. Task ID Format - Fixed duplicate prefix in task IDs

    • Changed from f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"
    • To f"{dataset_type}_{chunk_idx:03d}_{i:03d}"
    • Removed redundant eval_state.id (was "gsm8k" for GSM8K)
  6. Column Width Adjustments - Improved table formatting

    • Task ID column: 25 characters
    • Dataset column: 5 characters
    • Prompt column: 40 characters
    • Expected column: 10 characters

Testing Results:

  • GSM8K dataset loads correctly with 7473 questions
  • Numeric answers extracted from full reasoning text
  • Task summary table displays correctly with adjusted column widths
  • Task IDs show correct format (e.g., gsm8k_000_3169)
  • Both AIME and GSM8K datasets work with same script
  • Answer extraction works for both boxed and plain text formats
  • Progress tracking shows extracted answers for both datasets

Key Technical Decisions:

  • GSM8K uses "question" field instead of "problem" field
  • GSM8K answer field contains full reasoning with #### prefix
  • Numeric answer extracted during dataset initialization
  • Same regex grader pattern works for both datasets
  • Dataset selection via CLI argument for separate runs
  • Template registry supports different prompt formats per dataset
  • Task ID format simplified to avoid duplication

Refactoring:

  • Removed duplicate get_question() method from AimeDataset
  • Removed "2025" suffix from eval state ID (was remnant from old version)
  • Removed "2025" suffix from task summary table output
  • Removed "2025" suffix from progress tracking output
  • Updated Processor.__init__() to initialize appropriate dataset based on type
  • Updated _process_single_case() to handle both "problem" and "question" fields
  • Updated process() method to display dataset name and use dataset_type for task states