llama.cpp/examples/llama-eval/llama-eval-discussion.md

7.3 KiB

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

  • Start with AIME2025 (most familiar with it)
  • Don't support multiple evals initially

2. Implement an "eval state" object

  • ID
  • List of tasks
  • Task states
  • Sampling config

3. Implement a "processor" object

  • List of endpoints
  • Threads per endpoint
  • Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

  • Accepts eval state
  • Starts processing
  • Dumps eval state periodically as it progresses

5. Real-time feedback

  • Default: show "correct / not correct" for each task
  • Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

  • Abstract grading to support external "grader" or "judge"
  • Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

  • Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

  • Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
  • Regex-based answer extraction
  • HTTP requests to OpenAI-compatible endpoint
  • Checkpointing/resume capability
  • Thread-based parallel execution
  • Summary reporting

What needs to be removed:

  • All task implementations except AIME
  • Regex-based grading
  • Multiple endpoint support
  • Complex task loading logic
  • Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

  • What fields should be in the eval state object?
  • Should it include the actual prompts, or just metadata?
  • How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

  • Should the processor handle multiple endpoints (for distributed evaluation)?
  • What's the threading model?
  • How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

  • How should the grader be configured?
  • Should it be a separate service, or a local LLM call?
  • What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

  • Should the eval state be serialized to disk?
  • How often should it be dumped?
  • What format should it use?

5. Real-time Output

Status: Not Started

Questions:

  • How should progress be displayed?
  • Console output, file logging, or both?
  • What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

  • Should responses be in JSON format?
  • How should the grader interface work with JSON output?

Next Steps

  1. Eval State Object - Currently discussing
  2. Processor Architecture
  3. Grader Interface
  4. Checkpointing
  5. Real-time Output
  6. Output Format

References

Session Work Summary

llama-server-simulator Implementation

Created:

  • llama-server-simulator.py - Standalone Python script simulating llama-server HTTP endpoint
  • test-simulator.sh - Test script for verifying simulator functionality
  • llama-server-simulator-plan.md - Implementation plan
  • simulator-summary.md - Summary of implementation

Features Implemented:

  1. HTTP Server - Flask-based /v1/chat/completions endpoint with OpenAI-compatible format
  2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
  3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
  4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
  5. Debug Logging - Helps troubleshoot matching issues

Testing Results:

  • Correct answers returned when success rate allows
  • Wrong answers returned when success rate doesn't allow
  • No matching questions return errors
  • Success rate verified (80% in 10 requests)
  • HuggingFace dataset caching working correctly

Key Technical Decisions:

  • Used Levenshtein distance for partial matching (threshold: 0.3)
  • Automatic caching via HuggingFace datasets library
  • Wrong answers generated by incrementing expected answer
  • Debug output written to stderr for better visibility

Refactoring:

  • Extracted repeating question string into TEST_QUESTION variable
  • Created make_request() helper function to reduce code duplication
  • Added proper error handling for error responses
  • Fixed simulator stopping issue at script completion

llama-eval-new.py Implementation

Created:

  • llama-eval-new.py - Simplified evaluation tool focused on AIME

Features Implemented:

  1. Eval State Object - Structured dataclass with ID, tasks, task states, and sampling config
  2. Processor Object - Handles processing, grading, and state management
  3. Real-time Feedback - Shows correct/incorrect status for each case
  4. Flexible Grading System - Supports regex and CLI-based grading
  5. Structured JSON Output - Saves complete eval state to JSON file
  6. HuggingFace Dataset Caching - Uses cached dataset path to avoid HF Hub requests

Grading System:

  • Regex Grading: Built-in patterns for different task types
    • aime: \boxed{(\d+)}|\b(\d+)\b (handles boxed and plain text)
    • gsm8k: \b(\d+)\b (extract first number)
    • mmlu, hellaswag, arc, winogrande: [A-D] (extract single letter)
  • CLI Grading: External script interface
    • Script accepts --answer <pred> and --expected <gold>
    • Returns exit code 0 if correct, non-zero if incorrect
    • 30-second timeout to prevent hanging

Configuration Options:

  • --server: llama-server URL (default: http://localhost:8033)
  • --n_cases: Number of cases to evaluate (default: all)
  • --n_predict: Max tokens to predict per prompt (default: 2048)
  • --threads: Number of threads for parallel requests (default: 32)
  • --verbose: Show detailed output for each case
  • --output: Output file for eval state (default: llama-eval-state.json)
  • --grader-type: regex or cli
  • --grader-regex-type: aime, gsm8k, mmlu, hellaswag, arc, winogrande
  • --grader-script: Path to CLI grader script

Testing Results:

  • Works with simulator at 100% success rate (all correct)
  • Works with simulator at 0% success rate (all incorrect)
  • Works with simulator at 80% success rate (8/10 correct)
  • Real-time verbose output shows gold/pred/status for each case
  • JSON output contains complete eval state with all cases
  • HF Hub telemetry disabled (no warnings)
  • Uses cached dataset path to avoid HF Hub requests when available

Key Technical Decisions:

  • Removed Levenshtein matching - eval script only sends requests and validates answers
  • Abstract grading interface for external grader support
  • Exact match requirement for regex patterns
  • Handles both boxed and plain text formats for AIME answers
  • 30-second timeout for CLI grader
  • Validates script exists before running

Refactoring:

  • Removed all task implementations except AIME
  • Removed regex-based grading (moved to flexible grader system)
  • Removed multiple endpoint support
  • Removed complex task loading logic
  • Removed summary reporting (replaced with real-time feedback)
  • Added HuggingFace dataset caching optimization