llama.cpp/examples/llama-eval/IMPLEMENTATION.md

2.9 KiB

llama-eval Implementation Summary

Overview

Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).

Key Features

  • Multiple Datasets: AIME, GSM8K, GPQA with proper answer extraction
  • Flexible Grading: Regex, CLI, or LLM-based grading
  • Parallel Processing: Configurable thread count for concurrent requests
  • Sampling Parameters: Temperature, Top K, Top P, Min P (optional)
  • Real-time Feedback: Progress tracking with detailed output
  • JSON Output: Complete eval state saved for debugging
  • GPQA Support: Answer shuffling with reproducible results

Architecture

Eval State

@dataclass
class EvalState:
    id: str
    tasks: List[str]
    task_states: Dict[str, Dict[str, Any]]
    sampling_config: Dict[str, Any]

Processor

  • Handles processing, grading, and state management
  • Thread-safe concurrent execution
  • Configurable sampling parameters

Grader

  • Abstract grading interface supporting multiple types
  • Regex grader with dataset-specific patterns
  • CLI grader with external script interface
  • LLM grader with configurable server and model

Datasets

  • AimeDataset: 90 AIME 2025 questions
  • Aime2025Dataset: 30 AIME 2025 I & II questions
  • Gsm8kDataset: 7473 math word problems
  • GpqaDataset: 198 GPQA Diamond questions with shuffling

Configuration

Sampling Parameters (Optional)

  • --temperature: Sampling temperature
  • --top-k: Top K sampling
  • --top-p: Top P sampling
  • --min-p: Min P sampling
  • Only passed if explicitly specified

Grading Types

  • regex: Built-in patterns for each dataset
  • cli: External script with --answer and --expected args
  • llm: LLM-based extraction with few-shot examples and configurable server/model

Dataset Requirements

  • AIME: Supports regex, CLI, or LLM grader
  • AIME2025: Supports regex, CLI, or LLM grader
  • GSM8K: Supports regex, CLI, or LLM grader
  • GPQA: Requires LLM grader

Output Format

Progress Table

  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
  aime_000_001         AIME   Complete the following reactions and sel...    A          pending

Results Summary

============================================================
Results: 8/10 correct (80.0%)
============================================================

JSON Output

Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.

Technical Details

  • Default max tokens: -1 (infinite)
  • Default grader type: llm
  • Default seed: 1234
  • Default threads: 32
  • Prompt truncation: First 43 chars + padding + "..."
  • Response truncation: Last 10 lines for grading
  • GPQA requires LLM grader (returns letter A/B/C/D)
  • Judge model defaults to evaluated model if not specified
  • Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning