llama.cpp/examples/llama-eval
Georgi Gerganov a3405d4260
track total time
2026-02-23 21:22:02 +02:00
..
AGENTS.md add AGENTS.md 2026-02-16 13:13:35 +02:00
IMPLEMENTATION.md datasets : add aime2025 2026-02-16 11:07:54 +02:00
README.md datasets : add aime2025 2026-02-16 11:07:54 +02:00
llama-eval.py track total time 2026-02-23 21:22:02 +02:00
llama-server-simulator-README.md add gpqa + sampling + docs 2026-02-16 00:52:33 +02:00
llama-server-simulator.py sim : fix answer matching 2026-02-15 21:08:24 +02:00
test-simulator.sh test : fix path 2026-02-15 21:08:24 +02:00

README.md

llama-eval Evaluation Tool

Simple evaluation tool for llama.cpp with support for multiple datasets.

Features

  • Multiple Datasets: AIME, GSM8K, GPQA
  • Flexible Grading: Regex, CLI, or LLM-based grading
  • Parallel Processing: Configurable thread count
  • Real-time Feedback: Progress tracking with detailed output
  • Sampling Parameters: Temperature, Top K, Top P, Min P
  • JSON Output: Complete eval state saved for debugging

Usage

python llama-eval.py \
  --server http://127.0.0.1:8013 \
  --model gpt-oss-20b-hf-low \
  --judge-model gpt-oss-20b-hf-medium \
  --dataset aime \
  --n_cases 10 \
  --grader-type llm \
  --seed 42

CLI Arguments

  • --server: llama-server URL (default: http://127.0.0.1:8013)
  • --model: Model name for evaluation (default: llama)
  • --judge-model: Model name for LLM judge (default: same as main model)
  • --judge-server: Server URL for LLM judge (default: same as main server)
  • --dataset: Dataset type (aime, aime2025, gsm8k, gpqa)
  • --n_cases: Number of cases to evaluate (default: all)
  • --n_predict: Max tokens to predict per prompt (default: -1, infinite)
  • --temperature: Sampling temperature (default: not passed)
  • --top-k: Top K sampling (default: not passed)
  • --top-p: Top P sampling (default: not passed)
  • --min-p: Min P sampling (default: not passed)
  • --threads: Number of threads for parallel requests (default: 32)
  • --verbose: Show detailed output for each case
  • --output: Output file for eval state (default: llama-eval-state.json)
  • --grader-type: Grader type (regex, cli, llm, default: llm)
  • --grader-script: Path to CLI grader script (required for --grader-type cli)
  • --seed: Random seed for shuffling (default: 1234)

Datasets

AIME

  • 90 questions from 2025 AIME competition
  • Answers in boxed format: \boxed{answer}
  • Requires regex grader or LLM grader

AIME2025

  • 30 questions from 2025 AIME I & II competitions
  • Answers in boxed format: \boxed{answer}
  • Supports regex, CLI, or LLM grader

GSM8K

  • 7473 math word problems
  • Answers are numeric values
  • Requires regex grader or LLM grader

GPQA

  • 198 questions from GPQA Diamond dataset
  • Multiple choice with shuffled options
  • Requires LLM grader (returns letter A, B, C, or D)

Grading Types

Regex Grader

Built-in patterns for different datasets:

  • AIME: \boxed{(\d+)}|\b(\d+)\b
  • AIME2025: \boxed{(\d+)}|\b(\d+)\b
  • GSM8K: \b(\d+)\b
  • GPQA: Letter extraction (A, B, C, D)

CLI Grader

External script interface:

./grader.sh --answer <pred> --expected <gold>

Returns exit code 0 if correct, non-zero if incorrect.

LLM Grader

Uses LLM to extract and compare answers:

  • Configurable server and model
  • Includes few-shot examples from sample answers
  • Case-insensitive comparison
  • Required for GPQA dataset

Output

Progress Table

  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
  aime_000_001         AIME   Complete the following reactions and sel...    A          pending

Results

============================================================
Results: 8/10 correct (80.0%)
============================================================

JSON Output

Complete eval state saved to output file with:

  • Task IDs and correctness status
  • Prompts and extracted answers
  • Sampling configuration
  • Processing metadata