2.9 KiB

Raw Blame History

llama-eval Implementation Summary

Overview

Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).

Key Features

Multiple Datasets: AIME, GSM8K, GPQA with proper answer extraction
Flexible Grading: Regex, CLI, or LLM-based grading
Parallel Processing: Configurable thread count for concurrent requests
Sampling Parameters: Temperature, Top K, Top P, Min P (optional)
Real-time Feedback: Progress tracking with detailed output
JSON Output: Complete eval state saved for debugging
GPQA Support: Answer shuffling with reproducible results

Architecture

Eval State

@dataclass
class EvalState:
    id: str
    tasks: List[str]
    task_states: Dict[str, Dict[str, Any]]
    sampling_config: Dict[str, Any]

Processor

Handles processing, grading, and state management
Thread-safe concurrent execution
Configurable sampling parameters

Grader

Abstract grading interface supporting multiple types
Regex grader with dataset-specific patterns
CLI grader with external script interface
LLM grader with configurable server and model

Datasets

AimeDataset: 90 AIME 2025 questions
Aime2025Dataset: 30 AIME 2025 I & II questions
Gsm8kDataset: 7473 math word problems
GpqaDataset: 198 GPQA Diamond questions with shuffling

Configuration

Sampling Parameters (Optional)

--temperature: Sampling temperature
--top-k: Top K sampling
--top-p: Top P sampling
--min-p: Min P sampling
Only passed if explicitly specified

Grading Types

regex: Built-in patterns for each dataset
cli: External script with --answer and --expected args
llm: LLM-based extraction with few-shot examples and configurable server/model

Dataset Requirements

AIME: Supports regex, CLI, or LLM grader
AIME2025: Supports regex, CLI, or LLM grader
GSM8K: Supports regex, CLI, or LLM grader
GPQA: Requires LLM grader

Output Format

Progress Table

  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
  aime_000_001         AIME   Complete the following reactions and sel...    A          pending

Results Summary

============================================================
Results: 8/10 correct (80.0%)
============================================================

JSON Output

Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.

Technical Details

Default max tokens: -1 (infinite)
Default grader type: llm
Default seed: 1234
Default threads: 32
Prompt truncation: First 43 chars + padding + "..."
Response truncation: Last 10 lines for grading
GPQA requires LLM grader (returns letter A/B/C/D)
Judge model defaults to evaluated model if not specified
Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning

2.9 KiB Raw Blame History