2.9 KiB
2.9 KiB
llama-eval Implementation Summary
Overview
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
Key Features
- Multiple Datasets: AIME, GSM8K, GPQA with proper answer extraction
- Flexible Grading: Regex, CLI, or LLM-based grading
- Parallel Processing: Configurable thread count for concurrent requests
- Sampling Parameters: Temperature, Top K, Top P, Min P (optional)
- Real-time Feedback: Progress tracking with detailed output
- JSON Output: Complete eval state saved for debugging
- GPQA Support: Answer shuffling with reproducible results
Architecture
Eval State
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict[str, Any]]
sampling_config: Dict[str, Any]
Processor
- Handles processing, grading, and state management
- Thread-safe concurrent execution
- Configurable sampling parameters
Grader
- Abstract grading interface supporting multiple types
- Regex grader with dataset-specific patterns
- CLI grader with external script interface
- LLM grader with configurable server and model
Datasets
AimeDataset: 90 AIME 2025 questionsAime2025Dataset: 30 AIME 2025 I & II questionsGsm8kDataset: 7473 math word problemsGpqaDataset: 198 GPQA Diamond questions with shuffling
Configuration
Sampling Parameters (Optional)
--temperature: Sampling temperature--top-k: Top K sampling--top-p: Top P sampling--min-p: Min P sampling- Only passed if explicitly specified
Grading Types
- regex: Built-in patterns for each dataset
- cli: External script with
--answerand--expectedargs - llm: LLM-based extraction with few-shot examples and configurable server/model
Dataset Requirements
- AIME: Supports regex, CLI, or LLM grader
- AIME2025: Supports regex, CLI, or LLM grader
- GSM8K: Supports regex, CLI, or LLM grader
- GPQA: Requires LLM grader
Output Format
Progress Table
Task ID Dataset Prompt (first 43 chars) Expected Status
aime_000_001 AIME Complete the following reactions and sel... A pending
Results Summary
============================================================
Results: 8/10 correct (80.0%)
============================================================
JSON Output
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
Technical Details
- Default max tokens: -1 (infinite)
- Default grader type: llm
- Default seed: 1234
- Default threads: 32
- Prompt truncation: First 43 chars + padding + "..."
- Response truncation: Last 10 lines for grading
- GPQA requires LLM grader (returns letter A/B/C/D)
- Judge model defaults to evaluated model if not specified
- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning