95 lines
2.9 KiB
Markdown
95 lines
2.9 KiB
Markdown
# llama-eval Implementation Summary
|
|
|
|
## Overview
|
|
|
|
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
|
|
|
|
## Key Features
|
|
|
|
- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
|
|
- **Flexible Grading**: Regex, CLI, or LLM-based grading
|
|
- **Parallel Processing**: Configurable thread count for concurrent requests
|
|
- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
|
|
- **Real-time Feedback**: Progress tracking with detailed output
|
|
- **JSON Output**: Complete eval state saved for debugging
|
|
- **GPQA Support**: Answer shuffling with reproducible results
|
|
|
|
## Architecture
|
|
|
|
### Eval State
|
|
```python
|
|
@dataclass
|
|
class EvalState:
|
|
id: str
|
|
tasks: List[str]
|
|
task_states: Dict[str, Dict[str, Any]]
|
|
sampling_config: Dict[str, Any]
|
|
```
|
|
|
|
### Processor
|
|
- Handles processing, grading, and state management
|
|
- Thread-safe concurrent execution
|
|
- Configurable sampling parameters
|
|
|
|
### Grader
|
|
- Abstract grading interface supporting multiple types
|
|
- Regex grader with dataset-specific patterns
|
|
- CLI grader with external script interface
|
|
- LLM grader with configurable server and model
|
|
|
|
### Datasets
|
|
- `AimeDataset`: 90 AIME 2025 questions
|
|
- `Aime2025Dataset`: 30 AIME 2025 I & II questions
|
|
- `Gsm8kDataset`: 7473 math word problems
|
|
- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
|
|
|
|
## Configuration
|
|
|
|
### Sampling Parameters (Optional)
|
|
- `--temperature`: Sampling temperature
|
|
- `--top-k`: Top K sampling
|
|
- `--top-p`: Top P sampling
|
|
- `--min-p`: Min P sampling
|
|
- Only passed if explicitly specified
|
|
|
|
### Grading Types
|
|
- **regex**: Built-in patterns for each dataset
|
|
- **cli**: External script with `--answer` and `--expected` args
|
|
- **llm**: LLM-based extraction with few-shot examples and configurable server/model
|
|
|
|
### Dataset Requirements
|
|
- **AIME**: Supports regex, CLI, or LLM grader
|
|
- **AIME2025**: Supports regex, CLI, or LLM grader
|
|
- **GSM8K**: Supports regex, CLI, or LLM grader
|
|
- **GPQA**: Requires LLM grader
|
|
|
|
## Output Format
|
|
|
|
### Progress Table
|
|
```
|
|
Task ID Dataset Prompt (first 43 chars) Expected Status
|
|
aime_000_001 AIME Complete the following reactions and sel... A pending
|
|
```
|
|
|
|
### Results Summary
|
|
```
|
|
============================================================
|
|
Results: 8/10 correct (80.0%)
|
|
============================================================
|
|
```
|
|
|
|
### JSON Output
|
|
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
|
|
|
|
## Technical Details
|
|
|
|
- Default max tokens: -1 (infinite)
|
|
- Default grader type: llm
|
|
- Default seed: 1234
|
|
- Default threads: 32
|
|
- Prompt truncation: First 43 chars + padding + "..."
|
|
- Response truncation: Last 10 lines for grading
|
|
- GPQA requires LLM grader (returns letter A/B/C/D)
|
|
- Judge model defaults to evaluated model if not specified
|
|
- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning
|