add gpqa + sampling + docs

This commit is contained in:
Georgi Gerganov 2026-02-16 00:52:17 +02:00
parent e8a807519a
commit cffd268bb3
No known key found for this signature in database
GPG Key ID: 449E073F9DC10735
8 changed files with 444 additions and 765 deletions

View File

@ -0,0 +1,85 @@
# llama-eval Implementation Summary
## Overview
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
## Key Features
- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
- **Flexible Grading**: Regex, CLI, or LLM-based grading
- **Parallel Processing**: Configurable thread count for concurrent requests
- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
- **Real-time Feedback**: Progress tracking with detailed output
- **JSON Output**: Complete eval state saved for debugging
- **GPQA Support**: Answer shuffling with reproducible results
## Architecture
### Eval State
```python
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict[str, Any]]
sampling_config: Dict[str, Any]
```
### Processor
- Handles processing, grading, and state management
- Thread-safe concurrent execution
- Configurable sampling parameters
### Grader
- Abstract grading interface supporting multiple types
- Regex grader with dataset-specific patterns
- CLI grader with external script interface
- LLM grader with configurable server and model
### Datasets
- `AimeDataset`: 90 AIME 2025 questions
- `Gsm8kDataset`: 7473 math word problems
- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
## Configuration
### Sampling Parameters (Optional)
- `--temperature`: Sampling temperature
- `--top-k`: Top K sampling
- `--top-p`: Top P sampling
- `--min-p`: Min P sampling
- Only passed if explicitly specified
### Grading Types
- **regex**: Built-in patterns for each dataset
- **cli**: External script with `--answer` and `--expected` args
- **llm**: LLM-based extraction with configurable server/model
## Output Format
### Progress Table
```
Task ID Dataset Prompt (first 43 chars) Expected Status
aime_000_001 AIME Complete the following reactions and sel... A pending
```
### Results Summary
```
============================================================
Results: 8/10 correct (80.0%)
============================================================
```
### JSON Output
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
## Technical Details
- Default max tokens: -1 (infinite)
- Default grader type: llm
- Default seed: 1234
- Default threads: 32
- Prompt truncation: First 43 chars + padding + "..."
- GPQA requires LLM grader (returns letter A/B/C/D)
- Judge model defaults to evaluated model if not specified

View File

@ -0,0 +1,105 @@
# llama-eval Evaluation Tool
Simple evaluation tool for llama.cpp with support for multiple datasets.
## Features
- **Multiple Datasets**: AIME, GSM8K, GPQA
- **Flexible Grading**: Regex, CLI, or LLM-based grading
- **Parallel Processing**: Configurable thread count
- **Real-time Feedback**: Progress tracking with detailed output
- **Sampling Parameters**: Temperature, Top K, Top P, Min P
- **JSON Output**: Complete eval state saved for debugging
## Usage
```bash
python llama-eval-new.py \
--server http://127.0.0.1:8013 \
--model gpt-oss-20b-hf-low \
--judge-model gpt-oss-20b-hf-medium \
--dataset aime \
--n_cases 10 \
--grader-type llm \
--seed 42
```
## CLI Arguments
- `--server`: llama-server URL (default: http://127.0.0.1:8013)
- `--model`: Model name for evaluation (default: llama)
- `--judge-model`: Model name for LLM judge (default: same as main model)
- `--judge-server`: Server URL for LLM judge (default: same as main server)
- `--dataset`: Dataset type (aime, gsm8k, gpqa)
- `--n_cases`: Number of cases to evaluate (default: all)
- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
- `--temperature`: Sampling temperature (default: not passed)
- `--top-k`: Top K sampling (default: not passed)
- `--top-p`: Top P sampling (default: not passed)
- `--min-p`: Min P sampling (default: not passed)
- `--threads`: Number of threads for parallel requests (default: 32)
- `--verbose`: Show detailed output for each case
- `--output`: Output file for eval state (default: llama-eval-state.json)
- `--grader-type`: Grader type (regex, cli, llm, default: llm)
- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
- `--seed`: Random seed for shuffling (default: 1234)
## Datasets
### AIME
- 90 questions from 2025 AIME competition
- Answers in boxed format: `\boxed{answer}`
- Requires regex grader or LLM grader
### GSM8K
- 7473 math word problems
- Answers are numeric values
- Requires regex grader or LLM grader
### GPQA
- 198 questions from GPQA Diamond dataset
- Multiple choice with shuffled options
- Requires LLM grader (returns letter A, B, C, or D)
## Grading Types
### Regex Grader
Built-in patterns for different datasets:
- AIME: `\boxed{(\d+)}|\b(\d+)\b`
- GSM8K: `\b(\d+)\b`
- GPQA: Letter extraction (A, B, C, D)
### CLI Grader
External script interface:
```bash
./grader.sh --answer <pred> --expected <gold>
```
Returns exit code 0 if correct, non-zero if incorrect.
### LLM Grader
Uses LLM to extract and compare answers:
- Configurable server and model
- Includes problem context in prompt
- Case-insensitive comparison
## Output
### Progress Table
```
Task ID Dataset Prompt (first 43 chars) Expected Status
aime_000_001 AIME Complete the following reactions and sel... A pending
```
### Results
```
============================================================
Results: 8/10 correct (80.0%)
============================================================
```
### JSON Output
Complete eval state saved to output file with:
- Task IDs and correctness status
- Prompts and extracted answers
- Sampling configuration
- Processing metadata

View File

@ -1,395 +0,0 @@
# llama-eval Implementation Discussion
## Overview
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
## Key Requirements from ggerganov
### 1. Simplify and Focus on One Eval
- Start with AIME2025 (most familiar with it)
- Don't support multiple evals initially
### 2. Implement an "eval state" object
- ID
- List of tasks
- Task states
- Sampling config
### 3. Implement a "processor" object
- List of endpoints
- Threads per endpoint
- Grade/judge type (regex, endpoint, or CLI tool)
### 4. Processor responsibilities
- Accepts eval state
- Starts processing
- Dumps eval state periodically as it progresses
### 5. Real-time feedback
- Default: show "correct / not correct" for each task
- Verbose mode: show produced answer vs expected answer as soon as it completes
### 6. Grading approach
- Abstract grading to support external "grader" or "judge"
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
### 7. Output format
- Use structured output (JSON) instead of boxed text
## Current Implementation Analysis
### What exists in llama-eval.py:
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
- Regex-based answer extraction
- HTTP requests to OpenAI-compatible endpoint
- Checkpointing/resume capability
- Thread-based parallel execution
- Summary reporting
### What needs to be removed:
- All task implementations except AIME
- Regex-based grading
- Multiple endpoint support
- Complex task loading logic
- Summary reporting (replace with real-time feedback)
## Discussion Points
### 1. Eval State Object Structure
**Status: Under Discussion**
Questions:
- What fields should be in the eval state object?
- Should it include the actual prompts, or just metadata?
- How should task states be tracked?
### 2. Processor Architecture
**Status: Not Started**
Questions:
- Should the processor handle multiple endpoints (for distributed evaluation)?
- What's the threading model?
- How are endpoints configured?
### 3. Grader Interface
**Status: Not Started**
Questions:
- How should the grader be configured?
- Should it be a separate service, or a local LLM call?
- What's the interface for grading?
### 4. Checkpointing
**Status: Not Started**
Questions:
- Should the eval state be serialized to disk?
- How often should it be dumped?
- What format should it use?
### 5. Real-time Output
**Status: Not Started**
Questions:
- How should progress be displayed?
- Console output, file logging, or both?
- What verbosity levels are needed?
### 6. Output Format
**Status: Not Started**
Questions:
- Should responses be in JSON format?
- How should the grader interface work with JSON output?
## Next Steps
1. **Eval State Object** - Currently discussing
2. Processor Architecture
3. Grader Interface
4. Checkpointing
5. Real-time Output
6. Output Format
## References
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
## Session Work Summary
### llama-server-simulator Implementation
**Created:**
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
- `test-simulator.sh` - Test script for verifying simulator functionality
- `llama-server-simulator-plan.md` - Implementation plan
- `simulator-summary.md` - Summary of implementation
**Features Implemented:**
1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
5. Debug Logging - Helps troubleshoot matching issues
**Testing Results:**
- ✅ Correct answers returned when success rate allows
- ✅ Wrong answers returned when success rate doesn't allow
- ✅ No matching questions return errors
- ✅ Success rate verified (80% in 10 requests)
- ✅ HuggingFace dataset caching working correctly
**Key Technical Decisions:**
- Used Levenshtein distance for partial matching (threshold: 0.3)
- Automatic caching via HuggingFace datasets library
- Wrong answers generated by incrementing expected answer
- Debug output written to stderr for better visibility
**Refactoring:**
- Extracted repeating question string into TEST_QUESTION variable
- Created make_request() helper function to reduce code duplication
- Added proper error handling for error responses
- Fixed simulator stopping issue at script completion
### llama-eval-new.py Implementation
**Created:**
- `llama-eval-new.py` - Simplified evaluation tool focused on AIME
**Features Implemented:**
1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
2. **Processor Object** - Handles processing, grading, and state management
3. **Real-time Feedback** - Shows correct/incorrect status for each case
4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading
5. **Structured JSON Output** - Saves complete eval state to JSON file
6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
7. **Enhanced Answer Extraction** - Extracts answers from full responses for display
**Grading System:**
- **Regex Grading**: Built-in patterns for different task types
- `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text)
- `gsm8k`: `\b(\d+)\b` (extract first number)
- `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter)
- **CLI Grading**: External script interface
- Script accepts `--answer <pred>` and `--expected <gold>`
- Returns exit code 0 if correct, non-zero if incorrect
- 30-second timeout to prevent hanging
- **LLM Judge**: Generic answer extraction using LLM
- Uses configured server and model for extraction
- Includes problem statement in prompt for context
- Case-insensitive comparison
- Returns extracted answer for display
**Configuration Options:**
- `--server`: llama-server URL (default: http://localhost:8033)
- `--n_cases`: Number of cases to evaluate (default: all)
- `--n_predict`: Max tokens to predict per prompt (default: 2048)
- `--threads`: Number of threads for parallel requests (default: 32)
- `--verbose`: Show detailed output for each case
- `--output`: Output file for eval state (default: llama-eval-state.json)
- `--grader-type`: `regex`, `cli`, or `llm`
- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
- `--grader-script`: Path to CLI grader script
- `--judge-server`: Server URL for LLM judge (default: same as main server)
- `--judge-model`: Model name for LLM judge (default: same as main model)
**Testing Results:**
- ✅ Works with simulator at 100% success rate (all correct)
- ✅ Works with simulator at 0% success rate (all incorrect)
- ✅ Works with simulator at 80% success rate (8/10 correct)
- ✅ Real-time verbose output shows gold/pred/status for each case
- ✅ JSON output contains complete eval state with all cases
- ✅ HF Hub telemetry disabled (no warnings)
- ✅ Uses cached dataset path to avoid HF Hub requests when available
- ✅ Regex grader extracts answers correctly from various formats
- ✅ LLM judge can extract answers with problem context
- ✅ Response truncation focuses grading on final answer
- ✅ Case-insensitive matching works for both regex and LLM grader
- ✅ Judge model and server configuration propagate correctly
- ✅ Progress table shows extracted answers instead of full responses
**Key Technical Decisions:**
- Removed Levenshtein matching - eval script only sends requests and validates answers
- Abstract grading interface for external grader support
- Exact match requirement for regex patterns
- Handles both boxed and plain text formats for AIME answers
- 30-second timeout for CLI grader
- Validates script exists before running
- Judge parameters set once during Grader construction
- LLM judge prompt includes problem statement for better extraction
- Response truncation to last 2-3 lines focuses grading on final answer
- Case-insensitive comparison for more flexible matching
**Refactoring:**
- Removed all task implementations except AIME
- Removed regex-based grading (moved to flexible grader system)
- Removed multiple endpoint support
- Removed complex task loading logic
- Removed summary reporting (replaced with real-time feedback)
- Added HuggingFace dataset caching optimization
- Added LLM grader support with configurable server and model
- Added response truncation before grading
- Refactored grader interface to return extracted answers
### llama-eval-new.py Threading and Model Parameter Updates
**Changes Made:**
1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing
- Added `from concurrent.futures import ThreadPoolExecutor, as_completed`
- Created `_process_single_case()` method for thread-safe case processing
- Refactored `process()` to use ThreadPoolExecutor with configurable thread count
- Updated progress tracking to work with concurrent execution
- Thread-safe eval state updates (task_states and counters)
2. **Model Parameter** - Added `--model` argument to specify model name in request data
- Added `model_name` parameter to Processor.__init__()
- Updated `_make_request()` to use provided model name or default to "llama"
- Added `--model` argument to argument parser
- Model name is included in request JSON as `"model": "gpt-oss-20b-hf"`
**Testing Results:**
- ✅ Works with 2 threads (5 cases processed in ~0.2s)
- ✅ Works with 4 threads (slightly faster throughput)
- ✅ Model parameter correctly added to request data
- ✅ Thread-safe progress tracking with tqdm
- ✅ No race conditions in eval state updates
**Key Technical Decisions:**
- Used ThreadPoolExecutor for simple, effective parallelism
- No rate limiting needed (server can handle concurrent requests)
- Thread-safe counter updates for correct/total tracking
- Progress bar shows completion status across all threads
- Model parameter is optional - defaults to "llama" if not specified
**Refactoring:**
- Extracted single case processing into `_process_single_case()` method
- Changed from sequential loop to ThreadPoolExecutor with futures
- Updated verbose output to show total count instead of index
- Made eval state updates thread-safe
### llama-eval-new.py Enhanced Grading System
**Changes Made:**
1. **Enhanced Grader Interface** - Updated to return extracted answers
- `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer)
- Added `extracted` field to `TaskState` dataclass
- All grader types (regex, cli, llm) now return extracted answers
2. **Improved Regex Grader**
- New `_extract_answer_regex()` method extracts answers using configured patterns
- Supports case-insensitive matching
- Returns first valid match found
- Handles both single values and multiple matches
3. **LLM-Based Judge**
- New `_grade_llm()` method for generic answer extraction
- Includes problem statement in prompt for context
- Configurable server URL (defaults to main server)
- Configurable model name (defaults to main model)
- Case-insensitive comparison
- Returns extracted answer for display
4. **Response Truncation**
- New `_truncate_response()` method keeps only last 2-3 lines
- Applied before grading to focus on final answer section
5. **CLI Grader Update**
- Now also returns extracted answer
- Returns None if grading fails
6. **Display Updates**
- Progress table shows extracted answer instead of full response
- Verbose mode shows full response plus extracted answer
7. **New CLI Arguments**
- `--grader-type`: Added "llm" option
- `--judge-server`: Separate server for LLM judge
- `--judge-model`: Separate model for LLM judge
**Testing Results:**
- ✅ Regex grader extracts answers correctly from various formats
- ✅ LLM judge can extract answers with problem context
- ✅ Response truncation focuses grading on final answer
- ✅ Case-insensitive matching works for both regex and LLM grader
- ✅ Judge model and server configuration propagate correctly
- ✅ Progress table shows extracted answers instead of full responses
**Key Technical Decisions:**
- Judge parameters set once during Grader construction (not on each call)
- LLM judge prompt includes problem statement for better extraction
- Response truncation to last 2-3 lines focuses grading on final answer
- Case-insensitive comparison for more flexible matching
- Judge configuration propagates through Processor to Grader
- Display shows extracted answer for cleaner output
**Refactoring:**
- Removed judge parameters from `grade()` method calls
- Added `judge_server_url` and `judge_model_name` to Grader class
- Updated `_grade_llm()` to use instance variables instead of parameters
- Simplified Processor initialization to pass judge config to grader
- Updated startup info to show judge server and model
### llama-eval-new.py GSM8K Dataset Support
**Changes Made:**
1. **GSM8K Dataset Integration** - Added support for GSM8K dataset alongside AIME
- Created `Gsm8kDataset` class with proper answer extraction logic
- GSM8K uses `"question"` field instead of `"problem"` field
- GSM8K answer field contains full reasoning with `####` prefix
- Extracts numeric answer from answer field during initialization
- Uses same regex grader pattern as AIME (`\b(\d+)\b`)
2. **Dataset Type Configuration** - Added dataset selection support
- Added `--dataset` CLI argument with choices `aime` and `gsm8k`
- Updated `Processor` class to accept `dataset_type` parameter
- Dataset-specific initialization in `Processor.__init__()`
- Dataset name displayed in task summary table
3. **Template Registry** - Added dataset-specific prompt templates
- AIME template: includes `\boxed{}` wrapper for final answer
- GSM8K template: plain text answer without wrapper
- Templates applied based on `question["dataset_type"]` field
4. **Answer Extraction Logic** - Fixed GSM8K answer extraction
- GSM8K has pre-extracted `"gold"` field with numeric answer
- `Gsm8kDataset.get_answer()` checks for `"gold"` field first
- Falls back to answer field if gold field not present
- `AimeDataset.get_answer()` simplified to remove duplicate method
5. **Task ID Format** - Fixed duplicate prefix in task IDs
- Changed from `f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"`
- To `f"{dataset_type}_{chunk_idx:03d}_{i:03d}"`
- Removed redundant `eval_state.id` (was "gsm8k" for GSM8K)
6. **Column Width Adjustments** - Improved table formatting
- Task ID column: 25 characters
- Dataset column: 5 characters
- Prompt column: 40 characters
- Expected column: 10 characters
**Testing Results:**
- ✅ GSM8K dataset loads correctly with 7473 questions
- ✅ Numeric answers extracted from full reasoning text
- ✅ Task summary table displays correctly with adjusted column widths
- ✅ Task IDs show correct format (e.g., `gsm8k_000_3169`)
- ✅ Both AIME and GSM8K datasets work with same script
- ✅ Answer extraction works for both boxed and plain text formats
- ✅ Progress tracking shows extracted answers for both datasets
**Key Technical Decisions:**
- GSM8K uses `"question"` field instead of `"problem"` field
- GSM8K answer field contains full reasoning with `####` prefix
- Numeric answer extracted during dataset initialization
- Same regex grader pattern works for both datasets
- Dataset selection via CLI argument for separate runs
- Template registry supports different prompt formats per dataset
- Task ID format simplified to avoid duplication
**Refactoring:**
- Removed duplicate `get_question()` method from `AimeDataset`
- Removed "2025" suffix from eval state ID (was remnant from old version)
- Removed "2025" suffix from task summary table output
- Removed "2025" suffix from progress tracking output
- Updated `Processor.__init__()` to initialize appropriate dataset based on type
- Updated `_process_single_case()` to handle both `"problem"` and `"question"` fields
- Updated `process()` method to display dataset name and use `dataset_type` for task states

View File

@ -5,6 +5,7 @@ import json
import os
import re
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, asdict
@ -34,6 +35,15 @@ Please reason step by step, and put your final answer within \\boxed{{}}.
""",
"gsm8k": """{question}
Please reason step by step, and provide your final answer.
""",
"gpqa": """{Question}
(A) {A}
(B) {B}
(C) {C}
(D) {D}
Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'.
""",
}
@ -96,6 +106,15 @@ class AimeDataset:
return str(normalized) if normalized is not None else answer
return str(answer)
def get_prompt(self, question: Dict) -> str:
"""Get formatted prompt for the question"""
if question["dataset_type"] == "gpqa":
return TEMPLATE_REGISTRY["gpqa"].format(**question)
else:
return TEMPLATE_REGISTRY[question["dataset_type"]].format(
question=question["problem"] if "problem" in question else question["question"]
)
class Gsm8kDataset:
def __init__(self, split: str = "train"):
self.split = split
@ -146,17 +165,87 @@ class Gsm8kDataset:
return str(normalized) if normalized is not None else answer
return str(answer)
def get_prompt(self, question: Dict) -> str:
"""Get formatted prompt for the question"""
return TEMPLATE_REGISTRY[question["dataset_type"]].format(
question=question["problem"] if "problem" in question else question["question"]
)
class GpqaDataset:
def __init__(self, variant: str = "diamond", seed: int = 1234):
self.variant = variant
self.seed = seed
self.questions: List[Dict] = []
self._load_dataset()
def _load_dataset(self):
print(f"Loading GPQA dataset (variant: {self.variant})...")
import pandas as pd
url = f"https://openaipublic.blob.core.windows.net/simple-evals/gpqa_{self.variant}.csv"
df = pd.read_csv(url)
rng = random.Random(self.seed)
self.questions = []
for _, row in df.iterrows():
question = row.to_dict()
question["dataset_type"] = "gpqa"
# Shuffle the answer options
correct_answer = question["Correct Answer"]
incorrect_answers = [
question["Incorrect Answer 1"],
question["Incorrect Answer 2"],
question["Incorrect Answer 3"]
]
# Create list of (answer, is_correct) tuples
options = [(ans, ans == correct_answer) for ans in incorrect_answers]
options.append((correct_answer, True))
# Shuffle the options
rng.shuffle(options)
# Extract shuffled answers and determine correct letter
shuffled_answers = [ans for ans, _ in options]
correct_letter = chr(ord('A') + options.index((correct_answer, True)))
# Store shuffled answers and correct letter
question["shuffled_answers"] = shuffled_answers
question["correct_letter"] = correct_letter
self.questions.append(question)
print(f"GPQA dataset loaded: {len(self.questions)} questions")
def get_question(self, index: int) -> Dict:
"""Get question by index"""
return self.questions[index]
def get_answer(self, question: Dict) -> str:
# GPQA returns the correct letter (A, B, C, or D)
return question["correct_letter"]
def get_prompt(self, question: Dict) -> str:
"""Get formatted prompt for the question"""
return TEMPLATE_REGISTRY["gpqa"].format(
Question=question["Question"],
A=question["shuffled_answers"][0],
B=question["shuffled_answers"][1],
C=question["shuffled_answers"][2],
D=question["shuffled_answers"][3]
)
class Grader:
def __init__(
self,
grader_type: str = "regex",
grader_regex_type: str = "aime",
grader_type: str = "llm",
grader_script: Optional[str] = None,
judge_model_name: Optional[str] = None,
judge_server_url: str = ""
):
self.grader_type = grader_type
self.grader_regex_type = grader_regex_type
self.grader_script = grader_script
self.judge_model_name = judge_model_name
self.judge_server_url = judge_server_url
@ -164,9 +253,7 @@ class Grader:
def _get_pattern(self) -> Optional[str]:
if self.grader_type == "regex":
if self.grader_regex_type not in GRADER_PATTERNS:
raise ValueError(f"Unknown grader regex type: {self.grader_regex_type}")
return GRADER_PATTERNS[self.grader_regex_type]
return GRADER_PATTERNS.get("aime") # Default to aime pattern
return None
def _extract_answer_regex(self, pred: str) -> Optional[str]:
@ -221,18 +308,21 @@ class Grader:
"""Grade using LLM-based extraction"""
prompt = f"""Extract the answer from this response:
Response: {pred}
Expected answer: {gold}
Please provide only the extracted answer, nothing else."""
===
Response: {pred}
===
Please provide only the extracted answer, nothing else. If there is no clear answer in the response, reply with 'no answer'."""
url = f"{self.judge_server_url}/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": self.judge_model_name,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0,
"max_tokens": 256
}
try:
@ -264,14 +354,16 @@ class Processor:
def __init__(
self,
server_url: str,
n_predict: int = 2048,
n_predict: int = -1,
threads: int = 32,
verbose: bool = False,
grader: Optional[Grader] = None,
model_name: Optional[str] = None,
judge_server_url: str = "",
judge_model_name: Optional[str] = None,
dataset_type: str = "aime"
dataset_type: str = "aime",
seed: int = 1234,
sampling_config: Optional[Dict[str, Any]] = None
):
self.server_url = server_url
self.n_predict = n_predict
@ -281,12 +373,14 @@ class Processor:
self.judge_server_url = judge_server_url if judge_server_url else server_url
self.judge_model_name = judge_model_name
self.dataset_type = dataset_type
self.seed = seed
self.grader = grader or Grader()
self.sampling_config = sampling_config or {"n_predict": n_predict}
self.eval_state = EvalState(
id=dataset_type,
tasks=[dataset_type],
task_states={},
sampling_config={"temperature": 0, "max_tokens": n_predict}
sampling_config=self.sampling_config
)
# Pass judge configuration to grader if using LLM grader
@ -301,6 +395,8 @@ class Processor:
self.dataset = AimeDataset()
elif dataset_type == "gsm8k":
self.dataset = Gsm8kDataset()
elif dataset_type == "gpqa":
self.dataset = GpqaDataset(variant="diamond", seed=self.seed)
else:
raise ValueError(f"Unknown dataset type: {dataset_type}")
@ -311,9 +407,16 @@ class Processor:
data = {
"model": self.model_name if self.model_name else "llama",
"messages": [{"role": "user", "content": prompt}],
"temperature": 0,
"max_tokens": self.n_predict
"n_predict": self.n_predict
}
if self.sampling_config.get("temperature") is not None:
data["temperature"] = self.sampling_config["temperature"]
if self.sampling_config.get("top_k") is not None:
data["top_k"] = self.sampling_config["top_k"]
if self.sampling_config.get("top_p") is not None:
data["top_p"] = self.sampling_config["top_p"]
if self.sampling_config.get("min_p") is not None:
data["min_p"] = self.sampling_config["min_p"]
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
@ -322,14 +425,9 @@ class Processor:
def _process_single_case(self, i: int, task_id: str) -> TaskState:
"""Process a single case (thread-safe)"""
question = self.dataset.get_question(i)
dataset_id = f"{self.dataset_type}_{self.dataset.split}_{i}"
dataset_id = f"{self.dataset_type}_{i}"
gold = self.dataset.get_answer(question)
# Apply template if available
if question["dataset_type"] in TEMPLATE_REGISTRY:
prompt = TEMPLATE_REGISTRY[question["dataset_type"]].format(question=question["problem"] if "problem" in question else question["question"])
else:
prompt = question["problem"] if "problem" in question else question["question"]
prompt = self.dataset.get_prompt(question)
task_state = TaskState(
case_id=task_id,
@ -361,12 +459,15 @@ class Processor:
n_cases = len(self.dataset.questions)
print(f"\nProcessing {n_cases} {self.dataset_type.upper()} questions...")
print(f"Server: {self.server_url}")
print(f"Server: {self.server_url} (model: {self.model_name})")
print(f"Threads: {self.threads}")
print(f"Max tokens: {self.n_predict}")
print(f"Seed: {self.seed}")
print(f"Sampling: temp={self.sampling_config.get('temperature', 'skip')}, top-k={self.sampling_config.get('top_k', 'skip')}, top-p={self.sampling_config.get('top_p', 'skip')}, min-p={self.sampling_config.get('min_p', 'skip')}")
print(f"Grader: {self.grader.grader_type}", end="")
if self.grader.grader_type == "llm":
print(f" (judge server: {self.judge_server_url}, model: {self.judge_model_name})", end="")
judge_model = self.judge_model_name if self.judge_model_name else self.model_name
print(f" (judge server: {self.judge_server_url}, model: {judge_model})", end="")
print()
print()
@ -389,9 +490,14 @@ class Processor:
print(" Task ID Dataset Prompt (first 40 chars) Expected Status")
for i, task_id in task_list:
question = self.dataset.get_question(i)
prompt = question["problem"] if "problem" in question else question["question"]
prompt = self.dataset.get_prompt(question)
gold = self.dataset.get_answer(question)
truncated_prompt = prompt[:40] + "..." if len(prompt) > 40 else prompt
first_line = prompt.split('\n')[0]
truncated_prompt = first_line[:43]
if len(first_line) > 43:
truncated_prompt += "..."
else:
truncated_prompt = truncated_prompt.ljust(43) + "..."
print(f" {task_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {gold:<10} pending")
print()
@ -413,7 +519,13 @@ class Processor:
# Print task completion status
extracted_display = task_state.extracted if task_state.extracted else "N/A"
success_ratio = correct / total if total > 0 else 0.0
print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {task_state.prompt[:40]:<40} {task_state.gold:<10} {extracted_display:<10} {'' if task_state.correct else ''} [{correct:3}/{total:3}, {success_ratio:.3f}]")
first_line = task_state.prompt.split('\n')[0]
truncated_prompt = first_line[:43]
if len(first_line) > 43:
truncated_prompt += "..."
else:
truncated_prompt = truncated_prompt.ljust(43) + "..."
print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {task_state.gold:<10} {extracted_display:<10} {'' if task_state.correct else ''} [{correct:3}/{total:3}, {success_ratio:.3f}]")
if self.verbose:
print(f"\nCase {total}: {task_state.correct}")
@ -456,7 +568,7 @@ def main():
"--dataset",
type=str,
default="aime",
choices=["aime", "gsm8k"],
choices=["aime", "gsm8k", "gpqa"],
help="Dataset type (default: aime)"
)
parser.add_argument(
@ -474,8 +586,32 @@ def main():
parser.add_argument(
"--n_predict",
type=int,
default=2048,
help="Max tokens to predict per prompt (default: 2048)"
default=-1,
help="Max tokens to predict per prompt (default: -1, infinite)"
)
parser.add_argument(
"--temperature",
type=float,
default=None,
help="Sampling temperature (default: not passed)"
)
parser.add_argument(
"--top-k",
type=int,
default=None,
help="Top K sampling (default: not passed)"
)
parser.add_argument(
"--top-p",
type=float,
default=None,
help="Top P sampling (default: not passed)"
)
parser.add_argument(
"--min-p",
type=float,
default=None,
help="Min P sampling (default: not passed)"
)
parser.add_argument(
"--threads",
@ -503,16 +639,9 @@ def main():
parser.add_argument(
"--grader-type",
type=str,
default="regex",
default="llm",
choices=["regex", "cli", "llm"],
help="Grader type: regex, cli, or llm (default: regex)"
)
parser.add_argument(
"--grader-regex-type",
type=str,
default="aime",
choices=list(GRADER_PATTERNS.keys()),
help="Regex grader type (default: aime)"
help="Grader type: regex, cli, or llm (default: llm)"
)
parser.add_argument(
"--grader-script",
@ -529,21 +658,37 @@ def main():
parser.add_argument(
"--judge-model",
type=str,
default=None,
default="",
help="Model name for LLM judge (default: same as main model)"
)
args = parser.parse_args()
# Validate grader type for GPQA
if args.dataset == "gpqa" and args.grader_type != "llm":
print("Error: GPQA dataset requires --grader-type llm")
parser.print_help()
sys.exit(1)
grader = Grader(
grader_type=args.grader_type,
grader_regex_type=args.grader_regex_type,
grader_script=args.grader_script
grader_script=args.grader_script,
judge_model_name=args.judge_model if args.judge_model else args.model
)
if args.grader_type == "llm" and not args.judge_server:
print("Warning: Using same server for LLM judge (no --judge-server specified)")
sampling_config = {"n_predict": args.n_predict}
if args.temperature is not None:
sampling_config["temperature"] = args.temperature
if args.top_k is not None:
sampling_config["top_k"] = args.top_k
if args.top_p is not None:
sampling_config["top_p"] = args.top_p
if args.min_p is not None:
sampling_config["min_p"] = args.min_p
processor = Processor(
server_url=args.server,
n_predict=args.n_predict,
@ -553,7 +698,8 @@ def main():
model_name=args.model,
judge_server_url=args.judge_server,
judge_model_name=args.judge_model,
dataset_type=args.dataset
dataset_type=args.dataset,
sampling_config=sampling_config
)
eval_state = processor.process(n_cases=args.n_cases, seed=args.seed)

View File

@ -0,0 +1,29 @@
{
"id": "gpqa",
"tasks": [
"gpqa"
],
"task_states": {
"gpqa": {
"total": 1,
"correct": 0,
"cases": {
"gpqa": [
{
"case_id": "gpqa_000_184",
"prompt": "Consider a system with Hamiltonian operator $H = \\varepsilon \\vec{\\sigma}.\\vec{n}$. Here, $\\vec{n}$ is an arbitrary unit vector, $\\varepsilon $ is a constant of dimension energy, and components of $\\vec{\\sigma}$ are the Pauli spin matrices. What are the eigenvalues of the Hamiltonian operator?\n\n\n(A) +\\hbar/2, -\\hbar/2\n(B) +1, -1\n(C) +\\varepsilon \\hbar/2, - \\varepsilon \\hbar/2\n(D) + \\varepsilon, -\\varepsilon\n\n\nExpress your final answer as the corresponding option 'A', 'B', 'C', or 'D'.\n",
"gold": "+ \\varepsilon, -\\varepsilon\n",
"pred": null,
"extracted": null,
"correct": false,
"status": "error: HTTPConnectionPool(host='localhost', port=8034): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError(\"HTTPConnection(host='localhost', port=8034): Failed to establish a new connection: [Errno 61] Connection refused\"))"
}
]
}
}
},
"sampling_config": {
"temperature": 0,
"max_tokens": 2048
}
}

View File

@ -0,0 +1,36 @@
# llama-server-simulator
Standalone Python script simulating llama-server HTTP endpoint for testing.
## Features
- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
- AIME Dataset Integration - Loads 90 questions from HuggingFace
- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
- Configurable Success Rate - Control correct/wrong answer generation (0-1)
- Debug Logging - Troubleshoot matching issues
## Usage
```bash
python llama-server-simulator.py --success-rate 0.8
```
## Arguments
- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
- `--port`: Server port (default: 8033)
- `--debug`: Enable debug logging (default: False)
## Testing
```bash
./test-simulator.sh
```
## Implementation Details
- Uses Levenshtein distance for partial matching (threshold: 0.3)
- Automatic caching via HuggingFace datasets library
- Wrong answers generated by incrementing expected answer
- Debug output written to stderr

View File

@ -1,189 +0,0 @@
# llama-server-simulator Implementation Plan
## Overview
Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
## Goals
1. Simulate llama-server's `/v1/chat/completions` endpoint
2. Accept requests and respond with expected answers from AIME dataset
3. Implement configurable success rate (sometimes right, sometimes wrong)
4. Use regex matching to find questions in incoming requests
5. Test with curl requests before integrating with eval script
## Implementation Plan
### Phase 1: Basic Simulator Structure
- Create `llama-server-simulator.py` script
- Set up Flask/FastAPI HTTP server
- Implement `/v1/chat/completions` endpoint
- Handle basic request/response format
### Phase 2: AIME Dataset Integration
- Load AIME dataset
- Store questions and expected answers
- Implement regex matching to find questions in incoming requests
- Extract expected answer from matched question
### Phase 3: Response Generation
- Implement success rate configuration
- Randomly determine if response should be correct or incorrect
- Generate appropriate response based on success determination
- Format response in OpenAI-compatible format
### Phase 4: Testing
- Write curl commands to test basic functionality
- Test correct responses
- Test incorrect responses
- Test edge cases (no question found, etc.)
## Technical Details
### Server Framework
- Use Flask for simplicity
- Listen on configurable port
- Support JSON request/response format
### Request Format
```json
{
"model": "llama",
"messages": [
{"role": "user", "content": "Question text here"}
],
"temperature": 0,
"max_tokens": 2048
}
```
### Response Format
```json
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Answer text here"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
### AIME Dataset Integration
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
- Store in memory for fast lookup
- Regex pattern to find question text in request
- Extract answer from matched question
### Success Rate Configuration
- Command-line argument: `--success-rate 0.8` (80% success rate)
- Randomly determine correctness based on rate
- Log when responses are correct vs incorrect
### Testing Strategy
1. Start simulator with default settings
2. Send curl request with known question
3. Verify response contains expected answer
4. Test with different success rates
5. Test edge cases
## Implementation Steps
### Step 1: Basic Server Setup
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
# Handle request
return jsonify(response)
```
### Step 2: Load AIME Dataset
```python
import datasets
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
# Store in memory
```
### Step 3: Regex Matching
```python
import re
def find_question_in_request(request_text):
# Regex pattern to find question
pattern = r"question:\s*(.*?)\n"
match = re.search(pattern, request_text, re.DOTALL)
return match.group(1) if match else None
```
### Step 4: Response Generation
```python
import random
def generate_response(question, success_rate):
if random.random() < success_rate:
return get_expected_answer(question)
else:
return get_wrong_answer(question)
```
### Step 5: Testing with Curl
```bash
curl -X POST http://localhost:8033/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Question text"}]
}'
```
## Configuration Options
- `--port`: Server port (default: 8033)
- `--success-rate`: Success rate 0-1 (default: 0.8)
- `--host`: Server host (default: localhost)
- `--dataset-split`: AIME split to use (default: train)
## Expected Output
```
=== llama-server-simulator ===
Server running on http://localhost:8033
Success rate: 0.8
AIME dataset loaded: 1000 questions
```
## Testing Checklist
- [ ] Server starts successfully
- [ ] Basic request/response works
- [ ] Correct answer returned when success rate allows
- [ ] Wrong answer returned when success rate doesn't allow
- [ ] No question found returns error
- [ ] Multiple requests work correctly
- [ ] Different success rates work as expected
## Next Steps
1. ✓ Implement basic server structure
2. ✓ Load AIME dataset
3. ✓ Implement regex matching
4. ✓ Add response generation with success rate
5. ✓ Test with curl commands
6. ✓ Integrate with eval script once simulator works
7. ✓ Implement eval state object
8. ✓ Implement processor object
9. ✓ Add real-time progress reporting
10. ✓ Add enhanced grading system with LLM judge

View File

@ -1,138 +0,0 @@
# llama-server-simulator Implementation Summary
## Overview
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
## Features Implemented
### 1. HTTP Server
- Flask-based `/v1/chat/completions` endpoint
- OpenAI-compatible response format
- Configurable port and host
### 2. AIME Dataset Integration
- Loads AIME dataset from HuggingFace
- In-memory storage for fast lookup
- 90 questions loaded from train split
### 3. Intelligent Question Matching
- **Exact matching**: Direct string comparison
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
- **Levenshtein distance**: Calculates similarity between strings
- **Partial matching**: Finds best match even with small differences
### 4. Response Generation
- Configurable success rate (0-1)
- Returns correct answers when success rate allows
- Returns wrong answers when success rate doesn't allow
- Wrong answers are generated by incrementing the expected answer
### 5. Debug Logging
- Debug messages written to stderr
- Logs request content, matching results, and distances
- Helps troubleshoot matching issues
## Configuration Options
```bash
python3 llama-server-simulator.py \
--port 8034 \
--host localhost \
--success-rate 0.8 \
--dataset-split train
```
## Testing Results
### Test 1: Correct Answer
- **Success rate**: 0.8
- **Expected answer**: 116
- **Result**: ✓ Correct (116)
### Test 2: Wrong Answer
- **Success rate**: 0.0
- **Expected answer**: 116
- **Result**: ✓ Wrong (117)
### Test 3: No Matching Question
- **Request**: "What is the capital of France?"
- **Result**: ✓ Returns error "No matching question found"
### Test 4: Success Rate Verification
- **Success rate**: 0.8
- **Requests**: 10
- **Correct answers**: 8/10 (80%)
- **Result**: ✓ Success rate working as expected
## Technical Details
### Matching Algorithm
1. Try exact match (case-insensitive)
2. Try match after removing LaTeX formatting
3. Calculate Levenshtein distance for partial matches
4. Return best match if distance < 0.3 (30% difference)
### Response Format
```json
{
"id": "chatcmpl-1769864875",
"object": "chat.completion",
"created": 1769864875,
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "116"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
## Files Created
1. `llama-server-simulator.py` - Main simulator script
2. `test-simulator.sh` - Basic test script
3. `test-simulator-comprehensive.sh` - Comprehensive test script
4. `llama-server-simulator-plan.md` - Implementation plan
5. `llama-eval-discussion.md` - Discussion notes
## Next Steps
1. ✓ Basic simulator structure
2. ✓ AIME dataset integration
3. ✓ Question matching with Levenshtein distance
4. ✓ Response generation with configurable success rate
5. ✓ Testing with curl requests
6. ✓ Integrate with eval script
7. ✓ Implement eval state object
8. ✓ Implement processor object
9. ✓ Add real-time progress reporting
10. ✓ Add enhanced grading system with LLM judge
## Known Limitations
1. Only supports AIME dataset (train split)
2. Matching is case-insensitive
3. Wrong answers are simple increments (not realistic)
4. No support for multiple endpoints
5. No distributed evaluation
## Future Enhancements
1. Support multiple datasets
2. More sophisticated wrong answer generation
3. Multiple endpoint support
4. Distributed evaluation
5. Real-time progress reporting
6. Eval state serialization
7. Enhanced grading with LLM judge
8. Response truncation for better answer extraction