add gpqa + sampling + docs
This commit is contained in:
parent
e8a807519a
commit
cffd268bb3
|
|
@ -0,0 +1,85 @@
|
|||
# llama-eval Implementation Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
|
||||
- **Flexible Grading**: Regex, CLI, or LLM-based grading
|
||||
- **Parallel Processing**: Configurable thread count for concurrent requests
|
||||
- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
|
||||
- **Real-time Feedback**: Progress tracking with detailed output
|
||||
- **JSON Output**: Complete eval state saved for debugging
|
||||
- **GPQA Support**: Answer shuffling with reproducible results
|
||||
|
||||
## Architecture
|
||||
|
||||
### Eval State
|
||||
```python
|
||||
@dataclass
|
||||
class EvalState:
|
||||
id: str
|
||||
tasks: List[str]
|
||||
task_states: Dict[str, Dict[str, Any]]
|
||||
sampling_config: Dict[str, Any]
|
||||
```
|
||||
|
||||
### Processor
|
||||
- Handles processing, grading, and state management
|
||||
- Thread-safe concurrent execution
|
||||
- Configurable sampling parameters
|
||||
|
||||
### Grader
|
||||
- Abstract grading interface supporting multiple types
|
||||
- Regex grader with dataset-specific patterns
|
||||
- CLI grader with external script interface
|
||||
- LLM grader with configurable server and model
|
||||
|
||||
### Datasets
|
||||
- `AimeDataset`: 90 AIME 2025 questions
|
||||
- `Gsm8kDataset`: 7473 math word problems
|
||||
- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
|
||||
|
||||
## Configuration
|
||||
|
||||
### Sampling Parameters (Optional)
|
||||
- `--temperature`: Sampling temperature
|
||||
- `--top-k`: Top K sampling
|
||||
- `--top-p`: Top P sampling
|
||||
- `--min-p`: Min P sampling
|
||||
- Only passed if explicitly specified
|
||||
|
||||
### Grading Types
|
||||
- **regex**: Built-in patterns for each dataset
|
||||
- **cli**: External script with `--answer` and `--expected` args
|
||||
- **llm**: LLM-based extraction with configurable server/model
|
||||
|
||||
## Output Format
|
||||
|
||||
### Progress Table
|
||||
```
|
||||
Task ID Dataset Prompt (first 43 chars) Expected Status
|
||||
aime_000_001 AIME Complete the following reactions and sel... A pending
|
||||
```
|
||||
|
||||
### Results Summary
|
||||
```
|
||||
============================================================
|
||||
Results: 8/10 correct (80.0%)
|
||||
============================================================
|
||||
```
|
||||
|
||||
### JSON Output
|
||||
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
|
||||
|
||||
## Technical Details
|
||||
|
||||
- Default max tokens: -1 (infinite)
|
||||
- Default grader type: llm
|
||||
- Default seed: 1234
|
||||
- Default threads: 32
|
||||
- Prompt truncation: First 43 chars + padding + "..."
|
||||
- GPQA requires LLM grader (returns letter A/B/C/D)
|
||||
- Judge model defaults to evaluated model if not specified
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
# llama-eval Evaluation Tool
|
||||
|
||||
Simple evaluation tool for llama.cpp with support for multiple datasets.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multiple Datasets**: AIME, GSM8K, GPQA
|
||||
- **Flexible Grading**: Regex, CLI, or LLM-based grading
|
||||
- **Parallel Processing**: Configurable thread count
|
||||
- **Real-time Feedback**: Progress tracking with detailed output
|
||||
- **Sampling Parameters**: Temperature, Top K, Top P, Min P
|
||||
- **JSON Output**: Complete eval state saved for debugging
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
python llama-eval-new.py \
|
||||
--server http://127.0.0.1:8013 \
|
||||
--model gpt-oss-20b-hf-low \
|
||||
--judge-model gpt-oss-20b-hf-medium \
|
||||
--dataset aime \
|
||||
--n_cases 10 \
|
||||
--grader-type llm \
|
||||
--seed 42
|
||||
```
|
||||
|
||||
## CLI Arguments
|
||||
|
||||
- `--server`: llama-server URL (default: http://127.0.0.1:8013)
|
||||
- `--model`: Model name for evaluation (default: llama)
|
||||
- `--judge-model`: Model name for LLM judge (default: same as main model)
|
||||
- `--judge-server`: Server URL for LLM judge (default: same as main server)
|
||||
- `--dataset`: Dataset type (aime, gsm8k, gpqa)
|
||||
- `--n_cases`: Number of cases to evaluate (default: all)
|
||||
- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
|
||||
- `--temperature`: Sampling temperature (default: not passed)
|
||||
- `--top-k`: Top K sampling (default: not passed)
|
||||
- `--top-p`: Top P sampling (default: not passed)
|
||||
- `--min-p`: Min P sampling (default: not passed)
|
||||
- `--threads`: Number of threads for parallel requests (default: 32)
|
||||
- `--verbose`: Show detailed output for each case
|
||||
- `--output`: Output file for eval state (default: llama-eval-state.json)
|
||||
- `--grader-type`: Grader type (regex, cli, llm, default: llm)
|
||||
- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
|
||||
- `--seed`: Random seed for shuffling (default: 1234)
|
||||
|
||||
## Datasets
|
||||
|
||||
### AIME
|
||||
- 90 questions from 2025 AIME competition
|
||||
- Answers in boxed format: `\boxed{answer}`
|
||||
- Requires regex grader or LLM grader
|
||||
|
||||
### GSM8K
|
||||
- 7473 math word problems
|
||||
- Answers are numeric values
|
||||
- Requires regex grader or LLM grader
|
||||
|
||||
### GPQA
|
||||
- 198 questions from GPQA Diamond dataset
|
||||
- Multiple choice with shuffled options
|
||||
- Requires LLM grader (returns letter A, B, C, or D)
|
||||
|
||||
## Grading Types
|
||||
|
||||
### Regex Grader
|
||||
Built-in patterns for different datasets:
|
||||
- AIME: `\boxed{(\d+)}|\b(\d+)\b`
|
||||
- GSM8K: `\b(\d+)\b`
|
||||
- GPQA: Letter extraction (A, B, C, D)
|
||||
|
||||
### CLI Grader
|
||||
External script interface:
|
||||
```bash
|
||||
./grader.sh --answer <pred> --expected <gold>
|
||||
```
|
||||
Returns exit code 0 if correct, non-zero if incorrect.
|
||||
|
||||
### LLM Grader
|
||||
Uses LLM to extract and compare answers:
|
||||
- Configurable server and model
|
||||
- Includes problem context in prompt
|
||||
- Case-insensitive comparison
|
||||
|
||||
## Output
|
||||
|
||||
### Progress Table
|
||||
```
|
||||
Task ID Dataset Prompt (first 43 chars) Expected Status
|
||||
aime_000_001 AIME Complete the following reactions and sel... A pending
|
||||
```
|
||||
|
||||
### Results
|
||||
```
|
||||
============================================================
|
||||
Results: 8/10 correct (80.0%)
|
||||
============================================================
|
||||
```
|
||||
|
||||
### JSON Output
|
||||
Complete eval state saved to output file with:
|
||||
- Task IDs and correctness status
|
||||
- Prompts and extracted answers
|
||||
- Sampling configuration
|
||||
- Processing metadata
|
||||
|
|
@ -1,395 +0,0 @@
|
|||
# llama-eval Implementation Discussion
|
||||
|
||||
## Overview
|
||||
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
|
||||
|
||||
## Key Requirements from ggerganov
|
||||
|
||||
### 1. Simplify and Focus on One Eval
|
||||
- Start with AIME2025 (most familiar with it)
|
||||
- Don't support multiple evals initially
|
||||
|
||||
### 2. Implement an "eval state" object
|
||||
- ID
|
||||
- List of tasks
|
||||
- Task states
|
||||
- Sampling config
|
||||
|
||||
### 3. Implement a "processor" object
|
||||
- List of endpoints
|
||||
- Threads per endpoint
|
||||
- Grade/judge type (regex, endpoint, or CLI tool)
|
||||
|
||||
### 4. Processor responsibilities
|
||||
- Accepts eval state
|
||||
- Starts processing
|
||||
- Dumps eval state periodically as it progresses
|
||||
|
||||
### 5. Real-time feedback
|
||||
- Default: show "correct / not correct" for each task
|
||||
- Verbose mode: show produced answer vs expected answer as soon as it completes
|
||||
|
||||
### 6. Grading approach
|
||||
- Abstract grading to support external "grader" or "judge"
|
||||
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
|
||||
|
||||
### 7. Output format
|
||||
- Use structured output (JSON) instead of boxed text
|
||||
|
||||
## Current Implementation Analysis
|
||||
|
||||
### What exists in llama-eval.py:
|
||||
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
|
||||
- Regex-based answer extraction
|
||||
- HTTP requests to OpenAI-compatible endpoint
|
||||
- Checkpointing/resume capability
|
||||
- Thread-based parallel execution
|
||||
- Summary reporting
|
||||
|
||||
### What needs to be removed:
|
||||
- All task implementations except AIME
|
||||
- Regex-based grading
|
||||
- Multiple endpoint support
|
||||
- Complex task loading logic
|
||||
- Summary reporting (replace with real-time feedback)
|
||||
|
||||
## Discussion Points
|
||||
|
||||
### 1. Eval State Object Structure
|
||||
**Status: Under Discussion**
|
||||
|
||||
Questions:
|
||||
- What fields should be in the eval state object?
|
||||
- Should it include the actual prompts, or just metadata?
|
||||
- How should task states be tracked?
|
||||
|
||||
### 2. Processor Architecture
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should the processor handle multiple endpoints (for distributed evaluation)?
|
||||
- What's the threading model?
|
||||
- How are endpoints configured?
|
||||
|
||||
### 3. Grader Interface
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- How should the grader be configured?
|
||||
- Should it be a separate service, or a local LLM call?
|
||||
- What's the interface for grading?
|
||||
|
||||
### 4. Checkpointing
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should the eval state be serialized to disk?
|
||||
- How often should it be dumped?
|
||||
- What format should it use?
|
||||
|
||||
### 5. Real-time Output
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- How should progress be displayed?
|
||||
- Console output, file logging, or both?
|
||||
- What verbosity levels are needed?
|
||||
|
||||
### 6. Output Format
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should responses be in JSON format?
|
||||
- How should the grader interface work with JSON output?
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Eval State Object** - Currently discussing
|
||||
2. Processor Architecture
|
||||
3. Grader Interface
|
||||
4. Checkpointing
|
||||
5. Real-time Output
|
||||
6. Output Format
|
||||
|
||||
## References
|
||||
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
|
||||
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
|
||||
|
||||
## Session Work Summary
|
||||
|
||||
### llama-server-simulator Implementation
|
||||
|
||||
**Created:**
|
||||
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
|
||||
- `test-simulator.sh` - Test script for verifying simulator functionality
|
||||
- `llama-server-simulator-plan.md` - Implementation plan
|
||||
- `simulator-summary.md` - Summary of implementation
|
||||
|
||||
**Features Implemented:**
|
||||
1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
|
||||
2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
|
||||
3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
|
||||
4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
|
||||
5. Debug Logging - Helps troubleshoot matching issues
|
||||
|
||||
**Testing Results:**
|
||||
- ✅ Correct answers returned when success rate allows
|
||||
- ✅ Wrong answers returned when success rate doesn't allow
|
||||
- ✅ No matching questions return errors
|
||||
- ✅ Success rate verified (80% in 10 requests)
|
||||
- ✅ HuggingFace dataset caching working correctly
|
||||
|
||||
**Key Technical Decisions:**
|
||||
- Used Levenshtein distance for partial matching (threshold: 0.3)
|
||||
- Automatic caching via HuggingFace datasets library
|
||||
- Wrong answers generated by incrementing expected answer
|
||||
- Debug output written to stderr for better visibility
|
||||
|
||||
**Refactoring:**
|
||||
- Extracted repeating question string into TEST_QUESTION variable
|
||||
- Created make_request() helper function to reduce code duplication
|
||||
- Added proper error handling for error responses
|
||||
- Fixed simulator stopping issue at script completion
|
||||
|
||||
### llama-eval-new.py Implementation
|
||||
|
||||
**Created:**
|
||||
- `llama-eval-new.py` - Simplified evaluation tool focused on AIME
|
||||
|
||||
**Features Implemented:**
|
||||
1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
|
||||
2. **Processor Object** - Handles processing, grading, and state management
|
||||
3. **Real-time Feedback** - Shows correct/incorrect status for each case
|
||||
4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading
|
||||
5. **Structured JSON Output** - Saves complete eval state to JSON file
|
||||
6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
|
||||
7. **Enhanced Answer Extraction** - Extracts answers from full responses for display
|
||||
|
||||
**Grading System:**
|
||||
- **Regex Grading**: Built-in patterns for different task types
|
||||
- `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text)
|
||||
- `gsm8k`: `\b(\d+)\b` (extract first number)
|
||||
- `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter)
|
||||
- **CLI Grading**: External script interface
|
||||
- Script accepts `--answer <pred>` and `--expected <gold>`
|
||||
- Returns exit code 0 if correct, non-zero if incorrect
|
||||
- 30-second timeout to prevent hanging
|
||||
- **LLM Judge**: Generic answer extraction using LLM
|
||||
- Uses configured server and model for extraction
|
||||
- Includes problem statement in prompt for context
|
||||
- Case-insensitive comparison
|
||||
- Returns extracted answer for display
|
||||
|
||||
**Configuration Options:**
|
||||
- `--server`: llama-server URL (default: http://localhost:8033)
|
||||
- `--n_cases`: Number of cases to evaluate (default: all)
|
||||
- `--n_predict`: Max tokens to predict per prompt (default: 2048)
|
||||
- `--threads`: Number of threads for parallel requests (default: 32)
|
||||
- `--verbose`: Show detailed output for each case
|
||||
- `--output`: Output file for eval state (default: llama-eval-state.json)
|
||||
- `--grader-type`: `regex`, `cli`, or `llm`
|
||||
- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
|
||||
- `--grader-script`: Path to CLI grader script
|
||||
- `--judge-server`: Server URL for LLM judge (default: same as main server)
|
||||
- `--judge-model`: Model name for LLM judge (default: same as main model)
|
||||
|
||||
**Testing Results:**
|
||||
- ✅ Works with simulator at 100% success rate (all correct)
|
||||
- ✅ Works with simulator at 0% success rate (all incorrect)
|
||||
- ✅ Works with simulator at 80% success rate (8/10 correct)
|
||||
- ✅ Real-time verbose output shows gold/pred/status for each case
|
||||
- ✅ JSON output contains complete eval state with all cases
|
||||
- ✅ HF Hub telemetry disabled (no warnings)
|
||||
- ✅ Uses cached dataset path to avoid HF Hub requests when available
|
||||
- ✅ Regex grader extracts answers correctly from various formats
|
||||
- ✅ LLM judge can extract answers with problem context
|
||||
- ✅ Response truncation focuses grading on final answer
|
||||
- ✅ Case-insensitive matching works for both regex and LLM grader
|
||||
- ✅ Judge model and server configuration propagate correctly
|
||||
- ✅ Progress table shows extracted answers instead of full responses
|
||||
|
||||
**Key Technical Decisions:**
|
||||
- Removed Levenshtein matching - eval script only sends requests and validates answers
|
||||
- Abstract grading interface for external grader support
|
||||
- Exact match requirement for regex patterns
|
||||
- Handles both boxed and plain text formats for AIME answers
|
||||
- 30-second timeout for CLI grader
|
||||
- Validates script exists before running
|
||||
- Judge parameters set once during Grader construction
|
||||
- LLM judge prompt includes problem statement for better extraction
|
||||
- Response truncation to last 2-3 lines focuses grading on final answer
|
||||
- Case-insensitive comparison for more flexible matching
|
||||
|
||||
**Refactoring:**
|
||||
- Removed all task implementations except AIME
|
||||
- Removed regex-based grading (moved to flexible grader system)
|
||||
- Removed multiple endpoint support
|
||||
- Removed complex task loading logic
|
||||
- Removed summary reporting (replaced with real-time feedback)
|
||||
- Added HuggingFace dataset caching optimization
|
||||
- Added LLM grader support with configurable server and model
|
||||
- Added response truncation before grading
|
||||
- Refactored grader interface to return extracted answers
|
||||
|
||||
### llama-eval-new.py Threading and Model Parameter Updates
|
||||
|
||||
**Changes Made:**
|
||||
1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing
|
||||
- Added `from concurrent.futures import ThreadPoolExecutor, as_completed`
|
||||
- Created `_process_single_case()` method for thread-safe case processing
|
||||
- Refactored `process()` to use ThreadPoolExecutor with configurable thread count
|
||||
- Updated progress tracking to work with concurrent execution
|
||||
- Thread-safe eval state updates (task_states and counters)
|
||||
|
||||
2. **Model Parameter** - Added `--model` argument to specify model name in request data
|
||||
- Added `model_name` parameter to Processor.__init__()
|
||||
- Updated `_make_request()` to use provided model name or default to "llama"
|
||||
- Added `--model` argument to argument parser
|
||||
- Model name is included in request JSON as `"model": "gpt-oss-20b-hf"`
|
||||
|
||||
**Testing Results:**
|
||||
- ✅ Works with 2 threads (5 cases processed in ~0.2s)
|
||||
- ✅ Works with 4 threads (slightly faster throughput)
|
||||
- ✅ Model parameter correctly added to request data
|
||||
- ✅ Thread-safe progress tracking with tqdm
|
||||
- ✅ No race conditions in eval state updates
|
||||
|
||||
**Key Technical Decisions:**
|
||||
- Used ThreadPoolExecutor for simple, effective parallelism
|
||||
- No rate limiting needed (server can handle concurrent requests)
|
||||
- Thread-safe counter updates for correct/total tracking
|
||||
- Progress bar shows completion status across all threads
|
||||
- Model parameter is optional - defaults to "llama" if not specified
|
||||
|
||||
**Refactoring:**
|
||||
- Extracted single case processing into `_process_single_case()` method
|
||||
- Changed from sequential loop to ThreadPoolExecutor with futures
|
||||
- Updated verbose output to show total count instead of index
|
||||
- Made eval state updates thread-safe
|
||||
|
||||
### llama-eval-new.py Enhanced Grading System
|
||||
|
||||
**Changes Made:**
|
||||
1. **Enhanced Grader Interface** - Updated to return extracted answers
|
||||
- `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer)
|
||||
- Added `extracted` field to `TaskState` dataclass
|
||||
- All grader types (regex, cli, llm) now return extracted answers
|
||||
|
||||
2. **Improved Regex Grader**
|
||||
- New `_extract_answer_regex()` method extracts answers using configured patterns
|
||||
- Supports case-insensitive matching
|
||||
- Returns first valid match found
|
||||
- Handles both single values and multiple matches
|
||||
|
||||
3. **LLM-Based Judge**
|
||||
- New `_grade_llm()` method for generic answer extraction
|
||||
- Includes problem statement in prompt for context
|
||||
- Configurable server URL (defaults to main server)
|
||||
- Configurable model name (defaults to main model)
|
||||
- Case-insensitive comparison
|
||||
- Returns extracted answer for display
|
||||
|
||||
4. **Response Truncation**
|
||||
- New `_truncate_response()` method keeps only last 2-3 lines
|
||||
- Applied before grading to focus on final answer section
|
||||
|
||||
5. **CLI Grader Update**
|
||||
- Now also returns extracted answer
|
||||
- Returns None if grading fails
|
||||
|
||||
6. **Display Updates**
|
||||
- Progress table shows extracted answer instead of full response
|
||||
- Verbose mode shows full response plus extracted answer
|
||||
|
||||
7. **New CLI Arguments**
|
||||
- `--grader-type`: Added "llm" option
|
||||
- `--judge-server`: Separate server for LLM judge
|
||||
- `--judge-model`: Separate model for LLM judge
|
||||
|
||||
**Testing Results:**
|
||||
- ✅ Regex grader extracts answers correctly from various formats
|
||||
- ✅ LLM judge can extract answers with problem context
|
||||
- ✅ Response truncation focuses grading on final answer
|
||||
- ✅ Case-insensitive matching works for both regex and LLM grader
|
||||
- ✅ Judge model and server configuration propagate correctly
|
||||
- ✅ Progress table shows extracted answers instead of full responses
|
||||
|
||||
**Key Technical Decisions:**
|
||||
- Judge parameters set once during Grader construction (not on each call)
|
||||
- LLM judge prompt includes problem statement for better extraction
|
||||
- Response truncation to last 2-3 lines focuses grading on final answer
|
||||
- Case-insensitive comparison for more flexible matching
|
||||
- Judge configuration propagates through Processor to Grader
|
||||
- Display shows extracted answer for cleaner output
|
||||
|
||||
**Refactoring:**
|
||||
- Removed judge parameters from `grade()` method calls
|
||||
- Added `judge_server_url` and `judge_model_name` to Grader class
|
||||
- Updated `_grade_llm()` to use instance variables instead of parameters
|
||||
- Simplified Processor initialization to pass judge config to grader
|
||||
- Updated startup info to show judge server and model
|
||||
|
||||
### llama-eval-new.py GSM8K Dataset Support
|
||||
|
||||
**Changes Made:**
|
||||
1. **GSM8K Dataset Integration** - Added support for GSM8K dataset alongside AIME
|
||||
- Created `Gsm8kDataset` class with proper answer extraction logic
|
||||
- GSM8K uses `"question"` field instead of `"problem"` field
|
||||
- GSM8K answer field contains full reasoning with `####` prefix
|
||||
- Extracts numeric answer from answer field during initialization
|
||||
- Uses same regex grader pattern as AIME (`\b(\d+)\b`)
|
||||
|
||||
2. **Dataset Type Configuration** - Added dataset selection support
|
||||
- Added `--dataset` CLI argument with choices `aime` and `gsm8k`
|
||||
- Updated `Processor` class to accept `dataset_type` parameter
|
||||
- Dataset-specific initialization in `Processor.__init__()`
|
||||
- Dataset name displayed in task summary table
|
||||
|
||||
3. **Template Registry** - Added dataset-specific prompt templates
|
||||
- AIME template: includes `\boxed{}` wrapper for final answer
|
||||
- GSM8K template: plain text answer without wrapper
|
||||
- Templates applied based on `question["dataset_type"]` field
|
||||
|
||||
4. **Answer Extraction Logic** - Fixed GSM8K answer extraction
|
||||
- GSM8K has pre-extracted `"gold"` field with numeric answer
|
||||
- `Gsm8kDataset.get_answer()` checks for `"gold"` field first
|
||||
- Falls back to answer field if gold field not present
|
||||
- `AimeDataset.get_answer()` simplified to remove duplicate method
|
||||
|
||||
5. **Task ID Format** - Fixed duplicate prefix in task IDs
|
||||
- Changed from `f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"`
|
||||
- To `f"{dataset_type}_{chunk_idx:03d}_{i:03d}"`
|
||||
- Removed redundant `eval_state.id` (was "gsm8k" for GSM8K)
|
||||
|
||||
6. **Column Width Adjustments** - Improved table formatting
|
||||
- Task ID column: 25 characters
|
||||
- Dataset column: 5 characters
|
||||
- Prompt column: 40 characters
|
||||
- Expected column: 10 characters
|
||||
|
||||
**Testing Results:**
|
||||
- ✅ GSM8K dataset loads correctly with 7473 questions
|
||||
- ✅ Numeric answers extracted from full reasoning text
|
||||
- ✅ Task summary table displays correctly with adjusted column widths
|
||||
- ✅ Task IDs show correct format (e.g., `gsm8k_000_3169`)
|
||||
- ✅ Both AIME and GSM8K datasets work with same script
|
||||
- ✅ Answer extraction works for both boxed and plain text formats
|
||||
- ✅ Progress tracking shows extracted answers for both datasets
|
||||
|
||||
**Key Technical Decisions:**
|
||||
- GSM8K uses `"question"` field instead of `"problem"` field
|
||||
- GSM8K answer field contains full reasoning with `####` prefix
|
||||
- Numeric answer extracted during dataset initialization
|
||||
- Same regex grader pattern works for both datasets
|
||||
- Dataset selection via CLI argument for separate runs
|
||||
- Template registry supports different prompt formats per dataset
|
||||
- Task ID format simplified to avoid duplication
|
||||
|
||||
**Refactoring:**
|
||||
- Removed duplicate `get_question()` method from `AimeDataset`
|
||||
- Removed "2025" suffix from eval state ID (was remnant from old version)
|
||||
- Removed "2025" suffix from task summary table output
|
||||
- Removed "2025" suffix from progress tracking output
|
||||
- Updated `Processor.__init__()` to initialize appropriate dataset based on type
|
||||
- Updated `_process_single_case()` to handle both `"problem"` and `"question"` fields
|
||||
- Updated `process()` method to display dataset name and use `dataset_type` for task states
|
||||
|
|
@ -5,6 +5,7 @@ import json
|
|||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from dataclasses import dataclass, asdict
|
||||
|
|
@ -34,6 +35,15 @@ Please reason step by step, and put your final answer within \\boxed{{}}.
|
|||
""",
|
||||
"gsm8k": """{question}
|
||||
Please reason step by step, and provide your final answer.
|
||||
""",
|
||||
"gpqa": """{Question}
|
||||
|
||||
(A) {A}
|
||||
(B) {B}
|
||||
(C) {C}
|
||||
(D) {D}
|
||||
|
||||
Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'.
|
||||
""",
|
||||
}
|
||||
|
||||
|
|
@ -96,6 +106,15 @@ class AimeDataset:
|
|||
return str(normalized) if normalized is not None else answer
|
||||
return str(answer)
|
||||
|
||||
def get_prompt(self, question: Dict) -> str:
|
||||
"""Get formatted prompt for the question"""
|
||||
if question["dataset_type"] == "gpqa":
|
||||
return TEMPLATE_REGISTRY["gpqa"].format(**question)
|
||||
else:
|
||||
return TEMPLATE_REGISTRY[question["dataset_type"]].format(
|
||||
question=question["problem"] if "problem" in question else question["question"]
|
||||
)
|
||||
|
||||
class Gsm8kDataset:
|
||||
def __init__(self, split: str = "train"):
|
||||
self.split = split
|
||||
|
|
@ -146,17 +165,87 @@ class Gsm8kDataset:
|
|||
return str(normalized) if normalized is not None else answer
|
||||
return str(answer)
|
||||
|
||||
def get_prompt(self, question: Dict) -> str:
|
||||
"""Get formatted prompt for the question"""
|
||||
return TEMPLATE_REGISTRY[question["dataset_type"]].format(
|
||||
question=question["problem"] if "problem" in question else question["question"]
|
||||
)
|
||||
|
||||
class GpqaDataset:
|
||||
def __init__(self, variant: str = "diamond", seed: int = 1234):
|
||||
self.variant = variant
|
||||
self.seed = seed
|
||||
self.questions: List[Dict] = []
|
||||
self._load_dataset()
|
||||
|
||||
def _load_dataset(self):
|
||||
print(f"Loading GPQA dataset (variant: {self.variant})...")
|
||||
import pandas as pd
|
||||
|
||||
url = f"https://openaipublic.blob.core.windows.net/simple-evals/gpqa_{self.variant}.csv"
|
||||
df = pd.read_csv(url)
|
||||
|
||||
rng = random.Random(self.seed)
|
||||
|
||||
self.questions = []
|
||||
for _, row in df.iterrows():
|
||||
question = row.to_dict()
|
||||
question["dataset_type"] = "gpqa"
|
||||
|
||||
# Shuffle the answer options
|
||||
correct_answer = question["Correct Answer"]
|
||||
incorrect_answers = [
|
||||
question["Incorrect Answer 1"],
|
||||
question["Incorrect Answer 2"],
|
||||
question["Incorrect Answer 3"]
|
||||
]
|
||||
|
||||
# Create list of (answer, is_correct) tuples
|
||||
options = [(ans, ans == correct_answer) for ans in incorrect_answers]
|
||||
options.append((correct_answer, True))
|
||||
|
||||
# Shuffle the options
|
||||
rng.shuffle(options)
|
||||
|
||||
# Extract shuffled answers and determine correct letter
|
||||
shuffled_answers = [ans for ans, _ in options]
|
||||
correct_letter = chr(ord('A') + options.index((correct_answer, True)))
|
||||
|
||||
# Store shuffled answers and correct letter
|
||||
question["shuffled_answers"] = shuffled_answers
|
||||
question["correct_letter"] = correct_letter
|
||||
|
||||
self.questions.append(question)
|
||||
|
||||
print(f"GPQA dataset loaded: {len(self.questions)} questions")
|
||||
|
||||
def get_question(self, index: int) -> Dict:
|
||||
"""Get question by index"""
|
||||
return self.questions[index]
|
||||
|
||||
def get_answer(self, question: Dict) -> str:
|
||||
# GPQA returns the correct letter (A, B, C, or D)
|
||||
return question["correct_letter"]
|
||||
|
||||
def get_prompt(self, question: Dict) -> str:
|
||||
"""Get formatted prompt for the question"""
|
||||
return TEMPLATE_REGISTRY["gpqa"].format(
|
||||
Question=question["Question"],
|
||||
A=question["shuffled_answers"][0],
|
||||
B=question["shuffled_answers"][1],
|
||||
C=question["shuffled_answers"][2],
|
||||
D=question["shuffled_answers"][3]
|
||||
)
|
||||
|
||||
class Grader:
|
||||
def __init__(
|
||||
self,
|
||||
grader_type: str = "regex",
|
||||
grader_regex_type: str = "aime",
|
||||
grader_type: str = "llm",
|
||||
grader_script: Optional[str] = None,
|
||||
judge_model_name: Optional[str] = None,
|
||||
judge_server_url: str = ""
|
||||
):
|
||||
self.grader_type = grader_type
|
||||
self.grader_regex_type = grader_regex_type
|
||||
self.grader_script = grader_script
|
||||
self.judge_model_name = judge_model_name
|
||||
self.judge_server_url = judge_server_url
|
||||
|
|
@ -164,9 +253,7 @@ class Grader:
|
|||
|
||||
def _get_pattern(self) -> Optional[str]:
|
||||
if self.grader_type == "regex":
|
||||
if self.grader_regex_type not in GRADER_PATTERNS:
|
||||
raise ValueError(f"Unknown grader regex type: {self.grader_regex_type}")
|
||||
return GRADER_PATTERNS[self.grader_regex_type]
|
||||
return GRADER_PATTERNS.get("aime") # Default to aime pattern
|
||||
return None
|
||||
|
||||
def _extract_answer_regex(self, pred: str) -> Optional[str]:
|
||||
|
|
@ -221,18 +308,21 @@ class Grader:
|
|||
"""Grade using LLM-based extraction"""
|
||||
prompt = f"""Extract the answer from this response:
|
||||
|
||||
Response: {pred}
|
||||
|
||||
Expected answer: {gold}
|
||||
|
||||
Please provide only the extracted answer, nothing else."""
|
||||
===
|
||||
|
||||
Response: {pred}
|
||||
|
||||
===
|
||||
|
||||
Please provide only the extracted answer, nothing else. If there is no clear answer in the response, reply with 'no answer'."""
|
||||
url = f"{self.judge_server_url}/v1/chat/completions"
|
||||
headers = {"Content-Type": "application/json"}
|
||||
data = {
|
||||
"model": self.judge_model_name,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"temperature": 0,
|
||||
"max_tokens": 256
|
||||
}
|
||||
|
||||
try:
|
||||
|
|
@ -264,14 +354,16 @@ class Processor:
|
|||
def __init__(
|
||||
self,
|
||||
server_url: str,
|
||||
n_predict: int = 2048,
|
||||
n_predict: int = -1,
|
||||
threads: int = 32,
|
||||
verbose: bool = False,
|
||||
grader: Optional[Grader] = None,
|
||||
model_name: Optional[str] = None,
|
||||
judge_server_url: str = "",
|
||||
judge_model_name: Optional[str] = None,
|
||||
dataset_type: str = "aime"
|
||||
dataset_type: str = "aime",
|
||||
seed: int = 1234,
|
||||
sampling_config: Optional[Dict[str, Any]] = None
|
||||
):
|
||||
self.server_url = server_url
|
||||
self.n_predict = n_predict
|
||||
|
|
@ -281,12 +373,14 @@ class Processor:
|
|||
self.judge_server_url = judge_server_url if judge_server_url else server_url
|
||||
self.judge_model_name = judge_model_name
|
||||
self.dataset_type = dataset_type
|
||||
self.seed = seed
|
||||
self.grader = grader or Grader()
|
||||
self.sampling_config = sampling_config or {"n_predict": n_predict}
|
||||
self.eval_state = EvalState(
|
||||
id=dataset_type,
|
||||
tasks=[dataset_type],
|
||||
task_states={},
|
||||
sampling_config={"temperature": 0, "max_tokens": n_predict}
|
||||
sampling_config=self.sampling_config
|
||||
)
|
||||
|
||||
# Pass judge configuration to grader if using LLM grader
|
||||
|
|
@ -301,6 +395,8 @@ class Processor:
|
|||
self.dataset = AimeDataset()
|
||||
elif dataset_type == "gsm8k":
|
||||
self.dataset = Gsm8kDataset()
|
||||
elif dataset_type == "gpqa":
|
||||
self.dataset = GpqaDataset(variant="diamond", seed=self.seed)
|
||||
else:
|
||||
raise ValueError(f"Unknown dataset type: {dataset_type}")
|
||||
|
||||
|
|
@ -311,9 +407,16 @@ class Processor:
|
|||
data = {
|
||||
"model": self.model_name if self.model_name else "llama",
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"temperature": 0,
|
||||
"max_tokens": self.n_predict
|
||||
"n_predict": self.n_predict
|
||||
}
|
||||
if self.sampling_config.get("temperature") is not None:
|
||||
data["temperature"] = self.sampling_config["temperature"]
|
||||
if self.sampling_config.get("top_k") is not None:
|
||||
data["top_k"] = self.sampling_config["top_k"]
|
||||
if self.sampling_config.get("top_p") is not None:
|
||||
data["top_p"] = self.sampling_config["top_p"]
|
||||
if self.sampling_config.get("min_p") is not None:
|
||||
data["min_p"] = self.sampling_config["min_p"]
|
||||
|
||||
response = requests.post(url, headers=headers, json=data)
|
||||
response.raise_for_status()
|
||||
|
|
@ -322,14 +425,9 @@ class Processor:
|
|||
def _process_single_case(self, i: int, task_id: str) -> TaskState:
|
||||
"""Process a single case (thread-safe)"""
|
||||
question = self.dataset.get_question(i)
|
||||
dataset_id = f"{self.dataset_type}_{self.dataset.split}_{i}"
|
||||
dataset_id = f"{self.dataset_type}_{i}"
|
||||
gold = self.dataset.get_answer(question)
|
||||
|
||||
# Apply template if available
|
||||
if question["dataset_type"] in TEMPLATE_REGISTRY:
|
||||
prompt = TEMPLATE_REGISTRY[question["dataset_type"]].format(question=question["problem"] if "problem" in question else question["question"])
|
||||
else:
|
||||
prompt = question["problem"] if "problem" in question else question["question"]
|
||||
prompt = self.dataset.get_prompt(question)
|
||||
|
||||
task_state = TaskState(
|
||||
case_id=task_id,
|
||||
|
|
@ -361,12 +459,15 @@ class Processor:
|
|||
n_cases = len(self.dataset.questions)
|
||||
|
||||
print(f"\nProcessing {n_cases} {self.dataset_type.upper()} questions...")
|
||||
print(f"Server: {self.server_url}")
|
||||
print(f"Server: {self.server_url} (model: {self.model_name})")
|
||||
print(f"Threads: {self.threads}")
|
||||
print(f"Max tokens: {self.n_predict}")
|
||||
print(f"Seed: {self.seed}")
|
||||
print(f"Sampling: temp={self.sampling_config.get('temperature', 'skip')}, top-k={self.sampling_config.get('top_k', 'skip')}, top-p={self.sampling_config.get('top_p', 'skip')}, min-p={self.sampling_config.get('min_p', 'skip')}")
|
||||
print(f"Grader: {self.grader.grader_type}", end="")
|
||||
if self.grader.grader_type == "llm":
|
||||
print(f" (judge server: {self.judge_server_url}, model: {self.judge_model_name})", end="")
|
||||
judge_model = self.judge_model_name if self.judge_model_name else self.model_name
|
||||
print(f" (judge server: {self.judge_server_url}, model: {judge_model})", end="")
|
||||
print()
|
||||
print()
|
||||
|
||||
|
|
@ -389,9 +490,14 @@ class Processor:
|
|||
print(" Task ID Dataset Prompt (first 40 chars) Expected Status")
|
||||
for i, task_id in task_list:
|
||||
question = self.dataset.get_question(i)
|
||||
prompt = question["problem"] if "problem" in question else question["question"]
|
||||
prompt = self.dataset.get_prompt(question)
|
||||
gold = self.dataset.get_answer(question)
|
||||
truncated_prompt = prompt[:40] + "..." if len(prompt) > 40 else prompt
|
||||
first_line = prompt.split('\n')[0]
|
||||
truncated_prompt = first_line[:43]
|
||||
if len(first_line) > 43:
|
||||
truncated_prompt += "..."
|
||||
else:
|
||||
truncated_prompt = truncated_prompt.ljust(43) + "..."
|
||||
print(f" {task_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {gold:<10} pending")
|
||||
print()
|
||||
|
||||
|
|
@ -413,7 +519,13 @@ class Processor:
|
|||
# Print task completion status
|
||||
extracted_display = task_state.extracted if task_state.extracted else "N/A"
|
||||
success_ratio = correct / total if total > 0 else 0.0
|
||||
print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {task_state.prompt[:40]:<40} {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'} [{correct:3}/{total:3}, {success_ratio:.3f}]")
|
||||
first_line = task_state.prompt.split('\n')[0]
|
||||
truncated_prompt = first_line[:43]
|
||||
if len(first_line) > 43:
|
||||
truncated_prompt += "..."
|
||||
else:
|
||||
truncated_prompt = truncated_prompt.ljust(43) + "..."
|
||||
print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'} [{correct:3}/{total:3}, {success_ratio:.3f}]")
|
||||
|
||||
if self.verbose:
|
||||
print(f"\nCase {total}: {task_state.correct}")
|
||||
|
|
@ -456,7 +568,7 @@ def main():
|
|||
"--dataset",
|
||||
type=str,
|
||||
default="aime",
|
||||
choices=["aime", "gsm8k"],
|
||||
choices=["aime", "gsm8k", "gpqa"],
|
||||
help="Dataset type (default: aime)"
|
||||
)
|
||||
parser.add_argument(
|
||||
|
|
@ -474,8 +586,32 @@ def main():
|
|||
parser.add_argument(
|
||||
"--n_predict",
|
||||
type=int,
|
||||
default=2048,
|
||||
help="Max tokens to predict per prompt (default: 2048)"
|
||||
default=-1,
|
||||
help="Max tokens to predict per prompt (default: -1, infinite)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--temperature",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Sampling temperature (default: not passed)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--top-k",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Top K sampling (default: not passed)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--top-p",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Top P sampling (default: not passed)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--min-p",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Min P sampling (default: not passed)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--threads",
|
||||
|
|
@ -503,16 +639,9 @@ def main():
|
|||
parser.add_argument(
|
||||
"--grader-type",
|
||||
type=str,
|
||||
default="regex",
|
||||
default="llm",
|
||||
choices=["regex", "cli", "llm"],
|
||||
help="Grader type: regex, cli, or llm (default: regex)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--grader-regex-type",
|
||||
type=str,
|
||||
default="aime",
|
||||
choices=list(GRADER_PATTERNS.keys()),
|
||||
help="Regex grader type (default: aime)"
|
||||
help="Grader type: regex, cli, or llm (default: llm)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--grader-script",
|
||||
|
|
@ -529,21 +658,37 @@ def main():
|
|||
parser.add_argument(
|
||||
"--judge-model",
|
||||
type=str,
|
||||
default=None,
|
||||
default="",
|
||||
help="Model name for LLM judge (default: same as main model)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate grader type for GPQA
|
||||
if args.dataset == "gpqa" and args.grader_type != "llm":
|
||||
print("Error: GPQA dataset requires --grader-type llm")
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
grader = Grader(
|
||||
grader_type=args.grader_type,
|
||||
grader_regex_type=args.grader_regex_type,
|
||||
grader_script=args.grader_script
|
||||
grader_script=args.grader_script,
|
||||
judge_model_name=args.judge_model if args.judge_model else args.model
|
||||
)
|
||||
|
||||
if args.grader_type == "llm" and not args.judge_server:
|
||||
print("Warning: Using same server for LLM judge (no --judge-server specified)")
|
||||
|
||||
sampling_config = {"n_predict": args.n_predict}
|
||||
if args.temperature is not None:
|
||||
sampling_config["temperature"] = args.temperature
|
||||
if args.top_k is not None:
|
||||
sampling_config["top_k"] = args.top_k
|
||||
if args.top_p is not None:
|
||||
sampling_config["top_p"] = args.top_p
|
||||
if args.min_p is not None:
|
||||
sampling_config["min_p"] = args.min_p
|
||||
|
||||
processor = Processor(
|
||||
server_url=args.server,
|
||||
n_predict=args.n_predict,
|
||||
|
|
@ -553,7 +698,8 @@ def main():
|
|||
model_name=args.model,
|
||||
judge_server_url=args.judge_server,
|
||||
judge_model_name=args.judge_model,
|
||||
dataset_type=args.dataset
|
||||
dataset_type=args.dataset,
|
||||
sampling_config=sampling_config
|
||||
)
|
||||
|
||||
eval_state = processor.process(n_cases=args.n_cases, seed=args.seed)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,29 @@
|
|||
{
|
||||
"id": "gpqa",
|
||||
"tasks": [
|
||||
"gpqa"
|
||||
],
|
||||
"task_states": {
|
||||
"gpqa": {
|
||||
"total": 1,
|
||||
"correct": 0,
|
||||
"cases": {
|
||||
"gpqa": [
|
||||
{
|
||||
"case_id": "gpqa_000_184",
|
||||
"prompt": "Consider a system with Hamiltonian operator $H = \\varepsilon \\vec{\\sigma}.\\vec{n}$. Here, $\\vec{n}$ is an arbitrary unit vector, $\\varepsilon $ is a constant of dimension energy, and components of $\\vec{\\sigma}$ are the Pauli spin matrices. What are the eigenvalues of the Hamiltonian operator?\n\n\n(A) +\\hbar/2, -\\hbar/2\n(B) +1, -1\n(C) +\\varepsilon \\hbar/2, - \\varepsilon \\hbar/2\n(D) + \\varepsilon, -\\varepsilon\n\n\nExpress your final answer as the corresponding option 'A', 'B', 'C', or 'D'.\n",
|
||||
"gold": "+ \\varepsilon, -\\varepsilon\n",
|
||||
"pred": null,
|
||||
"extracted": null,
|
||||
"correct": false,
|
||||
"status": "error: HTTPConnectionPool(host='localhost', port=8034): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError(\"HTTPConnection(host='localhost', port=8034): Failed to establish a new connection: [Errno 61] Connection refused\"))"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"sampling_config": {
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
# llama-server-simulator
|
||||
|
||||
Standalone Python script simulating llama-server HTTP endpoint for testing.
|
||||
|
||||
## Features
|
||||
|
||||
- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
|
||||
- AIME Dataset Integration - Loads 90 questions from HuggingFace
|
||||
- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
|
||||
- Configurable Success Rate - Control correct/wrong answer generation (0-1)
|
||||
- Debug Logging - Troubleshoot matching issues
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
python llama-server-simulator.py --success-rate 0.8
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
|
||||
- `--port`: Server port (default: 8033)
|
||||
- `--debug`: Enable debug logging (default: False)
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
./test-simulator.sh
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
- Uses Levenshtein distance for partial matching (threshold: 0.3)
|
||||
- Automatic caching via HuggingFace datasets library
|
||||
- Wrong answers generated by incrementing expected answer
|
||||
- Debug output written to stderr
|
||||
|
|
@ -1,189 +0,0 @@
|
|||
# llama-server-simulator Implementation Plan
|
||||
|
||||
## Overview
|
||||
Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||
|
||||
## Goals
|
||||
1. Simulate llama-server's `/v1/chat/completions` endpoint
|
||||
2. Accept requests and respond with expected answers from AIME dataset
|
||||
3. Implement configurable success rate (sometimes right, sometimes wrong)
|
||||
4. Use regex matching to find questions in incoming requests
|
||||
5. Test with curl requests before integrating with eval script
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Basic Simulator Structure
|
||||
- Create `llama-server-simulator.py` script
|
||||
- Set up Flask/FastAPI HTTP server
|
||||
- Implement `/v1/chat/completions` endpoint
|
||||
- Handle basic request/response format
|
||||
|
||||
### Phase 2: AIME Dataset Integration
|
||||
- Load AIME dataset
|
||||
- Store questions and expected answers
|
||||
- Implement regex matching to find questions in incoming requests
|
||||
- Extract expected answer from matched question
|
||||
|
||||
### Phase 3: Response Generation
|
||||
- Implement success rate configuration
|
||||
- Randomly determine if response should be correct or incorrect
|
||||
- Generate appropriate response based on success determination
|
||||
- Format response in OpenAI-compatible format
|
||||
|
||||
### Phase 4: Testing
|
||||
- Write curl commands to test basic functionality
|
||||
- Test correct responses
|
||||
- Test incorrect responses
|
||||
- Test edge cases (no question found, etc.)
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Server Framework
|
||||
- Use Flask for simplicity
|
||||
- Listen on configurable port
|
||||
- Support JSON request/response format
|
||||
|
||||
### Request Format
|
||||
```json
|
||||
{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Question text here"}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}
|
||||
```
|
||||
|
||||
### Response Format
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-xxx",
|
||||
"object": "chat.completion",
|
||||
"created": 1234567890,
|
||||
"model": "llama",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Answer text here"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### AIME Dataset Integration
|
||||
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
|
||||
- Store in memory for fast lookup
|
||||
- Regex pattern to find question text in request
|
||||
- Extract answer from matched question
|
||||
|
||||
### Success Rate Configuration
|
||||
- Command-line argument: `--success-rate 0.8` (80% success rate)
|
||||
- Randomly determine correctness based on rate
|
||||
- Log when responses are correct vs incorrect
|
||||
|
||||
### Testing Strategy
|
||||
1. Start simulator with default settings
|
||||
2. Send curl request with known question
|
||||
3. Verify response contains expected answer
|
||||
4. Test with different success rates
|
||||
5. Test edge cases
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Basic Server Setup
|
||||
```python
|
||||
from flask import Flask, request, jsonify
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@app.route('/v1/chat/completions', methods=['POST'])
|
||||
def chat_completions():
|
||||
# Handle request
|
||||
return jsonify(response)
|
||||
```
|
||||
|
||||
### Step 2: Load AIME Dataset
|
||||
```python
|
||||
import datasets
|
||||
|
||||
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
|
||||
# Store in memory
|
||||
```
|
||||
|
||||
### Step 3: Regex Matching
|
||||
```python
|
||||
import re
|
||||
|
||||
def find_question_in_request(request_text):
|
||||
# Regex pattern to find question
|
||||
pattern = r"question:\s*(.*?)\n"
|
||||
match = re.search(pattern, request_text, re.DOTALL)
|
||||
return match.group(1) if match else None
|
||||
```
|
||||
|
||||
### Step 4: Response Generation
|
||||
```python
|
||||
import random
|
||||
|
||||
def generate_response(question, success_rate):
|
||||
if random.random() < success_rate:
|
||||
return get_expected_answer(question)
|
||||
else:
|
||||
return get_wrong_answer(question)
|
||||
```
|
||||
|
||||
### Step 5: Testing with Curl
|
||||
```bash
|
||||
curl -X POST http://localhost:8033/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [{"role": "user", "content": "Question text"}]
|
||||
}'
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
- `--port`: Server port (default: 8033)
|
||||
- `--success-rate`: Success rate 0-1 (default: 0.8)
|
||||
- `--host`: Server host (default: localhost)
|
||||
- `--dataset-split`: AIME split to use (default: train)
|
||||
|
||||
## Expected Output
|
||||
```
|
||||
=== llama-server-simulator ===
|
||||
Server running on http://localhost:8033
|
||||
Success rate: 0.8
|
||||
AIME dataset loaded: 1000 questions
|
||||
```
|
||||
|
||||
## Testing Checklist
|
||||
- [ ] Server starts successfully
|
||||
- [ ] Basic request/response works
|
||||
- [ ] Correct answer returned when success rate allows
|
||||
- [ ] Wrong answer returned when success rate doesn't allow
|
||||
- [ ] No question found returns error
|
||||
- [ ] Multiple requests work correctly
|
||||
- [ ] Different success rates work as expected
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✓ Implement basic server structure
|
||||
2. ✓ Load AIME dataset
|
||||
3. ✓ Implement regex matching
|
||||
4. ✓ Add response generation with success rate
|
||||
5. ✓ Test with curl commands
|
||||
6. ✓ Integrate with eval script once simulator works
|
||||
7. ✓ Implement eval state object
|
||||
8. ✓ Implement processor object
|
||||
9. ✓ Add real-time progress reporting
|
||||
10. ✓ Add enhanced grading system with LLM judge
|
||||
|
|
@ -1,138 +0,0 @@
|
|||
# llama-server-simulator Implementation Summary
|
||||
|
||||
## Overview
|
||||
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||
|
||||
## Features Implemented
|
||||
|
||||
### 1. HTTP Server
|
||||
- Flask-based `/v1/chat/completions` endpoint
|
||||
- OpenAI-compatible response format
|
||||
- Configurable port and host
|
||||
|
||||
### 2. AIME Dataset Integration
|
||||
- Loads AIME dataset from HuggingFace
|
||||
- In-memory storage for fast lookup
|
||||
- 90 questions loaded from train split
|
||||
|
||||
### 3. Intelligent Question Matching
|
||||
- **Exact matching**: Direct string comparison
|
||||
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
|
||||
- **Levenshtein distance**: Calculates similarity between strings
|
||||
- **Partial matching**: Finds best match even with small differences
|
||||
|
||||
### 4. Response Generation
|
||||
- Configurable success rate (0-1)
|
||||
- Returns correct answers when success rate allows
|
||||
- Returns wrong answers when success rate doesn't allow
|
||||
- Wrong answers are generated by incrementing the expected answer
|
||||
|
||||
### 5. Debug Logging
|
||||
- Debug messages written to stderr
|
||||
- Logs request content, matching results, and distances
|
||||
- Helps troubleshoot matching issues
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```bash
|
||||
python3 llama-server-simulator.py \
|
||||
--port 8034 \
|
||||
--host localhost \
|
||||
--success-rate 0.8 \
|
||||
--dataset-split train
|
||||
```
|
||||
|
||||
## Testing Results
|
||||
|
||||
### Test 1: Correct Answer
|
||||
- **Success rate**: 0.8
|
||||
- **Expected answer**: 116
|
||||
- **Result**: ✓ Correct (116)
|
||||
|
||||
### Test 2: Wrong Answer
|
||||
- **Success rate**: 0.0
|
||||
- **Expected answer**: 116
|
||||
- **Result**: ✓ Wrong (117)
|
||||
|
||||
### Test 3: No Matching Question
|
||||
- **Request**: "What is the capital of France?"
|
||||
- **Result**: ✓ Returns error "No matching question found"
|
||||
|
||||
### Test 4: Success Rate Verification
|
||||
- **Success rate**: 0.8
|
||||
- **Requests**: 10
|
||||
- **Correct answers**: 8/10 (80%)
|
||||
- **Result**: ✓ Success rate working as expected
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Matching Algorithm
|
||||
1. Try exact match (case-insensitive)
|
||||
2. Try match after removing LaTeX formatting
|
||||
3. Calculate Levenshtein distance for partial matches
|
||||
4. Return best match if distance < 0.3 (30% difference)
|
||||
|
||||
### Response Format
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-1769864875",
|
||||
"object": "chat.completion",
|
||||
"created": 1769864875,
|
||||
"model": "llama",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "116"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Files Created
|
||||
|
||||
1. `llama-server-simulator.py` - Main simulator script
|
||||
2. `test-simulator.sh` - Basic test script
|
||||
3. `test-simulator-comprehensive.sh` - Comprehensive test script
|
||||
4. `llama-server-simulator-plan.md` - Implementation plan
|
||||
5. `llama-eval-discussion.md` - Discussion notes
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✓ Basic simulator structure
|
||||
2. ✓ AIME dataset integration
|
||||
3. ✓ Question matching with Levenshtein distance
|
||||
4. ✓ Response generation with configurable success rate
|
||||
5. ✓ Testing with curl requests
|
||||
6. ✓ Integrate with eval script
|
||||
7. ✓ Implement eval state object
|
||||
8. ✓ Implement processor object
|
||||
9. ✓ Add real-time progress reporting
|
||||
10. ✓ Add enhanced grading system with LLM judge
|
||||
|
||||
## Known Limitations
|
||||
|
||||
1. Only supports AIME dataset (train split)
|
||||
2. Matching is case-insensitive
|
||||
3. Wrong answers are simple increments (not realistic)
|
||||
4. No support for multiple endpoints
|
||||
5. No distributed evaluation
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. Support multiple datasets
|
||||
2. More sophisticated wrong answer generation
|
||||
3. Multiple endpoint support
|
||||
4. Distributed evaluation
|
||||
5. Real-time progress reporting
|
||||
6. Eval state serialization
|
||||
7. Enhanced grading with LLM judge
|
||||
8. Response truncation for better answer extraction
|
||||
Loading…
Reference in New Issue