248 lines
9.0 KiB
Markdown
248 lines
9.0 KiB
Markdown
# llama-eval Implementation Discussion
|
|
|
|
## Overview
|
|
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
|
|
|
|
## Key Requirements from ggerganov
|
|
|
|
### 1. Simplify and Focus on One Eval
|
|
- Start with AIME2025 (most familiar with it)
|
|
- Don't support multiple evals initially
|
|
|
|
### 2. Implement an "eval state" object
|
|
- ID
|
|
- List of tasks
|
|
- Task states
|
|
- Sampling config
|
|
|
|
### 3. Implement a "processor" object
|
|
- List of endpoints
|
|
- Threads per endpoint
|
|
- Grade/judge type (regex, endpoint, or CLI tool)
|
|
|
|
### 4. Processor responsibilities
|
|
- Accepts eval state
|
|
- Starts processing
|
|
- Dumps eval state periodically as it progresses
|
|
|
|
### 5. Real-time feedback
|
|
- Default: show "correct / not correct" for each task
|
|
- Verbose mode: show produced answer vs expected answer as soon as it completes
|
|
|
|
### 6. Grading approach
|
|
- Abstract grading to support external "grader" or "judge"
|
|
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
|
|
|
|
### 7. Output format
|
|
- Use structured output (JSON) instead of boxed text
|
|
|
|
## Current Implementation Analysis
|
|
|
|
### What exists in llama-eval.py:
|
|
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
|
|
- Regex-based answer extraction
|
|
- HTTP requests to OpenAI-compatible endpoint
|
|
- Checkpointing/resume capability
|
|
- Thread-based parallel execution
|
|
- Summary reporting
|
|
|
|
### What needs to be removed:
|
|
- All task implementations except AIME
|
|
- Regex-based grading
|
|
- Multiple endpoint support
|
|
- Complex task loading logic
|
|
- Summary reporting (replace with real-time feedback)
|
|
|
|
## Discussion Points
|
|
|
|
### 1. Eval State Object Structure
|
|
**Status: Under Discussion**
|
|
|
|
Questions:
|
|
- What fields should be in the eval state object?
|
|
- Should it include the actual prompts, or just metadata?
|
|
- How should task states be tracked?
|
|
|
|
### 2. Processor Architecture
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should the processor handle multiple endpoints (for distributed evaluation)?
|
|
- What's the threading model?
|
|
- How are endpoints configured?
|
|
|
|
### 3. Grader Interface
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- How should the grader be configured?
|
|
- Should it be a separate service, or a local LLM call?
|
|
- What's the interface for grading?
|
|
|
|
### 4. Checkpointing
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should the eval state be serialized to disk?
|
|
- How often should it be dumped?
|
|
- What format should it use?
|
|
|
|
### 5. Real-time Output
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- How should progress be displayed?
|
|
- Console output, file logging, or both?
|
|
- What verbosity levels are needed?
|
|
|
|
### 6. Output Format
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should responses be in JSON format?
|
|
- How should the grader interface work with JSON output?
|
|
|
|
## Next Steps
|
|
|
|
1. **Eval State Object** - Currently discussing
|
|
2. Processor Architecture
|
|
3. Grader Interface
|
|
4. Checkpointing
|
|
5. Real-time Output
|
|
6. Output Format
|
|
|
|
## References
|
|
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
|
|
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
|
|
|
|
## Session Work Summary
|
|
|
|
### llama-server-simulator Implementation
|
|
|
|
**Created:**
|
|
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
|
|
- `test-simulator.sh` - Test script for verifying simulator functionality
|
|
- `llama-server-simulator-plan.md` - Implementation plan
|
|
- `simulator-summary.md` - Summary of implementation
|
|
|
|
**Features Implemented:**
|
|
1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
|
|
2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
|
|
3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
|
|
4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
|
|
5. Debug Logging - Helps troubleshoot matching issues
|
|
|
|
**Testing Results:**
|
|
- ✅ Correct answers returned when success rate allows
|
|
- ✅ Wrong answers returned when success rate doesn't allow
|
|
- ✅ No matching questions return errors
|
|
- ✅ Success rate verified (80% in 10 requests)
|
|
- ✅ HuggingFace dataset caching working correctly
|
|
|
|
**Key Technical Decisions:**
|
|
- Used Levenshtein distance for partial matching (threshold: 0.3)
|
|
- Automatic caching via HuggingFace datasets library
|
|
- Wrong answers generated by incrementing expected answer
|
|
- Debug output written to stderr for better visibility
|
|
|
|
**Refactoring:**
|
|
- Extracted repeating question string into TEST_QUESTION variable
|
|
- Created make_request() helper function to reduce code duplication
|
|
- Added proper error handling for error responses
|
|
- Fixed simulator stopping issue at script completion
|
|
|
|
### llama-eval-new.py Implementation
|
|
|
|
**Created:**
|
|
- `llama-eval-new.py` - Simplified evaluation tool focused on AIME
|
|
|
|
**Features Implemented:**
|
|
1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
|
|
2. **Processor Object** - Handles processing, grading, and state management
|
|
3. **Real-time Feedback** - Shows correct/incorrect status for each case
|
|
4. **Flexible Grading System** - Supports regex and CLI-based grading
|
|
5. **Structured JSON Output** - Saves complete eval state to JSON file
|
|
6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
|
|
|
|
**Grading System:**
|
|
- **Regex Grading**: Built-in patterns for different task types
|
|
- `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text)
|
|
- `gsm8k`: `\b(\d+)\b` (extract first number)
|
|
- `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter)
|
|
- **CLI Grading**: External script interface
|
|
- Script accepts `--answer <pred>` and `--expected <gold>`
|
|
- Returns exit code 0 if correct, non-zero if incorrect
|
|
- 30-second timeout to prevent hanging
|
|
|
|
**Configuration Options:**
|
|
- `--server`: llama-server URL (default: http://localhost:8033)
|
|
- `--n_cases`: Number of cases to evaluate (default: all)
|
|
- `--n_predict`: Max tokens to predict per prompt (default: 2048)
|
|
- `--threads`: Number of threads for parallel requests (default: 32)
|
|
- `--verbose`: Show detailed output for each case
|
|
- `--output`: Output file for eval state (default: llama-eval-state.json)
|
|
- `--grader-type`: `regex` or `cli`
|
|
- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
|
|
- `--grader-script`: Path to CLI grader script
|
|
|
|
**Testing Results:**
|
|
- ✅ Works with simulator at 100% success rate (all correct)
|
|
- ✅ Works with simulator at 0% success rate (all incorrect)
|
|
- ✅ Works with simulator at 80% success rate (8/10 correct)
|
|
- ✅ Real-time verbose output shows gold/pred/status for each case
|
|
- ✅ JSON output contains complete eval state with all cases
|
|
- ✅ HF Hub telemetry disabled (no warnings)
|
|
- ✅ Uses cached dataset path to avoid HF Hub requests when available
|
|
|
|
**Key Technical Decisions:**
|
|
- Removed Levenshtein matching - eval script only sends requests and validates answers
|
|
- Abstract grading interface for external grader support
|
|
- Exact match requirement for regex patterns
|
|
- Handles both boxed and plain text formats for AIME answers
|
|
- 30-second timeout for CLI grader
|
|
- Validates script exists before running
|
|
|
|
**Refactoring:**
|
|
- Removed all task implementations except AIME
|
|
- Removed regex-based grading (moved to flexible grader system)
|
|
- Removed multiple endpoint support
|
|
- Removed complex task loading logic
|
|
- Removed summary reporting (replaced with real-time feedback)
|
|
- Added HuggingFace dataset caching optimization
|
|
|
|
### llama-eval-new.py Threading and Model Parameter Updates
|
|
|
|
**Changes Made:**
|
|
1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing
|
|
- Added `from concurrent.futures import ThreadPoolExecutor, as_completed`
|
|
- Created `_process_single_case()` method for thread-safe case processing
|
|
- Refactored `process()` to use ThreadPoolExecutor with configurable thread count
|
|
- Updated progress tracking to work with concurrent execution
|
|
- Thread-safe eval state updates (task_states and counters)
|
|
|
|
2. **Model Parameter** - Added `--model` argument to specify model name in request data
|
|
- Added `model_name` parameter to Processor.__init__()
|
|
- Updated `_make_request()` to use provided model name or default to "llama"
|
|
- Added `--model` argument to argument parser
|
|
- Model name is included in request JSON as `"model": "gpt-oss-20b-hf"`
|
|
|
|
**Testing Results:**
|
|
- ✅ Works with 2 threads (5 cases processed in ~0.2s)
|
|
- ✅ Works with 4 threads (slightly faster throughput)
|
|
- ✅ Model parameter correctly added to request data
|
|
- ✅ Thread-safe progress tracking with tqdm
|
|
- ✅ No race conditions in eval state updates
|
|
|
|
**Key Technical Decisions:**
|
|
- Used ThreadPoolExecutor for simple, effective parallelism
|
|
- No rate limiting needed (server can handle concurrent requests)
|
|
- Thread-safe counter updates for correct/total tracking
|
|
- Progress bar shows completion status across all threads
|
|
- Model parameter is optional - defaults to "llama" if not specified
|
|
|
|
**Refactoring:**
|
|
- Extracted single case processing into `_process_single_case()` method
|
|
- Changed from sequential loop to ThreadPoolExecutor with futures
|
|
- Updated verbose output to show total count instead of index
|
|
- Made eval state updates thread-safe
|