153 lines
4.6 KiB
Markdown
153 lines
4.6 KiB
Markdown
# llama-eval Implementation Discussion
|
|
|
|
## Overview
|
|
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
|
|
|
|
## Key Requirements from ggerganov
|
|
|
|
### 1. Simplify and Focus on One Eval
|
|
- Start with AIME2025 (most familiar with it)
|
|
- Don't support multiple evals initially
|
|
|
|
### 2. Implement an "eval state" object
|
|
- ID
|
|
- List of tasks
|
|
- Task states
|
|
- Sampling config
|
|
|
|
### 3. Implement a "processor" object
|
|
- List of endpoints
|
|
- Threads per endpoint
|
|
- Grade/judge type (regex, endpoint, or CLI tool)
|
|
|
|
### 4. Processor responsibilities
|
|
- Accepts eval state
|
|
- Starts processing
|
|
- Dumps eval state periodically as it progresses
|
|
|
|
### 5. Real-time feedback
|
|
- Default: show "correct / not correct" for each task
|
|
- Verbose mode: show produced answer vs expected answer as soon as it completes
|
|
|
|
### 6. Grading approach
|
|
- Abstract grading to support external "grader" or "judge"
|
|
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
|
|
|
|
### 7. Output format
|
|
- Use structured output (JSON) instead of boxed text
|
|
|
|
## Current Implementation Analysis
|
|
|
|
### What exists in llama-eval.py:
|
|
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
|
|
- Regex-based answer extraction
|
|
- HTTP requests to OpenAI-compatible endpoint
|
|
- Checkpointing/resume capability
|
|
- Thread-based parallel execution
|
|
- Summary reporting
|
|
|
|
### What needs to be removed:
|
|
- All task implementations except AIME
|
|
- Regex-based grading
|
|
- Multiple endpoint support
|
|
- Complex task loading logic
|
|
- Summary reporting (replace with real-time feedback)
|
|
|
|
## Discussion Points
|
|
|
|
### 1. Eval State Object Structure
|
|
**Status: Under Discussion**
|
|
|
|
Questions:
|
|
- What fields should be in the eval state object?
|
|
- Should it include the actual prompts, or just metadata?
|
|
- How should task states be tracked?
|
|
|
|
### 2. Processor Architecture
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should the processor handle multiple endpoints (for distributed evaluation)?
|
|
- What's the threading model?
|
|
- How are endpoints configured?
|
|
|
|
### 3. Grader Interface
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- How should the grader be configured?
|
|
- Should it be a separate service, or a local LLM call?
|
|
- What's the interface for grading?
|
|
|
|
### 4. Checkpointing
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should the eval state be serialized to disk?
|
|
- How often should it be dumped?
|
|
- What format should it use?
|
|
|
|
### 5. Real-time Output
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- How should progress be displayed?
|
|
- Console output, file logging, or both?
|
|
- What verbosity levels are needed?
|
|
|
|
### 6. Output Format
|
|
**Status: Not Started**
|
|
|
|
Questions:
|
|
- Should responses be in JSON format?
|
|
- How should the grader interface work with JSON output?
|
|
|
|
## Next Steps
|
|
|
|
1. **Eval State Object** - Currently discussing
|
|
2. Processor Architecture
|
|
3. Grader Interface
|
|
4. Checkpointing
|
|
5. Real-time Output
|
|
6. Output Format
|
|
|
|
## References
|
|
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
|
|
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
|
|
|
|
## Session Work Summary
|
|
|
|
### llama-server-simulator Implementation
|
|
|
|
**Created:**
|
|
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
|
|
- `test-simulator.sh` - Test script for verifying simulator functionality
|
|
- `llama-server-simulator-plan.md` - Implementation plan
|
|
- `simulator-summary.md` - Summary of implementation
|
|
|
|
**Features Implemented:**
|
|
1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
|
|
2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
|
|
3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
|
|
4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
|
|
5. Debug Logging - Helps troubleshoot matching issues
|
|
|
|
**Testing Results:**
|
|
- ✅ Correct answers returned when success rate allows
|
|
- ✅ Wrong answers returned when success rate doesn't allow
|
|
- ✅ No matching questions return errors
|
|
- ✅ Success rate verified (80% in 10 requests)
|
|
- ✅ HuggingFace dataset caching working correctly
|
|
|
|
**Key Technical Decisions:**
|
|
- Used Levenshtein distance for partial matching (threshold: 0.3)
|
|
- Automatic caching via HuggingFace datasets library
|
|
- Wrong answers generated by incrementing expected answer
|
|
- Debug output written to stderr for better visibility
|
|
|
|
**Refactoring:**
|
|
- Extracted repeating question string into TEST_QUESTION variable
|
|
- Created make_request() helper function to reduce code duplication
|
|
- Added proper error handling for error responses
|
|
- Fixed simulator stopping issue at script completion
|