llama.cpp/examples/llama-eval/simulator-summary.md

139 lines
3.7 KiB
Markdown

# llama-server-simulator Implementation Summary
## Overview
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
## Features Implemented
### 1. HTTP Server
- Flask-based `/v1/chat/completions` endpoint
- OpenAI-compatible response format
- Configurable port and host
### 2. AIME Dataset Integration
- Loads AIME dataset from HuggingFace
- In-memory storage for fast lookup
- 90 questions loaded from train split
### 3. Intelligent Question Matching
- **Exact matching**: Direct string comparison
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
- **Levenshtein distance**: Calculates similarity between strings
- **Partial matching**: Finds best match even with small differences
### 4. Response Generation
- Configurable success rate (0-1)
- Returns correct answers when success rate allows
- Returns wrong answers when success rate doesn't allow
- Wrong answers are generated by incrementing the expected answer
### 5. Debug Logging
- Debug messages written to stderr
- Logs request content, matching results, and distances
- Helps troubleshoot matching issues
## Configuration Options
```bash
python3 llama-server-simulator.py \
--port 8034 \
--host localhost \
--success-rate 0.8 \
--dataset-split train
```
## Testing Results
### Test 1: Correct Answer
- **Success rate**: 0.8
- **Expected answer**: 116
- **Result**: ✓ Correct (116)
### Test 2: Wrong Answer
- **Success rate**: 0.0
- **Expected answer**: 116
- **Result**: ✓ Wrong (117)
### Test 3: No Matching Question
- **Request**: "What is the capital of France?"
- **Result**: ✓ Returns error "No matching question found"
### Test 4: Success Rate Verification
- **Success rate**: 0.8
- **Requests**: 10
- **Correct answers**: 8/10 (80%)
- **Result**: ✓ Success rate working as expected
## Technical Details
### Matching Algorithm
1. Try exact match (case-insensitive)
2. Try match after removing LaTeX formatting
3. Calculate Levenshtein distance for partial matches
4. Return best match if distance < 0.3 (30% difference)
### Response Format
```json
{
"id": "chatcmpl-1769864875",
"object": "chat.completion",
"created": 1769864875,
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "116"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
## Files Created
1. `llama-server-simulator.py` - Main simulator script
2. `test-simulator.sh` - Basic test script
3. `test-simulator-comprehensive.sh` - Comprehensive test script
4. `llama-server-simulator-plan.md` - Implementation plan
5. `llama-eval-discussion.md` - Discussion notes
## Next Steps
1. Basic simulator structure
2. AIME dataset integration
3. Question matching with Levenshtein distance
4. Response generation with configurable success rate
5. Testing with curl requests
6. Integrate with eval script
7. Implement eval state object
8. Implement processor object
9. Add real-time progress reporting
10. Add enhanced grading system with LLM judge
## Known Limitations
1. Only supports AIME dataset (train split)
2. Matching is case-insensitive
3. Wrong answers are simple increments (not realistic)
4. No support for multiple endpoints
5. No distributed evaluation
## Future Enhancements
1. Support multiple datasets
2. More sophisticated wrong answer generation
3. Multiple endpoint support
4. Distributed evaluation
5. Real-time progress reporting
6. Eval state serialization
7. Enhanced grading with LLM judge
8. Response truncation for better answer extraction