examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
This commit is contained in:
Georgi Gerganov 2026-01-31 15:37:31 +02:00
parent 8839037528
commit 07d5e1e0ea
No known key found for this signature in database
GPG Key ID: 449E073F9DC10735
6 changed files with 838 additions and 0 deletions

View File

@ -0,0 +1,116 @@
# llama-eval Implementation Discussion
## Overview
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
## Key Requirements from ggerganov
### 1. Simplify and Focus on One Eval
- Start with AIME2025 (most familiar with it)
- Don't support multiple evals initially
### 2. Implement an "eval state" object
- ID
- List of tasks
- Task states
- Sampling config
### 3. Implement a "processor" object
- List of endpoints
- Threads per endpoint
- Grade/judge type (regex, endpoint, or CLI tool)
### 4. Processor responsibilities
- Accepts eval state
- Starts processing
- Dumps eval state periodically as it progresses
### 5. Real-time feedback
- Default: show "correct / not correct" for each task
- Verbose mode: show produced answer vs expected answer as soon as it completes
### 6. Grading approach
- Abstract grading to support external "grader" or "judge"
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
### 7. Output format
- Use structured output (JSON) instead of boxed text
## Current Implementation Analysis
### What exists in llama-eval.py:
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
- Regex-based answer extraction
- HTTP requests to OpenAI-compatible endpoint
- Checkpointing/resume capability
- Thread-based parallel execution
- Summary reporting
### What needs to be removed:
- All task implementations except AIME
- Regex-based grading
- Multiple endpoint support
- Complex task loading logic
- Summary reporting (replace with real-time feedback)
## Discussion Points
### 1. Eval State Object Structure
**Status: Under Discussion**
Questions:
- What fields should be in the eval state object?
- Should it include the actual prompts, or just metadata?
- How should task states be tracked?
### 2. Processor Architecture
**Status: Not Started**
Questions:
- Should the processor handle multiple endpoints (for distributed evaluation)?
- What's the threading model?
- How are endpoints configured?
### 3. Grader Interface
**Status: Not Started**
Questions:
- How should the grader be configured?
- Should it be a separate service, or a local LLM call?
- What's the interface for grading?
### 4. Checkpointing
**Status: Not Started**
Questions:
- Should the eval state be serialized to disk?
- How often should it be dumped?
- What format should it use?
### 5. Real-time Output
**Status: Not Started**
Questions:
- How should progress be displayed?
- Console output, file logging, or both?
- What verbosity levels are needed?
### 6. Output Format
**Status: Not Started**
Questions:
- Should responses be in JSON format?
- How should the grader interface work with JSON output?
## Next Steps
1. **Eval State Object** - Currently discussing
2. Processor Architecture
3. Grader Interface
4. Checkpointing
5. Real-time Output
6. Output Format
## References
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195

View File

@ -0,0 +1,184 @@
# llama-server-simulator Implementation Plan
## Overview
Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
## Goals
1. Simulate llama-server's `/v1/chat/completions` endpoint
2. Accept requests and respond with expected answers from AIME dataset
3. Implement configurable success rate (sometimes right, sometimes wrong)
4. Use regex matching to find questions in incoming requests
5. Test with curl requests before integrating with eval script
## Implementation Plan
### Phase 1: Basic Simulator Structure
- Create `llama-server-simulator.py` script
- Set up Flask/FastAPI HTTP server
- Implement `/v1/chat/completions` endpoint
- Handle basic request/response format
### Phase 2: AIME Dataset Integration
- Load AIME dataset
- Store questions and expected answers
- Implement regex matching to find questions in incoming requests
- Extract expected answer from matched question
### Phase 3: Response Generation
- Implement success rate configuration
- Randomly determine if response should be correct or incorrect
- Generate appropriate response based on success determination
- Format response in OpenAI-compatible format
### Phase 4: Testing
- Write curl commands to test basic functionality
- Test correct responses
- Test incorrect responses
- Test edge cases (no question found, etc.)
## Technical Details
### Server Framework
- Use Flask for simplicity
- Listen on configurable port
- Support JSON request/response format
### Request Format
```json
{
"model": "llama",
"messages": [
{"role": "user", "content": "Question text here"}
],
"temperature": 0,
"max_tokens": 2048
}
```
### Response Format
```json
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Answer text here"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
### AIME Dataset Integration
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
- Store in memory for fast lookup
- Regex pattern to find question text in request
- Extract answer from matched question
### Success Rate Configuration
- Command-line argument: `--success-rate 0.8` (80% success rate)
- Randomly determine correctness based on rate
- Log when responses are correct vs incorrect
### Testing Strategy
1. Start simulator with default settings
2. Send curl request with known question
3. Verify response contains expected answer
4. Test with different success rates
5. Test edge cases
## Implementation Steps
### Step 1: Basic Server Setup
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
# Handle request
return jsonify(response)
```
### Step 2: Load AIME Dataset
```python
import datasets
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
# Store in memory
```
### Step 3: Regex Matching
```python
import re
def find_question_in_request(request_text):
# Regex pattern to find question
pattern = r"question:\s*(.*?)\n"
match = re.search(pattern, request_text, re.DOTALL)
return match.group(1) if match else None
```
### Step 4: Response Generation
```python
import random
def generate_response(question, success_rate):
if random.random() < success_rate:
return get_expected_answer(question)
else:
return get_wrong_answer(question)
```
### Step 5: Testing with Curl
```bash
curl -X POST http://localhost:8033/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Question text"}]
}'
```
## Configuration Options
- `--port`: Server port (default: 8033)
- `--success-rate`: Success rate 0-1 (default: 0.8)
- `--host`: Server host (default: localhost)
- `--dataset-split`: AIME split to use (default: train)
## Expected Output
```
=== llama-server-simulator ===
Server running on http://localhost:8033
Success rate: 0.8
AIME dataset loaded: 1000 questions
```
## Testing Checklist
- [ ] Server starts successfully
- [ ] Basic request/response works
- [ ] Correct answer returned when success rate allows
- [ ] Wrong answer returned when success rate doesn't allow
- [ ] No question found returns error
- [ ] Multiple requests work correctly
- [ ] Different success rates work as expected
## Next Steps
1. Implement basic server structure
2. Load AIME dataset
3. Implement regex matching
4. Add response generation with success rate
5. Test with curl commands
6. Integrate with eval script once simulator works

View File

@ -0,0 +1,267 @@
#!/usr/bin/env python3
import argparse
import json
import random
import re
import time
import sys
import os
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
from pathlib import Path
import datasets
from flask import Flask, request, jsonify
# Set cache directory for HuggingFace datasets
cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
cache_dir.mkdir(parents=True, exist_ok=True)
os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
def levenshtein_distance(s1: str, s2: str) -> int:
"""Calculate Levenshtein distance between two strings"""
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
def debug_log(message: str):
"""Log debug messages to both stdout and a file"""
print(message, file=sys.stderr)
with open("/tmp/simulator-debug.log", "a") as f:
f.write(message + "\n")
app = Flask(__name__)
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict]
sampling_config: Dict
class AimeDataset:
def __init__(self, split: str = "train"):
self.split = split
self.questions: List[Dict] = []
self._load_dataset()
def _load_dataset(self):
print(f"Loading AIME dataset (split: {self.split})...")
print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}")
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
self.questions = list(ds)
print(f"AIME dataset loaded: {len(self.questions)} questions")
def find_question(self, request_text: str) -> Optional[Dict]:
best_match = None
best_distance = float('inf')
best_index = -1
for i, question in enumerate(self.questions):
question_text = question["problem"]
request_lower = request_text.lower()
question_lower = question_text.lower()
# Exact match
if question_lower == request_lower:
debug_log(f"DEBUG: Found exact match at index {i}")
return question
# Remove LaTeX formatting for more flexible matching
question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
if question_no_latex.lower() == request_lower:
debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
return question
# Calculate Levenshtein distance for partial matches
# Only consider if request is at least 50% of question length
if len(request_lower) >= len(question_lower) * 0.5:
distance = levenshtein_distance(question_lower, request_lower)
# Normalize distance by length
normalized_distance = distance / len(question_lower)
if normalized_distance < best_distance:
best_distance = normalized_distance
best_match = question
best_index = i
if best_match and best_distance < 0.3: # Threshold for partial match
debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
return best_match
debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
return None
def get_answer(self, question: Dict) -> str:
return str(question["answer"])
class Simulator:
def __init__(
self,
port: int = 8033,
host: str = "localhost",
success_rate: float = 0.8,
dataset_split: str = "train"
):
self.port = port
self.host = host
self.success_rate = success_rate
self.dataset = AimeDataset(dataset_split)
self.eval_state = EvalState(
id="aime-2025",
tasks=["aime"],
task_states={},
sampling_config={"temperature": 0, "max_tokens": 2048}
)
def _generate_response(
self,
question: Dict,
should_be_correct: bool
) -> Dict:
expected_answer = self.dataset.get_answer(question)
if should_be_correct:
response_text = expected_answer
else:
response_text = self._generate_wrong_answer(question)
return {
"id": f"chatcmpl-{int(time.time())}",
"object": "chat.completion",
"created": int(time.time()),
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": response_text
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
def _generate_wrong_answer(self, question: Dict) -> str:
expected_answer = self.dataset.get_answer(question)
if expected_answer.isdigit():
wrong_answer = str(int(expected_answer) + 1)
else:
wrong_answer = expected_answer + " (wrong)"
return wrong_answer
def _process_request(self, request_data: Dict) -> Dict:
messages = request_data.get("messages", [])
if not messages:
return {"error": "No messages in request"}
request_text = messages[0].get("content", "")
debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
question = self.dataset.find_question(request_text)
if not question:
debug_log(f"DEBUG: find_question returned None")
return {"error": "No matching question found"}
should_be_correct = random.random() < self.success_rate
response = self._generate_response(question, should_be_correct)
task_id = "aime"
self.eval_state.task_states[task_id] = {
"correct": should_be_correct,
"expected": self.dataset.get_answer(question),
"predicted": response["choices"][0]["message"]["content"]
}
return response
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
try:
request_data = request.get_json()
if not request_data:
return jsonify({"error": "Invalid JSON"}), 400
response = simulator._process_request(request_data)
return jsonify(response)
except Exception as e:
print(f"Error processing request: {e}")
return jsonify({"error": str(e)}), 500
def main():
parser = argparse.ArgumentParser(
description="llama-server simulator for testing eval scripts"
)
parser.add_argument(
"--port",
type=int,
default=8033,
help="Server port (default: 8033)"
)
parser.add_argument(
"--host",
type=str,
default="localhost",
help="Server host (default: localhost)"
)
parser.add_argument(
"--success-rate",
type=float,
default=0.8,
help="Success rate 0-1 (default: 0.8)"
)
parser.add_argument(
"--dataset-split",
type=str,
default="train",
help="AIME dataset split to use (default: train)"
)
args = parser.parse_args()
global simulator
simulator = Simulator(
port=args.port,
host=args.host,
success_rate=args.success_rate,
dataset_split=args.dataset_split
)
print("\n=== llama-server-simulator ===")
print(f"Server running on http://{args.host}:{args.port}")
print(f"Success rate: {args.success_rate}")
print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
print("\nPress Ctrl+C to stop\n")
app.run(host=args.host, port=args.port, debug=False)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,135 @@
# llama-server-simulator Implementation Summary
## Overview
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
## Features Implemented
### 1. HTTP Server
- Flask-based `/v1/chat/completions` endpoint
- OpenAI-compatible response format
- Configurable port and host
### 2. AIME Dataset Integration
- Loads AIME dataset from HuggingFace
- In-memory storage for fast lookup
- 90 questions loaded from train split
### 3. Intelligent Question Matching
- **Exact matching**: Direct string comparison
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
- **Levenshtein distance**: Calculates similarity between strings
- **Partial matching**: Finds best match even with small differences
### 4. Response Generation
- Configurable success rate (0-1)
- Returns correct answers when success rate allows
- Returns wrong answers when success rate doesn't allow
- Wrong answers are generated by incrementing the expected answer
### 5. Debug Logging
- Debug messages written to stderr
- Logs request content, matching results, and distances
- Helps troubleshoot matching issues
## Configuration Options
```bash
python3 llama-server-simulator.py \
--port 8034 \
--host localhost \
--success-rate 0.8 \
--dataset-split train
```
## Testing Results
### Test 1: Correct Answer
- **Success rate**: 0.8
- **Expected answer**: 116
- **Result**: ✓ Correct (116)
### Test 2: Wrong Answer
- **Success rate**: 0.0
- **Expected answer**: 116
- **Result**: ✓ Wrong (117)
### Test 3: No Matching Question
- **Request**: "What is the capital of France?"
- **Result**: ✓ Returns error "No matching question found"
### Test 4: Success Rate Verification
- **Success rate**: 0.8
- **Requests**: 10
- **Correct answers**: 8/10 (80%)
- **Result**: ✓ Success rate working as expected
## Technical Details
### Matching Algorithm
1. Try exact match (case-insensitive)
2. Try match after removing LaTeX formatting
3. Calculate Levenshtein distance for partial matches
4. Return best match if distance < 0.3 (30% difference)
### Response Format
```json
{
"id": "chatcmpl-1769864875",
"object": "chat.completion",
"created": 1769864875,
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "116"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
```
## Files Created
1. `llama-server-simulator.py` - Main simulator script
2. `test-simulator.sh` - Basic test script
3. `test-simulator-comprehensive.sh` - Comprehensive test script
4. `llama-server-simulator-plan.md` - Implementation plan
5. `llama-eval-discussion.md` - Discussion notes
## Next Steps
1. ✓ Basic simulator structure
2. ✓ AIME dataset integration
3. ✓ Question matching with Levenshtein distance
4. ✓ Response generation with configurable success rate
5. ✓ Testing with curl requests
6. ⏭️ Integrate with eval script
7. ⏭️ Implement eval state object
8. ⏭️ Implement processor object
9. ⏭️ Add real-time progress reporting
## Known Limitations
1. Only supports AIME dataset (train split)
2. Matching is case-insensitive
3. Wrong answers are simple increments (not realistic)
4. No support for multiple endpoints
5. No distributed evaluation
## Future Enhancements
1. Support multiple datasets
2. More sophisticated wrong answer generation
3. Multiple endpoint support
4. Distributed evaluation
5. Real-time progress reporting
6. Eval state serialization

View File

@ -0,0 +1,43 @@
#!/bin/bash
echo "=== Testing HuggingFace Dataset Caching ==="
echo ""
echo "=== First Load (should download) ==="
echo "Starting simulator for first load..."
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log &
SIMULATOR_PID=$!
sleep 5
echo "First load complete"
echo ""
echo "=== Second Load (should use cache) ==="
echo "Starting simulator for second load..."
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log &
SIMULATOR_PID2=$!
sleep 5
echo "Second load complete"
echo ""
echo "=== Checking Cache Directory ==="
echo "Cache directory size:"
du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime
echo ""
echo "=== Checking First Load Log ==="
echo "First load log (last 15 lines):"
tail -15 /tmp/simulator-first.log
echo ""
echo "=== Checking Second Load Log ==="
echo "Second load log (last 15 lines):"
tail -15 /tmp/simulator-second.log
echo ""
echo "=== Test Complete ==="
echo "Both loads completed successfully!"
echo "The second load should have used the cache (no download warning)."
echo ""
kill $SIMULATOR_PID 2>/dev/null
kill $SIMULATOR_PID2 2>/dev/null

View File

@ -0,0 +1,93 @@
#!/bin/bash
echo "=== llama-server-simulator Test Script ==="
echo ""
PORT=8033
SUCCESS_RATE=0.8
echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
source venv/bin/activate
python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
SIMULATOR_PID=$!
echo "Waiting for simulator to start..."
sleep 5
echo ""
echo "=== Test 1: Basic Request with Known Question ==="
echo "Sending request with AIME question..."
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
],
"temperature": 0,
"max_tokens": 2048
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
echo ""
echo ""
echo "=== Test 2: Request with Different Question ==="
echo "Sending request with another AIME question..."
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "user", "content": "Compute the value of 2^10 + 3^10."}
],
"temperature": 0,
"max_tokens": 2048
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
echo ""
echo ""
echo "=== Test 3: Request with No Matching Question ==="
echo "Sending request with non-matching text..."
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0,
"max_tokens": 2048
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))"
echo ""
echo ""
echo "=== Test 4: Multiple Requests to Test Success Rate ==="
echo "Sending 10 requests to test success rate..."
correct_count=0
for i in {1..10}; do
echo "Request $i:"
response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
],
"temperature": 0,
"max_tokens": 2048
}')
answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])")
if [ "$answer" == "116" ]; then
correct_count=$((correct_count + 1))
fi
echo " Answer: $answer"
done
echo "Correct answers: $correct_count/10"
echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
echo ""
echo "=== Test Complete ==="
echo "Stopping simulator..."
kill $SIMULATOR_PID 2>/dev/null
wait $SIMULATOR_PID 2>/dev/null || true
echo "Simulator stopped."