examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.
This commit is contained in:
parent
8839037528
commit
07d5e1e0ea
|
|
@ -0,0 +1,116 @@
|
||||||
|
# llama-eval Implementation Discussion
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
|
||||||
|
|
||||||
|
## Key Requirements from ggerganov
|
||||||
|
|
||||||
|
### 1. Simplify and Focus on One Eval
|
||||||
|
- Start with AIME2025 (most familiar with it)
|
||||||
|
- Don't support multiple evals initially
|
||||||
|
|
||||||
|
### 2. Implement an "eval state" object
|
||||||
|
- ID
|
||||||
|
- List of tasks
|
||||||
|
- Task states
|
||||||
|
- Sampling config
|
||||||
|
|
||||||
|
### 3. Implement a "processor" object
|
||||||
|
- List of endpoints
|
||||||
|
- Threads per endpoint
|
||||||
|
- Grade/judge type (regex, endpoint, or CLI tool)
|
||||||
|
|
||||||
|
### 4. Processor responsibilities
|
||||||
|
- Accepts eval state
|
||||||
|
- Starts processing
|
||||||
|
- Dumps eval state periodically as it progresses
|
||||||
|
|
||||||
|
### 5. Real-time feedback
|
||||||
|
- Default: show "correct / not correct" for each task
|
||||||
|
- Verbose mode: show produced answer vs expected answer as soon as it completes
|
||||||
|
|
||||||
|
### 6. Grading approach
|
||||||
|
- Abstract grading to support external "grader" or "judge"
|
||||||
|
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
|
||||||
|
|
||||||
|
### 7. Output format
|
||||||
|
- Use structured output (JSON) instead of boxed text
|
||||||
|
|
||||||
|
## Current Implementation Analysis
|
||||||
|
|
||||||
|
### What exists in llama-eval.py:
|
||||||
|
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
|
||||||
|
- Regex-based answer extraction
|
||||||
|
- HTTP requests to OpenAI-compatible endpoint
|
||||||
|
- Checkpointing/resume capability
|
||||||
|
- Thread-based parallel execution
|
||||||
|
- Summary reporting
|
||||||
|
|
||||||
|
### What needs to be removed:
|
||||||
|
- All task implementations except AIME
|
||||||
|
- Regex-based grading
|
||||||
|
- Multiple endpoint support
|
||||||
|
- Complex task loading logic
|
||||||
|
- Summary reporting (replace with real-time feedback)
|
||||||
|
|
||||||
|
## Discussion Points
|
||||||
|
|
||||||
|
### 1. Eval State Object Structure
|
||||||
|
**Status: Under Discussion**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- What fields should be in the eval state object?
|
||||||
|
- Should it include the actual prompts, or just metadata?
|
||||||
|
- How should task states be tracked?
|
||||||
|
|
||||||
|
### 2. Processor Architecture
|
||||||
|
**Status: Not Started**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- Should the processor handle multiple endpoints (for distributed evaluation)?
|
||||||
|
- What's the threading model?
|
||||||
|
- How are endpoints configured?
|
||||||
|
|
||||||
|
### 3. Grader Interface
|
||||||
|
**Status: Not Started**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- How should the grader be configured?
|
||||||
|
- Should it be a separate service, or a local LLM call?
|
||||||
|
- What's the interface for grading?
|
||||||
|
|
||||||
|
### 4. Checkpointing
|
||||||
|
**Status: Not Started**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- Should the eval state be serialized to disk?
|
||||||
|
- How often should it be dumped?
|
||||||
|
- What format should it use?
|
||||||
|
|
||||||
|
### 5. Real-time Output
|
||||||
|
**Status: Not Started**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- How should progress be displayed?
|
||||||
|
- Console output, file logging, or both?
|
||||||
|
- What verbosity levels are needed?
|
||||||
|
|
||||||
|
### 6. Output Format
|
||||||
|
**Status: Not Started**
|
||||||
|
|
||||||
|
Questions:
|
||||||
|
- Should responses be in JSON format?
|
||||||
|
- How should the grader interface work with JSON output?
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Eval State Object** - Currently discussing
|
||||||
|
2. Processor Architecture
|
||||||
|
3. Grader Interface
|
||||||
|
4. Checkpointing
|
||||||
|
5. Real-time Output
|
||||||
|
6. Output Format
|
||||||
|
|
||||||
|
## References
|
||||||
|
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
|
||||||
|
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
|
||||||
|
|
@ -0,0 +1,184 @@
|
||||||
|
# llama-server-simulator Implementation Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
1. Simulate llama-server's `/v1/chat/completions` endpoint
|
||||||
|
2. Accept requests and respond with expected answers from AIME dataset
|
||||||
|
3. Implement configurable success rate (sometimes right, sometimes wrong)
|
||||||
|
4. Use regex matching to find questions in incoming requests
|
||||||
|
5. Test with curl requests before integrating with eval script
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Basic Simulator Structure
|
||||||
|
- Create `llama-server-simulator.py` script
|
||||||
|
- Set up Flask/FastAPI HTTP server
|
||||||
|
- Implement `/v1/chat/completions` endpoint
|
||||||
|
- Handle basic request/response format
|
||||||
|
|
||||||
|
### Phase 2: AIME Dataset Integration
|
||||||
|
- Load AIME dataset
|
||||||
|
- Store questions and expected answers
|
||||||
|
- Implement regex matching to find questions in incoming requests
|
||||||
|
- Extract expected answer from matched question
|
||||||
|
|
||||||
|
### Phase 3: Response Generation
|
||||||
|
- Implement success rate configuration
|
||||||
|
- Randomly determine if response should be correct or incorrect
|
||||||
|
- Generate appropriate response based on success determination
|
||||||
|
- Format response in OpenAI-compatible format
|
||||||
|
|
||||||
|
### Phase 4: Testing
|
||||||
|
- Write curl commands to test basic functionality
|
||||||
|
- Test correct responses
|
||||||
|
- Test incorrect responses
|
||||||
|
- Test edge cases (no question found, etc.)
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Server Framework
|
||||||
|
- Use Flask for simplicity
|
||||||
|
- Listen on configurable port
|
||||||
|
- Support JSON request/response format
|
||||||
|
|
||||||
|
### Request Format
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "Question text here"}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Response Format
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "chatcmpl-xxx",
|
||||||
|
"object": "chat.completion",
|
||||||
|
"created": 1234567890,
|
||||||
|
"model": "llama",
|
||||||
|
"choices": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"message": {
|
||||||
|
"role": "assistant",
|
||||||
|
"content": "Answer text here"
|
||||||
|
},
|
||||||
|
"finish_reason": "stop"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"usage": {
|
||||||
|
"prompt_tokens": 100,
|
||||||
|
"completion_tokens": 50,
|
||||||
|
"total_tokens": 150
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### AIME Dataset Integration
|
||||||
|
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
|
||||||
|
- Store in memory for fast lookup
|
||||||
|
- Regex pattern to find question text in request
|
||||||
|
- Extract answer from matched question
|
||||||
|
|
||||||
|
### Success Rate Configuration
|
||||||
|
- Command-line argument: `--success-rate 0.8` (80% success rate)
|
||||||
|
- Randomly determine correctness based on rate
|
||||||
|
- Log when responses are correct vs incorrect
|
||||||
|
|
||||||
|
### Testing Strategy
|
||||||
|
1. Start simulator with default settings
|
||||||
|
2. Send curl request with known question
|
||||||
|
3. Verify response contains expected answer
|
||||||
|
4. Test with different success rates
|
||||||
|
5. Test edge cases
|
||||||
|
|
||||||
|
## Implementation Steps
|
||||||
|
|
||||||
|
### Step 1: Basic Server Setup
|
||||||
|
```python
|
||||||
|
from flask import Flask, request, jsonify
|
||||||
|
|
||||||
|
app = Flask(__name__)
|
||||||
|
|
||||||
|
@app.route('/v1/chat/completions', methods=['POST'])
|
||||||
|
def chat_completions():
|
||||||
|
# Handle request
|
||||||
|
return jsonify(response)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Load AIME Dataset
|
||||||
|
```python
|
||||||
|
import datasets
|
||||||
|
|
||||||
|
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
|
||||||
|
# Store in memory
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Regex Matching
|
||||||
|
```python
|
||||||
|
import re
|
||||||
|
|
||||||
|
def find_question_in_request(request_text):
|
||||||
|
# Regex pattern to find question
|
||||||
|
pattern = r"question:\s*(.*?)\n"
|
||||||
|
match = re.search(pattern, request_text, re.DOTALL)
|
||||||
|
return match.group(1) if match else None
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Response Generation
|
||||||
|
```python
|
||||||
|
import random
|
||||||
|
|
||||||
|
def generate_response(question, success_rate):
|
||||||
|
if random.random() < success_rate:
|
||||||
|
return get_expected_answer(question)
|
||||||
|
else:
|
||||||
|
return get_wrong_answer(question)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Testing with Curl
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8033/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [{"role": "user", "content": "Question text"}]
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration Options
|
||||||
|
- `--port`: Server port (default: 8033)
|
||||||
|
- `--success-rate`: Success rate 0-1 (default: 0.8)
|
||||||
|
- `--host`: Server host (default: localhost)
|
||||||
|
- `--dataset-split`: AIME split to use (default: train)
|
||||||
|
|
||||||
|
## Expected Output
|
||||||
|
```
|
||||||
|
=== llama-server-simulator ===
|
||||||
|
Server running on http://localhost:8033
|
||||||
|
Success rate: 0.8
|
||||||
|
AIME dataset loaded: 1000 questions
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing Checklist
|
||||||
|
- [ ] Server starts successfully
|
||||||
|
- [ ] Basic request/response works
|
||||||
|
- [ ] Correct answer returned when success rate allows
|
||||||
|
- [ ] Wrong answer returned when success rate doesn't allow
|
||||||
|
- [ ] No question found returns error
|
||||||
|
- [ ] Multiple requests work correctly
|
||||||
|
- [ ] Different success rates work as expected
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
1. Implement basic server structure
|
||||||
|
2. Load AIME dataset
|
||||||
|
3. Implement regex matching
|
||||||
|
4. Add response generation with success rate
|
||||||
|
5. Test with curl commands
|
||||||
|
6. Integrate with eval script once simulator works
|
||||||
|
|
@ -0,0 +1,267 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import random
|
||||||
|
import re
|
||||||
|
import time
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
from dataclasses import dataclass, asdict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import datasets
|
||||||
|
from flask import Flask, request, jsonify
|
||||||
|
|
||||||
|
# Set cache directory for HuggingFace datasets
|
||||||
|
cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
|
||||||
|
cache_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
|
||||||
|
|
||||||
|
def levenshtein_distance(s1: str, s2: str) -> int:
|
||||||
|
"""Calculate Levenshtein distance between two strings"""
|
||||||
|
if len(s1) < len(s2):
|
||||||
|
return levenshtein_distance(s2, s1)
|
||||||
|
|
||||||
|
if len(s2) == 0:
|
||||||
|
return len(s1)
|
||||||
|
|
||||||
|
previous_row = range(len(s2) + 1)
|
||||||
|
for i, c1 in enumerate(s1):
|
||||||
|
current_row = [i + 1]
|
||||||
|
for j, c2 in enumerate(s2):
|
||||||
|
insertions = previous_row[j + 1] + 1
|
||||||
|
deletions = current_row[j] + 1
|
||||||
|
substitutions = previous_row[j] + (c1 != c2)
|
||||||
|
current_row.append(min(insertions, deletions, substitutions))
|
||||||
|
previous_row = current_row
|
||||||
|
|
||||||
|
return previous_row[-1]
|
||||||
|
|
||||||
|
def debug_log(message: str):
|
||||||
|
"""Log debug messages to both stdout and a file"""
|
||||||
|
print(message, file=sys.stderr)
|
||||||
|
with open("/tmp/simulator-debug.log", "a") as f:
|
||||||
|
f.write(message + "\n")
|
||||||
|
|
||||||
|
app = Flask(__name__)
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class EvalState:
|
||||||
|
id: str
|
||||||
|
tasks: List[str]
|
||||||
|
task_states: Dict[str, Dict]
|
||||||
|
sampling_config: Dict
|
||||||
|
|
||||||
|
class AimeDataset:
|
||||||
|
def __init__(self, split: str = "train"):
|
||||||
|
self.split = split
|
||||||
|
self.questions: List[Dict] = []
|
||||||
|
self._load_dataset()
|
||||||
|
|
||||||
|
def _load_dataset(self):
|
||||||
|
print(f"Loading AIME dataset (split: {self.split})...")
|
||||||
|
print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}")
|
||||||
|
|
||||||
|
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
|
||||||
|
self.questions = list(ds)
|
||||||
|
print(f"AIME dataset loaded: {len(self.questions)} questions")
|
||||||
|
|
||||||
|
def find_question(self, request_text: str) -> Optional[Dict]:
|
||||||
|
best_match = None
|
||||||
|
best_distance = float('inf')
|
||||||
|
best_index = -1
|
||||||
|
|
||||||
|
for i, question in enumerate(self.questions):
|
||||||
|
question_text = question["problem"]
|
||||||
|
request_lower = request_text.lower()
|
||||||
|
question_lower = question_text.lower()
|
||||||
|
|
||||||
|
# Exact match
|
||||||
|
if question_lower == request_lower:
|
||||||
|
debug_log(f"DEBUG: Found exact match at index {i}")
|
||||||
|
return question
|
||||||
|
|
||||||
|
# Remove LaTeX formatting for more flexible matching
|
||||||
|
question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
|
||||||
|
if question_no_latex.lower() == request_lower:
|
||||||
|
debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
|
||||||
|
return question
|
||||||
|
|
||||||
|
# Calculate Levenshtein distance for partial matches
|
||||||
|
# Only consider if request is at least 50% of question length
|
||||||
|
if len(request_lower) >= len(question_lower) * 0.5:
|
||||||
|
distance = levenshtein_distance(question_lower, request_lower)
|
||||||
|
# Normalize distance by length
|
||||||
|
normalized_distance = distance / len(question_lower)
|
||||||
|
|
||||||
|
if normalized_distance < best_distance:
|
||||||
|
best_distance = normalized_distance
|
||||||
|
best_match = question
|
||||||
|
best_index = i
|
||||||
|
|
||||||
|
if best_match and best_distance < 0.3: # Threshold for partial match
|
||||||
|
debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
|
||||||
|
return best_match
|
||||||
|
|
||||||
|
debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_answer(self, question: Dict) -> str:
|
||||||
|
return str(question["answer"])
|
||||||
|
|
||||||
|
class Simulator:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
port: int = 8033,
|
||||||
|
host: str = "localhost",
|
||||||
|
success_rate: float = 0.8,
|
||||||
|
dataset_split: str = "train"
|
||||||
|
):
|
||||||
|
self.port = port
|
||||||
|
self.host = host
|
||||||
|
self.success_rate = success_rate
|
||||||
|
self.dataset = AimeDataset(dataset_split)
|
||||||
|
self.eval_state = EvalState(
|
||||||
|
id="aime-2025",
|
||||||
|
tasks=["aime"],
|
||||||
|
task_states={},
|
||||||
|
sampling_config={"temperature": 0, "max_tokens": 2048}
|
||||||
|
)
|
||||||
|
|
||||||
|
def _generate_response(
|
||||||
|
self,
|
||||||
|
question: Dict,
|
||||||
|
should_be_correct: bool
|
||||||
|
) -> Dict:
|
||||||
|
expected_answer = self.dataset.get_answer(question)
|
||||||
|
|
||||||
|
if should_be_correct:
|
||||||
|
response_text = expected_answer
|
||||||
|
else:
|
||||||
|
response_text = self._generate_wrong_answer(question)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"id": f"chatcmpl-{int(time.time())}",
|
||||||
|
"object": "chat.completion",
|
||||||
|
"created": int(time.time()),
|
||||||
|
"model": "llama",
|
||||||
|
"choices": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"message": {
|
||||||
|
"role": "assistant",
|
||||||
|
"content": response_text
|
||||||
|
},
|
||||||
|
"finish_reason": "stop"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"usage": {
|
||||||
|
"prompt_tokens": 100,
|
||||||
|
"completion_tokens": 50,
|
||||||
|
"total_tokens": 150
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
def _generate_wrong_answer(self, question: Dict) -> str:
|
||||||
|
expected_answer = self.dataset.get_answer(question)
|
||||||
|
|
||||||
|
if expected_answer.isdigit():
|
||||||
|
wrong_answer = str(int(expected_answer) + 1)
|
||||||
|
else:
|
||||||
|
wrong_answer = expected_answer + " (wrong)"
|
||||||
|
|
||||||
|
return wrong_answer
|
||||||
|
|
||||||
|
def _process_request(self, request_data: Dict) -> Dict:
|
||||||
|
messages = request_data.get("messages", [])
|
||||||
|
if not messages:
|
||||||
|
return {"error": "No messages in request"}
|
||||||
|
|
||||||
|
request_text = messages[0].get("content", "")
|
||||||
|
debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
|
||||||
|
|
||||||
|
question = self.dataset.find_question(request_text)
|
||||||
|
if not question:
|
||||||
|
debug_log(f"DEBUG: find_question returned None")
|
||||||
|
return {"error": "No matching question found"}
|
||||||
|
|
||||||
|
should_be_correct = random.random() < self.success_rate
|
||||||
|
|
||||||
|
response = self._generate_response(question, should_be_correct)
|
||||||
|
|
||||||
|
task_id = "aime"
|
||||||
|
self.eval_state.task_states[task_id] = {
|
||||||
|
"correct": should_be_correct,
|
||||||
|
"expected": self.dataset.get_answer(question),
|
||||||
|
"predicted": response["choices"][0]["message"]["content"]
|
||||||
|
}
|
||||||
|
|
||||||
|
return response
|
||||||
|
|
||||||
|
@app.route('/v1/chat/completions', methods=['POST'])
|
||||||
|
def chat_completions():
|
||||||
|
try:
|
||||||
|
request_data = request.get_json()
|
||||||
|
|
||||||
|
if not request_data:
|
||||||
|
return jsonify({"error": "Invalid JSON"}), 400
|
||||||
|
|
||||||
|
response = simulator._process_request(request_data)
|
||||||
|
|
||||||
|
return jsonify(response)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing request: {e}")
|
||||||
|
return jsonify({"error": str(e)}), 500
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="llama-server simulator for testing eval scripts"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--port",
|
||||||
|
type=int,
|
||||||
|
default=8033,
|
||||||
|
help="Server port (default: 8033)"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--host",
|
||||||
|
type=str,
|
||||||
|
default="localhost",
|
||||||
|
help="Server host (default: localhost)"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--success-rate",
|
||||||
|
type=float,
|
||||||
|
default=0.8,
|
||||||
|
help="Success rate 0-1 (default: 0.8)"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--dataset-split",
|
||||||
|
type=str,
|
||||||
|
default="train",
|
||||||
|
help="AIME dataset split to use (default: train)"
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
global simulator
|
||||||
|
simulator = Simulator(
|
||||||
|
port=args.port,
|
||||||
|
host=args.host,
|
||||||
|
success_rate=args.success_rate,
|
||||||
|
dataset_split=args.dataset_split
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n=== llama-server-simulator ===")
|
||||||
|
print(f"Server running on http://{args.host}:{args.port}")
|
||||||
|
print(f"Success rate: {args.success_rate}")
|
||||||
|
print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
|
||||||
|
print("\nPress Ctrl+C to stop\n")
|
||||||
|
|
||||||
|
app.run(host=args.host, port=args.port, debug=False)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
@ -0,0 +1,135 @@
|
||||||
|
# llama-server-simulator Implementation Summary
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||||
|
|
||||||
|
## Features Implemented
|
||||||
|
|
||||||
|
### 1. HTTP Server
|
||||||
|
- Flask-based `/v1/chat/completions` endpoint
|
||||||
|
- OpenAI-compatible response format
|
||||||
|
- Configurable port and host
|
||||||
|
|
||||||
|
### 2. AIME Dataset Integration
|
||||||
|
- Loads AIME dataset from HuggingFace
|
||||||
|
- In-memory storage for fast lookup
|
||||||
|
- 90 questions loaded from train split
|
||||||
|
|
||||||
|
### 3. Intelligent Question Matching
|
||||||
|
- **Exact matching**: Direct string comparison
|
||||||
|
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
|
||||||
|
- **Levenshtein distance**: Calculates similarity between strings
|
||||||
|
- **Partial matching**: Finds best match even with small differences
|
||||||
|
|
||||||
|
### 4. Response Generation
|
||||||
|
- Configurable success rate (0-1)
|
||||||
|
- Returns correct answers when success rate allows
|
||||||
|
- Returns wrong answers when success rate doesn't allow
|
||||||
|
- Wrong answers are generated by incrementing the expected answer
|
||||||
|
|
||||||
|
### 5. Debug Logging
|
||||||
|
- Debug messages written to stderr
|
||||||
|
- Logs request content, matching results, and distances
|
||||||
|
- Helps troubleshoot matching issues
|
||||||
|
|
||||||
|
## Configuration Options
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 llama-server-simulator.py \
|
||||||
|
--port 8034 \
|
||||||
|
--host localhost \
|
||||||
|
--success-rate 0.8 \
|
||||||
|
--dataset-split train
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing Results
|
||||||
|
|
||||||
|
### Test 1: Correct Answer
|
||||||
|
- **Success rate**: 0.8
|
||||||
|
- **Expected answer**: 116
|
||||||
|
- **Result**: ✓ Correct (116)
|
||||||
|
|
||||||
|
### Test 2: Wrong Answer
|
||||||
|
- **Success rate**: 0.0
|
||||||
|
- **Expected answer**: 116
|
||||||
|
- **Result**: ✓ Wrong (117)
|
||||||
|
|
||||||
|
### Test 3: No Matching Question
|
||||||
|
- **Request**: "What is the capital of France?"
|
||||||
|
- **Result**: ✓ Returns error "No matching question found"
|
||||||
|
|
||||||
|
### Test 4: Success Rate Verification
|
||||||
|
- **Success rate**: 0.8
|
||||||
|
- **Requests**: 10
|
||||||
|
- **Correct answers**: 8/10 (80%)
|
||||||
|
- **Result**: ✓ Success rate working as expected
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Matching Algorithm
|
||||||
|
1. Try exact match (case-insensitive)
|
||||||
|
2. Try match after removing LaTeX formatting
|
||||||
|
3. Calculate Levenshtein distance for partial matches
|
||||||
|
4. Return best match if distance < 0.3 (30% difference)
|
||||||
|
|
||||||
|
### Response Format
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "chatcmpl-1769864875",
|
||||||
|
"object": "chat.completion",
|
||||||
|
"created": 1769864875,
|
||||||
|
"model": "llama",
|
||||||
|
"choices": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"message": {
|
||||||
|
"role": "assistant",
|
||||||
|
"content": "116"
|
||||||
|
},
|
||||||
|
"finish_reason": "stop"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"usage": {
|
||||||
|
"prompt_tokens": 100,
|
||||||
|
"completion_tokens": 50,
|
||||||
|
"total_tokens": 150
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files Created
|
||||||
|
|
||||||
|
1. `llama-server-simulator.py` - Main simulator script
|
||||||
|
2. `test-simulator.sh` - Basic test script
|
||||||
|
3. `test-simulator-comprehensive.sh` - Comprehensive test script
|
||||||
|
4. `llama-server-simulator-plan.md` - Implementation plan
|
||||||
|
5. `llama-eval-discussion.md` - Discussion notes
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. ✓ Basic simulator structure
|
||||||
|
2. ✓ AIME dataset integration
|
||||||
|
3. ✓ Question matching with Levenshtein distance
|
||||||
|
4. ✓ Response generation with configurable success rate
|
||||||
|
5. ✓ Testing with curl requests
|
||||||
|
6. ⏭️ Integrate with eval script
|
||||||
|
7. ⏭️ Implement eval state object
|
||||||
|
8. ⏭️ Implement processor object
|
||||||
|
9. ⏭️ Add real-time progress reporting
|
||||||
|
|
||||||
|
## Known Limitations
|
||||||
|
|
||||||
|
1. Only supports AIME dataset (train split)
|
||||||
|
2. Matching is case-insensitive
|
||||||
|
3. Wrong answers are simple increments (not realistic)
|
||||||
|
4. No support for multiple endpoints
|
||||||
|
5. No distributed evaluation
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
1. Support multiple datasets
|
||||||
|
2. More sophisticated wrong answer generation
|
||||||
|
3. Multiple endpoint support
|
||||||
|
4. Distributed evaluation
|
||||||
|
5. Real-time progress reporting
|
||||||
|
6. Eval state serialization
|
||||||
|
|
@ -0,0 +1,43 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
echo "=== Testing HuggingFace Dataset Caching ==="
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== First Load (should download) ==="
|
||||||
|
echo "Starting simulator for first load..."
|
||||||
|
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log &
|
||||||
|
SIMULATOR_PID=$!
|
||||||
|
sleep 5
|
||||||
|
echo "First load complete"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Second Load (should use cache) ==="
|
||||||
|
echo "Starting simulator for second load..."
|
||||||
|
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log &
|
||||||
|
SIMULATOR_PID2=$!
|
||||||
|
sleep 5
|
||||||
|
echo "Second load complete"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Checking Cache Directory ==="
|
||||||
|
echo "Cache directory size:"
|
||||||
|
du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Checking First Load Log ==="
|
||||||
|
echo "First load log (last 15 lines):"
|
||||||
|
tail -15 /tmp/simulator-first.log
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Checking Second Load Log ==="
|
||||||
|
echo "Second load log (last 15 lines):"
|
||||||
|
tail -15 /tmp/simulator-second.log
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Test Complete ==="
|
||||||
|
echo "Both loads completed successfully!"
|
||||||
|
echo "The second load should have used the cache (no download warning)."
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
kill $SIMULATOR_PID 2>/dev/null
|
||||||
|
kill $SIMULATOR_PID2 2>/dev/null
|
||||||
|
|
@ -0,0 +1,93 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
echo "=== llama-server-simulator Test Script ==="
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
PORT=8033
|
||||||
|
SUCCESS_RATE=0.8
|
||||||
|
|
||||||
|
echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
|
||||||
|
source venv/bin/activate
|
||||||
|
python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
|
||||||
|
SIMULATOR_PID=$!
|
||||||
|
|
||||||
|
echo "Waiting for simulator to start..."
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "=== Test 1: Basic Request with Known Question ==="
|
||||||
|
echo "Sending request with AIME question..."
|
||||||
|
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo ""
|
||||||
|
echo "=== Test 2: Request with Different Question ==="
|
||||||
|
echo "Sending request with another AIME question..."
|
||||||
|
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "Compute the value of 2^10 + 3^10."}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo ""
|
||||||
|
echo "=== Test 3: Request with No Matching Question ==="
|
||||||
|
echo "Sending request with non-matching text..."
|
||||||
|
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "What is the capital of France?"}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo ""
|
||||||
|
echo "=== Test 4: Multiple Requests to Test Success Rate ==="
|
||||||
|
echo "Sending 10 requests to test success rate..."
|
||||||
|
correct_count=0
|
||||||
|
for i in {1..10}; do
|
||||||
|
echo "Request $i:"
|
||||||
|
response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama",
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
|
||||||
|
],
|
||||||
|
"temperature": 0,
|
||||||
|
"max_tokens": 2048
|
||||||
|
}')
|
||||||
|
answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])")
|
||||||
|
if [ "$answer" == "116" ]; then
|
||||||
|
correct_count=$((correct_count + 1))
|
||||||
|
fi
|
||||||
|
echo " Answer: $answer"
|
||||||
|
done
|
||||||
|
echo "Correct answers: $correct_count/10"
|
||||||
|
echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "=== Test Complete ==="
|
||||||
|
echo "Stopping simulator..."
|
||||||
|
kill $SIMULATOR_PID 2>/dev/null
|
||||||
|
wait $SIMULATOR_PID 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "Simulator stopped."
|
||||||
Loading…
Reference in New Issue