examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.
This commit is contained in:
parent
8839037528
commit
07d5e1e0ea
|
|
@ -0,0 +1,116 @@
|
|||
# llama-eval Implementation Discussion
|
||||
|
||||
## Overview
|
||||
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
|
||||
|
||||
## Key Requirements from ggerganov
|
||||
|
||||
### 1. Simplify and Focus on One Eval
|
||||
- Start with AIME2025 (most familiar with it)
|
||||
- Don't support multiple evals initially
|
||||
|
||||
### 2. Implement an "eval state" object
|
||||
- ID
|
||||
- List of tasks
|
||||
- Task states
|
||||
- Sampling config
|
||||
|
||||
### 3. Implement a "processor" object
|
||||
- List of endpoints
|
||||
- Threads per endpoint
|
||||
- Grade/judge type (regex, endpoint, or CLI tool)
|
||||
|
||||
### 4. Processor responsibilities
|
||||
- Accepts eval state
|
||||
- Starts processing
|
||||
- Dumps eval state periodically as it progresses
|
||||
|
||||
### 5. Real-time feedback
|
||||
- Default: show "correct / not correct" for each task
|
||||
- Verbose mode: show produced answer vs expected answer as soon as it completes
|
||||
|
||||
### 6. Grading approach
|
||||
- Abstract grading to support external "grader" or "judge"
|
||||
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
|
||||
|
||||
### 7. Output format
|
||||
- Use structured output (JSON) instead of boxed text
|
||||
|
||||
## Current Implementation Analysis
|
||||
|
||||
### What exists in llama-eval.py:
|
||||
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
|
||||
- Regex-based answer extraction
|
||||
- HTTP requests to OpenAI-compatible endpoint
|
||||
- Checkpointing/resume capability
|
||||
- Thread-based parallel execution
|
||||
- Summary reporting
|
||||
|
||||
### What needs to be removed:
|
||||
- All task implementations except AIME
|
||||
- Regex-based grading
|
||||
- Multiple endpoint support
|
||||
- Complex task loading logic
|
||||
- Summary reporting (replace with real-time feedback)
|
||||
|
||||
## Discussion Points
|
||||
|
||||
### 1. Eval State Object Structure
|
||||
**Status: Under Discussion**
|
||||
|
||||
Questions:
|
||||
- What fields should be in the eval state object?
|
||||
- Should it include the actual prompts, or just metadata?
|
||||
- How should task states be tracked?
|
||||
|
||||
### 2. Processor Architecture
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should the processor handle multiple endpoints (for distributed evaluation)?
|
||||
- What's the threading model?
|
||||
- How are endpoints configured?
|
||||
|
||||
### 3. Grader Interface
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- How should the grader be configured?
|
||||
- Should it be a separate service, or a local LLM call?
|
||||
- What's the interface for grading?
|
||||
|
||||
### 4. Checkpointing
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should the eval state be serialized to disk?
|
||||
- How often should it be dumped?
|
||||
- What format should it use?
|
||||
|
||||
### 5. Real-time Output
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- How should progress be displayed?
|
||||
- Console output, file logging, or both?
|
||||
- What verbosity levels are needed?
|
||||
|
||||
### 6. Output Format
|
||||
**Status: Not Started**
|
||||
|
||||
Questions:
|
||||
- Should responses be in JSON format?
|
||||
- How should the grader interface work with JSON output?
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Eval State Object** - Currently discussing
|
||||
2. Processor Architecture
|
||||
3. Grader Interface
|
||||
4. Checkpointing
|
||||
5. Real-time Output
|
||||
6. Output Format
|
||||
|
||||
## References
|
||||
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
|
||||
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
# llama-server-simulator Implementation Plan
|
||||
|
||||
## Overview
|
||||
Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||
|
||||
## Goals
|
||||
1. Simulate llama-server's `/v1/chat/completions` endpoint
|
||||
2. Accept requests and respond with expected answers from AIME dataset
|
||||
3. Implement configurable success rate (sometimes right, sometimes wrong)
|
||||
4. Use regex matching to find questions in incoming requests
|
||||
5. Test with curl requests before integrating with eval script
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Basic Simulator Structure
|
||||
- Create `llama-server-simulator.py` script
|
||||
- Set up Flask/FastAPI HTTP server
|
||||
- Implement `/v1/chat/completions` endpoint
|
||||
- Handle basic request/response format
|
||||
|
||||
### Phase 2: AIME Dataset Integration
|
||||
- Load AIME dataset
|
||||
- Store questions and expected answers
|
||||
- Implement regex matching to find questions in incoming requests
|
||||
- Extract expected answer from matched question
|
||||
|
||||
### Phase 3: Response Generation
|
||||
- Implement success rate configuration
|
||||
- Randomly determine if response should be correct or incorrect
|
||||
- Generate appropriate response based on success determination
|
||||
- Format response in OpenAI-compatible format
|
||||
|
||||
### Phase 4: Testing
|
||||
- Write curl commands to test basic functionality
|
||||
- Test correct responses
|
||||
- Test incorrect responses
|
||||
- Test edge cases (no question found, etc.)
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Server Framework
|
||||
- Use Flask for simplicity
|
||||
- Listen on configurable port
|
||||
- Support JSON request/response format
|
||||
|
||||
### Request Format
|
||||
```json
|
||||
{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Question text here"}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}
|
||||
```
|
||||
|
||||
### Response Format
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-xxx",
|
||||
"object": "chat.completion",
|
||||
"created": 1234567890,
|
||||
"model": "llama",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Answer text here"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### AIME Dataset Integration
|
||||
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
|
||||
- Store in memory for fast lookup
|
||||
- Regex pattern to find question text in request
|
||||
- Extract answer from matched question
|
||||
|
||||
### Success Rate Configuration
|
||||
- Command-line argument: `--success-rate 0.8` (80% success rate)
|
||||
- Randomly determine correctness based on rate
|
||||
- Log when responses are correct vs incorrect
|
||||
|
||||
### Testing Strategy
|
||||
1. Start simulator with default settings
|
||||
2. Send curl request with known question
|
||||
3. Verify response contains expected answer
|
||||
4. Test with different success rates
|
||||
5. Test edge cases
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Basic Server Setup
|
||||
```python
|
||||
from flask import Flask, request, jsonify
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@app.route('/v1/chat/completions', methods=['POST'])
|
||||
def chat_completions():
|
||||
# Handle request
|
||||
return jsonify(response)
|
||||
```
|
||||
|
||||
### Step 2: Load AIME Dataset
|
||||
```python
|
||||
import datasets
|
||||
|
||||
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
|
||||
# Store in memory
|
||||
```
|
||||
|
||||
### Step 3: Regex Matching
|
||||
```python
|
||||
import re
|
||||
|
||||
def find_question_in_request(request_text):
|
||||
# Regex pattern to find question
|
||||
pattern = r"question:\s*(.*?)\n"
|
||||
match = re.search(pattern, request_text, re.DOTALL)
|
||||
return match.group(1) if match else None
|
||||
```
|
||||
|
||||
### Step 4: Response Generation
|
||||
```python
|
||||
import random
|
||||
|
||||
def generate_response(question, success_rate):
|
||||
if random.random() < success_rate:
|
||||
return get_expected_answer(question)
|
||||
else:
|
||||
return get_wrong_answer(question)
|
||||
```
|
||||
|
||||
### Step 5: Testing with Curl
|
||||
```bash
|
||||
curl -X POST http://localhost:8033/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [{"role": "user", "content": "Question text"}]
|
||||
}'
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
- `--port`: Server port (default: 8033)
|
||||
- `--success-rate`: Success rate 0-1 (default: 0.8)
|
||||
- `--host`: Server host (default: localhost)
|
||||
- `--dataset-split`: AIME split to use (default: train)
|
||||
|
||||
## Expected Output
|
||||
```
|
||||
=== llama-server-simulator ===
|
||||
Server running on http://localhost:8033
|
||||
Success rate: 0.8
|
||||
AIME dataset loaded: 1000 questions
|
||||
```
|
||||
|
||||
## Testing Checklist
|
||||
- [ ] Server starts successfully
|
||||
- [ ] Basic request/response works
|
||||
- [ ] Correct answer returned when success rate allows
|
||||
- [ ] Wrong answer returned when success rate doesn't allow
|
||||
- [ ] No question found returns error
|
||||
- [ ] Multiple requests work correctly
|
||||
- [ ] Different success rates work as expected
|
||||
|
||||
## Next Steps
|
||||
1. Implement basic server structure
|
||||
2. Load AIME dataset
|
||||
3. Implement regex matching
|
||||
4. Add response generation with success rate
|
||||
5. Test with curl commands
|
||||
6. Integrate with eval script once simulator works
|
||||
|
|
@ -0,0 +1,267 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import random
|
||||
import re
|
||||
import time
|
||||
import sys
|
||||
import os
|
||||
from typing import Dict, List, Optional
|
||||
from dataclasses import dataclass, asdict
|
||||
from pathlib import Path
|
||||
|
||||
import datasets
|
||||
from flask import Flask, request, jsonify
|
||||
|
||||
# Set cache directory for HuggingFace datasets
|
||||
cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
|
||||
cache_dir.mkdir(parents=True, exist_ok=True)
|
||||
os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
|
||||
|
||||
def levenshtein_distance(s1: str, s2: str) -> int:
|
||||
"""Calculate Levenshtein distance between two strings"""
|
||||
if len(s1) < len(s2):
|
||||
return levenshtein_distance(s2, s1)
|
||||
|
||||
if len(s2) == 0:
|
||||
return len(s1)
|
||||
|
||||
previous_row = range(len(s2) + 1)
|
||||
for i, c1 in enumerate(s1):
|
||||
current_row = [i + 1]
|
||||
for j, c2 in enumerate(s2):
|
||||
insertions = previous_row[j + 1] + 1
|
||||
deletions = current_row[j] + 1
|
||||
substitutions = previous_row[j] + (c1 != c2)
|
||||
current_row.append(min(insertions, deletions, substitutions))
|
||||
previous_row = current_row
|
||||
|
||||
return previous_row[-1]
|
||||
|
||||
def debug_log(message: str):
|
||||
"""Log debug messages to both stdout and a file"""
|
||||
print(message, file=sys.stderr)
|
||||
with open("/tmp/simulator-debug.log", "a") as f:
|
||||
f.write(message + "\n")
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@dataclass
|
||||
class EvalState:
|
||||
id: str
|
||||
tasks: List[str]
|
||||
task_states: Dict[str, Dict]
|
||||
sampling_config: Dict
|
||||
|
||||
class AimeDataset:
|
||||
def __init__(self, split: str = "train"):
|
||||
self.split = split
|
||||
self.questions: List[Dict] = []
|
||||
self._load_dataset()
|
||||
|
||||
def _load_dataset(self):
|
||||
print(f"Loading AIME dataset (split: {self.split})...")
|
||||
print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}")
|
||||
|
||||
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
|
||||
self.questions = list(ds)
|
||||
print(f"AIME dataset loaded: {len(self.questions)} questions")
|
||||
|
||||
def find_question(self, request_text: str) -> Optional[Dict]:
|
||||
best_match = None
|
||||
best_distance = float('inf')
|
||||
best_index = -1
|
||||
|
||||
for i, question in enumerate(self.questions):
|
||||
question_text = question["problem"]
|
||||
request_lower = request_text.lower()
|
||||
question_lower = question_text.lower()
|
||||
|
||||
# Exact match
|
||||
if question_lower == request_lower:
|
||||
debug_log(f"DEBUG: Found exact match at index {i}")
|
||||
return question
|
||||
|
||||
# Remove LaTeX formatting for more flexible matching
|
||||
question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
|
||||
if question_no_latex.lower() == request_lower:
|
||||
debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
|
||||
return question
|
||||
|
||||
# Calculate Levenshtein distance for partial matches
|
||||
# Only consider if request is at least 50% of question length
|
||||
if len(request_lower) >= len(question_lower) * 0.5:
|
||||
distance = levenshtein_distance(question_lower, request_lower)
|
||||
# Normalize distance by length
|
||||
normalized_distance = distance / len(question_lower)
|
||||
|
||||
if normalized_distance < best_distance:
|
||||
best_distance = normalized_distance
|
||||
best_match = question
|
||||
best_index = i
|
||||
|
||||
if best_match and best_distance < 0.3: # Threshold for partial match
|
||||
debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
|
||||
return best_match
|
||||
|
||||
debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
|
||||
return None
|
||||
|
||||
def get_answer(self, question: Dict) -> str:
|
||||
return str(question["answer"])
|
||||
|
||||
class Simulator:
|
||||
def __init__(
|
||||
self,
|
||||
port: int = 8033,
|
||||
host: str = "localhost",
|
||||
success_rate: float = 0.8,
|
||||
dataset_split: str = "train"
|
||||
):
|
||||
self.port = port
|
||||
self.host = host
|
||||
self.success_rate = success_rate
|
||||
self.dataset = AimeDataset(dataset_split)
|
||||
self.eval_state = EvalState(
|
||||
id="aime-2025",
|
||||
tasks=["aime"],
|
||||
task_states={},
|
||||
sampling_config={"temperature": 0, "max_tokens": 2048}
|
||||
)
|
||||
|
||||
def _generate_response(
|
||||
self,
|
||||
question: Dict,
|
||||
should_be_correct: bool
|
||||
) -> Dict:
|
||||
expected_answer = self.dataset.get_answer(question)
|
||||
|
||||
if should_be_correct:
|
||||
response_text = expected_answer
|
||||
else:
|
||||
response_text = self._generate_wrong_answer(question)
|
||||
|
||||
return {
|
||||
"id": f"chatcmpl-{int(time.time())}",
|
||||
"object": "chat.completion",
|
||||
"created": int(time.time()),
|
||||
"model": "llama",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": response_text
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
|
||||
def _generate_wrong_answer(self, question: Dict) -> str:
|
||||
expected_answer = self.dataset.get_answer(question)
|
||||
|
||||
if expected_answer.isdigit():
|
||||
wrong_answer = str(int(expected_answer) + 1)
|
||||
else:
|
||||
wrong_answer = expected_answer + " (wrong)"
|
||||
|
||||
return wrong_answer
|
||||
|
||||
def _process_request(self, request_data: Dict) -> Dict:
|
||||
messages = request_data.get("messages", [])
|
||||
if not messages:
|
||||
return {"error": "No messages in request"}
|
||||
|
||||
request_text = messages[0].get("content", "")
|
||||
debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
|
||||
|
||||
question = self.dataset.find_question(request_text)
|
||||
if not question:
|
||||
debug_log(f"DEBUG: find_question returned None")
|
||||
return {"error": "No matching question found"}
|
||||
|
||||
should_be_correct = random.random() < self.success_rate
|
||||
|
||||
response = self._generate_response(question, should_be_correct)
|
||||
|
||||
task_id = "aime"
|
||||
self.eval_state.task_states[task_id] = {
|
||||
"correct": should_be_correct,
|
||||
"expected": self.dataset.get_answer(question),
|
||||
"predicted": response["choices"][0]["message"]["content"]
|
||||
}
|
||||
|
||||
return response
|
||||
|
||||
@app.route('/v1/chat/completions', methods=['POST'])
|
||||
def chat_completions():
|
||||
try:
|
||||
request_data = request.get_json()
|
||||
|
||||
if not request_data:
|
||||
return jsonify({"error": "Invalid JSON"}), 400
|
||||
|
||||
response = simulator._process_request(request_data)
|
||||
|
||||
return jsonify(response)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing request: {e}")
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="llama-server simulator for testing eval scripts"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--port",
|
||||
type=int,
|
||||
default=8033,
|
||||
help="Server port (default: 8033)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--host",
|
||||
type=str,
|
||||
default="localhost",
|
||||
help="Server host (default: localhost)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--success-rate",
|
||||
type=float,
|
||||
default=0.8,
|
||||
help="Success rate 0-1 (default: 0.8)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset-split",
|
||||
type=str,
|
||||
default="train",
|
||||
help="AIME dataset split to use (default: train)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
global simulator
|
||||
simulator = Simulator(
|
||||
port=args.port,
|
||||
host=args.host,
|
||||
success_rate=args.success_rate,
|
||||
dataset_split=args.dataset_split
|
||||
)
|
||||
|
||||
print("\n=== llama-server-simulator ===")
|
||||
print(f"Server running on http://{args.host}:{args.port}")
|
||||
print(f"Success rate: {args.success_rate}")
|
||||
print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
|
||||
print("\nPress Ctrl+C to stop\n")
|
||||
|
||||
app.run(host=args.host, port=args.port, debug=False)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,135 @@
|
|||
# llama-server-simulator Implementation Summary
|
||||
|
||||
## Overview
|
||||
Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
|
||||
|
||||
## Features Implemented
|
||||
|
||||
### 1. HTTP Server
|
||||
- Flask-based `/v1/chat/completions` endpoint
|
||||
- OpenAI-compatible response format
|
||||
- Configurable port and host
|
||||
|
||||
### 2. AIME Dataset Integration
|
||||
- Loads AIME dataset from HuggingFace
|
||||
- In-memory storage for fast lookup
|
||||
- 90 questions loaded from train split
|
||||
|
||||
### 3. Intelligent Question Matching
|
||||
- **Exact matching**: Direct string comparison
|
||||
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
|
||||
- **Levenshtein distance**: Calculates similarity between strings
|
||||
- **Partial matching**: Finds best match even with small differences
|
||||
|
||||
### 4. Response Generation
|
||||
- Configurable success rate (0-1)
|
||||
- Returns correct answers when success rate allows
|
||||
- Returns wrong answers when success rate doesn't allow
|
||||
- Wrong answers are generated by incrementing the expected answer
|
||||
|
||||
### 5. Debug Logging
|
||||
- Debug messages written to stderr
|
||||
- Logs request content, matching results, and distances
|
||||
- Helps troubleshoot matching issues
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```bash
|
||||
python3 llama-server-simulator.py \
|
||||
--port 8034 \
|
||||
--host localhost \
|
||||
--success-rate 0.8 \
|
||||
--dataset-split train
|
||||
```
|
||||
|
||||
## Testing Results
|
||||
|
||||
### Test 1: Correct Answer
|
||||
- **Success rate**: 0.8
|
||||
- **Expected answer**: 116
|
||||
- **Result**: ✓ Correct (116)
|
||||
|
||||
### Test 2: Wrong Answer
|
||||
- **Success rate**: 0.0
|
||||
- **Expected answer**: 116
|
||||
- **Result**: ✓ Wrong (117)
|
||||
|
||||
### Test 3: No Matching Question
|
||||
- **Request**: "What is the capital of France?"
|
||||
- **Result**: ✓ Returns error "No matching question found"
|
||||
|
||||
### Test 4: Success Rate Verification
|
||||
- **Success rate**: 0.8
|
||||
- **Requests**: 10
|
||||
- **Correct answers**: 8/10 (80%)
|
||||
- **Result**: ✓ Success rate working as expected
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Matching Algorithm
|
||||
1. Try exact match (case-insensitive)
|
||||
2. Try match after removing LaTeX formatting
|
||||
3. Calculate Levenshtein distance for partial matches
|
||||
4. Return best match if distance < 0.3 (30% difference)
|
||||
|
||||
### Response Format
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-1769864875",
|
||||
"object": "chat.completion",
|
||||
"created": 1769864875,
|
||||
"model": "llama",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "116"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 100,
|
||||
"completion_tokens": 50,
|
||||
"total_tokens": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Files Created
|
||||
|
||||
1. `llama-server-simulator.py` - Main simulator script
|
||||
2. `test-simulator.sh` - Basic test script
|
||||
3. `test-simulator-comprehensive.sh` - Comprehensive test script
|
||||
4. `llama-server-simulator-plan.md` - Implementation plan
|
||||
5. `llama-eval-discussion.md` - Discussion notes
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✓ Basic simulator structure
|
||||
2. ✓ AIME dataset integration
|
||||
3. ✓ Question matching with Levenshtein distance
|
||||
4. ✓ Response generation with configurable success rate
|
||||
5. ✓ Testing with curl requests
|
||||
6. ⏭️ Integrate with eval script
|
||||
7. ⏭️ Implement eval state object
|
||||
8. ⏭️ Implement processor object
|
||||
9. ⏭️ Add real-time progress reporting
|
||||
|
||||
## Known Limitations
|
||||
|
||||
1. Only supports AIME dataset (train split)
|
||||
2. Matching is case-insensitive
|
||||
3. Wrong answers are simple increments (not realistic)
|
||||
4. No support for multiple endpoints
|
||||
5. No distributed evaluation
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. Support multiple datasets
|
||||
2. More sophisticated wrong answer generation
|
||||
3. Multiple endpoint support
|
||||
4. Distributed evaluation
|
||||
5. Real-time progress reporting
|
||||
6. Eval state serialization
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
#!/bin/bash
|
||||
|
||||
echo "=== Testing HuggingFace Dataset Caching ==="
|
||||
echo ""
|
||||
|
||||
echo "=== First Load (should download) ==="
|
||||
echo "Starting simulator for first load..."
|
||||
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log &
|
||||
SIMULATOR_PID=$!
|
||||
sleep 5
|
||||
echo "First load complete"
|
||||
echo ""
|
||||
|
||||
echo "=== Second Load (should use cache) ==="
|
||||
echo "Starting simulator for second load..."
|
||||
source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log &
|
||||
SIMULATOR_PID2=$!
|
||||
sleep 5
|
||||
echo "Second load complete"
|
||||
echo ""
|
||||
|
||||
echo "=== Checking Cache Directory ==="
|
||||
echo "Cache directory size:"
|
||||
du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime
|
||||
echo ""
|
||||
|
||||
echo "=== Checking First Load Log ==="
|
||||
echo "First load log (last 15 lines):"
|
||||
tail -15 /tmp/simulator-first.log
|
||||
echo ""
|
||||
|
||||
echo "=== Checking Second Load Log ==="
|
||||
echo "Second load log (last 15 lines):"
|
||||
tail -15 /tmp/simulator-second.log
|
||||
echo ""
|
||||
|
||||
echo "=== Test Complete ==="
|
||||
echo "Both loads completed successfully!"
|
||||
echo "The second load should have used the cache (no download warning)."
|
||||
echo ""
|
||||
|
||||
kill $SIMULATOR_PID 2>/dev/null
|
||||
kill $SIMULATOR_PID2 2>/dev/null
|
||||
|
|
@ -0,0 +1,93 @@
|
|||
#!/bin/bash
|
||||
|
||||
echo "=== llama-server-simulator Test Script ==="
|
||||
echo ""
|
||||
|
||||
PORT=8033
|
||||
SUCCESS_RATE=0.8
|
||||
|
||||
echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
|
||||
source venv/bin/activate
|
||||
python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
|
||||
SIMULATOR_PID=$!
|
||||
|
||||
echo "Waiting for simulator to start..."
|
||||
sleep 5
|
||||
|
||||
echo ""
|
||||
echo "=== Test 1: Basic Request with Known Question ==="
|
||||
echo "Sending request with AIME question..."
|
||||
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
|
||||
|
||||
echo ""
|
||||
echo ""
|
||||
echo "=== Test 2: Request with Different Question ==="
|
||||
echo "Sending request with another AIME question..."
|
||||
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Compute the value of 2^10 + 3^10."}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
|
||||
|
||||
echo ""
|
||||
echo ""
|
||||
echo "=== Test 3: Request with No Matching Question ==="
|
||||
echo "Sending request with non-matching text..."
|
||||
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "What is the capital of France?"}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))"
|
||||
|
||||
echo ""
|
||||
echo ""
|
||||
echo "=== Test 4: Multiple Requests to Test Success Rate ==="
|
||||
echo "Sending 10 requests to test success rate..."
|
||||
correct_count=0
|
||||
for i in {1..10}; do
|
||||
echo "Request $i:"
|
||||
response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
|
||||
],
|
||||
"temperature": 0,
|
||||
"max_tokens": 2048
|
||||
}')
|
||||
answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])")
|
||||
if [ "$answer" == "116" ]; then
|
||||
correct_count=$((correct_count + 1))
|
||||
fi
|
||||
echo " Answer: $answer"
|
||||
done
|
||||
echo "Correct answers: $correct_count/10"
|
||||
echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
|
||||
|
||||
echo ""
|
||||
echo "=== Test Complete ==="
|
||||
echo "Stopping simulator..."
|
||||
kill $SIMULATOR_PID 2>/dev/null
|
||||
wait $SIMULATOR_PID 2>/dev/null || true
|
||||
|
||||
echo "Simulator stopped."
|
||||
Loading…
Reference in New Issue