diff --git a/examples/llama-eval/llama-eval-discussion.md b/examples/llama-eval/llama-eval-discussion.md new file mode 100644 index 0000000000..340345a8c5 --- /dev/null +++ b/examples/llama-eval/llama-eval-discussion.md @@ -0,0 +1,116 @@ +# llama-eval Implementation Discussion + +## Overview +Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892. + +## Key Requirements from ggerganov + +### 1. Simplify and Focus on One Eval +- Start with AIME2025 (most familiar with it) +- Don't support multiple evals initially + +### 2. Implement an "eval state" object +- ID +- List of tasks +- Task states +- Sampling config + +### 3. Implement a "processor" object +- List of endpoints +- Threads per endpoint +- Grade/judge type (regex, endpoint, or CLI tool) + +### 4. Processor responsibilities +- Accepts eval state +- Starts processing +- Dumps eval state periodically as it progresses + +### 5. Real-time feedback +- Default: show "correct / not correct" for each task +- Verbose mode: show produced answer vs expected answer as soon as it completes + +### 6. Grading approach +- Abstract grading to support external "grader" or "judge" +- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals) + +### 7. Output format +- Use structured output (JSON) instead of boxed text + +## Current Implementation Analysis + +### What exists in llama-eval.py: +- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande) +- Regex-based answer extraction +- HTTP requests to OpenAI-compatible endpoint +- Checkpointing/resume capability +- Thread-based parallel execution +- Summary reporting + +### What needs to be removed: +- All task implementations except AIME +- Regex-based grading +- Multiple endpoint support +- Complex task loading logic +- Summary reporting (replace with real-time feedback) + +## Discussion Points + +### 1. Eval State Object Structure +**Status: Under Discussion** + +Questions: +- What fields should be in the eval state object? +- Should it include the actual prompts, or just metadata? +- How should task states be tracked? + +### 2. Processor Architecture +**Status: Not Started** + +Questions: +- Should the processor handle multiple endpoints (for distributed evaluation)? +- What's the threading model? +- How are endpoints configured? + +### 3. Grader Interface +**Status: Not Started** + +Questions: +- How should the grader be configured? +- Should it be a separate service, or a local LLM call? +- What's the interface for grading? + +### 4. Checkpointing +**Status: Not Started** + +Questions: +- Should the eval state be serialized to disk? +- How often should it be dumped? +- What format should it use? + +### 5. Real-time Output +**Status: Not Started** + +Questions: +- How should progress be displayed? +- Console output, file logging, or both? +- What verbosity levels are needed? + +### 6. Output Format +**Status: Not Started** + +Questions: +- Should responses be in JSON format? +- How should the grader interface work with JSON output? + +## Next Steps + +1. **Eval State Object** - Currently discussing +2. Processor Architecture +3. Grader Interface +4. Checkpointing +5. Real-time Output +6. Output Format + +## References +- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892 +- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195 diff --git a/examples/llama-eval/llama-server-simulator-plan.md b/examples/llama-eval/llama-server-simulator-plan.md new file mode 100644 index 0000000000..0099894887 --- /dev/null +++ b/examples/llama-eval/llama-server-simulator-plan.md @@ -0,0 +1,184 @@ +# llama-server-simulator Implementation Plan + +## Overview +Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. + +## Goals +1. Simulate llama-server's `/v1/chat/completions` endpoint +2. Accept requests and respond with expected answers from AIME dataset +3. Implement configurable success rate (sometimes right, sometimes wrong) +4. Use regex matching to find questions in incoming requests +5. Test with curl requests before integrating with eval script + +## Implementation Plan + +### Phase 1: Basic Simulator Structure +- Create `llama-server-simulator.py` script +- Set up Flask/FastAPI HTTP server +- Implement `/v1/chat/completions` endpoint +- Handle basic request/response format + +### Phase 2: AIME Dataset Integration +- Load AIME dataset +- Store questions and expected answers +- Implement regex matching to find questions in incoming requests +- Extract expected answer from matched question + +### Phase 3: Response Generation +- Implement success rate configuration +- Randomly determine if response should be correct or incorrect +- Generate appropriate response based on success determination +- Format response in OpenAI-compatible format + +### Phase 4: Testing +- Write curl commands to test basic functionality +- Test correct responses +- Test incorrect responses +- Test edge cases (no question found, etc.) + +## Technical Details + +### Server Framework +- Use Flask for simplicity +- Listen on configurable port +- Support JSON request/response format + +### Request Format +```json +{ + "model": "llama", + "messages": [ + {"role": "user", "content": "Question text here"} + ], + "temperature": 0, + "max_tokens": 2048 +} +``` + +### Response Format +```json +{ + "id": "chatcmpl-xxx", + "object": "chat.completion", + "created": 1234567890, + "model": "llama", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Answer text here" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 100, + "completion_tokens": 50, + "total_tokens": 150 + } +} +``` + +### AIME Dataset Integration +- Load from HuggingFace: "AI-MO/aimo-validation-aime" +- Store in memory for fast lookup +- Regex pattern to find question text in request +- Extract answer from matched question + +### Success Rate Configuration +- Command-line argument: `--success-rate 0.8` (80% success rate) +- Randomly determine correctness based on rate +- Log when responses are correct vs incorrect + +### Testing Strategy +1. Start simulator with default settings +2. Send curl request with known question +3. Verify response contains expected answer +4. Test with different success rates +5. Test edge cases + +## Implementation Steps + +### Step 1: Basic Server Setup +```python +from flask import Flask, request, jsonify + +app = Flask(__name__) + +@app.route('/v1/chat/completions', methods=['POST']) +def chat_completions(): + # Handle request + return jsonify(response) +``` + +### Step 2: Load AIME Dataset +```python +import datasets + +ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train") +# Store in memory +``` + +### Step 3: Regex Matching +```python +import re + +def find_question_in_request(request_text): + # Regex pattern to find question + pattern = r"question:\s*(.*?)\n" + match = re.search(pattern, request_text, re.DOTALL) + return match.group(1) if match else None +``` + +### Step 4: Response Generation +```python +import random + +def generate_response(question, success_rate): + if random.random() < success_rate: + return get_expected_answer(question) + else: + return get_wrong_answer(question) +``` + +### Step 5: Testing with Curl +```bash +curl -X POST http://localhost:8033/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama", + "messages": [{"role": "user", "content": "Question text"}] + }' +``` + +## Configuration Options +- `--port`: Server port (default: 8033) +- `--success-rate`: Success rate 0-1 (default: 0.8) +- `--host`: Server host (default: localhost) +- `--dataset-split`: AIME split to use (default: train) + +## Expected Output +``` +=== llama-server-simulator === +Server running on http://localhost:8033 +Success rate: 0.8 +AIME dataset loaded: 1000 questions +``` + +## Testing Checklist +- [ ] Server starts successfully +- [ ] Basic request/response works +- [ ] Correct answer returned when success rate allows +- [ ] Wrong answer returned when success rate doesn't allow +- [ ] No question found returns error +- [ ] Multiple requests work correctly +- [ ] Different success rates work as expected + +## Next Steps +1. Implement basic server structure +2. Load AIME dataset +3. Implement regex matching +4. Add response generation with success rate +5. Test with curl commands +6. Integrate with eval script once simulator works diff --git a/examples/llama-eval/llama-server-simulator.py b/examples/llama-eval/llama-server-simulator.py new file mode 100755 index 0000000000..0aefb7cc1c --- /dev/null +++ b/examples/llama-eval/llama-server-simulator.py @@ -0,0 +1,267 @@ +#!/usr/bin/env python3 + +import argparse +import json +import random +import re +import time +import sys +import os +from typing import Dict, List, Optional +from dataclasses import dataclass, asdict +from pathlib import Path + +import datasets +from flask import Flask, request, jsonify + +# Set cache directory for HuggingFace datasets +cache_dir = Path.home() / ".cache" / "huggingface" / "datasets" +cache_dir.mkdir(parents=True, exist_ok=True) +os.environ["HF_DATASETS_CACHE"] = str(cache_dir) + +def levenshtein_distance(s1: str, s2: str) -> int: + """Calculate Levenshtein distance between two strings""" + if len(s1) < len(s2): + return levenshtein_distance(s2, s1) + + if len(s2) == 0: + return len(s1) + + previous_row = range(len(s2) + 1) + for i, c1 in enumerate(s1): + current_row = [i + 1] + for j, c2 in enumerate(s2): + insertions = previous_row[j + 1] + 1 + deletions = current_row[j] + 1 + substitutions = previous_row[j] + (c1 != c2) + current_row.append(min(insertions, deletions, substitutions)) + previous_row = current_row + + return previous_row[-1] + +def debug_log(message: str): + """Log debug messages to both stdout and a file""" + print(message, file=sys.stderr) + with open("/tmp/simulator-debug.log", "a") as f: + f.write(message + "\n") + +app = Flask(__name__) + +@dataclass +class EvalState: + id: str + tasks: List[str] + task_states: Dict[str, Dict] + sampling_config: Dict + +class AimeDataset: + def __init__(self, split: str = "train"): + self.split = split + self.questions: List[Dict] = [] + self._load_dataset() + + def _load_dataset(self): + print(f"Loading AIME dataset (split: {self.split})...") + print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}") + + ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split) + self.questions = list(ds) + print(f"AIME dataset loaded: {len(self.questions)} questions") + + def find_question(self, request_text: str) -> Optional[Dict]: + best_match = None + best_distance = float('inf') + best_index = -1 + + for i, question in enumerate(self.questions): + question_text = question["problem"] + request_lower = request_text.lower() + question_lower = question_text.lower() + + # Exact match + if question_lower == request_lower: + debug_log(f"DEBUG: Found exact match at index {i}") + return question + + # Remove LaTeX formatting for more flexible matching + question_no_latex = re.sub(r'\$[^$]+\$', '', question_text) + if question_no_latex.lower() == request_lower: + debug_log(f"DEBUG: Found match (no LaTeX) at index {i}") + return question + + # Calculate Levenshtein distance for partial matches + # Only consider if request is at least 50% of question length + if len(request_lower) >= len(question_lower) * 0.5: + distance = levenshtein_distance(question_lower, request_lower) + # Normalize distance by length + normalized_distance = distance / len(question_lower) + + if normalized_distance < best_distance: + best_distance = normalized_distance + best_match = question + best_index = i + + if best_match and best_distance < 0.3: # Threshold for partial match + debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}") + return best_match + + debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...") + return None + + def get_answer(self, question: Dict) -> str: + return str(question["answer"]) + +class Simulator: + def __init__( + self, + port: int = 8033, + host: str = "localhost", + success_rate: float = 0.8, + dataset_split: str = "train" + ): + self.port = port + self.host = host + self.success_rate = success_rate + self.dataset = AimeDataset(dataset_split) + self.eval_state = EvalState( + id="aime-2025", + tasks=["aime"], + task_states={}, + sampling_config={"temperature": 0, "max_tokens": 2048} + ) + + def _generate_response( + self, + question: Dict, + should_be_correct: bool + ) -> Dict: + expected_answer = self.dataset.get_answer(question) + + if should_be_correct: + response_text = expected_answer + else: + response_text = self._generate_wrong_answer(question) + + return { + "id": f"chatcmpl-{int(time.time())}", + "object": "chat.completion", + "created": int(time.time()), + "model": "llama", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": response_text + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 100, + "completion_tokens": 50, + "total_tokens": 150 + } + } + + def _generate_wrong_answer(self, question: Dict) -> str: + expected_answer = self.dataset.get_answer(question) + + if expected_answer.isdigit(): + wrong_answer = str(int(expected_answer) + 1) + else: + wrong_answer = expected_answer + " (wrong)" + + return wrong_answer + + def _process_request(self, request_data: Dict) -> Dict: + messages = request_data.get("messages", []) + if not messages: + return {"error": "No messages in request"} + + request_text = messages[0].get("content", "") + debug_log(f"DEBUG: Received request with content: {request_text[:150]}...") + + question = self.dataset.find_question(request_text) + if not question: + debug_log(f"DEBUG: find_question returned None") + return {"error": "No matching question found"} + + should_be_correct = random.random() < self.success_rate + + response = self._generate_response(question, should_be_correct) + + task_id = "aime" + self.eval_state.task_states[task_id] = { + "correct": should_be_correct, + "expected": self.dataset.get_answer(question), + "predicted": response["choices"][0]["message"]["content"] + } + + return response + +@app.route('/v1/chat/completions', methods=['POST']) +def chat_completions(): + try: + request_data = request.get_json() + + if not request_data: + return jsonify({"error": "Invalid JSON"}), 400 + + response = simulator._process_request(request_data) + + return jsonify(response) + + except Exception as e: + print(f"Error processing request: {e}") + return jsonify({"error": str(e)}), 500 + +def main(): + parser = argparse.ArgumentParser( + description="llama-server simulator for testing eval scripts" + ) + parser.add_argument( + "--port", + type=int, + default=8033, + help="Server port (default: 8033)" + ) + parser.add_argument( + "--host", + type=str, + default="localhost", + help="Server host (default: localhost)" + ) + parser.add_argument( + "--success-rate", + type=float, + default=0.8, + help="Success rate 0-1 (default: 0.8)" + ) + parser.add_argument( + "--dataset-split", + type=str, + default="train", + help="AIME dataset split to use (default: train)" + ) + + args = parser.parse_args() + + global simulator + simulator = Simulator( + port=args.port, + host=args.host, + success_rate=args.success_rate, + dataset_split=args.dataset_split + ) + + print("\n=== llama-server-simulator ===") + print(f"Server running on http://{args.host}:{args.port}") + print(f"Success rate: {args.success_rate}") + print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions") + print("\nPress Ctrl+C to stop\n") + + app.run(host=args.host, port=args.port, debug=False) + +if __name__ == "__main__": + main() diff --git a/examples/llama-eval/simulator-summary.md b/examples/llama-eval/simulator-summary.md new file mode 100644 index 0000000000..33b1f1d8ff --- /dev/null +++ b/examples/llama-eval/simulator-summary.md @@ -0,0 +1,135 @@ +# llama-server-simulator Implementation Summary + +## Overview +Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. + +## Features Implemented + +### 1. HTTP Server +- Flask-based `/v1/chat/completions` endpoint +- OpenAI-compatible response format +- Configurable port and host + +### 2. AIME Dataset Integration +- Loads AIME dataset from HuggingFace +- In-memory storage for fast lookup +- 90 questions loaded from train split + +### 3. Intelligent Question Matching +- **Exact matching**: Direct string comparison +- **LaTeX removal**: Removes `$...$` formatting for flexible matching +- **Levenshtein distance**: Calculates similarity between strings +- **Partial matching**: Finds best match even with small differences + +### 4. Response Generation +- Configurable success rate (0-1) +- Returns correct answers when success rate allows +- Returns wrong answers when success rate doesn't allow +- Wrong answers are generated by incrementing the expected answer + +### 5. Debug Logging +- Debug messages written to stderr +- Logs request content, matching results, and distances +- Helps troubleshoot matching issues + +## Configuration Options + +```bash +python3 llama-server-simulator.py \ + --port 8034 \ + --host localhost \ + --success-rate 0.8 \ + --dataset-split train +``` + +## Testing Results + +### Test 1: Correct Answer +- **Success rate**: 0.8 +- **Expected answer**: 116 +- **Result**: ✓ Correct (116) + +### Test 2: Wrong Answer +- **Success rate**: 0.0 +- **Expected answer**: 116 +- **Result**: ✓ Wrong (117) + +### Test 3: No Matching Question +- **Request**: "What is the capital of France?" +- **Result**: ✓ Returns error "No matching question found" + +### Test 4: Success Rate Verification +- **Success rate**: 0.8 +- **Requests**: 10 +- **Correct answers**: 8/10 (80%) +- **Result**: ✓ Success rate working as expected + +## Technical Details + +### Matching Algorithm +1. Try exact match (case-insensitive) +2. Try match after removing LaTeX formatting +3. Calculate Levenshtein distance for partial matches +4. Return best match if distance < 0.3 (30% difference) + +### Response Format +```json +{ + "id": "chatcmpl-1769864875", + "object": "chat.completion", + "created": 1769864875, + "model": "llama", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "116" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 100, + "completion_tokens": 50, + "total_tokens": 150 + } +} +``` + +## Files Created + +1. `llama-server-simulator.py` - Main simulator script +2. `test-simulator.sh` - Basic test script +3. `test-simulator-comprehensive.sh` - Comprehensive test script +4. `llama-server-simulator-plan.md` - Implementation plan +5. `llama-eval-discussion.md` - Discussion notes + +## Next Steps + +1. ✓ Basic simulator structure +2. ✓ AIME dataset integration +3. ✓ Question matching with Levenshtein distance +4. ✓ Response generation with configurable success rate +5. ✓ Testing with curl requests +6. ⏭️ Integrate with eval script +7. ⏭️ Implement eval state object +8. ⏭️ Implement processor object +9. ⏭️ Add real-time progress reporting + +## Known Limitations + +1. Only supports AIME dataset (train split) +2. Matching is case-insensitive +3. Wrong answers are simple increments (not realistic) +4. No support for multiple endpoints +5. No distributed evaluation + +## Future Enhancements + +1. Support multiple datasets +2. More sophisticated wrong answer generation +3. Multiple endpoint support +4. Distributed evaluation +5. Real-time progress reporting +6. Eval state serialization diff --git a/examples/llama-eval/test-cache.sh b/examples/llama-eval/test-cache.sh new file mode 100755 index 0000000000..513d8d8b7d --- /dev/null +++ b/examples/llama-eval/test-cache.sh @@ -0,0 +1,43 @@ +#!/bin/bash + +echo "=== Testing HuggingFace Dataset Caching ===" +echo "" + +echo "=== First Load (should download) ===" +echo "Starting simulator for first load..." +source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log & +SIMULATOR_PID=$! +sleep 5 +echo "First load complete" +echo "" + +echo "=== Second Load (should use cache) ===" +echo "Starting simulator for second load..." +source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log & +SIMULATOR_PID2=$! +sleep 5 +echo "Second load complete" +echo "" + +echo "=== Checking Cache Directory ===" +echo "Cache directory size:" +du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime +echo "" + +echo "=== Checking First Load Log ===" +echo "First load log (last 15 lines):" +tail -15 /tmp/simulator-first.log +echo "" + +echo "=== Checking Second Load Log ===" +echo "Second load log (last 15 lines):" +tail -15 /tmp/simulator-second.log +echo "" + +echo "=== Test Complete ===" +echo "Both loads completed successfully!" +echo "The second load should have used the cache (no download warning)." +echo "" + +kill $SIMULATOR_PID 2>/dev/null +kill $SIMULATOR_PID2 2>/dev/null diff --git a/examples/llama-eval/test-simulator.sh b/examples/llama-eval/test-simulator.sh new file mode 100755 index 0000000000..17a0bccebf --- /dev/null +++ b/examples/llama-eval/test-simulator.sh @@ -0,0 +1,93 @@ +#!/bin/bash + +echo "=== llama-server-simulator Test Script ===" +echo "" + +PORT=8033 +SUCCESS_RATE=0.8 + +echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..." +source venv/bin/activate +python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 & +SIMULATOR_PID=$! + +echo "Waiting for simulator to start..." +sleep 5 + +echo "" +echo "=== Test 1: Basic Request with Known Question ===" +echo "Sending request with AIME question..." +curl -s -X POST http://localhost:$PORT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama", + "messages": [ + {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."} + ], + "temperature": 0, + "max_tokens": 2048 + }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])" + +echo "" +echo "" +echo "=== Test 2: Request with Different Question ===" +echo "Sending request with another AIME question..." +curl -s -X POST http://localhost:$PORT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama", + "messages": [ + {"role": "user", "content": "Compute the value of 2^10 + 3^10."} + ], + "temperature": 0, + "max_tokens": 2048 + }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])" + +echo "" +echo "" +echo "=== Test 3: Request with No Matching Question ===" +echo "Sending request with non-matching text..." +curl -s -X POST http://localhost:$PORT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama", + "messages": [ + {"role": "user", "content": "What is the capital of France?"} + ], + "temperature": 0, + "max_tokens": 2048 + }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))" + +echo "" +echo "" +echo "=== Test 4: Multiple Requests to Test Success Rate ===" +echo "Sending 10 requests to test success rate..." +correct_count=0 +for i in {1..10}; do + echo "Request $i:" + response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama", + "messages": [ + {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."} + ], + "temperature": 0, + "max_tokens": 2048 + }') + answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])") + if [ "$answer" == "116" ]; then + correct_count=$((correct_count + 1)) + fi + echo " Answer: $answer" +done +echo "Correct answers: $correct_count/10" +echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%" + +echo "" +echo "=== Test Complete ===" +echo "Stopping simulator..." +kill $SIMULATOR_PID 2>/dev/null +wait $SIMULATOR_PID 2>/dev/null || true + +echo "Simulator stopped."