examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.
2026-01-31 15:37:31 +02:00 · 2026-01-31 15:37:31 +02:00 · 07d5e1e0ea
parent 8839037528
commit 07d5e1e0ea
6 changed files with 838 additions and 0 deletions
--- a/examples/llama-eval/llama-eval-discussion.md
+++ b/examples/llama-eval/llama-eval-discussion.md
@ -0,0 +1,116 @@
 # llama-eval Implementation Discussion
 ## Overview
 Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
 ## Key Requirements from ggerganov
 ### 1. Simplify and Focus on One Eval
 - Start with AIME2025 (most familiar with it)
 - Don't support multiple evals initially
 ### 2. Implement an "eval state" object
 - ID
 - List of tasks
 - Task states
 - Sampling config
 ### 3. Implement a "processor" object
 - List of endpoints
 - Threads per endpoint
 - Grade/judge type (regex, endpoint, or CLI tool)
 ### 4. Processor responsibilities
 - Accepts eval state
 - Starts processing
 - Dumps eval state periodically as it progresses
 ### 5. Real-time feedback
 - Default: show "correct / not correct" for each task
 - Verbose mode: show produced answer vs expected answer as soon as it completes
 ### 6. Grading approach
 - Abstract grading to support external "grader" or "judge"
 - Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
 ### 7. Output format
 - Use structured output (JSON) instead of boxed text
 ## Current Implementation Analysis
 ### What exists in llama-eval.py:
 - Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
 - Regex-based answer extraction
 - HTTP requests to OpenAI-compatible endpoint
 - Checkpointing/resume capability
 - Thread-based parallel execution
 - Summary reporting
 ### What needs to be removed:
 - All task implementations except AIME
 - Regex-based grading
 - Multiple endpoint support
 - Complex task loading logic
 - Summary reporting (replace with real-time feedback)
 ## Discussion Points
 ### 1. Eval State Object Structure
 **Status: Under Discussion**
 Questions:
 - What fields should be in the eval state object?
 - Should it include the actual prompts, or just metadata?
 - How should task states be tracked?
 ### 2. Processor Architecture
 **Status: Not Started**
 Questions:
 - Should the processor handle multiple endpoints (for distributed evaluation)?
 - What's the threading model?
 - How are endpoints configured?
 ### 3. Grader Interface
 **Status: Not Started**
 Questions:
 - How should the grader be configured?
 - Should it be a separate service, or a local LLM call?
 - What's the interface for grading?
 ### 4. Checkpointing
 **Status: Not Started**
 Questions:
 - Should the eval state be serialized to disk?
 - How often should it be dumped?
 - What format should it use?
 ### 5. Real-time Output
 **Status: Not Started**
 Questions:
 - How should progress be displayed?
 - Console output, file logging, or both?
 - What verbosity levels are needed?
 ### 6. Output Format
 **Status: Not Started**
 Questions:
 - Should responses be in JSON format?
 - How should the grader interface work with JSON output?
 ## Next Steps
 1. **Eval State Object** - Currently discussing
 2. Processor Architecture
 3. Grader Interface
 4. Checkpointing
 5. Real-time Output
 6. Output Format
 ## References
 - PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
 - Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
--- a/examples/llama-eval/llama-server-simulator-plan.md
+++ b/examples/llama-eval/llama-server-simulator-plan.md
@ -0,0 +1,184 @@
 # llama-server-simulator Implementation Plan
 ## Overview
 Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
 ## Goals
 1. Simulate llama-server's `/v1/chat/completions` endpoint
 2. Accept requests and respond with expected answers from AIME dataset
 3. Implement configurable success rate (sometimes right, sometimes wrong)
 4. Use regex matching to find questions in incoming requests
 5. Test with curl requests before integrating with eval script
 ## Implementation Plan
 ### Phase 1: Basic Simulator Structure
 - Create `llama-server-simulator.py` script
 - Set up Flask/FastAPI HTTP server
 - Implement `/v1/chat/completions` endpoint
 - Handle basic request/response format
 ### Phase 2: AIME Dataset Integration
 - Load AIME dataset
 - Store questions and expected answers
 - Implement regex matching to find questions in incoming requests
 - Extract expected answer from matched question
 ### Phase 3: Response Generation
 - Implement success rate configuration
 - Randomly determine if response should be correct or incorrect
 - Generate appropriate response based on success determination
 - Format response in OpenAI-compatible format
 ### Phase 4: Testing
 - Write curl commands to test basic functionality
 - Test correct responses
 - Test incorrect responses
 - Test edge cases (no question found, etc.)
 ## Technical Details
 ### Server Framework
 - Use Flask for simplicity
 - Listen on configurable port
 - Support JSON request/response format
 ### Request Format
 ```json
 {
  "model": "llama",
  "messages": [
    {"role": "user", "content": "Question text here"}
  ],
  "temperature": 0,
  "max_tokens": 2048
 }
 ```
 ### Response Format
 ```json
 {
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Answer text here"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
 }
 ```
 ### AIME Dataset Integration
 - Load from HuggingFace: "AI-MO/aimo-validation-aime"
 - Store in memory for fast lookup
 - Regex pattern to find question text in request
 - Extract answer from matched question
 ### Success Rate Configuration
 - Command-line argument: `--success-rate 0.8` (80% success rate)
 - Randomly determine correctness based on rate
 - Log when responses are correct vs incorrect
 ### Testing Strategy
 1. Start simulator with default settings
 2. Send curl request with known question
 3. Verify response contains expected answer
 4. Test with different success rates
 5. Test edge cases
 ## Implementation Steps
 ### Step 1: Basic Server Setup
 ```python
 from flask import Flask, request, jsonify
 app = Flask(__name__)
@app.route('/v1/chat/completions', methods=['POST'])
 def chat_completions():
    # Handle request
    return jsonify(response)
 ```
 ### Step 2: Load AIME Dataset
 ```python
 import datasets
 ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
 # Store in memory
 ```
 ### Step 3: Regex Matching
 ```python
 import re
 def find_question_in_request(request_text):
    # Regex pattern to find question
    pattern = r"question:\s*(.*?)\n"
    match = re.search(pattern, request_text, re.DOTALL)
    return match.group(1) if match else None
 ```
 ### Step 4: Response Generation
 ```python
 import random
 def generate_response(question, success_rate):
    if random.random() < success_rate:
        return get_expected_answer(question)
    else:
        return get_wrong_answer(question)
 ```
 ### Step 5: Testing with Curl
 ```bash
 curl -X POST http://localhost:8033/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "Question text"}]
  }'
 ```
 ## Configuration Options
 - `--port`: Server port (default: 8033)
 - `--success-rate`: Success rate 0-1 (default: 0.8)
 - `--host`: Server host (default: localhost)
 - `--dataset-split`: AIME split to use (default: train)
 ## Expected Output
 ```
 === llama-server-simulator ===
 Server running on http://localhost:8033
 Success rate: 0.8
 AIME dataset loaded: 1000 questions
 ```
 ## Testing Checklist
 - [ ] Server starts successfully
 - [ ] Basic request/response works
 - [ ] Correct answer returned when success rate allows
 - [ ] Wrong answer returned when success rate doesn't allow
 - [ ] No question found returns error
 - [ ] Multiple requests work correctly
 - [ ] Different success rates work as expected
 ## Next Steps
 1. Implement basic server structure
 2. Load AIME dataset
 3. Implement regex matching
 4. Add response generation with success rate
 5. Test with curl commands
 6. Integrate with eval script once simulator works
--- a/examples/llama-eval/llama-server-simulator.py
+++ b/examples/llama-eval/llama-server-simulator.py
@ -0,0 +1,267 @@
 #!/usr/bin/env python3
 import argparse
 import json
 import random
 import re
 import time
 import sys
 import os
 from typing import Dict, List, Optional
 from dataclasses import dataclass, asdict
 from pathlib import Path
 import datasets
 from flask import Flask, request, jsonify
 # Set cache directory for HuggingFace datasets
 cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
 cache_dir.mkdir(parents=True, exist_ok=True)
 os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
 def levenshtein_distance(s1: str, s2: str) -> int:
    """Calculate Levenshtein distance between two strings"""
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]
 def debug_log(message: str):
    """Log debug messages to both stdout and a file"""
    print(message, file=sys.stderr)
    with open("/tmp/simulator-debug.log", "a") as f:
        f.write(message + "\n")
 app = Flask(__name__)
@dataclass
 class EvalState:
    id: str
    tasks: List[str]
    task_states: Dict[str, Dict]
    sampling_config: Dict
 class AimeDataset:
    def __init__(self, split: str = "train"):
        self.split = split
        self.questions: List[Dict] = []
        self._load_dataset()
    def _load_dataset(self):
        print(f"Loading AIME dataset (split: {self.split})...")
        print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}")
        ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
        self.questions = list(ds)
        print(f"AIME dataset loaded: {len(self.questions)} questions")
    def find_question(self, request_text: str) -> Optional[Dict]:
        best_match = None
        best_distance = float('inf')
        best_index = -1
        for i, question in enumerate(self.questions):
            question_text = question["problem"]
            request_lower = request_text.lower()
            question_lower = question_text.lower()
            # Exact match
            if question_lower == request_lower:
                debug_log(f"DEBUG: Found exact match at index {i}")
                return question
            # Remove LaTeX formatting for more flexible matching
            question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
            if question_no_latex.lower() == request_lower:
                debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
                return question
            # Calculate Levenshtein distance for partial matches
            # Only consider if request is at least 50% of question length
            if len(request_lower) >= len(question_lower) * 0.5:
                distance = levenshtein_distance(question_lower, request_lower)
                # Normalize distance by length
                normalized_distance = distance / len(question_lower)
                if normalized_distance < best_distance:
                    best_distance = normalized_distance
                    best_match = question
                    best_index = i
        if best_match and best_distance < 0.3:  # Threshold for partial match
            debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
            return best_match
        debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
        return None
    def get_answer(self, question: Dict) -> str:
        return str(question["answer"])
 class Simulator:
    def __init__(
        self,
        port: int = 8033,
        host: str = "localhost",
        success_rate: float = 0.8,
        dataset_split: str = "train"
    ):
        self.port = port
        self.host = host
        self.success_rate = success_rate
        self.dataset = AimeDataset(dataset_split)
        self.eval_state = EvalState(
            id="aime-2025",
            tasks=["aime"],
            task_states={},
            sampling_config={"temperature": 0, "max_tokens": 2048}
        )
    def _generate_response(
        self,
        question: Dict,
        should_be_correct: bool
    ) -> Dict:
        expected_answer = self.dataset.get_answer(question)
        if should_be_correct:
            response_text = expected_answer
        else:
            response_text = self._generate_wrong_answer(question)
        return {
            "id": f"chatcmpl-{int(time.time())}",
            "object": "chat.completion",
            "created": int(time.time()),
            "model": "llama",
            "choices": [
                {
                    "index": 0,
                    "message": {
                        "role": "assistant",
                        "content": response_text
                    },
                    "finish_reason": "stop"
                }
            ],
            "usage": {
                "prompt_tokens": 100,
                "completion_tokens": 50,
                "total_tokens": 150
            }
        }
    def _generate_wrong_answer(self, question: Dict) -> str:
        expected_answer = self.dataset.get_answer(question)
        if expected_answer.isdigit():
            wrong_answer = str(int(expected_answer) + 1)
        else:
            wrong_answer = expected_answer + " (wrong)"
        return wrong_answer
    def _process_request(self, request_data: Dict) -> Dict:
        messages = request_data.get("messages", [])
        if not messages:
            return {"error": "No messages in request"}
        request_text = messages[0].get("content", "")
        debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
        question = self.dataset.find_question(request_text)
        if not question:
            debug_log(f"DEBUG: find_question returned None")
            return {"error": "No matching question found"}
        should_be_correct = random.random() < self.success_rate
        response = self._generate_response(question, should_be_correct)
        task_id = "aime"
        self.eval_state.task_states[task_id] = {
            "correct": should_be_correct,
            "expected": self.dataset.get_answer(question),
            "predicted": response["choices"][0]["message"]["content"]
        }
        return response
@app.route('/v1/chat/completions', methods=['POST'])
 def chat_completions():
    try:
        request_data = request.get_json()
        if not request_data:
            return jsonify({"error": "Invalid JSON"}), 400
        response = simulator._process_request(request_data)
        return jsonify(response)
    except Exception as e:
        print(f"Error processing request: {e}")
        return jsonify({"error": str(e)}), 500
 def main():
    parser = argparse.ArgumentParser(
        description="llama-server simulator for testing eval scripts"
    )
    parser.add_argument(
        "--port",
        type=int,
        default=8033,
        help="Server port (default: 8033)"
    )
    parser.add_argument(
        "--host",
        type=str,
        default="localhost",
        help="Server host (default: localhost)"
    )
    parser.add_argument(
        "--success-rate",
        type=float,
        default=0.8,
        help="Success rate 0-1 (default: 0.8)"
    )
    parser.add_argument(
        "--dataset-split",
        type=str,
        default="train",
        help="AIME dataset split to use (default: train)"
    )
    args = parser.parse_args()
    global simulator
    simulator = Simulator(
        port=args.port,
        host=args.host,
        success_rate=args.success_rate,
        dataset_split=args.dataset_split
    )
    print("\n=== llama-server-simulator ===")
    print(f"Server running on http://{args.host}:{args.port}")
    print(f"Success rate: {args.success_rate}")
    print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
    print("\nPress Ctrl+C to stop\n")
    app.run(host=args.host, port=args.port, debug=False)
 if __name__ == "__main__":
    main()
--- a/examples/llama-eval/simulator-summary.md
+++ b/examples/llama-eval/simulator-summary.md
@ -0,0 +1,135 @@
 # llama-server-simulator Implementation Summary
 ## Overview
 Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
 ## Features Implemented
 ### 1. HTTP Server
 - Flask-based `/v1/chat/completions` endpoint
 - OpenAI-compatible response format
 - Configurable port and host
 ### 2. AIME Dataset Integration
 - Loads AIME dataset from HuggingFace
 - In-memory storage for fast lookup
 - 90 questions loaded from train split
 ### 3. Intelligent Question Matching
 - **Exact matching**: Direct string comparison
 - **LaTeX removal**: Removes `$...$` formatting for flexible matching
 - **Levenshtein distance**: Calculates similarity between strings
 - **Partial matching**: Finds best match even with small differences
 ### 4. Response Generation
 - Configurable success rate (0-1)
 - Returns correct answers when success rate allows
 - Returns wrong answers when success rate doesn't allow
 - Wrong answers are generated by incrementing the expected answer
 ### 5. Debug Logging
 - Debug messages written to stderr
 - Logs request content, matching results, and distances
 - Helps troubleshoot matching issues
 ## Configuration Options
 ```bash
 python3 llama-server-simulator.py \
  --port 8034 \
  --host localhost \
  --success-rate 0.8 \
  --dataset-split train
 ```
 ## Testing Results
 ### Test 1: Correct Answer
 - **Success rate**: 0.8
 - **Expected answer**: 116
 - **Result**: ✓ Correct (116)
 ### Test 2: Wrong Answer
 - **Success rate**: 0.0
 - **Expected answer**: 116
 - **Result**: ✓ Wrong (117)
 ### Test 3: No Matching Question
 - **Request**: "What is the capital of France?"
 - **Result**: ✓ Returns error "No matching question found"
 ### Test 4: Success Rate Verification
 - **Success rate**: 0.8
 - **Requests**: 10
 - **Correct answers**: 8/10 (80%)
 - **Result**: ✓ Success rate working as expected
 ## Technical Details
 ### Matching Algorithm
 1. Try exact match (case-insensitive)
 2. Try match after removing LaTeX formatting
 3. Calculate Levenshtein distance for partial matches
 4. Return best match if distance < 0.3 (30% difference)
 ### Response Format
 ```json
 {
  "id": "chatcmpl-1769864875",
  "object": "chat.completion",
  "created": 1769864875,
  "model": "llama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "116"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
 }
 ```
 ## Files Created
 1. `llama-server-simulator.py` - Main simulator script
 2. `test-simulator.sh` - Basic test script
 3. `test-simulator-comprehensive.sh` - Comprehensive test script
 4. `llama-server-simulator-plan.md` - Implementation plan
 5. `llama-eval-discussion.md` - Discussion notes
 ## Next Steps
 1. ✓ Basic simulator structure
 2. ✓ AIME dataset integration
 3. ✓ Question matching with Levenshtein distance
 4. ✓ Response generation with configurable success rate
 5. ✓ Testing with curl requests
 6. ⏭️ Integrate with eval script
 7. ⏭️ Implement eval state object
 8. ⏭️ Implement processor object
 9. ⏭️ Add real-time progress reporting
 ## Known Limitations
 1. Only supports AIME dataset (train split)
 2. Matching is case-insensitive
 3. Wrong answers are simple increments (not realistic)
 4. No support for multiple endpoints
 5. No distributed evaluation
 ## Future Enhancements
 1. Support multiple datasets
 2. More sophisticated wrong answer generation
 3. Multiple endpoint support
 4. Distributed evaluation
 5. Real-time progress reporting
 6. Eval state serialization
--- a/examples/llama-eval/test-cache.sh
+++ b/examples/llama-eval/test-cache.sh
@ -0,0 +1,43 @@
 #!/bin/bash
 echo "=== Testing HuggingFace Dataset Caching ==="
 echo ""
 echo "=== First Load (should download) ==="
 echo "Starting simulator for first load..."
 source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log &
 SIMULATOR_PID=$!
 sleep 5
 echo "First load complete"
 echo ""
 echo "=== Second Load (should use cache) ==="
 echo "Starting simulator for second load..."
 source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log &
 SIMULATOR_PID2=$!
 sleep 5
 echo "Second load complete"
 echo ""
 echo "=== Checking Cache Directory ==="
 echo "Cache directory size:"
 du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime
 echo ""
 echo "=== Checking First Load Log ==="
 echo "First load log (last 15 lines):"
 tail -15 /tmp/simulator-first.log
 echo ""
 echo "=== Checking Second Load Log ==="
 echo "Second load log (last 15 lines):"
 tail -15 /tmp/simulator-second.log
 echo ""
 echo "=== Test Complete ==="
 echo "Both loads completed successfully!"
 echo "The second load should have used the cache (no download warning)."
 echo ""
 kill $SIMULATOR_PID 2>/dev/null
 kill $SIMULATOR_PID2 2>/dev/null
--- a/examples/llama-eval/test-simulator.sh
+++ b/examples/llama-eval/test-simulator.sh
@ -0,0 +1,93 @@
 #!/bin/bash
 echo "=== llama-server-simulator Test Script ==="
 echo ""
 PORT=8033
 SUCCESS_RATE=0.8
 echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
 source venv/bin/activate
 python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
 SIMULATOR_PID=$!
 echo "Waiting for simulator to start..."
 sleep 5
 echo ""
 echo "=== Test 1: Basic Request with Known Question ==="
 echo "Sending request with AIME question..."
 curl -s -X POST http://localhost:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
    ],
    "temperature": 0,
    "max_tokens": 2048
  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
 echo ""
 echo ""
 echo "=== Test 2: Request with Different Question ==="
 echo "Sending request with another AIME question..."
 curl -s -X POST http://localhost:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "user", "content": "Compute the value of 2^10 + 3^10."}
    ],
    "temperature": 0,
    "max_tokens": 2048
  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
 echo ""
 echo ""
 echo "=== Test 3: Request with No Matching Question ==="
 echo "Sending request with non-matching text..."
 curl -s -X POST http://localhost:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0,
    "max_tokens": 2048
  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))"
 echo ""
 echo ""
 echo "=== Test 4: Multiple Requests to Test Success Rate ==="
 echo "Sending 10 requests to test success rate..."
 correct_count=0
 for i in {1..10}; do
  echo "Request $i:"
  response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "llama",
      "messages": [
        {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
      ],
      "temperature": 0,
      "max_tokens": 2048
    }')
  answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])")
  if [ "$answer" == "116" ]; then
    correct_count=$((correct_count + 1))
  fi
  echo "  Answer: $answer"
 done
 echo "Correct answers: $correct_count/10"
 echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
 echo ""
 echo "=== Test Complete ==="
 echo "Stopping simulator..."
 kill $SIMULATOR_PID 2>/dev/null
 wait $SIMULATOR_PID 2>/dev/null || true
 echo "Simulator stopped."