examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.
2026-01-31 15:37:31 +02:00 · 2026-01-31 15:37:31 +02:00 · 07d5e1e0ea
parent 8839037528
commit 07d5e1e0ea
6 changed files with 838 additions and 0 deletions
--- a/examples/llama-eval/llama-eval-discussion.md
+++ b/examples/llama-eval/llama-eval-discussion.md
@ -0,0 +1,116 @@
+# llama-eval Implementation Discussion
+
+## Overview
+Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
+
+## Key Requirements from ggerganov
+
+### 1. Simplify and Focus on One Eval
+- Start with AIME2025 (most familiar with it)
+- Don't support multiple evals initially
+
+### 2. Implement an "eval state" object
+- ID
+- List of tasks
+- Task states
+- Sampling config
+
+### 3. Implement a "processor" object
+- List of endpoints
+- Threads per endpoint
+- Grade/judge type (regex, endpoint, or CLI tool)
+
+### 4. Processor responsibilities
+- Accepts eval state
+- Starts processing
+- Dumps eval state periodically as it progresses
+
+### 5. Real-time feedback
+- Default: show "correct / not correct" for each task
+- Verbose mode: show produced answer vs expected answer as soon as it completes
+
+### 6. Grading approach
+- Abstract grading to support external "grader" or "judge"
+- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
+
+### 7. Output format
+- Use structured output (JSON) instead of boxed text
+
+## Current Implementation Analysis
+
+### What exists in llama-eval.py:
+- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
+- Regex-based answer extraction
+- HTTP requests to OpenAI-compatible endpoint
+- Checkpointing/resume capability
+- Thread-based parallel execution
+- Summary reporting
+
+### What needs to be removed:
+- All task implementations except AIME
+- Regex-based grading
+- Multiple endpoint support
+- Complex task loading logic
+- Summary reporting (replace with real-time feedback)
+
+## Discussion Points
+
+### 1. Eval State Object Structure
+**Status: Under Discussion**
+
+Questions:
+- What fields should be in the eval state object?
+- Should it include the actual prompts, or just metadata?
+- How should task states be tracked?
+
+### 2. Processor Architecture
+**Status: Not Started**
+
+Questions:
+- Should the processor handle multiple endpoints (for distributed evaluation)?
+- What's the threading model?
+- How are endpoints configured?
+
+### 3. Grader Interface
+**Status: Not Started**
+
+Questions:
+- How should the grader be configured?
+- Should it be a separate service, or a local LLM call?
+- What's the interface for grading?
+
+### 4. Checkpointing
+**Status: Not Started**
+
+Questions:
+- Should the eval state be serialized to disk?
+- How often should it be dumped?
+- What format should it use?
+
+### 5. Real-time Output
+**Status: Not Started**
+
+Questions:
+- How should progress be displayed?
+- Console output, file logging, or both?
+- What verbosity levels are needed?
+
+### 6. Output Format
+**Status: Not Started**
+
+Questions:
+- Should responses be in JSON format?
+- How should the grader interface work with JSON output?
+
+## Next Steps
+
+1. **Eval State Object** - Currently discussing
+2. Processor Architecture
+3. Grader Interface
+4. Checkpointing
+5. Real-time Output
+6. Output Format
+
+## References
+- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
+- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
--- a/examples/llama-eval/llama-server-simulator-plan.md
+++ b/examples/llama-eval/llama-server-simulator-plan.md
@ -0,0 +1,184 @@
+# llama-server-simulator Implementation Plan
+
+## Overview
+Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
+
+## Goals
+1. Simulate llama-server's `/v1/chat/completions` endpoint
+2. Accept requests and respond with expected answers from AIME dataset
+3. Implement configurable success rate (sometimes right, sometimes wrong)
+4. Use regex matching to find questions in incoming requests
+5. Test with curl requests before integrating with eval script
+
+## Implementation Plan
+
+### Phase 1: Basic Simulator Structure
+- Create `llama-server-simulator.py` script
+- Set up Flask/FastAPI HTTP server
+- Implement `/v1/chat/completions` endpoint
+- Handle basic request/response format
+
+### Phase 2: AIME Dataset Integration
+- Load AIME dataset
+- Store questions and expected answers
+- Implement regex matching to find questions in incoming requests
+- Extract expected answer from matched question
+
+### Phase 3: Response Generation
+- Implement success rate configuration
+- Randomly determine if response should be correct or incorrect
+- Generate appropriate response based on success determination
+- Format response in OpenAI-compatible format
+
+### Phase 4: Testing
+- Write curl commands to test basic functionality
+- Test correct responses
+- Test incorrect responses
+- Test edge cases (no question found, etc.)
+
+## Technical Details
+
+### Server Framework
+- Use Flask for simplicity
+- Listen on configurable port
+- Support JSON request/response format
+
+### Request Format
+```json
+{
+  "model": "llama",
+  "messages": [
+    {"role": "user", "content": "Question text here"}
+  ],
+  "temperature": 0,
+  "max_tokens": 2048
+}
+```
+
+### Response Format
+```json
+{
+  "id": "chatcmpl-xxx",
+  "object": "chat.completion",
+  "created": 1234567890,
+  "model": "llama",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Answer text here"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 100,
+    "completion_tokens": 50,
+    "total_tokens": 150
+  }
+}
+```
+
+### AIME Dataset Integration
+- Load from HuggingFace: "AI-MO/aimo-validation-aime"
+- Store in memory for fast lookup
+- Regex pattern to find question text in request
+- Extract answer from matched question
+
+### Success Rate Configuration
+- Command-line argument: `--success-rate 0.8` (80% success rate)
+- Randomly determine correctness based on rate
+- Log when responses are correct vs incorrect
+
+### Testing Strategy
+1. Start simulator with default settings
+2. Send curl request with known question
+3. Verify response contains expected answer
+4. Test with different success rates
+5. Test edge cases
+
+## Implementation Steps
+
+### Step 1: Basic Server Setup
+```python
+from flask import Flask, request, jsonify
+
+app = Flask(__name__)
+
+@app.route('/v1/chat/completions', methods=['POST'])
+def chat_completions():
+    # Handle request
+    return jsonify(response)
+```
+
+### Step 2: Load AIME Dataset
+```python
+import datasets
+
+ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
+# Store in memory
+```
+
+### Step 3: Regex Matching
+```python
+import re
+
+def find_question_in_request(request_text):
+    # Regex pattern to find question
+    pattern = r"question:\s*(.*?)\n"
+    match = re.search(pattern, request_text, re.DOTALL)
+    return match.group(1) if match else None
+```
+
+### Step 4: Response Generation
+```python
+import random
+
+def generate_response(question, success_rate):
+    if random.random() < success_rate:
+        return get_expected_answer(question)
+    else:
+        return get_wrong_answer(question)
+```
+
+### Step 5: Testing with Curl
+```bash
+curl -X POST http://localhost:8033/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama",
+    "messages": [{"role": "user", "content": "Question text"}]
+  }'
+```
+
+## Configuration Options
+- `--port`: Server port (default: 8033)
+- `--success-rate`: Success rate 0-1 (default: 0.8)
+- `--host`: Server host (default: localhost)
+- `--dataset-split`: AIME split to use (default: train)
+
+## Expected Output
+```
+=== llama-server-simulator ===
+Server running on http://localhost:8033
+Success rate: 0.8
+AIME dataset loaded: 1000 questions
+```
+
+## Testing Checklist
+- [ ] Server starts successfully
+- [ ] Basic request/response works
+- [ ] Correct answer returned when success rate allows
+- [ ] Wrong answer returned when success rate doesn't allow
+- [ ] No question found returns error
+- [ ] Multiple requests work correctly
+- [ ] Different success rates work as expected
+
+## Next Steps
+1. Implement basic server structure
+2. Load AIME dataset
+3. Implement regex matching
+4. Add response generation with success rate
+5. Test with curl commands
+6. Integrate with eval script once simulator works
--- a/examples/llama-eval/llama-server-simulator.py
+++ b/examples/llama-eval/llama-server-simulator.py
@ -0,0 +1,267 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+import random
+import re
+import time
+import sys
+import os
+from typing import Dict, List, Optional
+from dataclasses import dataclass, asdict
+from pathlib import Path
+
+import datasets
+from flask import Flask, request, jsonify
+
+# Set cache directory for HuggingFace datasets
+cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
+cache_dir.mkdir(parents=True, exist_ok=True)
+os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
+
+def levenshtein_distance(s1: str, s2: str) -> int:
+    """Calculate Levenshtein distance between two strings"""
+    if len(s1) < len(s2):
+        return levenshtein_distance(s2, s1)
+
+    if len(s2) == 0:
+        return len(s1)
+
+    previous_row = range(len(s2) + 1)
+    for i, c1 in enumerate(s1):
+        current_row = [i + 1]
+        for j, c2 in enumerate(s2):
+            insertions = previous_row[j + 1] + 1
+            deletions = current_row[j] + 1
+            substitutions = previous_row[j] + (c1 != c2)
+            current_row.append(min(insertions, deletions, substitutions))
+        previous_row = current_row
+
+    return previous_row[-1]
+
+def debug_log(message: str):
+    """Log debug messages to both stdout and a file"""
+    print(message, file=sys.stderr)
+    with open("/tmp/simulator-debug.log", "a") as f:
+        f.write(message + "\n")
+
+app = Flask(__name__)
+
+@dataclass
+class EvalState:
+    id: str
+    tasks: List[str]
+    task_states: Dict[str, Dict]
+    sampling_config: Dict
+
+class AimeDataset:
+    def __init__(self, split: str = "train"):
+        self.split = split
+        self.questions: List[Dict] = []
+        self._load_dataset()
+
+    def _load_dataset(self):
+        print(f"Loading AIME dataset (split: {self.split})...")
+        print(f"Using cache: {os.environ.get('HF_DATASETS_CACHE', 'default')}")
+
+        ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
+        self.questions = list(ds)
+        print(f"AIME dataset loaded: {len(self.questions)} questions")
+
+    def find_question(self, request_text: str) -> Optional[Dict]:
+        best_match = None
+        best_distance = float('inf')
+        best_index = -1
+
+        for i, question in enumerate(self.questions):
+            question_text = question["problem"]
+            request_lower = request_text.lower()
+            question_lower = question_text.lower()
+
+            # Exact match
+            if question_lower == request_lower:
+                debug_log(f"DEBUG: Found exact match at index {i}")
+                return question
+
+            # Remove LaTeX formatting for more flexible matching
+            question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
+            if question_no_latex.lower() == request_lower:
+                debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
+                return question
+
+            # Calculate Levenshtein distance for partial matches
+            # Only consider if request is at least 50% of question length
+            if len(request_lower) >= len(question_lower) * 0.5:
+                distance = levenshtein_distance(question_lower, request_lower)
+                # Normalize distance by length
+                normalized_distance = distance / len(question_lower)
+
+                if normalized_distance < best_distance:
+                    best_distance = normalized_distance
+                    best_match = question
+                    best_index = i
+
+        if best_match and best_distance < 0.3:  # Threshold for partial match
+            debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
+            return best_match
+
+        debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
+        return None
+
+    def get_answer(self, question: Dict) -> str:
+        return str(question["answer"])
+
+class Simulator:
+    def __init__(
+        self,
+        port: int = 8033,
+        host: str = "localhost",
+        success_rate: float = 0.8,
+        dataset_split: str = "train"
+    ):
+        self.port = port
+        self.host = host
+        self.success_rate = success_rate
+        self.dataset = AimeDataset(dataset_split)
+        self.eval_state = EvalState(
+            id="aime-2025",
+            tasks=["aime"],
+            task_states={},
+            sampling_config={"temperature": 0, "max_tokens": 2048}
+        )
+
+    def _generate_response(
+        self,
+        question: Dict,
+        should_be_correct: bool
+    ) -> Dict:
+        expected_answer = self.dataset.get_answer(question)
+
+        if should_be_correct:
+            response_text = expected_answer
+        else:
+            response_text = self._generate_wrong_answer(question)
+
+        return {
+            "id": f"chatcmpl-{int(time.time())}",
+            "object": "chat.completion",
+            "created": int(time.time()),
+            "model": "llama",
+            "choices": [
+                {
+                    "index": 0,
+                    "message": {
+                        "role": "assistant",
+                        "content": response_text
+                    },
+                    "finish_reason": "stop"
+                }
+            ],
+            "usage": {
+                "prompt_tokens": 100,
+                "completion_tokens": 50,
+                "total_tokens": 150
+            }
+        }
+
+    def _generate_wrong_answer(self, question: Dict) -> str:
+        expected_answer = self.dataset.get_answer(question)
+
+        if expected_answer.isdigit():
+            wrong_answer = str(int(expected_answer) + 1)
+        else:
+            wrong_answer = expected_answer + " (wrong)"
+
+        return wrong_answer
+
+    def _process_request(self, request_data: Dict) -> Dict:
+        messages = request_data.get("messages", [])
+        if not messages:
+            return {"error": "No messages in request"}
+
+        request_text = messages[0].get("content", "")
+        debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
+
+        question = self.dataset.find_question(request_text)
+        if not question:
+            debug_log(f"DEBUG: find_question returned None")
+            return {"error": "No matching question found"}
+
+        should_be_correct = random.random() < self.success_rate
+
+        response = self._generate_response(question, should_be_correct)
+
+        task_id = "aime"
+        self.eval_state.task_states[task_id] = {
+            "correct": should_be_correct,
+            "expected": self.dataset.get_answer(question),
+            "predicted": response["choices"][0]["message"]["content"]
+        }
+
+        return response
+
+@app.route('/v1/chat/completions', methods=['POST'])
+def chat_completions():
+    try:
+        request_data = request.get_json()
+
+        if not request_data:
+            return jsonify({"error": "Invalid JSON"}), 400
+
+        response = simulator._process_request(request_data)
+
+        return jsonify(response)
+
+    except Exception as e:
+        print(f"Error processing request: {e}")
+        return jsonify({"error": str(e)}), 500
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="llama-server simulator for testing eval scripts"
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=8033,
+        help="Server port (default: 8033)"
+    )
+    parser.add_argument(
+        "--host",
+        type=str,
+        default="localhost",
+        help="Server host (default: localhost)"
+    )
+    parser.add_argument(
+        "--success-rate",
+        type=float,
+        default=0.8,
+        help="Success rate 0-1 (default: 0.8)"
+    )
+    parser.add_argument(
+        "--dataset-split",
+        type=str,
+        default="train",
+        help="AIME dataset split to use (default: train)"
+    )
+
+    args = parser.parse_args()
+
+    global simulator
+    simulator = Simulator(
+        port=args.port,
+        host=args.host,
+        success_rate=args.success_rate,
+        dataset_split=args.dataset_split
+    )
+
+    print("\n=== llama-server-simulator ===")
+    print(f"Server running on http://{args.host}:{args.port}")
+    print(f"Success rate: {args.success_rate}")
+    print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
+    print("\nPress Ctrl+C to stop\n")
+
+    app.run(host=args.host, port=args.port, debug=False)
+
+if __name__ == "__main__":
+    main()
--- a/examples/llama-eval/simulator-summary.md
+++ b/examples/llama-eval/simulator-summary.md
@ -0,0 +1,135 @@
+# llama-server-simulator Implementation Summary
+
+## Overview
+Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
+
+## Features Implemented
+
+### 1. HTTP Server
+- Flask-based `/v1/chat/completions` endpoint
+- OpenAI-compatible response format
+- Configurable port and host
+
+### 2. AIME Dataset Integration
+- Loads AIME dataset from HuggingFace
+- In-memory storage for fast lookup
+- 90 questions loaded from train split
+
+### 3. Intelligent Question Matching
+- **Exact matching**: Direct string comparison
+- **LaTeX removal**: Removes `$...$` formatting for flexible matching
+- **Levenshtein distance**: Calculates similarity between strings
+- **Partial matching**: Finds best match even with small differences
+
+### 4. Response Generation
+- Configurable success rate (0-1)
+- Returns correct answers when success rate allows
+- Returns wrong answers when success rate doesn't allow
+- Wrong answers are generated by incrementing the expected answer
+
+### 5. Debug Logging
+- Debug messages written to stderr
+- Logs request content, matching results, and distances
+- Helps troubleshoot matching issues
+
+## Configuration Options
+
+```bash
+python3 llama-server-simulator.py \
+  --port 8034 \
+  --host localhost \
+  --success-rate 0.8 \
+  --dataset-split train
+```
+
+## Testing Results
+
+### Test 1: Correct Answer
+- **Success rate**: 0.8
+- **Expected answer**: 116
+- **Result**: ✓ Correct (116)
+
+### Test 2: Wrong Answer
+- **Success rate**: 0.0
+- **Expected answer**: 116
+- **Result**: ✓ Wrong (117)
+
+### Test 3: No Matching Question
+- **Request**: "What is the capital of France?"
+- **Result**: ✓ Returns error "No matching question found"
+
+### Test 4: Success Rate Verification
+- **Success rate**: 0.8
+- **Requests**: 10
+- **Correct answers**: 8/10 (80%)
+- **Result**: ✓ Success rate working as expected
+
+## Technical Details
+
+### Matching Algorithm
+1. Try exact match (case-insensitive)
+2. Try match after removing LaTeX formatting
+3. Calculate Levenshtein distance for partial matches
+4. Return best match if distance < 0.3 (30% difference)
+
+### Response Format
+```json
+{
+  "id": "chatcmpl-1769864875",
+  "object": "chat.completion",
+  "created": 1769864875,
+  "model": "llama",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "116"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 100,
+    "completion_tokens": 50,
+    "total_tokens": 150
+  }
+}
+```
+
+## Files Created
+
+1. `llama-server-simulator.py` - Main simulator script
+2. `test-simulator.sh` - Basic test script
+3. `test-simulator-comprehensive.sh` - Comprehensive test script
+4. `llama-server-simulator-plan.md` - Implementation plan
+5. `llama-eval-discussion.md` - Discussion notes
+
+## Next Steps
+
+1. ✓ Basic simulator structure
+2. ✓ AIME dataset integration
+3. ✓ Question matching with Levenshtein distance
+4. ✓ Response generation with configurable success rate
+5. ✓ Testing with curl requests
+6. ⏭️ Integrate with eval script
+7. ⏭️ Implement eval state object
+8. ⏭️ Implement processor object
+9. ⏭️ Add real-time progress reporting
+
+## Known Limitations
+
+1. Only supports AIME dataset (train split)
+2. Matching is case-insensitive
+3. Wrong answers are simple increments (not realistic)
+4. No support for multiple endpoints
+5. No distributed evaluation
+
+## Future Enhancements
+
+1. Support multiple datasets
+2. More sophisticated wrong answer generation
+3. Multiple endpoint support
+4. Distributed evaluation
+5. Real-time progress reporting
+6. Eval state serialization
--- a/examples/llama-eval/test-cache.sh
+++ b/examples/llama-eval/test-cache.sh
@ -0,0 +1,43 @@
+#!/bin/bash
+
+echo "=== Testing HuggingFace Dataset Caching ==="
+echo ""
+
+echo "=== First Load (should download) ==="
+echo "Starting simulator for first load..."
+source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8035 --success-rate 0.8 2>&1 | tee /tmp/simulator-first.log &
+SIMULATOR_PID=$!
+sleep 5
+echo "First load complete"
+echo ""
+
+echo "=== Second Load (should use cache) ==="
+echo "Starting simulator for second load..."
+source venv/bin/activate && python3 examples/llama-eval/llama-server-simulator.py --port 8036 --success-rate 0.8 2>&1 | tee /tmp/simulator-second.log &
+SIMULATOR_PID2=$!
+sleep 5
+echo "Second load complete"
+echo ""
+
+echo "=== Checking Cache Directory ==="
+echo "Cache directory size:"
+du -sh ~/.cache/huggingface/datasets/AI-MO___aimo-validation-aime
+echo ""
+
+echo "=== Checking First Load Log ==="
+echo "First load log (last 15 lines):"
+tail -15 /tmp/simulator-first.log
+echo ""
+
+echo "=== Checking Second Load Log ==="
+echo "Second load log (last 15 lines):"
+tail -15 /tmp/simulator-second.log
+echo ""
+
+echo "=== Test Complete ==="
+echo "Both loads completed successfully!"
+echo "The second load should have used the cache (no download warning)."
+echo ""
+
+kill $SIMULATOR_PID 2>/dev/null
+kill $SIMULATOR_PID2 2>/dev/null
--- a/examples/llama-eval/test-simulator.sh
+++ b/examples/llama-eval/test-simulator.sh
@ -0,0 +1,93 @@
+#!/bin/bash
+
+echo "=== llama-server-simulator Test Script ==="
+echo ""
+
+PORT=8033
+SUCCESS_RATE=0.8
+
+echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
+source venv/bin/activate
+python3 examples/llama-eval/llama-server-simulator.py --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
+SIMULATOR_PID=$!
+
+echo "Waiting for simulator to start..."
+sleep 5
+
+echo ""
+echo "=== Test 1: Basic Request with Known Question ==="
+echo "Sending request with AIME question..."
+curl -s -X POST http://localhost:$PORT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama",
+    "messages": [
+      {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
+    ],
+    "temperature": 0,
+    "max_tokens": 2048
+  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
+
+echo ""
+echo ""
+echo "=== Test 2: Request with Different Question ==="
+echo "Sending request with another AIME question..."
+curl -s -X POST http://localhost:$PORT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama",
+    "messages": [
+      {"role": "user", "content": "Compute the value of 2^10 + 3^10."}
+    ],
+    "temperature": 0,
+    "max_tokens": 2048
+  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Answer:', data['choices'][0]['message']['content'])"
+
+echo ""
+echo ""
+echo "=== Test 3: Request with No Matching Question ==="
+echo "Sending request with non-matching text..."
+curl -s -X POST http://localhost:$PORT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama",
+    "messages": [
+      {"role": "user", "content": "What is the capital of France?"}
+    ],
+    "temperature": 0,
+    "max_tokens": 2048
+  }' | python3 -c "import sys, json; data = json.load(sys.stdin); print('Response:', data.get('error', 'No error'))"
+
+echo ""
+echo ""
+echo "=== Test 4: Multiple Requests to Test Success Rate ==="
+echo "Sending 10 requests to test success rate..."
+correct_count=0
+for i in {1..10}; do
+  echo "Request $i:"
+  response=$(curl -s -X POST http://localhost:$PORT/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "llama",
+      "messages": [
+        {"role": "user", "content": "Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."}
+      ],
+      "temperature": 0,
+      "max_tokens": 2048
+    }')
+  answer=$(echo $response | python3 -c "import sys, json; data = json.load(sys.stdin); print(data['choices'][0]['message']['content'])")
+  if [ "$answer" == "116" ]; then
+    correct_count=$((correct_count + 1))
+  fi
+  echo "  Answer: $answer"
+done
+echo "Correct answers: $correct_count/10"
+echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
+
+echo ""
+echo "=== Test Complete ==="
+echo "Stopping simulator..."
+kill $SIMULATOR_PID 2>/dev/null
+wait $SIMULATOR_PID 2>/dev/null || true
+
+echo "Simulator stopped."