diff --git a/examples/llama-eval/IMPLEMENTATION.md b/examples/llama-eval/IMPLEMENTATION.md
new file mode 100644
index 0000000000..c9542f005d
--- /dev/null
+++ b/examples/llama-eval/IMPLEMENTATION.md
@@ -0,0 +1,85 @@
+# llama-eval Implementation Summary
+
+## Overview
+
+Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
+
+## Key Features
+
+- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
+- **Flexible Grading**: Regex, CLI, or LLM-based grading
+- **Parallel Processing**: Configurable thread count for concurrent requests
+- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
+- **Real-time Feedback**: Progress tracking with detailed output
+- **JSON Output**: Complete eval state saved for debugging
+- **GPQA Support**: Answer shuffling with reproducible results
+
+## Architecture
+
+### Eval State
+```python
+@dataclass
+class EvalState:
+    id: str
+    tasks: List[str]
+    task_states: Dict[str, Dict[str, Any]]
+    sampling_config: Dict[str, Any]
+```
+
+### Processor
+- Handles processing, grading, and state management
+- Thread-safe concurrent execution
+- Configurable sampling parameters
+
+### Grader
+- Abstract grading interface supporting multiple types
+- Regex grader with dataset-specific patterns
+- CLI grader with external script interface
+- LLM grader with configurable server and model
+
+### Datasets
+- `AimeDataset`: 90 AIME 2025 questions
+- `Gsm8kDataset`: 7473 math word problems
+- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
+
+## Configuration
+
+### Sampling Parameters (Optional)
+- `--temperature`: Sampling temperature
+- `--top-k`: Top K sampling
+- `--top-p`: Top P sampling
+- `--min-p`: Min P sampling
+- Only passed if explicitly specified
+
+### Grading Types
+- **regex**: Built-in patterns for each dataset
+- **cli**: External script with `--answer` and `--expected` args
+- **llm**: LLM-based extraction with configurable server/model
+
+## Output Format
+
+### Progress Table
+```
+  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
+  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
+```
+
+### Results Summary
+```
+============================================================
+Results: 8/10 correct (80.0%)
+============================================================
+```
+
+### JSON Output
+Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
+
+## Technical Details
+
+- Default max tokens: -1 (infinite)
+- Default grader type: llm
+- Default seed: 1234
+- Default threads: 32
+- Prompt truncation: First 43 chars + padding + "..."
+- GPQA requires LLM grader (returns letter A/B/C/D)
+- Judge model defaults to evaluated model if not specified
diff --git a/examples/llama-eval/README.md b/examples/llama-eval/README.md
new file mode 100644
index 0000000000..1c96cc6a1f
--- /dev/null
+++ b/examples/llama-eval/README.md
@@ -0,0 +1,105 @@
+# llama-eval Evaluation Tool
+
+Simple evaluation tool for llama.cpp with support for multiple datasets.
+
+## Features
+
+- **Multiple Datasets**: AIME, GSM8K, GPQA
+- **Flexible Grading**: Regex, CLI, or LLM-based grading
+- **Parallel Processing**: Configurable thread count
+- **Real-time Feedback**: Progress tracking with detailed output
+- **Sampling Parameters**: Temperature, Top K, Top P, Min P
+- **JSON Output**: Complete eval state saved for debugging
+
+## Usage
+
+```bash
+python llama-eval-new.py \
+  --server http://127.0.0.1:8013 \
+  --model gpt-oss-20b-hf-low \
+  --judge-model gpt-oss-20b-hf-medium \
+  --dataset aime \
+  --n_cases 10 \
+  --grader-type llm \
+  --seed 42
+```
+
+## CLI Arguments
+
+- `--server`: llama-server URL (default: http://127.0.0.1:8013)
+- `--model`: Model name for evaluation (default: llama)
+- `--judge-model`: Model name for LLM judge (default: same as main model)
+- `--judge-server`: Server URL for LLM judge (default: same as main server)
+- `--dataset`: Dataset type (aime, gsm8k, gpqa)
+- `--n_cases`: Number of cases to evaluate (default: all)
+- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
+- `--temperature`: Sampling temperature (default: not passed)
+- `--top-k`: Top K sampling (default: not passed)
+- `--top-p`: Top P sampling (default: not passed)
+- `--min-p`: Min P sampling (default: not passed)
+- `--threads`: Number of threads for parallel requests (default: 32)
+- `--verbose`: Show detailed output for each case
+- `--output`: Output file for eval state (default: llama-eval-state.json)
+- `--grader-type`: Grader type (regex, cli, llm, default: llm)
+- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
+- `--seed`: Random seed for shuffling (default: 1234)
+
+## Datasets
+
+### AIME
+- 90 questions from 2025 AIME competition
+- Answers in boxed format: `\boxed{answer}`
+- Requires regex grader or LLM grader
+
+### GSM8K
+- 7473 math word problems
+- Answers are numeric values
+- Requires regex grader or LLM grader
+
+### GPQA
+- 198 questions from GPQA Diamond dataset
+- Multiple choice with shuffled options
+- Requires LLM grader (returns letter A, B, C, or D)
+
+## Grading Types
+
+### Regex Grader
+Built-in patterns for different datasets:
+- AIME: `\boxed{(\d+)}|\b(\d+)\b`
+- GSM8K: `\b(\d+)\b`
+- GPQA: Letter extraction (A, B, C, D)
+
+### CLI Grader
+External script interface:
+```bash
+./grader.sh --answer <pred> --expected <gold>
+```
+Returns exit code 0 if correct, non-zero if incorrect.
+
+### LLM Grader
+Uses LLM to extract and compare answers:
+- Configurable server and model
+- Includes problem context in prompt
+- Case-insensitive comparison
+
+## Output
+
+### Progress Table
+```
+  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
+  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
+```
+
+### Results
+```
+============================================================
+Results: 8/10 correct (80.0%)
+============================================================
+```
+
+### JSON Output
+Complete eval state saved to output file with:
+- Task IDs and correctness status
+- Prompts and extracted answers
+- Sampling configuration
+- Processing metadata
diff --git a/examples/llama-eval/llama-eval-discussion.md b/examples/llama-eval/llama-eval-discussion.md
deleted file mode 100644
index 1747aa0655..0000000000
--- a/examples/llama-eval/llama-eval-discussion.md
+++ /dev/null
@@ -1,395 +0,0 @@
-# llama-eval Implementation Discussion
-
-## Overview
-Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
-
-## Key Requirements from ggerganov
-
-### 1. Simplify and Focus on One Eval
-- Start with AIME2025 (most familiar with it)
-- Don't support multiple evals initially
-
-### 2. Implement an "eval state" object
-- ID
-- List of tasks
-- Task states
-- Sampling config
-
-### 3. Implement a "processor" object
-- List of endpoints
-- Threads per endpoint
-- Grade/judge type (regex, endpoint, or CLI tool)
-
-### 4. Processor responsibilities
-- Accepts eval state
-- Starts processing
-- Dumps eval state periodically as it progresses
-
-### 5. Real-time feedback
-- Default: show "correct / not correct" for each task
-- Verbose mode: show produced answer vs expected answer as soon as it completes
-
-### 6. Grading approach
-- Abstract grading to support external "grader" or "judge"
-- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
-
-### 7. Output format
-- Use structured output (JSON) instead of boxed text
-
-## Current Implementation Analysis
-
-### What exists in llama-eval.py:
-- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
-- Regex-based answer extraction
-- HTTP requests to OpenAI-compatible endpoint
-- Checkpointing/resume capability
-- Thread-based parallel execution
-- Summary reporting
-
-### What needs to be removed:
-- All task implementations except AIME
-- Regex-based grading
-- Multiple endpoint support
-- Complex task loading logic
-- Summary reporting (replace with real-time feedback)
-
-## Discussion Points
-
-### 1. Eval State Object Structure
-**Status: Under Discussion**
-
-Questions:
-- What fields should be in the eval state object?
-- Should it include the actual prompts, or just metadata?
-- How should task states be tracked?
-
-### 2. Processor Architecture
-**Status: Not Started**
-
-Questions:
-- Should the processor handle multiple endpoints (for distributed evaluation)?
-- What's the threading model?
-- How are endpoints configured?
-
-### 3. Grader Interface
-**Status: Not Started**
-
-Questions:
-- How should the grader be configured?
-- Should it be a separate service, or a local LLM call?
-- What's the interface for grading?
-
-### 4. Checkpointing
-**Status: Not Started**
-
-Questions:
-- Should the eval state be serialized to disk?
-- How often should it be dumped?
-- What format should it use?
-
-### 5. Real-time Output
-**Status: Not Started**
-
-Questions:
-- How should progress be displayed?
-- Console output, file logging, or both?
-- What verbosity levels are needed?
-
-### 6. Output Format
-**Status: Not Started**
-
-Questions:
-- Should responses be in JSON format?
-- How should the grader interface work with JSON output?
-
-## Next Steps
-
-1. **Eval State Object** - Currently discussing
-2. Processor Architecture
-3. Grader Interface
-4. Checkpointing
-5. Real-time Output
-6. Output Format
-
-## References
-- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
-- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
-
-## Session Work Summary
-
-### llama-server-simulator Implementation
-
-**Created:**
-- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
-- `test-simulator.sh` - Test script for verifying simulator functionality
-- `llama-server-simulator-plan.md` - Implementation plan
-- `simulator-summary.md` - Summary of implementation
-
-**Features Implemented:**
-1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
-2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
-3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
-4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
-5. Debug Logging - Helps troubleshoot matching issues
-
-**Testing Results:**
-- ✅ Correct answers returned when success rate allows
-- ✅ Wrong answers returned when success rate doesn't allow
-- ✅ No matching questions return errors
-- ✅ Success rate verified (80% in 10 requests)
-- ✅ HuggingFace dataset caching working correctly
-
-**Key Technical Decisions:**
-- Used Levenshtein distance for partial matching (threshold: 0.3)
-- Automatic caching via HuggingFace datasets library
-- Wrong answers generated by incrementing expected answer
-- Debug output written to stderr for better visibility
-
-**Refactoring:**
-- Extracted repeating question string into TEST_QUESTION variable
-- Created make_request() helper function to reduce code duplication
-- Added proper error handling for error responses
-- Fixed simulator stopping issue at script completion
-
-### llama-eval-new.py Implementation
-
-**Created:**
-- `llama-eval-new.py` - Simplified evaluation tool focused on AIME
-
-**Features Implemented:**
-1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
-2. **Processor Object** - Handles processing, grading, and state management
-3. **Real-time Feedback** - Shows correct/incorrect status for each case
-4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading
-5. **Structured JSON Output** - Saves complete eval state to JSON file
-6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
-7. **Enhanced Answer Extraction** - Extracts answers from full responses for display
-
-**Grading System:**
-- **Regex Grading**: Built-in patterns for different task types
-  - `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text)
-  - `gsm8k`: `\b(\d+)\b` (extract first number)
-  - `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter)
-- **CLI Grading**: External script interface
-  - Script accepts `--answer <pred>` and `--expected <gold>`
-  - Returns exit code 0 if correct, non-zero if incorrect
-  - 30-second timeout to prevent hanging
-- **LLM Judge**: Generic answer extraction using LLM
-  - Uses configured server and model for extraction
-  - Includes problem statement in prompt for context
-  - Case-insensitive comparison
-  - Returns extracted answer for display
-
-**Configuration Options:**
-- `--server`: llama-server URL (default: http://localhost:8033)
-- `--n_cases`: Number of cases to evaluate (default: all)
-- `--n_predict`: Max tokens to predict per prompt (default: 2048)
-- `--threads`: Number of threads for parallel requests (default: 32)
-- `--verbose`: Show detailed output for each case
-- `--output`: Output file for eval state (default: llama-eval-state.json)
-- `--grader-type`: `regex`, `cli`, or `llm`
-- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
-- `--grader-script`: Path to CLI grader script
-- `--judge-server`: Server URL for LLM judge (default: same as main server)
-- `--judge-model`: Model name for LLM judge (default: same as main model)
-
-**Testing Results:**
-- ✅ Works with simulator at 100% success rate (all correct)
-- ✅ Works with simulator at 0% success rate (all incorrect)
-- ✅ Works with simulator at 80% success rate (8/10 correct)
-- ✅ Real-time verbose output shows gold/pred/status for each case
-- ✅ JSON output contains complete eval state with all cases
-- ✅ HF Hub telemetry disabled (no warnings)
-- ✅ Uses cached dataset path to avoid HF Hub requests when available
-- ✅ Regex grader extracts answers correctly from various formats
-- ✅ LLM judge can extract answers with problem context
-- ✅ Response truncation focuses grading on final answer
-- ✅ Case-insensitive matching works for both regex and LLM grader
-- ✅ Judge model and server configuration propagate correctly
-- ✅ Progress table shows extracted answers instead of full responses
-
-**Key Technical Decisions:**
-- Removed Levenshtein matching - eval script only sends requests and validates answers
-- Abstract grading interface for external grader support
-- Exact match requirement for regex patterns
-- Handles both boxed and plain text formats for AIME answers
-- 30-second timeout for CLI grader
-- Validates script exists before running
-- Judge parameters set once during Grader construction
-- LLM judge prompt includes problem statement for better extraction
-- Response truncation to last 2-3 lines focuses grading on final answer
-- Case-insensitive comparison for more flexible matching
-
-**Refactoring:**
-- Removed all task implementations except AIME
-- Removed regex-based grading (moved to flexible grader system)
-- Removed multiple endpoint support
-- Removed complex task loading logic
-- Removed summary reporting (replaced with real-time feedback)
-- Added HuggingFace dataset caching optimization
-- Added LLM grader support with configurable server and model
-- Added response truncation before grading
-- Refactored grader interface to return extracted answers
-
-### llama-eval-new.py Threading and Model Parameter Updates
-
-**Changes Made:**
-1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing
-   - Added `from concurrent.futures import ThreadPoolExecutor, as_completed`
-   - Created `_process_single_case()` method for thread-safe case processing
-   - Refactored `process()` to use ThreadPoolExecutor with configurable thread count
-   - Updated progress tracking to work with concurrent execution
-   - Thread-safe eval state updates (task_states and counters)
-
-2. **Model Parameter** - Added `--model` argument to specify model name in request data
-   - Added `model_name` parameter to Processor.__init__()
-   - Updated `_make_request()` to use provided model name or default to "llama"
-   - Added `--model` argument to argument parser
-   - Model name is included in request JSON as `"model": "gpt-oss-20b-hf"`
-
-**Testing Results:**
-- ✅ Works with 2 threads (5 cases processed in ~0.2s)
-- ✅ Works with 4 threads (slightly faster throughput)
-- ✅ Model parameter correctly added to request data
-- ✅ Thread-safe progress tracking with tqdm
-- ✅ No race conditions in eval state updates
-
-**Key Technical Decisions:**
-- Used ThreadPoolExecutor for simple, effective parallelism
-- No rate limiting needed (server can handle concurrent requests)
-- Thread-safe counter updates for correct/total tracking
-- Progress bar shows completion status across all threads
-- Model parameter is optional - defaults to "llama" if not specified
-
-**Refactoring:**
-- Extracted single case processing into `_process_single_case()` method
-- Changed from sequential loop to ThreadPoolExecutor with futures
-- Updated verbose output to show total count instead of index
-- Made eval state updates thread-safe
-
-### llama-eval-new.py Enhanced Grading System
-
-**Changes Made:**
-1. **Enhanced Grader Interface** - Updated to return extracted answers
-   - `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer)
-   - Added `extracted` field to `TaskState` dataclass
-   - All grader types (regex, cli, llm) now return extracted answers
-
-2. **Improved Regex Grader**
-   - New `_extract_answer_regex()` method extracts answers using configured patterns
-   - Supports case-insensitive matching
-   - Returns first valid match found
-   - Handles both single values and multiple matches
-
-3. **LLM-Based Judge**
-   - New `_grade_llm()` method for generic answer extraction
-   - Includes problem statement in prompt for context
-   - Configurable server URL (defaults to main server)
-   - Configurable model name (defaults to main model)
-   - Case-insensitive comparison
-   - Returns extracted answer for display
-
-4. **Response Truncation**
-   - New `_truncate_response()` method keeps only last 2-3 lines
-   - Applied before grading to focus on final answer section
-
-5. **CLI Grader Update**
-   - Now also returns extracted answer
-   - Returns None if grading fails
-
-6. **Display Updates**
-   - Progress table shows extracted answer instead of full response
-   - Verbose mode shows full response plus extracted answer
-
-7. **New CLI Arguments**
-   - `--grader-type`: Added "llm" option
-   - `--judge-server`: Separate server for LLM judge
-   - `--judge-model`: Separate model for LLM judge
-
-**Testing Results:**
-- ✅ Regex grader extracts answers correctly from various formats
-- ✅ LLM judge can extract answers with problem context
-- ✅ Response truncation focuses grading on final answer
-- ✅ Case-insensitive matching works for both regex and LLM grader
-- ✅ Judge model and server configuration propagate correctly
-- ✅ Progress table shows extracted answers instead of full responses
-
-**Key Technical Decisions:**
-- Judge parameters set once during Grader construction (not on each call)
-- LLM judge prompt includes problem statement for better extraction
-- Response truncation to last 2-3 lines focuses grading on final answer
-- Case-insensitive comparison for more flexible matching
-- Judge configuration propagates through Processor to Grader
-- Display shows extracted answer for cleaner output
-
-**Refactoring:**
-- Removed judge parameters from `grade()` method calls
-- Added `judge_server_url` and `judge_model_name` to Grader class
-- Updated `_grade_llm()` to use instance variables instead of parameters
-- Simplified Processor initialization to pass judge config to grader
-- Updated startup info to show judge server and model
-
-### llama-eval-new.py GSM8K Dataset Support
-
-**Changes Made:**
-1. **GSM8K Dataset Integration** - Added support for GSM8K dataset alongside AIME
-   - Created `Gsm8kDataset` class with proper answer extraction logic
-   - GSM8K uses `"question"` field instead of `"problem"` field
-   - GSM8K answer field contains full reasoning with `####` prefix
-   - Extracts numeric answer from answer field during initialization
-   - Uses same regex grader pattern as AIME (`\b(\d+)\b`)
-
-2. **Dataset Type Configuration** - Added dataset selection support
-   - Added `--dataset` CLI argument with choices `aime` and `gsm8k`
-   - Updated `Processor` class to accept `dataset_type` parameter
-   - Dataset-specific initialization in `Processor.__init__()`
-   - Dataset name displayed in task summary table
-
-3. **Template Registry** - Added dataset-specific prompt templates
-   - AIME template: includes `\boxed{}` wrapper for final answer
-   - GSM8K template: plain text answer without wrapper
-   - Templates applied based on `question["dataset_type"]` field
-
-4. **Answer Extraction Logic** - Fixed GSM8K answer extraction
-   - GSM8K has pre-extracted `"gold"` field with numeric answer
-   - `Gsm8kDataset.get_answer()` checks for `"gold"` field first
-   - Falls back to answer field if gold field not present
-   - `AimeDataset.get_answer()` simplified to remove duplicate method
-
-5. **Task ID Format** - Fixed duplicate prefix in task IDs
-   - Changed from `f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"`
-   - To `f"{dataset_type}_{chunk_idx:03d}_{i:03d}"`
-   - Removed redundant `eval_state.id` (was "gsm8k" for GSM8K)
-
-6. **Column Width Adjustments** - Improved table formatting
-   - Task ID column: 25 characters
-   - Dataset column: 5 characters
-   - Prompt column: 40 characters
-   - Expected column: 10 characters
-
-**Testing Results:**
-- ✅ GSM8K dataset loads correctly with 7473 questions
-- ✅ Numeric answers extracted from full reasoning text
-- ✅ Task summary table displays correctly with adjusted column widths
-- ✅ Task IDs show correct format (e.g., `gsm8k_000_3169`)
-- ✅ Both AIME and GSM8K datasets work with same script
-- ✅ Answer extraction works for both boxed and plain text formats
-- ✅ Progress tracking shows extracted answers for both datasets
-
-**Key Technical Decisions:**
-- GSM8K uses `"question"` field instead of `"problem"` field
-- GSM8K answer field contains full reasoning with `####` prefix
-- Numeric answer extracted during dataset initialization
-- Same regex grader pattern works for both datasets
-- Dataset selection via CLI argument for separate runs
-- Template registry supports different prompt formats per dataset
-- Task ID format simplified to avoid duplication
-
-**Refactoring:**
-- Removed duplicate `get_question()` method from `AimeDataset`
-- Removed "2025" suffix from eval state ID (was remnant from old version)
-- Removed "2025" suffix from task summary table output
-- Removed "2025" suffix from progress tracking output
-- Updated `Processor.__init__()` to initialize appropriate dataset based on type
-- Updated `_process_single_case()` to handle both `"problem"` and `"question"` fields
-- Updated `process()` method to display dataset name and use `dataset_type` for task states
diff --git a/examples/llama-eval/llama-eval-new.py b/examples/llama-eval/llama-eval-new.py
index 8426dae724..eacbe3d887 100755
--- a/examples/llama-eval/llama-eval-new.py
+++ b/examples/llama-eval/llama-eval-new.py
@@ -5,6 +5,7 @@ import json
 import os
 import re
 import subprocess
+import sys
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass, asdict
@@ -34,6 +35,15 @@ Please reason step by step, and put your final answer within \\boxed{{}}.
 """,
     "gsm8k": """{question}
 Please reason step by step, and provide your final answer.
+""",
+    "gpqa": """{Question}
+
+(A) {A}
+(B) {B}
+(C) {C}
+(D) {D}
+
+Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'.
 """,
 }
 
@@ -96,6 +106,15 @@ class AimeDataset:
             return str(normalized) if normalized is not None else answer
         return str(answer)
 
+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        if question["dataset_type"] == "gpqa":
+            return TEMPLATE_REGISTRY["gpqa"].format(**question)
+        else:
+            return TEMPLATE_REGISTRY[question["dataset_type"]].format(
+                question=question["problem"] if "problem" in question else question["question"]
+            )
+
 class Gsm8kDataset:
     def __init__(self, split: str = "train"):
         self.split = split
@@ -146,17 +165,87 @@ class Gsm8kDataset:
             return str(normalized) if normalized is not None else answer
         return str(answer)
 
+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        return TEMPLATE_REGISTRY[question["dataset_type"]].format(
+            question=question["problem"] if "problem" in question else question["question"]
+        )
+
+class GpqaDataset:
+    def __init__(self, variant: str = "diamond", seed: int = 1234):
+        self.variant = variant
+        self.seed = seed
+        self.questions: List[Dict] = []
+        self._load_dataset()
+
+    def _load_dataset(self):
+        print(f"Loading GPQA dataset (variant: {self.variant})...")
+        import pandas as pd
+
+        url = f"https://openaipublic.blob.core.windows.net/simple-evals/gpqa_{self.variant}.csv"
+        df = pd.read_csv(url)
+
+        rng = random.Random(self.seed)
+
+        self.questions = []
+        for _, row in df.iterrows():
+            question = row.to_dict()
+            question["dataset_type"] = "gpqa"
+
+            # Shuffle the answer options
+            correct_answer = question["Correct Answer"]
+            incorrect_answers = [
+                question["Incorrect Answer 1"],
+                question["Incorrect Answer 2"],
+                question["Incorrect Answer 3"]
+            ]
+
+            # Create list of (answer, is_correct) tuples
+            options = [(ans, ans == correct_answer) for ans in incorrect_answers]
+            options.append((correct_answer, True))
+
+            # Shuffle the options
+            rng.shuffle(options)
+
+            # Extract shuffled answers and determine correct letter
+            shuffled_answers = [ans for ans, _ in options]
+            correct_letter = chr(ord('A') + options.index((correct_answer, True)))
+
+            # Store shuffled answers and correct letter
+            question["shuffled_answers"] = shuffled_answers
+            question["correct_letter"] = correct_letter
+
+            self.questions.append(question)
+
+        print(f"GPQA dataset loaded: {len(self.questions)} questions")
+
+    def get_question(self, index: int) -> Dict:
+        """Get question by index"""
+        return self.questions[index]
+
+    def get_answer(self, question: Dict) -> str:
+        # GPQA returns the correct letter (A, B, C, or D)
+        return question["correct_letter"]
+
+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        return TEMPLATE_REGISTRY["gpqa"].format(
+            Question=question["Question"],
+            A=question["shuffled_answers"][0],
+            B=question["shuffled_answers"][1],
+            C=question["shuffled_answers"][2],
+            D=question["shuffled_answers"][3]
+        )
+
 class Grader:
     def __init__(
         self,
-        grader_type: str = "regex",
-        grader_regex_type: str = "aime",
+        grader_type: str = "llm",
         grader_script: Optional[str] = None,
         judge_model_name: Optional[str] = None,
         judge_server_url: str = ""
     ):
         self.grader_type = grader_type
-        self.grader_regex_type = grader_regex_type
         self.grader_script = grader_script
         self.judge_model_name = judge_model_name
         self.judge_server_url = judge_server_url
@@ -164,9 +253,7 @@ class Grader:
 
     def _get_pattern(self) -> Optional[str]:
         if self.grader_type == "regex":
-            if self.grader_regex_type not in GRADER_PATTERNS:
-                raise ValueError(f"Unknown grader regex type: {self.grader_regex_type}")
-            return GRADER_PATTERNS[self.grader_regex_type]
+            return GRADER_PATTERNS.get("aime")  # Default to aime pattern
         return None
 
     def _extract_answer_regex(self, pred: str) -> Optional[str]:
@@ -221,18 +308,21 @@ class Grader:
         """Grade using LLM-based extraction"""
         prompt = f"""Extract the answer from this response:
 
-Response: {pred}
-
 Expected answer: {gold}
 
-Please provide only the extracted answer, nothing else."""
+===
+
+Response: {pred}
+
+===
+
+Please provide only the extracted answer, nothing else. If there is no clear answer in the response, reply with 'no answer'."""
         url = f"{self.judge_server_url}/v1/chat/completions"
         headers = {"Content-Type": "application/json"}
         data = {
             "model": self.judge_model_name,
             "messages": [{"role": "user", "content": prompt}],
             "temperature": 0,
-            "max_tokens": 256
         }
 
         try:
@@ -264,14 +354,16 @@ class Processor:
     def __init__(
         self,
         server_url: str,
-        n_predict: int = 2048,
+        n_predict: int = -1,
         threads: int = 32,
         verbose: bool = False,
         grader: Optional[Grader] = None,
         model_name: Optional[str] = None,
         judge_server_url: str = "",
         judge_model_name: Optional[str] = None,
-        dataset_type: str = "aime"
+        dataset_type: str = "aime",
+        seed: int = 1234,
+        sampling_config: Optional[Dict[str, Any]] = None
     ):
         self.server_url = server_url
         self.n_predict = n_predict
@@ -281,12 +373,14 @@ class Processor:
         self.judge_server_url = judge_server_url if judge_server_url else server_url
         self.judge_model_name = judge_model_name
         self.dataset_type = dataset_type
+        self.seed = seed
         self.grader = grader or Grader()
+        self.sampling_config = sampling_config or {"n_predict": n_predict}
         self.eval_state = EvalState(
             id=dataset_type,
             tasks=[dataset_type],
             task_states={},
-            sampling_config={"temperature": 0, "max_tokens": n_predict}
+            sampling_config=self.sampling_config
         )
 
         # Pass judge configuration to grader if using LLM grader
@@ -301,6 +395,8 @@ class Processor:
             self.dataset = AimeDataset()
         elif dataset_type == "gsm8k":
             self.dataset = Gsm8kDataset()
+        elif dataset_type == "gpqa":
+            self.dataset = GpqaDataset(variant="diamond", seed=self.seed)
         else:
             raise ValueError(f"Unknown dataset type: {dataset_type}")
 
@@ -311,9 +407,16 @@ class Processor:
         data = {
             "model": self.model_name if self.model_name else "llama",
             "messages": [{"role": "user", "content": prompt}],
-            "temperature": 0,
-            "max_tokens": self.n_predict
+            "n_predict": self.n_predict
         }
+        if self.sampling_config.get("temperature") is not None:
+            data["temperature"] = self.sampling_config["temperature"]
+        if self.sampling_config.get("top_k") is not None:
+            data["top_k"] = self.sampling_config["top_k"]
+        if self.sampling_config.get("top_p") is not None:
+            data["top_p"] = self.sampling_config["top_p"]
+        if self.sampling_config.get("min_p") is not None:
+            data["min_p"] = self.sampling_config["min_p"]
 
         response = requests.post(url, headers=headers, json=data)
         response.raise_for_status()
@@ -322,14 +425,9 @@ class Processor:
     def _process_single_case(self, i: int, task_id: str) -> TaskState:
         """Process a single case (thread-safe)"""
         question = self.dataset.get_question(i)
-        dataset_id = f"{self.dataset_type}_{self.dataset.split}_{i}"
+        dataset_id = f"{self.dataset_type}_{i}"
         gold = self.dataset.get_answer(question)
-
-        # Apply template if available
-        if question["dataset_type"] in TEMPLATE_REGISTRY:
-            prompt = TEMPLATE_REGISTRY[question["dataset_type"]].format(question=question["problem"] if "problem" in question else question["question"])
-        else:
-            prompt = question["problem"] if "problem" in question else question["question"]
+        prompt = self.dataset.get_prompt(question)
 
         task_state = TaskState(
             case_id=task_id,
@@ -361,12 +459,15 @@ class Processor:
             n_cases = len(self.dataset.questions)
 
         print(f"\nProcessing {n_cases} {self.dataset_type.upper()} questions...")
-        print(f"Server: {self.server_url}")
+        print(f"Server: {self.server_url} (model: {self.model_name})")
         print(f"Threads: {self.threads}")
         print(f"Max tokens: {self.n_predict}")
+        print(f"Seed: {self.seed}")
+        print(f"Sampling: temp={self.sampling_config.get('temperature', 'skip')}, top-k={self.sampling_config.get('top_k', 'skip')}, top-p={self.sampling_config.get('top_p', 'skip')}, min-p={self.sampling_config.get('min_p', 'skip')}")
         print(f"Grader: {self.grader.grader_type}", end="")
         if self.grader.grader_type == "llm":
-            print(f" (judge server: {self.judge_server_url}, model: {self.judge_model_name})", end="")
+            judge_model = self.judge_model_name if self.judge_model_name else self.model_name
+            print(f" (judge server: {self.judge_server_url}, model: {judge_model})", end="")
         print()
         print()
 
@@ -389,9 +490,14 @@ class Processor:
         print("  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Status")
         for i, task_id in task_list:
             question = self.dataset.get_question(i)
-            prompt = question["problem"] if "problem" in question else question["question"]
+            prompt = self.dataset.get_prompt(question)
             gold = self.dataset.get_answer(question)
-            truncated_prompt = prompt[:40] + "..." if len(prompt) > 40 else prompt
+            first_line = prompt.split('\n')[0]
+            truncated_prompt = first_line[:43]
+            if len(first_line) > 43:
+                truncated_prompt += "..."
+            else:
+                truncated_prompt = truncated_prompt.ljust(43) + "..."
             print(f"  {task_id:<20} {self.dataset_type.upper()}   {truncated_prompt:<40}    {gold:<10} pending")
         print()
 
@@ -413,7 +519,13 @@ class Processor:
                 # Print task completion status
                 extracted_display = task_state.extracted if task_state.extracted else "N/A"
                 success_ratio = correct / total if total > 0 else 0.0
-                print(f"{total:3}/{n_cases:3}  {task_state.case_id:<20} {self.dataset_type.upper()}   {task_state.prompt[:40]:<40}    {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'}  [{correct:3}/{total:3}, {success_ratio:.3f}]")
+                first_line = task_state.prompt.split('\n')[0]
+                truncated_prompt = first_line[:43]
+                if len(first_line) > 43:
+                    truncated_prompt += "..."
+                else:
+                    truncated_prompt = truncated_prompt.ljust(43) + "..."
+                print(f"{total:3}/{n_cases:3}  {task_state.case_id:<20} {self.dataset_type.upper()}   {truncated_prompt:<40}    {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'}  [{correct:3}/{total:3}, {success_ratio:.3f}]")
 
                 if self.verbose:
                     print(f"\nCase {total}: {task_state.correct}")
@@ -456,7 +568,7 @@ def main():
         "--dataset",
         type=str,
         default="aime",
-        choices=["aime", "gsm8k"],
+        choices=["aime", "gsm8k", "gpqa"],
         help="Dataset type (default: aime)"
     )
     parser.add_argument(
@@ -474,8 +586,32 @@ def main():
     parser.add_argument(
         "--n_predict",
         type=int,
-        default=2048,
-        help="Max tokens to predict per prompt (default: 2048)"
+        default=-1,
+        help="Max tokens to predict per prompt (default: -1, infinite)"
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=None,
+        help="Sampling temperature (default: not passed)"
+    )
+    parser.add_argument(
+        "--top-k",
+        type=int,
+        default=None,
+        help="Top K sampling (default: not passed)"
+    )
+    parser.add_argument(
+        "--top-p",
+        type=float,
+        default=None,
+        help="Top P sampling (default: not passed)"
+    )
+    parser.add_argument(
+        "--min-p",
+        type=float,
+        default=None,
+        help="Min P sampling (default: not passed)"
     )
     parser.add_argument(
         "--threads",
@@ -503,16 +639,9 @@ def main():
     parser.add_argument(
         "--grader-type",
         type=str,
-        default="regex",
+        default="llm",
         choices=["regex", "cli", "llm"],
-        help="Grader type: regex, cli, or llm (default: regex)"
-    )
-    parser.add_argument(
-        "--grader-regex-type",
-        type=str,
-        default="aime",
-        choices=list(GRADER_PATTERNS.keys()),
-        help="Regex grader type (default: aime)"
+        help="Grader type: regex, cli, or llm (default: llm)"
     )
     parser.add_argument(
         "--grader-script",
@@ -529,21 +658,37 @@ def main():
     parser.add_argument(
         "--judge-model",
         type=str,
-        default=None,
+        default="",
         help="Model name for LLM judge (default: same as main model)"
     )
 
     args = parser.parse_args()
 
+    # Validate grader type for GPQA
+    if args.dataset == "gpqa" and args.grader_type != "llm":
+        print("Error: GPQA dataset requires --grader-type llm")
+        parser.print_help()
+        sys.exit(1)
+
     grader = Grader(
         grader_type=args.grader_type,
-        grader_regex_type=args.grader_regex_type,
-        grader_script=args.grader_script
+        grader_script=args.grader_script,
+        judge_model_name=args.judge_model if args.judge_model else args.model
     )
 
     if args.grader_type == "llm" and not args.judge_server:
         print("Warning: Using same server for LLM judge (no --judge-server specified)")
 
+    sampling_config = {"n_predict": args.n_predict}
+    if args.temperature is not None:
+        sampling_config["temperature"] = args.temperature
+    if args.top_k is not None:
+        sampling_config["top_k"] = args.top_k
+    if args.top_p is not None:
+        sampling_config["top_p"] = args.top_p
+    if args.min_p is not None:
+        sampling_config["min_p"] = args.min_p
+
     processor = Processor(
         server_url=args.server,
         n_predict=args.n_predict,
@@ -553,7 +698,8 @@ def main():
         model_name=args.model,
         judge_server_url=args.judge_server,
         judge_model_name=args.judge_model,
-        dataset_type=args.dataset
+        dataset_type=args.dataset,
+        sampling_config=sampling_config
     )
 
     eval_state = processor.process(n_cases=args.n_cases, seed=args.seed)
diff --git a/examples/llama-eval/llama-eval-state.json b/examples/llama-eval/llama-eval-state.json
new file mode 100644
index 0000000000..add0f626a3
--- /dev/null
+++ b/examples/llama-eval/llama-eval-state.json
@@ -0,0 +1,29 @@
+{
+  "id": "gpqa",
+  "tasks": [
+    "gpqa"
+  ],
+  "task_states": {
+    "gpqa": {
+      "total": 1,
+      "correct": 0,
+      "cases": {
+        "gpqa": [
+          {
+            "case_id": "gpqa_000_184",
+            "prompt": "Consider a system with Hamiltonian operator $H = \\varepsilon \\vec{\\sigma}.\\vec{n}$. Here, $\\vec{n}$ is an arbitrary unit vector, $\\varepsilon $ is a constant of dimension energy, and components of $\\vec{\\sigma}$ are the Pauli spin matrices. What are the eigenvalues of the Hamiltonian operator?\n\n\n(A) +\\hbar/2, -\\hbar/2\n(B) +1, -1\n(C) +\\varepsilon \\hbar/2, - \\varepsilon \\hbar/2\n(D) + \\varepsilon, -\\varepsilon\n\n\nExpress your final answer as the corresponding option 'A', 'B', 'C', or 'D'.\n",
+            "gold": "+ \\varepsilon, -\\varepsilon\n",
+            "pred": null,
+            "extracted": null,
+            "correct": false,
+            "status": "error: HTTPConnectionPool(host='localhost', port=8034): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError(\"HTTPConnection(host='localhost', port=8034): Failed to establish a new connection: [Errno 61] Connection refused\"))"
+          }
+        ]
+      }
+    }
+  },
+  "sampling_config": {
+    "temperature": 0,
+    "max_tokens": 2048
+  }
+}
\ No newline at end of file
diff --git a/examples/llama-eval/llama-server-simulator-README.md b/examples/llama-eval/llama-server-simulator-README.md
new file mode 100644
index 0000000000..bd69e2615c
--- /dev/null
+++ b/examples/llama-eval/llama-server-simulator-README.md
@@ -0,0 +1,36 @@
+# llama-server-simulator
+
+Standalone Python script simulating llama-server HTTP endpoint for testing.
+
+## Features
+
+- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
+- AIME Dataset Integration - Loads 90 questions from HuggingFace
+- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
+- Configurable Success Rate - Control correct/wrong answer generation (0-1)
+- Debug Logging - Troubleshoot matching issues
+
+## Usage
+
+```bash
+python llama-server-simulator.py --success-rate 0.8
+```
+
+## Arguments
+
+- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
+- `--port`: Server port (default: 8033)
+- `--debug`: Enable debug logging (default: False)
+
+## Testing
+
+```bash
+./test-simulator.sh
+```
+
+## Implementation Details
+
+- Uses Levenshtein distance for partial matching (threshold: 0.3)
+- Automatic caching via HuggingFace datasets library
+- Wrong answers generated by incrementing expected answer
+- Debug output written to stderr
diff --git a/examples/llama-eval/llama-server-simulator-plan.md b/examples/llama-eval/llama-server-simulator-plan.md
deleted file mode 100644
index ac7dfad060..0000000000
--- a/examples/llama-eval/llama-server-simulator-plan.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# llama-server-simulator Implementation Plan
-
-## Overview
-Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
-
-## Goals
-1. Simulate llama-server's `/v1/chat/completions` endpoint
-2. Accept requests and respond with expected answers from AIME dataset
-3. Implement configurable success rate (sometimes right, sometimes wrong)
-4. Use regex matching to find questions in incoming requests
-5. Test with curl requests before integrating with eval script
-
-## Implementation Plan
-
-### Phase 1: Basic Simulator Structure
-- Create `llama-server-simulator.py` script
-- Set up Flask/FastAPI HTTP server
-- Implement `/v1/chat/completions` endpoint
-- Handle basic request/response format
-
-### Phase 2: AIME Dataset Integration
-- Load AIME dataset
-- Store questions and expected answers
-- Implement regex matching to find questions in incoming requests
-- Extract expected answer from matched question
-
-### Phase 3: Response Generation
-- Implement success rate configuration
-- Randomly determine if response should be correct or incorrect
-- Generate appropriate response based on success determination
-- Format response in OpenAI-compatible format
-
-### Phase 4: Testing
-- Write curl commands to test basic functionality
-- Test correct responses
-- Test incorrect responses
-- Test edge cases (no question found, etc.)
-
-## Technical Details
-
-### Server Framework
-- Use Flask for simplicity
-- Listen on configurable port
-- Support JSON request/response format
-
-### Request Format
-```json
-{
-  "model": "llama",
-  "messages": [
-    {"role": "user", "content": "Question text here"}
-  ],
-  "temperature": 0,
-  "max_tokens": 2048
-}
-```
-
-### Response Format
-```json
-{
-  "id": "chatcmpl-xxx",
-  "object": "chat.completion",
-  "created": 1234567890,
-  "model": "llama",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "Answer text here"
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "usage": {
-    "prompt_tokens": 100,
-    "completion_tokens": 50,
-    "total_tokens": 150
-  }
-}
-```
-
-### AIME Dataset Integration
-- Load from HuggingFace: "AI-MO/aimo-validation-aime"
-- Store in memory for fast lookup
-- Regex pattern to find question text in request
-- Extract answer from matched question
-
-### Success Rate Configuration
-- Command-line argument: `--success-rate 0.8` (80% success rate)
-- Randomly determine correctness based on rate
-- Log when responses are correct vs incorrect
-
-### Testing Strategy
-1. Start simulator with default settings
-2. Send curl request with known question
-3. Verify response contains expected answer
-4. Test with different success rates
-5. Test edge cases
-
-## Implementation Steps
-
-### Step 1: Basic Server Setup
-```python
-from flask import Flask, request, jsonify
-
-app = Flask(__name__)
-
-@app.route('/v1/chat/completions', methods=['POST'])
-def chat_completions():
-    # Handle request
-    return jsonify(response)
-```
-
-### Step 2: Load AIME Dataset
-```python
-import datasets
-
-ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
-# Store in memory
-```
-
-### Step 3: Regex Matching
-```python
-import re
-
-def find_question_in_request(request_text):
-    # Regex pattern to find question
-    pattern = r"question:\s*(.*?)\n"
-    match = re.search(pattern, request_text, re.DOTALL)
-    return match.group(1) if match else None
-```
-
-### Step 4: Response Generation
-```python
-import random
-
-def generate_response(question, success_rate):
-    if random.random() < success_rate:
-        return get_expected_answer(question)
-    else:
-        return get_wrong_answer(question)
-```
-
-### Step 5: Testing with Curl
-```bash
-curl -X POST http://localhost:8033/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "llama",
-    "messages": [{"role": "user", "content": "Question text"}]
-  }'
-```
-
-## Configuration Options
-- `--port`: Server port (default: 8033)
-- `--success-rate`: Success rate 0-1 (default: 0.8)
-- `--host`: Server host (default: localhost)
-- `--dataset-split`: AIME split to use (default: train)
-
-## Expected Output
-```
-=== llama-server-simulator ===
-Server running on http://localhost:8033
-Success rate: 0.8
-AIME dataset loaded: 1000 questions
-```
-
-## Testing Checklist
-- [ ] Server starts successfully
-- [ ] Basic request/response works
-- [ ] Correct answer returned when success rate allows
-- [ ] Wrong answer returned when success rate doesn't allow
-- [ ] No question found returns error
-- [ ] Multiple requests work correctly
-- [ ] Different success rates work as expected
-
-## Next Steps
-
-1. ✓ Implement basic server structure
-2. ✓ Load AIME dataset
-3. ✓ Implement regex matching
-4. ✓ Add response generation with success rate
-5. ✓ Test with curl commands
-6. ✓ Integrate with eval script once simulator works
-7. ✓ Implement eval state object
-8. ✓ Implement processor object
-9. ✓ Add real-time progress reporting
-10. ✓ Add enhanced grading system with LLM judge
diff --git a/examples/llama-eval/simulator-summary.md b/examples/llama-eval/simulator-summary.md
deleted file mode 100644
index 3ea6af5530..0000000000
--- a/examples/llama-eval/simulator-summary.md
+++ /dev/null
@@ -1,138 +0,0 @@
-# llama-server-simulator Implementation Summary
-
-## Overview
-Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
-
-## Features Implemented
-
-### 1. HTTP Server
-- Flask-based `/v1/chat/completions` endpoint
-- OpenAI-compatible response format
-- Configurable port and host
-
-### 2. AIME Dataset Integration
-- Loads AIME dataset from HuggingFace
-- In-memory storage for fast lookup
-- 90 questions loaded from train split
-
-### 3. Intelligent Question Matching
-- **Exact matching**: Direct string comparison
-- **LaTeX removal**: Removes `$...$` formatting for flexible matching
-- **Levenshtein distance**: Calculates similarity between strings
-- **Partial matching**: Finds best match even with small differences
-
-### 4. Response Generation
-- Configurable success rate (0-1)
-- Returns correct answers when success rate allows
-- Returns wrong answers when success rate doesn't allow
-- Wrong answers are generated by incrementing the expected answer
-
-### 5. Debug Logging
-- Debug messages written to stderr
-- Logs request content, matching results, and distances
-- Helps troubleshoot matching issues
-
-## Configuration Options
-
-```bash
-python3 llama-server-simulator.py \
-  --port 8034 \
-  --host localhost \
-  --success-rate 0.8 \
-  --dataset-split train
-```
-
-## Testing Results
-
-### Test 1: Correct Answer
-- **Success rate**: 0.8
-- **Expected answer**: 116
-- **Result**: ✓ Correct (116)
-
-### Test 2: Wrong Answer
-- **Success rate**: 0.0
-- **Expected answer**: 116
-- **Result**: ✓ Wrong (117)
-
-### Test 3: No Matching Question
-- **Request**: "What is the capital of France?"
-- **Result**: ✓ Returns error "No matching question found"
-
-### Test 4: Success Rate Verification
-- **Success rate**: 0.8
-- **Requests**: 10
-- **Correct answers**: 8/10 (80%)
-- **Result**: ✓ Success rate working as expected
-
-## Technical Details
-
-### Matching Algorithm
-1. Try exact match (case-insensitive)
-2. Try match after removing LaTeX formatting
-3. Calculate Levenshtein distance for partial matches
-4. Return best match if distance < 0.3 (30% difference)
-
-### Response Format
-```json
-{
-  "id": "chatcmpl-1769864875",
-  "object": "chat.completion",
-  "created": 1769864875,
-  "model": "llama",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "116"
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "usage": {
-    "prompt_tokens": 100,
-    "completion_tokens": 50,
-    "total_tokens": 150
-  }
-}
-```
-
-## Files Created
-
-1. `llama-server-simulator.py` - Main simulator script
-2. `test-simulator.sh` - Basic test script
-3. `test-simulator-comprehensive.sh` - Comprehensive test script
-4. `llama-server-simulator-plan.md` - Implementation plan
-5. `llama-eval-discussion.md` - Discussion notes
-
-## Next Steps
-
-1. ✓ Basic simulator structure
-2. ✓ AIME dataset integration
-3. ✓ Question matching with Levenshtein distance
-4. ✓ Response generation with configurable success rate
-5. ✓ Testing with curl requests
-6. ✓ Integrate with eval script
-7. ✓ Implement eval state object
-8. ✓ Implement processor object
-9. ✓ Add real-time progress reporting
-10. ✓ Add enhanced grading system with LLM judge
-
-## Known Limitations
-
-1. Only supports AIME dataset (train split)
-2. Matching is case-insensitive
-3. Wrong answers are simple increments (not realistic)
-4. No support for multiple endpoints
-5. No distributed evaluation
-
-## Future Enhancements
-
-1. Support multiple datasets
-2. More sophisticated wrong answer generation
-3. Multiple endpoint support
-4. Distributed evaluation
-5. Real-time progress reporting
-6. Eval state serialization
-7. Enhanced grading with LLM judge
-8. Response truncation for better answer extraction