add gpqa + sampling + docs

2026-02-16 00:52:17 +02:00 · 2026-02-16 00:52:17 +02:00 · cffd268bb3
parent e8a807519a
commit cffd268bb3
8 changed files with 444 additions and 765 deletions
--- a/examples/llama-eval/IMPLEMENTATION.md
+++ b/examples/llama-eval/IMPLEMENTATION.md
@ -0,0 +1,85 @@
+# llama-eval Implementation Summary
+
+## Overview
+
+Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
+
+## Key Features
+
+- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
+- **Flexible Grading**: Regex, CLI, or LLM-based grading
+- **Parallel Processing**: Configurable thread count for concurrent requests
+- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
+- **Real-time Feedback**: Progress tracking with detailed output
+- **JSON Output**: Complete eval state saved for debugging
+- **GPQA Support**: Answer shuffling with reproducible results
+
+## Architecture
+
+### Eval State
+```python
+@dataclass
+class EvalState:
+    id: str
+    tasks: List[str]
+    task_states: Dict[str, Dict[str, Any]]
+    sampling_config: Dict[str, Any]
+```
+
+### Processor
+- Handles processing, grading, and state management
+- Thread-safe concurrent execution
+- Configurable sampling parameters
+
+### Grader
+- Abstract grading interface supporting multiple types
+- Regex grader with dataset-specific patterns
+- CLI grader with external script interface
+- LLM grader with configurable server and model
+
+### Datasets
+- `AimeDataset`: 90 AIME 2025 questions
+- `Gsm8kDataset`: 7473 math word problems
+- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
+
+## Configuration
+
+### Sampling Parameters (Optional)
+- `--temperature`: Sampling temperature
+- `--top-k`: Top K sampling
+- `--top-p`: Top P sampling
+- `--min-p`: Min P sampling
+- Only passed if explicitly specified
+
+### Grading Types
+- **regex**: Built-in patterns for each dataset
+- **cli**: External script with `--answer` and `--expected` args
+- **llm**: LLM-based extraction with configurable server/model
+
+## Output Format
+
+### Progress Table
+```
+  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
+  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
+```
+
+### Results Summary
+```
+============================================================
+Results: 8/10 correct (80.0%)
+============================================================
+```
+
+### JSON Output
+Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
+
+## Technical Details
+
+- Default max tokens: -1 (infinite)
+- Default grader type: llm
+- Default seed: 1234
+- Default threads: 32
+- Prompt truncation: First 43 chars + padding + "..."
+- GPQA requires LLM grader (returns letter A/B/C/D)
+- Judge model defaults to evaluated model if not specified
--- a/examples/llama-eval/README.md
+++ b/examples/llama-eval/README.md
@ -0,0 +1,105 @@
+# llama-eval Evaluation Tool
+
+Simple evaluation tool for llama.cpp with support for multiple datasets.
+
+## Features
+
+- **Multiple Datasets**: AIME, GSM8K, GPQA
+- **Flexible Grading**: Regex, CLI, or LLM-based grading
+- **Parallel Processing**: Configurable thread count
+- **Real-time Feedback**: Progress tracking with detailed output
+- **Sampling Parameters**: Temperature, Top K, Top P, Min P
+- **JSON Output**: Complete eval state saved for debugging
+
+## Usage
+
+```bash
+python llama-eval-new.py \
+  --server http://127.0.0.1:8013 \
+  --model gpt-oss-20b-hf-low \
+  --judge-model gpt-oss-20b-hf-medium \
+  --dataset aime \
+  --n_cases 10 \
+  --grader-type llm \
+  --seed 42
+```
+
+## CLI Arguments
+
+- `--server`: llama-server URL (default: http://127.0.0.1:8013)
+- `--model`: Model name for evaluation (default: llama)
+- `--judge-model`: Model name for LLM judge (default: same as main model)
+- `--judge-server`: Server URL for LLM judge (default: same as main server)
+- `--dataset`: Dataset type (aime, gsm8k, gpqa)
+- `--n_cases`: Number of cases to evaluate (default: all)
+- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
+- `--temperature`: Sampling temperature (default: not passed)
+- `--top-k`: Top K sampling (default: not passed)
+- `--top-p`: Top P sampling (default: not passed)
+- `--min-p`: Min P sampling (default: not passed)
+- `--threads`: Number of threads for parallel requests (default: 32)
+- `--verbose`: Show detailed output for each case
+- `--output`: Output file for eval state (default: llama-eval-state.json)
+- `--grader-type`: Grader type (regex, cli, llm, default: llm)
+- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
+- `--seed`: Random seed for shuffling (default: 1234)
+
+## Datasets
+
+### AIME
+- 90 questions from 2025 AIME competition
+- Answers in boxed format: `\boxed{answer}`
+- Requires regex grader or LLM grader
+
+### GSM8K
+- 7473 math word problems
+- Answers are numeric values
+- Requires regex grader or LLM grader
+
+### GPQA
+- 198 questions from GPQA Diamond dataset
+- Multiple choice with shuffled options
+- Requires LLM grader (returns letter A, B, C, or D)
+
+## Grading Types
+
+### Regex Grader
+Built-in patterns for different datasets:
+- AIME: `\boxed{(\d+)}|\b(\d+)\b`
+- GSM8K: `\b(\d+)\b`
+- GPQA: Letter extraction (A, B, C, D)
+
+### CLI Grader
+External script interface:
+```bash
+./grader.sh --answer <pred> --expected <gold>
+```
+Returns exit code 0 if correct, non-zero if incorrect.
+
+### LLM Grader
+Uses LLM to extract and compare answers:
+- Configurable server and model
+- Includes problem context in prompt
+- Case-insensitive comparison
+
+## Output
+
+### Progress Table
+```
+  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
+  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
+```
+
+### Results
+```
+============================================================
+Results: 8/10 correct (80.0%)
+============================================================
+```
+
+### JSON Output
+Complete eval state saved to output file with:
+- Task IDs and correctness status
+- Prompts and extracted answers
+- Sampling configuration
+- Processing metadata
--- a/examples/llama-eval/llama-eval-discussion.md
+++ b/examples/llama-eval/llama-eval-discussion.md
@ -1,395 +0,0 @@
-# llama-eval Implementation Discussion
-
-## Overview
-Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.
-
-## Key Requirements from ggerganov
-
-### 1. Simplify and Focus on One Eval
- Start with AIME2025 (most familiar with it)
- Don't support multiple evals initially
-
-### 2. Implement an "eval state" object
- ID
- List of tasks
- Task states
- Sampling config
-
-### 3. Implement a "processor" object
- List of endpoints
- Threads per endpoint
- Grade/judge type (regex, endpoint, or CLI tool)
-
-### 4. Processor responsibilities
- Accepts eval state
- Starts processing
- Dumps eval state periodically as it progresses
-
-### 5. Real-time feedback
- Default: show "correct / not correct" for each task
- Verbose mode: show produced answer vs expected answer as soon as it completes
-
-### 6. Grading approach
- Abstract grading to support external "grader" or "judge"
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)
-
-### 7. Output format
- Use structured output (JSON) instead of boxed text
-
-## Current Implementation Analysis
-
-### What exists in llama-eval.py:
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
- Regex-based answer extraction
- HTTP requests to OpenAI-compatible endpoint
- Checkpointing/resume capability
- Thread-based parallel execution
- Summary reporting
-
-### What needs to be removed:
- All task implementations except AIME
- Regex-based grading
- Multiple endpoint support
- Complex task loading logic
- Summary reporting (replace with real-time feedback)
-
-## Discussion Points
-
-### 1. Eval State Object Structure
-**Status: Under Discussion**
-
-Questions:
- What fields should be in the eval state object?
- Should it include the actual prompts, or just metadata?
- How should task states be tracked?
-
-### 2. Processor Architecture
-**Status: Not Started**
-
-Questions:
- Should the processor handle multiple endpoints (for distributed evaluation)?
- What's the threading model?
- How are endpoints configured?
-
-### 3. Grader Interface
-**Status: Not Started**
-
-Questions:
- How should the grader be configured?
- Should it be a separate service, or a local LLM call?
- What's the interface for grading?
-
-### 4. Checkpointing
-**Status: Not Started**
-
-Questions:
- Should the eval state be serialized to disk?
- How often should it be dumped?
- What format should it use?
-
-### 5. Real-time Output
-**Status: Not Started**
-
-Questions:
- How should progress be displayed?
- Console output, file logging, or both?
- What verbosity levels are needed?
-
-### 6. Output Format
-**Status: Not Started**
-
-Questions:
- Should responses be in JSON format?
- How should the grader interface work with JSON output?
-
-## Next Steps
-
-1. **Eval State Object** - Currently discussing
-2. Processor Architecture
-3. Grader Interface
-4. Checkpointing
-5. Real-time Output
-6. Output Format
-
-## References
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195
-
-## Session Work Summary
-
-### llama-server-simulator Implementation
-
-**Created:**
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
- `test-simulator.sh` - Test script for verifying simulator functionality
- `llama-server-simulator-plan.md` - Implementation plan
- `simulator-summary.md` - Summary of implementation
-
-**Features Implemented:**
-1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
-2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
-3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
-4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
-5. Debug Logging - Helps troubleshoot matching issues
-
-**Testing Results:**
- ✅ Correct answers returned when success rate allows
- ✅ Wrong answers returned when success rate doesn't allow
- ✅ No matching questions return errors
- ✅ Success rate verified (80% in 10 requests)
- ✅ HuggingFace dataset caching working correctly
-
-**Key Technical Decisions:**
- Used Levenshtein distance for partial matching (threshold: 0.3)
- Automatic caching via HuggingFace datasets library
- Wrong answers generated by incrementing expected answer
- Debug output written to stderr for better visibility
-
-**Refactoring:**
- Extracted repeating question string into TEST_QUESTION variable
- Created make_request() helper function to reduce code duplication
- Added proper error handling for error responses
- Fixed simulator stopping issue at script completion
-
-### llama-eval-new.py Implementation
-
-**Created:**
- `llama-eval-new.py` - Simplified evaluation tool focused on AIME
-
-**Features Implemented:**
-1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
-2. **Processor Object** - Handles processing, grading, and state management
-3. **Real-time Feedback** - Shows correct/incorrect status for each case
-4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading
-5. **Structured JSON Output** - Saves complete eval state to JSON file
-6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
-7. **Enhanced Answer Extraction** - Extracts answers from full responses for display
-
-**Grading System:**
- **Regex Grading**: Built-in patterns for different task types
-  - `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text)
-  - `gsm8k`: `\b(\d+)\b` (extract first number)
-  - `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter)
- **CLI Grading**: External script interface
-  - Script accepts `--answer <pred>` and `--expected <gold>`
-  - Returns exit code 0 if correct, non-zero if incorrect
-  - 30-second timeout to prevent hanging
- **LLM Judge**: Generic answer extraction using LLM
-  - Uses configured server and model for extraction
-  - Includes problem statement in prompt for context
-  - Case-insensitive comparison
-  - Returns extracted answer for display
-
-**Configuration Options:**
- `--server`: llama-server URL (default: http://localhost:8033)
- `--n_cases`: Number of cases to evaluate (default: all)
- `--n_predict`: Max tokens to predict per prompt (default: 2048)
- `--threads`: Number of threads for parallel requests (default: 32)
- `--verbose`: Show detailed output for each case
- `--output`: Output file for eval state (default: llama-eval-state.json)
- `--grader-type`: `regex`, `cli`, or `llm`
- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
- `--grader-script`: Path to CLI grader script
- `--judge-server`: Server URL for LLM judge (default: same as main server)
- `--judge-model`: Model name for LLM judge (default: same as main model)
-
-**Testing Results:**
- ✅ Works with simulator at 100% success rate (all correct)
- ✅ Works with simulator at 0% success rate (all incorrect)
- ✅ Works with simulator at 80% success rate (8/10 correct)
- ✅ Real-time verbose output shows gold/pred/status for each case
- ✅ JSON output contains complete eval state with all cases
- ✅ HF Hub telemetry disabled (no warnings)
- ✅ Uses cached dataset path to avoid HF Hub requests when available
- ✅ Regex grader extracts answers correctly from various formats
- ✅ LLM judge can extract answers with problem context
- ✅ Response truncation focuses grading on final answer
- ✅ Case-insensitive matching works for both regex and LLM grader
- ✅ Judge model and server configuration propagate correctly
- ✅ Progress table shows extracted answers instead of full responses
-
-**Key Technical Decisions:**
- Removed Levenshtein matching - eval script only sends requests and validates answers
- Abstract grading interface for external grader support
- Exact match requirement for regex patterns
- Handles both boxed and plain text formats for AIME answers
- 30-second timeout for CLI grader
- Validates script exists before running
- Judge parameters set once during Grader construction
- LLM judge prompt includes problem statement for better extraction
- Response truncation to last 2-3 lines focuses grading on final answer
- Case-insensitive comparison for more flexible matching
-
-**Refactoring:**
- Removed all task implementations except AIME
- Removed regex-based grading (moved to flexible grader system)
- Removed multiple endpoint support
- Removed complex task loading logic
- Removed summary reporting (replaced with real-time feedback)
- Added HuggingFace dataset caching optimization
- Added LLM grader support with configurable server and model
- Added response truncation before grading
- Refactored grader interface to return extracted answers
-
-### llama-eval-new.py Threading and Model Parameter Updates
-
-**Changes Made:**
-1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing
-   - Added `from concurrent.futures import ThreadPoolExecutor, as_completed`
-   - Created `_process_single_case()` method for thread-safe case processing
-   - Refactored `process()` to use ThreadPoolExecutor with configurable thread count
-   - Updated progress tracking to work with concurrent execution
-   - Thread-safe eval state updates (task_states and counters)
-
-2. **Model Parameter** - Added `--model` argument to specify model name in request data
-   - Added `model_name` parameter to Processor.__init__()
-   - Updated `_make_request()` to use provided model name or default to "llama"
-   - Added `--model` argument to argument parser
-   - Model name is included in request JSON as `"model": "gpt-oss-20b-hf"`
-
-**Testing Results:**
- ✅ Works with 2 threads (5 cases processed in ~0.2s)
- ✅ Works with 4 threads (slightly faster throughput)
- ✅ Model parameter correctly added to request data
- ✅ Thread-safe progress tracking with tqdm
- ✅ No race conditions in eval state updates
-
-**Key Technical Decisions:**
- Used ThreadPoolExecutor for simple, effective parallelism
- No rate limiting needed (server can handle concurrent requests)
- Thread-safe counter updates for correct/total tracking
- Progress bar shows completion status across all threads
- Model parameter is optional - defaults to "llama" if not specified
-
-**Refactoring:**
- Extracted single case processing into `_process_single_case()` method
- Changed from sequential loop to ThreadPoolExecutor with futures
- Updated verbose output to show total count instead of index
- Made eval state updates thread-safe
-
-### llama-eval-new.py Enhanced Grading System
-
-**Changes Made:**
-1. **Enhanced Grader Interface** - Updated to return extracted answers
-   - `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer)
-   - Added `extracted` field to `TaskState` dataclass
-   - All grader types (regex, cli, llm) now return extracted answers
-
-2. **Improved Regex Grader**
-   - New `_extract_answer_regex()` method extracts answers using configured patterns
-   - Supports case-insensitive matching
-   - Returns first valid match found
-   - Handles both single values and multiple matches
-
-3. **LLM-Based Judge**
-   - New `_grade_llm()` method for generic answer extraction
-   - Includes problem statement in prompt for context
-   - Configurable server URL (defaults to main server)
-   - Configurable model name (defaults to main model)
-   - Case-insensitive comparison
-   - Returns extracted answer for display
-
-4. **Response Truncation**
-   - New `_truncate_response()` method keeps only last 2-3 lines
-   - Applied before grading to focus on final answer section
-
-5. **CLI Grader Update**
-   - Now also returns extracted answer
-   - Returns None if grading fails
-
-6. **Display Updates**
-   - Progress table shows extracted answer instead of full response
-   - Verbose mode shows full response plus extracted answer
-
-7. **New CLI Arguments**
-   - `--grader-type`: Added "llm" option
-   - `--judge-server`: Separate server for LLM judge
-   - `--judge-model`: Separate model for LLM judge
-
-**Testing Results:**
- ✅ Regex grader extracts answers correctly from various formats
- ✅ LLM judge can extract answers with problem context
- ✅ Response truncation focuses grading on final answer
- ✅ Case-insensitive matching works for both regex and LLM grader
- ✅ Judge model and server configuration propagate correctly
- ✅ Progress table shows extracted answers instead of full responses
-
-**Key Technical Decisions:**
- Judge parameters set once during Grader construction (not on each call)
- LLM judge prompt includes problem statement for better extraction
- Response truncation to last 2-3 lines focuses grading on final answer
- Case-insensitive comparison for more flexible matching
- Judge configuration propagates through Processor to Grader
- Display shows extracted answer for cleaner output
-
-**Refactoring:**
- Removed judge parameters from `grade()` method calls
- Added `judge_server_url` and `judge_model_name` to Grader class
- Updated `_grade_llm()` to use instance variables instead of parameters
- Simplified Processor initialization to pass judge config to grader
- Updated startup info to show judge server and model
-
-### llama-eval-new.py GSM8K Dataset Support
-
-**Changes Made:**
-1. **GSM8K Dataset Integration** - Added support for GSM8K dataset alongside AIME
-   - Created `Gsm8kDataset` class with proper answer extraction logic
-   - GSM8K uses `"question"` field instead of `"problem"` field
-   - GSM8K answer field contains full reasoning with `####` prefix
-   - Extracts numeric answer from answer field during initialization
-   - Uses same regex grader pattern as AIME (`\b(\d+)\b`)
-
-2. **Dataset Type Configuration** - Added dataset selection support
-   - Added `--dataset` CLI argument with choices `aime` and `gsm8k`
-   - Updated `Processor` class to accept `dataset_type` parameter
-   - Dataset-specific initialization in `Processor.__init__()`
-   - Dataset name displayed in task summary table
-
-3. **Template Registry** - Added dataset-specific prompt templates
-   - AIME template: includes `\boxed{}` wrapper for final answer
-   - GSM8K template: plain text answer without wrapper
-   - Templates applied based on `question["dataset_type"]` field
-
-4. **Answer Extraction Logic** - Fixed GSM8K answer extraction
-   - GSM8K has pre-extracted `"gold"` field with numeric answer
-   - `Gsm8kDataset.get_answer()` checks for `"gold"` field first
-   - Falls back to answer field if gold field not present
-   - `AimeDataset.get_answer()` simplified to remove duplicate method
-
-5. **Task ID Format** - Fixed duplicate prefix in task IDs
-   - Changed from `f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"`
-   - To `f"{dataset_type}_{chunk_idx:03d}_{i:03d}"`
-   - Removed redundant `eval_state.id` (was "gsm8k" for GSM8K)
-
-6. **Column Width Adjustments** - Improved table formatting
-   - Task ID column: 25 characters
-   - Dataset column: 5 characters
-   - Prompt column: 40 characters
-   - Expected column: 10 characters
-
-**Testing Results:**
- ✅ GSM8K dataset loads correctly with 7473 questions
- ✅ Numeric answers extracted from full reasoning text
- ✅ Task summary table displays correctly with adjusted column widths
- ✅ Task IDs show correct format (e.g., `gsm8k_000_3169`)
- ✅ Both AIME and GSM8K datasets work with same script
- ✅ Answer extraction works for both boxed and plain text formats
- ✅ Progress tracking shows extracted answers for both datasets
-
-**Key Technical Decisions:**
- GSM8K uses `"question"` field instead of `"problem"` field
- GSM8K answer field contains full reasoning with `####` prefix
- Numeric answer extracted during dataset initialization
- Same regex grader pattern works for both datasets
- Dataset selection via CLI argument for separate runs
- Template registry supports different prompt formats per dataset
- Task ID format simplified to avoid duplication
-
-**Refactoring:**
- Removed duplicate `get_question()` method from `AimeDataset`
- Removed "2025" suffix from eval state ID (was remnant from old version)
- Removed "2025" suffix from task summary table output
- Removed "2025" suffix from progress tracking output
- Updated `Processor.__init__()` to initialize appropriate dataset based on type
- Updated `_process_single_case()` to handle both `"problem"` and `"question"` fields
- Updated `process()` method to display dataset name and use `dataset_type` for task states
--- a/examples/llama-eval/llama-eval-new.py
+++ b/examples/llama-eval/llama-eval-new.py
@ -5,6 +5,7 @@ import json
 import os
 import re
 import subprocess
+import sys
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass, asdict
@ -34,6 +35,15 @@ Please reason step by step, and put your final answer within \\boxed{{}}.
 """,
    "gsm8k": """{question}
 Please reason step by step, and provide your final answer.
+""",
+    "gpqa": """{Question}
+
+(A) {A}
+(B) {B}
+(C) {C}
+(D) {D}
+
+Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'.
 """,
 }

@ -96,6 +106,15 @@ class AimeDataset:
            return str(normalized) if normalized is not None else answer
        return str(answer)

+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        if question["dataset_type"] == "gpqa":
+            return TEMPLATE_REGISTRY["gpqa"].format(**question)
+        else:
+            return TEMPLATE_REGISTRY[question["dataset_type"]].format(
+                question=question["problem"] if "problem" in question else question["question"]
+            )
+
 class Gsm8kDataset:
    def __init__(self, split: str = "train"):
        self.split = split
@ -146,17 +165,87 @@ class Gsm8kDataset:
            return str(normalized) if normalized is not None else answer
        return str(answer)

+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        return TEMPLATE_REGISTRY[question["dataset_type"]].format(
+            question=question["problem"] if "problem" in question else question["question"]
+        )
+
+class GpqaDataset:
+    def __init__(self, variant: str = "diamond", seed: int = 1234):
+        self.variant = variant
+        self.seed = seed
+        self.questions: List[Dict] = []
+        self._load_dataset()
+
+    def _load_dataset(self):
+        print(f"Loading GPQA dataset (variant: {self.variant})...")
+        import pandas as pd
+
+        url = f"https://openaipublic.blob.core.windows.net/simple-evals/gpqa_{self.variant}.csv"
+        df = pd.read_csv(url)
+
+        rng = random.Random(self.seed)
+
+        self.questions = []
+        for _, row in df.iterrows():
+            question = row.to_dict()
+            question["dataset_type"] = "gpqa"
+
+            # Shuffle the answer options
+            correct_answer = question["Correct Answer"]
+            incorrect_answers = [
+                question["Incorrect Answer 1"],
+                question["Incorrect Answer 2"],
+                question["Incorrect Answer 3"]
+            ]
+
+            # Create list of (answer, is_correct) tuples
+            options = [(ans, ans == correct_answer) for ans in incorrect_answers]
+            options.append((correct_answer, True))
+
+            # Shuffle the options
+            rng.shuffle(options)
+
+            # Extract shuffled answers and determine correct letter
+            shuffled_answers = [ans for ans, _ in options]
+            correct_letter = chr(ord('A') + options.index((correct_answer, True)))
+
+            # Store shuffled answers and correct letter
+            question["shuffled_answers"] = shuffled_answers
+            question["correct_letter"] = correct_letter
+
+            self.questions.append(question)
+
+        print(f"GPQA dataset loaded: {len(self.questions)} questions")
+
+    def get_question(self, index: int) -> Dict:
+        """Get question by index"""
+        return self.questions[index]
+
+    def get_answer(self, question: Dict) -> str:
+        # GPQA returns the correct letter (A, B, C, or D)
+        return question["correct_letter"]
+
+    def get_prompt(self, question: Dict) -> str:
+        """Get formatted prompt for the question"""
+        return TEMPLATE_REGISTRY["gpqa"].format(
+            Question=question["Question"],
+            A=question["shuffled_answers"][0],
+            B=question["shuffled_answers"][1],
+            C=question["shuffled_answers"][2],
+            D=question["shuffled_answers"][3]
+        )
+
 class Grader:
    def __init__(
        self,
-        grader_type: str = "regex",
-        grader_regex_type: str = "aime",
+        grader_type: str = "llm",
        grader_script: Optional[str] = None,
        judge_model_name: Optional[str] = None,
        judge_server_url: str = ""
    ):
        self.grader_type = grader_type
-        self.grader_regex_type = grader_regex_type
        self.grader_script = grader_script
        self.judge_model_name = judge_model_name
        self.judge_server_url = judge_server_url
@ -164,9 +253,7 @@ class Grader:

    def _get_pattern(self) -> Optional[str]:
        if self.grader_type == "regex":
-            if self.grader_regex_type not in GRADER_PATTERNS:
-                raise ValueError(f"Unknown grader regex type: {self.grader_regex_type}")
-            return GRADER_PATTERNS[self.grader_regex_type]
+            return GRADER_PATTERNS.get("aime")  # Default to aime pattern
        return None

    def _extract_answer_regex(self, pred: str) -> Optional[str]:
@ -221,18 +308,21 @@ class Grader:
        """Grade using LLM-based extraction"""
        prompt = f"""Extract the answer from this response:

-Response: {pred}
-
 Expected answer: {gold}

-Please provide only the extracted answer, nothing else."""
+===
+
+Response: {pred}
+
+===
+
+Please provide only the extracted answer, nothing else. If there is no clear answer in the response, reply with 'no answer'."""
        url = f"{self.judge_server_url}/v1/chat/completions"
        headers = {"Content-Type": "application/json"}
        data = {
            "model": self.judge_model_name,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0,
-            "max_tokens": 256
        }

        try:
@ -264,14 +354,16 @@ class Processor:
    def __init__(
        self,
        server_url: str,
-        n_predict: int = 2048,
+        n_predict: int = -1,
        threads: int = 32,
        verbose: bool = False,
        grader: Optional[Grader] = None,
        model_name: Optional[str] = None,
        judge_server_url: str = "",
        judge_model_name: Optional[str] = None,
-        dataset_type: str = "aime"
+        dataset_type: str = "aime",
+        seed: int = 1234,
+        sampling_config: Optional[Dict[str, Any]] = None
    ):
        self.server_url = server_url
        self.n_predict = n_predict
@ -281,12 +373,14 @@ class Processor:
        self.judge_server_url = judge_server_url if judge_server_url else server_url
        self.judge_model_name = judge_model_name
        self.dataset_type = dataset_type
+        self.seed = seed
        self.grader = grader or Grader()
+        self.sampling_config = sampling_config or {"n_predict": n_predict}
        self.eval_state = EvalState(
            id=dataset_type,
            tasks=[dataset_type],
            task_states={},
-            sampling_config={"temperature": 0, "max_tokens": n_predict}
+            sampling_config=self.sampling_config
        )

        # Pass judge configuration to grader if using LLM grader
@ -301,6 +395,8 @@ class Processor:
            self.dataset = AimeDataset()
        elif dataset_type == "gsm8k":
            self.dataset = Gsm8kDataset()
+        elif dataset_type == "gpqa":
+            self.dataset = GpqaDataset(variant="diamond", seed=self.seed)
        else:
            raise ValueError(f"Unknown dataset type: {dataset_type}")

@ -311,9 +407,16 @@ class Processor:
        data = {
            "model": self.model_name if self.model_name else "llama",
            "messages": [{"role": "user", "content": prompt}],
-            "temperature": 0,
-            "max_tokens": self.n_predict
+            "n_predict": self.n_predict
        }
+        if self.sampling_config.get("temperature") is not None:
+            data["temperature"] = self.sampling_config["temperature"]
+        if self.sampling_config.get("top_k") is not None:
+            data["top_k"] = self.sampling_config["top_k"]
+        if self.sampling_config.get("top_p") is not None:
+            data["top_p"] = self.sampling_config["top_p"]
+        if self.sampling_config.get("min_p") is not None:
+            data["min_p"] = self.sampling_config["min_p"]

        response = requests.post(url, headers=headers, json=data)
        response.raise_for_status()
@ -322,14 +425,9 @@ class Processor:
    def _process_single_case(self, i: int, task_id: str) -> TaskState:
        """Process a single case (thread-safe)"""
        question = self.dataset.get_question(i)
-        dataset_id = f"{self.dataset_type}_{self.dataset.split}_{i}"
+        dataset_id = f"{self.dataset_type}_{i}"
        gold = self.dataset.get_answer(question)
-
-        # Apply template if available
-        if question["dataset_type"] in TEMPLATE_REGISTRY:
-            prompt = TEMPLATE_REGISTRY[question["dataset_type"]].format(question=question["problem"] if "problem" in question else question["question"])
-        else:
-            prompt = question["problem"] if "problem" in question else question["question"]
+        prompt = self.dataset.get_prompt(question)

        task_state = TaskState(
            case_id=task_id,
@ -361,12 +459,15 @@ class Processor:
            n_cases = len(self.dataset.questions)

        print(f"\nProcessing {n_cases} {self.dataset_type.upper()} questions...")
-        print(f"Server: {self.server_url}")
+        print(f"Server: {self.server_url} (model: {self.model_name})")
        print(f"Threads: {self.threads}")
        print(f"Max tokens: {self.n_predict}")
+        print(f"Seed: {self.seed}")
+        print(f"Sampling: temp={self.sampling_config.get('temperature', 'skip')}, top-k={self.sampling_config.get('top_k', 'skip')}, top-p={self.sampling_config.get('top_p', 'skip')}, min-p={self.sampling_config.get('min_p', 'skip')}")
        print(f"Grader: {self.grader.grader_type}", end="")
        if self.grader.grader_type == "llm":
-            print(f" (judge server: {self.judge_server_url}, model: {self.judge_model_name})", end="")
+            judge_model = self.judge_model_name if self.judge_model_name else self.model_name
+            print(f" (judge server: {self.judge_server_url}, model: {judge_model})", end="")
        print()
        print()

@ -389,9 +490,14 @@ class Processor:
        print("  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Status")
        for i, task_id in task_list:
            question = self.dataset.get_question(i)
-            prompt = question["problem"] if "problem" in question else question["question"]
+            prompt = self.dataset.get_prompt(question)
            gold = self.dataset.get_answer(question)
-            truncated_prompt = prompt[:40] + "..." if len(prompt) > 40 else prompt
+            first_line = prompt.split('\n')[0]
+            truncated_prompt = first_line[:43]
+            if len(first_line) > 43:
+                truncated_prompt += "..."
+            else:
+                truncated_prompt = truncated_prompt.ljust(43) + "..."
            print(f"  {task_id:<20} {self.dataset_type.upper()}   {truncated_prompt:<40}    {gold:<10} pending")
        print()

@ -413,7 +519,13 @@ class Processor:
                # Print task completion status
                extracted_display = task_state.extracted if task_state.extracted else "N/A"
                success_ratio = correct / total if total > 0 else 0.0
-                print(f"{total:3}/{n_cases:3}  {task_state.case_id:<20} {self.dataset_type.upper()}   {task_state.prompt[:40]:<40}    {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'}  [{correct:3}/{total:3}, {success_ratio:.3f}]")
+                first_line = task_state.prompt.split('\n')[0]
+                truncated_prompt = first_line[:43]
+                if len(first_line) > 43:
+                    truncated_prompt += "..."
+                else:
+                    truncated_prompt = truncated_prompt.ljust(43) + "..."
+                print(f"{total:3}/{n_cases:3}  {task_state.case_id:<20} {self.dataset_type.upper()}   {truncated_prompt:<40}    {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'}  [{correct:3}/{total:3}, {success_ratio:.3f}]")

                if self.verbose:
                    print(f"\nCase {total}: {task_state.correct}")
@ -456,7 +568,7 @@ def main():
        "--dataset",
        type=str,
        default="aime",
-        choices=["aime", "gsm8k"],
+        choices=["aime", "gsm8k", "gpqa"],
        help="Dataset type (default: aime)"
    )
    parser.add_argument(
@ -474,8 +586,32 @@ def main():
    parser.add_argument(
        "--n_predict",
        type=int,
-        default=2048,
-        help="Max tokens to predict per prompt (default: 2048)"
+        default=-1,
+        help="Max tokens to predict per prompt (default: -1, infinite)"
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=None,
+        help="Sampling temperature (default: not passed)"
+    )
+    parser.add_argument(
+        "--top-k",
+        type=int,
+        default=None,
+        help="Top K sampling (default: not passed)"
+    )
+    parser.add_argument(
+        "--top-p",
+        type=float,
+        default=None,
+        help="Top P sampling (default: not passed)"
+    )
+    parser.add_argument(
+        "--min-p",
+        type=float,
+        default=None,
+        help="Min P sampling (default: not passed)"
    )
    parser.add_argument(
        "--threads",
@ -503,16 +639,9 @@ def main():
    parser.add_argument(
        "--grader-type",
        type=str,
-        default="regex",
+        default="llm",
        choices=["regex", "cli", "llm"],
-        help="Grader type: regex, cli, or llm (default: regex)"
-    )
-    parser.add_argument(
-        "--grader-regex-type",
-        type=str,
-        default="aime",
-        choices=list(GRADER_PATTERNS.keys()),
-        help="Regex grader type (default: aime)"
+        help="Grader type: regex, cli, or llm (default: llm)"
    )
    parser.add_argument(
        "--grader-script",
@ -529,21 +658,37 @@ def main():
    parser.add_argument(
        "--judge-model",
        type=str,
-        default=None,
+        default="",
        help="Model name for LLM judge (default: same as main model)"
    )

    args = parser.parse_args()

+    # Validate grader type for GPQA
+    if args.dataset == "gpqa" and args.grader_type != "llm":
+        print("Error: GPQA dataset requires --grader-type llm")
+        parser.print_help()
+        sys.exit(1)
+
    grader = Grader(
        grader_type=args.grader_type,
-        grader_regex_type=args.grader_regex_type,
-        grader_script=args.grader_script
+        grader_script=args.grader_script,
+        judge_model_name=args.judge_model if args.judge_model else args.model
    )

    if args.grader_type == "llm" and not args.judge_server:
        print("Warning: Using same server for LLM judge (no --judge-server specified)")

+    sampling_config = {"n_predict": args.n_predict}
+    if args.temperature is not None:
+        sampling_config["temperature"] = args.temperature
+    if args.top_k is not None:
+        sampling_config["top_k"] = args.top_k
+    if args.top_p is not None:
+        sampling_config["top_p"] = args.top_p
+    if args.min_p is not None:
+        sampling_config["min_p"] = args.min_p
+
    processor = Processor(
        server_url=args.server,
        n_predict=args.n_predict,
@ -553,7 +698,8 @@ def main():
        model_name=args.model,
        judge_server_url=args.judge_server,
        judge_model_name=args.judge_model,
-        dataset_type=args.dataset
+        dataset_type=args.dataset,
+        sampling_config=sampling_config
    )

    eval_state = processor.process(n_cases=args.n_cases, seed=args.seed)
--- a/examples/llama-eval/llama-eval-state.json
+++ b/examples/llama-eval/llama-eval-state.json
@ -0,0 +1,29 @@
+{
+  "id": "gpqa",
+  "tasks": [
+    "gpqa"
+  ],
+  "task_states": {
+    "gpqa": {
+      "total": 1,
+      "correct": 0,
+      "cases": {
+        "gpqa": [
+          {
+            "case_id": "gpqa_000_184",
+            "prompt": "Consider a system with Hamiltonian operator $H = \\varepsilon \\vec{\\sigma}.\\vec{n}$. Here, $\\vec{n}$ is an arbitrary unit vector, $\\varepsilon $ is a constant of dimension energy, and components of $\\vec{\\sigma}$ are the Pauli spin matrices. What are the eigenvalues of the Hamiltonian operator?\n\n\n(A) +\\hbar/2, -\\hbar/2\n(B) +1, -1\n(C) +\\varepsilon \\hbar/2, - \\varepsilon \\hbar/2\n(D) + \\varepsilon, -\\varepsilon\n\n\nExpress your final answer as the corresponding option 'A', 'B', 'C', or 'D'.\n",
+            "gold": "+ \\varepsilon, -\\varepsilon\n",
+            "pred": null,
+            "extracted": null,
+            "correct": false,
+            "status": "error: HTTPConnectionPool(host='localhost', port=8034): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError(\"HTTPConnection(host='localhost', port=8034): Failed to establish a new connection: [Errno 61] Connection refused\"))"
+          }
+        ]
+      }
+    }
+  },
+  "sampling_config": {
+    "temperature": 0,
+    "max_tokens": 2048
+  }
+}
--- a/examples/llama-eval/llama-server-simulator-README.md
+++ b/examples/llama-eval/llama-server-simulator-README.md
@ -0,0 +1,36 @@
+# llama-server-simulator
+
+Standalone Python script simulating llama-server HTTP endpoint for testing.
+
+## Features
+
+- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
+- AIME Dataset Integration - Loads 90 questions from HuggingFace
+- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
+- Configurable Success Rate - Control correct/wrong answer generation (0-1)
+- Debug Logging - Troubleshoot matching issues
+
+## Usage
+
+```bash
+python llama-server-simulator.py --success-rate 0.8
+```
+
+## Arguments
+
+- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
+- `--port`: Server port (default: 8033)
+- `--debug`: Enable debug logging (default: False)
+
+## Testing
+
+```bash
+./test-simulator.sh
+```
+
+## Implementation Details
+
+- Uses Levenshtein distance for partial matching (threshold: 0.3)
+- Automatic caching via HuggingFace datasets library
+- Wrong answers generated by incrementing expected answer
+- Debug output written to stderr
--- a/examples/llama-eval/llama-server-simulator-plan.md
+++ b/examples/llama-eval/llama-server-simulator-plan.md
@ -1,189 +0,0 @@
-# llama-server-simulator Implementation Plan
-
-## Overview
-Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
-
-## Goals
-1. Simulate llama-server's `/v1/chat/completions` endpoint
-2. Accept requests and respond with expected answers from AIME dataset
-3. Implement configurable success rate (sometimes right, sometimes wrong)
-4. Use regex matching to find questions in incoming requests
-5. Test with curl requests before integrating with eval script
-
-## Implementation Plan
-
-### Phase 1: Basic Simulator Structure
- Create `llama-server-simulator.py` script
- Set up Flask/FastAPI HTTP server
- Implement `/v1/chat/completions` endpoint
- Handle basic request/response format
-
-### Phase 2: AIME Dataset Integration
- Load AIME dataset
- Store questions and expected answers
- Implement regex matching to find questions in incoming requests
- Extract expected answer from matched question
-
-### Phase 3: Response Generation
- Implement success rate configuration
- Randomly determine if response should be correct or incorrect
- Generate appropriate response based on success determination
- Format response in OpenAI-compatible format
-
-### Phase 4: Testing
- Write curl commands to test basic functionality
- Test correct responses
- Test incorrect responses
- Test edge cases (no question found, etc.)
-
-## Technical Details
-
-### Server Framework
- Use Flask for simplicity
- Listen on configurable port
- Support JSON request/response format
-
-### Request Format
-```json
-{
-  "model": "llama",
-  "messages": [
-    {"role": "user", "content": "Question text here"}
-  ],
-  "temperature": 0,
-  "max_tokens": 2048
-}
-```
-
-### Response Format
-```json
-{
-  "id": "chatcmpl-xxx",
-  "object": "chat.completion",
-  "created": 1234567890,
-  "model": "llama",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "Answer text here"
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "usage": {
-    "prompt_tokens": 100,
-    "completion_tokens": 50,
-    "total_tokens": 150
-  }
-}
-```
-
-### AIME Dataset Integration
- Load from HuggingFace: "AI-MO/aimo-validation-aime"
- Store in memory for fast lookup
- Regex pattern to find question text in request
- Extract answer from matched question
-
-### Success Rate Configuration
- Command-line argument: `--success-rate 0.8` (80% success rate)
- Randomly determine correctness based on rate
- Log when responses are correct vs incorrect
-
-### Testing Strategy
-1. Start simulator with default settings
-2. Send curl request with known question
-3. Verify response contains expected answer
-4. Test with different success rates
-5. Test edge cases
-
-## Implementation Steps
-
-### Step 1: Basic Server Setup
-```python
-from flask import Flask, request, jsonify
-
-app = Flask(__name__)
-
-@app.route('/v1/chat/completions', methods=['POST'])
-def chat_completions():
-    # Handle request
-    return jsonify(response)
-```
-
-### Step 2: Load AIME Dataset
-```python
-import datasets
-
-ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train")
-# Store in memory
-```
-
-### Step 3: Regex Matching
-```python
-import re
-
-def find_question_in_request(request_text):
-    # Regex pattern to find question
-    pattern = r"question:\s*(.*?)\n"
-    match = re.search(pattern, request_text, re.DOTALL)
-    return match.group(1) if match else None
-```
-
-### Step 4: Response Generation
-```python
-import random
-
-def generate_response(question, success_rate):
-    if random.random() < success_rate:
-        return get_expected_answer(question)
-    else:
-        return get_wrong_answer(question)
-```
-
-### Step 5: Testing with Curl
-```bash
-curl -X POST http://localhost:8033/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "llama",
-    "messages": [{"role": "user", "content": "Question text"}]
-  }'
-```
-
-## Configuration Options
- `--port`: Server port (default: 8033)
- `--success-rate`: Success rate 0-1 (default: 0.8)
- `--host`: Server host (default: localhost)
- `--dataset-split`: AIME split to use (default: train)
-
-## Expected Output
-```
-=== llama-server-simulator ===
-Server running on http://localhost:8033
-Success rate: 0.8
-AIME dataset loaded: 1000 questions
-```
-
-## Testing Checklist
- [ ] Server starts successfully
- [ ] Basic request/response works
- [ ] Correct answer returned when success rate allows
- [ ] Wrong answer returned when success rate doesn't allow
- [ ] No question found returns error
- [ ] Multiple requests work correctly
- [ ] Different success rates work as expected
-
-## Next Steps
-
-1. ✓ Implement basic server structure
-2. ✓ Load AIME dataset
-3. ✓ Implement regex matching
-4. ✓ Add response generation with success rate
-5. ✓ Test with curl commands
-6. ✓ Integrate with eval script once simulator works
-7. ✓ Implement eval state object
-8. ✓ Implement processor object
-9. ✓ Add real-time progress reporting
-10. ✓ Add enhanced grading system with LLM judge
--- a/examples/llama-eval/simulator-summary.md
+++ b/examples/llama-eval/simulator-summary.md
@ -1,138 +0,0 @@
-# llama-server-simulator Implementation Summary
-
-## Overview
-Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script.
-
-## Features Implemented
-
-### 1. HTTP Server
- Flask-based `/v1/chat/completions` endpoint
- OpenAI-compatible response format
- Configurable port and host
-
-### 2. AIME Dataset Integration
- Loads AIME dataset from HuggingFace
- In-memory storage for fast lookup
- 90 questions loaded from train split
-
-### 3. Intelligent Question Matching
- **Exact matching**: Direct string comparison
- **LaTeX removal**: Removes `$...$` formatting for flexible matching
- **Levenshtein distance**: Calculates similarity between strings
- **Partial matching**: Finds best match even with small differences
-
-### 4. Response Generation
- Configurable success rate (0-1)
- Returns correct answers when success rate allows
- Returns wrong answers when success rate doesn't allow
- Wrong answers are generated by incrementing the expected answer
-
-### 5. Debug Logging
- Debug messages written to stderr
- Logs request content, matching results, and distances
- Helps troubleshoot matching issues
-
-## Configuration Options
-
-```bash
-python3 llama-server-simulator.py \
-  --port 8034 \
-  --host localhost \
-  --success-rate 0.8 \
-  --dataset-split train
-```
-
-## Testing Results
-
-### Test 1: Correct Answer
- **Success rate**: 0.8
- **Expected answer**: 116
- **Result**: ✓ Correct (116)
-
-### Test 2: Wrong Answer
- **Success rate**: 0.0
- **Expected answer**: 116
- **Result**: ✓ Wrong (117)
-
-### Test 3: No Matching Question
- **Request**: "What is the capital of France?"
- **Result**: ✓ Returns error "No matching question found"
-
-### Test 4: Success Rate Verification
- **Success rate**: 0.8
- **Requests**: 10
- **Correct answers**: 8/10 (80%)
- **Result**: ✓ Success rate working as expected
-
-## Technical Details
-
-### Matching Algorithm
-1. Try exact match (case-insensitive)
-2. Try match after removing LaTeX formatting
-3. Calculate Levenshtein distance for partial matches
-4. Return best match if distance < 0.3 (30% difference)
-
-### Response Format
-```json
-{
-  "id": "chatcmpl-1769864875",
-  "object": "chat.completion",
-  "created": 1769864875,
-  "model": "llama",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "116"
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "usage": {
-    "prompt_tokens": 100,
-    "completion_tokens": 50,
-    "total_tokens": 150
-  }
-}
-```
-
-## Files Created
-
-1. `llama-server-simulator.py` - Main simulator script
-2. `test-simulator.sh` - Basic test script
-3. `test-simulator-comprehensive.sh` - Comprehensive test script
-4. `llama-server-simulator-plan.md` - Implementation plan
-5. `llama-eval-discussion.md` - Discussion notes
-
-## Next Steps
-
-1. ✓ Basic simulator structure
-2. ✓ AIME dataset integration
-3. ✓ Question matching with Levenshtein distance
-4. ✓ Response generation with configurable success rate
-5. ✓ Testing with curl requests
-6. ✓ Integrate with eval script
-7. ✓ Implement eval state object
-8. ✓ Implement processor object
-9. ✓ Add real-time progress reporting
-10. ✓ Add enhanced grading system with LLM judge
-
-## Known Limitations
-
-1. Only supports AIME dataset (train split)
-2. Matching is case-insensitive
-3. Wrong answers are simple increments (not realistic)
-4. No support for multiple endpoints
-5. No distributed evaluation
-
-## Future Enhancements
-
-1. Support multiple datasets
-2. More sophisticated wrong answer generation
-3. Multiple endpoint support
-4. Distributed evaluation
-5. Real-time progress reporting
-6. Eval state serialization
-7. Enhanced grading with LLM judge
-8. Response truncation for better answer extraction