diff --git a/examples/llama-eval/IMPLEMENTATION.md b/examples/llama-eval/IMPLEMENTATION.md new file mode 100644 index 0000000000..c9542f005d --- /dev/null +++ b/examples/llama-eval/IMPLEMENTATION.md @@ -0,0 +1,85 @@ +# llama-eval Implementation Summary + +## Overview + +Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM). + +## Key Features + +- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction +- **Flexible Grading**: Regex, CLI, or LLM-based grading +- **Parallel Processing**: Configurable thread count for concurrent requests +- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional) +- **Real-time Feedback**: Progress tracking with detailed output +- **JSON Output**: Complete eval state saved for debugging +- **GPQA Support**: Answer shuffling with reproducible results + +## Architecture + +### Eval State +```python +@dataclass +class EvalState: + id: str + tasks: List[str] + task_states: Dict[str, Dict[str, Any]] + sampling_config: Dict[str, Any] +``` + +### Processor +- Handles processing, grading, and state management +- Thread-safe concurrent execution +- Configurable sampling parameters + +### Grader +- Abstract grading interface supporting multiple types +- Regex grader with dataset-specific patterns +- CLI grader with external script interface +- LLM grader with configurable server and model + +### Datasets +- `AimeDataset`: 90 AIME 2025 questions +- `Gsm8kDataset`: 7473 math word problems +- `GpqaDataset`: 198 GPQA Diamond questions with shuffling + +## Configuration + +### Sampling Parameters (Optional) +- `--temperature`: Sampling temperature +- `--top-k`: Top K sampling +- `--top-p`: Top P sampling +- `--min-p`: Min P sampling +- Only passed if explicitly specified + +### Grading Types +- **regex**: Built-in patterns for each dataset +- **cli**: External script with `--answer` and `--expected` args +- **llm**: LLM-based extraction with configurable server/model + +## Output Format + +### Progress Table +``` + Task ID Dataset Prompt (first 43 chars) Expected Status + aime_000_001 AIME Complete the following reactions and sel... A pending +``` + +### Results Summary +``` +============================================================ +Results: 8/10 correct (80.0%) +============================================================ +``` + +### JSON Output +Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration. + +## Technical Details + +- Default max tokens: -1 (infinite) +- Default grader type: llm +- Default seed: 1234 +- Default threads: 32 +- Prompt truncation: First 43 chars + padding + "..." +- GPQA requires LLM grader (returns letter A/B/C/D) +- Judge model defaults to evaluated model if not specified diff --git a/examples/llama-eval/README.md b/examples/llama-eval/README.md new file mode 100644 index 0000000000..1c96cc6a1f --- /dev/null +++ b/examples/llama-eval/README.md @@ -0,0 +1,105 @@ +# llama-eval Evaluation Tool + +Simple evaluation tool for llama.cpp with support for multiple datasets. + +## Features + +- **Multiple Datasets**: AIME, GSM8K, GPQA +- **Flexible Grading**: Regex, CLI, or LLM-based grading +- **Parallel Processing**: Configurable thread count +- **Real-time Feedback**: Progress tracking with detailed output +- **Sampling Parameters**: Temperature, Top K, Top P, Min P +- **JSON Output**: Complete eval state saved for debugging + +## Usage + +```bash +python llama-eval-new.py \ + --server http://127.0.0.1:8013 \ + --model gpt-oss-20b-hf-low \ + --judge-model gpt-oss-20b-hf-medium \ + --dataset aime \ + --n_cases 10 \ + --grader-type llm \ + --seed 42 +``` + +## CLI Arguments + +- `--server`: llama-server URL (default: http://127.0.0.1:8013) +- `--model`: Model name for evaluation (default: llama) +- `--judge-model`: Model name for LLM judge (default: same as main model) +- `--judge-server`: Server URL for LLM judge (default: same as main server) +- `--dataset`: Dataset type (aime, gsm8k, gpqa) +- `--n_cases`: Number of cases to evaluate (default: all) +- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite) +- `--temperature`: Sampling temperature (default: not passed) +- `--top-k`: Top K sampling (default: not passed) +- `--top-p`: Top P sampling (default: not passed) +- `--min-p`: Min P sampling (default: not passed) +- `--threads`: Number of threads for parallel requests (default: 32) +- `--verbose`: Show detailed output for each case +- `--output`: Output file for eval state (default: llama-eval-state.json) +- `--grader-type`: Grader type (regex, cli, llm, default: llm) +- `--grader-script`: Path to CLI grader script (required for --grader-type cli) +- `--seed`: Random seed for shuffling (default: 1234) + +## Datasets + +### AIME +- 90 questions from 2025 AIME competition +- Answers in boxed format: `\boxed{answer}` +- Requires regex grader or LLM grader + +### GSM8K +- 7473 math word problems +- Answers are numeric values +- Requires regex grader or LLM grader + +### GPQA +- 198 questions from GPQA Diamond dataset +- Multiple choice with shuffled options +- Requires LLM grader (returns letter A, B, C, or D) + +## Grading Types + +### Regex Grader +Built-in patterns for different datasets: +- AIME: `\boxed{(\d+)}|\b(\d+)\b` +- GSM8K: `\b(\d+)\b` +- GPQA: Letter extraction (A, B, C, D) + +### CLI Grader +External script interface: +```bash +./grader.sh --answer --expected +``` +Returns exit code 0 if correct, non-zero if incorrect. + +### LLM Grader +Uses LLM to extract and compare answers: +- Configurable server and model +- Includes problem context in prompt +- Case-insensitive comparison + +## Output + +### Progress Table +``` + Task ID Dataset Prompt (first 43 chars) Expected Status + aime_000_001 AIME Complete the following reactions and sel... A pending +``` + +### Results +``` +============================================================ +Results: 8/10 correct (80.0%) +============================================================ +``` + +### JSON Output +Complete eval state saved to output file with: +- Task IDs and correctness status +- Prompts and extracted answers +- Sampling configuration +- Processing metadata diff --git a/examples/llama-eval/llama-eval-discussion.md b/examples/llama-eval/llama-eval-discussion.md deleted file mode 100644 index 1747aa0655..0000000000 --- a/examples/llama-eval/llama-eval-discussion.md +++ /dev/null @@ -1,395 +0,0 @@ -# llama-eval Implementation Discussion - -## Overview -Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892. - -## Key Requirements from ggerganov - -### 1. Simplify and Focus on One Eval -- Start with AIME2025 (most familiar with it) -- Don't support multiple evals initially - -### 2. Implement an "eval state" object -- ID -- List of tasks -- Task states -- Sampling config - -### 3. Implement a "processor" object -- List of endpoints -- Threads per endpoint -- Grade/judge type (regex, endpoint, or CLI tool) - -### 4. Processor responsibilities -- Accepts eval state -- Starts processing -- Dumps eval state periodically as it progresses - -### 5. Real-time feedback -- Default: show "correct / not correct" for each task -- Verbose mode: show produced answer vs expected answer as soon as it completes - -### 6. Grading approach -- Abstract grading to support external "grader" or "judge" -- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals) - -### 7. Output format -- Use structured output (JSON) instead of boxed text - -## Current Implementation Analysis - -### What exists in llama-eval.py: -- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande) -- Regex-based answer extraction -- HTTP requests to OpenAI-compatible endpoint -- Checkpointing/resume capability -- Thread-based parallel execution -- Summary reporting - -### What needs to be removed: -- All task implementations except AIME -- Regex-based grading -- Multiple endpoint support -- Complex task loading logic -- Summary reporting (replace with real-time feedback) - -## Discussion Points - -### 1. Eval State Object Structure -**Status: Under Discussion** - -Questions: -- What fields should be in the eval state object? -- Should it include the actual prompts, or just metadata? -- How should task states be tracked? - -### 2. Processor Architecture -**Status: Not Started** - -Questions: -- Should the processor handle multiple endpoints (for distributed evaluation)? -- What's the threading model? -- How are endpoints configured? - -### 3. Grader Interface -**Status: Not Started** - -Questions: -- How should the grader be configured? -- Should it be a separate service, or a local LLM call? -- What's the interface for grading? - -### 4. Checkpointing -**Status: Not Started** - -Questions: -- Should the eval state be serialized to disk? -- How often should it be dumped? -- What format should it use? - -### 5. Real-time Output -**Status: Not Started** - -Questions: -- How should progress be displayed? -- Console output, file logging, or both? -- What verbosity levels are needed? - -### 6. Output Format -**Status: Not Started** - -Questions: -- Should responses be in JSON format? -- How should the grader interface work with JSON output? - -## Next Steps - -1. **Eval State Object** - Currently discussing -2. Processor Architecture -3. Grader Interface -4. Checkpointing -5. Real-time Output -6. Output Format - -## References -- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892 -- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195 - -## Session Work Summary - -### llama-server-simulator Implementation - -**Created:** -- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint -- `test-simulator.sh` - Test script for verifying simulator functionality -- `llama-server-simulator-plan.md` - Implementation plan -- `simulator-summary.md` - Summary of implementation - -**Features Implemented:** -1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format -2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching -3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance -4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation -5. Debug Logging - Helps troubleshoot matching issues - -**Testing Results:** -- ✅ Correct answers returned when success rate allows -- ✅ Wrong answers returned when success rate doesn't allow -- ✅ No matching questions return errors -- ✅ Success rate verified (80% in 10 requests) -- ✅ HuggingFace dataset caching working correctly - -**Key Technical Decisions:** -- Used Levenshtein distance for partial matching (threshold: 0.3) -- Automatic caching via HuggingFace datasets library -- Wrong answers generated by incrementing expected answer -- Debug output written to stderr for better visibility - -**Refactoring:** -- Extracted repeating question string into TEST_QUESTION variable -- Created make_request() helper function to reduce code duplication -- Added proper error handling for error responses -- Fixed simulator stopping issue at script completion - -### llama-eval-new.py Implementation - -**Created:** -- `llama-eval-new.py` - Simplified evaluation tool focused on AIME - -**Features Implemented:** -1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config -2. **Processor Object** - Handles processing, grading, and state management -3. **Real-time Feedback** - Shows correct/incorrect status for each case -4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading -5. **Structured JSON Output** - Saves complete eval state to JSON file -6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests -7. **Enhanced Answer Extraction** - Extracts answers from full responses for display - -**Grading System:** -- **Regex Grading**: Built-in patterns for different task types - - `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text) - - `gsm8k`: `\b(\d+)\b` (extract first number) - - `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter) -- **CLI Grading**: External script interface - - Script accepts `--answer ` and `--expected ` - - Returns exit code 0 if correct, non-zero if incorrect - - 30-second timeout to prevent hanging -- **LLM Judge**: Generic answer extraction using LLM - - Uses configured server and model for extraction - - Includes problem statement in prompt for context - - Case-insensitive comparison - - Returns extracted answer for display - -**Configuration Options:** -- `--server`: llama-server URL (default: http://localhost:8033) -- `--n_cases`: Number of cases to evaluate (default: all) -- `--n_predict`: Max tokens to predict per prompt (default: 2048) -- `--threads`: Number of threads for parallel requests (default: 32) -- `--verbose`: Show detailed output for each case -- `--output`: Output file for eval state (default: llama-eval-state.json) -- `--grader-type`: `regex`, `cli`, or `llm` -- `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande -- `--grader-script`: Path to CLI grader script -- `--judge-server`: Server URL for LLM judge (default: same as main server) -- `--judge-model`: Model name for LLM judge (default: same as main model) - -**Testing Results:** -- ✅ Works with simulator at 100% success rate (all correct) -- ✅ Works with simulator at 0% success rate (all incorrect) -- ✅ Works with simulator at 80% success rate (8/10 correct) -- ✅ Real-time verbose output shows gold/pred/status for each case -- ✅ JSON output contains complete eval state with all cases -- ✅ HF Hub telemetry disabled (no warnings) -- ✅ Uses cached dataset path to avoid HF Hub requests when available -- ✅ Regex grader extracts answers correctly from various formats -- ✅ LLM judge can extract answers with problem context -- ✅ Response truncation focuses grading on final answer -- ✅ Case-insensitive matching works for both regex and LLM grader -- ✅ Judge model and server configuration propagate correctly -- ✅ Progress table shows extracted answers instead of full responses - -**Key Technical Decisions:** -- Removed Levenshtein matching - eval script only sends requests and validates answers -- Abstract grading interface for external grader support -- Exact match requirement for regex patterns -- Handles both boxed and plain text formats for AIME answers -- 30-second timeout for CLI grader -- Validates script exists before running -- Judge parameters set once during Grader construction -- LLM judge prompt includes problem statement for better extraction -- Response truncation to last 2-3 lines focuses grading on final answer -- Case-insensitive comparison for more flexible matching - -**Refactoring:** -- Removed all task implementations except AIME -- Removed regex-based grading (moved to flexible grader system) -- Removed multiple endpoint support -- Removed complex task loading logic -- Removed summary reporting (replaced with real-time feedback) -- Added HuggingFace dataset caching optimization -- Added LLM grader support with configurable server and model -- Added response truncation before grading -- Refactored grader interface to return extracted answers - -### llama-eval-new.py Threading and Model Parameter Updates - -**Changes Made:** -1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing - - Added `from concurrent.futures import ThreadPoolExecutor, as_completed` - - Created `_process_single_case()` method for thread-safe case processing - - Refactored `process()` to use ThreadPoolExecutor with configurable thread count - - Updated progress tracking to work with concurrent execution - - Thread-safe eval state updates (task_states and counters) - -2. **Model Parameter** - Added `--model` argument to specify model name in request data - - Added `model_name` parameter to Processor.__init__() - - Updated `_make_request()` to use provided model name or default to "llama" - - Added `--model` argument to argument parser - - Model name is included in request JSON as `"model": "gpt-oss-20b-hf"` - -**Testing Results:** -- ✅ Works with 2 threads (5 cases processed in ~0.2s) -- ✅ Works with 4 threads (slightly faster throughput) -- ✅ Model parameter correctly added to request data -- ✅ Thread-safe progress tracking with tqdm -- ✅ No race conditions in eval state updates - -**Key Technical Decisions:** -- Used ThreadPoolExecutor for simple, effective parallelism -- No rate limiting needed (server can handle concurrent requests) -- Thread-safe counter updates for correct/total tracking -- Progress bar shows completion status across all threads -- Model parameter is optional - defaults to "llama" if not specified - -**Refactoring:** -- Extracted single case processing into `_process_single_case()` method -- Changed from sequential loop to ThreadPoolExecutor with futures -- Updated verbose output to show total count instead of index -- Made eval state updates thread-safe - -### llama-eval-new.py Enhanced Grading System - -**Changes Made:** -1. **Enhanced Grader Interface** - Updated to return extracted answers - - `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer) - - Added `extracted` field to `TaskState` dataclass - - All grader types (regex, cli, llm) now return extracted answers - -2. **Improved Regex Grader** - - New `_extract_answer_regex()` method extracts answers using configured patterns - - Supports case-insensitive matching - - Returns first valid match found - - Handles both single values and multiple matches - -3. **LLM-Based Judge** - - New `_grade_llm()` method for generic answer extraction - - Includes problem statement in prompt for context - - Configurable server URL (defaults to main server) - - Configurable model name (defaults to main model) - - Case-insensitive comparison - - Returns extracted answer for display - -4. **Response Truncation** - - New `_truncate_response()` method keeps only last 2-3 lines - - Applied before grading to focus on final answer section - -5. **CLI Grader Update** - - Now also returns extracted answer - - Returns None if grading fails - -6. **Display Updates** - - Progress table shows extracted answer instead of full response - - Verbose mode shows full response plus extracted answer - -7. **New CLI Arguments** - - `--grader-type`: Added "llm" option - - `--judge-server`: Separate server for LLM judge - - `--judge-model`: Separate model for LLM judge - -**Testing Results:** -- ✅ Regex grader extracts answers correctly from various formats -- ✅ LLM judge can extract answers with problem context -- ✅ Response truncation focuses grading on final answer -- ✅ Case-insensitive matching works for both regex and LLM grader -- ✅ Judge model and server configuration propagate correctly -- ✅ Progress table shows extracted answers instead of full responses - -**Key Technical Decisions:** -- Judge parameters set once during Grader construction (not on each call) -- LLM judge prompt includes problem statement for better extraction -- Response truncation to last 2-3 lines focuses grading on final answer -- Case-insensitive comparison for more flexible matching -- Judge configuration propagates through Processor to Grader -- Display shows extracted answer for cleaner output - -**Refactoring:** -- Removed judge parameters from `grade()` method calls -- Added `judge_server_url` and `judge_model_name` to Grader class -- Updated `_grade_llm()` to use instance variables instead of parameters -- Simplified Processor initialization to pass judge config to grader -- Updated startup info to show judge server and model - -### llama-eval-new.py GSM8K Dataset Support - -**Changes Made:** -1. **GSM8K Dataset Integration** - Added support for GSM8K dataset alongside AIME - - Created `Gsm8kDataset` class with proper answer extraction logic - - GSM8K uses `"question"` field instead of `"problem"` field - - GSM8K answer field contains full reasoning with `####` prefix - - Extracts numeric answer from answer field during initialization - - Uses same regex grader pattern as AIME (`\b(\d+)\b`) - -2. **Dataset Type Configuration** - Added dataset selection support - - Added `--dataset` CLI argument with choices `aime` and `gsm8k` - - Updated `Processor` class to accept `dataset_type` parameter - - Dataset-specific initialization in `Processor.__init__()` - - Dataset name displayed in task summary table - -3. **Template Registry** - Added dataset-specific prompt templates - - AIME template: includes `\boxed{}` wrapper for final answer - - GSM8K template: plain text answer without wrapper - - Templates applied based on `question["dataset_type"]` field - -4. **Answer Extraction Logic** - Fixed GSM8K answer extraction - - GSM8K has pre-extracted `"gold"` field with numeric answer - - `Gsm8kDataset.get_answer()` checks for `"gold"` field first - - Falls back to answer field if gold field not present - - `AimeDataset.get_answer()` simplified to remove duplicate method - -5. **Task ID Format** - Fixed duplicate prefix in task IDs - - Changed from `f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"` - - To `f"{dataset_type}_{chunk_idx:03d}_{i:03d}"` - - Removed redundant `eval_state.id` (was "gsm8k" for GSM8K) - -6. **Column Width Adjustments** - Improved table formatting - - Task ID column: 25 characters - - Dataset column: 5 characters - - Prompt column: 40 characters - - Expected column: 10 characters - -**Testing Results:** -- ✅ GSM8K dataset loads correctly with 7473 questions -- ✅ Numeric answers extracted from full reasoning text -- ✅ Task summary table displays correctly with adjusted column widths -- ✅ Task IDs show correct format (e.g., `gsm8k_000_3169`) -- ✅ Both AIME and GSM8K datasets work with same script -- ✅ Answer extraction works for both boxed and plain text formats -- ✅ Progress tracking shows extracted answers for both datasets - -**Key Technical Decisions:** -- GSM8K uses `"question"` field instead of `"problem"` field -- GSM8K answer field contains full reasoning with `####` prefix -- Numeric answer extracted during dataset initialization -- Same regex grader pattern works for both datasets -- Dataset selection via CLI argument for separate runs -- Template registry supports different prompt formats per dataset -- Task ID format simplified to avoid duplication - -**Refactoring:** -- Removed duplicate `get_question()` method from `AimeDataset` -- Removed "2025" suffix from eval state ID (was remnant from old version) -- Removed "2025" suffix from task summary table output -- Removed "2025" suffix from progress tracking output -- Updated `Processor.__init__()` to initialize appropriate dataset based on type -- Updated `_process_single_case()` to handle both `"problem"` and `"question"` fields -- Updated `process()` method to display dataset name and use `dataset_type` for task states diff --git a/examples/llama-eval/llama-eval-new.py b/examples/llama-eval/llama-eval-new.py index 8426dae724..eacbe3d887 100755 --- a/examples/llama-eval/llama-eval-new.py +++ b/examples/llama-eval/llama-eval-new.py @@ -5,6 +5,7 @@ import json import os import re import subprocess +import sys import time from concurrent.futures import ThreadPoolExecutor, as_completed from dataclasses import dataclass, asdict @@ -34,6 +35,15 @@ Please reason step by step, and put your final answer within \\boxed{{}}. """, "gsm8k": """{question} Please reason step by step, and provide your final answer. +""", + "gpqa": """{Question} + +(A) {A} +(B) {B} +(C) {C} +(D) {D} + +Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'. """, } @@ -96,6 +106,15 @@ class AimeDataset: return str(normalized) if normalized is not None else answer return str(answer) + def get_prompt(self, question: Dict) -> str: + """Get formatted prompt for the question""" + if question["dataset_type"] == "gpqa": + return TEMPLATE_REGISTRY["gpqa"].format(**question) + else: + return TEMPLATE_REGISTRY[question["dataset_type"]].format( + question=question["problem"] if "problem" in question else question["question"] + ) + class Gsm8kDataset: def __init__(self, split: str = "train"): self.split = split @@ -146,17 +165,87 @@ class Gsm8kDataset: return str(normalized) if normalized is not None else answer return str(answer) + def get_prompt(self, question: Dict) -> str: + """Get formatted prompt for the question""" + return TEMPLATE_REGISTRY[question["dataset_type"]].format( + question=question["problem"] if "problem" in question else question["question"] + ) + +class GpqaDataset: + def __init__(self, variant: str = "diamond", seed: int = 1234): + self.variant = variant + self.seed = seed + self.questions: List[Dict] = [] + self._load_dataset() + + def _load_dataset(self): + print(f"Loading GPQA dataset (variant: {self.variant})...") + import pandas as pd + + url = f"https://openaipublic.blob.core.windows.net/simple-evals/gpqa_{self.variant}.csv" + df = pd.read_csv(url) + + rng = random.Random(self.seed) + + self.questions = [] + for _, row in df.iterrows(): + question = row.to_dict() + question["dataset_type"] = "gpqa" + + # Shuffle the answer options + correct_answer = question["Correct Answer"] + incorrect_answers = [ + question["Incorrect Answer 1"], + question["Incorrect Answer 2"], + question["Incorrect Answer 3"] + ] + + # Create list of (answer, is_correct) tuples + options = [(ans, ans == correct_answer) for ans in incorrect_answers] + options.append((correct_answer, True)) + + # Shuffle the options + rng.shuffle(options) + + # Extract shuffled answers and determine correct letter + shuffled_answers = [ans for ans, _ in options] + correct_letter = chr(ord('A') + options.index((correct_answer, True))) + + # Store shuffled answers and correct letter + question["shuffled_answers"] = shuffled_answers + question["correct_letter"] = correct_letter + + self.questions.append(question) + + print(f"GPQA dataset loaded: {len(self.questions)} questions") + + def get_question(self, index: int) -> Dict: + """Get question by index""" + return self.questions[index] + + def get_answer(self, question: Dict) -> str: + # GPQA returns the correct letter (A, B, C, or D) + return question["correct_letter"] + + def get_prompt(self, question: Dict) -> str: + """Get formatted prompt for the question""" + return TEMPLATE_REGISTRY["gpqa"].format( + Question=question["Question"], + A=question["shuffled_answers"][0], + B=question["shuffled_answers"][1], + C=question["shuffled_answers"][2], + D=question["shuffled_answers"][3] + ) + class Grader: def __init__( self, - grader_type: str = "regex", - grader_regex_type: str = "aime", + grader_type: str = "llm", grader_script: Optional[str] = None, judge_model_name: Optional[str] = None, judge_server_url: str = "" ): self.grader_type = grader_type - self.grader_regex_type = grader_regex_type self.grader_script = grader_script self.judge_model_name = judge_model_name self.judge_server_url = judge_server_url @@ -164,9 +253,7 @@ class Grader: def _get_pattern(self) -> Optional[str]: if self.grader_type == "regex": - if self.grader_regex_type not in GRADER_PATTERNS: - raise ValueError(f"Unknown grader regex type: {self.grader_regex_type}") - return GRADER_PATTERNS[self.grader_regex_type] + return GRADER_PATTERNS.get("aime") # Default to aime pattern return None def _extract_answer_regex(self, pred: str) -> Optional[str]: @@ -221,18 +308,21 @@ class Grader: """Grade using LLM-based extraction""" prompt = f"""Extract the answer from this response: -Response: {pred} - Expected answer: {gold} -Please provide only the extracted answer, nothing else.""" +=== + +Response: {pred} + +=== + +Please provide only the extracted answer, nothing else. If there is no clear answer in the response, reply with 'no answer'.""" url = f"{self.judge_server_url}/v1/chat/completions" headers = {"Content-Type": "application/json"} data = { "model": self.judge_model_name, "messages": [{"role": "user", "content": prompt}], "temperature": 0, - "max_tokens": 256 } try: @@ -264,14 +354,16 @@ class Processor: def __init__( self, server_url: str, - n_predict: int = 2048, + n_predict: int = -1, threads: int = 32, verbose: bool = False, grader: Optional[Grader] = None, model_name: Optional[str] = None, judge_server_url: str = "", judge_model_name: Optional[str] = None, - dataset_type: str = "aime" + dataset_type: str = "aime", + seed: int = 1234, + sampling_config: Optional[Dict[str, Any]] = None ): self.server_url = server_url self.n_predict = n_predict @@ -281,12 +373,14 @@ class Processor: self.judge_server_url = judge_server_url if judge_server_url else server_url self.judge_model_name = judge_model_name self.dataset_type = dataset_type + self.seed = seed self.grader = grader or Grader() + self.sampling_config = sampling_config or {"n_predict": n_predict} self.eval_state = EvalState( id=dataset_type, tasks=[dataset_type], task_states={}, - sampling_config={"temperature": 0, "max_tokens": n_predict} + sampling_config=self.sampling_config ) # Pass judge configuration to grader if using LLM grader @@ -301,6 +395,8 @@ class Processor: self.dataset = AimeDataset() elif dataset_type == "gsm8k": self.dataset = Gsm8kDataset() + elif dataset_type == "gpqa": + self.dataset = GpqaDataset(variant="diamond", seed=self.seed) else: raise ValueError(f"Unknown dataset type: {dataset_type}") @@ -311,9 +407,16 @@ class Processor: data = { "model": self.model_name if self.model_name else "llama", "messages": [{"role": "user", "content": prompt}], - "temperature": 0, - "max_tokens": self.n_predict + "n_predict": self.n_predict } + if self.sampling_config.get("temperature") is not None: + data["temperature"] = self.sampling_config["temperature"] + if self.sampling_config.get("top_k") is not None: + data["top_k"] = self.sampling_config["top_k"] + if self.sampling_config.get("top_p") is not None: + data["top_p"] = self.sampling_config["top_p"] + if self.sampling_config.get("min_p") is not None: + data["min_p"] = self.sampling_config["min_p"] response = requests.post(url, headers=headers, json=data) response.raise_for_status() @@ -322,14 +425,9 @@ class Processor: def _process_single_case(self, i: int, task_id: str) -> TaskState: """Process a single case (thread-safe)""" question = self.dataset.get_question(i) - dataset_id = f"{self.dataset_type}_{self.dataset.split}_{i}" + dataset_id = f"{self.dataset_type}_{i}" gold = self.dataset.get_answer(question) - - # Apply template if available - if question["dataset_type"] in TEMPLATE_REGISTRY: - prompt = TEMPLATE_REGISTRY[question["dataset_type"]].format(question=question["problem"] if "problem" in question else question["question"]) - else: - prompt = question["problem"] if "problem" in question else question["question"] + prompt = self.dataset.get_prompt(question) task_state = TaskState( case_id=task_id, @@ -361,12 +459,15 @@ class Processor: n_cases = len(self.dataset.questions) print(f"\nProcessing {n_cases} {self.dataset_type.upper()} questions...") - print(f"Server: {self.server_url}") + print(f"Server: {self.server_url} (model: {self.model_name})") print(f"Threads: {self.threads}") print(f"Max tokens: {self.n_predict}") + print(f"Seed: {self.seed}") + print(f"Sampling: temp={self.sampling_config.get('temperature', 'skip')}, top-k={self.sampling_config.get('top_k', 'skip')}, top-p={self.sampling_config.get('top_p', 'skip')}, min-p={self.sampling_config.get('min_p', 'skip')}") print(f"Grader: {self.grader.grader_type}", end="") if self.grader.grader_type == "llm": - print(f" (judge server: {self.judge_server_url}, model: {self.judge_model_name})", end="") + judge_model = self.judge_model_name if self.judge_model_name else self.model_name + print(f" (judge server: {self.judge_server_url}, model: {judge_model})", end="") print() print() @@ -389,9 +490,14 @@ class Processor: print(" Task ID Dataset Prompt (first 40 chars) Expected Status") for i, task_id in task_list: question = self.dataset.get_question(i) - prompt = question["problem"] if "problem" in question else question["question"] + prompt = self.dataset.get_prompt(question) gold = self.dataset.get_answer(question) - truncated_prompt = prompt[:40] + "..." if len(prompt) > 40 else prompt + first_line = prompt.split('\n')[0] + truncated_prompt = first_line[:43] + if len(first_line) > 43: + truncated_prompt += "..." + else: + truncated_prompt = truncated_prompt.ljust(43) + "..." print(f" {task_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {gold:<10} pending") print() @@ -413,7 +519,13 @@ class Processor: # Print task completion status extracted_display = task_state.extracted if task_state.extracted else "N/A" success_ratio = correct / total if total > 0 else 0.0 - print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {task_state.prompt[:40]:<40} {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'} [{correct:3}/{total:3}, {success_ratio:.3f}]") + first_line = task_state.prompt.split('\n')[0] + truncated_prompt = first_line[:43] + if len(first_line) > 43: + truncated_prompt += "..." + else: + truncated_prompt = truncated_prompt.ljust(43) + "..." + print(f"{total:3}/{n_cases:3} {task_state.case_id:<20} {self.dataset_type.upper()} {truncated_prompt:<40} {task_state.gold:<10} {extracted_display:<10} {'✓' if task_state.correct else '✗'} [{correct:3}/{total:3}, {success_ratio:.3f}]") if self.verbose: print(f"\nCase {total}: {task_state.correct}") @@ -456,7 +568,7 @@ def main(): "--dataset", type=str, default="aime", - choices=["aime", "gsm8k"], + choices=["aime", "gsm8k", "gpqa"], help="Dataset type (default: aime)" ) parser.add_argument( @@ -474,8 +586,32 @@ def main(): parser.add_argument( "--n_predict", type=int, - default=2048, - help="Max tokens to predict per prompt (default: 2048)" + default=-1, + help="Max tokens to predict per prompt (default: -1, infinite)" + ) + parser.add_argument( + "--temperature", + type=float, + default=None, + help="Sampling temperature (default: not passed)" + ) + parser.add_argument( + "--top-k", + type=int, + default=None, + help="Top K sampling (default: not passed)" + ) + parser.add_argument( + "--top-p", + type=float, + default=None, + help="Top P sampling (default: not passed)" + ) + parser.add_argument( + "--min-p", + type=float, + default=None, + help="Min P sampling (default: not passed)" ) parser.add_argument( "--threads", @@ -503,16 +639,9 @@ def main(): parser.add_argument( "--grader-type", type=str, - default="regex", + default="llm", choices=["regex", "cli", "llm"], - help="Grader type: regex, cli, or llm (default: regex)" - ) - parser.add_argument( - "--grader-regex-type", - type=str, - default="aime", - choices=list(GRADER_PATTERNS.keys()), - help="Regex grader type (default: aime)" + help="Grader type: regex, cli, or llm (default: llm)" ) parser.add_argument( "--grader-script", @@ -529,21 +658,37 @@ def main(): parser.add_argument( "--judge-model", type=str, - default=None, + default="", help="Model name for LLM judge (default: same as main model)" ) args = parser.parse_args() + # Validate grader type for GPQA + if args.dataset == "gpqa" and args.grader_type != "llm": + print("Error: GPQA dataset requires --grader-type llm") + parser.print_help() + sys.exit(1) + grader = Grader( grader_type=args.grader_type, - grader_regex_type=args.grader_regex_type, - grader_script=args.grader_script + grader_script=args.grader_script, + judge_model_name=args.judge_model if args.judge_model else args.model ) if args.grader_type == "llm" and not args.judge_server: print("Warning: Using same server for LLM judge (no --judge-server specified)") + sampling_config = {"n_predict": args.n_predict} + if args.temperature is not None: + sampling_config["temperature"] = args.temperature + if args.top_k is not None: + sampling_config["top_k"] = args.top_k + if args.top_p is not None: + sampling_config["top_p"] = args.top_p + if args.min_p is not None: + sampling_config["min_p"] = args.min_p + processor = Processor( server_url=args.server, n_predict=args.n_predict, @@ -553,7 +698,8 @@ def main(): model_name=args.model, judge_server_url=args.judge_server, judge_model_name=args.judge_model, - dataset_type=args.dataset + dataset_type=args.dataset, + sampling_config=sampling_config ) eval_state = processor.process(n_cases=args.n_cases, seed=args.seed) diff --git a/examples/llama-eval/llama-eval-state.json b/examples/llama-eval/llama-eval-state.json new file mode 100644 index 0000000000..add0f626a3 --- /dev/null +++ b/examples/llama-eval/llama-eval-state.json @@ -0,0 +1,29 @@ +{ + "id": "gpqa", + "tasks": [ + "gpqa" + ], + "task_states": { + "gpqa": { + "total": 1, + "correct": 0, + "cases": { + "gpqa": [ + { + "case_id": "gpqa_000_184", + "prompt": "Consider a system with Hamiltonian operator $H = \\varepsilon \\vec{\\sigma}.\\vec{n}$. Here, $\\vec{n}$ is an arbitrary unit vector, $\\varepsilon $ is a constant of dimension energy, and components of $\\vec{\\sigma}$ are the Pauli spin matrices. What are the eigenvalues of the Hamiltonian operator?\n\n\n(A) +\\hbar/2, -\\hbar/2\n(B) +1, -1\n(C) +\\varepsilon \\hbar/2, - \\varepsilon \\hbar/2\n(D) + \\varepsilon, -\\varepsilon\n\n\nExpress your final answer as the corresponding option 'A', 'B', 'C', or 'D'.\n", + "gold": "+ \\varepsilon, -\\varepsilon\n", + "pred": null, + "extracted": null, + "correct": false, + "status": "error: HTTPConnectionPool(host='localhost', port=8034): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError(\"HTTPConnection(host='localhost', port=8034): Failed to establish a new connection: [Errno 61] Connection refused\"))" + } + ] + } + } + }, + "sampling_config": { + "temperature": 0, + "max_tokens": 2048 + } +} \ No newline at end of file diff --git a/examples/llama-eval/llama-server-simulator-README.md b/examples/llama-eval/llama-server-simulator-README.md new file mode 100644 index 0000000000..bd69e2615c --- /dev/null +++ b/examples/llama-eval/llama-server-simulator-README.md @@ -0,0 +1,36 @@ +# llama-server-simulator + +Standalone Python script simulating llama-server HTTP endpoint for testing. + +## Features + +- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint +- AIME Dataset Integration - Loads 90 questions from HuggingFace +- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance +- Configurable Success Rate - Control correct/wrong answer generation (0-1) +- Debug Logging - Troubleshoot matching issues + +## Usage + +```bash +python llama-server-simulator.py --success-rate 0.8 +``` + +## Arguments + +- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8) +- `--port`: Server port (default: 8033) +- `--debug`: Enable debug logging (default: False) + +## Testing + +```bash +./test-simulator.sh +``` + +## Implementation Details + +- Uses Levenshtein distance for partial matching (threshold: 0.3) +- Automatic caching via HuggingFace datasets library +- Wrong answers generated by incrementing expected answer +- Debug output written to stderr diff --git a/examples/llama-eval/llama-server-simulator-plan.md b/examples/llama-eval/llama-server-simulator-plan.md deleted file mode 100644 index ac7dfad060..0000000000 --- a/examples/llama-eval/llama-server-simulator-plan.md +++ /dev/null @@ -1,189 +0,0 @@ -# llama-server-simulator Implementation Plan - -## Overview -Create a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. - -## Goals -1. Simulate llama-server's `/v1/chat/completions` endpoint -2. Accept requests and respond with expected answers from AIME dataset -3. Implement configurable success rate (sometimes right, sometimes wrong) -4. Use regex matching to find questions in incoming requests -5. Test with curl requests before integrating with eval script - -## Implementation Plan - -### Phase 1: Basic Simulator Structure -- Create `llama-server-simulator.py` script -- Set up Flask/FastAPI HTTP server -- Implement `/v1/chat/completions` endpoint -- Handle basic request/response format - -### Phase 2: AIME Dataset Integration -- Load AIME dataset -- Store questions and expected answers -- Implement regex matching to find questions in incoming requests -- Extract expected answer from matched question - -### Phase 3: Response Generation -- Implement success rate configuration -- Randomly determine if response should be correct or incorrect -- Generate appropriate response based on success determination -- Format response in OpenAI-compatible format - -### Phase 4: Testing -- Write curl commands to test basic functionality -- Test correct responses -- Test incorrect responses -- Test edge cases (no question found, etc.) - -## Technical Details - -### Server Framework -- Use Flask for simplicity -- Listen on configurable port -- Support JSON request/response format - -### Request Format -```json -{ - "model": "llama", - "messages": [ - {"role": "user", "content": "Question text here"} - ], - "temperature": 0, - "max_tokens": 2048 -} -``` - -### Response Format -```json -{ - "id": "chatcmpl-xxx", - "object": "chat.completion", - "created": 1234567890, - "model": "llama", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "Answer text here" - }, - "finish_reason": "stop" - } - ], - "usage": { - "prompt_tokens": 100, - "completion_tokens": 50, - "total_tokens": 150 - } -} -``` - -### AIME Dataset Integration -- Load from HuggingFace: "AI-MO/aimo-validation-aime" -- Store in memory for fast lookup -- Regex pattern to find question text in request -- Extract answer from matched question - -### Success Rate Configuration -- Command-line argument: `--success-rate 0.8` (80% success rate) -- Randomly determine correctness based on rate -- Log when responses are correct vs incorrect - -### Testing Strategy -1. Start simulator with default settings -2. Send curl request with known question -3. Verify response contains expected answer -4. Test with different success rates -5. Test edge cases - -## Implementation Steps - -### Step 1: Basic Server Setup -```python -from flask import Flask, request, jsonify - -app = Flask(__name__) - -@app.route('/v1/chat/completions', methods=['POST']) -def chat_completions(): - # Handle request - return jsonify(response) -``` - -### Step 2: Load AIME Dataset -```python -import datasets - -ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split="train") -# Store in memory -``` - -### Step 3: Regex Matching -```python -import re - -def find_question_in_request(request_text): - # Regex pattern to find question - pattern = r"question:\s*(.*?)\n" - match = re.search(pattern, request_text, re.DOTALL) - return match.group(1) if match else None -``` - -### Step 4: Response Generation -```python -import random - -def generate_response(question, success_rate): - if random.random() < success_rate: - return get_expected_answer(question) - else: - return get_wrong_answer(question) -``` - -### Step 5: Testing with Curl -```bash -curl -X POST http://localhost:8033/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llama", - "messages": [{"role": "user", "content": "Question text"}] - }' -``` - -## Configuration Options -- `--port`: Server port (default: 8033) -- `--success-rate`: Success rate 0-1 (default: 0.8) -- `--host`: Server host (default: localhost) -- `--dataset-split`: AIME split to use (default: train) - -## Expected Output -``` -=== llama-server-simulator === -Server running on http://localhost:8033 -Success rate: 0.8 -AIME dataset loaded: 1000 questions -``` - -## Testing Checklist -- [ ] Server starts successfully -- [ ] Basic request/response works -- [ ] Correct answer returned when success rate allows -- [ ] Wrong answer returned when success rate doesn't allow -- [ ] No question found returns error -- [ ] Multiple requests work correctly -- [ ] Different success rates work as expected - -## Next Steps - -1. ✓ Implement basic server structure -2. ✓ Load AIME dataset -3. ✓ Implement regex matching -4. ✓ Add response generation with success rate -5. ✓ Test with curl commands -6. ✓ Integrate with eval script once simulator works -7. ✓ Implement eval state object -8. ✓ Implement processor object -9. ✓ Add real-time progress reporting -10. ✓ Add enhanced grading system with LLM judge diff --git a/examples/llama-eval/simulator-summary.md b/examples/llama-eval/simulator-summary.md deleted file mode 100644 index 3ea6af5530..0000000000 --- a/examples/llama-eval/simulator-summary.md +++ /dev/null @@ -1,138 +0,0 @@ -# llama-server-simulator Implementation Summary - -## Overview -Successfully implemented a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. - -## Features Implemented - -### 1. HTTP Server -- Flask-based `/v1/chat/completions` endpoint -- OpenAI-compatible response format -- Configurable port and host - -### 2. AIME Dataset Integration -- Loads AIME dataset from HuggingFace -- In-memory storage for fast lookup -- 90 questions loaded from train split - -### 3. Intelligent Question Matching -- **Exact matching**: Direct string comparison -- **LaTeX removal**: Removes `$...$` formatting for flexible matching -- **Levenshtein distance**: Calculates similarity between strings -- **Partial matching**: Finds best match even with small differences - -### 4. Response Generation -- Configurable success rate (0-1) -- Returns correct answers when success rate allows -- Returns wrong answers when success rate doesn't allow -- Wrong answers are generated by incrementing the expected answer - -### 5. Debug Logging -- Debug messages written to stderr -- Logs request content, matching results, and distances -- Helps troubleshoot matching issues - -## Configuration Options - -```bash -python3 llama-server-simulator.py \ - --port 8034 \ - --host localhost \ - --success-rate 0.8 \ - --dataset-split train -``` - -## Testing Results - -### Test 1: Correct Answer -- **Success rate**: 0.8 -- **Expected answer**: 116 -- **Result**: ✓ Correct (116) - -### Test 2: Wrong Answer -- **Success rate**: 0.0 -- **Expected answer**: 116 -- **Result**: ✓ Wrong (117) - -### Test 3: No Matching Question -- **Request**: "What is the capital of France?" -- **Result**: ✓ Returns error "No matching question found" - -### Test 4: Success Rate Verification -- **Success rate**: 0.8 -- **Requests**: 10 -- **Correct answers**: 8/10 (80%) -- **Result**: ✓ Success rate working as expected - -## Technical Details - -### Matching Algorithm -1. Try exact match (case-insensitive) -2. Try match after removing LaTeX formatting -3. Calculate Levenshtein distance for partial matches -4. Return best match if distance < 0.3 (30% difference) - -### Response Format -```json -{ - "id": "chatcmpl-1769864875", - "object": "chat.completion", - "created": 1769864875, - "model": "llama", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "116" - }, - "finish_reason": "stop" - } - ], - "usage": { - "prompt_tokens": 100, - "completion_tokens": 50, - "total_tokens": 150 - } -} -``` - -## Files Created - -1. `llama-server-simulator.py` - Main simulator script -2. `test-simulator.sh` - Basic test script -3. `test-simulator-comprehensive.sh` - Comprehensive test script -4. `llama-server-simulator-plan.md` - Implementation plan -5. `llama-eval-discussion.md` - Discussion notes - -## Next Steps - -1. ✓ Basic simulator structure -2. ✓ AIME dataset integration -3. ✓ Question matching with Levenshtein distance -4. ✓ Response generation with configurable success rate -5. ✓ Testing with curl requests -6. ✓ Integrate with eval script -7. ✓ Implement eval state object -8. ✓ Implement processor object -9. ✓ Add real-time progress reporting -10. ✓ Add enhanced grading system with LLM judge - -## Known Limitations - -1. Only supports AIME dataset (train split) -2. Matching is case-insensitive -3. Wrong answers are simple increments (not realistic) -4. No support for multiple endpoints -5. No distributed evaluation - -## Future Enhancements - -1. Support multiple datasets -2. More sophisticated wrong answer generation -3. Multiple endpoint support -4. Distributed evaluation -5. Real-time progress reporting -6. Eval state serialization -7. Enhanced grading with LLM judge -8. Response truncation for better answer extraction