From 7751ae2796e6c3cba3ce499d39b9a63b5edf6010 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Sun, 15 Feb 2026 22:15:50 +0200 Subject: [PATCH] docs --- examples/llama-eval/llama-eval-discussion.md | 87 ++++++++++++++++++- .../llama-eval/llama-server-simulator-plan.md | 17 ++-- examples/llama-eval/simulator-summary.md | 11 ++- 3 files changed, 103 insertions(+), 12 deletions(-) diff --git a/examples/llama-eval/llama-eval-discussion.md b/examples/llama-eval/llama-eval-discussion.md index 8069ea1625..57bcda138f 100644 --- a/examples/llama-eval/llama-eval-discussion.md +++ b/examples/llama-eval/llama-eval-discussion.md @@ -160,9 +160,10 @@ Questions: 1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config 2. **Processor Object** - Handles processing, grading, and state management 3. **Real-time Feedback** - Shows correct/incorrect status for each case -4. **Flexible Grading System** - Supports regex and CLI-based grading +4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading 5. **Structured JSON Output** - Saves complete eval state to JSON file 6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests +7. **Enhanced Answer Extraction** - Extracts answers from full responses for display **Grading System:** - **Regex Grading**: Built-in patterns for different task types @@ -173,6 +174,11 @@ Questions: - Script accepts `--answer ` and `--expected ` - Returns exit code 0 if correct, non-zero if incorrect - 30-second timeout to prevent hanging +- **LLM Judge**: Generic answer extraction using LLM + - Uses configured server and model for extraction + - Includes problem statement in prompt for context + - Case-insensitive comparison + - Returns extracted answer for display **Configuration Options:** - `--server`: llama-server URL (default: http://localhost:8033) @@ -181,9 +187,11 @@ Questions: - `--threads`: Number of threads for parallel requests (default: 32) - `--verbose`: Show detailed output for each case - `--output`: Output file for eval state (default: llama-eval-state.json) -- `--grader-type`: `regex` or `cli` +- `--grader-type`: `regex`, `cli`, or `llm` - `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande - `--grader-script`: Path to CLI grader script +- `--judge-server`: Server URL for LLM judge (default: same as main server) +- `--judge-model`: Model name for LLM judge (default: same as main model) **Testing Results:** - ✅ Works with simulator at 100% success rate (all correct) @@ -193,6 +201,12 @@ Questions: - ✅ JSON output contains complete eval state with all cases - ✅ HF Hub telemetry disabled (no warnings) - ✅ Uses cached dataset path to avoid HF Hub requests when available +- ✅ Regex grader extracts answers correctly from various formats +- ✅ LLM judge can extract answers with problem context +- ✅ Response truncation focuses grading on final answer +- ✅ Case-insensitive matching works for both regex and LLM grader +- ✅ Judge model and server configuration propagate correctly +- ✅ Progress table shows extracted answers instead of full responses **Key Technical Decisions:** - Removed Levenshtein matching - eval script only sends requests and validates answers @@ -201,6 +215,10 @@ Questions: - Handles both boxed and plain text formats for AIME answers - 30-second timeout for CLI grader - Validates script exists before running +- Judge parameters set once during Grader construction +- LLM judge prompt includes problem statement for better extraction +- Response truncation to last 2-3 lines focuses grading on final answer +- Case-insensitive comparison for more flexible matching **Refactoring:** - Removed all task implementations except AIME @@ -209,6 +227,9 @@ Questions: - Removed complex task loading logic - Removed summary reporting (replaced with real-time feedback) - Added HuggingFace dataset caching optimization +- Added LLM grader support with configurable server and model +- Added response truncation before grading +- Refactored grader interface to return extracted answers ### llama-eval-new.py Threading and Model Parameter Updates @@ -245,3 +266,65 @@ Questions: - Changed from sequential loop to ThreadPoolExecutor with futures - Updated verbose output to show total count instead of index - Made eval state updates thread-safe + +### llama-eval-new.py Enhanced Grading System + +**Changes Made:** +1. **Enhanced Grader Interface** - Updated to return extracted answers + - `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer) + - Added `extracted` field to `TaskState` dataclass + - All grader types (regex, cli, llm) now return extracted answers + +2. **Improved Regex Grader** + - New `_extract_answer_regex()` method extracts answers using configured patterns + - Supports case-insensitive matching + - Returns first valid match found + - Handles both single values and multiple matches + +3. **LLM-Based Judge** + - New `_grade_llm()` method for generic answer extraction + - Includes problem statement in prompt for context + - Configurable server URL (defaults to main server) + - Configurable model name (defaults to main model) + - Case-insensitive comparison + - Returns extracted answer for display + +4. **Response Truncation** + - New `_truncate_response()` method keeps only last 2-3 lines + - Applied before grading to focus on final answer section + +5. **CLI Grader Update** + - Now also returns extracted answer + - Returns None if grading fails + +6. **Display Updates** + - Progress table shows extracted answer instead of full response + - Verbose mode shows full response plus extracted answer + +7. **New CLI Arguments** + - `--grader-type`: Added "llm" option + - `--judge-server`: Separate server for LLM judge + - `--judge-model`: Separate model for LLM judge + +**Testing Results:** +- ✅ Regex grader extracts answers correctly from various formats +- ✅ LLM judge can extract answers with problem context +- ✅ Response truncation focuses grading on final answer +- ✅ Case-insensitive matching works for both regex and LLM grader +- ✅ Judge model and server configuration propagate correctly +- ✅ Progress table shows extracted answers instead of full responses + +**Key Technical Decisions:** +- Judge parameters set once during Grader construction (not on each call) +- LLM judge prompt includes problem statement for better extraction +- Response truncation to last 2-3 lines focuses grading on final answer +- Case-insensitive comparison for more flexible matching +- Judge configuration propagates through Processor to Grader +- Display shows extracted answer for cleaner output + +**Refactoring:** +- Removed judge parameters from `grade()` method calls +- Added `judge_server_url` and `judge_model_name` to Grader class +- Updated `_grade_llm()` to use instance variables instead of parameters +- Simplified Processor initialization to pass judge config to grader +- Updated startup info to show judge server and model diff --git a/examples/llama-eval/llama-server-simulator-plan.md b/examples/llama-eval/llama-server-simulator-plan.md index 0099894887..ac7dfad060 100644 --- a/examples/llama-eval/llama-server-simulator-plan.md +++ b/examples/llama-eval/llama-server-simulator-plan.md @@ -176,9 +176,14 @@ AIME dataset loaded: 1000 questions - [ ] Different success rates work as expected ## Next Steps -1. Implement basic server structure -2. Load AIME dataset -3. Implement regex matching -4. Add response generation with success rate -5. Test with curl commands -6. Integrate with eval script once simulator works + +1. ✓ Implement basic server structure +2. ✓ Load AIME dataset +3. ✓ Implement regex matching +4. ✓ Add response generation with success rate +5. ✓ Test with curl commands +6. ✓ Integrate with eval script once simulator works +7. ✓ Implement eval state object +8. ✓ Implement processor object +9. ✓ Add real-time progress reporting +10. ✓ Add enhanced grading system with LLM judge diff --git a/examples/llama-eval/simulator-summary.md b/examples/llama-eval/simulator-summary.md index 33b1f1d8ff..3ea6af5530 100644 --- a/examples/llama-eval/simulator-summary.md +++ b/examples/llama-eval/simulator-summary.md @@ -112,10 +112,11 @@ python3 llama-server-simulator.py \ 3. ✓ Question matching with Levenshtein distance 4. ✓ Response generation with configurable success rate 5. ✓ Testing with curl requests -6. ⏭️ Integrate with eval script -7. ⏭️ Implement eval state object -8. ⏭️ Implement processor object -9. ⏭️ Add real-time progress reporting +6. ✓ Integrate with eval script +7. ✓ Implement eval state object +8. ✓ Implement processor object +9. ✓ Add real-time progress reporting +10. ✓ Add enhanced grading system with LLM judge ## Known Limitations @@ -133,3 +134,5 @@ python3 llama-server-simulator.py \ 4. Distributed evaluation 5. Real-time progress reporting 6. Eval state serialization +7. Enhanced grading with LLM judge +8. Response truncation for better answer extraction