docs

2026-02-15 22:15:50 +02:00 · 2026-02-15 22:15:50 +02:00 · 7751ae2796
parent d2b10302ce
commit 7751ae2796
3 changed files with 103 additions and 12 deletions
--- a/examples/llama-eval/llama-eval-discussion.md
+++ b/examples/llama-eval/llama-eval-discussion.md
@ -160,9 +160,10 @@ Questions:
 1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config
 2. **Processor Object** - Handles processing, grading, and state management
 3. **Real-time Feedback** - Shows correct/incorrect status for each case
-4. **Flexible Grading System** - Supports regex and CLI-based grading
+4. **Flexible Grading System** - Supports regex, CLI, and LLM-based grading
 5. **Structured JSON Output** - Saves complete eval state to JSON file
 6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests
+7. **Enhanced Answer Extraction** - Extracts answers from full responses for display

 **Grading System:**
 - **Regex Grading**: Built-in patterns for different task types
@ -173,6 +174,11 @@ Questions:
  - Script accepts `--answer <pred>` and `--expected <gold>`
  - Returns exit code 0 if correct, non-zero if incorrect
  - 30-second timeout to prevent hanging
+- **LLM Judge**: Generic answer extraction using LLM
+  - Uses configured server and model for extraction
+  - Includes problem statement in prompt for context
+  - Case-insensitive comparison
+  - Returns extracted answer for display

 **Configuration Options:**
 - `--server`: llama-server URL (default: http://localhost:8033)
@ -181,9 +187,11 @@ Questions:
 - `--threads`: Number of threads for parallel requests (default: 32)
 - `--verbose`: Show detailed output for each case
 - `--output`: Output file for eval state (default: llama-eval-state.json)
- `--grader-type`: `regex` or `cli`
+- `--grader-type`: `regex`, `cli`, or `llm`
 - `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande
 - `--grader-script`: Path to CLI grader script
+- `--judge-server`: Server URL for LLM judge (default: same as main server)
+- `--judge-model`: Model name for LLM judge (default: same as main model)

 **Testing Results:**
 - ✅ Works with simulator at 100% success rate (all correct)
@ -193,6 +201,12 @@ Questions:
 - ✅ JSON output contains complete eval state with all cases
 - ✅ HF Hub telemetry disabled (no warnings)
 - ✅ Uses cached dataset path to avoid HF Hub requests when available
+- ✅ Regex grader extracts answers correctly from various formats
+- ✅ LLM judge can extract answers with problem context
+- ✅ Response truncation focuses grading on final answer
+- ✅ Case-insensitive matching works for both regex and LLM grader
+- ✅ Judge model and server configuration propagate correctly
+- ✅ Progress table shows extracted answers instead of full responses

 **Key Technical Decisions:**
 - Removed Levenshtein matching - eval script only sends requests and validates answers
@ -201,6 +215,10 @@ Questions:
 - Handles both boxed and plain text formats for AIME answers
 - 30-second timeout for CLI grader
 - Validates script exists before running
+- Judge parameters set once during Grader construction
+- LLM judge prompt includes problem statement for better extraction
+- Response truncation to last 2-3 lines focuses grading on final answer
+- Case-insensitive comparison for more flexible matching

 **Refactoring:**
 - Removed all task implementations except AIME
@ -209,6 +227,9 @@ Questions:
 - Removed complex task loading logic
 - Removed summary reporting (replaced with real-time feedback)
 - Added HuggingFace dataset caching optimization
+- Added LLM grader support with configurable server and model
+- Added response truncation before grading
+- Refactored grader interface to return extracted answers

 ### llama-eval-new.py Threading and Model Parameter Updates

@ -245,3 +266,65 @@ Questions:
 - Changed from sequential loop to ThreadPoolExecutor with futures
 - Updated verbose output to show total count instead of index
 - Made eval state updates thread-safe
+
+### llama-eval-new.py Enhanced Grading System
+
+**Changes Made:**
+1. **Enhanced Grader Interface** - Updated to return extracted answers
+   - `grade()` method now returns `Tuple[bool, Optional[str]]` (correctness + extracted answer)
+   - Added `extracted` field to `TaskState` dataclass
+   - All grader types (regex, cli, llm) now return extracted answers
+
+2. **Improved Regex Grader**
+   - New `_extract_answer_regex()` method extracts answers using configured patterns
+   - Supports case-insensitive matching
+   - Returns first valid match found
+   - Handles both single values and multiple matches
+
+3. **LLM-Based Judge**
+   - New `_grade_llm()` method for generic answer extraction
+   - Includes problem statement in prompt for context
+   - Configurable server URL (defaults to main server)
+   - Configurable model name (defaults to main model)
+   - Case-insensitive comparison
+   - Returns extracted answer for display
+
+4. **Response Truncation**
+   - New `_truncate_response()` method keeps only last 2-3 lines
+   - Applied before grading to focus on final answer section
+
+5. **CLI Grader Update**
+   - Now also returns extracted answer
+   - Returns None if grading fails
+
+6. **Display Updates**
+   - Progress table shows extracted answer instead of full response
+   - Verbose mode shows full response plus extracted answer
+
+7. **New CLI Arguments**
+   - `--grader-type`: Added "llm" option
+   - `--judge-server`: Separate server for LLM judge
+   - `--judge-model`: Separate model for LLM judge
+
+**Testing Results:**
+- ✅ Regex grader extracts answers correctly from various formats
+- ✅ LLM judge can extract answers with problem context
+- ✅ Response truncation focuses grading on final answer
+- ✅ Case-insensitive matching works for both regex and LLM grader
+- ✅ Judge model and server configuration propagate correctly
+- ✅ Progress table shows extracted answers instead of full responses
+
+**Key Technical Decisions:**
+- Judge parameters set once during Grader construction (not on each call)
+- LLM judge prompt includes problem statement for better extraction
+- Response truncation to last 2-3 lines focuses grading on final answer
+- Case-insensitive comparison for more flexible matching
+- Judge configuration propagates through Processor to Grader
+- Display shows extracted answer for cleaner output
+
+**Refactoring:**
+- Removed judge parameters from `grade()` method calls
+- Added `judge_server_url` and `judge_model_name` to Grader class
+- Updated `_grade_llm()` to use instance variables instead of parameters
+- Simplified Processor initialization to pass judge config to grader
+- Updated startup info to show judge server and model
--- a/examples/llama-eval/llama-server-simulator-plan.md
+++ b/examples/llama-eval/llama-server-simulator-plan.md
@ -176,9 +176,14 @@ AIME dataset loaded: 1000 questions
 - [ ] Different success rates work as expected

 ## Next Steps
-1. Implement basic server structure
-2. Load AIME dataset
-3. Implement regex matching
-4. Add response generation with success rate
-5. Test with curl commands
-6. Integrate with eval script once simulator works
+
+1. ✓ Implement basic server structure
+2. ✓ Load AIME dataset
+3. ✓ Implement regex matching
+4. ✓ Add response generation with success rate
+5. ✓ Test with curl commands
+6. ✓ Integrate with eval script once simulator works
+7. ✓ Implement eval state object
+8. ✓ Implement processor object
+9. ✓ Add real-time progress reporting
+10. ✓ Add enhanced grading system with LLM judge
--- a/examples/llama-eval/simulator-summary.md
+++ b/examples/llama-eval/simulator-summary.md
@ -112,10 +112,11 @@ python3 llama-server-simulator.py \
 3. ✓ Question matching with Levenshtein distance
 4. ✓ Response generation with configurable success rate
 5. ✓ Testing with curl requests
-6. ⏭️ Integrate with eval script
-7. ⏭️ Implement eval state object
-8. ⏭️ Implement processor object
-9. ⏭️ Add real-time progress reporting
+6. ✓ Integrate with eval script
+7. ✓ Implement eval state object
+8. ✓ Implement processor object
+9. ✓ Add real-time progress reporting
+10. ✓ Add enhanced grading system with LLM judge

 ## Known Limitations

@ -133,3 +134,5 @@ python3 llama-server-simulator.py \
 4. Distributed evaluation
 5. Real-time progress reporting
 6. Eval state serialization
+7. Enhanced grading with LLM judge
+8. Response truncation for better answer extraction