# llama-eval Implementation Discussion ## Overview Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892. ## Key Requirements from ggerganov ### 1. Simplify and Focus on One Eval - Start with AIME2025 (most familiar with it) - Don't support multiple evals initially ### 2. Implement an "eval state" object - ID - List of tasks - Task states - Sampling config ### 3. Implement a "processor" object - List of endpoints - Threads per endpoint - Grade/judge type (regex, endpoint, or CLI tool) ### 4. Processor responsibilities - Accepts eval state - Starts processing - Dumps eval state periodically as it progresses ### 5. Real-time feedback - Default: show "correct / not correct" for each task - Verbose mode: show produced answer vs expected answer as soon as it completes ### 6. Grading approach - Abstract grading to support external "grader" or "judge" - Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals) ### 7. Output format - Use structured output (JSON) instead of boxed text ## Current Implementation Analysis ### What exists in llama-eval.py: - Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande) - Regex-based answer extraction - HTTP requests to OpenAI-compatible endpoint - Checkpointing/resume capability - Thread-based parallel execution - Summary reporting ### What needs to be removed: - All task implementations except AIME - Regex-based grading - Multiple endpoint support - Complex task loading logic - Summary reporting (replace with real-time feedback) ## Discussion Points ### 1. Eval State Object Structure **Status: Under Discussion** Questions: - What fields should be in the eval state object? - Should it include the actual prompts, or just metadata? - How should task states be tracked? ### 2. Processor Architecture **Status: Not Started** Questions: - Should the processor handle multiple endpoints (for distributed evaluation)? - What's the threading model? - How are endpoints configured? ### 3. Grader Interface **Status: Not Started** Questions: - How should the grader be configured? - Should it be a separate service, or a local LLM call? - What's the interface for grading? ### 4. Checkpointing **Status: Not Started** Questions: - Should the eval state be serialized to disk? - How often should it be dumped? - What format should it use? ### 5. Real-time Output **Status: Not Started** Questions: - How should progress be displayed? - Console output, file logging, or both? - What verbosity levels are needed? ### 6. Output Format **Status: Not Started** Questions: - Should responses be in JSON format? - How should the grader interface work with JSON output? ## Next Steps 1. **Eval State Object** - Currently discussing 2. Processor Architecture 3. Grader Interface 4. Checkpointing 5. Real-time Output 6. Output Format ## References - PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892 - Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195 ## Session Work Summary ### llama-server-simulator Implementation **Created:** - `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint - `test-simulator.sh` - Test script for verifying simulator functionality - `llama-server-simulator-plan.md` - Implementation plan - `simulator-summary.md` - Summary of implementation **Features Implemented:** 1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format 2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching 3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance 4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation 5. Debug Logging - Helps troubleshoot matching issues **Testing Results:** - ✅ Correct answers returned when success rate allows - ✅ Wrong answers returned when success rate doesn't allow - ✅ No matching questions return errors - ✅ Success rate verified (80% in 10 requests) - ✅ HuggingFace dataset caching working correctly **Key Technical Decisions:** - Used Levenshtein distance for partial matching (threshold: 0.3) - Automatic caching via HuggingFace datasets library - Wrong answers generated by incrementing expected answer - Debug output written to stderr for better visibility **Refactoring:** - Extracted repeating question string into TEST_QUESTION variable - Created make_request() helper function to reduce code duplication - Added proper error handling for error responses - Fixed simulator stopping issue at script completion ### llama-eval-new.py Implementation **Created:** - `llama-eval-new.py` - Simplified evaluation tool focused on AIME **Features Implemented:** 1. **Eval State Object** - Structured dataclass with ID, tasks, task states, and sampling config 2. **Processor Object** - Handles processing, grading, and state management 3. **Real-time Feedback** - Shows correct/incorrect status for each case 4. **Flexible Grading System** - Supports regex and CLI-based grading 5. **Structured JSON Output** - Saves complete eval state to JSON file 6. **HuggingFace Dataset Caching** - Uses cached dataset path to avoid HF Hub requests **Grading System:** - **Regex Grading**: Built-in patterns for different task types - `aime`: `\boxed{(\d+)}|\b(\d+)\b` (handles boxed and plain text) - `gsm8k`: `\b(\d+)\b` (extract first number) - `mmlu`, `hellaswag`, `arc`, `winogrande`: `[A-D]` (extract single letter) - **CLI Grading**: External script interface - Script accepts `--answer ` and `--expected ` - Returns exit code 0 if correct, non-zero if incorrect - 30-second timeout to prevent hanging **Configuration Options:** - `--server`: llama-server URL (default: http://localhost:8033) - `--n_cases`: Number of cases to evaluate (default: all) - `--n_predict`: Max tokens to predict per prompt (default: 2048) - `--threads`: Number of threads for parallel requests (default: 32) - `--verbose`: Show detailed output for each case - `--output`: Output file for eval state (default: llama-eval-state.json) - `--grader-type`: `regex` or `cli` - `--grader-regex-type`: aime, gsm8k, mmlu, hellaswag, arc, winogrande - `--grader-script`: Path to CLI grader script **Testing Results:** - ✅ Works with simulator at 100% success rate (all correct) - ✅ Works with simulator at 0% success rate (all incorrect) - ✅ Works with simulator at 80% success rate (8/10 correct) - ✅ Real-time verbose output shows gold/pred/status for each case - ✅ JSON output contains complete eval state with all cases - ✅ HF Hub telemetry disabled (no warnings) - ✅ Uses cached dataset path to avoid HF Hub requests when available **Key Technical Decisions:** - Removed Levenshtein matching - eval script only sends requests and validates answers - Abstract grading interface for external grader support - Exact match requirement for regex patterns - Handles both boxed and plain text formats for AIME answers - 30-second timeout for CLI grader - Validates script exists before running **Refactoring:** - Removed all task implementations except AIME - Removed regex-based grading (moved to flexible grader system) - Removed multiple endpoint support - Removed complex task loading logic - Removed summary reporting (replaced with real-time feedback) - Added HuggingFace dataset caching optimization ### llama-eval-new.py Threading and Model Parameter Updates **Changes Made:** 1. **Threading Support** - Added ThreadPoolExecutor for parallel request processing - Added `from concurrent.futures import ThreadPoolExecutor, as_completed` - Created `_process_single_case()` method for thread-safe case processing - Refactored `process()` to use ThreadPoolExecutor with configurable thread count - Updated progress tracking to work with concurrent execution - Thread-safe eval state updates (task_states and counters) 2. **Model Parameter** - Added `--model` argument to specify model name in request data - Added `model_name` parameter to Processor.__init__() - Updated `_make_request()` to use provided model name or default to "llama" - Added `--model` argument to argument parser - Model name is included in request JSON as `"model": "gpt-oss-20b-hf"` **Testing Results:** - ✅ Works with 2 threads (5 cases processed in ~0.2s) - ✅ Works with 4 threads (slightly faster throughput) - ✅ Model parameter correctly added to request data - ✅ Thread-safe progress tracking with tqdm - ✅ No race conditions in eval state updates **Key Technical Decisions:** - Used ThreadPoolExecutor for simple, effective parallelism - No rate limiting needed (server can handle concurrent requests) - Thread-safe counter updates for correct/total tracking - Progress bar shows completion status across all threads - Model parameter is optional - defaults to "llama" if not specified **Refactoring:** - Extracted single case processing into `_process_single_case()` method - Changed from sequential loop to ThreadPoolExecutor with futures - Updated verbose output to show total count instead of index - Made eval state updates thread-safe