# llama-eval Implementation Discussion ## Overview Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892. ## Key Requirements from ggerganov ### 1. Simplify and Focus on One Eval - Start with AIME2025 (most familiar with it) - Don't support multiple evals initially ### 2. Implement an "eval state" object - ID - List of tasks - Task states - Sampling config ### 3. Implement a "processor" object - List of endpoints - Threads per endpoint - Grade/judge type (regex, endpoint, or CLI tool) ### 4. Processor responsibilities - Accepts eval state - Starts processing - Dumps eval state periodically as it progresses ### 5. Real-time feedback - Default: show "correct / not correct" for each task - Verbose mode: show produced answer vs expected answer as soon as it completes ### 6. Grading approach - Abstract grading to support external "grader" or "judge" - Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals) ### 7. Output format - Use structured output (JSON) instead of boxed text ## Current Implementation Analysis ### What exists in llama-eval.py: - Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande) - Regex-based answer extraction - HTTP requests to OpenAI-compatible endpoint - Checkpointing/resume capability - Thread-based parallel execution - Summary reporting ### What needs to be removed: - All task implementations except AIME - Regex-based grading - Multiple endpoint support - Complex task loading logic - Summary reporting (replace with real-time feedback) ## Discussion Points ### 1. Eval State Object Structure **Status: Under Discussion** Questions: - What fields should be in the eval state object? - Should it include the actual prompts, or just metadata? - How should task states be tracked? ### 2. Processor Architecture **Status: Not Started** Questions: - Should the processor handle multiple endpoints (for distributed evaluation)? - What's the threading model? - How are endpoints configured? ### 3. Grader Interface **Status: Not Started** Questions: - How should the grader be configured? - Should it be a separate service, or a local LLM call? - What's the interface for grading? ### 4. Checkpointing **Status: Not Started** Questions: - Should the eval state be serialized to disk? - How often should it be dumped? - What format should it use? ### 5. Real-time Output **Status: Not Started** Questions: - How should progress be displayed? - Console output, file logging, or both? - What verbosity levels are needed? ### 6. Output Format **Status: Not Started** Questions: - Should responses be in JSON format? - How should the grader interface work with JSON output? ## Next Steps 1. **Eval State Object** - Currently discussing 2. Processor Architecture 3. Grader Interface 4. Checkpointing 5. Real-time Output 6. Output Format ## References - PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892 - Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195 ## Session Work Summary ### llama-server-simulator Implementation **Created:** - `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint - `test-simulator.sh` - Test script for verifying simulator functionality - `llama-server-simulator-plan.md` - Implementation plan - `simulator-summary.md` - Summary of implementation **Features Implemented:** 1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format 2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching 3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance 4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation 5. Debug Logging - Helps troubleshoot matching issues **Testing Results:** - ✅ Correct answers returned when success rate allows - ✅ Wrong answers returned when success rate doesn't allow - ✅ No matching questions return errors - ✅ Success rate verified (80% in 10 requests) - ✅ HuggingFace dataset caching working correctly **Key Technical Decisions:** - Used Levenshtein distance for partial matching (threshold: 0.3) - Automatic caching via HuggingFace datasets library - Wrong answers generated by incrementing expected answer - Debug output written to stderr for better visibility **Refactoring:** - Extracted repeating question string into TEST_QUESTION variable - Created make_request() helper function to reduce code duplication - Added proper error handling for error responses - Fixed simulator stopping issue at script completion