llama.cpp/examples/llama-eval/llama-eval-discussion.md

# llama-eval Implementation Discussion

## Overview
Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

## Key Requirements from ggerganov

### 1. Simplify and Focus on One Eval
- Start with AIME2025 (most familiar with it)
- Don't support multiple evals initially

### 2. Implement an "eval state" object
- ID
- List of tasks
- Task states
- Sampling config

### 3. Implement a "processor" object
- List of endpoints
- Threads per endpoint
- Grade/judge type (regex, endpoint, or CLI tool)

### 4. Processor responsibilities
- Accepts eval state
- Starts processing
- Dumps eval state periodically as it progresses

### 5. Real-time feedback
- Default: show "correct / not correct" for each task
- Verbose mode: show produced answer vs expected answer as soon as it completes

### 6. Grading approach
- Abstract grading to support external "grader" or "judge"
- Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

### 7. Output format
- Use structured output (JSON) instead of boxed text

## Current Implementation Analysis

### What exists in llama-eval.py:
- Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
- Regex-based answer extraction
- HTTP requests to OpenAI-compatible endpoint
- Checkpointing/resume capability
- Thread-based parallel execution
- Summary reporting

### What needs to be removed:
- All task implementations except AIME
- Regex-based grading
- Multiple endpoint support
- Complex task loading logic
- Summary reporting (replace with real-time feedback)

## Discussion Points

### 1. Eval State Object Structure
**Status: Under Discussion**

Questions:
- What fields should be in the eval state object?
- Should it include the actual prompts, or just metadata?
- How should task states be tracked?

### 2. Processor Architecture
**Status: Not Started**

Questions:
- Should the processor handle multiple endpoints (for distributed evaluation)?
- What's the threading model?
- How are endpoints configured?

### 3. Grader Interface
**Status: Not Started**

Questions:
- How should the grader be configured?
- Should it be a separate service, or a local LLM call?
- What's the interface for grading?

### 4. Checkpointing
**Status: Not Started**

Questions:
- Should the eval state be serialized to disk?
- How often should it be dumped?
- What format should it use?

### 5. Real-time Output
**Status: Not Started**

Questions:
- How should progress be displayed?
- Console output, file logging, or both?
- What verbosity levels are needed?

### 6. Output Format
**Status: Not Started**

Questions:
- Should responses be in JSON format?
- How should the grader interface work with JSON output?

## Next Steps

1. **Eval State Object** - Currently discussing
2. Processor Architecture
3. Grader Interface
4. Checkpointing
5. Real-time Output
6. Output Format

## References
- PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
- Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195

## Session Work Summary

### llama-server-simulator Implementation

**Created:**
- `llama-server-simulator.py` - Standalone Python script simulating llama-server HTTP endpoint
- `test-simulator.sh` - Test script for verifying simulator functionality
- `llama-server-simulator-plan.md` - Implementation plan
- `simulator-summary.md` - Summary of implementation

**Features Implemented:**
1. HTTP Server - Flask-based `/v1/chat/completions` endpoint with OpenAI-compatible format
2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
5. Debug Logging - Helps troubleshoot matching issues

**Testing Results:**
- ✅ Correct answers returned when success rate allows
- ✅ Wrong answers returned when success rate doesn't allow
- ✅ No matching questions return errors
- ✅ Success rate verified (80% in 10 requests)
- ✅ HuggingFace dataset caching working correctly

**Key Technical Decisions:**
- Used Levenshtein distance for partial matching (threshold: 0.3)
- Automatic caching via HuggingFace datasets library
- Wrong answers generated by incrementing expected answer
- Debug output written to stderr for better visibility

**Refactoring:**
- Extracted repeating question string into TEST_QUESTION variable
- Created make_request() helper function to reduce code duplication
- Added proper error handling for error responses
- Fixed simulator stopping issue at script completion