7.3 KiB

Raw Blame History

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

Start with AIME2025 (most familiar with it)
Don't support multiple evals initially

2. Implement an "eval state" object

ID
List of tasks
Task states
Sampling config

3. Implement a "processor" object

List of endpoints
Threads per endpoint
Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

Accepts eval state
Starts processing
Dumps eval state periodically as it progresses

5. Real-time feedback

Default: show "correct / not correct" for each task
Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

Abstract grading to support external "grader" or "judge"
Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
Regex-based answer extraction
HTTP requests to OpenAI-compatible endpoint
Checkpointing/resume capability
Thread-based parallel execution
Summary reporting

What needs to be removed:

All task implementations except AIME
Regex-based grading
Multiple endpoint support
Complex task loading logic
Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

What fields should be in the eval state object?
Should it include the actual prompts, or just metadata?
How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

Should the processor handle multiple endpoints (for distributed evaluation)?
What's the threading model?
How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

How should the grader be configured?
Should it be a separate service, or a local LLM call?
What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

Should the eval state be serialized to disk?
How often should it be dumped?
What format should it use?

5. Real-time Output

Status: Not Started

Questions:

How should progress be displayed?
Console output, file logging, or both?
What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

Should responses be in JSON format?
How should the grader interface work with JSON output?

Next Steps

Eval State Object - Currently discussing
Processor Architecture
Grader Interface
Checkpointing
Real-time Output
Output Format

References

PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195

Session Work Summary

llama-server-simulator Implementation

Created:

llama-server-simulator.py - Standalone Python script simulating llama-server HTTP endpoint
test-simulator.sh - Test script for verifying simulator functionality
llama-server-simulator-plan.md - Implementation plan
simulator-summary.md - Summary of implementation

Features Implemented:

HTTP Server - Flask-based /v1/chat/completions endpoint with OpenAI-compatible format
AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
Debug Logging - Helps troubleshoot matching issues

Testing Results:

✅ Correct answers returned when success rate allows
✅ Wrong answers returned when success rate doesn't allow
✅ No matching questions return errors
✅ Success rate verified (80% in 10 requests)
✅ HuggingFace dataset caching working correctly

Key Technical Decisions:

Used Levenshtein distance for partial matching (threshold: 0.3)
Automatic caching via HuggingFace datasets library
Wrong answers generated by incrementing expected answer
Debug output written to stderr for better visibility

Refactoring:

Extracted repeating question string into TEST_QUESTION variable
Created make_request() helper function to reduce code duplication
Added proper error handling for error responses
Fixed simulator stopping issue at script completion

llama-eval-new.py Implementation

Created:

llama-eval-new.py - Simplified evaluation tool focused on AIME

Features Implemented:

Eval State Object - Structured dataclass with ID, tasks, task states, and sampling config
Processor Object - Handles processing, grading, and state management
Real-time Feedback - Shows correct/incorrect status for each case
Flexible Grading System - Supports regex and CLI-based grading
Structured JSON Output - Saves complete eval state to JSON file
HuggingFace Dataset Caching - Uses cached dataset path to avoid HF Hub requests

Grading System:

Regex Grading: Built-in patterns for different task types
- aime: \boxed{(\d+)}|\b(\d+)\b (handles boxed and plain text)
- gsm8k: \b(\d+)\b (extract first number)
- mmlu, hellaswag, arc, winogrande: [A-D] (extract single letter)
CLI Grading: External script interface
- Script accepts --answer <pred> and --expected <gold>
- Returns exit code 0 if correct, non-zero if incorrect
- 30-second timeout to prevent hanging

Configuration Options:

--server: llama-server URL (default: http://localhost:8033)
--n_cases: Number of cases to evaluate (default: all)
--n_predict: Max tokens to predict per prompt (default: 2048)
--threads: Number of threads for parallel requests (default: 32)
--verbose: Show detailed output for each case
--output: Output file for eval state (default: llama-eval-state.json)
--grader-type: regex or cli
--grader-regex-type: aime, gsm8k, mmlu, hellaswag, arc, winogrande
--grader-script: Path to CLI grader script

Testing Results:

✅ Works with simulator at 100% success rate (all correct)
✅ Works with simulator at 0% success rate (all incorrect)
✅ Works with simulator at 80% success rate (8/10 correct)
✅ Real-time verbose output shows gold/pred/status for each case
✅ JSON output contains complete eval state with all cases
✅ HF Hub telemetry disabled (no warnings)
✅ Uses cached dataset path to avoid HF Hub requests when available

Key Technical Decisions:

Removed Levenshtein matching - eval script only sends requests and validates answers
Abstract grading interface for external grader support
Exact match requirement for regex patterns
Handles both boxed and plain text formats for AIME answers
30-second timeout for CLI grader
Validates script exists before running

Refactoring:

Removed all task implementations except AIME
Removed regex-based grading (moved to flexible grader system)
Removed multiple endpoint support
Removed complex task loading logic
Removed summary reporting (replaced with real-time feedback)
Added HuggingFace dataset caching optimization

7.3 KiB Raw Blame History

llama-eval Implementation Discussion

Overview

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

2. Implement an "eval state" object

3. Implement a "processor" object

4. Processor responsibilities

5. Real-time feedback

6. Grading approach

7. Output format

Current Implementation Analysis

What exists in llama-eval.py:

What needs to be removed:

Discussion Points

1. Eval State Object Structure

2. Processor Architecture

3. Grader Interface

4. Checkpointing

5. Real-time Output

6. Output Format

Next Steps

References

Session Work Summary

llama-server-simulator Implementation

llama-eval-new.py Implementation

7.3 KiB

Raw Blame History