16 KiB

Raw Blame History

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

Start with AIME2025 (most familiar with it)
Don't support multiple evals initially

2. Implement an "eval state" object

ID
List of tasks
Task states
Sampling config

3. Implement a "processor" object

List of endpoints
Threads per endpoint
Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

Accepts eval state
Starts processing
Dumps eval state periodically as it progresses

5. Real-time feedback

Default: show "correct / not correct" for each task
Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

Abstract grading to support external "grader" or "judge"
Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
Regex-based answer extraction
HTTP requests to OpenAI-compatible endpoint
Checkpointing/resume capability
Thread-based parallel execution
Summary reporting

What needs to be removed:

All task implementations except AIME
Regex-based grading
Multiple endpoint support
Complex task loading logic
Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

What fields should be in the eval state object?
Should it include the actual prompts, or just metadata?
How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

Should the processor handle multiple endpoints (for distributed evaluation)?
What's the threading model?
How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

How should the grader be configured?
Should it be a separate service, or a local LLM call?
What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

Should the eval state be serialized to disk?
How often should it be dumped?
What format should it use?

5. Real-time Output

Status: Not Started

Questions:

How should progress be displayed?
Console output, file logging, or both?
What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

Should responses be in JSON format?
How should the grader interface work with JSON output?

Next Steps

Eval State Object - Currently discussing
Processor Architecture
Grader Interface
Checkpointing
Real-time Output
Output Format

References

PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195

Session Work Summary

llama-server-simulator Implementation

Created:

llama-server-simulator.py - Standalone Python script simulating llama-server HTTP endpoint
test-simulator.sh - Test script for verifying simulator functionality
llama-server-simulator-plan.md - Implementation plan
simulator-summary.md - Summary of implementation

Features Implemented:

HTTP Server - Flask-based /v1/chat/completions endpoint with OpenAI-compatible format
AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
Debug Logging - Helps troubleshoot matching issues

Testing Results:

✅ Correct answers returned when success rate allows
✅ Wrong answers returned when success rate doesn't allow
✅ No matching questions return errors
✅ Success rate verified (80% in 10 requests)
✅ HuggingFace dataset caching working correctly

Key Technical Decisions:

Used Levenshtein distance for partial matching (threshold: 0.3)
Automatic caching via HuggingFace datasets library
Wrong answers generated by incrementing expected answer
Debug output written to stderr for better visibility

Refactoring:

Extracted repeating question string into TEST_QUESTION variable
Created make_request() helper function to reduce code duplication
Added proper error handling for error responses
Fixed simulator stopping issue at script completion

llama-eval-new.py Implementation

Created:

llama-eval-new.py - Simplified evaluation tool focused on AIME

Features Implemented:

Eval State Object - Structured dataclass with ID, tasks, task states, and sampling config
Processor Object - Handles processing, grading, and state management
Real-time Feedback - Shows correct/incorrect status for each case
Flexible Grading System - Supports regex, CLI, and LLM-based grading
Structured JSON Output - Saves complete eval state to JSON file
HuggingFace Dataset Caching - Uses cached dataset path to avoid HF Hub requests
Enhanced Answer Extraction - Extracts answers from full responses for display

Grading System:

Regex Grading: Built-in patterns for different task types
- aime: \boxed{(\d+)}|\b(\d+)\b (handles boxed and plain text)
- gsm8k: \b(\d+)\b (extract first number)
- mmlu, hellaswag, arc, winogrande: [A-D] (extract single letter)
CLI Grading: External script interface
- Script accepts --answer <pred> and --expected <gold>
- Returns exit code 0 if correct, non-zero if incorrect
- 30-second timeout to prevent hanging
LLM Judge: Generic answer extraction using LLM
- Uses configured server and model for extraction
- Includes problem statement in prompt for context
- Case-insensitive comparison
- Returns extracted answer for display

Configuration Options:

--server: llama-server URL (default: http://localhost:8033)
--n_cases: Number of cases to evaluate (default: all)
--n_predict: Max tokens to predict per prompt (default: 2048)
--threads: Number of threads for parallel requests (default: 32)
--verbose: Show detailed output for each case
--output: Output file for eval state (default: llama-eval-state.json)
--grader-type: regex, cli, or llm
--grader-regex-type: aime, gsm8k, mmlu, hellaswag, arc, winogrande
--grader-script: Path to CLI grader script
--judge-server: Server URL for LLM judge (default: same as main server)
--judge-model: Model name for LLM judge (default: same as main model)

Testing Results:

✅ Works with simulator at 100% success rate (all correct)
✅ Works with simulator at 0% success rate (all incorrect)
✅ Works with simulator at 80% success rate (8/10 correct)
✅ Real-time verbose output shows gold/pred/status for each case
✅ JSON output contains complete eval state with all cases
✅ HF Hub telemetry disabled (no warnings)
✅ Uses cached dataset path to avoid HF Hub requests when available
✅ Regex grader extracts answers correctly from various formats
✅ LLM judge can extract answers with problem context
✅ Response truncation focuses grading on final answer
✅ Case-insensitive matching works for both regex and LLM grader
✅ Judge model and server configuration propagate correctly
✅ Progress table shows extracted answers instead of full responses

Key Technical Decisions:

Removed Levenshtein matching - eval script only sends requests and validates answers
Abstract grading interface for external grader support
Exact match requirement for regex patterns
Handles both boxed and plain text formats for AIME answers
30-second timeout for CLI grader
Validates script exists before running
Judge parameters set once during Grader construction
LLM judge prompt includes problem statement for better extraction
Response truncation to last 2-3 lines focuses grading on final answer
Case-insensitive comparison for more flexible matching

Refactoring:

Removed all task implementations except AIME
Removed regex-based grading (moved to flexible grader system)
Removed multiple endpoint support
Removed complex task loading logic
Removed summary reporting (replaced with real-time feedback)
Added HuggingFace dataset caching optimization
Added LLM grader support with configurable server and model
Added response truncation before grading
Refactored grader interface to return extracted answers

llama-eval-new.py Threading and Model Parameter Updates

Changes Made:

Threading Support - Added ThreadPoolExecutor for parallel request processing
- Added from concurrent.futures import ThreadPoolExecutor, as_completed
- Created _process_single_case() method for thread-safe case processing
- Refactored process() to use ThreadPoolExecutor with configurable thread count
- Updated progress tracking to work with concurrent execution
- Thread-safe eval state updates (task_states and counters)
Model Parameter - Added --model argument to specify model name in request data
- Added model_name parameter to Processor.init()
- Updated _make_request() to use provided model name or default to "llama"
- Added --model argument to argument parser
- Model name is included in request JSON as "model": "gpt-oss-20b-hf"

Testing Results:

✅ Works with 2 threads (5 cases processed in ~0.2s)
✅ Works with 4 threads (slightly faster throughput)
✅ Model parameter correctly added to request data
✅ Thread-safe progress tracking with tqdm
✅ No race conditions in eval state updates

Key Technical Decisions:

Used ThreadPoolExecutor for simple, effective parallelism
No rate limiting needed (server can handle concurrent requests)
Thread-safe counter updates for correct/total tracking
Progress bar shows completion status across all threads
Model parameter is optional - defaults to "llama" if not specified

Refactoring:

Extracted single case processing into _process_single_case() method
Changed from sequential loop to ThreadPoolExecutor with futures
Updated verbose output to show total count instead of index
Made eval state updates thread-safe

llama-eval-new.py Enhanced Grading System

Changes Made:

Enhanced Grader Interface - Updated to return extracted answers
- grade() method now returns Tuple[bool, Optional[str]] (correctness + extracted answer)
- Added extracted field to TaskState dataclass
- All grader types (regex, cli, llm) now return extracted answers
Improved Regex Grader
- New _extract_answer_regex() method extracts answers using configured patterns
- Supports case-insensitive matching
- Returns first valid match found
- Handles both single values and multiple matches
LLM-Based Judge
- New _grade_llm() method for generic answer extraction
- Includes problem statement in prompt for context
- Configurable server URL (defaults to main server)
- Configurable model name (defaults to main model)
- Case-insensitive comparison
- Returns extracted answer for display
Response Truncation
- New _truncate_response() method keeps only last 2-3 lines
- Applied before grading to focus on final answer section
CLI Grader Update
- Now also returns extracted answer
- Returns None if grading fails
Display Updates
- Progress table shows extracted answer instead of full response
- Verbose mode shows full response plus extracted answer
New CLI Arguments
- --grader-type: Added "llm" option
- --judge-server: Separate server for LLM judge
- --judge-model: Separate model for LLM judge

Testing Results:

✅ Regex grader extracts answers correctly from various formats
✅ LLM judge can extract answers with problem context
✅ Response truncation focuses grading on final answer
✅ Case-insensitive matching works for both regex and LLM grader
✅ Judge model and server configuration propagate correctly
✅ Progress table shows extracted answers instead of full responses

Key Technical Decisions:

Judge parameters set once during Grader construction (not on each call)
LLM judge prompt includes problem statement for better extraction
Response truncation to last 2-3 lines focuses grading on final answer
Case-insensitive comparison for more flexible matching
Judge configuration propagates through Processor to Grader
Display shows extracted answer for cleaner output

Refactoring:

Removed judge parameters from grade() method calls
Added judge_server_url and judge_model_name to Grader class
Updated _grade_llm() to use instance variables instead of parameters
Simplified Processor initialization to pass judge config to grader
Updated startup info to show judge server and model

llama-eval-new.py GSM8K Dataset Support

Changes Made:

GSM8K Dataset Integration - Added support for GSM8K dataset alongside AIME
- Created Gsm8kDataset class with proper answer extraction logic
- GSM8K uses "question" field instead of "problem" field
- GSM8K answer field contains full reasoning with #### prefix
- Extracts numeric answer from answer field during initialization
- Uses same regex grader pattern as AIME (\b(\d+)\b)
Dataset Type Configuration - Added dataset selection support
- Added --dataset CLI argument with choices aime and gsm8k
- Updated Processor class to accept dataset_type parameter
- Dataset-specific initialization in Processor.__init__()
- Dataset name displayed in task summary table
Template Registry - Added dataset-specific prompt templates
- AIME template: includes \boxed{} wrapper for final answer
- GSM8K template: plain text answer without wrapper
- Templates applied based on question["dataset_type"] field
Answer Extraction Logic - Fixed GSM8K answer extraction
- GSM8K has pre-extracted "gold" field with numeric answer
- Gsm8kDataset.get_answer() checks for "gold" field first
- Falls back to answer field if gold field not present
- AimeDataset.get_answer() simplified to remove duplicate method
Task ID Format - Fixed duplicate prefix in task IDs
- Changed from f"{dataset_type}_{eval_state.id}_{chunk_idx:03d}_{i:03d}"
- To f"{dataset_type}_{chunk_idx:03d}_{i:03d}"
- Removed redundant eval_state.id (was "gsm8k" for GSM8K)
Column Width Adjustments - Improved table formatting
- Task ID column: 25 characters
- Dataset column: 5 characters
- Prompt column: 40 characters
- Expected column: 10 characters

Testing Results:

✅ GSM8K dataset loads correctly with 7473 questions
✅ Numeric answers extracted from full reasoning text
✅ Task summary table displays correctly with adjusted column widths
✅ Task IDs show correct format (e.g., gsm8k_000_3169)
✅ Both AIME and GSM8K datasets work with same script
✅ Answer extraction works for both boxed and plain text formats
✅ Progress tracking shows extracted answers for both datasets

Key Technical Decisions:

GSM8K uses "question" field instead of "problem" field
GSM8K answer field contains full reasoning with #### prefix
Numeric answer extracted during dataset initialization
Same regex grader pattern works for both datasets
Dataset selection via CLI argument for separate runs
Template registry supports different prompt formats per dataset
Task ID format simplified to avoid duplication

Refactoring:

Removed duplicate get_question() method from AimeDataset
Removed "2025" suffix from eval state ID (was remnant from old version)
Removed "2025" suffix from task summary table output
Removed "2025" suffix from progress tracking output
Updated Processor.__init__() to initialize appropriate dataset based on type
Updated _process_single_case() to handle both "problem" and "question" fields
Updated process() method to display dataset name and use dataset_type for task states

16 KiB Raw Blame History

llama-eval Implementation Discussion

Overview

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

2. Implement an "eval state" object

3. Implement a "processor" object

4. Processor responsibilities

5. Real-time feedback

6. Grading approach

7. Output format

Current Implementation Analysis

What exists in llama-eval.py:

What needs to be removed:

Discussion Points

1. Eval State Object Structure

2. Processor Architecture

3. Grader Interface

4. Checkpointing

5. Real-time Output

6. Output Format

Next Steps

References

Session Work Summary

llama-server-simulator Implementation

llama-eval-new.py Implementation

llama-eval-new.py Threading and Model Parameter Updates

llama-eval-new.py Enhanced Grading System

llama-eval-new.py GSM8K Dataset Support

16 KiB

Raw Blame History