llama.cpp/examples/llama-eval/llama-eval-discussion.md

4.6 KiB

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

  • Start with AIME2025 (most familiar with it)
  • Don't support multiple evals initially

2. Implement an "eval state" object

  • ID
  • List of tasks
  • Task states
  • Sampling config

3. Implement a "processor" object

  • List of endpoints
  • Threads per endpoint
  • Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

  • Accepts eval state
  • Starts processing
  • Dumps eval state periodically as it progresses

5. Real-time feedback

  • Default: show "correct / not correct" for each task
  • Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

  • Abstract grading to support external "grader" or "judge"
  • Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

  • Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

  • Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
  • Regex-based answer extraction
  • HTTP requests to OpenAI-compatible endpoint
  • Checkpointing/resume capability
  • Thread-based parallel execution
  • Summary reporting

What needs to be removed:

  • All task implementations except AIME
  • Regex-based grading
  • Multiple endpoint support
  • Complex task loading logic
  • Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

  • What fields should be in the eval state object?
  • Should it include the actual prompts, or just metadata?
  • How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

  • Should the processor handle multiple endpoints (for distributed evaluation)?
  • What's the threading model?
  • How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

  • How should the grader be configured?
  • Should it be a separate service, or a local LLM call?
  • What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

  • Should the eval state be serialized to disk?
  • How often should it be dumped?
  • What format should it use?

5. Real-time Output

Status: Not Started

Questions:

  • How should progress be displayed?
  • Console output, file logging, or both?
  • What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

  • Should responses be in JSON format?
  • How should the grader interface work with JSON output?

Next Steps

  1. Eval State Object - Currently discussing
  2. Processor Architecture
  3. Grader Interface
  4. Checkpointing
  5. Real-time Output
  6. Output Format

References

Session Work Summary

llama-server-simulator Implementation

Created:

  • llama-server-simulator.py - Standalone Python script simulating llama-server HTTP endpoint
  • test-simulator.sh - Test script for verifying simulator functionality
  • llama-server-simulator-plan.md - Implementation plan
  • simulator-summary.md - Summary of implementation

Features Implemented:

  1. HTTP Server - Flask-based /v1/chat/completions endpoint with OpenAI-compatible format
  2. AIME Dataset Integration - Loads 90 questions from HuggingFace with automatic local caching
  3. Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
  4. Response Generation - Configurable success rate (0-1) for correct/wrong answer generation
  5. Debug Logging - Helps troubleshoot matching issues

Testing Results:

  • Correct answers returned when success rate allows
  • Wrong answers returned when success rate doesn't allow
  • No matching questions return errors
  • Success rate verified (80% in 10 requests)
  • HuggingFace dataset caching working correctly

Key Technical Decisions:

  • Used Levenshtein distance for partial matching (threshold: 0.3)
  • Automatic caching via HuggingFace datasets library
  • Wrong answers generated by incrementing expected answer
  • Debug output written to stderr for better visibility

Refactoring:

  • Extracted repeating question string into TEST_QUESTION variable
  • Created make_request() helper function to reduce code duplication
  • Added proper error handling for error responses
  • Fixed simulator stopping issue at script completion