3.0 KiB

Raw Blame History

llama-eval Implementation Discussion

Overview

Discussion about implementing a lean evaluation tool for llama.cpp based on ggerganov's feedback in PR #18892.

Key Requirements from ggerganov

1. Simplify and Focus on One Eval

Start with AIME2025 (most familiar with it)
Don't support multiple evals initially

2. Implement an "eval state" object

ID
List of tasks
Task states
Sampling config

3. Implement a "processor" object

List of endpoints
Threads per endpoint
Grade/judge type (regex, endpoint, or CLI tool)

4. Processor responsibilities

Accepts eval state
Starts processing
Dumps eval state periodically as it progresses

5. Real-time feedback

Default: show "correct / not correct" for each task
Verbose mode: show produced answer vs expected answer as soon as it completes

6. Grading approach

Abstract grading to support external "grader" or "judge"
Use LLM post-processing instead of regex (to avoid issues from GPT-OSS evals)

7. Output format

Use structured output (JSON) instead of boxed text

Current Implementation Analysis

What exists in llama-eval.py:

Multiple task implementations (AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande)
Regex-based answer extraction
HTTP requests to OpenAI-compatible endpoint
Checkpointing/resume capability
Thread-based parallel execution
Summary reporting

What needs to be removed:

All task implementations except AIME
Regex-based grading
Multiple endpoint support
Complex task loading logic
Summary reporting (replace with real-time feedback)

Discussion Points

1. Eval State Object Structure

Status: Under Discussion

Questions:

What fields should be in the eval state object?
Should it include the actual prompts, or just metadata?
How should task states be tracked?

2. Processor Architecture

Status: Not Started

Questions:

Should the processor handle multiple endpoints (for distributed evaluation)?
What's the threading model?
How are endpoints configured?

3. Grader Interface

Status: Not Started

Questions:

How should the grader be configured?
Should it be a separate service, or a local LLM call?
What's the interface for grading?

4. Checkpointing

Status: Not Started

Questions:

Should the eval state be serialized to disk?
How often should it be dumped?
What format should it use?

5. Real-time Output

Status: Not Started

Questions:

How should progress be displayed?
Console output, file logging, or both?
What verbosity levels are needed?

6. Output Format

Status: Not Started

Questions:

Should responses be in JSON format?
How should the grader interface work with JSON output?

Next Steps

Eval State Object - Currently discussing
Processor Architecture
Grader Interface
Checkpointing
Real-time Output
Output Format

References

PR #18892: https://github.com/ggml-org/llama.cpp/pull/18892
Discussion #18195: https://github.com/ggml-org/llama.cpp/discussions/18195