llama.cpp/examples/llama-eval
Georgi Gerganov 5cc2258e82
examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-02-15 21:08:22 +02:00
..
README.md add checkpointing 2026-02-15 21:08:22 +02:00
llama-eval-discussion.md docs: update llama-eval-discussion.md with session work summary 2026-02-15 21:08:22 +02:00
llama-eval-new.py examples: add simplified llama-eval-new.py for AIME evaluation 2026-02-15 21:08:22 +02:00
llama-eval.py add checkpointing 2026-02-15 21:08:22 +02:00
llama-server-simulator-plan.md examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
llama-server-simulator.py examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
simulator-summary.md examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
test-simulator.sh examples: refactor test-simulator.sh for better readability 2026-02-15 21:08:22 +02:00

README.md

llama.cpp/example/llama-eval

llama-eval.py is a single-script evaluation runner that sends prompt/response pairs to any OpenAI-compatible HTTP server (the default llama-server).

./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py --path_server http://localhost:8033 --n_prompts 100 --prompt_source arc

The supported tasks are:

  • GSM8K — grade-school math
  • AIME — competition math (integer answers)
  • MMLU — multi-domain multiple choice
  • HellaSwag — commonsense reasoning multiple choice
  • ARC — grade-school science multiple choice
  • WinoGrande — commonsense coreference multiple choice