llama.cpp/examples/llama-eval
Georgi Gerganov 5a1be6ce37
examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-02-15 21:08:23 +02:00
..
llama-eval-discussion.md docs: update llama-eval-discussion.md with session work summary 2026-02-15 21:08:22 +02:00
llama-eval-new.py examples: implement flexible grader system for answer validation 2026-02-15 21:08:23 +02:00
llama-eval.py add checkpointing 2026-02-15 21:08:22 +02:00
llama-server-simulator-plan.md examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
llama-server-simulator.py examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
simulator-summary.md examples: add llama-server simulator for testing eval scripts 2026-02-15 21:08:22 +02:00
test-grader.py examples: implement flexible grader system for answer validation 2026-02-15 21:08:23 +02:00
test-simulator.sh examples: refactor test-simulator.sh for better readability 2026-02-15 21:08:22 +02:00