- Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers |
||
|---|---|---|
| .. | ||
| README.md | ||
| llama-eval-discussion.md | ||
| llama-eval-new.py | ||
| llama-eval.py | ||
| llama-server-simulator-plan.md | ||
| llama-server-simulator.py | ||
| simulator-summary.md | ||
| test-simulator.sh | ||
README.md
llama.cpp/example/llama-eval
llama-eval.py is a single-script evaluation runner that sends prompt/response pairs to any OpenAI-compatible HTTP server (the default llama-server).
./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py --path_server http://localhost:8033 --n_prompts 100 --prompt_source arc
The supported tasks are:
- GSM8K — grade-school math
- AIME — competition math (integer answers)
- MMLU — multi-domain multiple choice
- HellaSwag — commonsense reasoning multiple choice
- ARC — grade-school science multiple choice
- WinoGrande — commonsense coreference multiple choice