History

Georgi Gerganov c87af1d527 docs: update llama-eval-discussion.md with session work summary Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.		2026-02-15 21:08:22 +02:00
..
README.md	add checkpointing	2026-02-15 21:08:22 +02:00
llama-eval-discussion.md	docs: update llama-eval-discussion.md with session work summary	2026-02-15 21:08:22 +02:00
llama-eval.py	add checkpointing	2026-02-15 21:08:22 +02:00
llama-server-simulator-plan.md	examples: add llama-server simulator for testing eval scripts	2026-02-15 21:08:22 +02:00
llama-server-simulator.py	examples: add llama-server simulator for testing eval scripts	2026-02-15 21:08:22 +02:00
simulator-summary.md	examples: add llama-server simulator for testing eval scripts	2026-02-15 21:08:22 +02:00
test-cache.sh	examples: add llama-server simulator for testing eval scripts	2026-02-15 21:08:22 +02:00
test-simulator.sh	examples: refactor test-simulator.sh for better readability	2026-02-15 21:08:22 +02:00

README.md

llama.cpp/example/llama-eval

llama-eval.py is a single-script evaluation runner that sends prompt/response pairs to any OpenAI-compatible HTTP server (the default llama-server).

./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py --path_server http://localhost:8033 --n_prompts 100 --prompt_source arc

The supported tasks are:

GSM8K — grade-school math
AIME — competition math (integer answers)
MMLU — multi-domain multiple choice
HellaSwag — commonsense reasoning multiple choice
ARC — grade-school science multiple choice
WinoGrande — commonsense coreference multiple choice