191 lines
5.4 KiB
Markdown
191 lines
5.4 KiB
Markdown
# llama-eval Codebase Guidelines
|
|
|
|
## Overview
|
|
|
|
This directory contains Python evaluation tools for llama.cpp:
|
|
- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA)
|
|
- `llama-server-simulator.py` - Flask-based server simulator for testing
|
|
- `test-simulator.sh` - Test script for the simulator
|
|
|
|
## Build/Run Commands
|
|
|
|
### Virtual Environment
|
|
The project uses a virtual environment located at `venv/`:
|
|
```bash
|
|
source venv/bin/activate
|
|
```
|
|
|
|
### Running the Main Evaluator
|
|
```bash
|
|
python llama-eval.py \
|
|
--server http://127.0.0.1:8013 \
|
|
--model gpt-oss-20b-hf-low \
|
|
--dataset aime \
|
|
--n_cases 10 \
|
|
--grader-type llm \
|
|
--seed 42
|
|
```
|
|
|
|
### Running the Simulator (for testing)
|
|
```bash
|
|
python llama-server-simulator.py --port 8033 --success-rate 0.8
|
|
```
|
|
|
|
### Running Tests
|
|
```bash
|
|
./test-simulator.sh
|
|
```
|
|
|
|
## Code Style Guidelines
|
|
|
|
### Imports
|
|
- Standard library imports first (argparse, json, os, re, subprocess, sys, time)
|
|
- Third-party imports (requests, tqdm, datasets, flask) after standard library
|
|
- Relative imports not used
|
|
- Group imports by category with blank line between groups
|
|
|
|
### Formatting
|
|
- 4-space indentation
|
|
- Max line length: 125 characters (per parent project's .flake8)
|
|
- Use double quotes for strings
|
|
- Use triple double quotes for docstrings
|
|
- Binary operators at the beginning of continued lines
|
|
|
|
### Naming Conventions
|
|
- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`)
|
|
- Functions: snake_case (e.g., `normalize_number`, `get_prompt`)
|
|
- Variables: snake_case (e.g., `question_text`, `correct_count`)
|
|
- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`)
|
|
- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`)
|
|
|
|
### Types
|
|
- Use type hints for all function signatures
|
|
- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple`
|
|
- Use `@dataclass` for data structures
|
|
- Prefer `Optional[T]` over `Union[T, None]`
|
|
|
|
### Error Handling
|
|
- Use try/except for network requests and file operations
|
|
- Return `None` or `False` on errors when appropriate
|
|
- Use `ValueError` for invalid arguments
|
|
- Use `FileNotFoundError` for missing files
|
|
- CLI scripts should handle exceptions gracefully
|
|
|
|
### Dataclasses
|
|
- Use `@dataclass` for structured data
|
|
- Define fields with explicit types
|
|
- Use `Optional[T]` for nullable fields
|
|
- Provide default values where appropriate
|
|
|
|
### String Formatting
|
|
- Use f-strings for formatting (Python 3.6+)
|
|
- Use triple double quotes for multi-line strings
|
|
- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'`
|
|
|
|
### File Paths
|
|
- Use `pathlib.Path` instead of string paths
|
|
- Create directories with `mkdir(parents=True, exist_ok=True)`
|
|
- Use `Path.home()` for user home directory
|
|
|
|
### Logging
|
|
- Use `print()` for user-facing output
|
|
- Use `sys.stderr` for debug logging
|
|
- Simulator writes debug logs to `/tmp/simulator-debug.log`
|
|
|
|
### Testing
|
|
|
|
- Test script uses bash with `set -e` for strict error handling
|
|
- Simulator runs in background with PID tracking
|
|
- Tests verify correct answers, error cases, and edge cases
|
|
- Use `curl` for HTTP testing in shell scripts
|
|
|
|
### Whitespace Cleanup
|
|
- Remove trailing whitespace from all lines
|
|
- When making edits, do not leave trailing whitespace
|
|
|
|
## Dataset Support
|
|
|
|
### AIME Dataset
|
|
- 90 questions from 2025 AIME competition
|
|
- Answers in `\boxed{answer}` format
|
|
- Supports regex, CLI, and LLM grading
|
|
|
|
### AIME2025 Dataset
|
|
- 30 questions from 2025 AIME I & II
|
|
- Answers in `\boxed{answer}` format
|
|
- Requires loading two config parts
|
|
|
|
### GSM8K Dataset
|
|
- 7473 math word problems
|
|
- Answers numeric values with `####` separator
|
|
- Supports regex, CLI, and LLM grading
|
|
|
|
### GPQA Dataset
|
|
- 198 questions from GPQA Diamond
|
|
- Multiple choice with shuffled options (A, B, C, D)
|
|
- **Requires LLM grader** (returns letter A/B/C/D)
|
|
|
|
## Grading Types
|
|
|
|
### Regex Grader
|
|
- Built-in patterns per dataset
|
|
- Prioritizes `\boxed{}` for AIME datasets
|
|
- Extracts last number for GSM8K
|
|
|
|
### CLI Grader
|
|
- External script interface
|
|
- Call: `grader.sh --answer <pred> --expected <gold>`
|
|
- Exit code 0 = correct, non-zero = incorrect
|
|
|
|
### LLM Grader
|
|
- Uses judge model for answer extraction
|
|
- Includes few-shot examples
|
|
- Case-insensitive comparison
|
|
- Required for GPQA
|
|
|
|
## Configuration
|
|
|
|
### Sampling Parameters (Optional)
|
|
- `--temperature`: Sampling temperature
|
|
- `--top-k`: Top K sampling
|
|
- `--top-p`: Top P sampling
|
|
- `--min-p`: Min P sampling
|
|
- Only passed to API if explicitly specified
|
|
|
|
### Default Values
|
|
- `--n_predict`: -1 (infinite)
|
|
- `--grader-type`: llm
|
|
- `--seed`: 1234
|
|
- `--threads`: 32
|
|
- `--output`: llama-eval-state.json
|
|
|
|
## Output Format
|
|
|
|
### Progress Table
|
|
- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status
|
|
- Uses `tqdm` for progress bars
|
|
|
|
### Results Summary
|
|
- Format: `Results: X/Y correct (Z%)`
|
|
- Displayed after all tasks complete
|
|
|
|
### JSON Output
|
|
- Complete eval state saved to output file
|
|
- Contains: task IDs, correctness, prompts, extracted answers, sampling config
|
|
- Uses `dataclasses.asdict()` for serialization
|
|
|
|
## HuggingFace Datasets
|
|
|
|
- Cache directory: `~/.cache/huggingface/datasets`
|
|
- Set via `HF_DATASETS_CACHE` environment variable
|
|
- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1`
|
|
- Datasets loaded with `datasets.load_dataset()`
|
|
|
|
## Flask Simulator
|
|
|
|
- Runs on configurable port (default: 5000)
|
|
- Endpoint: `/v1/chat/completions` (OpenAI-compatible)
|
|
- Uses Dice coefficient for question matching
|
|
- Configurable success rate for testing
|
|
- Debug logs to `/tmp/simulator-debug.log`
|