remove junk
This commit is contained in:
parent
a3405d4260
commit
1c128d941e
|
|
@ -1,190 +0,0 @@
|
|||
# llama-eval Codebase Guidelines
|
||||
|
||||
## Overview
|
||||
|
||||
This directory contains Python evaluation tools for llama.cpp:
|
||||
- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA)
|
||||
- `llama-server-simulator.py` - Flask-based server simulator for testing
|
||||
- `test-simulator.sh` - Test script for the simulator
|
||||
|
||||
## Build/Run Commands
|
||||
|
||||
### Virtual Environment
|
||||
The project uses a virtual environment located at `venv/`:
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
### Running the Main Evaluator
|
||||
```bash
|
||||
python llama-eval.py \
|
||||
--server http://127.0.0.1:8013 \
|
||||
--model gpt-oss-20b-hf-low \
|
||||
--dataset aime \
|
||||
--n_cases 10 \
|
||||
--grader-type llm \
|
||||
--seed 42
|
||||
```
|
||||
|
||||
### Running the Simulator (for testing)
|
||||
```bash
|
||||
python llama-server-simulator.py --port 8033 --success-rate 0.8
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
```bash
|
||||
./test-simulator.sh
|
||||
```
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
### Imports
|
||||
- Standard library imports first (argparse, json, os, re, subprocess, sys, time)
|
||||
- Third-party imports (requests, tqdm, datasets, flask) after standard library
|
||||
- Relative imports not used
|
||||
- Group imports by category with blank line between groups
|
||||
|
||||
### Formatting
|
||||
- 4-space indentation
|
||||
- Max line length: 125 characters (per parent project's .flake8)
|
||||
- Use double quotes for strings
|
||||
- Use triple double quotes for docstrings
|
||||
- Binary operators at the beginning of continued lines
|
||||
|
||||
### Naming Conventions
|
||||
- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`)
|
||||
- Functions: snake_case (e.g., `normalize_number`, `get_prompt`)
|
||||
- Variables: snake_case (e.g., `question_text`, `correct_count`)
|
||||
- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`)
|
||||
- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`)
|
||||
|
||||
### Types
|
||||
- Use type hints for all function signatures
|
||||
- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple`
|
||||
- Use `@dataclass` for data structures
|
||||
- Prefer `Optional[T]` over `Union[T, None]`
|
||||
|
||||
### Error Handling
|
||||
- Use try/except for network requests and file operations
|
||||
- Return `None` or `False` on errors when appropriate
|
||||
- Use `ValueError` for invalid arguments
|
||||
- Use `FileNotFoundError` for missing files
|
||||
- CLI scripts should handle exceptions gracefully
|
||||
|
||||
### Dataclasses
|
||||
- Use `@dataclass` for structured data
|
||||
- Define fields with explicit types
|
||||
- Use `Optional[T]` for nullable fields
|
||||
- Provide default values where appropriate
|
||||
|
||||
### String Formatting
|
||||
- Use f-strings for formatting (Python 3.6+)
|
||||
- Use triple double quotes for multi-line strings
|
||||
- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'`
|
||||
|
||||
### File Paths
|
||||
- Use `pathlib.Path` instead of string paths
|
||||
- Create directories with `mkdir(parents=True, exist_ok=True)`
|
||||
- Use `Path.home()` for user home directory
|
||||
|
||||
### Logging
|
||||
- Use `print()` for user-facing output
|
||||
- Use `sys.stderr` for debug logging
|
||||
- Simulator writes debug logs to `/tmp/simulator-debug.log`
|
||||
|
||||
### Testing
|
||||
|
||||
- Test script uses bash with `set -e` for strict error handling
|
||||
- Simulator runs in background with PID tracking
|
||||
- Tests verify correct answers, error cases, and edge cases
|
||||
- Use `curl` for HTTP testing in shell scripts
|
||||
|
||||
### Whitespace Cleanup
|
||||
- Remove trailing whitespace from all lines
|
||||
- When making edits, do not leave trailing whitespace
|
||||
|
||||
## Dataset Support
|
||||
|
||||
### AIME Dataset
|
||||
- 90 questions from 2025 AIME competition
|
||||
- Answers in `\boxed{answer}` format
|
||||
- Supports regex, CLI, and LLM grading
|
||||
|
||||
### AIME2025 Dataset
|
||||
- 30 questions from 2025 AIME I & II
|
||||
- Answers in `\boxed{answer}` format
|
||||
- Requires loading two config parts
|
||||
|
||||
### GSM8K Dataset
|
||||
- 7473 math word problems
|
||||
- Answers numeric values with `####` separator
|
||||
- Supports regex, CLI, and LLM grading
|
||||
|
||||
### GPQA Dataset
|
||||
- 198 questions from GPQA Diamond
|
||||
- Multiple choice with shuffled options (A, B, C, D)
|
||||
- **Requires LLM grader** (returns letter A/B/C/D)
|
||||
|
||||
## Grading Types
|
||||
|
||||
### Regex Grader
|
||||
- Built-in patterns per dataset
|
||||
- Prioritizes `\boxed{}` for AIME datasets
|
||||
- Extracts last number for GSM8K
|
||||
|
||||
### CLI Grader
|
||||
- External script interface
|
||||
- Call: `grader.sh --answer <pred> --expected <gold>`
|
||||
- Exit code 0 = correct, non-zero = incorrect
|
||||
|
||||
### LLM Grader
|
||||
- Uses judge model for answer extraction
|
||||
- Includes few-shot examples
|
||||
- Case-insensitive comparison
|
||||
- Required for GPQA
|
||||
|
||||
## Configuration
|
||||
|
||||
### Sampling Parameters (Optional)
|
||||
- `--temperature`: Sampling temperature
|
||||
- `--top-k`: Top K sampling
|
||||
- `--top-p`: Top P sampling
|
||||
- `--min-p`: Min P sampling
|
||||
- Only passed to API if explicitly specified
|
||||
|
||||
### Default Values
|
||||
- `--n_predict`: -1 (infinite)
|
||||
- `--grader-type`: llm
|
||||
- `--seed`: 1234
|
||||
- `--threads`: 32
|
||||
- `--output`: llama-eval-state.json
|
||||
|
||||
## Output Format
|
||||
|
||||
### Progress Table
|
||||
- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status
|
||||
- Uses `tqdm` for progress bars
|
||||
|
||||
### Results Summary
|
||||
- Format: `Results: X/Y correct (Z%)`
|
||||
- Displayed after all tasks complete
|
||||
|
||||
### JSON Output
|
||||
- Complete eval state saved to output file
|
||||
- Contains: task IDs, correctness, prompts, extracted answers, sampling config
|
||||
- Uses `dataclasses.asdict()` for serialization
|
||||
|
||||
## HuggingFace Datasets
|
||||
|
||||
- Cache directory: `~/.cache/huggingface/datasets`
|
||||
- Set via `HF_DATASETS_CACHE` environment variable
|
||||
- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1`
|
||||
- Datasets loaded with `datasets.load_dataset()`
|
||||
|
||||
## Flask Simulator
|
||||
|
||||
- Runs on configurable port (default: 5000)
|
||||
- Endpoint: `/v1/chat/completions` (OpenAI-compatible)
|
||||
- Uses Dice coefficient for question matching
|
||||
- Configurable success rate for testing
|
||||
- Debug logs to `/tmp/simulator-debug.log`
|
||||
|
|
@ -1,94 +0,0 @@
|
|||
# llama-eval Implementation Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
|
||||
- **Flexible Grading**: Regex, CLI, or LLM-based grading
|
||||
- **Parallel Processing**: Configurable thread count for concurrent requests
|
||||
- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
|
||||
- **Real-time Feedback**: Progress tracking with detailed output
|
||||
- **JSON Output**: Complete eval state saved for debugging
|
||||
- **GPQA Support**: Answer shuffling with reproducible results
|
||||
|
||||
## Architecture
|
||||
|
||||
### Eval State
|
||||
```python
|
||||
@dataclass
|
||||
class EvalState:
|
||||
id: str
|
||||
tasks: List[str]
|
||||
task_states: Dict[str, Dict[str, Any]]
|
||||
sampling_config: Dict[str, Any]
|
||||
```
|
||||
|
||||
### Processor
|
||||
- Handles processing, grading, and state management
|
||||
- Thread-safe concurrent execution
|
||||
- Configurable sampling parameters
|
||||
|
||||
### Grader
|
||||
- Abstract grading interface supporting multiple types
|
||||
- Regex grader with dataset-specific patterns
|
||||
- CLI grader with external script interface
|
||||
- LLM grader with configurable server and model
|
||||
|
||||
### Datasets
|
||||
- `AimeDataset`: 90 AIME 2025 questions
|
||||
- `Aime2025Dataset`: 30 AIME 2025 I & II questions
|
||||
- `Gsm8kDataset`: 7473 math word problems
|
||||
- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
|
||||
|
||||
## Configuration
|
||||
|
||||
### Sampling Parameters (Optional)
|
||||
- `--temperature`: Sampling temperature
|
||||
- `--top-k`: Top K sampling
|
||||
- `--top-p`: Top P sampling
|
||||
- `--min-p`: Min P sampling
|
||||
- Only passed if explicitly specified
|
||||
|
||||
### Grading Types
|
||||
- **regex**: Built-in patterns for each dataset
|
||||
- **cli**: External script with `--answer` and `--expected` args
|
||||
- **llm**: LLM-based extraction with few-shot examples and configurable server/model
|
||||
|
||||
### Dataset Requirements
|
||||
- **AIME**: Supports regex, CLI, or LLM grader
|
||||
- **AIME2025**: Supports regex, CLI, or LLM grader
|
||||
- **GSM8K**: Supports regex, CLI, or LLM grader
|
||||
- **GPQA**: Requires LLM grader
|
||||
|
||||
## Output Format
|
||||
|
||||
### Progress Table
|
||||
```
|
||||
Task ID Dataset Prompt (first 43 chars) Expected Status
|
||||
aime_000_001 AIME Complete the following reactions and sel... A pending
|
||||
```
|
||||
|
||||
### Results Summary
|
||||
```
|
||||
============================================================
|
||||
Results: 8/10 correct (80.0%)
|
||||
============================================================
|
||||
```
|
||||
|
||||
### JSON Output
|
||||
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
|
||||
|
||||
## Technical Details
|
||||
|
||||
- Default max tokens: -1 (infinite)
|
||||
- Default grader type: llm
|
||||
- Default seed: 1234
|
||||
- Default threads: 32
|
||||
- Prompt truncation: First 43 chars + padding + "..."
|
||||
- Response truncation: Last 10 lines for grading
|
||||
- GPQA requires LLM grader (returns letter A/B/C/D)
|
||||
- Judge model defaults to evaluated model if not specified
|
||||
- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning
|
||||
|
|
@ -1,112 +1,5 @@
|
|||
# llama-eval Evaluation Tool
|
||||
# llama-eval
|
||||
|
||||
Simple evaluation tool for llama.cpp with support for multiple datasets.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multiple Datasets**: AIME, GSM8K, GPQA
|
||||
- **Flexible Grading**: Regex, CLI, or LLM-based grading
|
||||
- **Parallel Processing**: Configurable thread count
|
||||
- **Real-time Feedback**: Progress tracking with detailed output
|
||||
- **Sampling Parameters**: Temperature, Top K, Top P, Min P
|
||||
- **JSON Output**: Complete eval state saved for debugging
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
python llama-eval.py \
|
||||
--server http://127.0.0.1:8013 \
|
||||
--model gpt-oss-20b-hf-low \
|
||||
--judge-model gpt-oss-20b-hf-medium \
|
||||
--dataset aime \
|
||||
--n_cases 10 \
|
||||
--grader-type llm \
|
||||
--seed 42
|
||||
```
|
||||
|
||||
## CLI Arguments
|
||||
|
||||
- `--server`: llama-server URL (default: http://127.0.0.1:8013)
|
||||
- `--model`: Model name for evaluation (default: llama)
|
||||
- `--judge-model`: Model name for LLM judge (default: same as main model)
|
||||
- `--judge-server`: Server URL for LLM judge (default: same as main server)
|
||||
- `--dataset`: Dataset type (aime, aime2025, gsm8k, gpqa)
|
||||
- `--n_cases`: Number of cases to evaluate (default: all)
|
||||
- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
|
||||
- `--temperature`: Sampling temperature (default: not passed)
|
||||
- `--top-k`: Top K sampling (default: not passed)
|
||||
- `--top-p`: Top P sampling (default: not passed)
|
||||
- `--min-p`: Min P sampling (default: not passed)
|
||||
- `--threads`: Number of threads for parallel requests (default: 32)
|
||||
- `--verbose`: Show detailed output for each case
|
||||
- `--output`: Output file for eval state (default: llama-eval-state.json)
|
||||
- `--grader-type`: Grader type (regex, cli, llm, default: llm)
|
||||
- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
|
||||
- `--seed`: Random seed for shuffling (default: 1234)
|
||||
|
||||
## Datasets
|
||||
|
||||
### AIME
|
||||
- 90 questions from 2025 AIME competition
|
||||
- Answers in boxed format: `\boxed{answer}`
|
||||
- Requires regex grader or LLM grader
|
||||
|
||||
### AIME2025
|
||||
- 30 questions from 2025 AIME I & II competitions
|
||||
- Answers in boxed format: `\boxed{answer}`
|
||||
- Supports regex, CLI, or LLM grader
|
||||
|
||||
### GSM8K
|
||||
- 7473 math word problems
|
||||
- Answers are numeric values
|
||||
- Requires regex grader or LLM grader
|
||||
|
||||
### GPQA
|
||||
- 198 questions from GPQA Diamond dataset
|
||||
- Multiple choice with shuffled options
|
||||
- Requires LLM grader (returns letter A, B, C, or D)
|
||||
|
||||
## Grading Types
|
||||
|
||||
### Regex Grader
|
||||
Built-in patterns for different datasets:
|
||||
- AIME: `\boxed{(\d+)}|\b(\d+)\b`
|
||||
- AIME2025: `\boxed{(\d+)}|\b(\d+)\b`
|
||||
- GSM8K: `\b(\d+)\b`
|
||||
- GPQA: Letter extraction (A, B, C, D)
|
||||
|
||||
### CLI Grader
|
||||
External script interface:
|
||||
```bash
|
||||
./grader.sh --answer <pred> --expected <gold>
|
||||
```
|
||||
Returns exit code 0 if correct, non-zero if incorrect.
|
||||
|
||||
### LLM Grader
|
||||
Uses LLM to extract and compare answers:
|
||||
- Configurable server and model
|
||||
- Includes few-shot examples from sample answers
|
||||
- Case-insensitive comparison
|
||||
- Required for GPQA dataset
|
||||
|
||||
## Output
|
||||
|
||||
### Progress Table
|
||||
```
|
||||
Task ID Dataset Prompt (first 43 chars) Expected Status
|
||||
aime_000_001 AIME Complete the following reactions and sel... A pending
|
||||
```
|
||||
|
||||
### Results
|
||||
```
|
||||
============================================================
|
||||
Results: 8/10 correct (80.0%)
|
||||
============================================================
|
||||
```
|
||||
|
||||
### JSON Output
|
||||
Complete eval state saved to output file with:
|
||||
- Task IDs and correctness status
|
||||
- Prompts and extracted answers
|
||||
- Sampling configuration
|
||||
- Processing metadata
|
||||
TODO: add usage
|
||||
|
|
|
|||
|
|
@ -1,36 +0,0 @@
|
|||
# llama-server-simulator
|
||||
|
||||
Standalone Python script simulating llama-server HTTP endpoint for testing.
|
||||
|
||||
## Features
|
||||
|
||||
- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
|
||||
- AIME Dataset Integration - Loads 90 questions from HuggingFace
|
||||
- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
|
||||
- Configurable Success Rate - Control correct/wrong answer generation (0-1)
|
||||
- Debug Logging - Troubleshoot matching issues
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
python llama-server-simulator.py --success-rate 0.8
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
|
||||
- `--port`: Server port (default: 8033)
|
||||
- `--debug`: Enable debug logging (default: False)
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
./test-simulator.sh
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
- Uses Levenshtein distance for partial matching (threshold: 0.3)
|
||||
- Automatic caching via HuggingFace datasets library
|
||||
- Wrong answers generated by incrementing expected answer
|
||||
- Debug output written to stderr
|
||||
Loading…
Reference in New Issue