From 1c128d941ee447344984b825dfa34d9f09a30b13 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Sun, 29 Mar 2026 17:31:04 +0300 Subject: [PATCH] remove junk --- examples/llama-eval/AGENTS.md | 190 ------------------ examples/llama-eval/IMPLEMENTATION.md | 94 --------- examples/llama-eval/README.md | 111 +--------- .../llama-server-simulator-README.md | 36 ---- 4 files changed, 2 insertions(+), 429 deletions(-) delete mode 100644 examples/llama-eval/AGENTS.md delete mode 100644 examples/llama-eval/IMPLEMENTATION.md delete mode 100644 examples/llama-eval/llama-server-simulator-README.md diff --git a/examples/llama-eval/AGENTS.md b/examples/llama-eval/AGENTS.md deleted file mode 100644 index 60700aefc7..0000000000 --- a/examples/llama-eval/AGENTS.md +++ /dev/null @@ -1,190 +0,0 @@ -# llama-eval Codebase Guidelines - -## Overview - -This directory contains Python evaluation tools for llama.cpp: -- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA) -- `llama-server-simulator.py` - Flask-based server simulator for testing -- `test-simulator.sh` - Test script for the simulator - -## Build/Run Commands - -### Virtual Environment -The project uses a virtual environment located at `venv/`: -```bash -source venv/bin/activate -``` - -### Running the Main Evaluator -```bash -python llama-eval.py \ - --server http://127.0.0.1:8013 \ - --model gpt-oss-20b-hf-low \ - --dataset aime \ - --n_cases 10 \ - --grader-type llm \ - --seed 42 -``` - -### Running the Simulator (for testing) -```bash -python llama-server-simulator.py --port 8033 --success-rate 0.8 -``` - -### Running Tests -```bash -./test-simulator.sh -``` - -## Code Style Guidelines - -### Imports -- Standard library imports first (argparse, json, os, re, subprocess, sys, time) -- Third-party imports (requests, tqdm, datasets, flask) after standard library -- Relative imports not used -- Group imports by category with blank line between groups - -### Formatting -- 4-space indentation -- Max line length: 125 characters (per parent project's .flake8) -- Use double quotes for strings -- Use triple double quotes for docstrings -- Binary operators at the beginning of continued lines - -### Naming Conventions -- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`) -- Functions: snake_case (e.g., `normalize_number`, `get_prompt`) -- Variables: snake_case (e.g., `question_text`, `correct_count`) -- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`) -- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`) - -### Types -- Use type hints for all function signatures -- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple` -- Use `@dataclass` for data structures -- Prefer `Optional[T]` over `Union[T, None]` - -### Error Handling -- Use try/except for network requests and file operations -- Return `None` or `False` on errors when appropriate -- Use `ValueError` for invalid arguments -- Use `FileNotFoundError` for missing files -- CLI scripts should handle exceptions gracefully - -### Dataclasses -- Use `@dataclass` for structured data -- Define fields with explicit types -- Use `Optional[T]` for nullable fields -- Provide default values where appropriate - -### String Formatting -- Use f-strings for formatting (Python 3.6+) -- Use triple double quotes for multi-line strings -- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'` - -### File Paths -- Use `pathlib.Path` instead of string paths -- Create directories with `mkdir(parents=True, exist_ok=True)` -- Use `Path.home()` for user home directory - -### Logging -- Use `print()` for user-facing output -- Use `sys.stderr` for debug logging -- Simulator writes debug logs to `/tmp/simulator-debug.log` - -### Testing - -- Test script uses bash with `set -e` for strict error handling -- Simulator runs in background with PID tracking -- Tests verify correct answers, error cases, and edge cases -- Use `curl` for HTTP testing in shell scripts - -### Whitespace Cleanup -- Remove trailing whitespace from all lines -- When making edits, do not leave trailing whitespace - -## Dataset Support - -### AIME Dataset -- 90 questions from 2025 AIME competition -- Answers in `\boxed{answer}` format -- Supports regex, CLI, and LLM grading - -### AIME2025 Dataset -- 30 questions from 2025 AIME I & II -- Answers in `\boxed{answer}` format -- Requires loading two config parts - -### GSM8K Dataset -- 7473 math word problems -- Answers numeric values with `####` separator -- Supports regex, CLI, and LLM grading - -### GPQA Dataset -- 198 questions from GPQA Diamond -- Multiple choice with shuffled options (A, B, C, D) -- **Requires LLM grader** (returns letter A/B/C/D) - -## Grading Types - -### Regex Grader -- Built-in patterns per dataset -- Prioritizes `\boxed{}` for AIME datasets -- Extracts last number for GSM8K - -### CLI Grader -- External script interface -- Call: `grader.sh --answer --expected ` -- Exit code 0 = correct, non-zero = incorrect - -### LLM Grader -- Uses judge model for answer extraction -- Includes few-shot examples -- Case-insensitive comparison -- Required for GPQA - -## Configuration - -### Sampling Parameters (Optional) -- `--temperature`: Sampling temperature -- `--top-k`: Top K sampling -- `--top-p`: Top P sampling -- `--min-p`: Min P sampling -- Only passed to API if explicitly specified - -### Default Values -- `--n_predict`: -1 (infinite) -- `--grader-type`: llm -- `--seed`: 1234 -- `--threads`: 32 -- `--output`: llama-eval-state.json - -## Output Format - -### Progress Table -- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status -- Uses `tqdm` for progress bars - -### Results Summary -- Format: `Results: X/Y correct (Z%)` -- Displayed after all tasks complete - -### JSON Output -- Complete eval state saved to output file -- Contains: task IDs, correctness, prompts, extracted answers, sampling config -- Uses `dataclasses.asdict()` for serialization - -## HuggingFace Datasets - -- Cache directory: `~/.cache/huggingface/datasets` -- Set via `HF_DATASETS_CACHE` environment variable -- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1` -- Datasets loaded with `datasets.load_dataset()` - -## Flask Simulator - -- Runs on configurable port (default: 5000) -- Endpoint: `/v1/chat/completions` (OpenAI-compatible) -- Uses Dice coefficient for question matching -- Configurable success rate for testing -- Debug logs to `/tmp/simulator-debug.log` diff --git a/examples/llama-eval/IMPLEMENTATION.md b/examples/llama-eval/IMPLEMENTATION.md deleted file mode 100644 index 9ce2bdc3f9..0000000000 --- a/examples/llama-eval/IMPLEMENTATION.md +++ /dev/null @@ -1,94 +0,0 @@ -# llama-eval Implementation Summary - -## Overview - -Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM). - -## Key Features - -- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction -- **Flexible Grading**: Regex, CLI, or LLM-based grading -- **Parallel Processing**: Configurable thread count for concurrent requests -- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional) -- **Real-time Feedback**: Progress tracking with detailed output -- **JSON Output**: Complete eval state saved for debugging -- **GPQA Support**: Answer shuffling with reproducible results - -## Architecture - -### Eval State -```python -@dataclass -class EvalState: - id: str - tasks: List[str] - task_states: Dict[str, Dict[str, Any]] - sampling_config: Dict[str, Any] -``` - -### Processor -- Handles processing, grading, and state management -- Thread-safe concurrent execution -- Configurable sampling parameters - -### Grader -- Abstract grading interface supporting multiple types -- Regex grader with dataset-specific patterns -- CLI grader with external script interface -- LLM grader with configurable server and model - -### Datasets -- `AimeDataset`: 90 AIME 2025 questions -- `Aime2025Dataset`: 30 AIME 2025 I & II questions -- `Gsm8kDataset`: 7473 math word problems -- `GpqaDataset`: 198 GPQA Diamond questions with shuffling - -## Configuration - -### Sampling Parameters (Optional) -- `--temperature`: Sampling temperature -- `--top-k`: Top K sampling -- `--top-p`: Top P sampling -- `--min-p`: Min P sampling -- Only passed if explicitly specified - -### Grading Types -- **regex**: Built-in patterns for each dataset -- **cli**: External script with `--answer` and `--expected` args -- **llm**: LLM-based extraction with few-shot examples and configurable server/model - -### Dataset Requirements -- **AIME**: Supports regex, CLI, or LLM grader -- **AIME2025**: Supports regex, CLI, or LLM grader -- **GSM8K**: Supports regex, CLI, or LLM grader -- **GPQA**: Requires LLM grader - -## Output Format - -### Progress Table -``` - Task ID Dataset Prompt (first 43 chars) Expected Status - aime_000_001 AIME Complete the following reactions and sel... A pending -``` - -### Results Summary -``` -============================================================ -Results: 8/10 correct (80.0%) -============================================================ -``` - -### JSON Output -Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration. - -## Technical Details - -- Default max tokens: -1 (infinite) -- Default grader type: llm -- Default seed: 1234 -- Default threads: 32 -- Prompt truncation: First 43 chars + padding + "..." -- Response truncation: Last 10 lines for grading -- GPQA requires LLM grader (returns letter A/B/C/D) -- Judge model defaults to evaluated model if not specified -- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning diff --git a/examples/llama-eval/README.md b/examples/llama-eval/README.md index 4409f9c90b..82ba6c46f2 100644 --- a/examples/llama-eval/README.md +++ b/examples/llama-eval/README.md @@ -1,112 +1,5 @@ -# llama-eval Evaluation Tool +# llama-eval Simple evaluation tool for llama.cpp with support for multiple datasets. -## Features - -- **Multiple Datasets**: AIME, GSM8K, GPQA -- **Flexible Grading**: Regex, CLI, or LLM-based grading -- **Parallel Processing**: Configurable thread count -- **Real-time Feedback**: Progress tracking with detailed output -- **Sampling Parameters**: Temperature, Top K, Top P, Min P -- **JSON Output**: Complete eval state saved for debugging - -## Usage - -```bash -python llama-eval.py \ - --server http://127.0.0.1:8013 \ - --model gpt-oss-20b-hf-low \ - --judge-model gpt-oss-20b-hf-medium \ - --dataset aime \ - --n_cases 10 \ - --grader-type llm \ - --seed 42 -``` - -## CLI Arguments - -- `--server`: llama-server URL (default: http://127.0.0.1:8013) -- `--model`: Model name for evaluation (default: llama) -- `--judge-model`: Model name for LLM judge (default: same as main model) -- `--judge-server`: Server URL for LLM judge (default: same as main server) -- `--dataset`: Dataset type (aime, aime2025, gsm8k, gpqa) -- `--n_cases`: Number of cases to evaluate (default: all) -- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite) -- `--temperature`: Sampling temperature (default: not passed) -- `--top-k`: Top K sampling (default: not passed) -- `--top-p`: Top P sampling (default: not passed) -- `--min-p`: Min P sampling (default: not passed) -- `--threads`: Number of threads for parallel requests (default: 32) -- `--verbose`: Show detailed output for each case -- `--output`: Output file for eval state (default: llama-eval-state.json) -- `--grader-type`: Grader type (regex, cli, llm, default: llm) -- `--grader-script`: Path to CLI grader script (required for --grader-type cli) -- `--seed`: Random seed for shuffling (default: 1234) - -## Datasets - -### AIME -- 90 questions from 2025 AIME competition -- Answers in boxed format: `\boxed{answer}` -- Requires regex grader or LLM grader - -### AIME2025 -- 30 questions from 2025 AIME I & II competitions -- Answers in boxed format: `\boxed{answer}` -- Supports regex, CLI, or LLM grader - -### GSM8K -- 7473 math word problems -- Answers are numeric values -- Requires regex grader or LLM grader - -### GPQA -- 198 questions from GPQA Diamond dataset -- Multiple choice with shuffled options -- Requires LLM grader (returns letter A, B, C, or D) - -## Grading Types - -### Regex Grader -Built-in patterns for different datasets: -- AIME: `\boxed{(\d+)}|\b(\d+)\b` -- AIME2025: `\boxed{(\d+)}|\b(\d+)\b` -- GSM8K: `\b(\d+)\b` -- GPQA: Letter extraction (A, B, C, D) - -### CLI Grader -External script interface: -```bash -./grader.sh --answer --expected -``` -Returns exit code 0 if correct, non-zero if incorrect. - -### LLM Grader -Uses LLM to extract and compare answers: -- Configurable server and model -- Includes few-shot examples from sample answers -- Case-insensitive comparison -- Required for GPQA dataset - -## Output - -### Progress Table -``` - Task ID Dataset Prompt (first 43 chars) Expected Status - aime_000_001 AIME Complete the following reactions and sel... A pending -``` - -### Results -``` -============================================================ -Results: 8/10 correct (80.0%) -============================================================ -``` - -### JSON Output -Complete eval state saved to output file with: -- Task IDs and correctness status -- Prompts and extracted answers -- Sampling configuration -- Processing metadata +TODO: add usage diff --git a/examples/llama-eval/llama-server-simulator-README.md b/examples/llama-eval/llama-server-simulator-README.md deleted file mode 100644 index bd69e2615c..0000000000 --- a/examples/llama-eval/llama-server-simulator-README.md +++ /dev/null @@ -1,36 +0,0 @@ -# llama-server-simulator - -Standalone Python script simulating llama-server HTTP endpoint for testing. - -## Features - -- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint -- AIME Dataset Integration - Loads 90 questions from HuggingFace -- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance -- Configurable Success Rate - Control correct/wrong answer generation (0-1) -- Debug Logging - Troubleshoot matching issues - -## Usage - -```bash -python llama-server-simulator.py --success-rate 0.8 -``` - -## Arguments - -- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8) -- `--port`: Server port (default: 8033) -- `--debug`: Enable debug logging (default: False) - -## Testing - -```bash -./test-simulator.sh -``` - -## Implementation Details - -- Uses Levenshtein distance for partial matching (threshold: 0.3) -- Automatic caching via HuggingFace datasets library -- Wrong answers generated by incrementing expected answer -- Debug output written to stderr