From 1c128d941ee447344984b825dfa34d9f09a30b13 Mon Sep 17 00:00:00 2001
From: Georgi Gerganov <ggerganov@gmail.com>
Date: Sun, 29 Mar 2026 17:31:04 +0300
Subject: [PATCH] remove junk

---
 examples/llama-eval/AGENTS.md                 | 190 ------------------
 examples/llama-eval/IMPLEMENTATION.md         |  94 ---------
 examples/llama-eval/README.md                 | 111 +---------
 .../llama-server-simulator-README.md          |  36 ----
 4 files changed, 2 insertions(+), 429 deletions(-)
 delete mode 100644 examples/llama-eval/AGENTS.md
 delete mode 100644 examples/llama-eval/IMPLEMENTATION.md
 delete mode 100644 examples/llama-eval/llama-server-simulator-README.md

diff --git a/examples/llama-eval/AGENTS.md b/examples/llama-eval/AGENTS.md
deleted file mode 100644
index 60700aefc7..0000000000
--- a/examples/llama-eval/AGENTS.md
+++ /dev/null
@@ -1,190 +0,0 @@
-# llama-eval Codebase Guidelines
-
-## Overview
-
-This directory contains Python evaluation tools for llama.cpp:
-- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA)
-- `llama-server-simulator.py` - Flask-based server simulator for testing
-- `test-simulator.sh` - Test script for the simulator
-
-## Build/Run Commands
-
-### Virtual Environment
-The project uses a virtual environment located at `venv/`:
-```bash
-source venv/bin/activate
-```
-
-### Running the Main Evaluator
-```bash
-python llama-eval.py \
-  --server http://127.0.0.1:8013 \
-  --model gpt-oss-20b-hf-low \
-  --dataset aime \
-  --n_cases 10 \
-  --grader-type llm \
-  --seed 42
-```
-
-### Running the Simulator (for testing)
-```bash
-python llama-server-simulator.py --port 8033 --success-rate 0.8
-```
-
-### Running Tests
-```bash
-./test-simulator.sh
-```
-
-## Code Style Guidelines
-
-### Imports
-- Standard library imports first (argparse, json, os, re, subprocess, sys, time)
-- Third-party imports (requests, tqdm, datasets, flask) after standard library
-- Relative imports not used
-- Group imports by category with blank line between groups
-
-### Formatting
-- 4-space indentation
-- Max line length: 125 characters (per parent project's .flake8)
-- Use double quotes for strings
-- Use triple double quotes for docstrings
-- Binary operators at the beginning of continued lines
-
-### Naming Conventions
-- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`)
-- Functions: snake_case (e.g., `normalize_number`, `get_prompt`)
-- Variables: snake_case (e.g., `question_text`, `correct_count`)
-- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`)
-- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`)
-
-### Types
-- Use type hints for all function signatures
-- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple`
-- Use `@dataclass` for data structures
-- Prefer `Optional[T]` over `Union[T, None]`
-
-### Error Handling
-- Use try/except for network requests and file operations
-- Return `None` or `False` on errors when appropriate
-- Use `ValueError` for invalid arguments
-- Use `FileNotFoundError` for missing files
-- CLI scripts should handle exceptions gracefully
-
-### Dataclasses
-- Use `@dataclass` for structured data
-- Define fields with explicit types
-- Use `Optional[T]` for nullable fields
-- Provide default values where appropriate
-
-### String Formatting
-- Use f-strings for formatting (Python 3.6+)
-- Use triple double quotes for multi-line strings
-- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'`
-
-### File Paths
-- Use `pathlib.Path` instead of string paths
-- Create directories with `mkdir(parents=True, exist_ok=True)`
-- Use `Path.home()` for user home directory
-
-### Logging
-- Use `print()` for user-facing output
-- Use `sys.stderr` for debug logging
-- Simulator writes debug logs to `/tmp/simulator-debug.log`
-
-### Testing
-
-- Test script uses bash with `set -e` for strict error handling
-- Simulator runs in background with PID tracking
-- Tests verify correct answers, error cases, and edge cases
-- Use `curl` for HTTP testing in shell scripts
-
-### Whitespace Cleanup
-- Remove trailing whitespace from all lines
-- When making edits, do not leave trailing whitespace
-
-## Dataset Support
-
-### AIME Dataset
-- 90 questions from 2025 AIME competition
-- Answers in `\boxed{answer}` format
-- Supports regex, CLI, and LLM grading
-
-### AIME2025 Dataset
-- 30 questions from 2025 AIME I & II
-- Answers in `\boxed{answer}` format
-- Requires loading two config parts
-
-### GSM8K Dataset
-- 7473 math word problems
-- Answers numeric values with `####` separator
-- Supports regex, CLI, and LLM grading
-
-### GPQA Dataset
-- 198 questions from GPQA Diamond
-- Multiple choice with shuffled options (A, B, C, D)
-- **Requires LLM grader** (returns letter A/B/C/D)
-
-## Grading Types
-
-### Regex Grader
-- Built-in patterns per dataset
-- Prioritizes `\boxed{}` for AIME datasets
-- Extracts last number for GSM8K
-
-### CLI Grader
-- External script interface
-- Call: `grader.sh --answer <pred> --expected <gold>`
-- Exit code 0 = correct, non-zero = incorrect
-
-### LLM Grader
-- Uses judge model for answer extraction
-- Includes few-shot examples
-- Case-insensitive comparison
-- Required for GPQA
-
-## Configuration
-
-### Sampling Parameters (Optional)
-- `--temperature`: Sampling temperature
-- `--top-k`: Top K sampling
-- `--top-p`: Top P sampling
-- `--min-p`: Min P sampling
-- Only passed to API if explicitly specified
-
-### Default Values
-- `--n_predict`: -1 (infinite)
-- `--grader-type`: llm
-- `--seed`: 1234
-- `--threads`: 32
-- `--output`: llama-eval-state.json
-
-## Output Format
-
-### Progress Table
-- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status
-- Uses `tqdm` for progress bars
-
-### Results Summary
-- Format: `Results: X/Y correct (Z%)`
-- Displayed after all tasks complete
-
-### JSON Output
-- Complete eval state saved to output file
-- Contains: task IDs, correctness, prompts, extracted answers, sampling config
-- Uses `dataclasses.asdict()` for serialization
-
-## HuggingFace Datasets
-
-- Cache directory: `~/.cache/huggingface/datasets`
-- Set via `HF_DATASETS_CACHE` environment variable
-- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1`
-- Datasets loaded with `datasets.load_dataset()`
-
-## Flask Simulator
-
-- Runs on configurable port (default: 5000)
-- Endpoint: `/v1/chat/completions` (OpenAI-compatible)
-- Uses Dice coefficient for question matching
-- Configurable success rate for testing
-- Debug logs to `/tmp/simulator-debug.log`
diff --git a/examples/llama-eval/IMPLEMENTATION.md b/examples/llama-eval/IMPLEMENTATION.md
deleted file mode 100644
index 9ce2bdc3f9..0000000000
--- a/examples/llama-eval/IMPLEMENTATION.md
+++ /dev/null
@@ -1,94 +0,0 @@
-# llama-eval Implementation Summary
-
-## Overview
-
-Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
-
-## Key Features
-
-- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
-- **Flexible Grading**: Regex, CLI, or LLM-based grading
-- **Parallel Processing**: Configurable thread count for concurrent requests
-- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
-- **Real-time Feedback**: Progress tracking with detailed output
-- **JSON Output**: Complete eval state saved for debugging
-- **GPQA Support**: Answer shuffling with reproducible results
-
-## Architecture
-
-### Eval State
-```python
-@dataclass
-class EvalState:
-    id: str
-    tasks: List[str]
-    task_states: Dict[str, Dict[str, Any]]
-    sampling_config: Dict[str, Any]
-```
-
-### Processor
-- Handles processing, grading, and state management
-- Thread-safe concurrent execution
-- Configurable sampling parameters
-
-### Grader
-- Abstract grading interface supporting multiple types
-- Regex grader with dataset-specific patterns
-- CLI grader with external script interface
-- LLM grader with configurable server and model
-
-### Datasets
-- `AimeDataset`: 90 AIME 2025 questions
-- `Aime2025Dataset`: 30 AIME 2025 I & II questions
-- `Gsm8kDataset`: 7473 math word problems
-- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
-
-## Configuration
-
-### Sampling Parameters (Optional)
-- `--temperature`: Sampling temperature
-- `--top-k`: Top K sampling
-- `--top-p`: Top P sampling
-- `--min-p`: Min P sampling
-- Only passed if explicitly specified
-
-### Grading Types
-- **regex**: Built-in patterns for each dataset
-- **cli**: External script with `--answer` and `--expected` args
-- **llm**: LLM-based extraction with few-shot examples and configurable server/model
-
-### Dataset Requirements
-- **AIME**: Supports regex, CLI, or LLM grader
-- **AIME2025**: Supports regex, CLI, or LLM grader
-- **GSM8K**: Supports regex, CLI, or LLM grader
-- **GPQA**: Requires LLM grader
-
-## Output Format
-
-### Progress Table
-```
-  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
-  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
-```
-
-### Results Summary
-```
-============================================================
-Results: 8/10 correct (80.0%)
-============================================================
-```
-
-### JSON Output
-Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
-
-## Technical Details
-
-- Default max tokens: -1 (infinite)
-- Default grader type: llm
-- Default seed: 1234
-- Default threads: 32
-- Prompt truncation: First 43 chars + padding + "..."
-- Response truncation: Last 10 lines for grading
-- GPQA requires LLM grader (returns letter A/B/C/D)
-- Judge model defaults to evaluated model if not specified
-- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning
diff --git a/examples/llama-eval/README.md b/examples/llama-eval/README.md
index 4409f9c90b..82ba6c46f2 100644
--- a/examples/llama-eval/README.md
+++ b/examples/llama-eval/README.md
@@ -1,112 +1,5 @@
-# llama-eval Evaluation Tool
+# llama-eval
 
 Simple evaluation tool for llama.cpp with support for multiple datasets.
 
-## Features
-
-- **Multiple Datasets**: AIME, GSM8K, GPQA
-- **Flexible Grading**: Regex, CLI, or LLM-based grading
-- **Parallel Processing**: Configurable thread count
-- **Real-time Feedback**: Progress tracking with detailed output
-- **Sampling Parameters**: Temperature, Top K, Top P, Min P
-- **JSON Output**: Complete eval state saved for debugging
-
-## Usage
-
-```bash
-python llama-eval.py \
-  --server http://127.0.0.1:8013 \
-  --model gpt-oss-20b-hf-low \
-  --judge-model gpt-oss-20b-hf-medium \
-  --dataset aime \
-  --n_cases 10 \
-  --grader-type llm \
-  --seed 42
-```
-
-## CLI Arguments
-
-- `--server`: llama-server URL (default: http://127.0.0.1:8013)
-- `--model`: Model name for evaluation (default: llama)
-- `--judge-model`: Model name for LLM judge (default: same as main model)
-- `--judge-server`: Server URL for LLM judge (default: same as main server)
-- `--dataset`: Dataset type (aime, aime2025, gsm8k, gpqa)
-- `--n_cases`: Number of cases to evaluate (default: all)
-- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
-- `--temperature`: Sampling temperature (default: not passed)
-- `--top-k`: Top K sampling (default: not passed)
-- `--top-p`: Top P sampling (default: not passed)
-- `--min-p`: Min P sampling (default: not passed)
-- `--threads`: Number of threads for parallel requests (default: 32)
-- `--verbose`: Show detailed output for each case
-- `--output`: Output file for eval state (default: llama-eval-state.json)
-- `--grader-type`: Grader type (regex, cli, llm, default: llm)
-- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
-- `--seed`: Random seed for shuffling (default: 1234)
-
-## Datasets
-
-### AIME
-- 90 questions from 2025 AIME competition
-- Answers in boxed format: `\boxed{answer}`
-- Requires regex grader or LLM grader
-
-### AIME2025
-- 30 questions from 2025 AIME I & II competitions
-- Answers in boxed format: `\boxed{answer}`
-- Supports regex, CLI, or LLM grader
-
-### GSM8K
-- 7473 math word problems
-- Answers are numeric values
-- Requires regex grader or LLM grader
-
-### GPQA
-- 198 questions from GPQA Diamond dataset
-- Multiple choice with shuffled options
-- Requires LLM grader (returns letter A, B, C, or D)
-
-## Grading Types
-
-### Regex Grader
-Built-in patterns for different datasets:
-- AIME: `\boxed{(\d+)}|\b(\d+)\b`
-- AIME2025: `\boxed{(\d+)}|\b(\d+)\b`
-- GSM8K: `\b(\d+)\b`
-- GPQA: Letter extraction (A, B, C, D)
-
-### CLI Grader
-External script interface:
-```bash
-./grader.sh --answer <pred> --expected <gold>
-```
-Returns exit code 0 if correct, non-zero if incorrect.
-
-### LLM Grader
-Uses LLM to extract and compare answers:
-- Configurable server and model
-- Includes few-shot examples from sample answers
-- Case-insensitive comparison
-- Required for GPQA dataset
-
-## Output
-
-### Progress Table
-```
-  Task ID             Dataset  Prompt (first 43 chars)                        Expected    Status
-  aime_000_001         AIME   Complete the following reactions and sel...    A          pending
-```
-
-### Results
-```
-============================================================
-Results: 8/10 correct (80.0%)
-============================================================
-```
-
-### JSON Output
-Complete eval state saved to output file with:
-- Task IDs and correctness status
-- Prompts and extracted answers
-- Sampling configuration
-- Processing metadata
+TODO: add usage
diff --git a/examples/llama-eval/llama-server-simulator-README.md b/examples/llama-eval/llama-server-simulator-README.md
deleted file mode 100644
index bd69e2615c..0000000000
--- a/examples/llama-eval/llama-server-simulator-README.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# llama-server-simulator
-
-Standalone Python script simulating llama-server HTTP endpoint for testing.
-
-## Features
-
-- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
-- AIME Dataset Integration - Loads 90 questions from HuggingFace
-- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
-- Configurable Success Rate - Control correct/wrong answer generation (0-1)
-- Debug Logging - Troubleshoot matching issues
-
-## Usage
-
-```bash
-python llama-server-simulator.py --success-rate 0.8
-```
-
-## Arguments
-
-- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
-- `--port`: Server port (default: 8033)
-- `--debug`: Enable debug logging (default: False)
-
-## Testing
-
-```bash
-./test-simulator.sh
-```
-
-## Implementation Details
-
-- Uses Levenshtein distance for partial matching (threshold: 0.3)
-- Automatic caching via HuggingFace datasets library
-- Wrong answers generated by incrementing expected answer
-- Debug output written to stderr