- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:
- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting
Also includes test scripts and documentation for testing and understanding
the simulator functionality.
This commit adds a new python script that can be used to print tensors
information from a tensor in a safetensors model.
The motivation for this is that during model conversion work it can
sometimes be useful to verify the shape of tensors in the original
model. While it is possible to print the tensors when loading the model
this can be slow when working with larger models.
With this script it is possible to quickly query tensor shapes.
Example usage:
```console
(venv) $ ./scripts/utils/tensor-info.py --help
usage: tensor-info.py [-h] [-m MODEL_PATH] [-l] [tensor_name]
Print tensor information from a safetensors model
positional arguments:
tensor_name Name of the tensor to inspect
options:
-h, --help show this help message and exit
-m MODEL_PATH, --model-path MODEL_PATH
Path to the model directory (default: MODEL_PATH environment variable)
-l, --list List unique tensor patterns in the model (layer numbers replaced with #)
```
Listing tensor names:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m -l
embed_tokens.weight
layers.#.input_layernorm.weight
layers.#.mlp.down_proj.weight
layers.#.mlp.gate_proj.weight
layers.#.mlp.up_proj.weight
layers.#.post_attention_layernorm.weight
layers.#.post_feedforward_layernorm.weight
layers.#.pre_feedforward_layernorm.weight
layers.#.self_attn.k_norm.weight
layers.#.self_attn.k_proj.weight
layers.#.self_attn.o_proj.weight
layers.#.self_attn.q_norm.weight
layers.#.self_attn.q_proj.weight
layers.#.self_attn.v_proj.weight
norm.weight
```
Printing a specific tensor's information:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m layers.0.input_layernorm.weight
Tensor: layers.0.input_layernorm.weight
File: model.safetensors
Shape: [768]
```
This commit adds a debug option to the model conversion script to enable
using the Python debugger (pdb) during model conversion.
The motivation for this is that I've found myself adding this a few
times now and it would be quicker to have this flag as an option and a
makefile target/recipe for it.
* lookup, lookahead: fix crash when n_ctx not specified
Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic
GPU memory fitting. This causes llama-lookup and llama-lookahead to crash
when run without explicit -c flag:
GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded")
Root cause: Both examples use params.n_ctx directly for batch initialization,
but params.n_ctx remains 0 even after the context is properly initialized
to n_ctx_train internally.
Bug history:
- Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern
- Dec 2023: lookup.cpp created (PR #4484) with same pattern
- Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant
- Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated
The bug was dormant for 2+ years because params.n_ctx defaulted to 512,
then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering
the crash.
Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching
the pattern already used elsewhere in lookup.cpp (line 72) and in
speculative.cpp/speculative-simple.cpp.
Tested: llama-lookup now works without -c flag (12.5% acceptance on
Gemma-3-1B).
Note: llama-lookahead has a separate pre-existing issue with sequence
initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix.
* lookahead: fix n_seq_max and kv_unified configuration
Lookahead decoding requires:
- W + G + 1 = 31 sequences for parallel Jacobi decoding
- Unified KV cache for coupled sequences in batch splitting
These requirements were broken after PR #14482 changed validation logic.
Consolidates fix from PR #18730 per maintainer request.
Commit message drafted with Claude.
This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.
The motivation for this is that Commit
3d55846a5c ("model-conversion : add
BUILD_DIR variable to run-converted-model scripts") introduced this
variable to the causal and embeddings scripts, but I missed the scripts
in the utils directory.