|
|
||
|---|---|---|
| .. | ||
| CMakeLists.txt | ||
| README.md | ||
| imatrix.cpp | ||
README.md
llama.cpp/tools/imatrix
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models. More information is available in https://github.com/ggml-org/llama.cpp/pull/4861.
Usage
./llama-imatrix \
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--output-format {gguf,dat}] [--no-ppl] \
[--process-output] [--chunk 123] [--save-frequency 0] [--output-frequency 10] \
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] [--parse-special] \
[--activation-statistics] [--show-statistics] [...]
Here -m | --model with a model name and -f | --file with a file containing calibration data (such as e.g. wiki.train.raw) are mandatory.
The parameters in square brackets are optional and have the following meaning:
-h | --helpshows usage information and exits.-lv | --verbosityspecifies the verbosity level. If set to0, no output other than the perplexity of the processed chunks will be generated. If set to1, each time the results are saved a message is written tostderr. If>=2, a message is output each time data is collected for any tensor. Default verbosity level is1.-o | --output-filespecifies the name of the file where the computed data will be stored. If missingimatrix.ggufis used.-ofreq | --output-frequencyspecifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)--output-formatspecifies the output format of the generated imatrix file. Eithergguf, ordat(the legacy format). Defaults togguf.--save-frequencyspecifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)--process-outputspecifies if data will be collected for theoutput.weighttensor. Typically, it is better not to utilize the importance matrix when quantizingoutput.weight, so this is set tofalseby default.--in-fileone or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.--parse-specialenables parsing of special tokens (e.g.,<|im_start|>in some models). Useful for models with custom tokenizers.--chunk | --from-chunkto skip the firstnchunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.--chunksmaximum number of chunks to process. Default is-1for all available chunks.--no-ppldisables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.--show-statisticsdisplays imatrix file's statistics.--activation-statisticsenables the collection of activation statistics for each tensor. If set, the imatrix file size will double, but reported statistics will be more accurate.
For faster computation, make sure to use GPU offloading via the -ngl | --n-gpu-layers argument.
Versions b5942 and newer of llama-imatrix store data in GGUF format by default. For the legacy format, use --output-format dat when saving the output file. More information is available in https://github.com/ggml-org/llama.cpp/pull/9400.
Examples
# generate importance matrix using default filename (imatrix.gguf), offloading 99 layers to GPU
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99
# use the imatrix to perform a Q4_K_M quantization
./llama-quantize --imatrix imatrix.gguf ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
# generate and save the imatrix using legacy format
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --output-format dat -o imatrix-legcy-format.dat -ngl 99
# convert legacy (binary) imatrix format to new (GGUF) format
./llama-imatrix --in-file imatrix-legacy-format.dat -o imatrix-new-format.gguf
# convert new (GGUF) imatrix format to legacy (binary) format
./llama-imatrix --in-file imatrix-new-format.gguf --output-format dat -o imatrix-legacy-format.dat
# combine existing imatrices
./llama-imatrix --in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf -o imatrix-combined.gguf
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
# generate imatrix and enable activation-based statistics
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --activation-statistics -ngl 99
# analyse imatrix file and display summary statistics instead of running inference
./llama-imatrix --in-file imatrix.gguf --show-statistics
Statistics
Beginning with version , --show-statistics has two modes. If --activation-statistics was used at imatrix creation time and --output-format was set to gguf, it reports precise statistics. Otherwise, it reports less accurate, albeit still useful, metrics based on average squared activations.
Per tensor
- Σ(Act²) (legacy mode) / L₂ Norm (preferred): If in legacy mode, the raw sum of squares of activations (sum of
Act²). In preferred mode, the Euclidean Distance (L₂ Norm) between this tensor’s average activations and those of the previous layer. - Min / Max / μ / σ: Tensor elements Min, Max, Mean, and Standard Deviation.
- N: Number of tensor elements considered.
- H Norm: Shannon Entropy normalized over log₂(N). Defined as
H Norm=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}. Used to determine how well a prompt "exercises" the model's capabilities. - H (legacy mode) / ECS (preferred): If legacy, Shannon Entropy defined as
H = -\sum_{i=1}^N p_i \log_2 p_i. If preferred, Euclidean-Cosine Score defined asECS = K \cdot e^{-\alpha a} \cdot |b|^{\gamma}wherea = L₂ Norm,b = Cosine Similarity,α = 0.01,γ = 10between this tensor’s elements and those of the previous layer. Higher score means more similarity and lower change. - ZD: % of elements whose Z-score is > 1.0 in magnitude (an indicator of outliers), as described in 3.1 Layer Importance Scores of Layer-Wise Quantization
- CosSim: Cosine Similarity between this tensor’s elements and those of the previous layer.
Per layer
Aggregated metrics per block/layer:
- Σ(Act²) (legacy mode) / L₂ Norm (preferred): If in legacy mode, the sum of squared activations (sum of Act²) for the layer's concatenated tensors. In preferred mode, the Euclidean Distance (L₂ Norm) between this layer's average concatenated tensor activations the previous layer.
- ZD: % of this layer's concatenated tensors' elements with |Z| > 1.
- CosSim: Cosine Similarity between this layer's concatenated tensors' elements compared and the previous layer’s.
- ECS (preferred only): Euclidean-Cosine Score applied to the layer.
More information is available in https://github.com/ggml-org/llama.cpp/pull/14891