llama.cpp/examples/model-conversion
Daniel Bevenius ffba4f29e6
examples : add debug utility/example (#18464)
* examples : add debug utility/example

This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.

The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.

Example usage:
```console
./build/bin/llama-debug \
    -m models/Qwen2.5-0.5B-Instruct.gguf \
    --prompt "Hello, my name is" \
    --save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11)  my(847)  name(829)  is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```

For more details about the options available for this example, please
refer to examples/debug/README.md.

* throw runtime error instead of logging error

* remove params.warmup and enable the warmup/nowarmup option

* model-conversion : remove logits.cpp

This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.

* examples : remove model-conversion directory

This was missed in the previous commit.

* model-conversion : add support for saving prompt and token ids

This commit add support for storing the prompt and the token ids for the
prompt when running the original models.

The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).

* squash! model-conversion : add support for saving prompt and token ids

fix pyright errors.

* model-conversion : add compare_tokens utility

This commit adds a script to compare token outputs between original and
converted models.

Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16

Comparing tokens between:
  Original : pytorch-gemma-3-270m-it (6 tokens)
  Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)

 All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v

Original model prompt (pytorch-gemma-3-270m-it):
  prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563

Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
  prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563

Comparing tokens between:
  Original : pytorch-gemma-3-270m-it (6 tokens)
  Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)

 All 6 tokens match!
```

* model-conversion : add token comparison to verifiction scripts

This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.

Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.

Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.

* debug : use llama_model_n_embd_out

This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.

The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.

* debug : add print_usage function

This commit adds a print_usage function that is passed to the
common_params_parse.

The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:

  Print tensors:

  ./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose

  The tensors to be printed can be filtered with --tensor-filter option.

  Save logits/embeddings:

  ./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits

  Add --embedding to save embeddings
```
2026-01-07 10:42:19 +01:00
..
scripts examples : add debug utility/example (#18464) 2026-01-07 10:42:19 +01:00
.gitignore examples : add model conversion tool/example (#15455) 2025-08-21 12:16:54 +02:00
Makefile model-conversion : add device option to run-org-model.py (#18318) 2025-12-23 14:07:25 +01:00
README.md model-conversion : add note about verifying previous models (#18082) 2025-12-16 11:17:40 +01:00
requirements.txt model-conversion : add support for SentenceTransformers (#16387) 2025-10-09 14:35:22 +02:00

README.md

Model Conversion Example

This directory contains scripts and code to help in the process of converting HuggingFace PyTorch models to GGUF format.

The motivation for having this is that the conversion process can often be an iterative process, where the original model is inspected, converted, updates made to llama.cpp, converted again, etc. Once the model has been converted it needs to be verified against the original model, and then optionally quantified, and in some cases perplexity checked of the quantized model. And finally the model/models need to the ggml-org on Hugging Face. This tool/example tries to help with this process.

📝 Note: When adding a new model from an existing family, verify the previous version passes logits verification first. Existing models can have subtle numerical differences that don't affect generation quality but cause logits mismatches. Identifying these upfront whether they exist in llama.cpp, the conversion script, or in an upstream implementation, can save significant debugging time.

Overview

The idea is that the makefile targets and scripts here can be used in the development/conversion process assisting with things like:

  • inspect/run the original model to figure out how it works
  • convert the original model to GGUF format
  • inspect/run the converted model
  • verify the logits produced by the original model and the converted model
  • quantize the model to GGUF format
  • run perplexity evaluation to verify that the quantized model is performing as expected
  • upload the model to HuggingFace to make it available for others

Setup

Create virtual python environment

$ python3.11 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt

Causal Language Model Conversion

This section describes the steps to convert a causal language model to GGUF and to verify that the conversion was successful.

Download the original model

First, clone the original model to some local directory:

$ mkdir models && cd models
$ git clone https://huggingface.co/user/model_name
$ cd model_name
$ git lfs install
$ git lfs pull

Set the MODEL_PATH

The path to the downloaded model can be provided in two ways:

Option 1: Environment variable (recommended for iterative development)

export MODEL_PATH=~/work/ai/models/some_model

Option 2: Command line argument (for one-off tasks)

make causal-convert-model MODEL_PATH=~/work/ai/models/some_model

Command line arguments take precedence over environment variables when both are provided.

In cases where the transformer implementation for the model has not been released yet it is possible to set the environment variable UNRELEASED_MODEL_NAME which will then cause the transformer implementation to be loaded explicitely and not use AutoModelForCausalLM:

export UNRELEASED_MODEL_NAME=SomeNewModel

Inspecting the original tensors

# Using environment variable
(venv) $ make causal-inspect-original-model

# Or using command line argument
(venv) $ make causal-inspect-original-model MODEL_PATH=~/work/ai/models/some_model

Running the original model

This is mainly to verify that the original model works, and to compare the output from the converted model.

# Using environment variable
(venv) $ make causal-run-original-model

# Or using command line argument
(venv) $ make causal-run-original-model MODEL_PATH=~/work/ai/models/some_model

This command will save two files to the data directory, one is a binary file containing logits which will be used for comparison with the converted model later, and the other is a text file which allows for manual visual inspection.

Model conversion

After updates have been made to gguf-py to add support for the new model, the model can be converted to GGUF format using the following command:

# Using environment variable
(venv) $ make causal-convert-model

# Or using command line argument
(venv) $ make causal-convert-model MODEL_PATH=~/work/ai/models/some_model

Inspecting the converted model

The converted model can be inspected using the following command:

(venv) $ make causal-inspect-converted-model

Running the converted model

(venv) $ make causal-run-converted-model

Model logits verfication

The following target will run the original model and the converted model and compare the logits:

(venv) $ make causal-verify-logits

Quantizing the model

The causal model can be quantized to GGUF format using the following command:

(venv) $ make causal-quantize-Q8_0
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
Export the quantized model path to QUANTIZED_MODEL variable in your environment

This will show the path to the quantized model in the terminal, which can then be used to set the QUANTIZED_MODEL environment variable:

export QUANTIZED_MODEL=/path/to/quantized/model-Q8_0.gguf

Then the quantized model can be run using the following command:

(venv) $ make causal-run-quantized-model

Quantizing QAT (Quantization Aware Training) models

When quantizing to Q4_0, the default data type for the token embedding weights will be Q6_K. For models that are going to be uploaded to ggml-org it is recommended to use Q8_0 instead for the embeddings and output tensors. The reason is that although Q6_K is smaller in size, it requires more compute to unpack, which can hurt performance during output generation when the entire embedding matrix must be dequantized to compute vocabulary logits. Q8_0 provides practically full quality with better computational efficiency.

(venv) $ make causal-quantize-qat-Q4_0

Embedding Language Model Conversion

Download the original model

$ mkdir models && cd models
$ git clone https://huggingface.co/user/model_name
$ cd model_name
$ git lfs install
$ git lfs pull

The path to the embedding model can be provided in two ways:

Option 1: Environment variable (recommended for iterative development)

export EMBEDDING_MODEL_PATH=~/path/to/embedding_model

Option 2: Command line argument (for one-off tasks)

make embedding-convert-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model

Command line arguments take precedence over environment variables when both are provided.

Running the original model

This is mainly to verify that the original model works and to compare the output with the output from the converted model.

# Using environment variable
(venv) $ make embedding-run-original-model

# Or using command line argument
(venv) $ make embedding-run-original-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model

This command will save two files to the data directory, one is a binary file containing logits which will be used for comparison with the converted model, and the other is a text file which allows for manual visual inspection.

Using SentenceTransformer with numbered layers

For models that have numbered SentenceTransformer layers (01_Pooling, 02_Dense, 03_Dense, 04_Normalize), use the -st targets to apply all these layers:

# Run original model with SentenceTransformer (applies all numbered layers)
(venv) $ make embedding-run-original-model-st

# Run converted model with pooling enabled
(venv) $ make embedding-run-converted-model-st

This will use the SentenceTransformer library to load and run the model, which automatically applies all the numbered layers in the correct order. This is particularly useful when comparing with models that should include these additional transformation layers beyond just the base model output.

Model conversion

After updates have been made to gguf-py to add support for the new model the model can be converted to GGUF format using the following command:

(venv) $ make embedding-convert-model

Run the converted model

(venv) $ make embedding-run-converted-model

Model logits verfication

The following target will run the original model and the converted model (which was done manually in the previous steps) and compare the logits:

(venv) $ make embedding-verify-logits

For models with SentenceTransformer layers, use the -st verification target:

(venv) $ make embedding-verify-logits-st

This convenience target automatically runs both the original model with SentenceTransformer and the converted model with pooling enabled, then compares the results.

llama-server verification

To verify that the converted model works with llama-server, the following command can be used:

(venv) $ make embedding-start-embedding-server

Then open another terminal and set the EMBEDDINGS_MODEL_PATH environment variable as this will not be inherited by the new terminal:

(venv) $ make embedding-curl-embedding-endpoint

This will call the embedding endpoing and the output will be piped into the same verification script as used by the target embedding-verify-logits.

The causal model can also be used to produce embeddings and this can be verified using the following commands:

(venv) $ make causal-start-embedding-server

Then open another terminal and set the MODEL_PATH environment variable as this will not be inherited by the new terminal:

(venv) $ make casual-curl-embedding-endpoint

Quantizing the model

The embedding model can be quantized to GGUF format using the following command:

(venv) $ make embedding-quantize-Q8_0
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment

This will show the path to the quantized model in the terminal, which can then be used to set the QUANTIZED_EMBEDDING_MODEL environment variable:

export QUANTIZED_EMBEDDING_MODEL=/path/to/quantized/model-Q8_0.gguf

Then the quantized model can be run using the following command:

(venv) $ make embedding-run-quantized-model

Quantizing QAT (Quantization Aware Training) models

When quantizing to Q4_0, the default data type for the token embedding weights will be Q6_K. For models that are going to be uploaded to ggml-org it is recommended to use Q8_0 instead for the embeddings and output tensors. The reason is that although Q6_K is smaller in size, it requires more compute to unpack, which can hurt performance during output generation when the entire embedding matrix must be dequantized to compute vocabulary logits. Q8_0 provides practically full quality with better computational efficiency.

(venv) $ make embedding-quantize-qat-Q4_0

Perplexity Evaluation

Simple perplexity evaluation

This allows to run the perplexity evaluation without having to generate a token/logits file:

(venv) $ make perplexity-run QUANTIZED_MODEL=~/path/to/quantized/model.gguf

This will use the wikitext dataset to run the perplexity evaluation and output the perplexity score to the terminal. This value can then be compared with the perplexity score of the unquantized model.

Full perplexity evaluation

First use the converted, non-quantized, model to generate the perplexity evaluation dataset using the following command:

$ make perplexity-data-gen CONVERTED_MODEL=~/path/to/converted/model.gguf

This will generate a file in the data directory named after the model and with a .kld suffix which contains the tokens and the logits for the wikitext dataset.

After the dataset has been generated, the perplexity evaluation can be run using the quantized model:

$ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LOGITS_FILE=data/model.gguf.ppl

📝 Note: The LOGITS_FILE is the file generated by the previous command can be very large, so make sure you have enough disk space available.

HuggingFace utilities

The following targets are useful for creating collections and model repositories on Hugging Face in the the ggml-org. These can be used when preparing a relase to script the process for new model releases.

For the following targets a HF_TOKEN environment variable is required.

📝 Note: Don't forget to logout from Hugging Face after running these commands, otherwise you might have issues pulling/cloning repositories as the token will still be in use: $ huggingface-cli logout $ unset HF_TOKEN

Create a new Hugging Face Model (model repository)

This will create a new model repsository on Hugging Face with the specified model name.

(venv) $ make hf-create-model MODEL_NAME='TestModel' NAMESPACE="danbev" ORIGINAL_BASE_MODEL="some-base-model"
Repository ID:  danbev/TestModel-GGUF
Repository created: https://huggingface.co/danbev/TestModel-GGUF

Note that we append a -GGUF suffix to the model name to ensure a consistent naming convention for GGUF models.

An embedding model can be created using the following command:

(venv) $ make hf-create-model-embedding MODEL_NAME='TestEmbeddingModel' NAMESPACE="danbev" ORIGINAL_BASE_MODEL="some-base-model"

The only difference is that the model card for an embedding model will be different with regards to the llama-server command and also how to access/call the embedding endpoint.

Upload a GGUF model to model repository

The following target uploads a model to an existing Hugging Face model repository.

(venv) $ make hf-upload-gguf-to-model MODEL_PATH=dummy-model1.gguf REPO_ID=danbev/TestModel-GGUF
📤 Uploading dummy-model1.gguf to danbev/TestModel-GGUF/dummy-model1.gguf
✅ Upload successful!
🔗 File available at: https://huggingface.co/danbev/TestModel-GGUF/blob/main/dummy-model1.gguf

This command can also be used to update an existing model file in a repository.

Create a new Collection

(venv) $ make hf-new-collection NAME=TestCollection DESCRIPTION="Collection for testing scripts" NAMESPACE=danbev
🚀 Creating Hugging Face Collection
Title: TestCollection
Description: Collection for testing scripts
Namespace: danbev
Private: False
✅ Authenticated as: danbev
📚 Creating collection: 'TestCollection'...
✅ Collection created successfully!
📋 Collection slug: danbev/testcollection-68930fcf73eb3fc200b9956d
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d

🎉 Collection created successfully!
Use this slug to add models: danbev/testcollection-68930fcf73eb3fc200b9956d

Add model to a Collection

(venv) $ make hf-add-model-to-collection COLLECTION=danbev/testcollection-68930fcf73eb3fc200b9956d MODEL=danbev/TestModel-GGUF
✅ Authenticated as: danbev
🔍 Checking if model exists: danbev/TestModel-GGUF
✅ Model found: danbev/TestModel-GGUF
📚 Adding model to collection...
✅ Model added to collection successfully!
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d

🎉 Model added successfully!