From 2948e6049a4ad0f96a4ab15246db2d2086b80703 Mon Sep 17 00:00:00 2001 From: "Piotr Wilkin (ilintar)" Date: Fri, 13 Mar 2026 12:21:33 +0100 Subject: [PATCH] general: CONTRIBUTING.md - guidelines for quantization schemes (#19762) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Guidelines for quantization schemes * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler * Change required precision from Q8 to FP16/BF16 * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler * Update CONTRIBUTING.md [no ci] * Update CONTRIBUTING.md [no ci] --------- Co-authored-by: Johannes Gäßler --- CONTRIBUTING.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 996f34ed82..fc26289aec 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -30,14 +30,19 @@ Before submitting your PR: - Search for existing PRs to prevent duplicating efforts - llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier - Test your changes: - - Execute [the full CI locally on your machine](ci/README.md) before publishing - - Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`) - - If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends) - - If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops` + - Execute [the full CI locally on your machine](ci/README.md) before publishing + - Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`) + - If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends) + - If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops` - Create separate PRs for each feature or fix: - - Avoid combining unrelated changes in a single PR - - For intricate features, consider opening a feature request first to discuss and align expectations - - When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs + - Avoid combining unrelated changes in a single PR + - For intricate features, consider opening a feature request first to discuss and align expectations + - When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs + - In particular, adding new data types (extension of the `ggml_type` enum) carries with it a disproportionate maintenance burden. As such, to add a new quantization type you will need to meet the following *additional* criteria *at minimum*: + - convert a small model to GGUF using the new type and upload it to HuggingFace + - provide [perplexity](https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity) comparisons to FP16/BF16 (whichever is the native precision) as well as to types of similar size + - provide KL divergence data calculated vs. the FP16/BF16 (whichever is the native precision) version for both the new type as well as types of similar size + - provide [performance data](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) for the new type in comparison to types of similar size on pure CPU - Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly - If you are a new contributor, limit your open PRs to 1.