Commit Graph

211 Commits

Author SHA1 Message Date
Ed Addario feda897fa2
Enhanced scaling factor 2026-03-12 20:05:29 +00:00
Ed Addario d1ed43ca62
Revert to weighted cosine error estimation 2026-03-12 20:05:07 +00:00
Ed Addario 6e68a04797
Angle-aware/Magnitude-aware (MSE/WCE) hybrid error estimation 2026-03-12 19:59:26 +00:00
Ed Addario 9bb8e17e04
Remove wce flag 2026-03-12 19:55:56 +00:00
Ed Addario 0ccf5e5f21
Test removing unused headers 2026-03-12 16:04:36 +00:00
Ed Addario fd64e639ab
Merge branch 'master' into quantize 2026-03-12 15:43:01 +00:00
ddh0 10e5b148b0
llama-quant : correct `n_attention_wv` usage (#20357)
* llama-quant : correct `n_attention_wv` usage

In #19770, I introduced a regression in the way the
`quantize_state_impl` counter values were initialized. I was
incrementing and using `n_attention_wv` in the same loop, when it should
have been fixed by the time we're deciding tensor types in
`llama_tensor_get_type_impl` (for `use_more_bits`).

I never observed a difference in any of [my
tests](https://github.com/ggml-org/llama.cpp/pull/19770#issuecomment-4000424712)
- it was only after @bartowski kindly pointed this out that I realized
it was incorrect. (Thanks!)

* simplify
2026-03-10 21:43:29 +02:00
ddh0 1dab5f5a44
llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770)
* quantize : imatrix-fail early + code cleanup

* fix manual override printing

it's in the preliminary loop now, so needs to be on its own line

* revert header changes per ggerganov

* remove old #includes

* clarify naming

rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality

* fix per barto
2026-03-10 08:16:05 +02:00
ddh0 b518195101
llama-quant : left-align tensor names in output (#20117) 2026-03-09 09:28:41 +02:00
Johannes Gäßler a976ff081b
llama: end-to-end tests (#19802)
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
2026-03-08 12:30:21 +01:00
Ed Addario d6a718e55a
Fix scale factor overwrite bug 2026-03-02 18:40:39 +00:00
Ed Addario 6773bd59ad
Expected Output Error MSE 2026-03-01 09:22:15 +00:00
Ed Addario 06d3b50b03
Improve WCE to be magnitude-aware 2026-03-01 09:19:55 +00:00
Ed Addario a057d827ca
Minor refactoring 2026-02-21 10:10:32 +00:00
Ed Addario 9e460f1c0f
Refactor is_quantizable() 2026-02-21 10:08:20 +00:00
Ed Addario 6729dedbb5
Merge branch 'master' into quantize 2026-02-20 16:47:26 +00:00
Ed Addario f2a719b14a
Change tensor importance score logic 2026-02-20 15:05:46 +00:00
Ed Addario 551463e2e8
Minor refactoring 2026-02-20 15:03:56 +00:00
ddh0 492bc31978
quantize : add --dry-run option (#19526)
* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref #12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-20 09:20:16 +01:00
Ed Addario 6029c6ea17
Merge branch 'master' into quantize 2026-02-07 16:49:52 +00:00
ymcki 3688c4f504
Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)
* kimi linear model implementation

* kimi linear convert_hf_to_gguf

* kimi linear constants.py tensor_mapping.py

* Kimi Linear ggml.h

* kimi linear ggml-cpu

* Kimi Linear ggml-cuda

* Kimi Linear ggml.c

* kimi linear src/llama

* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning

* remove type mismatch warning

* read MoE params

* removed some hard coded code

* removed all hard code

* use DeepseekV2 tokenizer

* removed unnecessary internal methods called by the old set_vocab of KimiLinear

* rewrite get_vocab for KimiLinear. Removed all kda_scan code

* removed all traces of kda_scan

* reduce OP count by 1 due to removal of kda_scan

* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

* set n_embd_head_k/v to ensure kv cache works

* don't quantize conv1d of Kimi Linear

* Kimi Linear backend agnostic

* removed LOG_INFO

* naive chunking form implemented

* fixed some comments

* add Kimi-K2 specific tokens to be recognized as EOG

* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682

* replaced Akk and Aqk with mul_mat and clamp

* no clamp version

* Moved Aqk computation out of the loop

* fixed typo and split wkv_b into wk_b and wv_b

* MLA KV cache support

* fix trailing spaces

* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error

* fix trailing whitespace

* removed traling whitespaces in empty line + make sure indentation is multiple of 4

* try to make lint happy

* remove blank lines to make lint happy

* removed at least blank line containing white space

* fixed flake8 complaints locally

* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement

* removed Kimi-Linear specific change that causes failure at server-windows

* removed private: from kimi_linear to make build checks happy

* removed unnecessary ggml_cont before ggml_reshape

* created static function causal_conv1d to abtract similar code for q/k/v

* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.

* reverted to original

* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.

* remove DT_B from constants.py. remove one comment line in llama-model.cpp

* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight

* remove ssm_o_norm_b

* remove ssm_o_norm_b

* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k

* removed all ggml_cont b4 ggml_reshape_4d

* Whitespace

* replaced all hparams.get with find_hparams

* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

* use is_mla to switch between different mem_hybrid types

* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

* removed if else for required parameters kv_lora_rank and qk_rope_head_dim

* add back ggml_cont for Vcur

* minor changes

* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp

* f16 gguf cannot run without context length

* made a mistake of adding back n_ctx parsing

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-06 11:39:58 +01:00
Ed Addario 7a02dafaa7
Fix state file name generation and simplify_pareto() lambda capture 2026-02-04 10:40:50 +00:00
Ed Addario 462d3dab82
Merge branch 'master' into quantize 2026-02-03 10:57:05 +00:00
Georgi Gerganov c5c64f72ac
llama : disable Direct IO by default (#19109)
* llama : disable Direct IO by default

* cont : override mmap if supported
2026-01-28 09:11:13 +02:00
Ed Addario 220df5f1ff
Update output log 2026-01-22 23:19:26 +00:00
Ed Addario 0b5030d704
Merge branch 'master' into quantize 2026-01-22 15:45:07 +00:00
Georgi Gerganov 0e4ebeb057
quant : manual overrides of tensor types take precedence (#18952) 2026-01-22 16:17:06 +02:00
Ed Addario ff3b9b4cae
Memory optimisations (AI assisted) 2026-01-22 11:39:26 +00:00
Ed Addario 2ede173218
Performance optimisations (AI assisted) 2026-01-22 10:38:16 +00:00
Ed Addario 1c23a6fbd2
Add experimental entropy-modulated weighted cosine error (WCE) 2026-01-21 18:28:37 +00:00
Ed Addario 0b63f50463
Major refactor 2026-01-21 18:26:41 +00:00
Ed Addario 41ff6f95ee
Merge branch 'master' into quantize 2026-01-11 18:39:28 +00:00
Julius Tischbein 2038101bd9
llama : add `use_direct_io` flag for model loading (#18166)
* Adding --direct-io flag for model loading

* Fixing read_raw() calls

* Fixing Windows read_raw_at

* Changing type off_t to size_t for windows and Renaming functions

* disable direct io when mmap is explicitly enabled

* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL

* Fallback to std::fread in case O_DIRECT fails due to bad address

* Windows: remove const keywords and unused functions

* Update src/llama-mmap.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-08 08:35:30 +02:00
Ed Addario d18ddbac9c
Refactor variable names 2026-01-07 18:28:57 +00:00
Ed Addario 774ba01367
Remove file deletion 2026-01-07 18:28:26 +00:00
Ed Addario 06f46afedc
Improve file handling 2026-01-07 18:27:39 +00:00
Ed Addario c09fa60daa
Update parameter names 2026-01-07 18:26:16 +00:00
Ed Addario bdd7ec7f56
Implement target_size logic 2026-01-07 18:11:51 +00:00
Ed Addario 960ef96141
Prepare for future optimization algorithms 2026-01-01 13:44:59 +00:00
Ed Addario 91846ee79b
Change checkpoint file magic 2025-12-29 13:02:06 +00:00
Ed Addario b6d718a4a6
Add code comments 2025-12-25 15:47:44 +00:00
Ed Addario 5f7bba7828
Improve state checkpoint filename 2025-12-25 15:47:18 +00:00
Ed Addario dfa79a9484
Merge branch 'master' into quantize 2025-12-16 13:57:54 +01:00
Johannes Gäßler b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Ed Addario e3d9b340ca
Merge branch 'master' into quantize 2025-12-06 15:07:36 +01:00
Daniel Bevenius 444f00b0ec
llama : remove quantization sanity check (#17788)
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
2025-12-06 12:26:20 +01:00
Georgi Gerganov a67ef0f47f
llama : fix sanity checks during quantization (#17721) 2025-12-04 10:33:42 +02:00
Ed Addario 3f7842c645
Merge branch 'master' into quantize 2025-11-30 13:01:54 +00:00
Ed Addario 37cf51ebd0
Process bpw targets up to B/F16 2025-11-30 00:29:35 +00:00
Ed Addario 229109f329
Increase importance boost for final pass 2025-11-29 10:31:39 +00:00