Ed Addario
14fae69a7b
General refactoring
2025-09-20 21:31:31 +01:00
Jie Fu (傅杰)
745cbcf2fe
llama-quant : fix the verification of attention layers for encoder-decoder models ( #16023 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-17 09:30:55 +02:00
Ed Addario
ad70fca5b2
Merge branch 'quantize' of https://github.com/EAddario/llama.cpp into quantize
2025-09-15 07:42:37 +01:00
Ed Addario
9b857e3984
Merge branch 'ggml-org:master' into quantize
2025-09-14 23:35:43 +01:00
Ed Addario
c709e1a335
Fix MoE tensor estimation
2025-09-14 22:38:27 +01:00
Ed Addario
8503d59ee4
Increase IQ options
2025-09-13 11:49:18 +01:00
Ed Addario
2b516068e2
"Convexify" candidate list
2025-09-13 09:41:52 +01:00
Ed Addario
12e816b511
Replace greedy allocator with lagrangian relaxation
2025-09-13 09:24:23 +01:00
Ed Addario
7d85993f26
Minor refactoring
2025-09-13 08:44:41 +01:00
Ed Addario
4dff85fbe5
Improve precise_lambda() efficiency
2025-09-13 08:41:37 +01:00
Ed Addario
bc8762f27f
Capture surrounding function name
2025-09-13 08:33:22 +01:00
Ed Addario
886536d80a
Increase error type precision
2025-09-13 08:27:23 +01:00
ddh0
df082f5630
nitpick : correct MB to MiB ( #15934 )
...
MB was incorrectly used for 1024 x 1024 bytes instead of MiB
2025-09-11 19:12:34 +02:00
Ed Addario
04c07b3272
Add better control over MSE and directional bias computation
2025-09-10 18:00:56 +01:00
Ed Addario
eab8708244
Minor factoring for efficiency and correctness
2025-08-30 10:14:46 +01:00
Ed Addario
556f6b04fe
Add --precise-lambda option
2025-08-28 16:08:08 +01:00
Ed Addario
66aff8fa1e
Add precise_lambda()
2025-08-28 16:06:42 +01:00
Ed Addario
8df1d00ae4
Add directional scaling
2025-08-28 16:04:28 +01:00
Ed Addario
04946114c9
Refactor epsilon into a function-wide variable
2025-08-28 16:01:03 +01:00
Ed Addario
4286690019
Minor comment update
2025-08-26 21:39:40 +01:00
Ed Addario
d4ac2106fb
Improve logging and some minor code refactoring
2025-08-24 13:39:10 +01:00
Ed Addario
61c0e01f50
Execute bpw_overrides() only if an imatrix file is provided
2025-08-24 13:36:03 +01:00
Ed Addario
3856d60328
Restrict quant types per family
2025-08-23 14:45:07 +01:00
Ed Addario
decafae270
Adjust bias_lambda
2025-08-23 11:30:11 +01:00
Ed Addario
68ae5e66ce
Improve list of candidate types
2025-08-23 02:50:55 +01:00
Ed Addario
73124a9921
Refactor estimate_error()
2025-08-23 02:17:22 +01:00
Ed Addario
f75265f55b
Fix typo
2025-08-23 01:08:37 +01:00
Ed Addario
9a4b115497
Explicitly adding <atomic> include
2025-08-23 01:08:01 +01:00
Ed Addario
6d17889add
Log if override is from tensor-type or from bpw-target
2025-08-22 16:58:46 +01:00
Ed Addario
fea99d051a
Refactor and combine lambdas
2025-08-22 16:57:58 +01:00
Ed Addario
f05c8483d8
Improve dequantized_buffer fill
2025-08-22 09:17:58 +01:00
Ed Addario
897decbe8a
Show skipped IQ tensors
2025-08-22 09:15:11 +01:00
Ed Addario
01c927fb94
Improve pareto efficient candidate selection
2025-08-22 09:14:14 +01:00
Ed Addario
47cdbe2155
Reduce sampling window to speedup process
2025-08-22 09:11:11 +01:00
Ed Addario
2f13fee795
Parameterise type
2025-08-22 09:05:55 +01:00
Ed Addario
bb0d912c1f
Update comments
2025-08-22 09:02:56 +01:00
Ed Addario
35c1504441
Fix byte count for 3d or higher tensors
2025-08-22 09:01:57 +01:00
Ed Addario
ec0afbe79f
Include embeddings and output tensors
2025-08-22 01:46:09 +01:00
Ed Addario
5b6f1e9fde
General code refactor
2025-08-21 19:18:54 +01:00
Ed Addario
9e11f82e8f
Precompute error denominator in estimate_erro()
2025-08-21 16:25:31 +01:00
Ed Addario
887490c5ec
Dequantise sampled rows only
2025-08-21 15:11:49 +01:00
Ed Addario
e01dad886b
Parallelise candidate evaluation
2025-08-21 12:47:13 +01:00
Ed Addario
95b2ab2800
Change error estimate to use normalised weighted MSE
2025-08-21 10:46:37 +01:00
Ed Addario
5ef493ea1a
Exclude embeddings and output tensor
2025-08-21 09:48:29 +01:00
Ed Addario
35ad0fc4ad
Improve error estimation using weighted MSE
2025-08-20 23:27:20 +01:00
Ed Addario
b0b33b7ccb
Optimise tensor sampling
2025-08-20 20:58:26 +01:00
Ed Addario
3f0118d602
Fix bias lambda bug
2025-08-20 17:26:37 +01:00
Ed Addario
52da4a4f8c
Skip if output.weight or type is COPY
2025-08-20 17:26:05 +01:00
Ed Addario
43caadf783
Add better fallbacks for IQ mixes
2025-08-20 17:24:48 +01:00
Ed Addario
29b2dc3ec0
Do not mix K and IQ quants
2025-08-20 13:27:01 +01:00
Ed Addario
5cd69a6809
Add F16/BF16 type
2025-08-20 09:41:39 +01:00
Ed Addario
936294f6af
Increase precision for error calculation
2025-08-19 23:31:22 +01:00
Ed Addario
f22b3097eb
Avoid division by zero if truncation occurs
2025-08-19 22:34:01 +01:00
Ed Addario
ee05d6bc0b
Update comments
2025-08-19 22:32:53 +01:00
Ed Addario
5aceb9e3ae
Refactor variable names
2025-08-19 22:29:27 +01:00
Ed Addario
1187f6aa9e
Implement bpw_overrides call
2025-08-19 11:07:03 +01:00
Ed Addario
92f49ab399
Add target_bpw_type() logic
2025-08-19 11:05:01 +01:00
Ed Addario
017945a3b2
Validate if imatrix contains activations
2025-08-19 11:03:52 +01:00
Ed Addario
9adae08789
Add is_iq()
2025-08-19 11:00:50 +01:00
Ed Addario
c96b8eef94
Add fallback_type enum
2025-08-19 11:00:05 +01:00
Ed Addario
a22a9deeee
Refactor variable and add target_bpw
2025-08-19 10:57:44 +01:00
Xuan-Son Nguyen
50aa938901
convert : support non-mxfp4 HF model ( #15153 )
...
* convert : support non-mxfp4 HF model
* rm redundant check
* disable debug check
2025-08-07 23:26:03 +02:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Ed Addario
daf2dd7880
quantize : skip tensor override when in fallback mode ( #14995 )
2025-07-31 21:32:18 +02:00
Ed Addario
982e347255
quantize : fix minor logic flaw in --tensor-type ( #14572 )
2025-07-13 18:02:17 +02:00
Tarek Dakhran
f5e96b368f
model : support LiquidAI LFM2 hybrid family ( #14620 )
...
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340 )into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main "
```
2025-07-11 20:27:01 +02:00
Xuan-Son Nguyen
8846aace49
model : gemma3n text-only ( #14400 )
...
* gemma3n
* add llm_graph_input_one
2025-06-26 20:34:02 +03:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) ( #13037 )
2025-06-22 23:16:26 +02:00
Ed Addario
30e5b01de2
quantize : change int to unsigned int for KV overrides ( #14197 )
2025-06-15 18:53:45 +02:00
Ed Addario
e5c834f718
quantize : improve tensor-type pattern matching ( #13033 )
2025-05-13 19:12:31 +02:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Ed Addario
71e90e8813
quantize: Handle user-defined quantization levels for additional tensors ( #12511 )
...
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement
2025-04-13 21:29:28 +03:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers ( #11397 )
...
* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Molly Sophia
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Xuan Son Nguyen
681149ced2
llama : add `llama_model_load_from_splits` ( #11255 )
...
* llama : add `llama_model_load_from_splits`
* update
2025-01-16 13:54:08 +01:00
Georgi Gerganov
afa8a9ec9b
llama : add `llama_vocab`, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture ( #11001 )
...
llama: add support for QRWKV6 model architecture (#11001 )
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Georgi Gerganov
c07d437bbd
llama : avoid hardcoded QK_K ( #11061 )
...
ggml-ci
2025-01-08 16:19:36 +02:00
Johannes Gäßler
53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes ( #11030 )
...
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
2025-01-07 18:01:58 +01:00
Georgi Gerganov
5047dd3546
llama : use _impl suffix instead of _internal ( #11060 )
...
ggml-ci
2025-01-06 10:52:01 +02:00
Georgi Gerganov
f66f582927
llama : refactor `src/llama.cpp` ( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00