Ed Addario
dfa79a9484
Merge branch 'master' into quantize
2025-12-16 13:57:54 +01:00
Xuan-Son Nguyen
6c2131773c
cli: new CLI experience ( #17824 )
...
* wip
* wip
* fix logging, add display info
* handle commands
* add args
* wip
* move old cli to llama-completion
* rm deprecation notice
* move server to a shared library
* move ci to llama-completion
* add loading animation
* add --show-timings arg
* add /read command, improve LOG_ERR
* add args for speculative decoding, enable show timings by default
* add arg --image and --audio
* fix windows build
* support reasoning_content
* fix llama2c workflow
* color default is auto
* fix merge conflicts
* properly fix color problem
Co-authored-by: bandoti <bandoti@users.noreply.github.com>
* better loading spinner
* make sure to clean color on force-exit
* also clear input files on "/clear"
* simplify common_log_flush
* add warning in mtmd-cli
* implement console writter
* fix data race
* add attribute
* fix llama-completion and mtmd-cli
* add some notes about console::log
* fix compilation
---------
Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
Ed Addario
b97cda6289
Add B/F16 to get_ftype()
2025-11-29 23:52:51 +00:00
Ed Addario
69a32b6f50
Relax target bpw range
2025-11-29 10:28:43 +00:00
Ed Addario
6616008420
Use more descriptive option naming
2025-11-24 18:26:45 +00:00
Ed Addario
1c9993e131
Add --disable-tensor-importance option
2025-11-23 17:51:04 +00:00
Ed Addario
9ec3e6e262
Remove processing statistics_data
2025-11-23 17:49:53 +00:00
Ed Addario
6e32244a06
Read statistics from imatrix
2025-10-30 21:53:07 +00:00
Ed Addario
00ddf039b3
Update usage
2025-10-20 21:38:49 +01:00
Ed Addario
0b3e930d52
Add option to override bpw state file name
2025-10-16 11:41:26 +01:00
Ed Addario
cd734b89ce
Update quant types
2025-10-13 15:15:23 +01:00
Ed Addario
ca282302b5
Add --keep-bpw-state option
2025-10-12 18:23:23 +01:00
Ed Addario
c93131cef6
Remove --no-bias option
2025-10-10 13:26:51 +01:00
Ed Addario
66d4aed173
Minor refactoring
2025-10-04 08:21:01 +01:00
Ed Addario
940db63144
Select quantization type if target_bpw is set unless user specifies type and threads
2025-10-03 11:08:02 +01:00
Ed Addario
dd4f4bd0b8
Reduce bpw range
2025-09-27 17:23:48 +01:00
Ed Addario
29bb30c4ed
Merge branch 'master' into quantize
2025-09-25 19:55:31 +01:00
Georgi Gerganov
1d660d2fae
ci : use smaller model ( #16168 )
...
* ci : switch from gemma to qwen3 0.6b
* ci : use smaller model for some tests
2025-09-22 09:11:39 +03:00
Ed Addario
9e74f83411
Replace --bpw-bias flag with --no-bias
2025-09-20 23:06:37 +01:00
Ed Addario
ab02bb1f3e
Merge branch 'master' into quantize
2025-09-20 21:41:25 +01:00
Yuri Khrustalev
07808ebb07
cmake : Do not install tools on iOS targets ( #15903 )
2025-09-16 09:54:44 +07:00
Ed Addario
04c07b3272
Add better control over MSE and directional bias computation
2025-09-10 18:00:56 +01:00
Ed Addario
556f6b04fe
Add --precise-lambda option
2025-08-28 16:08:08 +01:00
Ed Addario
d4ac2106fb
Improve logging and some minor code refactoring
2025-08-24 13:39:10 +01:00
Ed Addario
69586e212e
Add F16/BF16 type
2025-08-20 13:23:11 +01:00
Ed Addario
1b3d5b5744
Populate params
2025-08-19 10:56:02 +01:00
Ed Addario
e877474458
Process target_bpw parameter
2025-08-19 10:54:02 +01:00
Ed Addario
0edbf0c176
Process activations
2025-08-19 10:51:58 +01:00
Ed Addario
77b818c040
Populate activations_data with imatrix activations if present
2025-08-19 10:50:37 +01:00
Ed Addario
e6d55dc47b
Load activations
2025-08-19 10:49:01 +01:00
Ed Addario
5e85fb3ff3
Add parse_target_bpw()
2025-08-19 10:46:36 +01:00
Ed Addario
cfec4048ab
Update usage
2025-08-19 10:43:51 +01:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret
2721257e3e
quantize : fix confusing error message if ftype is invalid ( #15071 )
2025-08-04 18:11:02 +02:00
Ed Addario
e9192bec56
quantize : fix using combined imatrix GGUFs (multiple datasets) ( #14973 )
2025-07-30 21:11:56 +02:00
Ed Addario
7f97599581
quantize : update README.md ( #14905 )
...
* Update README.md
* Fix trailing whitespace
* Update README.md
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-27 23:31:11 +02:00
compilade
90083283ec
imatrix : use GGUF to store importance matrices ( #9400 )
...
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Vedran Miletić
e9b6350e61
scripts : make the shell scripts cross-platform ( #14341 )
2025-06-30 10:17:18 +02:00
Ed Addario
fa4a9f2a1c
quantize : handle user-defined pruning of whole layers (blocks) ( #13037 )
2025-06-22 23:16:26 +02:00
Ed Addario
e5c834f718
quantize : improve tensor-type pattern matching ( #13033 )
2025-05-13 19:12:31 +02:00
Diego Devesa
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00