llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ed Addario	b97cda6289	Add B/F16 to get_ftype()	2025-11-29 23:52:51 +00:00
Ed Addario	69a32b6f50	Relax target bpw range	2025-11-29 10:28:43 +00:00
Ed Addario	6616008420	Use more descriptive option naming	2025-11-24 18:26:45 +00:00
Ed Addario	1c9993e131	Add --disable-tensor-importance option	2025-11-23 17:51:04 +00:00
Ed Addario	9ec3e6e262	Remove processing statistics_data	2025-11-23 17:49:53 +00:00
Ed Addario	6e32244a06	Read statistics from imatrix	2025-10-30 21:53:07 +00:00
Ed Addario	00ddf039b3	Update usage	2025-10-20 21:38:49 +01:00
Ed Addario	0b3e930d52	Add option to override bpw state file name	2025-10-16 11:41:26 +01:00
Ed Addario	cd734b89ce	Update quant types	2025-10-13 15:15:23 +01:00
Ed Addario	ca282302b5	Add --keep-bpw-state option	2025-10-12 18:23:23 +01:00
Ed Addario	c93131cef6	Remove --no-bias option	2025-10-10 13:26:51 +01:00
Ed Addario	66d4aed173	Minor refactoring	2025-10-04 08:21:01 +01:00
Ed Addario	940db63144	Select quantization type if target_bpw is set unless user specifies type and threads	2025-10-03 11:08:02 +01:00
Ed Addario	dd4f4bd0b8	Reduce bpw range	2025-09-27 17:23:48 +01:00
Ed Addario	9e74f83411	Replace --bpw-bias flag with --no-bias	2025-09-20 23:06:37 +01:00
Ed Addario	04c07b3272	Add better control over MSE and directional bias computation	2025-09-10 18:00:56 +01:00
Ed Addario	556f6b04fe	Add --precise-lambda option	2025-08-28 16:08:08 +01:00
Ed Addario	d4ac2106fb	Improve logging and some minor code refactoring	2025-08-24 13:39:10 +01:00
Ed Addario	69586e212e	Add F16/BF16 type	2025-08-20 13:23:11 +01:00
Ed Addario	1b3d5b5744	Populate params	2025-08-19 10:56:02 +01:00
Ed Addario	e877474458	Process target_bpw parameter	2025-08-19 10:54:02 +01:00
Ed Addario	0edbf0c176	Process activations	2025-08-19 10:51:58 +01:00
Ed Addario	77b818c040	Populate activations_data with imatrix activations if present	2025-08-19 10:50:37 +01:00
Ed Addario	e6d55dc47b	Load activations	2025-08-19 10:49:01 +01:00
Ed Addario	5e85fb3ff3	Add parse_target_bpw()	2025-08-19 10:46:36 +01:00
Ed Addario	cfec4048ab	Update usage	2025-08-19 10:43:51 +01:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret	2721257e3e	quantize : fix confusing error message if ftype is invalid (#15071 )	2025-08-04 18:11:02 +02:00
Ed Addario	e9192bec56	quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973 )	2025-07-30 21:11:56 +02:00
compilade	90083283ec	imatrix : use GGUF to store importance matrices (#9400 ) * imatrix : allow processing multiple chunks per batch * perplexity : simplify filling the batch * imatrix : fix segfault when using a single chunk per batch * imatrix : use GGUF to store imatrix data * imatrix : fix conversion problems * imatrix : use FMA and sort tensor names * py : add requirements for legacy imatrix convert script * perplexity : revert changes * py : include imatrix converter requirements in toplevel requirements * imatrix : avoid using designated initializers in C++ * imatrix : remove unused n_entries * imatrix : allow loading mis-ordered tensors Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * quantize : use unused imatrix chunk_size with LLAMA_TRACE * common : use GGUF for imatrix output by default * imatrix : two-way conversion between old format and GGUF * convert : remove imatrix to gguf python script * imatrix : use the function name in more error messages * imatrix : don't use FMA explicitly This should make comparisons between the formats easier because this matches the behavior of the previous version. * imatrix : avoid returning from void function save_imatrix * imatrix : support 3d tensors with MUL_MAT * quantize : fix dataset name loading from gguf imatrix * common : move string_remove_suffix from quantize and imatrix Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * imatrix : add warning when legacy format is written * imatrix : warn when writing partial data, to help guess dataset coverage Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used. * imatrix : avoid loading model to convert or combine imatrix * imatrix : avoid using imatrix.dat in README --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-19 12:51:22 -04:00
Ed Addario	fa4a9f2a1c	quantize : handle user-defined pruning of whole layers (blocks) (#13037 )	2025-06-22 23:16:26 +02:00
Ed Addario	e5c834f718	quantize : improve tensor-type pattern matching (#13033 )	2025-05-13 19:12:31 +02:00
Diego Devesa	1d36b3670b	llama : move end-user examples to tools directory (#13249 ) * llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-02 20:27:13 +02:00

33 Commits