llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ed Addario	f2a719b14a	Change tensor importance score logic	2026-02-20 15:05:46 +00:00
Ed Addario	551463e2e8	Minor refactoring	2026-02-20 15:03:56 +00:00
Ed Addario	6029c6ea17	Merge branch 'master' into quantize	2026-02-07 16:49:52 +00:00
ymcki	3688c4f504	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 ) * kimi linear model implementation * kimi linear convert_hf_to_gguf * kimi linear constants.py tensor_mapping.py * Kimi Linear ggml.h * kimi linear ggml-cpu * Kimi Linear ggml-cuda * Kimi Linear ggml.c * kimi linear src/llama * remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning * remove type mismatch warning * read MoE params * removed some hard coded code * removed all hard code * use DeepseekV2 tokenizer * removed unnecessary internal methods called by the old set_vocab of KimiLinear * rewrite get_vocab for KimiLinear. Removed all kda_scan code * removed all traces of kda_scan * reduce OP count by 1 due to removal of kda_scan * Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache * set n_embd_head_k/v to ensure kv cache works * don't quantize conv1d of Kimi Linear * Kimi Linear backend agnostic * removed LOG_INFO * naive chunking form implemented * fixed some comments * add Kimi-K2 specific tokens to be recognized as EOG * build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 * replaced Akk and Aqk with mul_mat and clamp * no clamp version * Moved Aqk computation out of the loop * fixed typo and split wkv_b into wk_b and wv_b * MLA KV cache support * fix trailing spaces * moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error * fix trailing whitespace * removed traling whitespaces in empty line + make sure indentation is multiple of 4 * try to make lint happy * remove blank lines to make lint happy * removed at least blank line containing white space * fixed flake8 complaints locally * return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement * removed Kimi-Linear specific change that causes failure at server-windows * removed private: from kimi_linear to make build checks happy * removed unnecessary ggml_cont before ggml_reshape * created static function causal_conv1d to abtract similar code for q/k/v * merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. * reverted to original * fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. * remove DT_B from constants.py. remove one comment line in llama-model.cpp * new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight * remove ssm_o_norm_b * remove ssm_o_norm_b * changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k * removed all ggml_cont b4 ggml_reshape_4d * Whitespace * replaced all hparams.get with find_hparams * added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp * use is_mla to switch between different mem_hybrid types * fixed logical errors in convert_hf_to_gguf.py pointed out by CISC * removed if else for required parameters kv_lora_rank and qk_rope_head_dim * add back ggml_cont for Vcur * minor changes * removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp * f16 gguf cannot run without context length * made a mistake of adding back n_ctx parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-02-06 11:39:58 +01:00
Ed Addario	7a02dafaa7	Fix state file name generation and simplify_pareto() lambda capture	2026-02-04 10:40:50 +00:00
Ed Addario	462d3dab82	Merge branch 'master' into quantize	2026-02-03 10:57:05 +00:00
Georgi Gerganov	c5c64f72ac	llama : disable Direct IO by default (#19109 ) * llama : disable Direct IO by default * cont : override mmap if supported	2026-01-28 09:11:13 +02:00
Ed Addario	220df5f1ff	Update output log	2026-01-22 23:19:26 +00:00
Ed Addario	0b5030d704	Merge branch 'master' into quantize	2026-01-22 15:45:07 +00:00
Georgi Gerganov	0e4ebeb057	quant : manual overrides of tensor types take precedence (#18952 )	2026-01-22 16:17:06 +02:00
Ed Addario	ff3b9b4cae	Memory optimisations (AI assisted)	2026-01-22 11:39:26 +00:00
Ed Addario	2ede173218	Performance optimisations (AI assisted)	2026-01-22 10:38:16 +00:00
Ed Addario	1c23a6fbd2	Add experimental entropy-modulated weighted cosine error (WCE)	2026-01-21 18:28:37 +00:00
Ed Addario	0b63f50463	Major refactor	2026-01-21 18:26:41 +00:00
Ed Addario	41ff6f95ee	Merge branch 'master' into quantize	2026-01-11 18:39:28 +00:00
Julius Tischbein	2038101bd9	llama : add `use_direct_io` flag for model loading (#18166 ) * Adding --direct-io flag for model loading * Fixing read_raw() calls * Fixing Windows read_raw_at * Changing type off_t to size_t for windows and Renaming functions * disable direct io when mmap is explicitly enabled * Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL * Fallback to std::fread in case O_DIRECT fails due to bad address * Windows: remove const keywords and unused functions * Update src/llama-mmap.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: jtischbein <jtischbein@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-08 08:35:30 +02:00
Ed Addario	d18ddbac9c	Refactor variable names	2026-01-07 18:28:57 +00:00
Ed Addario	774ba01367	Remove file deletion	2026-01-07 18:28:26 +00:00
Ed Addario	06f46afedc	Improve file handling	2026-01-07 18:27:39 +00:00
Ed Addario	c09fa60daa	Update parameter names	2026-01-07 18:26:16 +00:00
Ed Addario	bdd7ec7f56	Implement target_size logic	2026-01-07 18:11:51 +00:00
Ed Addario	960ef96141	Prepare for future optimization algorithms	2026-01-01 13:44:59 +00:00
Ed Addario	91846ee79b	Change checkpoint file magic	2025-12-29 13:02:06 +00:00
Ed Addario	b6d718a4a6	Add code comments	2025-12-25 15:47:44 +00:00
Ed Addario	5f7bba7828	Improve state checkpoint filename	2025-12-25 15:47:18 +00:00
Ed Addario	dfa79a9484	Merge branch 'master' into quantize	2025-12-16 13:57:54 +01:00
Johannes Gäßler	b1f3a6e5db	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 ) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-15 09:24:59 +01:00
Ed Addario	e3d9b340ca	Merge branch 'master' into quantize	2025-12-06 15:07:36 +01:00
Daniel Bevenius	444f00b0ec	llama : remove quantization sanity check (#17788 ) * llama : remove quantization sanity check This commit removes the quantization sanity check for attention layers. The motivation for this is that there are model that are hybrid models that have recurrent layers, experts layers, and attention layers. For these models the current check fails as the experts layers are not taking into account. After consideration, it was decided that this check is not strictly necessary, and can be removed to allow for more flexible model architectures. * llama : remove unused pruned_attention_w and is_clip_model vars	2025-12-06 12:26:20 +01:00
Georgi Gerganov	a67ef0f47f	llama : fix sanity checks during quantization (#17721 )	2025-12-04 10:33:42 +02:00
Ed Addario	3f7842c645	Merge branch 'master' into quantize	2025-11-30 13:01:54 +00:00
Ed Addario	37cf51ebd0	Process bpw targets up to B/F16	2025-11-30 00:29:35 +00:00
Ed Addario	229109f329	Increase importance boost for final pass	2025-11-29 10:31:39 +00:00
Ed Addario	5b557ca958	Minor refactoring	2025-11-29 10:30:20 +00:00
Piotr Wilkin (ilintar)	ff55414c42	model : Qwen3 Next (#16095 ) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 12:02:56 +01:00
Ed Addario	6616008420	Use more descriptive option naming	2025-11-24 18:26:45 +00:00
Ed Addario	1c9993e131	Add --disable-tensor-importance option	2025-11-23 17:51:04 +00:00
Ed Addario	9ec3e6e262	Remove processing statistics_data	2025-11-23 17:49:53 +00:00
Ed Addario	a0ba913613	Fix lambda capture bug in Windows and initialise candidate_types struct	2025-11-19 11:19:44 +00:00
Ed Addario	ac8cfbdd12	Improved is_important() logic	2025-11-17 18:03:09 +00:00
Ed Addario	b02b1b2304	Merge branch 'master' into quantize	2025-10-31 23:20:17 +00:00
Ed Addario	c59bb6d49d	Add Euclidean-Cosine score to identify important tensors	2025-10-30 22:11:40 +00:00
Ed Addario	6e32244a06	Read statistics from imatrix	2025-10-30 21:53:07 +00:00
Jan Boon	d7395115ba	llama : use std::abs instead of abs (#16853 )	2025-10-30 08:30:58 +02:00
Ed Addario	f8863b9a80	Minor refactoring	2025-10-28 15:22:32 +00:00
Ed Addario	5303212324	Simplify tensor selection	2025-10-26 17:40:52 +00:00
Ed Addario	d6ccd5649a	Finetune heuristics	2025-10-25 12:09:20 +01:00
Ed Addario	04561d5782	Update epsilon specifier	2025-10-21 12:53:26 +01:00
Ed Addario	27bf25e93c	Fix lambda capture	2025-10-20 22:04:35 +01:00
Ed Addario	543b5a99db	Fix lambda capture	2025-10-20 21:57:03 +01:00

1 2 3 4

194 Commits