llama.cpp

History

ymcki 3688c4f504 Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 ) * kimi linear model implementation * kimi linear convert_hf_to_gguf * kimi linear constants.py tensor_mapping.py * Kimi Linear ggml.h * kimi linear ggml-cpu * Kimi Linear ggml-cuda * Kimi Linear ggml.c * kimi linear src/llama * remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning * remove type mismatch warning * read MoE params * removed some hard coded code * removed all hard code * use DeepseekV2 tokenizer * removed unnecessary internal methods called by the old set_vocab of KimiLinear * rewrite get_vocab for KimiLinear. Removed all kda_scan code * removed all traces of kda_scan * reduce OP count by 1 due to removal of kda_scan * Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache * set n_embd_head_k/v to ensure kv cache works * don't quantize conv1d of Kimi Linear * Kimi Linear backend agnostic * removed LOG_INFO * naive chunking form implemented * fixed some comments * add Kimi-K2 specific tokens to be recognized as EOG * build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 * replaced Akk and Aqk with mul_mat and clamp * no clamp version * Moved Aqk computation out of the loop * fixed typo and split wkv_b into wk_b and wv_b * MLA KV cache support * fix trailing spaces * moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error * fix trailing whitespace * removed traling whitespaces in empty line + make sure indentation is multiple of 4 * try to make lint happy * remove blank lines to make lint happy * removed at least blank line containing white space * fixed flake8 complaints locally * return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement * removed Kimi-Linear specific change that causes failure at server-windows * removed private: from kimi_linear to make build checks happy * removed unnecessary ggml_cont before ggml_reshape * created static function causal_conv1d to abtract similar code for q/k/v * merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. * reverted to original * fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. * remove DT_B from constants.py. remove one comment line in llama-model.cpp * new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight * remove ssm_o_norm_b * remove ssm_o_norm_b * changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k * removed all ggml_cont b4 ggml_reshape_4d * Whitespace * replaced all hparams.get with find_hparams * added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp * use is_mla to switch between different mem_hybrid types * fixed logical errors in convert_hf_to_gguf.py pointed out by CISC * removed if else for required parameters kv_lora_rank and qk_rope_head_dim * add back ggml_cont for Vcur * minor changes * removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp * f16 gguf cannot run without context length * made a mistake of adding back n_ctx parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>		2026-02-06 11:39:58 +01:00
..
models	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
CMakeLists.txt	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-adapter.cpp	lora: make sure model keep track of associated adapters (#18490 )	2026-01-15 10:24:28 +01:00
llama-adapter.h	lora: make sure model keep track of associated adapters (#18490 )	2026-01-15 10:24:28 +01:00
llama-arch.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-arch.h	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-batch.cpp	batch : fix sequence id ownership (#17915 )	2025-12-11 14:29:47 +02:00
llama-batch.h	batch : fix sequence id ownership (#17915 )	2025-12-11 14:29:47 +02:00
llama-chat.cpp	docs : Minor cleanups (#19252 )	2026-02-02 08:38:55 +02:00
llama-chat.h	model : add EXAONE MoE (#18543 )	2026-01-13 23:28:38 +01:00
llama-context.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-context.h	sampling : remove sampling branching in output_reserve (#18811 )	2026-01-28 05:59:30 +01:00
llama-cparams.cpp	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 )	2025-06-15 10:08:58 +03:00
llama-cparams.h	context : reserve new scheduler when graph topology changes (#18547 )	2026-01-15 16:39:17 +02:00
llama-grammar.cpp	llama : rename llama-sampling to llama-sampler (#19363 )	2026-02-06 07:26:54 +01:00
llama-grammar.h	common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )	2026-01-03 16:02:43 -06:00
llama-graph.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-graph.h	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-hparams.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-hparams.h	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-impl.cpp	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 )	2025-12-15 09:24:59 +01:00
llama-impl.h	ggml, llama : use defaulted constructors/destructors (#17649 )	2025-12-03 07:12:18 +01:00
llama-io.cpp	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-io.h	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp	kv-cache : pad the cache size to 256 for performance (#17046 )	2025-11-07 20:03:25 +02:00
llama-kv-cache-iswa.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-kv-cache.cpp	memory : remove unused tmp_buf (#19199 )	2026-01-30 10:37:06 +01:00
llama-kv-cache.h	kv-cache : optimize KQ mask construction (#18842 )	2026-01-17 15:42:42 +02:00
llama-kv-cells.h	llama: store mrope data in KV cell (#16825 )	2025-10-29 18:09:18 +01:00
llama-memory-hybrid-iswa.cpp	memory : add llama_memory_hybrid_iswa (#18601 )	2026-01-21 14:30:23 +02:00
llama-memory-hybrid-iswa.h	memory : add llama_memory_hybrid_iswa (#18601 )	2026-01-21 14:30:23 +02:00
llama-memory-hybrid.cpp	graph : reuse SSM graphs (#16490 )	2025-12-16 09:36:21 +02:00
llama-memory-hybrid.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-memory-recurrent.cpp	memory : clarify comments for r_l and s_l tensors [no ci] (#19203 )	2026-01-30 15:18:41 +01:00
llama-memory-recurrent.h	llama: consistent ctx <-> buf order for KV cache (#16746 )	2025-10-28 11:23:54 +01:00
llama-memory.cpp	memory : correctly handle failure in apply() (#14438 )	2025-06-30 18:03:03 +03:00
llama-memory.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-mmap.cpp	llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887 )	2026-01-18 18:35:57 +02:00
llama-mmap.h	llama : add `use_direct_io` flag for model loading (#18166 )	2026-01-08 08:35:30 +02:00
llama-model-loader.cpp	llama : disable Direct IO by default (#19109 )	2026-01-28 09:11:13 +02:00
llama-model-loader.h	llama : add `use_direct_io` flag for model loading (#18166 )	2026-01-08 08:35:30 +02:00
llama-model-saver.cpp	kv-cache : support V-less cache (#19067 )	2026-01-25 15:48:56 +02:00
llama-model-saver.h	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
llama-model.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-model.h	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-quant.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-quant.h	llama : refactor `src/llama.cpp` (#10902 )	2025-01-03 10:18:53 +02:00
llama-sampler.cpp	llama : rename llama-sampling to llama-sampler (#19363 )	2026-02-06 07:26:54 +01:00
llama-sampler.h	llama : rename llama-sampling to llama-sampler (#19363 )	2026-02-06 07:26:54 +01:00
llama-vocab.cpp	Kimi-Linear support (backend agnostic + MLA KV cache) (#18755 )	2026-02-06 11:39:58 +01:00
llama-vocab.h	model : add EXAONE MoE (#18543 )	2026-01-13 23:28:38 +01:00
llama.cpp	llama: fix integer type consistency in split helpers (#18894 )	2026-01-25 09:10:52 +02:00
unicode-data.cpp	server : better security control for public deployments (#9776 )	2024-10-08 13:27:04 +02:00
unicode-data.h	llama : reduce compile time and binary size (#9712 )	2024-10-02 15:49:55 +02:00
unicode.cpp	model: support youtu-vl model (#18479 )	2026-01-01 19:25:54 +01:00
unicode.h	devops: add s390x & ppc64le CI (#15925 )	2025-09-27 02:03:33 +08:00