Commit Graph

32 Commits

Author SHA1 Message Date
Yee Man Chan 06f0728984 replace ggml_acc with ggml_set for vulkan compatibility 2026-02-07 13:28:02 +08:00
Yee Man Chan 97f229c8df sync to latest plus replace chunkify with get_slice_2d 2026-02-06 19:26:24 +08:00
Yee Man Chan 17cd6e8514 4x4 16x16 blocks computation for Akk and Aqk 2026-02-06 19:03:09 +08:00
ymcki 3688c4f504
Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)
* kimi linear model implementation

* kimi linear convert_hf_to_gguf

* kimi linear constants.py tensor_mapping.py

* Kimi Linear ggml.h

* kimi linear ggml-cpu

* Kimi Linear ggml-cuda

* Kimi Linear ggml.c

* kimi linear src/llama

* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning

* remove type mismatch warning

* read MoE params

* removed some hard coded code

* removed all hard code

* use DeepseekV2 tokenizer

* removed unnecessary internal methods called by the old set_vocab of KimiLinear

* rewrite get_vocab for KimiLinear. Removed all kda_scan code

* removed all traces of kda_scan

* reduce OP count by 1 due to removal of kda_scan

* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

* set n_embd_head_k/v to ensure kv cache works

* don't quantize conv1d of Kimi Linear

* Kimi Linear backend agnostic

* removed LOG_INFO

* naive chunking form implemented

* fixed some comments

* add Kimi-K2 specific tokens to be recognized as EOG

* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682

* replaced Akk and Aqk with mul_mat and clamp

* no clamp version

* Moved Aqk computation out of the loop

* fixed typo and split wkv_b into wk_b and wv_b

* MLA KV cache support

* fix trailing spaces

* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error

* fix trailing whitespace

* removed traling whitespaces in empty line + make sure indentation is multiple of 4

* try to make lint happy

* remove blank lines to make lint happy

* removed at least blank line containing white space

* fixed flake8 complaints locally

* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement

* removed Kimi-Linear specific change that causes failure at server-windows

* removed private: from kimi_linear to make build checks happy

* removed unnecessary ggml_cont before ggml_reshape

* created static function causal_conv1d to abtract similar code for q/k/v

* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.

* reverted to original

* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.

* remove DT_B from constants.py. remove one comment line in llama-model.cpp

* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight

* remove ssm_o_norm_b

* remove ssm_o_norm_b

* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k

* removed all ggml_cont b4 ggml_reshape_4d

* Whitespace

* replaced all hparams.get with find_hparams

* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

* use is_mla to switch between different mem_hybrid types

* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

* removed if else for required parameters kv_lora_rank and qk_rope_head_dim

* add back ggml_cont for Vcur

* minor changes

* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp

* f16 gguf cannot run without context length

* made a mistake of adding back n_ctx parsing

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-06 11:39:58 +01:00
Yee Man Chan 000fded1ea add back ggml_cont for Vcur 2026-02-03 18:42:17 +08:00
Yee Man Chan 11282a0f60 use is_mla to switch between different mem_hybrid types 2026-02-01 20:12:20 +08:00
Yee Man Chan 2c8cd844d0 added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp 2026-02-01 08:42:01 +08:00
Yee Man Chan 6216273ede removed all ggml_cont b4 ggml_reshape_4d 2026-01-29 08:46:33 +08:00
Yee Man Chan a6b2c450c8 changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k 2026-01-29 08:35:35 +08:00
Yee Man Chan 0444a4faa0 remove ssm_o_norm_b 2026-01-27 13:19:55 +08:00
Yee Man Chan f1525b3695 new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight 2026-01-27 11:25:13 +08:00
Yee Man Chan 560190af97 fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. 2026-01-21 22:12:21 +08:00
Yee Man Chan 0aea18e718 merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. 2026-01-16 12:02:27 +08:00
Yee Man Chan c163dff4c0 sync fork and comment fixing in kimi-linear.cpp 2026-01-14 18:01:44 +08:00
Yee Man Chan 2882915258 created static function causal_conv1d to abtract similar code for q/k/v 2026-01-14 17:26:00 +08:00
Yee Man Chan 18ae7f4684 removed unnecessary ggml_cont before ggml_reshape 2026-01-14 03:22:53 +08:00
Yee Man Chan 22bc582a82 return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement 2026-01-12 20:32:19 +08:00
Yee Man Chan 59182f5e06 fix trailing whitespace 2026-01-11 22:06:48 +08:00
Yee Man Chan 93afbedc96 moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error 2026-01-11 21:44:54 +08:00
Yee Man Chan 6ae66fc40d fix trailing spaces 2026-01-11 21:31:35 +08:00
Yee Man Chan b9360c7fe1 MLA KV cache support 2026-01-11 15:58:46 +08:00
Yee Man Chan d26fe50178 Moved Aqk computation out of the loop 2026-01-10 08:45:57 +08:00
Yee Man Chan 6150bb7b17 no clamp version 2026-01-09 20:11:45 +08:00
Yee Man Chan f99913dd5f replaced Akk and Aqk with mul_mat and clamp 2026-01-08 13:40:17 +08:00
Yee Man Chan 1099cbf694 build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 2026-01-07 18:42:31 +08:00
Yee Man Chan e3542ff8a2 fixed some comments 2026-01-06 11:35:25 +08:00
Yee Man Chan cfed14e31b naive chunking form implemented 2026-01-06 11:23:53 +08:00
Yee Man Chan aba181ebad removed LOG_INFO 2026-01-05 19:21:06 +08:00
Yee Man Chan 66c0c5d8d4 Kimi Linear backend agnostic 2026-01-05 16:35:19 +08:00
Yee Man Chan a0269af292 removed all hard code 2025-12-06 11:51:16 +08:00
Yee Man Chan 9f1265fec1 removed some hard coded code 2025-12-05 19:51:02 +08:00
Yee Man Chan 27baad43d5 kimi linear model implementation 2025-12-02 08:35:14 +08:00