Details
Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)
-
kimi linear model implementation
-
kimi linear convert_hf_to_gguf
-
kimi linear constants.py tensor_mapping.py
-
Kimi Linear ggml.h
-
kimi linear ggml-cpu
-
Kimi Linear ggml-cuda
-
Kimi Linear ggml.c
-
kimi linear src/llama
-
remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
-
remove type mismatch warning
-
read MoE params
-
removed some hard coded code
-
removed all hard code
-
use DeepseekV2 tokenizer
-
removed unnecessary internal methods called by the old set_vocab of KimiLinear
-
rewrite get_vocab for KimiLinear. Removed all kda_scan code
-
removed all traces of kda_scan
-
reduce OP count by 1 due to removal of kda_scan
-
Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
-
set n_embd_head_k/v to ensure kv cache works
-
don't quantize conv1d of Kimi Linear
-
Kimi Linear backend agnostic
-
removed LOG_INFO
-
naive chunking form implemented
-
fixed some comments
-
add Kimi-K2 specific tokens to be recognized as EOG
-
build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682
-
replaced Akk and Aqk with mul_mat and clamp
-
no clamp version
-
Moved Aqk computation out of the loop
-
fixed typo and split wkv_b into wk_b and wv_b
-
MLA KV cache support
-
fix trailing spaces
-
moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error
-
fix trailing whitespace
-
removed traling whitespaces in empty line + make sure indentation is multiple of 4
-
try to make lint happy
-
remove blank lines to make lint happy
-
removed at least blank line containing white space
-
fixed flake8 complaints locally
-
return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement
-
removed Kimi-Linear specific change that causes failure at server-windows
-
removed private: from kimi_linear to make build checks happy
-
removed unnecessary ggml_cont before ggml_reshape
-
created static function causal_conv1d to abtract similar code for q/k/v
-
merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.
-
reverted to original
-
fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.
-
remove DT_B from constants.py. remove one comment line in llama-model.cpp
-
new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight
-
remove ssm_o_norm_b
-
remove ssm_o_norm_b
-
changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k
-
removed all ggml_cont b4 ggml_reshape_4d
-
Whitespace
-
replaced all hparams.get with find_hparams
-
added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp
-
use is_mla to switch between different mem_hybrid types
-
fixed logical errors in convert_hf_to_gguf.py pointed out by CISC
-
removed if else for required parameters kv_lora_rank and qk_rope_head_dim
-
add back ggml_cont for Vcur
-
minor changes
-
removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp
-
f16 gguf cannot run without context length
-
made a mistake of adding back n_ctx parsing
Co-authored-by: Piotr Wilkin (ilintar) piotr.wilkin@syndatis.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: