ggml-org/llama.cpp b7957 on GitHub

Details

Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)

kimi linear model implementation
kimi linear convert_hf_to_gguf
kimi linear constants.py tensor_mapping.py
Kimi Linear ggml.h
kimi linear ggml-cpu
Kimi Linear ggml-cuda
Kimi Linear ggml.c
kimi linear src/llama
remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
remove type mismatch warning
read MoE params
removed some hard coded code
removed all hard code
use DeepseekV2 tokenizer
removed unnecessary internal methods called by the old set_vocab of KimiLinear
rewrite get_vocab for KimiLinear. Removed all kda_scan code
removed all traces of kda_scan
reduce OP count by 1 due to removal of kda_scan
Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
set n_embd_head_k/v to ensure kv cache works
don't quantize conv1d of Kimi Linear
Kimi Linear backend agnostic
removed LOG_INFO
naive chunking form implemented
fixed some comments
add Kimi-K2 specific tokens to be recognized as EOG
build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682
replaced Akk and Aqk with mul_mat and clamp
no clamp version
Moved Aqk computation out of the loop
fixed typo and split wkv_b into wk_b and wv_b
MLA KV cache support
fix trailing spaces
moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error
fix trailing whitespace
removed traling whitespaces in empty line + make sure indentation is multiple of 4
try to make lint happy
remove blank lines to make lint happy
removed at least blank line containing white space
fixed flake8 complaints locally
return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement
removed Kimi-Linear specific change that causes failure at server-windows
removed private: from kimi_linear to make build checks happy
removed unnecessary ggml_cont before ggml_reshape
created static function causal_conv1d to abtract similar code for q/k/v
merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.
reverted to original
fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.
remove DT_B from constants.py. remove one comment line in llama-model.cpp
new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight
remove ssm_o_norm_b
remove ssm_o_norm_b
changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k
removed all ggml_cont b4 ggml_reshape_4d
Whitespace
replaced all hparams.get with find_hparams
added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp
use is_mla to switch between different mem_hybrid types
fixed logical errors in convert_hf_to_gguf.py pointed out by CISC
removed if else for required parameters kv_lora_rank and qk_rope_head_dim
add back ggml_cont for Vcur
minor changes
removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp
f16 gguf cannot run without context length
made a mistake of adding back n_ctx parsing

Co-authored-by: Piotr Wilkin (ilintar) piotr.wilkin@syndatis.com

macOS/iOS:

Linux:

Windows:

openEuler: