github ggml-org/llama.cpp b7957

one hour ago
Details

Kimi-Linear support (backend agnostic + MLA KV cache) (#18755)

  • kimi linear model implementation

  • kimi linear convert_hf_to_gguf

  • kimi linear constants.py tensor_mapping.py

  • Kimi Linear ggml.h

  • kimi linear ggml-cpu

  • Kimi Linear ggml-cuda

  • Kimi Linear ggml.c

  • kimi linear src/llama

  • remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning

  • remove type mismatch warning

  • read MoE params

  • removed some hard coded code

  • removed all hard code

  • use DeepseekV2 tokenizer

  • removed unnecessary internal methods called by the old set_vocab of KimiLinear

  • rewrite get_vocab for KimiLinear. Removed all kda_scan code

  • removed all traces of kda_scan

  • reduce OP count by 1 due to removal of kda_scan

  • Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

  • set n_embd_head_k/v to ensure kv cache works

  • don't quantize conv1d of Kimi Linear

  • Kimi Linear backend agnostic

  • removed LOG_INFO

  • naive chunking form implemented

  • fixed some comments

  • add Kimi-K2 specific tokens to be recognized as EOG

  • build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682

  • replaced Akk and Aqk with mul_mat and clamp

  • no clamp version

  • Moved Aqk computation out of the loop

  • fixed typo and split wkv_b into wk_b and wv_b

  • MLA KV cache support

  • fix trailing spaces

  • moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error

  • fix trailing whitespace

  • removed traling whitespaces in empty line + make sure indentation is multiple of 4

  • try to make lint happy

  • remove blank lines to make lint happy

  • removed at least blank line containing white space

  • fixed flake8 complaints locally

  • return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement

  • removed Kimi-Linear specific change that causes failure at server-windows

  • removed private: from kimi_linear to make build checks happy

  • removed unnecessary ggml_cont before ggml_reshape

  • created static function causal_conv1d to abtract similar code for q/k/v

  • merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.

  • reverted to original

  • fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.

  • remove DT_B from constants.py. remove one comment line in llama-model.cpp

  • new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight

  • remove ssm_o_norm_b

  • remove ssm_o_norm_b

  • changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k

  • removed all ggml_cont b4 ggml_reshape_4d

  • Whitespace

  • replaced all hparams.get with find_hparams

  • added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

  • use is_mla to switch between different mem_hybrid types

  • fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

  • removed if else for required parameters kv_lora_rank and qk_rope_head_dim

  • add back ggml_cont for Vcur

  • minor changes

  • removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp

  • f16 gguf cannot run without context length

  • made a mistake of adding back n_ctx parsing


Co-authored-by: Piotr Wilkin (ilintar) piotr.wilkin@syndatis.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.