github kvcache-ai/ktransformers v0.5.3
KTransformers v0.5.3

8 hours ago

🚀 Core Highlights

  • AVX2-Only MoE Inference Support in kt-kernel: Added AVX2-only inference support for bf16, fp8, and gptq-int4 MoE workloads, expanding deployment coverage to CPUs without AMX while preserving CPU-GPU heterogeneous inference workflows.
  • CPU Weight Conversion for GLM-5 & MiniMax-M2.5: Added new tooling to convert model weights for CPU-side deployment, making GLM-5 and MiniMax-M2.5 integration and packaging smoother in production workflows.
  • NUMA-Aware Deployment Improvements: Added explicit --numa-nodes support for finer-grained NUMA mapping in multi-socket environments, followed by fixes to improve correctness and deployment stability.
  • Lower Idle CPU Overhead & Better Runtime Behavior: Fixed worker pool / task queue idle spinning issues that could cause unnecessary 100% CPU usage when the system was idle, improving runtime efficiency for long-running services.
  • Speculative Decode Enhancements: Added more complete speculative decode support across EAGLE-3 / MTP / STANDALONE / NGRAM, with better configuration, observability, and runtime stability.

📌 Models, Hardware & Tooling

  • Model & conversion updates

    • Add CPU weight conversion support for GLM-5 and MiniMax-M2.5 (#1853).
    • Add utility script to merge loose layer weights into safetensors for easier packaging and deployment (#1886).
    • Improve deployment readiness for AVX2-only MoE inference on broader CPU hardware.
    • Continue compatibility improvements around expert loading and runtime integration.
  • Kernel & hardware improvements

    • Add AVX2-only MoE kernels for bf16, fp8, and gptq-int4 inference (#1892).
    • Add explicit numa_nodes parameter for deployment control on multi-socket systems (#1891).
    • Fix --numa-nodes handling in runtime configuration (#1904).
    • Improve SGLang / kt-kernel detection timing and integration behavior (#1887).
  • Speculative decoding & serving acceleration

    • Add more complete speculative decode support across EAGLE-3 / MTP / STANDALONE / NGRAM and related worker paths (#1903, #1876).
    • Improve speculative worker stability and refresh docs/examples for major model families.
  • Runtime & stability

    • Fix worker pool idle CPU usage (#1902).
    • Fix TaskQueue worker thread 100% CPU spin when idle (#1899).
    • Improve speculative worker / MoE layer stability in the serving runtime.
    • Improve overall stability for long-lived inference processes.
  • Tooling & integration

    • Add utility script to merge loose layer weights to safetensors for easier release packaging (#1886).
    • Sync bundled sglang submodule to the latest compatible revision (#1903, #1876).
    • Keep submodule-based integration aligned with the current KTransformers runtime stack.

📝 Docs & Community

Documentation updates

  • Add AVX2 tutorial (EN): doc/en/kt-kernel/AVX2-Tutorial.md
  • Add AVX2 tutorial (ZH): doc/zh/AVX2-Tutorial_zh.md
  • Refresh speculative decoding docs/examples for newer model families and acceleration paths.
  • Update summary / docs navigation for the new AVX2 documentation.
  • Small README alignment updates for the latest supported capabilities.

🐛 Bug Fixes

  • Fix --numa-nodes handling (#1904).
  • Fix worker pool idle CPU overhead (#1902).
  • Fix TaskQueue idle spin causing 100% CPU usage (#1899).
  • Improve SGLang / kt-kernel detect time duration (#1887).

🌟 Contributors

  • Thanks to all contributors who helped ship this release.

Full Changelog: v0.5.2.post1...v0.5.3

CC: @ouqingliang @ErvinXie @chenht2022 @KMSorSMS @ovowei @SkqLiao @JimmyPeilinLi @mrhaoxx @yyj6666667 @james0zan

Don't miss a new ktransformers release

NewReleases is sending notifications on new releases.