kvcache-ai/ktransformers v0.5.3 on GitHub

🚀 Core Highlights

AVX2-Only MoE Inference Support in kt-kernel: Added AVX2-only inference support for bf16, fp8, and gptq-int4 MoE workloads, expanding deployment coverage to CPUs without AMX while preserving CPU-GPU heterogeneous inference workflows.
CPU Weight Conversion for GLM-5 & MiniMax-M2.5: Added new tooling to convert model weights for CPU-side deployment, making GLM-5 and MiniMax-M2.5 integration and packaging smoother in production workflows.
NUMA-Aware Deployment Improvements: Added explicit --numa-nodes support for finer-grained NUMA mapping in multi-socket environments, followed by fixes to improve correctness and deployment stability.
Lower Idle CPU Overhead & Better Runtime Behavior: Fixed worker pool / task queue idle spinning issues that could cause unnecessary 100% CPU usage when the system was idle, improving runtime efficiency for long-running services.
Speculative Decode Enhancements: Added more complete speculative decode support across EAGLE-3 / MTP / STANDALONE / NGRAM, with better configuration, observability, and runtime stability.

📌 Models, Hardware & Tooling

Model & conversion updates
- Add CPU weight conversion support for GLM-5 and MiniMax-M2.5 (#1853).
- Add utility script to merge loose layer weights into safetensors for easier packaging and deployment (#1886).
- Improve deployment readiness for AVX2-only MoE inference on broader CPU hardware.
- Continue compatibility improvements around expert loading and runtime integration.
Kernel & hardware improvements
- Add AVX2-only MoE kernels for bf16, fp8, and gptq-int4 inference (#1892).
- Add explicit numa_nodes parameter for deployment control on multi-socket systems (#1891).
- Fix --numa-nodes handling in runtime configuration (#1904).
- Improve SGLang / kt-kernel detection timing and integration behavior (#1887).
Speculative decoding & serving acceleration
- Add more complete speculative decode support across EAGLE-3 / MTP / STANDALONE / NGRAM and related worker paths (#1903, #1876).
- Improve speculative worker stability and refresh docs/examples for major model families.
Runtime & stability
- Fix worker pool idle CPU usage (#1902).
- Fix TaskQueue worker thread 100% CPU spin when idle (#1899).
- Improve speculative worker / MoE layer stability in the serving runtime.
- Improve overall stability for long-lived inference processes.
Tooling & integration
- Add utility script to merge loose layer weights to safetensors for easier release packaging (#1886).
- Sync bundled sglang submodule to the latest compatible revision (#1903, #1876).
- Keep submodule-based integration aligned with the current KTransformers runtime stack.

📝 Docs & Community

Documentation updates

Add AVX2 tutorial (EN): doc/en/kt-kernel/AVX2-Tutorial.md
Add AVX2 tutorial (ZH): doc/zh/AVX2-Tutorial_zh.md
Refresh speculative decoding docs/examples for newer model families and acceleration paths.
Update summary / docs navigation for the new AVX2 documentation.
Small README alignment updates for the latest supported capabilities.

🐛 Bug Fixes

Fix --numa-nodes handling (#1904).
Fix worker pool idle CPU overhead (#1902).
Fix TaskQueue idle spin causing 100% CPU usage (#1899).
Improve SGLang / kt-kernel detect time duration (#1887).

🌟 Contributors

Thanks to all contributors who helped ship this release.

Full Changelog: v0.5.2.post1...v0.5.3

CC: @ouqingliang @ErvinXie @chenht2022 @KMSorSMS @ovowei @SkqLiao @JimmyPeilinLi @mrhaoxx @yyj6666667 @james0zan

kvcache-ai/ktransformers v0.5.3 KTransformers v0.5.3 on GitHub

kvcache-ai/ktransformers v0.5.3
KTransformers v0.5.3

on GitHub