🚀 Core Highlights
- AVX2-Only MoE Inference Support in kt-kernel: Added AVX2-only inference support for bf16, fp8, and gptq-int4 MoE workloads, expanding deployment coverage to CPUs without AMX while preserving CPU-GPU heterogeneous inference workflows.
- CPU Weight Conversion for GLM-5 & MiniMax-M2.5: Added new tooling to convert model weights for CPU-side deployment, making GLM-5 and MiniMax-M2.5 integration and packaging smoother in production workflows.
- NUMA-Aware Deployment Improvements: Added explicit
--numa-nodessupport for finer-grained NUMA mapping in multi-socket environments, followed by fixes to improve correctness and deployment stability. - Lower Idle CPU Overhead & Better Runtime Behavior: Fixed worker pool / task queue idle spinning issues that could cause unnecessary 100% CPU usage when the system was idle, improving runtime efficiency for long-running services.
- Speculative Decode Enhancements: Added more complete speculative decode support across EAGLE-3 / MTP / STANDALONE / NGRAM, with better configuration, observability, and runtime stability.
📌 Models, Hardware & Tooling
-
Model & conversion updates
- Add CPU weight conversion support for GLM-5 and MiniMax-M2.5 (#1853).
- Add utility script to merge loose layer weights into safetensors for easier packaging and deployment (#1886).
- Improve deployment readiness for AVX2-only MoE inference on broader CPU hardware.
- Continue compatibility improvements around expert loading and runtime integration.
-
Kernel & hardware improvements
- Add AVX2-only MoE kernels for bf16, fp8, and gptq-int4 inference (#1892).
- Add explicit
numa_nodesparameter for deployment control on multi-socket systems (#1891). - Fix
--numa-nodeshandling in runtime configuration (#1904). - Improve SGLang / kt-kernel detection timing and integration behavior (#1887).
-
Speculative decoding & serving acceleration
-
Runtime & stability
-
Tooling & integration
📝 Docs & Community
Documentation updates
- Add AVX2 tutorial (EN):
doc/en/kt-kernel/AVX2-Tutorial.md - Add AVX2 tutorial (ZH):
doc/zh/AVX2-Tutorial_zh.md - Refresh speculative decoding docs/examples for newer model families and acceleration paths.
- Update summary / docs navigation for the new AVX2 documentation.
- Small README alignment updates for the latest supported capabilities.
🐛 Bug Fixes
- Fix
--numa-nodeshandling (#1904). - Fix worker pool idle CPU overhead (#1902).
- Fix TaskQueue idle spin causing 100% CPU usage (#1899).
- Improve SGLang / kt-kernel detect time duration (#1887).
🌟 Contributors
- Thanks to all contributors who helped ship this release.
Full Changelog: v0.5.2.post1...v0.5.3
CC: @ouqingliang @ErvinXie @chenht2022 @KMSorSMS @ovowei @SkqLiao @JimmyPeilinLi @mrhaoxx @yyj6666667 @james0zan