๐ Core Highlights
-
Native FP8 MoE Kernel: Introducing native FP8 precision support for MoE inference with a new AVX-based kernel. Run FP8 models directly without precision conversion overhead, preserving the original model accuracy while maximizing hardware efficiency.
-
kt-clifor Effortless Local Inference: A new CLI tool designed for simplicity and ease of use. Model management, automatic configuration, seamless chat/completions workflows, and built-in SGLang environment detectionโget started with local LLM inference in minutes. -
Enhanced Layerwise Prefill: Improved layerwise prefill performance through expert-by-expert pipelining. The layerwise prefill architecture enables efficient memory streaming during prefill, significantly improving throughput and reducing latency for long-context workloads.
๐ Models, Hardware & Tooling
-
Model support updates
- Extend the FP8 enablement path in this release, focusing on native FP8 MoE support and compatibility improvements.
- Add native MiniMax-M2, MiniMax-M2.1, DeepSeek-V3.2 support and related enablement.
-
Kernel & hardware improvements
- Add AVX-based FP8 MoE kernel.
- Reduce DRAM requirements for most models during prefill in CPU.
- Improve layerwise prefill for better throughput.
-
Tooling & integration
- Introduce
kt-cli, a new unified CLI for model management, chat, automatic configuration and inference server management.
- Introduce
-
Deployment & installation
- Refactor installation workflows/scripts for the new CLI/tooling path (including cleanup of legacy install steps).
- Improve CPU instruction set auto detection.
๐ Docs & Community
- Add MiniMax-M2.1 end-to-end tutorial.
- Refine DPO tutorial.
๐ Contributors
- Thanks to all contributors who helped ship this release.
Full Changelog: v0.4.4...v0.5.0
CC: @ouqingliang @ErvinXie @chenht2022 @KMSorSMS @ovowei @SkqLiao @JimmyPeilinLi @mrhaoxx @james0zan