kvcache-ai/ktransformers v0.5.0 on GitHub

🚀 Core Highlights

Native FP8 MoE Kernel: Introducing native FP8 precision support for MoE inference with a new AVX-based kernel. Run FP8 models directly without precision conversion overhead, preserving the original model accuracy while maximizing hardware efficiency.
kt-cli for Effortless Local Inference: A new CLI tool designed for simplicity and ease of use. Model management, automatic configuration, seamless chat/completions workflows, and built-in SGLang environment detection—get started with local LLM inference in minutes.
Enhanced Layerwise Prefill: Improved layerwise prefill performance through expert-by-expert pipelining. The layerwise prefill architecture enables efficient memory streaming during prefill, significantly improving throughput and reducing latency for long-context workloads.

Model support updates
- Extend the FP8 enablement path in this release, focusing on native FP8 MoE support and compatibility improvements.
- Add native MiniMax-M2, MiniMax-M2.1, DeepSeek-V3.2 support and related enablement.
Kernel & hardware improvements
- Add AVX-based FP8 MoE kernel.
- Reduce DRAM requirements for most models during prefill in CPU.
- Improve layerwise prefill for better throughput.
Tooling & integration
- Introduce kt-cli, a new unified CLI for model management, chat, automatic configuration and inference server management.
Deployment & installation
- Refactor installation workflows/scripts for the new CLI/tooling path (including cleanup of legacy install steps).
- Improve CPU instruction set auto detection.

Full Changelog: v0.4.4...v0.5.0