🚀 Core Highlights
- Optimized CPU-GPU Expert Scheduling: Introducing a flexible GPU expert mask system that enables intelligent placement of MoE experts across CPU and GPU. The new scheduling system supports multiple placement strategies (frequency-based, uniform, front-loading, random) and dynamic expert updates during inference, significantly improving throughput by up to 30% at lower GPU expert ratios.
- Native Precision MoE Support with CI: Expanded native precision support for FP8 and BF16 MoE models. Run Qwen3-BF16, GLM-4.7, GLM-4.7-FP8 and more models directly in their native precision without conversion overhead, now with comprehensive CI coverage.
- Unified Fine-tuning & Inference Pipeline: New end-to-end tutorial for cost-effective large model fine-tuning and inference using AutoDL cloud infrastructure. Complete the full LoRA fine-tuning and inference loop for models from 14B to 235B with minimal GPU resources.
📌 Models, Hardware & Tooling
- Model support updates
- Add native precision support for MiniMax-M2, MiniMax-M2.1, MiMo, DeepSeek-V3.2, GLM-4.7-FP8.
- Extend FP8 and BF16 MoE enablement path with CI validation.
- Kernel & hardware improvements
- Introduce GPU expert mask system for flexible per-layer expert placement control.
- Add dual-stream CPU-GPU parallel optimization to hide CPU overhead when experts are fully on GPU.
- Implement dynamic expert update for runtime adaptive optimization during layerwise prefill.
- New parameters: --kt-num-gpu-experts (per-layer), --kt-gpu-experts-ratio (global ratio 0.0-1.0).
- Add expert placement strategies: frequency, uniform, front-loading, random.
- Tooling & integration
- Add inference statistics and analysis functionality for GPU expert hit rate monitoring.
📝 Docs & Community
- Add CPU-GPU Expert Scheduling https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/experts-sched-Tutorial.md.
- Add AutoDL unified fine-tuning and inference https://github.com/kvcache-ai/ktransformers/blob/main/doc/zh/【云端低价训推】%20KTransformers+AutoDL+LlamaFactory:随用随租的低成本超大模型「微调+推理」一体化流程.md (Chinese).
🐛 Bug Fixes
- Fix environment mismatch issues in AutoDL community image for fine-tuning and inference.
- Fix various stability issues in kt-kernel.
- Improve error handling and logging for expert distribution recording.
🌟 Contributors
- Thanks to all contributors who helped ship this release.
Full Changelog: v0.5.0...v0.5.1
CC: @ouqingliang @ErvinXie @chenht2022 @KMSorSMS @ovowei @SkqLiao @JimmyPeilinLi @mrhaoxx @james0zan