kvcache-ai/ktransformers v0.5.1 on GitHub

🚀 Core Highlights

Optimized CPU-GPU Expert Scheduling: Introducing a flexible GPU expert mask system that enables intelligent placement of MoE experts across CPU and GPU. The new scheduling system supports multiple placement strategies (frequency-based, uniform, front-loading, random) and dynamic expert updates during inference, significantly improving throughput by up to 30% at lower GPU expert ratios.
Native Precision MoE Support with CI: Expanded native precision support for FP8 and BF16 MoE models. Run Qwen3-BF16, GLM-4.7, GLM-4.7-FP8 and more models directly in their native precision without conversion overhead, now with comprehensive CI coverage.
Unified Fine-tuning & Inference Pipeline: New end-to-end tutorial for cost-effective large model fine-tuning and inference using AutoDL cloud infrastructure. Complete the full LoRA fine-tuning and inference loop for models from 14B to 235B with minimal GPU resources.

📌 Models, Hardware & Tooling

Model support updates
- Add native precision support for MiniMax-M2, MiniMax-M2.1, MiMo, DeepSeek-V3.2, GLM-4.7-FP8.
- Extend FP8 and BF16 MoE enablement path with CI validation.
Kernel & hardware improvements
- Introduce GPU expert mask system for flexible per-layer expert placement control.
- Add dual-stream CPU-GPU parallel optimization to hide CPU overhead when experts are fully on GPU.
- Implement dynamic expert update for runtime adaptive optimization during layerwise prefill.
- New parameters: --kt-num-gpu-experts (per-layer), --kt-gpu-experts-ratio (global ratio 0.0-1.0).
- Add expert placement strategies: frequency, uniform, front-loading, random.
Tooling & integration
- Add inference statistics and analysis functionality for GPU expert hit rate monitoring.

📝 Docs & Community

🐛 Bug Fixes

Fix environment mismatch issues in AutoDL community image for fine-tuning and inference.
Fix various stability issues in kt-kernel.
Improve error handling and logging for expert distribution recording.

🌟 Contributors

kvcache-ai/ktransformers v0.5.1 KTransformers v0.5.1 on GitHub