kvcache-ai/ktransformers v0.4.4 on GitHub

🚀 Core Highlights

Add RL-DPO training support to kt-sft, enabling preference-based reinforcement learning fine-tuning on top of KTransformers’ MoE stack.
- Includes critical PEFT adaptations and bug fixes for RL workflows.
- Example configurations and end-to-end usage can be found in:
  https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DPO_tutorial.md
Improve large-scale MoE stability and efficiency
- Significantly reduce CPU memory usage during large-chunk prefill.
- Fix Kimi-K2 MoE decode bugs related to buffer management.
- Refine NUMA-aware buffer writing and memory handling paths.

Model support updates
- Add GLM-4.6V support via refactored CPU weight conversion utilities.
- Extend and stabilize Qwen3 / Qwen3-MoE support on NPU (Ascend), including attention, LN, MLP, cache, and expert operators.
Deployment & installation
- Add Docker-based deployment support and automatic deployment workflows.
- Improve CPU instruction set handling (e.g., automatic BLIS detection on AMD CPUs).
- Polish PyPI release workflows and installation instructions for smoother setup.

Update and polish Kimi-K2 / Kimi-K2-Thinking documentation, including installation steps, prefill strategy, and performance metrics.
Add and refine NPU benchmarks, prerequisites, and Qwen3-NPU guides.
Fix README assets, image links, path issues, and reorganize documentation structure.

Thanks to all contributors who helped ship this release.
Special thanks to @mrhaoxx and @poryfly for enabling RL-DPO support, and to all community members for kernel fixes, model adaptations, documentation, and tooling improvements.

Full Changelog: v0.4.3...v0.4.4