kvcache-ai/ktransformers v0.2.4 on GitHub

KTransformers v0.2.4 Release Notes

We are excited to announce the official release of the long-awaited KTransformers v0.2.4!
In this version, we’ve added highly desired multi-concurrency support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:

v0.2.4.mp4

🚀 Key Updates

Multi-Concurrency Support
- Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
- We implemented custom_flashinfer based on the high-performance and highly flexible operator library flashinfer, and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
- In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
- With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
Engine Architecture Optimization

Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
- Server: Handles user requests and serves the OpenAI-compatible API.
- Inference Engine: Executes model inference and supports chunked prefill.
- Scheduler: Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in the FCFS manner and sending them to the inference engine.
Project Structure Reorganization
All C/C++ code is now centralized under the /csrc directory.
Parameter Adjustments
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.

📚 Upgrade Notes

Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
To enable multi-concurrency, please refer to the latest documentation for configuration examples.

What's Changed

Implemented custom_flashinfer @Atream @ovowei @qiyuxinlin
Implemented balance_serve engine based on FlashInfer @qiyuxinlin @ovowei
Implemented a continuous batching scheduler in C++ @ErvinXie
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao

Warning

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!