# Balance Serve backend (multi-concurrency) for ktransformers
# Balance Serve backend (multi-concurrency) for ktransformers
## KTransformers v0.2.4 Release Notes
## KTransformers v0.2.4 Release Notes
We are excited to announce the official release of the long-awaited **KTransformers v0.2.4**!
We are excited to announce the official release of the long-awaited **KTransformers v0.2.4**!
In this version, we’ve added highly desired **multi-concurrency** support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
In this version, we’ve added highly desired **multi-concurrency** support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:
- Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
- Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
- We implemented [custom_flashinfer](https://github.com/kvcache-ai/custom_flashinfer/tree/fix-precision-mla-merge-main) based on the high-performance and highly flexible operator library [flashinfer](https://github.com/flashinfer-ai/flashinfer/), and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
- We implemented [custom_flashinfer](https://github.com/kvcache-ai/custom_flashinfer/tree/fix-precision-mla-merge-main) based on the high-performance and highly flexible operator library [flashinfer](https://github.com/flashinfer-ai/flashinfer/), and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
- In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
- In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
- With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
- With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
- Server:Handles user requests and serves the OpenAI-compatible API.
- Server:Handles user requests and serves the OpenAI-compatible API.
- Inference Engine:Executes model inference and supports chunked prefill.
- Inference Engine:Executes model inference and supports chunked prefill.
- Scheduler:Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in a FCFS manner and sending them to the inference engine.
- Scheduler:Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in a FCFS manner and sending them to the inference engine.
3. Project Structure Reorganization
3. Project Structure Reorganization
All C/C++ code is now centralized under the /csrc directory.
All C/C++ code is now centralized under the /csrc directory.
4. Parameter Adjustments
4. Parameter Adjustments
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.
### 📚 Upgrade Notes
### 📚 Upgrade Notes
- Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
- Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
- To enable multi-concurrency, please refer to the latest documentation for configuration examples.
- To enable multi-concurrency, please refer to the latest documentation for configuration examples.
Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei
Implemented a **continuous batching** scheduler in C++ @ErvinXie
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao
## Installation Guide
## Installation Guide
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
### 1. Set Up Conda Environment
### 1. Set Up Conda Environment
We recommend using Miniconda3/Anaconda3 for environment management:
We recommend using Miniconda3/Anaconda3 for environment management:
@@ -116,6 +120,7 @@ It features the following arguments:
...
@@ -116,6 +120,7 @@ It features the following arguments:
-`--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
-`--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
### 2. access server
### 2. access server
```
```
curl -X POST http://localhost:10002/v1/chat/completions \
curl -X POST http://localhost:10002/v1/chat/completions \
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. -->
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. -->
- For Windows (Windows native temprarily deprecated, please try WSL)
```shell
```shell
install.bat
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
```
```
- For Windows (Windows native temprarily deprecated, please try WSL)
```shell
install.bat
```
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
-`--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
-`--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
...
@@ -174,6 +175,7 @@ We provide a server script, which supports multi-concurrency functionality in ve
...
@@ -174,6 +175,7 @@ We provide a server script, which supports multi-concurrency functionality in ve