balance-serve.md 6.45 KB
Newer Older
dongjw's avatar
dongjw committed
1
2
3
# Balance Serve backend (multi-concurrency) for ktransformers

## KTransformers v0.2.4 Release Notes
dongjw's avatar
dongjw committed
4

dongjw's avatar
dongjw committed
5
6
7
8
9
10
11
12
13
We are excited to announce the official release of the long-awaited **KTransformers v0.2.4**!
In this version, we’ve added highly desired **multi-concurrency** support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:

https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a

</p>

### 🚀 Key Updates
dongjw's avatar
dongjw committed
14

dongjw's avatar
dongjw committed
15
16
17
18
19
20
1. Multi-Concurrency Support
   - Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
   - We implemented [custom_flashinfer](https://github.com/kvcache-ai/custom_flashinfer/tree/fix-precision-mla-merge-main) based on the high-performance and highly flexible operator library [flashinfer](https://github.com/flashinfer-ai/flashinfer/), and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
   - In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
   - With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
2. Engine Architecture Optimization
dongjw's avatar
dongjw committed
21
22
   ![image](https://github.com/user-attachments/assets/f5f001fa-dca7-4377-a01a-32192902aa47)
   Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
dongjw's avatar
dongjw committed
23
24
25
26
   - Server:Handles user requests and serves the OpenAI-compatible API.
   - Inference Engine:Executes model inference and supports chunked prefill.
   - Scheduler:Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in a FCFS manner and sending them to the inference engine.
3. Project Structure Reorganization
dongjw's avatar
dongjw committed
27
   All C/C++ code is now centralized under the /csrc directory.
dongjw's avatar
dongjw committed
28
4. Parameter Adjustments
dongjw's avatar
dongjw committed
29
30
31
   Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
   We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.

dongjw's avatar
dongjw committed
32
### 📚 Upgrade Notes
dongjw's avatar
dongjw committed
33

dongjw's avatar
dongjw committed
34
35
36
- Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
- To enable multi-concurrency, please refer to the latest documentation for configuration examples.

dongjw's avatar
dongjw committed
37
### What's Changed
dongjw's avatar
dongjw committed
38

dongjw's avatar
dongjw committed
39
40
41
42
Implemented **custom_flashinfer** @Atream @ovowei @qiyuxinlin
Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei
Implemented a **continuous batching** scheduler in C++ @ErvinXie
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie  @qiyuxinlin @ovowei @KMSorSMS @SkqLiao
dongjw's avatar
dongjw committed
43
44

## Installation Guide
dongjw's avatar
dongjw committed
45

wang jiahao's avatar
wang jiahao committed
46
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
dongjw's avatar
dongjw committed
47

wang jiahao's avatar
wang jiahao committed
48
49
50
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
dongjw's avatar
dongjw committed
51

dongjw's avatar
dongjw committed
52
### 1. Set Up Conda Environment
dongjw's avatar
dongjw committed
53

dongjw's avatar
dongjw committed
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
We recommend using Miniconda3/Anaconda3 for environment management:

```bash
# Download Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Create environment
conda create --name ktransformers python=3.11
conda activate ktransformers

# Install required libraries
conda install -c conda-forge libstdcxx-ng

# Verify GLIBCXX version (should include 3.4.32)
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
```

> **Note:** Adjust the Anaconda path if your installation directory differs from `~/anaconda3`

### 2. Install dependencies

```bash
sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf
```

### 3. Build ktransformers

```bash
# Clone repository
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive


# Install single NUMA dependencies
dongjw's avatar
dongjw committed
89
USE_BALANCE_SERVE=1  bash ./install.sh
dongjw's avatar
dongjw committed
90
# Or Install Dual NUMA dependencies
dongjw's avatar
dongjw committed
91
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
dongjw's avatar
dongjw committed
92
93
94
95
```

## Running DeepSeek-R1-Q4KM Models

dongjw's avatar
dongjw committed
96
97
### 1. Run for 24GB VRAM GPUs

dongjw's avatar
dongjw committed
98
99
100
101
Use our optimized configuration for constrained VRAM:

```bash
python ktransformers/server/main.py \
dongjw's avatar
dongjw committed
102
  --port 10002
dongjw's avatar
dongjw committed
103
104
105
106
107
108
109
110
111
112
113
114
115
  --model_path <path_to_safetensor_config> \
  --gguf_path <path_to_gguf_files> \
  --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
  --max_new_tokens 1024 \
  --cache_lens 32768 \
  --chunk_size 256 \
  --max_batch_size 4 \
  --backend_type balance_serve
```

It features the following arguments:

- `--max_new_tokens`: Maximum number of tokens generated per request.
dongjw's avatar
dongjw committed
116
- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space.
dongjw's avatar
dongjw committed
117
- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
dongjw's avatar
dongjw committed
118
  corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
dongjw's avatar
dongjw committed
119
120
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
dongjw's avatar
dongjw committed
121
122

### 2. access server
dongjw's avatar
dongjw committed
123

dongjw's avatar
dongjw committed
124
125
126
127
128
129
130
131
132
133
134
135
136
```
curl -X POST http://localhost:10002/v1/chat/completions \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "hello"}
    ],
    "model": "DeepSeek-R1",
    "temperature": 0.3,
    "top_p": 1.0,
    "stream": true
  }'
wang jiahao's avatar
wang jiahao committed
137
```