README.md 16.7 KB
Newer Older
1
# DeepSeek V3.1/V3/R1 Support
2

3
The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVIDIA and AMD GPUs **from day one**. SGLang also supports [MLA optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) and [DP attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), making SGLang one of the best open-source LLM engines for running DeepSeek models. SGLang is the inference engine recommended by the official [DeepSeek team](https://github.com/deepseek-ai/DeepSeek-V3/tree/main?tab=readme-ov-file#62-inference-with-sglang-recommended).
4

5
6
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

7
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/basic_usage/deepseek.html).
8

9
10
## Installation & Launch

Yineng Zhang's avatar
Yineng Zhang committed
11
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.
12

13
### Using Docker (Recommended)
14

15
```bash
Yineng Zhang's avatar
Yineng Zhang committed
16
17
18
19
20
# Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

# Launch
21
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
Ke Bao's avatar
Ke Bao committed
22
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
23
```
24

25
26
27
28
29
If you are using RDMA, please note that:

1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.

30
Add [performance optimization options](#performance-optimization-options) as needed.
31
32

### Using pip
33

34
35
```bash
# Installation
Yineng Zhang's avatar
Yineng Zhang committed
36
pip install "sglang[all]>=0.5.3rc0"
37
38

# Launch
Ke Bao's avatar
Ke Bao committed
39
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
40
```
41

42
43
Add [performance optimization options](#performance-optimization-options) as needed.

44
<a id="option_args"></a>
45

46
### Performance Optimization Options
47

48
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
49
50

- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
51
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
52

53
54
55
### Usage: Chat with DeepSeek

#### DeepSeek V3/R1
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

```python3
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)
```
74

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
#### DeepSeek V3.1
On top of the basic usage similar to the DeepSeek V3/R1 example, DeepSeek V3.1 supports a request-level thinking/non-thinking toggle. Simply switch the `"thinking"` field in `extra_body={"chat_template_kwargs": {"thinking": True}}` to enable/disable the thinking mode.

##### Non Thinking
```python3
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=1024,
    extra_body = {"chat_template_kwargs": {"thinking": False}}
)
print(response.choices[0].message.content)
```
Answer:
```
h
```
* The correct response should be 'A', as the correct answer to the question is 'Paris'.
##### Thinking
```python3
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "Answer the following with the second letter of the correct answer only: What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=1024,
    extra_body = {"chat_template_kwargs": {"thinking": True}}
)
print(response)
```
Answer:
```
First, the question is: "What is the capital of France?" I know that the capital of France is Paris.

The user says: "Answer the following with the second letter of the correct answer only." So, I need to provide only the second letter of the correct answer.

The correct answer is "Paris". Now, I need to find the second letter of "Paris".

Let's spell it out: P-A-R-I-S.

- First letter: P

- Second letter: A

- Third letter: R

- Fourth letter: I

- Fifth letter: S

So, the second letter is "A".

I should only output the second letter, which is "A". No additional text or explanation, just the letter.

The user emphasized "the second letter of the correct answer only", so my response should be just "A".

Finally, I need to make sure that this is the correct answer. Yes, Paris is indeed the capital of France.</think>A
```
* The response contains `</think>` thinking trace and model was able to derive the correct answer from it.

151
152
### Example: Serving with two H20\*8 nodes

Yineng Zhang's avatar
Yineng Zhang committed
153
154
155
For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands.

If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables).
156

157
158
If the multi nodes support NVIDIA InfiniBand and encounter hanging issues during startup, consider adding the parameter `export NCCL_IB_GID_INDEX=3`. For more information, see [this](https://github.com/sgl-project/sglang/issues/3516#issuecomment-2668493307).

159
160
```bash
# node 1
161
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code
162
163

# node 2
164
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
165
```
166

Yineng Zhang's avatar
Yineng Zhang committed
167
168
If you have two H100 nodes, the usage is similar to the aforementioned H20.

169
170
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).

171
### Example: Serving with two H200\*8 nodes and docker
172

173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.

```bash
# node 1
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
```

```bash
# node 2
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
```

To ensure functionality, we include a test from a client Docker container.
207

208
209
210
211
212
213
214
215
216
217
218
219
220
221
```bash
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name sglang_multinode_client \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 --output-file "deepseekv3_multinode.jsonl"
```

222
223
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).

224
225
### Example: Serving with four A100\*8 nodes

226
227
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.

228
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

```bash
# node 1
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000

# node 2
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code

# node 3
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code

# node 4
python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code
```

244
245
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).

246
247
248
249
250
251
252
253
254
255
Then we can benchmark the accuracy and latency by accessing the first node's exposed port with the following example commands.

```bash
# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000

# bench latency
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
```

Chayenne's avatar
Chayenne committed
256
257
258

### Example: Serving with 8 A100/A800 with AWQ Quantization

259
260
**Recommended Usage**

261
Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
AniZpZ's avatar
AniZpZ committed
262
One example is as follows:
Chayenne's avatar
Chayenne committed
263
264

```bash
AniZpZ's avatar
AniZpZ committed
265
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --quantization moe_wna16
Chayenne's avatar
Chayenne committed
266
267
```

268
269
270
271
272
273
274
Alternatively, you can use `--quantization awq_marlin` as follows:

```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --quantization awq_marlin --dtype float16
```

Note that `awq_marlin` only supports `float16` now, which may lead to some precision loss.
275

276
277
### Example: Serving with 16 A100/A800 with int8 Quantization

278
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
279
280
281
282

- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)

283
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
284
285
286
287
```bash
#master
python3 -m sglang.launch_server \
	--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
288
	MASTER_IP:5000 --nnodes 2 --node-rank 0 --trust-remote-code --enable-torch-compile --torch-compile-max-bs 8
289
290
291
#cluster
python3 -m sglang.launch_server \
	--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
292
	MASTER_IP:5000 --nnodes 2 --node-rank 1 --trust-remote-code --enable-torch-compile --torch-compile-max-bs 8
293
294
```

295
296
297
298
299
300
301
302
303
304
305
> **Note that the launch command here enables `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).

Then on the **master node**, supposing the ShareGPT data is located at `/path/to/ShareGPT_V3_unfiltered_cleaned_split.json`, you can run the following commands to benchmark the launched server:

```bash
# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319

# bench serving
python3 -m sglang.bench_serving --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random  --random-input 128 --random-output 128 --num-prompts 1000 --request-rate 128 --random-range-ratio 1.0
```
306

307
> **Note: using `--parallel 200` can accelerate accuracy benchmarking**.
308

309
310
311
312
313
314
### Example: Serving with 32 L40S with int8 Quantization

Running with per-channel quantization model:

- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)

315
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335

```bash
#master
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
	--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 0 --trust-remote \
	--enable-torch-compile --torch-compile-max-bs 32
#cluster
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
	--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 1 --trust-remote \
	--enable-torch-compile --torch-compile-max-bs 32
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
	--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 2 --trust-remote \
	--enable-torch-compile --torch-compile-max-bs 32
python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --quantization w8a8_int8 \
	--dist-init-addr MASTER_IP:5000 --nnodes 4 --node-rank 3 --trust-remote \
	--enable-torch-compile --torch-compile-max-bs 32
```

The benchmarking method is the same as describted in the previous [16 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization) example.

336
337
338
339
340
341
342
343
344
345
346
347
348
349
### Example: Serving on any cloud or Kubernetes with SkyPilot

SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).

To serve on multiple nodes:

```bash
git clone https://github.com/skypilot-org/skypilot.git
# Serve on 2 H100/H200x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B.yaml --retry-until-up
# Serve on 4 A100x8 nodes
sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B-A100.yaml --retry-until-up
```

Jhin's avatar
Jhin committed
350
351
#### Troubleshooting

Yineng Zhang's avatar
Yineng Zhang committed
352
If you encounter the following error with fp16/bf16 checkpoint:
Jhin's avatar
Jhin committed
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370

```bash
ValueError: Weight output_partition_size = 576 is not divisible by weight quantization block_n = 128.
```

edit your `config.json` and remove the `quantization_config` block. For example:

```json
"quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [128, 128]
},
```

Removing this block typically resolves the error. For more details, see the discussion in [sgl-project/sglang#3491](https://github.com/sgl-project/sglang/issues/3491#issuecomment-2650779851).

371
## DeepSeek V3 Optimization Plan
372
373

https://github.com/sgl-project/sglang/issues/2591