dsr1-wideep-gb200.md 5.77 KB
Newer Older
ishandhanani's avatar
ishandhanani committed
1
2
3
4
5
6
7
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Running DeepSeek-R1 Disaggregated with WideEP on GB200s

8
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-wideep` and a sample configuration that demonstrates WideEP and P/D  disaggregation. To run the exact configuration shown in the blog post, you can view the commands created by the SGLang team [here](https://github.com/sgl-project/sglang/issues/7227). In this example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 2 GB200 nodes (total 8 GPUs).
ishandhanani's avatar
ishandhanani committed
9
10
11

## Instructions

12
13
14
15
1. Build the Dynamo container using the latest published dynamo version and stable sglang version. If you want to build from a local dynamo repo, you can add `--build-arg BRANCH_TYPE=local` to the build command. If you want to build from a remote dynamo repo, you can add `--build-arg BRANCH_TYPE=remote` to the build command. If you want to use a specific tag for the default sglang version, you can add `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.

> [!Note]
> Please ensure that you are building this on an ARM64 machine. The correct SGLang image will be selected automatically via the multi-arch manifest.
ishandhanani's avatar
ishandhanani committed
16

17
18
19
> [!Note]
> Please use `--build-arg SGLANG_IMAGE_TAG=nightly-dev-20251019-fda0cb2a` to build the container due to a bug that we found with the DeepEP version being installed. This was fixed in [PR 11773](https://github.com/sgl-project/sglang/pull/11773). When SGLang releases a version > `0.5.3.post3` we will update these instructions.

ishandhanani's avatar
ishandhanani committed
20
21
22
23
24
```bash
cd $DYNAMO_ROOT
docker build \
  -f container/Dockerfile.sglang-wideep \
  -t dynamo-wideep-gb200 \
25
  --build-arg SGLANG_IMAGE_TAG=nightly-dev-20251019-fda0cb2a \
26
  --no-cache \
ishandhanani's avatar
ishandhanani committed
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
  .
```

2. You can run this container on each 4xGB200 node using the following command.

> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)

```bash
docker run \
    --gpus all \
    -it \
    --rm \
    --network host \
    --volume /PATH_TO_DSR1_MODEL/:/model/ \
    --shm-size=10G \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --ulimit nofile=65536:65536 \
    --cap-add CAP_SYS_PTRACE \
    --ipc host \
    dynamo-wideep-gb200:latest
```

51
3. Run the ingress and prefill worker
ishandhanani's avatar
ishandhanani committed
52
53
54
55
56

```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
57
DYN_SKIP_SGLANG_LOG_FORMATTING=1 \
ishandhanani's avatar
ishandhanani committed
58
59
60
61
62
63
64
65
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
MC_FORCE_MNNVL=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
66
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
ishandhanani's avatar
ishandhanani committed
67
PYTHONUNBUFFERED=1 \
68
python3 -m dynamo.sglang \
ishandhanani's avatar
ishandhanani committed
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --model-path /model/ \
  --skip-tokenizer-init \
  --trust-remote-code \
  --disaggregation-mode prefill \
  --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
  --disaggregation-bootstrap-port 30001 \
  --nnodes 2 \
  --node-rank 0 \
  --tp-size 8 \
  --dp-size 8 \
  --enable-dp-attention \
  --host 0.0.0.0 \
  --decode-log-interval 1 \
  --max-running-requests 6144 \
  --context-length 2716 \
  --disable-radix-cache \
86
87
88
  --moe-a2a-backend deepep \
  --load-balance-method round_robin \
  --deepep-mode normal \
ishandhanani's avatar
ishandhanani committed
89
90
91
92
93
94
95
96
97
98
99
  --moe-dense-tp-size 1 \
  --enable-dp-lm-head \
  --disable-shared-experts-fusion \
  --ep-num-redundant-experts 32 \
  --ep-dispatch-algorithm static \
  --eplb-algorithm deepseek \
  --attention-backend cutlass_mla \
  --watchdog-timeout 1000000 \
  --disable-cuda-graph \
  --chunked-prefill-size 16384 \
  --max-total-tokens 32768 \
100
  --mem-fraction-static 0.82 \
101
102
  --log-level debug \
  --disaggregation-transfer-backend nixl
ishandhanani's avatar
ishandhanani committed
103
104
```

105
106
On the other prefill nodes (this example has 2 total prefill nodes), run the same command but change `--node-rank` to 1

107
4. Run the decode worker on the head decode node
ishandhanani's avatar
ishandhanani committed
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

```bash
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
123
python3 -m dynamo.sglang \
ishandhanani's avatar
ishandhanani committed
124
125
126
127
128
129
130
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --model-path /model/ \
  --skip-tokenizer-init \
  --trust-remote-code \
  --disaggregation-mode decode \
  --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
  --disaggregation-bootstrap-port 30001 \
131
  --nnodes 2 \
ishandhanani's avatar
ishandhanani committed
132
  --node-rank 0 \
133
134
  --tp-size 8 \
  --dp-size 8 \
ishandhanani's avatar
ishandhanani committed
135
136
137
138
139
140
  --enable-dp-attention \
  --host 0.0.0.0 \
  --decode-log-interval 1 \
  --max-running-requests 36864 \
  --context-length 2716 \
  --disable-radix-cache \
141
142
  --moe-a2a-backend deepep \
  --prefill-round-robin-balance \
ishandhanani's avatar
ishandhanani committed
143
144
145
  --deepep-mode low_latency \
  --moe-dense-tp-size 1 \
  --enable-dp-lm-head \
146
  --cuda-graph-max-bs 256 \
ishandhanani's avatar
ishandhanani committed
147
148
149
150
151
152
153
154
  --disable-shared-experts-fusion \
  --ep-num-redundant-experts 32 \
  --ep-dispatch-algorithm static \
  --eplb-algorithm deepseek \
  --attention-backend cutlass_mla \
  --watchdog-timeout 1000000 \
  --chunked-prefill-size 36864 \
  --mem-fraction-static 0.82 \
155
156
  --log-level debug \
  --disaggregation-transfer-backend nixl
ishandhanani's avatar
ishandhanani committed
157
158
```

159
On the other decode nodes (this example has 2 total decode nodes), run the same command but change `--node-rank` to 1.