SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. We provide a sample configuration that demonstrates WideEP and P/D disaggregation. To run the exact configuration shown in the blog post, you can view the commands created by the SGLang team [here](https://github.com/sgl-project/sglang/issues/7227). In this example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 2 GB200 nodes (total 8 GPUs).
## Instructions
1. Build the Dynamo container for ARM64 (GB200) using the `build.sh` script.
> [!Note]
> Please ensure that you are building this on an ARM64 machine. The build script will automatically configure the correct platform and build arguments for SGLang on ARM64/GB200.
```bash
cd$DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--platform linux/arm64 \
--tag dynamo-wideep-gb200:latest
```
2. You can run this container on each 4xGB200 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it\
--rm\
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimitmemlock=-1\
--ulimitstack=67108864 \
--ulimitnofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep-gb200:latest
```
In each container, you should be in the /sgl-workspace/dynamo/examples/backends/sglang directory.
On the other prefill nodes (this example has 2 total prefill nodes), run the same command but change `--node-rank` to 1
> [!IMPORTANT]
> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running DeepSeek-R1 Disaggregated with WideEP on H100s
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a sample configuration that demonstrates WideEP and P/D disaggregation. To run the exact configuration shown in the blog post, you can view the commands created by the SGLang team [here](https://github.com/sgl-project/sglang/issues/6017). In this example, we will run 1 prefill worker on 4 H100 nodes (32 GPUs each) and 1 decode worker on 4 H100 nodes (total 64 GPUs).
## Instructions
1. Build the Dynamo container for AMD64/x86_64 (H100) using the `build.sh` script.
> [!Note]
> Please ensure that you are building this on an AMD64 (x86_64) machine. The build script will automatically configure the correct platform for SGLang.
```bash
cd$DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-wideep:latest \
```
2. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it\
--rm\
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimitmemlock=-1\
--ulimitstack=67108864 \
--ulimitnofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep:latest
```
In each container, you should be in the `/sgl-workspace/dynamo/examples/backends/sglang` directory.
3. Run the ingress and prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init\
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--host 0.0.0.0 \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 4 \
--node-rank 0 \
--tp-size 32 \
--dp-size 32 \
--enable-dp-attention\
--decode-log-interval 1000 \
--moe-a2a-backend deepep \
--load-balance-method round_robin \
--page-size 1 \
--trust-remote-code\
--moe-dense-tp-size 1 \
--enable-dp-lm-head\
--disable-radix-cache\
--watchdog-timeout 1000000 \
--enable-two-batch-overlap\
--deepep-mode normal \
--mem-fraction-static 0.85 \
--deepep-config /configs/deepep.json \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek
```
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
> [!IMPORTANT]
> If you encounter random CPU recv timeout issues during the warm-up phase in multi-GPU or multi-node setups, they are likely caused by DeepGEMM kernel compilation overhead.
> To avoid these non-deterministic timeouts, it's strongly recommended to precompile the DeepGEMM kernels before launching the SGLang engine. This ensures all kernels are cached and ready, preventing long initialization delays or distributed timeout errors. To precompile and use cached kernels, please execute the following commands:
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Multinode Examples
## Multi-node sized models
SGLang allows you to deploy multi-node sized models by adding in the `dist-init-addr`, `nnodes`, and `node-rank` arguments. Below we demonstrate and example of deploying DeepSeek R1 for disaggregated serving across 4 nodes. This example requires 4 nodes of 8xH100 GPUs.
**Prerequisite**: Building the Dynamo container.
```bash
cd$DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-wideep:latest \
```
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
**Step 1**: Ensure that your configuration file has the required arguments. Here's an example configuration that runs prefill and the model in TP16:
Node 1: Run HTTP ingress, processor, and 8 shards of the prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# run prefill worker
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--load-balance-method round_robin \
--host 0.0.0.0 \
--mem-fraction-static 0.82
```
Node 2: Run the remaining 8 shards of the prefill worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr${HEAD_PREFILL_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--load-balance-method round_robin \
--mem-fraction-static 0.82
```
Node 3: Run the first 8 shards of the decode worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 0 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--prefill-round-robin-balance\
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 8
```
Node 4: Run the remaining 8 shards of the decode worker
```bash
python3 -m dynamo.sglang \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--tp 16 \
--dp-size 16 \
--dist-init-addr${HEAD_DECODE_NODE_IP}:29500 \
--nnodes 2 \
--node-rank 1 \
--enable-dp-attention\
--trust-remote-code\
--skip-tokenizer-init\
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--host 0.0.0.0 \
--prefill-round-robin-balance\
--mem-fraction-static 0.82 \
--cuda-graph-max-bs 8
```
**Step 2**: Run inference
SGLang typically requires a warmup period to ensure the DeepGEMM kernels are loaded. We recommend running a few warmup requests and ensuring that the DeepGEMM kernels load in.
"content": "In the heart of the tennis world, where champions rise and fall with each Grand Slam, lies the legend of the Golden Racket of Wimbledon. Once wielded by the greatest players of antiquity, this mythical racket is said to bestow unparalleled precision, grace, and longevity upon its rightful owner. For centuries, it remained hidden, its location lost to all but the most dedicated scholars of the sport. You are Roger Federer, the Swiss maestro whose elegant play and sportsmanship have already cemented your place among the legends, but whose quest for perfection remains unquenched even as time marches on. Recent dreams have brought you visions of this ancient artifact, along with fragments of a map that seems to lead to its resting place. Your journey will take you through the hallowed grounds of tennis history, from the clay courts of Roland Garros to the hidden training grounds of forgotten champions, and finally to a secret chamber beneath Centre Court itself. Character Background: Develop a detailed background for Roger Federer in this quest. Describe his motivations for seeking the Golden Racket, his tennis skills and personal weaknesses, and any connections to the legends of the sport that came before him. Is he driven by a desire to extend his career, to secure his legacy as the greatest of all time, or perhaps by something more personal? What price might he be willing to pay to claim this artifact, and what challenges from rivals past and present might stand in his way?"