SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs).
2. You can run this container on each 4xGB200 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it\
--rm\
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimitmemlock=-1\
--ulimitstack=67108864 \
--ulimitnofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep-gb200:latest
```
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
```bash
./utils/gen_env_vars.sh
```
4. Run the ingress and prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
3. You can run this container on each 8xH100 node using the following command.
You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
2. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT]
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
...
@@ -47,17 +41,17 @@ docker run \
...
@@ -47,17 +41,17 @@ docker run \
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
```bash
```bash
./utils/gen_env_vars.sh
./utils/gen_env_vars.sh
```
```
5. Run the ingress and prefill worker
4. Run the ingress and prefill worker
```bash
```bash
# run ingress
# run ingress
dynamo run in=http out=dyn &
python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
...
@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
...
@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
prefill:
prefill:
```bash
```bash
...
...
--max-running-requests 8192 \
--max-running-requests 8192 \
...
@@ -142,6 +137,7 @@ prefill:
...
@@ -142,6 +137,7 @@ prefill:
```
```
decode:
decode:
```bash
```bash
...
...
--max-running-requests 18432 \
--max-running-requests 18432 \
...
@@ -152,9 +148,10 @@ decode:
...
@@ -152,9 +148,10 @@ decode:
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
1.**GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
1.**GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
...
@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
```
```
2.**GenAI Perf to benchmark completions with custom dataset**
2.**GenAI Perf to benchmark completions with custom dataset**
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
For simplicity of the example, we will make some assumptions about your SLURM cluster:
For simplicity of the example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
testing, you should aim to allocate groups of nodes that are performantly
...
@@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
...
@@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
## Usage
## Usage
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
1.**Submit a benchmark job**:
1.**Submit a benchmark job**:
```bash
```bash
python submit_job_script.py \
python submit_job_script.py \
--template job_script_template.j2 \
--template job_script_template.j2 \
...
@@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
...
@@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
```
```
**Required arguments**:
**Required arguments**:
-`--template`: Path to Jinja2 template file
-`--template`: Path to Jinja2 template file
-`--model-dir`: Model directory path
-`--model-dir`: Model directory path
-`--config-dir`: Config directory path
-`--config-dir`: Config directory path
...
@@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
...
@@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
-`--account`: SLURM account
-`--account`: SLURM account
**Optional arguments**:
**Optional arguments**:
-`--prefill-nodes`: Number of prefill nodes (default: `2`)
-`--prefill-nodes`: Number of prefill nodes (default: `2`)
-`--decode-nodes`: Number of decode nodes (default: `2`)
-`--decode-nodes`: Number of decode nodes (default: `2`)
-`--gpus-per-node`: Number of GPUs per node (default: `8`)
-`--gpus-per-node`: Number of GPUs per node (default: `8`)
-`--network-interface`: Network interface to use (default: `eth3`)
-`--network-interface`: Network interface to use (default: `eth3`)
-`--job-name`: SLURM job name (default: `dynamo_setup`)
-`--job-name`: SLURM job name (default: `dynamo_setup`)
-`--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
-`--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
-`--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
-`--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2.**Monitor job progress**:
2.**Example with different GPU types**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3.**Monitor job progress**:
```bash
```bash
squeue -u$USER
squeue -u$USER
```
```
3.**Check logs in real-time**:
4.**Check logs in real-time**:
```bash
```bash
tail-f logs/{JOB_ID}/log.out
tail-f logs/{JOB_ID}/log.out
```
```
4.**Monitor GPU utilization**:
You can view logs of all prefill or decode workers simultaneously by running: