"docs/reference/cli.md" did not exist on "84e71e27d36e3db7168e673137ac9d6d10537efe"
Unverified Commit 9d7624f1 authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

docs: instructions to run DSR1 with SGLang wideep on 104+ GPUs (#1583)


Co-authored-by: default avatarkkranen <kyle.kranen@gmail.com>
parent 68d74615
......@@ -134,15 +134,17 @@ RUN if [ "$ARCH" = "arm64" ]; then \
fi
# Install sglang
# Once either 0.4.6post6 or 0.4.7 is released, we can switch back to using the published version
# This commit references a fix to add DP attention based routing along with other perf fixes https://github.com/sgl-project/sglang/pull/6884
ARG SGLANG_COMMIT="f1569876d54dd3b6601f5280f12652e9fbb1375c"
# This commit references a NIXL fix that was releasted after the 0.4.8.post1 release https://github.com/sgl-project/sglang/pull/7330
ARG SGLANG_COMMIT="bb9b608c86ebad7d9d01e29fe058bc184dc7285f"
RUN --mount=type=cache,target=/root/.cache/uv \
git clone https://github.com/sgl-project/sglang.git && \
cd sglang && \
git checkout ${SGLANG_COMMIT} && \
uv pip install -e "python[all]"
# Set env var that allows for forceful shutdown of inflight requests in SGL's TokenizerManager
ENV SGL_FORCE_SHUTDOWN=1
# Common dependencies
RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \
uv pip install --requirement /tmp/requirements.txt
......
......@@ -71,7 +71,8 @@ RUN rm -rf /opt/hpcx/ucx && \
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH
# Pinning to NIXL 0.2.1 right now
# TODO: investigate pip install failure with 0.3.0 release
# There is a fix that was merged into SGLang after 0.4.8.post1
# TODO: Investigate perf hit of that change before we bump to up to date NIXL
ARG NIXL_COMMIT="5e4c179ee850d482a83cb2a211e0947e46281060"
RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_COMMIT} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx"
......@@ -79,18 +80,18 @@ WORKDIR /sgl-workspace
RUN pip uninstall --break-system-packages -y sglang
RUN rm -rf sglang
# 0.4.8 has a bug with CUDA graphs and decode worker
# Pinning to 0.4.8.post1 for now which solves a TBO issue
# https://github.com/sgl-project/sglang/issues/7511
RUN pip install --break-system-packages "sglang==0.4.7.post1"
RUN pip install --break-system-packages "sglang==0.4.8.post1"
# Allow forceful shutdown of inflight requests
ENV SGL_FORCE_SHUTDOWN=1
WORKDIR /sgl-workspace
# https://github.com/ai-dynamo/dynamo/pull/1510
ARG DYNAMO_COMMIT="382e3aedc421b3b3abc338062b332b54b5aa8529"
ARG DYNAMO_BRANCH="ishan/cmpl-token-id"
RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_BRANCH}
# support batch completions for SGL benchmarking
# https://github.com/ai-dynamo/dynamo/pull/1626
ARG DYNAMO_COMMIT="fc16a79bfc5a4c4f58503d3c36f2013340244cac"
RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_COMMIT}
# install dynamo in editable mode
WORKDIR /sgl-workspace/dynamo
......@@ -149,6 +150,23 @@ RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/downlo
rm /tmp/etcd.tar.gz
ENV PATH=/usr/local/bin/etcd/:$PATH
COPY examples/sglang/configs/deepep/* /sgl-workspace/dynamo/examples/sglang/configs/
# Install perf_analyzer and genai-perf
RUN apt-get update -y && \
apt-get install -y --no-install-recommends \
rapidjson-dev \
zlib1g-dev
RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \
mkdir perf_analyzer/build && \
cmake -B perf_analyzer/build -S perf_analyzer && \
cmake --build perf_analyzer/build -- -j8
ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH
RUN pip install --break-system-packages genai-perf
COPY examples/sglang/configs/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
COPY examples/sglang/utils/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/utils/
WORKDIR /sgl-workspace/dynamo/examples/sglang
......@@ -73,7 +73,10 @@ dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
#### Disaggregated
SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows
<details>
<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows:
1. The load balancer receives a request from the client
2. A random `(prefill, decode)` pair is selected from the pool of available workers
......@@ -82,6 +85,8 @@ SGLang uses a mini load balancer to route requests to handle disaggregated servi
Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead, we first route to a random prefill worker, select a random decode worker, and then send the request to both. Internally, SGLang's bootstrap server (which is a part of the `tokenizer_manager`) is used in conjuction with NIXL to handle the kv transfer.
</details>
> [!IMPORTANT]
> Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)
......@@ -90,7 +95,7 @@ cd /workspace/examples/sglang
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
##### Disaggregated with MoE and DP attention
##### Disaggregated with MoE models and DP attention
SGLang also supports DP attention for MoE models. We provide an example config for this in `configs/disagg-dp-attention.yaml` which is based on the [DeepSeek-R1-Small-2layers](https://huggingface.co/silence09/DeepSeek-R1-Small-2layers) model. You can use this configuration to test out disaggregated serving on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
......@@ -100,145 +105,8 @@ cd /workspace/examples/sglang
dynamo serve graphs.disagg:Frontend -f ./configs/disagg-dp-attention.yaml
```
##### Disaggregated with WideEP
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 2 H100 nodes and 1 decode worker on 4 H100 nodes (48 total GPUs). You can easily scale this to 96 GPUs or more by simply changing the configuration files.
Steps to run:
1. Build the SGLang DeepEP container.
```bash
git clone -b v0.4.8 https://github.com/sgl-project/sglang.git
cd sglang/docker
docker build -f Dockerfile -t deepep .
```
You will now have a `deepep:latest` image
2. Build the Dynamo container
```bash
cd $DYNAMO_ROOT
docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
```
3. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-deepep:latest
```
In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
4. On the head prefill node, start `nats-server` and `etcd` using the following commands
```bash
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://0.0.0.0:2379 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
```
5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
> [!IMPORTANT]
> You will need the IP address of your head prefill node and head decode node for the configuration files
```bash
# run this on every other node
export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
```
6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
Each container contains the configuration file in `configs/dsr1.yaml`. For our example, we will make the following changes:
On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 0
...
```
On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 1
...
```
On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
node-rank: 0
...
```
On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
# depending on which node this will be 1, 2, and 3
node-rank: 1
```
7. Start up the workers using the following commands
On prefill head node
In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1.yaml
```
On prefill child node
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1.yaml --service-name SGLangWorker
```
On all decode nodes
```bash
dynamo serve graphs.disagg:Frontend -f configs/dsr1.yaml --service-name SGLangDecodeWorker
```
8. Run the warmup script to warm up the model
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
##### Disaggregated with WideEP
```bash
./warmup.sh HEAD_PREFILL_NODE_IP
```
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can find detailed deployment and benchmarking instructions [here](./dsr1-wideep.md)
......@@ -19,7 +19,7 @@ import logging
import sglang as sgl
from utils.protocol import DisaggPreprocessedRequest
from utils.sglang import parse_sglang_args
from utils.sgl_utils import parse_sglang_args
from dynamo.sdk import endpoint, service
......
......@@ -34,7 +34,7 @@ import sglang as sgl
from components.decode_worker import SGLangDecodeWorker
from sglang.srt.utils import get_ip
from utils.protocol import DisaggPreprocessedRequest, PreprocessedRequest
from utils.sglang import parse_sglang_args
from utils.sgl_utils import parse_sglang_args
from dynamo.llm import ModelType, register_llm
from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
if [ $# -lt 1 ]; then
echo "Usage: $0 <ip> [port]"
echo "port defaults to 8000 if not specified"
exit 1
fi
IP=$1
PORT=${2:-8000}
echo "Running initial warmup 5 times with 5 seconds between each request"
for i in {1..5}; do
echo "Running iteration $i..."
curl ${IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
}
],
"stream":true,
"max_tokens": 100
}'
echo "Sleeping for 5 seconds..."
sleep 5
done
echo "Increasing output length to 500 tokens and running same request 10 times"
for i in {1..10}; do
echo "Running iteration $i..."
curl ${IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
}
],
"stream":true,
"max_tokens": 500
}'
echo "Sleeping for 5 seconds..."
sleep 5
done
echo "Running 5 parallel requests with 500 tokens each"
for i in {1..5}; do
curl ${IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
}
],
"stream":true,
"max_tokens": 1000
}' &
done
wait
echo "Parallel requests complete"
......@@ -26,10 +26,10 @@ SGLangWorker:
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
dist-init-addr: HEAD_PREFILL_NODE_IP:29500
nnodes: 2
nnodes: 4
node-rank: 0
tp-size: 16
dp-size: 16
tp-size: 32
dp-size: 32
enable-dp-attention: true
decode-log-interval: 1
# when MoE is enabled ep-size == tp-size
......@@ -43,13 +43,16 @@ SGLangWorker:
enable-two-batch-overlap: true
deepep-mode: normal
mem-fraction-static: 0.85
# SGLang's instructions for benchmarking include these flags
# ------------------------------------------------------------------------------------------------
# If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
# The `init-expert-location` configs can be found in the SGL blog post repro instructions
#max-running-requests: 8192
#max-total-tokens: 131072
#context-length: 8192
#init-expert-location: /configs/prefill_in4096.json
#deepep-config: /configs/deepep.json
chunked-prefill-size: 524288
#chunked-prefill-size: 524288
# ------------------------------------------------------------------------------------------------
deepep-config: /configs/deepep.json
ep-num-redundant-experts: 32
ep-dispatch-algorithm: dynamic
eplb-algorithm: deepseek
......@@ -69,10 +72,10 @@ SGLangDecodeWorker:
disaggregation-transfer-backend: nixl
disaggregation-bootstrap-port: 30001
dist-init-addr: HEAD_DECODE_NODE_IP:29500
nnodes: 4
nnodes: 9
node-rank: 0
tp-size: 32
dp-size: 32
tp-size: 72
dp-size: 72
enable-dp-attention: true
decode-log-interval: 1
enable-deepep-moe: true
......@@ -86,10 +89,13 @@ SGLangDecodeWorker:
enable-two-batch-overlap: true
deepep-mode: low_latency
mem-fraction-static: 0.835
# SGLang's instructions for benchmarking include these flags
# ------------------------------------------------------------------------------------------------
# If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
# The `init-expert-location` configs can be found in the SGL blog post repro instructions
#max-running-requests: 18432
#context-length: 4500
#init-expert-location: /configs/decode_in2000out100.json
# ------------------------------------------------------------------------------------------------
ep-num-redundant-experts: 32
cuda-graph-bs: 256
ServiceArgs:
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1
endpoint: dynamo.SGLangWorker.generate
port: 8000
SGLangWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
dp-size: 16
dist-init-addr: HEAD_PREFILL_NODE_IP:29500
nnodes: 2
node-rank: 0
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: prefill
disaggregation-transfer-backend: nixl
mem-fraction-static: 0.82
ServiceArgs:
workers: 1
resources:
gpu: 8
SGLangDecodeWorker:
model-path: /model/
served-model-name: deepseek-ai/DeepSeek-R1
tp: 16
dp-size: 16
dist-init-addr: HEAD_DECODE_NODE_IP:29500
nnodes: 2
node-rank: 0
enable-dp-attention: true
trust-remote-code: true
skip-tokenizer-init: true
disaggregation-mode: decode
disaggregation-transfer-backend: nixl
mem-fraction-static: 0.82
ServiceArgs:
workers: 1
resources:
gpu: 8
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Running DeepSeek-R1 Disaggregated with WideEP
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
## Instructions
1. Build the SGLang DeepEP container.
```bash
git clone -b v0.4.8.post1 https://github.com/sgl-project/sglang.git
cd sglang/docker
docker build -f Dockerfile -t deepep .
```
You will now have a `deepep:latest` image
2. Build the Dynamo container
```bash
cd $DYNAMO_ROOT
docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
```
3. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-deepep:latest
```
In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
4. On the head prefill node, start `nats-server` and `etcd` using the following commands
```bash
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://0.0.0.0:2379 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
```
5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
> [!IMPORTANT]
> You will need the IP address of your head prefill node and head decode node for the configuration files
```bash
# run this on every other node
export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
```
6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
Each container contains the configuration file in `configs/dsr1-wideep.yaml`. For our example, we will make the following changes:
On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 0
...
```
On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
```yaml
SGLangWorker:
...
dist-init-addr: HEAD_PREFILL_NODE_IP
nnodes: 2
node-rank: 1
...
```
On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
node-rank: 0
...
```
On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
```yaml
SGLangDecodeWorker:
...
dist-init-addr: HEAD_DECODE_NODE_IP
nnodes: 4
# depending on which node this will be 1, 2, and 3
node-rank: 1
```
7. Start up the workers using the following commands
On prefill head node
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml
```
On prefill child node
```bash
dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangWorker
```
On all decode nodes
```bash
dynamo serve graphs.disagg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangDecodeWorker
```
8. Run the warmup script to warm up the model
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
```bash
./warmup.sh HEAD_PREFILL_NODE_IP
```
## Benchmarking
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to uncomment the labeled flags in the `configs/dsr1.yaml` file inside of the container.
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
Example usage:
```bash
# warmup
./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
# run benchmark
./utils/bench.sh HEAD_PREFILL_NODE_IP --type e2e
```
2. **GenAI Perf to benchmark completions with custom dataset**
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
Example usage:
```bash
# generate data
python3 utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1
# run benchmark
./utils/bench.sh HEAD_PREFILL_NODE_IP --type custom_completions
```
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
usage() {
echo "Usage: $0 <ip> [port] [--type e2e|custom_completions|warmup]"
echo " ip: server IP address"
echo " port: server port (defaults to 8000)"
echo " --type: endpoint type - 'e2e' for chat completions, 'custom_completions' for completions, 'warmup' for warmup phases"
exit 1
}
if [ $# -lt 1 ]; then
usage
fi
IP=$1
PORT=8000
TYPE="e2e"
# Check if second argument is a port number or an option
if [[ $# -gt 1 && $2 =~ ^[0-9]+$ ]]; then
PORT=$2
shift 2
else
shift 1
fi
# Parse remaining arguments
while [[ $# -gt 0 ]]; do
case $1 in
--type)
TYPE="$2"
shift 2
;;
*)
usage
;;
esac
done
if [[ "$TYPE" != "e2e" && "$TYPE" != "custom_completions" && "$TYPE" != "warmup" ]]; then
echo "Error: --type must be 'e2e', 'custom_completions', or 'warmup'"
usage
fi
MODEL="deepseek-ai/DeepSeek-R1"
ARTIFACT_DIR="/benchmarks/"
if [[ "$TYPE" == "e2e" ]]; then
# E2E chat completions configuration
ISL=8000
OSL=256
CONCURRENCY_ARRAY=(1 2 4 16 64 256 512 1024 2048 4096 8192)
for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
echo "Run e2e concurrency: $concurrency"
genai-perf profile \
--model ${MODEL} \
--tokenizer ${MODEL} \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url ${IP}:${PORT} \
--synthetic-input-tokens-mean ${ISL} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${OSL} \
--output-tokens-stddev 0 \
--extra-inputs max_tokens:${OSL} \
--extra-inputs min_tokens:${OSL} \
--extra-inputs ignore_eos:true \
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
--concurrency ${concurrency} \
--request-count $(($concurrency*10)) \
--num-dataset-entries $(($concurrency*12)) \
--random-seed 100 \
--artifact-dir ${ARTIFACT_DIR} \
-- \
-v \
--max-threads ${concurrency} \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
done
elif [[ "$TYPE" == "warmup" ]]; then
echo "Starting warmup phases..."
# Phase configurations: "ISL OSL CONCURRENCY_LIST"
PHASES=(
"500 100 1,2,4,8"
"2000 100 1,2,4,8"
"4000 256 1,2,8,64"
)
for i in "${!PHASES[@]}"; do
phase_num=$((i + 1))
phase_config=(${PHASES[$i]})
ISL=${phase_config[0]}
OSL=${phase_config[1]}
concurrency_list=${phase_config[2]}
echo "Phase $phase_num: ISL=$ISL, OSL=$OSL"
# Convert comma-separated list to array
IFS=',' read -ra CONCURRENCY_ARRAY <<< "$concurrency_list"
for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
echo "Run warmup phase $phase_num, concurrency: $concurrency, ISL: $ISL, OSL: $OSL"
genai-perf profile \
--model ${MODEL} \
--tokenizer ${MODEL} \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url ${IP}:${PORT} \
--synthetic-input-tokens-mean ${ISL} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${OSL} \
--output-tokens-stddev 0 \
--extra-inputs max_tokens:${OSL} \
--extra-inputs min_tokens:${OSL} \
--extra-inputs ignore_eos:true \
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
--concurrency ${concurrency} \
--request-count $(($concurrency)) \
--warmup-request-count $(($concurrency)) \
--num-dataset-entries $(($concurrency*12)) \
--random-seed 100 \
--artifact-dir ${ARTIFACT_DIR} \
-- \
-v \
--max-threads ${concurrency} \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
echo "Sleeping for 5 seconds..."
sleep 5
done
echo "Phase $phase_num complete"
done
else
# Custom completions configuration
OSL=5
INPUT_FILE=data.jsonl
CONCURRENCY_ARRAY=(8192)
for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
echo "Run custom_completions concurrency: $concurrency"
genai-perf profile \
--model ${MODEL} \
--tokenizer ${MODEL} \
--endpoint-type completions \
--streaming \
--url ${IP}:${PORT} \
--input-file ${INPUT_FILE} \
--extra-inputs max_tokens:${OSL} \
--extra-inputs min_tokens:${OSL} \
--extra-inputs ignore_eos:true \
--concurrency ${concurrency} \
--request-count ${concurrency} \
--random-seed 100 \
--artifact-dir ${ARTIFACT_DIR} \
--warmup-requests 10 \
-- \
-v \
--max-threads 256 \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
done
fi
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import random
import numpy as np
from sglang.bench_serving import sample_random_requests
from transformers import AutoTokenizer, PreTrainedTokenizerBase
"""
Helper script that uses SGLang's random request generator to sample ShareGPT data
and then converts it to a jsonl file that can be used by GenAI perf for benchmarking
Example usage:
python3 generate_bench_data.py --model deepseek-ai/DeepSeek-R1 --output data.jsonl
"""
def main():
parser = argparse.ArgumentParser(
description="Use sglang.sample_random_requests to generate token-based JSONL for GenAI-Perf"
)
parser.add_argument(
"--dataset-path", type=str, default="", help="Path or URL to ShareGPT JSON"
)
parser.add_argument(
"--output", type=str, required=True, help="Output JSONL filename"
)
parser.add_argument(
"--model",
type=str,
required=True,
help="Model identifier for payloads and tokenizer name",
)
parser.add_argument(
"--num-prompts", type=int, default=8192, help="Total number of samples"
)
parser.add_argument(
"--input-len", type=int, default=4096, help="Target input token length"
)
parser.add_argument(
"--output-len", type=int, default=5, help="Target output token length"
)
parser.add_argument(
"--range-ratio", type=float, default=1.0, help="Sampling length range ratio"
)
parser.add_argument(
"--random-seed", type=int, default=1, help="Random seed for reproducibility"
)
args = parser.parse_args()
random.seed(args.random_seed)
np.random.seed(args.random_seed)
tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
args.model, trust_remote_code=True
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# this is what SGL uses in their benchmarking
# https://github.com/sgl-project/sglang/blob/b783c1cb829ec451639d1a3ce68380fb7a7be4a3/python/sglang/bench_one_batch_server.py#L131
# We return text instead of returning raw tokens as GenAI Perf expects text during benchmarking
samples = sample_random_requests(
input_len=args.input_len,
output_len=args.output_len,
num_prompts=args.num_prompts,
range_ratio=args.range_ratio,
tokenizer=tokenizer,
dataset_path=args.dataset_path,
random_sample=True,
return_text=True,
)
with open(args.output, "w", encoding="utf-8") as fout:
for row in samples:
# genai-perf expects this format
payload = {
"text": row.prompt,
"output_length": row.output_len,
}
fout.write(json.dumps(payload) + "\n")
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment