docs: instructions to run DSR1 with SGLang wideep on 104+ GPUs (#1583)

Co-authored-by: kkranen <kyle.kranen@gmail.com>

docs: instructions to run DSR1 with SGLang wideep on 104+ GPUs (#1583)
Co-authored-by: kkranen <kyle.kranen@gmail.com>
9d7624f1 · ishandhanani · GitHub · 68d74615 · 9d7624f1 · 9d7624f1
Unverified Commit 9d7624f1 authored Jun 27, 2025 by ishandhanani Committed by GitHub Jun 27, 2025
13 changed files
--- a/container/Dockerfile.sglang
+++ b/container/Dockerfile.sglang
@@ -134,15 +134,17 @@ RUN if [ "$ARCH" = "arm64" ]; then \
    fi

 # Install sglang
-# Once either 0.4.6post6 or 0.4.7 is released, we can switch back to using the published version
-# This commit references a fix to add DP attention based routing along with other perf fixes https://github.com/sgl-project/sglang/pull/6884
-ARG SGLANG_COMMIT="f1569876d54dd3b6601f5280f12652e9fbb1375c"
+# This commit references a NIXL fix that was releasted after the 0.4.8.post1 release https://github.com/sgl-project/sglang/pull/7330
+ARG SGLANG_COMMIT="bb9b608c86ebad7d9d01e29fe058bc184dc7285f"
 RUN --mount=type=cache,target=/root/.cache/uv \
    git clone https://github.com/sgl-project/sglang.git && \
    cd sglang && \
    git checkout ${SGLANG_COMMIT} && \
    uv pip install -e "python[all]"

+# Set env var that allows for forceful shutdown of inflight requests in SGL's TokenizerManager
+ENV SGL_FORCE_SHUTDOWN=1
+
 # Common dependencies
 RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \
    uv pip install --requirement /tmp/requirements.txt

--- a/container/Dockerfile.sglang-deepep
+++ b/container/Dockerfile.sglang-deepep
@@ -71,7 +71,8 @@ RUN rm -rf /opt/hpcx/ucx && \
 ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH

 # Pinning to NIXL 0.2.1 right now
-# TODO: investigate pip install failure with 0.3.0 release
+# There is a fix that was merged into SGLang after 0.4.8.post1
+# TODO: Investigate perf hit of that change before we bump to up to date NIXL
 ARG NIXL_COMMIT="5e4c179ee850d482a83cb2a211e0947e46281060"
 RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_COMMIT} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx"

@@ -79,18 +80,18 @@ WORKDIR /sgl-workspace

 RUN pip uninstall --break-system-packages -y sglang
 RUN rm -rf sglang
-# 0.4.8 has a bug with CUDA graphs and decode worker
+# Pinning to 0.4.8.post1 for now which solves a TBO issue
 # https://github.com/sgl-project/sglang/issues/7511
-RUN pip install --break-system-packages "sglang==0.4.7.post1"
+RUN pip install --break-system-packages "sglang==0.4.8.post1"

 # Allow forceful shutdown of inflight requests
 ENV SGL_FORCE_SHUTDOWN=1

 WORKDIR /sgl-workspace
-# https://github.com/ai-dynamo/dynamo/pull/1510
-ARG DYNAMO_COMMIT="382e3aedc421b3b3abc338062b332b54b5aa8529"
-ARG DYNAMO_BRANCH="ishan/cmpl-token-id"
-RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_BRANCH}
+# support batch completions for SGL benchmarking
+# https://github.com/ai-dynamo/dynamo/pull/1626
+ARG DYNAMO_COMMIT="fc16a79bfc5a4c4f58503d3c36f2013340244cac"
+RUN git clone https://github.com/ai-dynamo/dynamo.git && cd dynamo && git checkout ${DYNAMO_COMMIT}

 # install dynamo in editable mode
 WORKDIR /sgl-workspace/dynamo
@@ -149,6 +150,23 @@ RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/downlo
    rm /tmp/etcd.tar.gz
 ENV PATH=/usr/local/bin/etcd/:$PATH

-COPY examples/sglang/configs/deepep/* /sgl-workspace/dynamo/examples/sglang/configs/
+# Install perf_analyzer and genai-perf
+RUN apt-get update -y && \
+    apt-get install -y --no-install-recommends \
+    rapidjson-dev \
+    zlib1g-dev
+
+RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \
+    mkdir perf_analyzer/build && \
+    cmake -B perf_analyzer/build -S perf_analyzer && \
+    cmake --build perf_analyzer/build -- -j8
+
+ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH
+
+RUN pip install --break-system-packages genai-perf
+
+COPY examples/sglang/configs/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/configs/
+COPY examples/sglang/utils/deepseek-r1-wideep/* /sgl-workspace/dynamo/examples/sglang/utils/

 WORKDIR /sgl-workspace/dynamo/examples/sglang
+
--- a/examples/sglang/README.md
+++ b/examples/sglang/README.md
@@ -73,7 +73,10 @@ dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml

 #### Disaggregated

-SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows
+<details>
+<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
+
+SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows:

 1. The load balancer receives a request from the client
 2. A random `(prefill, decode)` pair is selected from the pool of available workers
@@ -82,6 +85,8 @@ SGLang uses a mini load balancer to route requests to handle disaggregated servi

 Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead, we first route to a random prefill worker, select a random decode worker, and then send the request to both. Internally, SGLang's bootstrap server (which is a part of the `tokenizer_manager`) is used in conjuction with NIXL to handle the kv transfer.

+</details>
+
 > [!IMPORTANT]
 > Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)

@@ -90,7 +95,7 @@ cd /workspace/examples/sglang
 dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
 ```

-##### Disaggregated with MoE and DP attention
+##### Disaggregated with MoE models and DP attention

 SGLang also supports DP attention for MoE models. We provide an example config for this in `configs/disagg-dp-attention.yaml` which is based on the [DeepSeek-R1-Small-2layers](https://huggingface.co/silence09/DeepSeek-R1-Small-2layers) model. You can use this configuration to test out disaggregated serving on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

@@ -100,145 +105,8 @@ cd /workspace/examples/sglang
 dynamo serve graphs.disagg:Frontend -f ./configs/disagg-dp-attention.yaml
 ```

-##### Disaggregated with WideEP
-
-Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 2 H100 nodes and 1 decode worker on 4 H100 nodes (48 total GPUs). You can easily scale this to 96 GPUs or more by simply changing the configuration files.
-
-Steps to run:
-
-1. Build the SGLang DeepEP container.
-
-```bash
-git clone -b v0.4.8 https://github.com/sgl-project/sglang.git
-cd sglang/docker
-docker build -f Dockerfile -t deepep .
-```
-
-You will now have a `deepep:latest` image
-
-2. Build the Dynamo container
-
-```bash
-cd $DYNAMO_ROOT
-docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
-```
-
-3. You can run this container on each 8xH100 node using the following command.
-
-> [!IMPORTANT]
-> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
-
-```bash
-docker run \
-    --gpus all \
-    -it \
-    --rm \
-    --network host \
-    --volume /PATH_TO_DSR1_MODEL/:/model/ \
-    --shm-size=10G \
-    --ulimit memlock=-1 \
-    --ulimit stack=67108864 \
-    --ulimit nofile=65536:65536 \
-    --cap-add CAP_SYS_PTRACE \
-    --ipc host \
-    dynamo-deepep:latest
-```
-
-In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
-
-4. On the head prefill node, start `nats-server` and `etcd` using the following commands
-
-```bash
-nats-server -js &
-etcd --listen-client-urls http://0.0.0.0:2379 \
-     --advertise-client-urls http://0.0.0.0:2379 \
-     --listen-peer-urls http://0.0.0.0:2380 \
-     --initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
-```
-
-5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
-
-> [!IMPORTANT]
-> You will need the IP address of your head prefill node and head decode node for the configuration files
-
-```bash
-# run this on every other node
-export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
-export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
-```
-
-6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
-
-Each container contains the configuration file in `configs/dsr1.yaml`. For our example, we will make the following changes:
-
-On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
-
-```yaml
-SGLangWorker:
-    ...
-    dist-init-addr: HEAD_PREFILL_NODE_IP
-    nnodes: 2
-    node-rank: 0
-    ...
-```
-
-On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
-
-```yaml
-SGLangWorker:
-    ...
-    dist-init-addr: HEAD_PREFILL_NODE_IP
-    nnodes: 2
-    node-rank: 1
-    ...
-```
-
-On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
-
-```yaml
-SGLangDecodeWorker:
-    ...
-    dist-init-addr: HEAD_DECODE_NODE_IP
-    nnodes: 4
-    node-rank: 0
-    ...
-```
-
-On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
-
-```yaml
-SGLangDecodeWorker:
-    ...
-    dist-init-addr: HEAD_DECODE_NODE_IP
-    nnodes: 4
-    # depending on which node this will be 1, 2, and 3
-    node-rank: 1
-```
-
-7. Start up the workers using the following commands
-
-On prefill head node
+In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.

-```bash
-dynamo serve graphs.agg:Frontend -f configs/dsr1.yaml
-```
-
-On prefill child node
-
-```bash
-dynamo serve graphs.agg:Frontend -f configs/dsr1.yaml --service-name SGLangWorker
-```
-
-On all decode nodes
-
-```bash
-dynamo serve graphs.disagg:Frontend -f configs/dsr1.yaml --service-name SGLangDecodeWorker
-```
-
-8. Run the warmup script to warm up the model
-
-DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
+##### Disaggregated with WideEP

-```bash
-./warmup.sh HEAD_PREFILL_NODE_IP
-```
+Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can find detailed deployment and benchmarking instructions [here](./dsr1-wideep.md)
--- a/examples/sglang/components/decode_worker.py
+++ b/examples/sglang/components/decode_worker.py
@@ -19,7 +19,7 @@ import logging

 import sglang as sgl
 from utils.protocol import DisaggPreprocessedRequest
-from utils.sglang import parse_sglang_args
+from utils.sgl_utils import parse_sglang_args

 from dynamo.sdk import endpoint, service


--- a/examples/sglang/components/worker.py
+++ b/examples/sglang/components/worker.py
@@ -34,7 +34,7 @@ import sglang as sgl
 from components.decode_worker import SGLangDecodeWorker
 from sglang.srt.utils import get_ip
 from utils.protocol import DisaggPreprocessedRequest, PreprocessedRequest
-from utils.sglang import parse_sglang_args
+from utils.sgl_utils import parse_sglang_args

 from dynamo.llm import ModelType, register_llm
 from dynamo.sdk import async_on_start, depends, dynamo_context, endpoint, service

--- a/examples/sglang/configs/deepep/warmup.sh
+++ b/examples/sglang/configs/deepep/warmup.sh
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-#!/bin/bash
-
-if [ $# -lt 1 ]; then
-    echo "Usage: $0 <ip> [port]"
-    echo "port defaults to 8000 if not specified"
-    exit 1
-fi
-
-IP=$1
-PORT=${2:-8000}
-
-echo "Running initial warmup 5 times with 5 seconds between each request"
-for i in {1..5}; do
-  echo "Running iteration $i..."
-  curl ${IP}:${PORT}/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-      "model": "deepseek-ai/DeepSeek-R1",
-      "messages": [
-      {
-          "role": "user",
-          "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
-      }
-      ],
-      "stream":true,
-      "max_tokens": 100
-    }'
-  echo "Sleeping for 5 seconds..."
-  sleep 5
-done
-
-echo "Increasing output length to 500 tokens and running same request 10 times"
-for i in {1..10}; do
-  echo "Running iteration $i..."
-  curl ${IP}:${PORT}/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-      "model": "deepseek-ai/DeepSeek-R1",
-      "messages": [
-      {
-          "role": "user",
-          "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
-      }
-      ],
-      "stream":true,
-      "max_tokens": 500
-    }'
-  echo "Sleeping for 5 seconds..."
-  sleep 5
-done
-
-echo "Running 5 parallel requests with 500 tokens each"
-for i in {1..5}; do
-  curl ${IP}:${PORT}/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-      "model": "deepseek-ai/DeepSeek-R1",
-      "messages": [
-      {
-          "role": "user",
-          "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the worldIn the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden.Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for"
-      }
-      ],
-      "stream":true,
-      "max_tokens": 1000
-    }' &
-done
-
-wait
-echo "Parallel requests complete"
--- a/examples/sglang/configs/deepep/deepep.json
+++ b/examples/sglang/configs/deepep/deepep.json
--- a/examples/sglang/configs/deepep/dsr1.yaml
+++ b/examples/sglang/configs/deepep/dsr1.yaml
@@ -26,10 +26,10 @@ SGLangWorker:
  disaggregation-transfer-backend: nixl
  disaggregation-bootstrap-port: 30001
  dist-init-addr: HEAD_PREFILL_NODE_IP:29500
-  nnodes: 2
+  nnodes: 4
  node-rank: 0
-  tp-size: 16
-  dp-size: 16
+  tp-size: 32
+  dp-size: 32
  enable-dp-attention: true
  decode-log-interval: 1
  # when MoE is enabled ep-size == tp-size
@@ -43,13 +43,16 @@ SGLangWorker:
  enable-two-batch-overlap: true
  deepep-mode: normal
  mem-fraction-static: 0.85
-  # SGLang's instructions for benchmarking include these flags
+  # ------------------------------------------------------------------------------------------------
+  # If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
+  # The `init-expert-location` configs can be found in the SGL blog post repro instructions
  #max-running-requests: 8192
  #max-total-tokens: 131072
  #context-length: 8192
  #init-expert-location: /configs/prefill_in4096.json
-  #deepep-config: /configs/deepep.json
-  chunked-prefill-size: 524288
+  #chunked-prefill-size: 524288
+  # ------------------------------------------------------------------------------------------------
+  deepep-config: /configs/deepep.json
  ep-num-redundant-experts: 32
  ep-dispatch-algorithm: dynamic
  eplb-algorithm: deepseek
@@ -69,10 +72,10 @@ SGLangDecodeWorker:
  disaggregation-transfer-backend: nixl
  disaggregation-bootstrap-port: 30001
  dist-init-addr: HEAD_DECODE_NODE_IP:29500
-  nnodes: 4
+  nnodes: 9
  node-rank: 0
-  tp-size: 32
-  dp-size: 32
+  tp-size: 72
+  dp-size: 72
  enable-dp-attention: true
  decode-log-interval: 1
  enable-deepep-moe: true
@@ -86,10 +89,13 @@ SGLangDecodeWorker:
  enable-two-batch-overlap: true
  deepep-mode: low_latency
  mem-fraction-static: 0.835
-  # SGLang's instructions for benchmarking include these flags
+  # ------------------------------------------------------------------------------------------------
+  # If you are trying to repro SGLang's blog post benchmarking - you will need to add these flags
+  # The `init-expert-location` configs can be found in the SGL blog post repro instructions
  #max-running-requests: 18432
  #context-length: 4500
  #init-expert-location: /configs/decode_in2000out100.json
+  # ------------------------------------------------------------------------------------------------
  ep-num-redundant-experts: 32
  cuda-graph-bs: 256
  ServiceArgs:

--- a/examples/sglang/configs/dsr1.yaml
+++ b/examples/sglang/configs/dsr1.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1
+  endpoint: dynamo.SGLangWorker.generate
+  port: 8000
+
+SGLangWorker:
+  model-path: /model/
+  served-model-name: deepseek-ai/DeepSeek-R1
+  tp: 16
+  dp-size: 16
+  dist-init-addr: HEAD_PREFILL_NODE_IP:29500
+  nnodes: 2
+  node-rank: 0
+  enable-dp-attention: true
+  trust-remote-code: true
+  skip-tokenizer-init: true
+  disaggregation-mode: prefill
+  disaggregation-transfer-backend: nixl
+  mem-fraction-static: 0.82
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 8
+
+SGLangDecodeWorker:
+  model-path: /model/
+  served-model-name: deepseek-ai/DeepSeek-R1
+  tp: 16
+  dp-size: 16
+  dist-init-addr: HEAD_DECODE_NODE_IP:29500
+  nnodes: 2
+  node-rank: 0
+  enable-dp-attention: true
+  trust-remote-code: true
+  skip-tokenizer-init: true
+  disaggregation-mode: decode
+  disaggregation-transfer-backend: nixl
+  mem-fraction-static: 0.82
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 8
\ No newline at end of file
--- a/examples/sglang/dsr1-wideep.md
+++ b/examples/sglang/dsr1-wideep.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Running DeepSeek-R1 Disaggregated with WideEP
+
+Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
+
+## Instructions
+
+1. Build the SGLang DeepEP container.
+
+```bash
+git clone -b v0.4.8.post1 https://github.com/sgl-project/sglang.git
+cd sglang/docker
+docker build -f Dockerfile -t deepep .
+```
+
+You will now have a `deepep:latest` image
+
+2. Build the Dynamo container
+
+```bash
+cd $DYNAMO_ROOT
+docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
+```
+
+3. You can run this container on each 8xH100 node using the following command.
+
+> [!IMPORTANT]
+> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
+
+```bash
+docker run \
+    --gpus all \
+    -it \
+    --rm \
+    --network host \
+    --volume /PATH_TO_DSR1_MODEL/:/model/ \
+    --shm-size=10G \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    --ulimit nofile=65536:65536 \
+    --cap-add CAP_SYS_PTRACE \
+    --ipc host \
+    dynamo-deepep:latest
+```
+
+In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
+
+4. On the head prefill node, start `nats-server` and `etcd` using the following commands
+
+```bash
+nats-server -js &
+etcd --listen-client-urls http://0.0.0.0:2379 \
+     --advertise-client-urls http://0.0.0.0:2379 \
+     --listen-peer-urls http://0.0.0.0:2380 \
+     --initial-cluster default=http://HEAD_PREFILL_NODE_IP:2380 &
+```
+
+5. On every other node, go ahead and export the `NATS_SERVER` and `ETCD_ENDPOINTS` environment variables
+
+> [!IMPORTANT]
+> You will need the IP address of your head prefill node and head decode node for the configuration files
+
+```bash
+# run this on every other node
+export NATS_SERVER=nats://HEAD_PREFILL_NODE_IP:4222
+export ETCD_ENDPOINTS=http://HEAD_PREFILL_NODE_IP:2379
+```
+
+6. Configure each configuration file to use the correct `dist-init-addr`, and `node-rank`
+
+Each container contains the configuration file in `configs/dsr1-wideep.yaml`. For our example, we will make the following changes:
+
+On the prefill head node, `vim` into the configs and change the following section of the `SGLangWorker`:
+
+```yaml
+SGLangWorker:
+    ...
+    dist-init-addr: HEAD_PREFILL_NODE_IP
+    nnodes: 2
+    node-rank: 0
+    ...
+```
+
+On the other prefill node (since this example has 2 prefill nodes), change the following section of the `SGLangWorker`:
+
+```yaml
+SGLangWorker:
+    ...
+    dist-init-addr: HEAD_PREFILL_NODE_IP
+    nnodes: 2
+    node-rank: 1
+    ...
+```
+
+On the decode head node, `vim` into the configs and change the following section of the `SGLangDecodeWorker`:
+
+```yaml
+SGLangDecodeWorker:
+    ...
+    dist-init-addr: HEAD_DECODE_NODE_IP
+    nnodes: 4
+    node-rank: 0
+    ...
+```
+
+On the other decode nodes (this example has 4 decode nodes), change the following section of the `SGLangDecodeWorker`:
+
+```yaml
+SGLangDecodeWorker:
+    ...
+    dist-init-addr: HEAD_DECODE_NODE_IP
+    nnodes: 4
+    # depending on which node this will be 1, 2, and 3
+    node-rank: 1
+```
+
+7. Start up the workers using the following commands
+
+On prefill head node
+
+```bash
+dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml
+```
+
+On prefill child node
+
+```bash
+dynamo serve graphs.agg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangWorker
+```
+
+On all decode nodes
+
+```bash
+dynamo serve graphs.disagg:Frontend -f configs/dsr1-wideep.yaml --service-name SGLangDecodeWorker
+```
+
+8. Run the warmup script to warm up the model
+
+DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
+
+```bash
+./warmup.sh HEAD_PREFILL_NODE_IP
+```
+
+## Benchmarking
+
+In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to uncomment the labeled flags in the `configs/dsr1.yaml` file inside of the container.
+
+We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
+
+1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
+We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
+
+Example usage:
+```bash
+# warmup
+./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
+# run benchmark
+./utils/bench.sh HEAD_PREFILL_NODE_IP --type e2e
+```
+
+2. **GenAI Perf to benchmark completions with custom dataset**
+We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
+
+Example usage:
+```bash
+# generate data
+python3 utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1
+# run benchmark
+./utils/bench.sh HEAD_PREFILL_NODE_IP --type custom_completions
+```
--- a/examples/sglang/utils/deepseek-r1-wideep/bench.sh
+++ b/examples/sglang/utils/deepseek-r1-wideep/bench.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+usage() {
+    echo "Usage: $0 <ip> [port] [--type e2e|custom_completions|warmup]"
+    echo "  ip: server IP address"
+    echo "  port: server port (defaults to 8000)"
+    echo "  --type: endpoint type - 'e2e' for chat completions, 'custom_completions' for completions, 'warmup' for warmup phases"
+    exit 1
+}
+
+if [ $# -lt 1 ]; then
+    usage
+fi
+
+IP=$1
+PORT=8000
+TYPE="e2e"
+
+# Check if second argument is a port number or an option
+if [[ $# -gt 1 && $2 =~ ^[0-9]+$ ]]; then
+    PORT=$2
+    shift 2
+else
+    shift 1
+fi
+
+# Parse remaining arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --type)
+            TYPE="$2"
+            shift 2
+            ;;
+        *)
+            usage
+            ;;
+    esac
+done
+
+if [[ "$TYPE" != "e2e" && "$TYPE" != "custom_completions" && "$TYPE" != "warmup" ]]; then
+    echo "Error: --type must be 'e2e', 'custom_completions', or 'warmup'"
+    usage
+fi
+
+MODEL="deepseek-ai/DeepSeek-R1"
+ARTIFACT_DIR="/benchmarks/"
+
+if [[ "$TYPE" == "e2e" ]]; then
+    # E2E chat completions configuration
+    ISL=8000
+    OSL=256
+    CONCURRENCY_ARRAY=(1 2 4 16 64 256 512 1024 2048 4096 8192)
+
+    for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
+        echo "Run e2e concurrency: $concurrency"
+
+        genai-perf profile \
+            --model ${MODEL} \
+            --tokenizer ${MODEL} \
+            --endpoint-type chat \
+            --endpoint /v1/chat/completions \
+            --streaming \
+            --url ${IP}:${PORT} \
+            --synthetic-input-tokens-mean ${ISL} \
+            --synthetic-input-tokens-stddev 0 \
+            --output-tokens-mean ${OSL} \
+            --output-tokens-stddev 0 \
+            --extra-inputs max_tokens:${OSL} \
+            --extra-inputs min_tokens:${OSL} \
+            --extra-inputs ignore_eos:true \
+            --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+            --concurrency ${concurrency} \
+            --request-count $(($concurrency*10)) \
+            --num-dataset-entries $(($concurrency*12)) \
+            --random-seed 100 \
+            --artifact-dir ${ARTIFACT_DIR} \
+            -- \
+            -v \
+            --max-threads ${concurrency} \
+            -H 'Authorization: Bearer NOT USED' \
+            -H 'Accept: text/event-stream'
+    done
+
+elif [[ "$TYPE" == "warmup" ]]; then
+    echo "Starting warmup phases..."
+
+    # Phase configurations: "ISL OSL CONCURRENCY_LIST"
+    PHASES=(
+        "500 100 1,2,4,8"
+        "2000 100 1,2,4,8"
+        "4000 256 1,2,8,64"
+    )
+
+    for i in "${!PHASES[@]}"; do
+        phase_num=$((i + 1))
+        phase_config=(${PHASES[$i]})
+        ISL=${phase_config[0]}
+        OSL=${phase_config[1]}
+        concurrency_list=${phase_config[2]}
+
+        echo "Phase $phase_num: ISL=$ISL, OSL=$OSL"
+
+        # Convert comma-separated list to array
+        IFS=',' read -ra CONCURRENCY_ARRAY <<< "$concurrency_list"
+
+        for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
+            echo "Run warmup phase $phase_num, concurrency: $concurrency, ISL: $ISL, OSL: $OSL"
+
+            genai-perf profile \
+                --model ${MODEL} \
+                --tokenizer ${MODEL} \
+                --endpoint-type chat \
+                --endpoint /v1/chat/completions \
+                --streaming \
+                --url ${IP}:${PORT} \
+                --synthetic-input-tokens-mean ${ISL} \
+                --synthetic-input-tokens-stddev 0 \
+                --output-tokens-mean ${OSL} \
+                --output-tokens-stddev 0 \
+                --extra-inputs max_tokens:${OSL} \
+                --extra-inputs min_tokens:${OSL} \
+                --extra-inputs ignore_eos:true \
+                --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+                --concurrency ${concurrency} \
+                --request-count $(($concurrency)) \
+                --warmup-request-count $(($concurrency)) \
+                --num-dataset-entries $(($concurrency*12)) \
+                --random-seed 100 \
+                --artifact-dir ${ARTIFACT_DIR} \
+                -- \
+                -v \
+                --max-threads ${concurrency} \
+                -H 'Authorization: Bearer NOT USED' \
+                -H 'Accept: text/event-stream'
+
+            echo "Sleeping for 5 seconds..."
+            sleep 5
+        done
+
+        echo "Phase $phase_num complete"
+    done
+
+else
+    # Custom completions configuration
+    OSL=5
+    INPUT_FILE=data.jsonl
+    CONCURRENCY_ARRAY=(8192)
+
+    for concurrency in "${CONCURRENCY_ARRAY[@]}"; do
+        echo "Run custom_completions concurrency: $concurrency"
+
+        genai-perf profile \
+            --model ${MODEL} \
+            --tokenizer ${MODEL} \
+            --endpoint-type completions \
+            --streaming \
+            --url ${IP}:${PORT} \
+            --input-file ${INPUT_FILE} \
+            --extra-inputs max_tokens:${OSL} \
+            --extra-inputs min_tokens:${OSL} \
+            --extra-inputs ignore_eos:true \
+            --concurrency ${concurrency} \
+            --request-count ${concurrency} \
+            --random-seed 100 \
+            --artifact-dir ${ARTIFACT_DIR} \
+            --warmup-requests 10 \
+            -- \
+            -v \
+            --max-threads 256 \
+            -H 'Authorization: Bearer NOT USED' \
+            -H 'Accept: text/event-stream'
+    done
+fi
\ No newline at end of file
--- a/examples/sglang/utils/deepseek-r1-wideep/generate_bench_data.py
+++ b/examples/sglang/utils/deepseek-r1-wideep/generate_bench_data.py
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import random
+
+import numpy as np
+from sglang.bench_serving import sample_random_requests
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+
+"""
+Helper script that uses SGLang's random request generator to sample ShareGPT data
+and then converts it to a jsonl file that can be used by GenAI perf for benchmarking
+
+Example usage:
+python3 generate_bench_data.py --model deepseek-ai/DeepSeek-R1 --output data.jsonl
+"""
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Use sglang.sample_random_requests to generate token-based JSONL for GenAI-Perf"
+    )
+    parser.add_argument(
+        "--dataset-path", type=str, default="", help="Path or URL to ShareGPT JSON"
+    )
+    parser.add_argument(
+        "--output", type=str, required=True, help="Output JSONL filename"
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        required=True,
+        help="Model identifier for payloads and tokenizer name",
+    )
+    parser.add_argument(
+        "--num-prompts", type=int, default=8192, help="Total number of samples"
+    )
+    parser.add_argument(
+        "--input-len", type=int, default=4096, help="Target input token length"
+    )
+    parser.add_argument(
+        "--output-len", type=int, default=5, help="Target output token length"
+    )
+    parser.add_argument(
+        "--range-ratio", type=float, default=1.0, help="Sampling length range ratio"
+    )
+    parser.add_argument(
+        "--random-seed", type=int, default=1, help="Random seed for reproducibility"
+    )
+    args = parser.parse_args()
+
+    random.seed(args.random_seed)
+    np.random.seed(args.random_seed)
+
+    tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
+        args.model, trust_remote_code=True
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    # this is what SGL uses in their benchmarking
+    # https://github.com/sgl-project/sglang/blob/b783c1cb829ec451639d1a3ce68380fb7a7be4a3/python/sglang/bench_one_batch_server.py#L131
+    # We return text instead of returning raw tokens as GenAI Perf expects text during benchmarking
+    samples = sample_random_requests(
+        input_len=args.input_len,
+        output_len=args.output_len,
+        num_prompts=args.num_prompts,
+        range_ratio=args.range_ratio,
+        tokenizer=tokenizer,
+        dataset_path=args.dataset_path,
+        random_sample=True,
+        return_text=True,
+    )
+
+    with open(args.output, "w", encoding="utf-8") as fout:
+        for row in samples:
+            # genai-perf expects this format
+            payload = {
+                "text": row.prompt,
+                "output_length": row.output_len,
+            }
+            fout.write(json.dumps(payload) + "\n")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/sglang/utils/sglang.py
+++ b/examples/sglang/utils/sglang.py