feat: sglang + gb200 (#2223)

959f810f · ishandhanani · GitHub · ae51b3f4 · 959f810f · 959f810f
Unverified Commit 959f810f authored Aug 01, 2025 by ishandhanani Committed by GitHub Aug 01, 2025
10 changed files
--- a/components/backends/sglang/README.md
+++ b/components/backends/sglang/README.md
@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ### Large Scale P/D and WideEP Features
-| Feature            | SGLang | Notes                                                                 |
+| Feature             | SGLang | Notes                                                        |
-|--------------------|--------|-----------------------------------------------------------------------|
+|---------------------|--------|--------------------------------------------------------------|
-| **WideEP**         | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556)                                     |
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
-| **DP Rank Routing**| 🚧    | Direct routing supported. Process per DP rank is not supported        |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
-| **GB200 Support**  | 🚧    | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
+| **GB200 Support**   | ✅     |                                                              |
 ## Quick Start
@@ -155,7 +155,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
 Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
-### Run on multi-node
+### Run a multi-node sized model
 - **[Run a multi-node model](docs/multinode-examples.md)**
 ### Large scale P/D disaggregation with WideEP

--- a/components/backends/sglang/docs/dsr1-wideep-gb200.md
+++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
+Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs).
+## Instructions
+1. Build the Dynamo container
+```bash
+cd $DYNAMO_ROOT
+docker build \
+  -f container/Dockerfile.sglang-wideep \
+  -t dynamo-wideep-gb200 \
+  --build-arg MODE=blackwell \
+  --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \
+  --build-arg ARCH=arm64 \
+  --build-arg ARCH_ALT=aarch64 \
+  .
+```
+2. You can run this container on each 4xGB200 node using the following command.
+> [!IMPORTANT]
+> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
+```bash
+docker run \
+    --gpus all \
+    -it \
+    --rm \
+    --network host \
+    --volume /PATH_TO_DSR1_MODEL/:/model/ \
+    --shm-size=10G \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    --ulimit nofile=65536:65536 \
+    --cap-add CAP_SYS_PTRACE \
+    --ipc host \
+    dynamo-wideep-gb200:latest
+```
+3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
+```bash
+./utils/gen_env_vars.sh
+```
+4. Run the ingress and prefill worker
+```bash
+# run ingress
+python3 -m dynamo.frontend --http-port=8000 &
+# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
+python3 utils/sgl_http_server.py --ns dynamo &
+# run prefill worker
+SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
+MC_TE_METRIC=true \
+SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+MC_FORCE_MNNVL=1 \
+NCCL_MNNVL_ENABLE=1 \
+NCCL_CUMEM_ENABLE=1 \
+SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+PYTHONUNBUFFERED=1 \
+python3 components/worker.py \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --model-path /model/ \
+  --skip-tokenizer-init \
+  --trust-remote-code \
+  --disaggregation-mode prefill \
+  --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
+  --disaggregation-bootstrap-port 30001 \
+  --disaggregation-transfer-backend nixl \
+  --nnodes 2 \
+  --node-rank 0 \
+  --tp-size 8 \
+  --dp-size 8 \
+  --enable-dp-attention \
+  --host 0.0.0.0 \
+  --decode-log-interval 1 \
+  --max-running-requests 6144 \
+  --context-length 2716 \
+  --disable-radix-cache \
+  --enable-deepep-moe \
+  --deepep-mode low_latency \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --disable-shared-experts-fusion \
+  --ep-num-redundant-experts 32 \
+  --ep-dispatch-algorithm static \
+  --eplb-algorithm deepseek \
+  --attention-backend cutlass_mla \
+  --watchdog-timeout 1000000 \
+  --disable-cuda-graph \
+  --chunked-prefill-size 16384 \
+  --max-total-tokens 32768 \
+  --mem-fraction-static 0.8 \
+  --log-level debug
+```
+5. Run the decode worker on the head decode node
+```bash
+SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
+MC_TE_METRIC=true \
+SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
+SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+NCCL_MNNVL_ENABLE=1 \
+MC_FORCE_MNNVL=1 \
+NCCL_CUMEM_ENABLE=1 \
+SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+PYTHONUNBUFFERED=1 \
+python3 components/decode_worker.py \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --model-path /model/ \
+  --skip-tokenizer-init \
+  --trust-remote-code \
+  --disaggregation-mode decode \
+  --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
+  --disaggregation-bootstrap-port 30001 \
+  --nnodes 12 \
+  --node-rank 0 \
+  --tp-size 48 \
+  --dp-size 48 \
+  --enable-dp-attention \
+  --host 0.0.0.0 \
+  --decode-log-interval 1 \
+  --max-running-requests 36864 \
+  --context-length 2716 \
+  --disable-radix-cache \
+  --enable-deepep-moe \
+  --deepep-mode low_latency \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --cuda-graph-bs 768 \
+  --disable-shared-experts-fusion \
+  --ep-num-redundant-experts 32 \
+  --ep-dispatch-algorithm static \
+  --eplb-algorithm deepseek \
+  --attention-backend cutlass_mla \
+  --watchdog-timeout 1000000 \
+  --chunked-prefill-size 36864 \
+  --mem-fraction-static 0.82 \
+  --log-level debug
+```
+On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
--- a/components/backends/sglang/docs/dsr1-wideep-h100.md
+++ b/components/backends/sglang/docs/dsr1-wideep-h100.md
@@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
 ## Instructions
-1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases.
+1. Build the Dynamo container
-```bash
-docker pull lmsysorg/sglang:v0.4.8.post1-cu126
-```
-You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags)
-2. Build the Dynamo container
 ```bash
 cd $DYNAMO_ROOT
 docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
 ```
-3. You can run this container on each 8xH100 node using the following command.
+You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
+2. You can run this container on each 8xH100 node using the following command.
 > [!IMPORTANT]
 > We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
@@ -47,17 +41,17 @@ docker run \
 In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
-4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
+3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
 ```bash
 ./utils/gen_env_vars.sh
 ```
-5. Run the ingress and prefill worker
+4. Run the ingress and prefill worker
 ```bash
 # run ingress
-dynamo run in=http out=dyn &
+python3 -m dynamo.frontend --http-port=8000 &
 # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
 python3 utils/sgl_http_server.py --ns dynamo &
 # run prefill worker
@@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \
 On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
-7. Run the decode worker on the head decode node
+5. Run the decode worker on the head decode node
 ```bash
 python3 -m dynamo.sglang.decode_worker \
@@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \
  --deepep-mode low_latency \
  --mem-fraction-static 0.835 \
  --ep-num-redundant-experts 32 \
-  --cuda-graph-bs 256
+  --cuda-graph-bs 128
 ```
 On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
 In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
 prefill:
 ```bash
 ...
 --max-running-requests 8192 \
@@ -142,6 +137,7 @@ prefill:
 ```
 decode:
 ```bash
 ...
 --max-running-requests 18432 \
@@ -152,9 +148,10 @@ decode:
 We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
-We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
+   We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
 Example usage:
 ```bash
 # warmup
 ./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
 ```
 2. **GenAI Perf to benchmark completions with custom dataset**
-We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
+   We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
 Example usage:
 ```bash
 # generate data
 python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1

--- a/components/backends/sglang/slurm_jobs/README.md
+++ b/components/backends/sglang/slurm_jobs/README.md
@@ -45,6 +45,7 @@ logs/
 ## Setup
 For simplicity of the example, we will make some assumptions about your SLURM cluster:
 1. We assume you have access to a SLURM cluster with multiple GPU nodes
   available. For functional testing, most setups should be fine. For performance
   testing, you should aim to allocate groups of nodes that are performantly
@@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
 ## Usage
+> [!NOTE]
+> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
 1. **Submit a benchmark job**:
   ```bash
   python submit_job_script.py \
     --template job_script_template.j2 \
@@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
   ```
   **Required arguments**:
   - `--template`: Path to Jinja2 template file
   - `--model-dir`: Model directory path
   - `--config-dir`: Config directory path
@@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
   - `--account`: SLURM account
   **Optional arguments**:
   - `--prefill-nodes`: Number of prefill nodes (default: `2`)
   - `--decode-nodes`: Number of decode nodes (default: `2`)
   - `--gpus-per-node`: Number of GPUs per node (default: `8`)
   - `--network-interface`: Network interface to use (default: `eth3`)
   - `--job-name`: SLURM job name (default: `dynamo_setup`)
   - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
+   - `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
+   - `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
-2. **Monitor job progress**:
+2. **Example with different GPU types**:
+   ```bash
+   # For H100 with Dynamo (default)
+   python submit_job_script.py \
+     --template job_script_template.j2 \
+     --model-dir /path/to/model \
+     --config-dir /path/to/configs \
+     --container-image container-image-uri \
+     --account your-slurm-account \
+     --gpu-type h100
+   # For GB200 with SGLang
+   python submit_job_script.py \
+     --template job_script_template.j2 \
+     --model-dir /path/to/model \
+     --config-dir /path/to/configs \
+     --container-image container-image-uri \
+     --account your-slurm-account \
+     --gpu-type gb200 \
+     --use-sglang-commands
+     --gpus-per-node 4
+   ```
+3. **Monitor job progress**:
   ```bash
   squeue -u $USER
   ```
-3. **Check logs in real-time**:
+4. **Check logs in real-time**:
   ```bash
   tail -f logs/{JOB_ID}/log.out
   ```
-4. **Monitor GPU utilization**:
+   You can view logs of all prefill or decode workers simultaneously by running:
+   ```bash
+   # prefill workers err (or .out)
+   tail -f logs/{JOB_ID}/*_prefill.err
+   # decode workers err (or .out)
+   tail -f logs/{JOB_ID}/*_decode.err
+   ```
+5. **Monitor GPU utilization**:
   ```bash
   tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
   ```

--- a/components/backends/sglang/slurm_jobs/job_script_template.j2
+++ b/components/backends/sglang/slurm_jobs/job_script_template.j2
@@ -7,6 +7,7 @@
 #SBATCH --time={{ time_limit }}
 #SBATCH --output=logs/%j/log.out
 #SBATCH --error=logs/%j/log.err
+#SBATCH --partition={{ partition }}
 # Constants
 PREFILL_NODES={{ prefill_nodes }}
@@ -20,6 +21,8 @@ MODEL_DIR="{{ model_dir }}"
 CONFIG_DIR="{{ config_dir }}"
 CONTAINER_IMAGE="{{ container_image }}"
 NETWORK_INTERFACE="{{ network_interface }}"
+GPU_TYPE="{{ gpu_type | default('h100') }}"
+USE_SGLANG_COMMANDS="{{ use_sglang_commands | default(false) }}"
 {% raw %}
@@ -36,14 +39,14 @@ for i in "${!nodes[@]}"; do
    echo "Node $i: ${nodes[$i]}"
 done
-PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
+PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip route get $(getent ahosts ${nodes[0]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}')
 if [ -z "$PREFILL_HOST_IP" ]; then
    echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE"
    exit 1
 fi
 echo "Prefill host IP address: $PREFILL_HOST_IP"
-DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
+DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ip route get $(getent ahosts ${nodes[$PREFILL_NODES]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}')
 if [ -z "$DECODE_HOST_IP" ]; then
    echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE"
    exit 1
@@ -54,21 +57,25 @@ echo "Decode host IP address: $DECODE_HOST_IP"
 ENROOT_ARGS="\
    --container-image=${CONTAINER_IMAGE} \
    --no-container-entrypoint \
-    --container-mount-home \
+    --no-container-mount-home \
-    --no-container-remap-root \
    --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
 "
+# Build common worker arguments
+WORKER_ARGS="--gpu_type ${GPU_TYPE} --gpus_per_node ${GPUS_PER_NODE}"
+if [ "$USE_SGLANG_COMMANDS" = "True" ]; then
+    WORKER_ARGS="${WORKER_ARGS} --use-sglang-commands"
+fi
 # Launch prefill tasks on the first PREFILL_NODES nodes
 for i in $(seq 0 $((PREFILL_NODES - 1))); do
    node=${nodes[$i]}
    rank=$i
    echo "Launching prefill task on node ${i} (rank ${rank}): $node"
-    echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err"
-    echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &"
+    cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log ${WORKER_ARGS}"
-    srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
+    echo "$cmd"
-    --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \
+    $cmd &
-    python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &
 done
 # Launch decode tasks on the next DECODE_NODES nodes
@@ -76,11 +83,10 @@ for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do
    node=${nodes[$i]}
    rank=$((i - PREFILL_NODES))
    echo "Launching decode task on node ${i} (rank ${rank}): $node"
-    echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err"
-    echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &"
+    cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log ${WORKER_ARGS}"
-    srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
+    echo "$cmd"
-    --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \
+    $cmd &
-    python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &
 done
 echo ""

--- a/components/backends/sglang/slurm_jobs/scripts/gb200.sh
+++ b/components/backends/sglang/slurm_jobs/scripts/gb200.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Function to print usage
+print_usage() {
+    echo "Usage: $0 <mode> <cmd>"
+    echo "  mode: prefill or decode"
+    echo "  cmd:  dynamo or sglang"
+    echo ""
+    echo "Examples:"
+    echo "  $0 prefill dynamo"
+    echo "  $0 decode sglang"
+    exit 1
+}
+# Check if correct number of arguments provided
+if [ $# -ne 2 ]; then
+    echo "Error: Expected 2 arguments, got $#"
+    print_usage
+fi
+# Parse arguments
+mode=$1
+cmd=$2
+# Validate mode argument
+if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then
+    echo "Error: mode must be 'prefill' or 'decode', got '$mode'"
+    print_usage
+fi
+# Validate cmd argument
+if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then
+    echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'"
+    print_usage
+fi
+echo "Mode: $mode"
+echo "Command: $cmd"
+# Check if required environment variables are set
+if [ -z "$HOST_IP" ]; then
+    echo "Error: HOST_IP environment variable is not set"
+    exit 1
+fi
+if [ -z "$PORT" ]; then
+    echo "Error: PORT environment variable is not set"
+    exit 1
+fi
+if [ -z "$TOTAL_GPUS" ]; then
+    echo "Error: TOTAL_GPUS environment variable is not set"
+    exit 1
+fi
+if [ -z "$RANK" ]; then
+    echo "Error: RANK environment variable is not set"
+    exit 1
+fi
+if [ -z "$TOTAL_NODES" ]; then
+    echo "Error: TOTAL_NODES environment variable is not set"
+    exit 1
+fi
+# TODO: since the args for sglang and dynamo are the same, we can be a bit cleaner here
+# Construct command based on mode and cmd
+if [ "$mode" = "prefill" ]; then
+    if [ "$cmd" = "dynamo" ]; then
+    # We are not using a init-expert-location file for e2e benchmarking
+        # We also don't currently have a --deepep-config file for GB200
+        # Need to increase --context-length to 10k for 8k1k benchmarking
+        SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
+        MC_TE_METRIC=true \
+        SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+        SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+        SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+        SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+        MC_FORCE_MNNVL=1 \
+        NCCL_MNNVL_ENABLE=1 \
+        NCCL_CUMEM_ENABLE=1 \
+        SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+        SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+        PYTHONUNBUFFERED=1 \
+        python3 components/worker.py \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --model-path /model/ \
+            --skip-tokenizer-init \
+            --trust-remote-code \
+            --disaggregation-mode prefill \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --disaggregation-bootstrap-port 30001 \
+            --disaggregation-transfer-backend nixl \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --host 0.0.0.0 \
+            --decode-log-interval 1 \
+            --max-running-requests 6144 \
+            --context-length 2716 \
+            --disable-radix-cache \
+            --enable-deepep-moe \
+            --deepep-mode low_latency \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-shared-experts-fusion \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm static \
+            --eplb-algorithm deepseek \
+            --attention-backend cutlass_mla \
+            --watchdog-timeout 1000000 \
+            --disable-cuda-graph \
+            --chunked-prefill-size 16384 \
+            --max-total-tokens 32768 \
+            --mem-fraction-static 0.8 \
+            --log-level debug
+    elif [ "$cmd" = "sglang" ]; then
+        # GB200 sglang prefill command
+        # We are not using a init-expert-location file for e2e benchmarking
+        # We also don't currently have a --deepep-config file for GB200
+        # Need to increase --context-length to 10k for 8k1k benchmarking
+        SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
+        MC_TE_METRIC=true \
+        SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+        SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+        SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+        SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+        NCCL_MNNVL_ENABLE=1 \
+        MC_FORCE_MNNVL=1 \
+        NCCL_CUMEM_ENABLE=1 \
+        SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+        SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+        PYTHONUNBUFFERED=1 \
+        python3 -m sglang.launch_server \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --model-path /model/ \
+            --trust-remote-code \
+            --disaggregation-mode prefill \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --disaggregation-bootstrap-port 30001 \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --host 0.0.0.0 \
+            --decode-log-interval 1 \
+            --max-running-requests 6144 \
+            --context-length 2716 \
+            --disable-radix-cache \
+            --enable-deepep-moe \
+            --deepep-mode low_latency \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-shared-experts-fusion \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm static \
+            --eplb-algorithm deepseek \
+            --attention-backend cutlass_mla \
+            --watchdog-timeout 1000000 \
+            --disable-cuda-graph \
+            --chunked-prefill-size 16384 \
+            --max-total-tokens 32768 \
+            --mem-fraction-static 0.8 \
+            --log-level debug
+    fi
+elif [ "$mode" = "decode" ]; then
+    if [ "$cmd" = "dynamo" ]; then
+        # Need to increase --context-length to 10k for 8k1k benchmarking
+        # We are not using a init-expert-location file for e2e benchmarking
+        SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
+        MC_TE_METRIC=true \
+        SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+        SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+        SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+        SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
+        SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+        NCCL_MNNVL_ENABLE=1 \
+        MC_FORCE_MNNVL=1 \
+        NCCL_CUMEM_ENABLE=1 \
+        SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+        SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+        PYTHONUNBUFFERED=1 \
+        python3 components/decode_worker.py \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --model-path /model/ \
+            --skip-tokenizer-init \
+            --trust-remote-code \
+            --disaggregation-mode decode \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --disaggregation-bootstrap-port 30001 \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --host 0.0.0.0 \
+            --decode-log-interval 1 \
+            --max-running-requests 36864 \
+            --context-length 2716 \
+            --disable-radix-cache \
+            --enable-deepep-moe \
+            --deepep-mode low_latency \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --cuda-graph-bs 768 \
+            --disable-shared-experts-fusion \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm static \
+            --eplb-algorithm deepseek \
+            --attention-backend cutlass_mla \
+            --watchdog-timeout 1000000 \
+            --chunked-prefill-size 36864 \
+            --mem-fraction-static 0.82 \
+            --log-level debug
+    elif [ "$cmd" = "sglang" ]; then
+        # GB200 sglang decode command
+        # Need to increase --context-length to 10k for 8k1k benchmarking
+        # We are not using a init-expert-location file for e2e benchmarking
+        SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
+        MC_TE_METRIC=true \
+        SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+        SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+        SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+        SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
+        SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+        NCCL_MNNVL_ENABLE=1 \
+        MC_FORCE_MNNVL=1 \
+        NCCL_CUMEM_ENABLE=1 \
+        SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+        SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+        PYTHONUNBUFFERED=1 \
+        python3 -m sglang.launch_server \
+            --model-path /model/ \
+            --trust-remote-code \
+            --disaggregation-mode decode \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --disaggregation-bootstrap-port 30001 \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --host 0.0.0.0 \
+            --decode-log-interval 1 \
+            --max-running-requests 36864 \
+            --context-length 2716 \
+            --disable-radix-cache \
+            --enable-deepep-moe \
+            --deepep-mode low_latency \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --cuda-graph-bs 768 \
+            --disable-shared-experts-fusion \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm static \
+            --eplb-algorithm deepseek \
+            --attention-backend cutlass_mla \
+            --watchdog-timeout 1000000 \
+            --chunked-prefill-size 36864 \
+            --mem-fraction-static 0.82 \
+            --log-level debug
+    fi
+fi
--- a/components/backends/sglang/slurm_jobs/scripts/h100.sh
+++ b/components/backends/sglang/slurm_jobs/scripts/h100.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Function to print usage
+print_usage() {
+    echo "Usage: $0 <mode> <cmd>"
+    echo "  mode: prefill or decode"
+    echo "  cmd:  dynamo or sglang"
+    echo ""
+    echo "Examples:"
+    echo "  $0 prefill dynamo"
+    echo "  $0 decode sglang"
+    exit 1
+}
+# Check if correct number of arguments provided
+if [ $# -ne 2 ]; then
+    echo "Error: Expected 2 arguments, got $#"
+    print_usage
+fi
+# Parse arguments
+mode=$1
+cmd=$2
+# Validate mode argument
+if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then
+    echo "Error: mode must be 'prefill' or 'decode', got '$mode'"
+    print_usage
+fi
+# Validate cmd argument
+if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then
+    echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'"
+    print_usage
+fi
+echo "Mode: $mode"
+echo "Command: $cmd"
+# Check if required environment variables are set
+if [ -z "$HOST_IP" ]; then
+    echo "Error: HOST_IP environment variable is not set"
+    exit 1
+fi
+if [ -z "$PORT" ]; then
+    echo "Error: PORT environment variable is not set"
+    exit 1
+fi
+if [ -z "$TOTAL_GPUS" ]; then
+    echo "Error: TOTAL_GPUS environment variable is not set"
+    exit 1
+fi
+if [ -z "$RANK" ]; then
+    echo "Error: RANK environment variable is not set"
+    exit 1
+fi
+if [ -z "$TOTAL_NODES" ]; then
+    echo "Error: TOTAL_NODES environment variable is not set"
+    exit 1
+fi
+# Construct command based on mode and cmd
+if [ "$mode" = "prefill" ]; then
+    if [ "$cmd" = "dynamo" ]; then
+        # H100 dynamo prefill command
+        python3 components/worker.py \
+            --model-path /model/ \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --skip-tokenizer-init \
+            --disaggregation-mode prefill \
+            --disaggregation-transfer-backend nixl \
+            --disaggregation-bootstrap-port 30001 \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --decode-log-interval 1 \
+            --enable-deepep-moe \
+            --page-size 1 \
+            --trust-remote-code \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-radix-cache \
+            --watchdog-timeout 1000000 \
+            --enable-two-batch-overlap \
+            --deepep-mode normal \
+            --mem-fraction-static 0.85 \
+            --deepep-config /configs/deepep.json \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm dynamic \
+            --eplb-algorithm deepseek
+    elif [ "$cmd" = "sglang" ]; then
+        # H100 sglang prefill command
+        python3 -m sglang.launch_server \
+            --model-path /model/ \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --disaggregation-transfer-backend nixl \
+            --disaggregation-mode prefill \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --decode-log-interval 1 \
+            --enable-deepep-moe \
+            --page-size 1 \
+            --host 0.0.0.0 \
+            --trust-remote-code \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-radix-cache \
+            --watchdog-timeout 1000000 \
+            --enable-two-batch-overlap \
+            --deepep-mode normal \
+            --mem-fraction-static 0.85 \
+            --ep-num-redundant-experts 32 \
+            --ep-dispatch-algorithm dynamic \
+            --eplb-algorithm deepseek \
+            --deepep-config /configs/deepep.json
+    fi
+elif [ "$mode" = "decode" ]; then
+    if [ "$cmd" = "dynamo" ]; then
+        # H100 dynamo decode command
+        python3 components/decode_worker.py \
+            --model-path /model/ \
+            --served-model-name deepseek-ai/DeepSeek-R1 \
+            --skip-tokenizer-init \
+            --disaggregation-mode decode \
+            --disaggregation-transfer-backend nixl \
+            --disaggregation-bootstrap-port 30001 \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --decode-log-interval 1 \
+            --enable-deepep-moe \
+            --page-size 1 \
+            --trust-remote-code \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-radix-cache \
+            --watchdog-timeout 1000000 \
+            --enable-two-batch-overlap \
+            --deepep-mode low_latency \
+            --mem-fraction-static 0.835 \
+            --ep-num-redundant-experts 32 \
+            --cuda-graph-bs 256
+    elif [ "$cmd" = "sglang" ]; then
+        # H100 sglang decode command
+        python3 -m sglang.launch_server \
+            --model-path /model/ \
+            --disaggregation-transfer-backend nixl \
+            --disaggregation-mode decode \
+            --dist-init-addr "$HOST_IP:$PORT" \
+            --nnodes "$TOTAL_NODES" \
+            --node-rank "$RANK" \
+            --tp-size "$TOTAL_GPUS" \
+            --dp-size "$TOTAL_GPUS" \
+            --enable-dp-attention \
+            --decode-log-interval 1 \
+            --enable-deepep-moe \
+            --page-size 1 \
+            --host 0.0.0.0 \
+            --trust-remote-code \
+            --moe-dense-tp-size 1 \
+            --enable-dp-lm-head \
+            --disable-radix-cache \
+            --watchdog-timeout 1000000 \
+            --enable-two-batch-overlap \
+            --deepep-mode low_latency \
+            --mem-fraction-static 0.835 \
+            --ep-num-redundant-experts 32 \
+            --cuda-graph-bs 256
+    fi
+fi
--- a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py
+++ b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py
@@ -8,8 +8,8 @@ benchmark_dynamo.sh script.
 The script will:
 - Setup the environment
- Update the YAML config file
+- Generate the python3 command to run the prefill or decode worker
- Start Dynamo graphs.disagg service
+- Start dynamo (or sglang)
 - Monitor the GPU utilization
 """
@@ -165,6 +165,19 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
        default=None,
        help="File to log GPU utilization (default: None)",
    )
+    parser.add_argument(
+        "--use-sglang-commands",
+        action="store_true",
+        default=False,
+        help="Helper to spin up SGLang servers instead of dynamo. This is helpful for benchmarking SGLang as well",
+    )
+    parser.add_argument(
+        "--gpu_type",
+        type=str,
+        choices=["h100", "gb200"],
+        default="h100",
+        help="Type of GPU to use",
+    )
    return parser.parse_args(args)
@@ -181,73 +194,114 @@ def _validate_args(args: argparse.Namespace) -> None:
        raise ValueError("GPUs per node must be at least 1")
-def setup_prefill_node(
+def get_sglang_mini_lb_command_args(prefill_host_ip: str, decode_host_ip: str) -> str:
-    rank: int, prefill_host_ip: str, total_nodes: int, total_gpus: int
+    cmd = (
-) -> int:
+        f"python3 -m sglang.srt.disaggregation.launch_lb "
+        f"--prefill http://{prefill_host_ip}:30000 "
+        f"--decode http://{decode_host_ip}:30000 "
+        "--host 0.0.0.0 "
+        "--port 8000 "
+        "--timeout 3600"
+    )
+    return cmd
+def setup_env_vars_for_gpu_script(
+    host_ip: str,
+    rank: int,
+    total_gpus: int,
+    total_nodes: int,
+    port: int = DIST_INIT_PORT,
+):
+    """Setup environment variables required by GPU scripts (h100.sh, gb200.sh)"""
+    os.environ["HOST_IP"] = host_ip
+    os.environ["PORT"] = str(port)
+    os.environ["TOTAL_GPUS"] = str(total_gpus)
+    os.environ["RANK"] = str(rank)
+    os.environ["TOTAL_NODES"] = str(total_nodes)
+    logging.info(f"Set HOST_IP: {host_ip}")
+    logging.info(f"Set PORT: {port}")
+    logging.info(f"Set TOTAL_GPUS: {total_gpus}")
+    logging.info(f"Set RANK: {rank}")
+    logging.info(f"Set TOTAL_NODES: {total_nodes}")
+def get_gpu_command(worker_type: str, use_sglang_commands: bool, gpu_type: str) -> str:
+    """Generate command to run the appropriate GPU script"""
+    script_name = f"{gpu_type}.sh"
+    script_path = Path(__file__).parent / script_name
+    mode = worker_type  # "prefill" or "decode"
+    cmd = "sglang" if use_sglang_commands else "dynamo"
+    return f"bash {script_path} {mode} {cmd}"
+def setup_head_prefill_node(prefill_host_ip: str) -> None:
    """
-    Setup the prefill node.
+    Setup NATS, etcd, ingress, and http servers on the prefill host node.
    """
-    if rank == 0:
+    logging.info(f"Starting nats server on node {prefill_host_ip}")
-        logging.info(f"Setting up host prefill node: {rank}")
-        logging.info(f"Starting nats server on node {rank} with IP {prefill_host_ip}")
+    nats_process = run_command("nats-server -js", background=True)
+    if not nats_process:
-        nats_process = run_command("nats-server -js", background=True)
+        raise RuntimeError("Failed to start nats-server")
-        if not nats_process:
-            raise RuntimeError("Failed to start nats-server")
+    logging.info(f"Starting etcd server on node {prefill_host_ip}")
+    etcd_cmd = (
-        etcd_cmd = (
+        f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
-            f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
+        f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
-            f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
+        f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} "
-            f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} "
+        f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}"
-            f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}"
+    )
-        )
-        etcd_process = run_command(etcd_cmd, background=True)
+    etcd_process = run_command(etcd_cmd, background=True)
-        if not etcd_process:
+    if not etcd_process:
-            raise RuntimeError("Failed to start etcd")
+        raise RuntimeError("Failed to start etcd")
-        ingress_process = run_command("dynamo run in=http out=dyn", background=True)
+    logging.info(f"Starting ingress server on node {prefill_host_ip}")
-        if not ingress_process:
+    ingress_process = run_command(
-            raise RuntimeError("Failed to start ingress")
+        "dynamo run in=http out=dyn --http-port=8000", background=True
+    )
+    if not ingress_process:
+        raise RuntimeError("Failed to start ingress")
+    logging.info(
+        f"Starting http server on port 9001 for flush_cache endpoint on node {prefill_host_ip}"
+    )
+    cache_flush_server_cmd = "python3 utils/sgl_http_server.py --ns dynamo"
+    cache_flush_server_process = run_command(cache_flush_server_cmd, background=True)
+    if not cache_flush_server_process:
+        raise RuntimeError("Failed to start cache flush server")
+def setup_prefill_node(
+    rank: int,
+    prefill_host_ip: str,
+    total_nodes: int,
+    total_gpus: int,
+    use_sglang_commands: bool,
+    gpu_type: str,
+) -> int:
+    """
+    Setup the prefill node.
+    """
+    if not use_sglang_commands:
+        if rank == 0:
+            setup_head_prefill_node(prefill_host_ip)
+        else:
+            logging.info(f"Setting up child prefill node: {rank}")
+            if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
+                raise RuntimeError("Failed to connect to etcd")
    else:
-        logging.info(f"Setting up child prefill node: {rank}")
+        logging.info("Using SGLang servers. No need to setup etcd or nats")
-        if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
-            raise RuntimeError("Failed to connect to etcd")
-    # NOTE: This implements the example in examples/sglang/dsr1-wideep.md
+    # Setup environment variables for GPU script
-    # For other examples, the command might have to be modified.
+    setup_env_vars_for_gpu_script(prefill_host_ip, rank, total_gpus, total_nodes)
-    dynamo_cmd = (
-        f"python3 -m dynamo.sglang.worker "
+    # Use appropriate GPU script instead of generating command directly
-        "--model-path /model/ "
+    cmd_to_run = get_gpu_command("prefill", use_sglang_commands, gpu_type)
-        "--served-model-name deepseek-ai/DeepSeek-R1 "
+    return run_command(cmd_to_run)
-        "--skip-tokenizer-init "
-        "--disaggregation-mode prefill "
-        "--disaggregation-transfer-backend nixl "
-        "--disaggregation-bootstrap-port 30001 "
-        f"--dist-init-addr {prefill_host_ip}:{DIST_INIT_PORT} "
-        f"--nnodes {total_nodes} "
-        f"--node-rank {rank} "
-        f"--tp-size {total_gpus} "
-        f"--dp-size {total_gpus} "
-        "--enable-dp-attention "
-        "--decode-log-interval 1 "
-        "--enable-deepep-moe "
-        "--page-size 1 "
-        "--trust-remote-code "
-        "--moe-dense-tp-size 1 "
-        "--enable-dp-lm-head "
-        "--disable-radix-cache "
-        "--watchdog-timeout 1000000 "
-        "--enable-two-batch-overlap "
-        "--deepep-mode normal "
-        "--mem-fraction-static 0.85 "
-        "--deepep-config /configs/deepep.json "
-        "--ep-num-redundant-experts 32 "
-        "--ep-dispatch-algorithm dynamic "
-        "--eplb-algorithm deepseek "
-    )
-    return run_command(dynamo_cmd)
 def setup_decode_node(
@@ -256,45 +310,29 @@ def setup_decode_node(
    prefill_host_ip: str,
    total_nodes: int,
    total_gpus: int,
+    use_sglang_commands: bool,
+    gpu_type: str,
 ) -> int:
    """
    Setup the decode node.
    """
    logging.info(f"Setting up child decode node: {rank}")
-    if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
+    if use_sglang_commands:
-        raise RuntimeError("Failed to connect to etcd")
+        sgl_mini_lb_cmd = get_sglang_mini_lb_command_args(
+            prefill_host_ip, decode_host_ip
-    dynamo_cmd = (
+        )
-        "python3 -m dynamo.sglang.decode_worker "
+        run_command(sgl_mini_lb_cmd, background=True)
-        "--model-path /model/ "
+    else:
-        "--served-model-name deepseek-ai/DeepSeek-R1 "
+        if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
-        "--skip-tokenizer-init "
+            raise RuntimeError("Failed to connect to etcd")
-        "--disaggregation-mode decode "
-        "--disaggregation-transfer-backend nixl "
+    # Setup environment variables for GPU script
-        "--disaggregation-bootstrap-port 30001 "
+    setup_env_vars_for_gpu_script(decode_host_ip, rank, total_gpus, total_nodes)
-        f"--dist-init-addr {decode_host_ip}:{DIST_INIT_PORT} "
-        f"--nnodes {total_nodes} "
-        f"--node-rank {rank} "
-        f"--tp-size {total_gpus} "
-        f"--dp-size {total_gpus} "
-        "--enable-dp-attention "
-        "--decode-log-interval 1 "
-        "--enable-deepep-moe "
-        "--page-size 1 "
-        "--trust-remote-code "
-        "--moe-dense-tp-size 1 "
-        "--enable-dp-lm-head "
-        "--disable-radix-cache "
-        "--watchdog-timeout 1000000 "
-        "--enable-two-batch-overlap "
-        "--deepep-mode low_latency "
-        "--mem-fraction-static 0.835 "
-        "--ep-num-redundant-experts 32 "
-        "--cuda-graph-bs 256 "
-    )
-    return run_command(dynamo_cmd)
+    # Use appropriate GPU script instead of generating command directly
+    cmd_to_run = get_gpu_command("decode", use_sglang_commands, gpu_type)
+    return run_command(cmd_to_run)
 def setup_env(prefill_host_ip: str):
@@ -321,6 +359,7 @@ def main(input_args: list[str] | None = None):
    logging.info(f"Prefill host IP: {args.prefill_host_ip}")
    logging.info(f"Decode host IP: {args.decode_host_ip}")
    logging.info(f"Rank: {args.rank}")
+    logging.info(f"Use SGLang commands: {args.use_sglang_commands}")
    setup_env(args.prefill_host_ip)
    if args.worker_type == "prefill":
@@ -329,6 +368,8 @@ def main(input_args: list[str] | None = None):
            args.prefill_host_ip,
            args.total_nodes,
            args.total_nodes * args.gpus_per_node,
+            args.use_sglang_commands,
+            args.gpu_type,
        )
    else:
        setup_decode_node(
@@ -337,6 +378,8 @@ def main(input_args: list[str] | None = None):
            args.prefill_host_ip,
            args.total_nodes,
            args.total_nodes * args.gpus_per_node,
+            args.use_sglang_commands,
+            args.gpu_type,
        )
    logging.info(f"{args.worker_type.capitalize()} node setup complete")

--- a/components/backends/sglang/slurm_jobs/submit_job_script.py
+++ b/components/backends/sglang/slurm_jobs/submit_job_script.py
@@ -86,7 +86,7 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
    parser.add_argument("--config-dir", required=True, help="Config directory path")
    parser.add_argument("--container-image", required=True, help="Container image")
    parser.add_argument(
-        "--time-limit", default="01:00:00", help="Time limit (HH:MM:SS)"
+        "--time-limit", default="04:00:00", help="Time limit (HH:MM:SS)"
    )
    parser.add_argument(
        "--prefill-nodes", type=int, default=2, help="Number of prefill nodes"
@@ -100,6 +100,20 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
    parser.add_argument(
        "--network-interface", default="eth3", help="Network interface to use"
    )
+    parser.add_argument(
+        "--gpu-type", choices=["h100", "gb200"], default="h100", help="GPU type to use"
+    )
+    parser.add_argument(
+        "--use-sglang-commands",
+        action="store_true",
+        default=False,
+        help="Use SGLang commands instead of Dynamo",
+    )
+    parser.add_argument(
+        "--partition",
+        default="batch",
+        help="SLURM partition to use",
+    )
    return parser.parse_args(args)
@@ -120,6 +134,9 @@ def main(input_args: list[str] | None = None):
        "container_image": args.container_image,
        "gpus_per_node": args.gpus_per_node,
        "network_interface": args.network_interface,
+        "gpu_type": args.gpu_type,
+        "use_sglang_commands": args.use_sglang_commands,
+        "partition": args.partition,
    }
    with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file:

--- a/container/Dockerfile.sglang-wideep
+++ b/container/Dockerfile.sglang-wideep
@@ -13,160 +13,132 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-# This should be pinned to the sglang version that is installed with Dynamo
+ARG SGLANG_IMAGE_TAG="v0.4.10-cu126"
-# in the pyproject.toml
-FROM lmsysorg/sglang:v0.4.8.post1-cu126
-# Add NIXL build dependencies
+FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG}
-RUN apt-get update -y && \
-    apt-get install -y \
-    cmake \
-    meson \
-    ninja-build \
-    pybind11-dev \
-    patchelf \
-    net-tools
-# Install Python build dependencies
-RUN pip install --break-system-packages meson-python wheel build
-# Add architecture args for NIXL build
+ARG MODE="hopper"
-ARG ARCH=amd64
+ARG ARCH="amd64"
-ARG ARCH_ALT=x86_64
+ARG ARCH_ALT="x86_64"
+ARG NIXL_UCX_REF="v1.19.x"
-WORKDIR /sgl-workspace
+ARG NIXL_TAG="0.4.1"
+ARG CMAKE_VERSION="3.31.8"
+ARG RUST_VERSION="1.87.0"
+ARG CARGO_BUILD_JOBS="16"
-# Install UCX dependencies
 RUN apt-get update -y && \
-    apt-get install -y --no-install-recommends \
+    apt-get install -y \
-    --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev \
+      cmake meson ninja-build pybind11-dev patchelf net-tools \
-    libnuma-dev librdmacm-dev ibverbs-providers \
+      build-essential protobuf-compiler libssl-dev pkg-config \
-    autoconf libtool
+      clang libclang-dev git rapidjson-dev zlib1g-dev && \
+    pip install --break-system-packages meson-python wheel build
-# Build UCX from source
-ARG NIXL_UCX_REF=v1.19.x
+# Build UCX + NIXL for x86/hopper until its fully tested on GB200
-RUN rm -rf /opt/hpcx/ucx && \
+RUN if [ "$MODE" = "hopper" ]; then \
-    rm -rf /usr/local/ucx && \
+      apt-get install -y --no-install-recommends \
-    cd /usr/local/src && \
+        libibverbs-dev rdma-core ibverbs-utils libibumad-dev \
-    git clone https://github.com/openucx/ucx.git && \
+        libnuma-dev librdmacm-dev ibverbs-providers autoconf libtool && \
-    cd ucx && \
+      # UCX from source
-    git checkout $NIXL_UCX_REF && \
+      rm -rf /opt/hpcx/ucx /usr/local/ucx && \
-    ./autogen.sh && ./configure \
+      cd /usr/local/src && \
-    --prefix=/usr/local/ucx \
+      git clone https://github.com/openucx/ucx.git && \
-    --enable-shared \
+      cd ucx && git checkout $NIXL_UCX_REF && \
-    --disable-static \
+      ./autogen.sh && \
-    --disable-doxygen-doc \
+      ./configure \
-    --enable-optimizations \
+        --prefix=/usr/local/ucx \
-    --enable-cma \
+        --enable-shared \
-    --enable-devel-headers \
+        --disable-static \
-    --with-cuda=/usr/local/cuda \
+        --disable-doxygen-doc \
-    --with-verbs \
+        --enable-optimizations \
-    --with-efa \
+        --enable-cma \
-    --with-dm \
+        --enable-devel-headers \
-    --with-gdrcopy=/usr/local \
+        --with-cuda=/usr/local/cuda \
-    --enable-mt && \
+        --with-verbs \
-    make -j && \
+        --with-efa \
-    make -j install-strip && \
+        --with-dm \
-    ldconfig
+        --with-gdrcopy=/usr/local \
+        --enable-mt && \
+      make -j && make install-strip && ldconfig && \
+      # NIXL
+      git clone https://github.com/ai-dynamo/nixl.git /opt/nixl && \
+      cd /opt/nixl && git checkout $NIXL_TAG && \
+      pip install --break-system-packages . \
+        --config-settings="setup-args=-Ducx_path=/usr/local/ucx"; \
+    fi
 ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH
-ARG NIXL_TAG=0.4.1
+# Dynamo
-RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_TAG} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx"
-WORKDIR /sgl-workspace
-# Allow forceful shutdown of inflight requests
-ENV SGL_FORCE_SHUTDOWN=1
 WORKDIR /sgl-workspace
 RUN git clone https://github.com/ai-dynamo/dynamo.git
-# install dynamo in editable mode
-WORKDIR /sgl-workspace/dynamo
-# Rust build/dev dependencies
-RUN apt update -y && \
-    apt install --no-install-recommends -y \
-    build-essential \
-    protobuf-compiler \
-    cmake \
-    libssl-dev \
-    pkg-config \
-    clang \
-    libclang-dev \
-    git
-# Define Rust target based on ARCH_ALT ARG
-ARG RUSTARCH=${ARCH_ALT}-unknown-linux-gnu
 ENV RUSTUP_HOME=/usr/local/rustup \
    CARGO_HOME=/usr/local/cargo \
-    PATH=/usr/local/cargo/bin:$PATH \
+    PATH=/usr/local/cargo/bin:$PATH
-    RUST_VERSION=1.86.0
-# Install Rust using RUSTARCH derived from ARCH_ALT
+RUN wget --tries=3 --waitretry=5 \
-RUN wget --tries=3 --waitretry=5 "https://static.rust-lang.org/rustup/archive/1.28.1/${RUSTARCH}/rustup-init" && \
+    "https://static.rust-lang.org/rustup/archive/1.28.1/${ARCH_ALT}-unknown-linux-gnu/rustup-init" && \
-    # TODO: Add SHA check back based on RUSTARCH
    chmod +x rustup-init && \
-    ./rustup-init -y --no-modify-path --profile minimal --default-toolchain $RUST_VERSION --default-host ${RUSTARCH} && \
+    ./rustup-init -y \
+      --no-modify-path \
+      --profile minimal \
+      --default-toolchain $RUST_VERSION \
+      --default-host ${ARCH_ALT}-unknown-linux-gnu && \
    rm rustup-init && \
    chmod -R a+w $RUSTUP_HOME $CARGO_HOME
 ARG CARGO_BUILD_JOBS
-# Set CARGO_BUILD_JOBS to 16 if not provided
+ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS}
-# This is to prevent cargo from building $(nproc) jobs in parallel,
-# which might exceed the number of opened files limit.
+RUN cd dynamo && cargo build --release
-ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS:-16}
-RUN cargo build --release
+RUN cd dynamo/lib/bindings/python && \
+    pip install --break-system-packages -e . && \
+    cd /sgl-workspace/dynamo && \
+    pip install --break-system-packages .
-RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../..
+RUN pip install --break-system-packages sglang-router==0.1.5
-RUN pip install --break-system-packages .
-RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.28/nats-server-v2.10.28-${ARCH}.deb && \
+RUN wget --tries=3 --waitretry=5 \
+      https://github.com/nats-io/nats-server/releases/download/v2.10.28/\
+nats-server-v2.10.28-${ARCH}.deb && \
    dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb
 ENV ETCD_VERSION="v3.5.21"
-RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/download/$ETCD_VERSION/etcd-$ETCD_VERSION-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \
+RUN wget --tries=3 --waitretry=5 \
+      https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/\
+etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \
    mkdir -p /usr/local/bin/etcd && \
-    tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/etcd --strip-components=1 && \
+    tar -xzf /tmp/etcd.tar.gz \
+        -C /usr/local/bin/etcd --strip-components=1 && \
    rm /tmp/etcd.tar.gz
-ENV PATH=/usr/local/bin/etcd/:$PATH
-ARG CMAKE_VERSION=3.31.8
+ENV PATH=/usr/local/bin/etcd:$PATH
-RUN mkdir /sgl-workspace/cmake_build
-WORKDIR /sgl-workspace/cmake_build
-# uninstall CMake
+# GenAI Perf
 RUN apt-get purge -y cmake
-# download newer version of CMake
-RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
-    tar -xvzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
-    mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake
-ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH
-# should be 3.31.8
+RUN mkdir /sgl-workspace/cmake_build && \
+    cd /sgl-workspace/cmake_build && \
+    wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/\
+cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
+    tar -xzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
+    mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake && \
+    rm cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz
+ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH
 RUN cmake --version
-# Install perf_analyzer and genai-perf
+RUN git clone --depth=1 \
-RUN apt-get update -y && \
+      https://github.com/triton-inference-server/perf_analyzer.git && \
-    apt-get install -y --no-install-recommends \
-    rapidjson-dev \
-    # jq and curl for polling various endpoints and health checks
-    jq \
-    curl \
-    zlib1g-dev
-RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \
    mkdir perf_analyzer/build && \
    cmake -B perf_analyzer/build -S perf_analyzer && \
-    cmake --build perf_analyzer/build -- -j8
+    cmake --build perf_analyzer/build -- -j$(nproc)
 ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH
 RUN pip install --break-system-packages genai-perf
-# https://pypi.org/project/sglang-router/0.1.5 is latest
+# Enable forceful shutdown of inflight requests
-RUN pip install sglang-router==0.1.5
+ENV SGL_FORCE_SHUTDOWN=1
 WORKDIR /sgl-workspace/dynamo/components/backends/sglang