chore: update slurm scrips for better warmup and bump sgl version (#3291)

0ab6bc2b · ishandhanani · GitHub · 87190db0 · 0ab6bc2b · 0ab6bc2b
Unverified Commit 0ab6bc2b authored Sep 30, 2025 by ishandhanani Committed by GitHub Sep 30, 2025
18 changed files
--- a/README.md
+++ b/README.md
@@ -14,6 +14,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 ![Dynamo banner](./docs/images/frontpage-banner.png)
 [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
 ## Latest News
-* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
+- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
 ## The Era of Multi-GPU, Multi-Node
@@ -53,16 +54,17 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
 ## Framework Support Matrix
-| Feature | vLLM | SGLang | TensorRT-LLM |
+| Feature                                                                                           | vLLM | SGLang | TensorRT-LLM |
-|---------|----------------------|----------------------------|----------------------------------------|
+| ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ |
-| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md)                                 | ✅   | ✅     | ✅           |
-| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
+| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧   | 🚧     | 🚧           |
-| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md)                                    | ✅   | ✅     | ✅           |
-| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
+| [**Load Based Planner**](/docs/architecture/load_planner.md)                                      | 🚧   | 🚧     | 🚧           |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md)                                        | ✅   | ✅     | ✅           |
-| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ |
+| [**KVBM**](/docs/architecture/kvbm_architecture.md)                                               | ✅   | 🚧     | ✅           |
 To learn more about each framework and their capabilities, check out each framework's README!
 - **[vLLM](components/backends/vllm/README.md)**
 - **[SGLang](components/backends/sglang/README.md)**
 - **[TensorRT-LLM](components/backends/trtllm/README.md)**
@@ -77,6 +79,7 @@ Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](
 ## 1. Initial setup
 The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
 ```
 curl -LsSf https://astral.sh/uv/install.sh | sh
 ```
@@ -89,6 +92,7 @@ To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynam
 - [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
 To quickly setup etcd & NATS, you can also run:
 ```
 # At the root of the repository:
 # Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
@@ -125,7 +129,7 @@ python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key
 # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
 # both for the same model and for multiple models. The frontend node will discover them.
-python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
+python -m dynamo.sglang --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 ```
 #### Send a Request
@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
 Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
-* **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
+- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
-* **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
+- **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
 # Engines
@@ -170,6 +174,7 @@ uv pip install ai-dynamo[vllm]
 ```
 Run the backend/worker like this:
 ```
 python -m dynamo.vllm --help
 ```
@@ -188,8 +193,9 @@ uv pip install ai-dynamo[sglang]
 ```
 Run the backend/worker like this:
 ```
-python -m dynamo.sglang.worker --help
+python -m dynamo.sglang --help
 ```
 You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
 > Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
 ### Install prerequisites
 ```
 # Optional step: Only required for Blackwell and Grace Hopper
 uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
@@ -221,11 +228,13 @@ sudo apt-get -y install libopenmpi-dev
 > You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
 ### After installing the pre-requisites above, install Dynamo
 ```
 uv pip install ai-dynamo[trtllm]
 ```
 Run the backend/worker like this:
 ```
 python -m dynamo.trtllm --help
 ```
@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
 ## 1. Install libraries
 **Ubuntu:**
 ```
 sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
 ```
 **macOS:**
 - [Homebrew](https://brew.sh/)
 ```
 # if brew is not installed on your system, install it
 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
 ```
 - [Xcode](https://developer.apple.com/xcode/)
 ```
@@ -255,8 +268,8 @@ brew install cmake protobuf
 ## Check that Metal is accessible
 xcrun -sdk macosx metal
 ```
-If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
+If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
 ## 2. Install Rust
@@ -270,11 +283,13 @@ source $HOME/.cargo/env
 Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
 - Install uv
 ```bash
 curl -LsSf https://astral.sh/uv/install.sh | sh
 ```
 - Create a virtual environment
 ```bash
 uv venv dynamo
 source dynamo/bin/activate

--- a/components/backends/sglang/docs/dsr1-wideep-gb200.md
+++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md
@@ -29,7 +29,7 @@ docker build \
  -f container/Dockerfile.sglang-wideep \
  -t dynamo-wideep-gb200 \
  --build-arg MODE=blackwell \
-  --build-arg SGLANG_IMAGE_TAG=v0.5.0rc0-cu129-gb200 \
+  --build-arg SGLANG_IMAGE_TAG=v0.5.3rc0-cu129-gb200 \
  --build-arg ARCH=arm64 \
  --build-arg ARCH_ALT=aarch64 \
  .

--- a/components/backends/sglang/slurm_jobs/README.md
+++ b/components/backends/sglang/slurm_jobs/README.md
-# Example: Deploy Multi-node SGLang with Dynamo on SLURM
+# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
-This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
+This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
-## Overview
+## SLURM Prerequisites
-The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
+For this example, we will make some assumptions about your SLURM cluster:
-The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
-## Scripts
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
-## Logs Folder Structure
-Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
-### Log File Structure
-```
-logs/
-├── 3062824/                    # Job ID directory
-│   ├── log.out                 # Main job output (node allocation, IP addresses, launch commands)
-│   ├── log.err                 # Main job errors
-│   ├── node0197_prefill.out     # Prefill node stdout (node0197)
-│   ├── node0197_prefill.err     # Prefill node stderr (node0197)
-│   ├── node0200_prefill.out     # Prefill node stdout (node0200)
-│   ├── node0200_prefill.err     # Prefill node stderr (node0200)
-│   ├── node0201_decode.out      # Decode node stdout (node0201)
-│   ├── node0201_decode.err      # Decode node stderr (node0201)
-│   ├── node0204_decode.out      # Decode node stdout (node0204)
-│   ├── node0204_decode.err      # Decode node stderr (node0204)
-│   ├── node0197_prefill_gpu_utilization.log    # GPU utilization monitoring (node0197)
-│   ├── node0200_prefill_gpu_utilization.log    # GPU utilization monitoring (node0200)
-│   ├── node0201_decode_gpu_utilization.log     # GPU utilization monitoring (node0201)
-│   └── node0204_decode_gpu_utilization.log     # GPU utilization monitoring (node0204)
-├── 3063137/                    # Another job ID directory
-├── 3062689/                    # Another job ID directory
-└── ...
-```
-## Setup
-For simplicity of the example, we will make some assumptions about your SLURM cluster:
 1. We assume you have access to a SLURM cluster with multiple GPU nodes
   available. For functional testing, most setups should be fine. For performance
@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
   If your cluster supports similar container based plugins, you may be able to
   modify the template to use that instead.
 3. We assume you have already built a recent Dynamo+SGLang container image as
-   described [here](../docs/dsr1-wideep-h100.md#instructions).
+   described [here](../docs/dsr1-wideep-gb200.md#instructions).
   This is the image that can be passed to the `--container-image` argument in later steps.
+## Scripts Overview
+- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
+- **`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
+- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
+- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
+## Logs Folder Structure
+Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
 ## Usage
 > [!NOTE]
-> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
+> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
 1. **Submit a benchmark job**:
   ```bash
-   python submit_job_script.py \
+   python3 submit_job_script.py \
     --template job_script_template.j2 \
-     --model-dir /path/to/model \
+     --model-dir <path-to>/deepseek-r1-0528 \
-     --config-dir /path/to/configs \
+     --container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh \
-     --container-image container-image-uri \
+     --gpus-per-node 4 \
-     --account your-slurm-account
+     --config-dir <path-to>/klconfigs \
+     --gpu-type gb200-fp8 \
+     --network-interface enP6p9s0np0 \
+     --prefill-nodes 6 \
+     --decode-nodes 12 \
+     --prefill-workers 3 \
+     --decode-workers 1 \
+     --account <account> \
+     --partition <partition> \
+     --time-limit 4:00:00 \
+     --enable-multiple-frontends \
+     --num-additional-frontends 9 \
+     --profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"
   ```
-   **Required arguments**:
+   This command will deploy 3 prefill workers and 1 decode worker with 9 additional frontends load-balanced by nginx. Diving deeper into the command:
-   - `--template`: Path to Jinja2 template file
+   - `--template job_script_template.j2`: Path to Jinja2 template file (this shouldn't change unless you want to modify the template)
-   - `--model-dir`: Model directory path
+   - `--model-dir <path-to>/deepseek-r1-0528`: Path to DSR1-FP8 model directory
-   - `--config-dir`: Config directory path
+   - `--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh`: Enroot container image URI
-   - `--container-image`: Container image URI (e.g., `registry/repository:tag`)
+   - `--gpus-per-node 4`: Number of GPUs per node (each GB200 tray has 4 GPUs)
-   - `--account`: SLURM account
+   - `--config-dir <path-to>/klconfigs`: Various configs (see explanation below)
+   - `--gpu-type gb200-fp8`: GPU type to use, choices: `gb200-fp8`
-   **Optional arguments**:
+   - `--network-interface enP6p9s0np0`: Network interface to use (depends on your cluster)
+   - `--prefill-nodes 6`: Number of prefill nodes
-   - `--prefill-nodes`: Number of prefill nodes (default: `2`)
+   - `--decode-nodes 12`: Number of decode nodes
-   - `--decode-nodes`: Number of decode nodes (default: `2`)
+   - `--prefill-workers 3`: Number of prefill workers
-   - `--gpus-per-node`: Number of GPUs per node (default: `8`)
+   - `--decode-workers 1`: Number of decode workers
-   - `--network-interface`: Network interface to use (default: `eth3`)
+   - `--account <account>`: SLURM account
-   - `--job-name`: SLURM job name (default: `dynamo_setup`)
+   - `--partition <partition>`: SLURM partition
-   - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
+   - `--time-limit 4:00:00`: Time limit in HH:MM:SS format
-   - `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
+   - `--enable-multiple-frontends`: Enable multiple frontend architecture with nginx load balancer
-   - `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
+   - `--num-additional-frontends 9`: Number of additional frontends
+   - `--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"`: Profiler configurations (see explanation below)
   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
-2. **Example with different GPU types**:
+2. **Check logs in real-time**:
-   ```bash
-   # For H100 with Dynamo (default)
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account \
-     --gpu-type h100
-   # For GB200 with SGLang
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account \
-     --gpu-type gb200 \
-     --use-sglang-commands
-     --gpus-per-node 4
-   ```
-3. **Monitor job progress**:
   ```bash
-   squeue -u $USER
+   cd logs/{JOB_ID}
+   tail -f *_prefill_*.err *_decode_*.err
   ```
-4. **Check logs in real-time**:
+## Configs directory
-   ```bash
-   tail -f logs/{JOB_ID}/log.out
-   ```
-   You can view logs of all prefill or decode workers simultaneously by running:
+The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
-   ```bash
+```bash
-   # prefill workers err (or .out)
+klconfigs/
-   tail -f logs/{JOB_ID}/*_prefill.err
+├── decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json
+├── deepep_config.json
+├── dgcache/
+└── prefill_dsr1-0528_in1000out1000_num40000.json
+```
-   # decode workers err (or .out)
+1. `decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
-   tail -f logs/{JOB_ID}/*_decode.err
+2. `deepep_config.json`: DeepEP config file for GB2009
-   ```
+3. `dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
+4. `prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
-5. **Monitor GPU utilization**:
+**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
-   ```bash
-   tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
-   ```
-## Outputs
+## Profiler
-Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
+If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
--- a/components/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh
+++ b/components/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh
@@ -50,24 +50,30 @@ warmup_model() {
    model_path=$4
    config=$5
-    IFS='x' read -r -a config_list <<< "$config"
+    model_name="deepseek-ai/DeepSeek-R1"
-    isl=${config_list[0]}
+    model_path="deepseek-ai/DeepSeek-R1-0528"
-    osl=${config_list[1]}
+    head_node="localhost"
-    num_prompts=${config_list[2]}
+    head_port="8000"
-    concurrency=${config_list[3]}
+    chosen_isl=1024
-    request_rate=${config_list[4]}
+    chosen_osl=1024
+    chosen_req_rate="inf"
+    chosen_concurrencies=(1 2 4 8 16 32 64 128)
-    command=(
+	for concurrency in ${chosen_concurrencies[@]}
-        python3 -m sglang.bench_serving
+	do
-        --base-url "http://${service_host}:${service_port}"
+	    num_prompts=$((concurrency * 5))
-        --model ${served_model_name} --tokenizer ${model_path}
-        --backend sglang-oai
-        --dataset-name random --random-input ${isl} --random-output ${osl}
-        --random-range-ratio 1
-        --num-prompts ${num_prompts} --request-rate ${request_rate} --max-concurrency ${concurrency}
-    )
-    echo "Config ${config}. Running command ${command[@]}"
+	    command=(
+		python3 -m sglang.bench_serving
+		--base-url "http://${head_node}:${head_port}"
+		--model ${model_name} --tokenizer ${model_path}
+		--backend sglang-oai
+		--dataset-name random --random-input ${chosen_isl} --random-output ${chosen_osl}
+		--random-range-ratio 1
+		--num-prompts ${num_prompts} --request-rate ${chosen_req_rate} --max-concurrency ${concurrency}
+	    )
-    ${command[@]}
+	    echo "Running with concurrency: ${concurrency}, num_prompts: ${num_prompts}"
-}
+	    "${command[@]}"
+	done
+}
\ No newline at end of file
--- a/components/backends/sglang/slurm_jobs/scripts/gb200-fp8.sh
+++ b/components/backends/sglang/slurm_jobs/scripts/gb200-fp8.sh
@@ -32,8 +32,8 @@ echo "Mode: $mode"
 echo "Command: dynamo"
 # Check if required environment variables are set
-if [ -z "$HOST_IP" ]; then
+if [ -z "$HOST_IP_MACHINE" ]; then
-    echo "Error: HOST_IP environment variable is not set"
+    echo "Error: HOST_IP_MACHINE environment variable is not set"
    exit 1
 fi
@@ -67,6 +67,9 @@ if [ "$mode" = "prefill" ]; then
    # GB200 dynamo prefill command
    set -x
    # SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
+    # timeouts and kernel cache
+    export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
+    export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
    if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/prefill_dsr1-0528_in1000out1000_num40000.json"; fi
@@ -80,15 +83,15 @@ if [ "$mode" = "prefill" ]; then
    NCCL_MNNVL_ENABLE=1 \
    NCCL_CUMEM_ENABLE=1 \
    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
-    SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
    PYTHONUNBUFFERED=1 \
-    python3 -m dynamo.sglang.worker \
+    python3 -m dynamo.sglang \
        --served-model-name deepseek-ai/DeepSeek-R1 \
        --model-path /model/ \
        --skip-tokenizer-init \
        --trust-remote-code \
        --disaggregation-mode prefill \
-        --dist-init-addr "$HOST_IP:$PORT" \
+        --dist-init-addr "$HOST_IP_MACHINE:$PORT" \
        --disaggregation-bootstrap-port 30001 \
        --nnodes "$TOTAL_NODES" \
        --node-rank "$RANK" \
@@ -100,7 +103,8 @@ if [ "$mode" = "prefill" ]; then
        --max-running-requests 12288 \
        --context-length 9600 \
        --disable-radix-cache \
-        --enable-deepep-moe \
+        --moe-a2a-backend deepep \
+        --load-balance-method round_robin \
        --deepep-mode normal \
        --ep-dispatch-algorithm dynamic \
        --moe-dense-tp-size 1 \
@@ -122,6 +126,10 @@ elif [ "$mode" = "decode" ]; then
    command_suffix=""
    if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json"; fi
+    # timeouts and kernel cache
+    export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
+    export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
    # GB200 dynamo decode command
    DYN_SKIP_SGLANG_LOG_FORMATTING=1 \
    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 \
@@ -135,15 +143,15 @@ elif [ "$mode" = "decode" ]; then
    MC_FORCE_MNNVL=1 \
    NCCL_CUMEM_ENABLE=1 \
    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
-    SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
    PYTHONUNBUFFERED=1 \
-    python3 -m dynamo.sglang.decode_worker \
+    python3 -m dynamo.sglang \
        --served-model-name deepseek-ai/DeepSeek-R1 \
        --model-path /model/ \
        --skip-tokenizer-init \
        --trust-remote-code \
        --disaggregation-mode decode \
-        --dist-init-addr "$HOST_IP:$PORT" \
+        --dist-init-addr "$HOST_IP_MACHINE:$PORT" \
        --disaggregation-bootstrap-port 30001 \
        --nnodes "$TOTAL_NODES" \
        --node-rank "$RANK" \
@@ -155,7 +163,8 @@ elif [ "$mode" = "decode" ]; then
        --max-running-requests 36864 \
        --context-length 9600 \
        --disable-radix-cache \
-        --enable-deepep-moe \
+        --moe-a2a-backend deepep \
+        --prefill-round-robin-balance \
        --deepep-mode low_latency \
        --moe-dense-tp-size 1 \
        --enable-dp-lm-head \

--- a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py
+++ b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py
@@ -175,8 +175,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
    parser.add_argument(
        "--gpu_type",
        type=str,
-        choices=["h100", "gb200-fp8"],
+        choices=["gb200-fp8"],
-        default="h100",
+        default="gb200-fp8",
        help="Type of GPU to use",
    )
@@ -237,8 +237,8 @@ def setup_env_vars_for_gpu_script(
    port: int = DIST_INIT_PORT,
    use_init_locations: bool = True,
 ):
-    """Setup environment variables required by GPU scripts (h100.sh, gb200-fp8.sh, gb200-fp4.sh)"""
+    """Setup environment variables required by GPU scripts (gb200-fp8.sh)"""
-    os.environ["HOST_IP"] = host_ip
+    os.environ["HOST_IP_MACHINE"] = host_ip
    os.environ["PORT"] = str(port)
    os.environ["TOTAL_GPUS"] = str(total_gpus)
    os.environ["RANK"] = str(local_rank)

--- a/components/backends/sglang/slurm_jobs/submit_job_script.py
+++ b/components/backends/sglang/slurm_jobs/submit_job_script.py
@@ -142,8 +142,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
    )
    parser.add_argument(
        "--gpu-type",
-        choices=["h100", "gb200-fp8"],
+        choices=["gb200-fp8"],
-        default="h100",
+        default="gb200-fp8",
        help="GPU type to use",
    )

--- a/components/backends/sglang/src/dynamo/sglang/decode_worker/__init__.py
+++ b/components/backends/sglang/src/dynamo/sglang/decode_worker/__init__.py
-#  SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#  SPDX-License-Identifier: Apache-2.0
-# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
--- a/components/backends/sglang/src/dynamo/sglang/decode_worker/__main__.py
+++ b/components/backends/sglang/src/dynamo/sglang/decode_worker/__main__.py
-#  SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#  SPDX-License-Identifier: Apache-2.0
-import logging
-from dynamo.runtime.logging import configure_dynamo_logging
-from dynamo.sglang.main import main
-if __name__ == "__main__":
-    configure_dynamo_logging()
-    logging.warning(
-        "DEPRECATION WARNING: `python3 -m dynamo.sglang.decode_worker` is deprecated and will be removed in dynamo v0.5.0."
-        "Use `python3 -m dynamo.sglang` instead.",
-    )
-    main()
--- a/components/backends/sglang/src/dynamo/sglang/worker/__init__.py
+++ b/components/backends/sglang/src/dynamo/sglang/worker/__init__.py
-#  SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#  SPDX-License-Identifier: Apache-2.0
-# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
--- a/components/backends/sglang/src/dynamo/sglang/worker/__main__.py
+++ b/components/backends/sglang/src/dynamo/sglang/worker/__main__.py
-#  SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#  SPDX-License-Identifier: Apache-2.0
-import logging
-from dynamo.runtime.logging import configure_dynamo_logging
-from dynamo.sglang.main import main
-if __name__ == "__main__":
-    configure_dynamo_logging()
-    logging.warning(
-        "DEPRECATION WARNING: `python3 -m dynamo.sglang.worker` is deprecated and will be removed in dynamo v0.5.0."
-        "Use `python3 -m dynamo.sglang` instead.",
-    )
-    main()
--- a/container/Dockerfile.sglang
+++ b/container/Dockerfile.sglang
 # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+# Note: This Dockerfile will be deprecated in favor of Dockerfile.sglang-wideep soon. Please build the container with that Dockerfile instead.
 ARG BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
 # TODO OPS-612: NCCL will hang with 25.03, so use 25.01 for now
 # Please check https://github.com/ai-dynamo/dynamo/pull/1065

--- a/container/Dockerfile.sglang-wideep
+++ b/container/Dockerfile.sglang-wideep
 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-ARG SGLANG_IMAGE_TAG="v0.5.0rc2-cu126"
+ARG SGLANG_IMAGE_TAG="v0.5.3rc0-cu126"
 FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG}

--- a/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go
+++ b/deploy/cloud/operator/internal/dynamo/backend_sglang_test.go
@@ -68,9 +68,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
 			role:              RoleMain,
 			multinodeDeployer: &MockSimpleDeployer{},
 			initialCommand:    []string{"python3"},
-			initialArgs:       []string{"-m", "dynamo.sglang.worker"},
+			initialArgs:       []string{"-m", "dynamo.sglang"},
 			expectedCommand:   []string{"python3"},
-			expectedArgs:      []string{"-m", "dynamo.sglang.worker"},
+			expectedArgs:      []string{"-m", "dynamo.sglang"},
 			description:       "Single node should not modify python commands",
 		},
 		{
@@ -79,9 +79,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
 			role:              RoleWorker,
 			multinodeDeployer: &MockSimpleDeployer{},
 			initialCommand:    []string{"python3"},
-			initialArgs:       []string{"-m", "dynamo.sglang.worker", "--model", "llama"},
+			initialArgs:       []string{"-m", "dynamo.sglang", "--model", "llama"},
 			expectedCommand:   []string{"python3"},
-			expectedArgs:      []string{"-m", "dynamo.sglang.worker", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
+			expectedArgs:      []string{"-m", "dynamo.sglang", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
 			description:       "Direct python command with simple deployer should append flags",
 		},
 		{
@@ -90,9 +90,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
 			role:              RoleWorker,
 			multinodeDeployer: &MockShellDeployer{},
 			initialCommand:    []string{"python3"},
-			initialArgs:       []string{"-m", "dynamo.sglang.worker", "--model", "llama"},
+			initialArgs:       []string{"-m", "dynamo.sglang", "--model", "llama"},
 			expectedCommand:   []string{"sh", "-c"},
-			expectedArgs:      []string{"exec python3 -m dynamo.sglang.worker --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"},
+			expectedArgs:      []string{"exec python3 -m dynamo.sglang --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"},
 			description:       "Direct python command with shell deployer should wrap with sh -c exec",
 		},
 		{
@@ -101,9 +101,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &MockShellDeployer{},
 			initialCommand:    []string{"python"},
-			initialArgs:       []string{"-m", "dynamo.sglang.worker"},
+			initialArgs:       []string{"-m", "dynamo.sglang"},
 			expectedCommand:   []string{"python"},
-			expectedArgs:      []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"},
+			expectedArgs:      []string{"-m", "dynamo.sglang", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"},
 			description:       "Leader role should never use shell wrapping",
 		},
 		{
@@ -112,9 +112,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
 			role:              RoleWorker,
 			multinodeDeployer: &MockSimpleDeployer{},
 			initialCommand:    []string{"python3.11"},
-			initialArgs:       []string{"-m", "dynamo.sglang.worker"},
+			initialArgs:       []string{"-m", "dynamo.sglang"},
 			expectedCommand:   []string{"python3.11"},
-			expectedArgs:      []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
+			expectedArgs:      []string{"-m", "dynamo.sglang", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
 			description:       "Python version variants should be recognized",
 		},
 		{
@@ -202,8 +202,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleMain,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker"},
+			initialArgs:       []string{"python -m dynamo.sglang"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker"},
+			expectedArgs:      []string{"python -m dynamo.sglang"},
 			description:       "Single node should not modify shell commands",
 		},
 		{
@@ -212,8 +212,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker"},
+			initialArgs:       []string{"python -m dynamo.sglang"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"},
+			expectedArgs:      []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"},
 			description:       "Shell commands should use regex injection for python commands",
 		},
 		{
@@ -222,8 +222,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"echo blah | wc -l && python -m dynamo.sglang.worker && ls -al"},
+			initialArgs:       []string{"echo blah | wc -l && python -m dynamo.sglang && ls -al"},
-			expectedArgs:      []string{"echo blah | wc -l && python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"},
+			expectedArgs:      []string{"echo blah | wc -l && python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"},
 			description:       "Complex shell commands should inject flags only into python part",
 		},
 		{
@@ -232,8 +232,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleWorker,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker"},
+			initialArgs:       []string{"python -m dynamo.sglang"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"},
+			expectedArgs:      []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"},
 			description:       "Shell command worker should get grove env vars in node rank",
 		},
 		{
@@ -242,8 +242,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &LWSMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker"},
+			initialArgs:       []string{"python -m dynamo.sglang"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"},
+			expectedArgs:      []string{"python -m dynamo.sglang --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"},
 			description:       "LWS shell commands should use LWS variables",
 		},
 		{
@@ -252,8 +252,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker | tee /tmp/log"},
+			initialArgs:       []string{"python -m dynamo.sglang | tee /tmp/log"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"},
+			expectedArgs:      []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"},
 			description:       "Shell commands with pipes should inject flags before pipe",
 		},
 		{
@@ -262,8 +262,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"echo start", "python -m dynamo.sglang.worker", "echo done"},
+			initialArgs:       []string{"echo start", "python -m dynamo.sglang", "echo done"},
-			expectedArgs:      []string{"echo start", "python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"},
+			expectedArgs:      []string{"echo start", "python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"},
 			description:       "Shell commands with multiple args should process each individually, modify only the python arg",
 		},
 		{
@@ -282,8 +282,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
 			role:              RoleLeader,
 			multinodeDeployer: &GroveMultinodeDeployer{},
 			initialCommand:    []string{"sh", "-c"},
-			initialArgs:       []string{"python -m dynamo.sglang.worker", "python -m dynamo.sglang.worker --other-flags"},
+			initialArgs:       []string{"python -m dynamo.sglang", "python -m dynamo.sglang --other-flags"},
-			expectedArgs:      []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang.worker --other-flags"},
+			expectedArgs:      []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang --other-flags"},
 			description:       "Should stop processing after first successful python flag injection",
 		},
 	}
@@ -444,7 +444,7 @@ func TestSGLangBackend_ProbeRemoval(t *testing.T) {
 			startupProbe := &corev1.Probe{InitialDelaySeconds: 5}
 			container := &corev1.Container{
-				Args:           []string{"python -m dynamo.sglang.worker"},
+				Args:           []string{"python -m dynamo.sglang"},
 				LivenessProbe:  livenessProbe,
 				ReadinessProbe: readinessProbe,
 				StartupProbe:   startupProbe,

--- a/deploy/cloud/operator/internal/dynamo/graph_test.go
+++ b/deploy/cloud/operator/internal/dynamo/graph_test.go
@@ -1675,7 +1675,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
 												"-c",
 											},
 											Args: []string{
-												"python3 -m dynamo.sglang.worker --custom-flag custom-value",
+												"python3 -m dynamo.sglang --custom-flag custom-value",
 											},
 										},
 									},
@@ -1828,7 +1828,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
 													"-c",
 												},
 												Args: []string{
-													"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value",
+													"python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value",
 												},
 												Ports: []corev1.ContainerPort{
 													{
@@ -1980,7 +1980,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
 													"-c",
 												},
 												Args: []string{
-													"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value",
+													"python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value",
 												},
 												Ports: []corev1.ContainerPort{
 													{
@@ -3207,7 +3207,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 					ComponentType: commonconsts.ComponentTypeWorker,
 					ExtraPodSpec: &common.ExtraPodSpec{
 						MainContainer: &corev1.Container{
-							Args: []string{"python3 -m dynamo.sglang.worker"},
+							Args: []string{"python3 -m dynamo.sglang"},
 						},
 					},
 				},
@@ -3216,7 +3216,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 			role:              RoleMain,
 			numberOfNodes:     1,
 			expectError:       false,
-			expectContains:    []string{"python3", "-m", "dynamo.sglang.worker"},
+			expectContains:    []string{"python3", "-m", "dynamo.sglang"},
 			expectNotContains: []string{"dist-init-addr", "nnodes", "tp-size"},
 		},
 		{
@@ -3226,7 +3226,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 					ComponentType: commonconsts.ComponentTypeWorker,
 					ExtraPodSpec: &common.ExtraPodSpec{
 						MainContainer: &corev1.Container{
-							Args: []string{"python3 -m dynamo.sglang.worker"},
+							Args: []string{"python3 -m dynamo.sglang"},
 						},
 					},
 				},
@@ -3235,7 +3235,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 			role:             RoleLeader,
 			numberOfNodes:    3,
 			expectError:      false,
-			expectContains:   []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"},
+			expectContains:   []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
 		},
 		{
 			name: "SGLang multinode worker",
@@ -3244,7 +3244,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 					ComponentType: commonconsts.ComponentTypeWorker,
 					ExtraPodSpec: &common.ExtraPodSpec{
 						MainContainer: &corev1.Container{
-							Args: []string{"python3 -m dynamo.sglang.worker"},
+							Args: []string{"python3 -m dynamo.sglang"},
 						},
 					},
 				},
@@ -3253,7 +3253,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
 			role:             RoleWorker,
 			numberOfNodes:    3,
 			expectError:      false,
-			expectContains:   []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"},
+			expectContains:   []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
 		},
 		{
 			name: "SGLang with user command override",
@@ -3685,7 +3685,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
 		{
 			name:     "detect SGLang from args",
 			command:  []string{"/bin/sh", "-c"},
-			args:     []string{"python -m dynamo.sglang.worker --model test"},
+			args:     []string{"python -m dynamo.sglang --model test"},
 			expected: BackendFrameworkSGLang,
 		},
 		{
@@ -3703,7 +3703,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
 		{
 			name:     "detect from python3.11",
 			command:  []string{},
-			args:     []string{"python3.11 -m dynamo.sglang.decode_worker"},
+			args:     []string{"python3.11 -m dynamo.sglang"},
 			expected: BackendFrameworkSGLang,
 		},
 		{
@@ -3715,7 +3715,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
 		{
 			name:        "multiple backends detected",
 			command:     []string{},
-			args:        []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang.worker"},
+			args:        []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang"},
 			expectError: true,
 		},
 	}
@@ -3777,7 +3777,7 @@ func TestDetermineBackendFramework(t *testing.T) {
 		{
 			name:                     "worker with detected matching explicit",
 			componentType:            "worker",
-			args:                     []string{"python -m dynamo.sglang.worker"},
+			args:                     []string{"python -m dynamo.sglang"},
 			explicitBackendFramework: "sglang",
 			expected:                 BackendFrameworkSGLang,
 		},
@@ -3881,7 +3881,7 @@ func TestGetBackendFrameworkFromComponent(t *testing.T) {
 					ComponentType: "worker", // Worker component
 					ExtraPodSpec: &common.ExtraPodSpec{
 						MainContainer: &corev1.Container{
-							Args: []string{"python -m dynamo.sglang.worker"},
+							Args: []string{"python -m dynamo.sglang"},
 						},
 					},
 				},

--- a/docs/support_matrix.md
+++ b/docs/support_matrix.md
@@ -30,7 +30,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | **NVIDIA Ada Lovelace Architecture** | Supported  |
 | **NVIDIA Ampere Architecture**       | Supported  |
 ## Platform Architecture Compatibility
 **Dynamo** is compatible with the following platforms:
@@ -51,16 +50,15 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 > [!Caution]
 > KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
 ## Software Compatibility
 ### Runtime Dependency
-| **Python Package** | **Version**   | glibc version                        | CUDA Version |
+| **Python Package** | **Version** | glibc version                         | CUDA Version |
-| :----------------- | :------------ | :----------------------------------- | :----------- |
+| :----------------- | :---------- | :------------------------------------ | :----------- |
-| ai-dynamo          | 0.5.1         | >=2.28                               |              |
+| ai-dynamo          | 0.5.1       | >=2.28                                |              |
-| ai-dynamo-runtime  | 0.5.1         | >=2.28 (Python 3.12 has known issues)|              |
+| ai-dynamo-runtime  | 0.5.1       | >=2.28 (Python 3.12 has known issues) |              |
-| NIXL               | 0.4.1         | >=2.27                               | >=11.8       |
+| NIXL               | 0.4.1       | >=2.27                                | >=11.8       |
 ### Build Dependency
@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 | **TensorRT-LLM**     | 1.1.0rc5                                                                         |
 | **NIXL**             | 0.4.1                                                                            |
 | **vLLM**             | 0.10.1.1                                                                         |
-| **SGLang**           | 0.5.0rc2                                                                         |
+| **SGLang**           | 0.5.3rc0                                                                         |
 > [!Important]
 > Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
@@ -78,27 +76,25 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 ### AWS
-| **Host Operating System** | **Version** | **Architecture** | **Status**   |
+| **Host Operating System** | **Version** | **Architecture** | **Status** |
-| :------------------------ | :---------- | :--------------- | :----------- |
+| :------------------------ | :---------- | :--------------- | :--------- |
-| **Amazon Linux**          | 2023        | x86_64           | Supported¹   |
+| **Amazon Linux**          | 2023        | x86_64           | Supported¹ |
 > [!Caution]
 > ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
 ## Build Support
 **Dynamo** currently provides build support in the following ways:
 - **Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**.
-   No wheels are available for other platforms at this time.
+  No wheels are available for other platforms at this time.
 - **Runtime Container Images**: We distribute only **AMD64** images of the runtime target on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime), [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime), and [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime).
-    Users must build the container image from source if they require an **ARM64** image.
+  Users must build the container image from source if they require an **ARM64** image.
 - **Deployment-supportive Images**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the [Dynamo kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
-    It is currently provided as an **AMD64** image only.
+  It is currently provided as an **AMD64** image only.
 - **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo. [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds), [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform), and [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) are available.

--- a/examples/basics/multinode/README.md
+++ b/examples/basics/multinode/README.md
@@ -3,12 +3,14 @@
 This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
 For more information about the core concepts, see:
 - [Dynamo Disaggregated Serving](../../../docs/architecture/disagg_serving.md)
 - [KV Cache Routing Architecture](../../../docs/architecture/kv_cache_routing.md)
 ## Architecture Overview
 The multi-node setup consists of:
 - **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them
 - **2 Model Replicas**: Each with dedicated prefill and decode workers
 - **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers**
@@ -57,6 +59,7 @@ KV-aware routing optimizes LLM inference by directing requests to workers that a
 - **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
 This is particularly beneficial for:
 - **Shared system prompts**: Cached across workers and reused efficiently
 - **Multi-turn conversations**: Full conversation history benefits from caching
 - **Similar queries**: Common prefixes are computed once and reused
@@ -90,6 +93,7 @@ For more information about the SGLang backend and its integration with Dynamo, s
 ### 3. Network Requirements
 Ensure the following ports are accessible between nodes:
 - **2379**: etcd client port
 - **4222**: NATS client port
 - **8000**: Frontend HTTP port (only needed on frontend node)
@@ -98,6 +102,7 @@ Ensure the following ports are accessible between nodes:
 ### 4. Hardware Setup
 This example assumes:
 - **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
 - **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
 - **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
@@ -131,7 +136,7 @@ Open a terminal on Node 1 and launch both workers:
 ```bash
 # Launch prefill worker in background
-CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
+CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
    --model-path Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --page-size 16 \
@@ -141,7 +146,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl &
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
+CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
    --model-path Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --page-size 16 \
@@ -153,6 +158,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
 ```
 > [!INFO]
+>
 > - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
 > - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
 > - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
@@ -165,7 +171,7 @@ Open a terminal on Node 2 and launch both workers:
 ```bash
 # Launch prefill worker in background
-CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
+CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
    --model-path Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --page-size 16 \
@@ -176,7 +182,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
    --disaggregation-transfer-backend nixl &
 # Launch decode worker in foreground
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
+CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
    --model-path Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --page-size 16 \
@@ -206,6 +212,7 @@ hostname -I | awk '{print $1}'
 ```
 The frontend will:
 - Discover all available decode workers via etcd
 - Enable KV-aware routing for intelligent request distribution
 - Monitor worker health and adjust routing accordingly
@@ -418,6 +425,7 @@ curl http://${DYN_FRONTEND_IP}:8000/health
 ### Workers Not Discovering Each Other
 1. Verify etcd connectivity from all nodes:
   ```bash
   etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
   ```
@@ -461,9 +469,11 @@ Stop all components in reverse order:
 1. Stop Frontend (Ctrl+C in the frontend terminal)
 2. Stop workers on each node:
   - On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
   - On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
   - To stop the background prefill workers, use one of these methods:
     ```bash
     # Method 1: Kill background jobs in the same terminal
     jobs           # See background jobs
@@ -473,8 +483,9 @@ Stop all components in reverse order:
     exit
     # Method 3: Kill by process name (from any terminal)
-     pkill -f "dynamo.sglang.worker.*prefill"
+     pkill -f "dynamo.sglang.*prefill"
     ```
 3. Stop infrastructure services:
   ```bash
   docker compose -f deploy/docker-compose.yml down

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -60,7 +60,7 @@ vllm = [
 sglang = [
    "uvloop",
    "nixl<=0.4.1",
-    "sglang[all]==0.5.0rc2",
+    "sglang[all]==0.5.3rc0",
 ]
 llama_cpp = [