feat: sglang to 0.5.9 + updated docs (#6518)

Co-authored-by: baihuitian <baihuitian.bht@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

feat: sglang to 0.5.9 + updated docs (#6518)
Co-authored-by: baihuitian <baihuitian.bht@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
6642e23e · ishandhanani · GitHub · 1df620b4 · 1df620b4 · 1df620b4
Unverified Commit 6642e23e authored Feb 24, 2026 by ishandhanani Committed by GitHub Feb 24, 2026
20 changed files
--- a/examples/backends/sglang/slurm_jobs/.gitignore
+++ b/examples/backends/sglang/slurm_jobs/.gitignore
-logs/*
-outputs/*
--- a/examples/backends/sglang/slurm_jobs/README.md
+++ b/examples/backends/sglang/slurm_jobs/README.md
-# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
-
-This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
-
-## SLURM Prerequisites
-
-For this example, we will make some assumptions about your SLURM cluster:
-
-1. We assume you have access to a SLURM cluster with multiple GPU nodes
-   available. For functional testing, most setups should be fine. For performance
-   testing, you should aim to allocate groups of nodes that are performantly
-   inter-connected, such as those in an NVL72 setup.
-2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `job_script_template.j2` template in this
-   example will use `srun` arguments like `--container-image`,
-   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
-   If your cluster supports similar container based plugins, you may be able to
-   modify the template to use that instead.
-3. We assume you have already built a recent Dynamo+SGLang container image as
-   described [here](../../../../docs/pages/backends/sglang/README.md#using-docker-containers).
-   This is the image that can be passed to the `--container-image` argument in later steps.
-
-## Scripts Overview
-
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
-
-## Logs Folder Structure
-
-Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
-
-## Usage
-
-> [!NOTE]
-> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
-
-1. **Submit a benchmark job**:
-
-   ```bash
-   python3 submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir <path-to>/deepseek-r1-0528 \
-     --container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh \
-     --gpus-per-node 4 \
-     --config-dir <path-to>/klconfigs \
-     --gpu-type gb200-fp8 \
-     --network-interface enP6p9s0np0 \
-     --prefill-nodes 6 \
-     --decode-nodes 12 \
-     --prefill-workers 3 \
-     --decode-workers 1 \
-     --account <account> \
-     --partition <partition> \
-     --time-limit 4:00:00 \
-     --enable-multiple-frontends \
-     --num-additional-frontends 9 \
-     --profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"
-   ```
-
-   This command will deploy 3 prefill workers and 1 decode worker with 9 additional frontends load-balanced by nginx. Diving deeper into the command:
-
-   - `--template job_script_template.j2`: Path to Jinja2 template file (this shouldn't change unless you want to modify the template)
-   - `--model-dir <path-to>/deepseek-r1-0528`: Path to DSR1-FP8 model directory
-   - `--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh`: Enroot container image URI
-   - `--gpus-per-node 4`: Number of GPUs per node (each GB200 tray has 4 GPUs)
-   - `--config-dir <path-to>/klconfigs`: Various configs (see explanation below)
-   - `--gpu-type gb200-fp8`: GPU type to use, choices: `gb200-fp8`
-   - `--network-interface enP6p9s0np0`: Network interface to use (depends on your cluster)
-   - `--prefill-nodes 6`: Number of prefill nodes
-   - `--decode-nodes 12`: Number of decode nodes
-   - `--prefill-workers 3`: Number of prefill workers
-   - `--decode-workers 1`: Number of decode workers
-   - `--account <account>`: SLURM account
-   - `--partition <partition>`: SLURM partition
-   - `--time-limit 4:00:00`: Time limit in HH:MM:SS format
-   - `--enable-multiple-frontends`: Enable multiple frontend architecture with nginx load balancer
-   - `--num-additional-frontends 9`: Number of additional frontends
-   - `--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"`: Profiler configurations (see explanation below)
-
-   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
-
-2. **Check logs in real-time**:
-
-   ```bash
-   cd logs/{JOB_ID}
-   tail -f *_prefill_*.err *_decode_*.err
-   ```
-
-## Configs directory
-
-The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
-
-```bash
-klconfigs/
-├── decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json
-├── deepep_config.json
-├── dgcache/
-└── prefill_dsr1-0528_in1000out1000_num40000.json
-```
-
-1. `decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
-2. `deepep_config.json`: DeepEP config file for GB2009
-3. `dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
-4. `prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
-
-**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
-
-## Profiler
-
-If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
--- a/examples/backends/sglang/slurm_jobs/job_script_template_agg.j2
+++ b/examples/backends/sglang/slurm_jobs/job_script_template_agg.j2
-#!/bin/bash
-#SBATCH --job-name={{ job_name }}
-#SBATCH --nodes={{ total_nodes }}
-#SBATCH --ntasks={{ total_nodes }}
-#SBATCH --ntasks-per-node=1
-#SBATCH --account={{ account }}
-#SBATCH --time={{ time_limit }}
-#SBATCH --output=logs/%j_{{ agg_workers }}A_{{ timestamp }}/log.out
-#SBATCH --error=logs/%j_{{ agg_workers }}A_{{ timestamp }}/log.err
-#SBATCH --partition={{ partition }}
-
-# Constants
-set -x
-AGG_NODES={{ agg_nodes }}
-AGG_WORKERS={{ agg_workers }}
-TOTAL_NODES={{ total_nodes }}
-GPUS_PER_NODE={{ gpus_per_node }}
-TOTAL_GPUS=$((AGG_NODES * GPUS_PER_NODE))
-PREFILL_GPUS=0
-DECODE_GPUS=$TOTAL_GPUS
-AGG_NODES_PER_WORKER=$((AGG_NODES / AGG_WORKERS))
-LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}_{{ agg_workers }}A_{{ timestamp }}"
-SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
-OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
-MODEL_DIR="{{ model_dir }}"
-CONFIG_DIR="{{ config_dir }}"
-CONTAINER_IMAGE="{{ container_image }}"
-NETWORK_INTERFACE="{{ network_interface }}"
-GPU_TYPE="{{ gpu_type | default('h100') }}"
-set +x
-
-{% raw %}
-
-mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
-
-nodes=($(scontrol show hostnames $SLURM_NODELIST))
-if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
-    echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
-    exit 1
-fi
-
-# Print node information
-for i in "${!nodes[@]}"; do
-    echo "Node $i: ${nodes[$i]}"
-done
-
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Multiple frontend architecture
-# Node 0: nginx + aggregated worker shard
-# Node 1: NATS/ETCD + first frontend
-# Node 2+: aggregated workers + optional additional frontends
-
-NGINX_NODE=${nodes[0]}
-MASTER_NODE=${nodes[1]}
-MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-if [ -z "$MASTER_IP" ]; then
-    echo "Error: Could not retrieve IP address for master host $MASTER_NODE on interface $NETWORK_INTERFACE"
-    exit 1
-fi
-echo "Master IP address (node 1): $MASTER_IP"
-echo "Nginx node (node 0): $NGINX_NODE"
-
-# Generate frontend IP list for nginx config
-frontend_hosts=()
-frontend_ips=()
-# Node 1 always has a frontend (with NATS/ETCD)
-frontend_hosts+=("$MASTER_NODE")
-frontend_ips+=("$MASTER_IP")
-
-# Add additional frontends based on num_additional_frontends
-{% endraw %}ADDITIONAL_FRONTENDS={{ num_additional_frontends }}{% raw %}
-if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
-    # Calculate which nodes get additional frontends
-    # We have AGG_NODES aggregated worker nodes, distribute additional frontends across them
-    nodes_per_frontend=$(( (AGG_NODES - 1 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))  # ceil division
-    frontend_node_idx=2  # Start from node 2 (node 1 already has frontend)
-
-    for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
-        if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
-            node_name=${nodes[$frontend_node_idx]}
-            node_ip=$(srun --nodes=1 --ntasks=1 --nodelist=$node_name ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-            frontend_hosts+=("$node_name")
-            frontend_ips+=("$node_ip")
-            echo "Additional frontend $i on node $frontend_node_idx: $node_name ($node_ip)"
-            frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
-        fi
-    done
-fi
-
-echo "Frontend hosts: ${frontend_hosts[@]}"
-echo "Frontend IPs: ${frontend_ips[@]}"
-
-# Generate nginx configuration
-# Build a Python list literal of frontend hosts from the bash array
-FRONTEND_LIST=$(printf "'%s'," "${frontend_ips[@]}")
-FRONTEND_LIST="[${FRONTEND_LIST%,}]"
-export FRONTEND_LIST SCRIPT_DIR LOG_DIR
-python3 - <<'PY'
-import os
-from jinja2 import Template
-
-template_path = os.path.join(os.environ['SCRIPT_DIR'], 'nginx.conf.j2')
-output_path = os.path.join(os.environ['LOG_DIR'], 'nginx.conf')
-
-with open(template_path, 'r') as f:
-    tmpl = Template(f.read())
-
-frontend_hosts = eval(os.environ['FRONTEND_LIST'])
-config = tmpl.render(frontend_hosts=frontend_hosts)
-
-with open(output_path, 'w') as f:
-    f.write(config)
-PY
-
-{% endraw %}
-{% else %}
-{% raw %}
-# Traditional architecture - first aggregated worker node handles everything
-MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-if [ -z "$MASTER_IP" ]; then
-    echo "Error: Could not retrieve IP address for master host ${nodes[0]} on interface $NETWORK_INTERFACE"
-    exit 1
-fi
-echo "Master IP address: $MASTER_IP"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-# Compute leader nodes for each aggregated worker
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# With multiple frontends: keep offset 0; nginx coexists on node 0
-WORKER_NODE_OFFSET=0
-{% endraw %}
-{% else %}
-{% raw %}
-# Traditional: workers start from node 0
-WORKER_NODE_OFFSET=0
-{% endraw %}
-{% endif %}
-{% raw %}
-
-agg_leaders=()
-for i in $(seq 0 $((AGG_WORKERS - 1))); do
-    leader_idx=$((WORKER_NODE_OFFSET + i * AGG_NODES_PER_WORKER))
-    agg_leaders[$i]=$leader_idx
-done
-
-echo "Aggregated worker leaders: ${agg_leaders[@]}"
-
-# Prepare enroot arguments to pass to srun commands
-ENROOT_ARGS="\
-    --container-image=${CONTAINER_IMAGE} \
-    --no-container-entrypoint \
-    --no-container-mount-home \
-    --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
-"
-
-# Build common worker arguments
-{% endraw %}
-SCRIPT_VARIANT="{{ script_variant | default('default') }}"
-{% raw %}
-WORKER_ARGS="--gpu_type ${GPU_TYPE} --script-variant ${SCRIPT_VARIANT} --gpus_per_node ${GPUS_PER_NODE} --master_ip ${MASTER_IP}"
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Add multiple frontends flag for worker setup
-WORKER_ARGS="$WORKER_ARGS --multiple-frontends-enabled"
-{% endraw %}
-{% endif %}
-{% if run_in_ci %}
-{% raw %}
-# Add CI mode flag for worker setup
-WORKER_ARGS="$WORKER_ARGS --run-in-ci"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Launch nginx on node 0
-echo "Launching nginx on ${NGINX_NODE}"
-cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$NGINX_NODE --output=${LOG_DIR}/${NGINX_NODE}_nginx.out --error=${LOG_DIR}/${NGINX_NODE}_nginx.err python /scripts/worker_setup.py --worker_type nginx --nginx_config /logs/nginx.conf ${WORKER_ARGS}"
-echo "$cmd"
-$cmd &
-
-# Launch frontend on master node (node 1) - this will also start NATS/ETCD
-echo "Launching frontend + NATS/ETCD on master node ${MASTER_NODE}"
-cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE --output=${LOG_DIR}/${MASTER_NODE}_frontend_0.out --error=${LOG_DIR}/${MASTER_NODE}_frontend.err python /scripts/worker_setup.py --worker_type frontend --worker_idx 0 ${WORKER_ARGS}"
-echo "$cmd"
-$cmd &
-
-# Launch additional frontends on designated nodes
-if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
-    frontend_idx=1  # Start from 1 since node 1 is frontend 0
-    nodes_per_frontend=$(( (TOTAL_NODES - 2 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))
-    frontend_node_idx=2
-
-    for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
-        if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
-            node=${nodes[$frontend_node_idx]}
-            echo "Launching additional frontend $frontend_idx on node $frontend_node_idx: $node"
-            cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_frontend_${frontend_idx}.out --error=${LOG_DIR}/${node}_frontend_${frontend_idx}.err python /scripts/worker_setup.py --worker_type frontend --worker_idx ${frontend_idx} ${WORKER_ARGS}"
-            echo "$cmd"
-            $cmd &
-            frontend_idx=$((frontend_idx + 1))
-            frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
-        fi
-    done
-fi
-{% endraw %}
-{% else %}
-{% raw %}
-# Traditional: first aggregated worker node also runs frontend + NATS/ETCD
-# This is handled in setup_aggregated_worker when worker_idx=0 and local_rank=0
-{% endraw %}
-{% endif %}
-{% raw %}
-
-# Launch aggregated workers
-for worker_idx in $(seq 0 $((AGG_WORKERS - 1))); do
-    leader_idx=${agg_leaders[$worker_idx]}
-    leader_node=${nodes[$leader_idx]}
-
-    # Get leader IP for this worker group
-    LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-    echo "Aggregated worker $worker_idx leader: $leader_node ($LEADER_IP)"
-
-    # Launch all nodes for this worker
-    for node_idx in $(seq 0 $((AGG_NODES_PER_WORKER - 1))); do
-        global_node_idx=$((leader_idx + node_idx))
-        node=${nodes[$global_node_idx]}
-        local_rank=$node_idx
-
-        echo "Launching aggregated worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
-{% endraw %}
-{% if enable_config_dump %}
-{% raw %}
-        CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
-{% endraw %}
-{% else %}
-{% raw %}
-        CONFIG_DUMP_ARG=""
-{% endraw %}
-{% endif %}
-{% raw %}
-        cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_agg_w${worker_idx}.out --error=${LOG_DIR}/${node}_agg_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${AGG_NODES_PER_WORKER} --worker_type aggregated --gpu_utilization_log /logs/${node}_agg_w${worker_idx}_gpu_utilization.log ${CONFIG_DUMP_ARG} ${WORKER_ARGS}"
-        echo "$cmd"
-        $cmd &
-    done
-done
-
-echo ""
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-echo "Frontend available at: http://${NGINX_NODE}:8000"
-echo "To connect to the nginx node:"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${NGINX_NODE} --overlap --pty bash"
-echo "To connect to the master node (NATS/ETCD):"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${MASTER_NODE} --overlap --pty bash"
-{% endraw %}
-{% else %}
-{% raw %}
-echo "To connect to the master node:"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-echo ""
-echo "Make sure to cancel the job at the end:"
-echo "scancel $SLURM_JOB_ID"
-
-# Instead of waiting for all tasks to complete, wait for profile.sh to complete and then exit.
-
-{% endraw %}
-
-PROFILER_TYPE={{ profiler_type }}
-PROFILER_ARGS="{{ profiler_arg }}"
-
-{% if do_profile %}
-{% raw %}
-srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --output=${LOG_DIR}/profile.out --error=${LOG_DIR}/profile.err --overlap bash /scripts/${PROFILER_TYPE}/bench.sh 0 $AGG_WORKERS $PREFILL_GPUS $DECODE_GPUS $TOTAL_GPUS ${PROFILER_ARGS} &
-{% endraw %}
-{% endif %}
-
-{% raw %}
-wait -n
-first_exit_code=$?
-echo "Script finished at $(date) with exit code ${first_exit_code}"
-exit $first_exit_code
-{% endraw %}
-
--- a/examples/backends/sglang/slurm_jobs/job_script_template_disagg.j2
+++ b/examples/backends/sglang/slurm_jobs/job_script_template_disagg.j2
-#!/bin/bash
-#SBATCH --job-name={{ job_name }}
-#SBATCH --nodes={{ total_nodes }}
-#SBATCH --ntasks={{ total_nodes }}
-#SBATCH --ntasks-per-node=1
-#SBATCH --account={{ account }}
-#SBATCH --time={{ time_limit }}
-#SBATCH --output=logs/%j_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}/log.out
-#SBATCH --error=logs/%j_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}/log.err
-#SBATCH --partition={{ partition }}
-
-# Constants
-set -x
-PREFILL_NODES={{ prefill_nodes }}
-DECODE_NODES={{ decode_nodes }}
-PREFILL_WORKERS={{ prefill_workers }}
-DECODE_WORKERS={{ decode_workers }}
-TOTAL_NODES=$((PREFILL_NODES + DECODE_NODES))
-GPUS_PER_NODE={{ gpus_per_node }}
-TOTAL_GPUS=$((TOTAL_NODES * GPUS_PER_NODE))
-PREFILL_GPUS=$((PREFILL_NODES * GPUS_PER_NODE))
-DECODE_GPUS=$((DECODE_NODES * GPUS_PER_NODE))
-PREFILL_NODES_PER_WORKER=$((PREFILL_NODES / PREFILL_WORKERS))
-DECODE_NODES_PER_WORKER=$((DECODE_NODES / DECODE_WORKERS))
-LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}"
-SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
-OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
-MODEL_DIR="{{ model_dir }}"
-CONFIG_DIR="{{ config_dir }}"
-CONTAINER_IMAGE="{{ container_image }}"
-NETWORK_INTERFACE="{{ network_interface }}"
-GPU_TYPE="{{ gpu_type | default('h100') }}"
-set +x
-
-{% raw %}
-
-mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
-
-nodes=($(scontrol show hostnames $SLURM_NODELIST))
-if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
-    echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
-    exit 1
-fi
-
-# Print node information
-for i in "${!nodes[@]}"; do
-    echo "Node $i: ${nodes[$i]}"
-done
-
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Multiple frontend architecture
-# Node 0: nginx only + prefill shard
-# Node 1: NATS/ETCD + first frontend + prefill shard
-# Node 2+: prefill/decode workers + optional additional frontends
-
-NGINX_NODE=${nodes[0]}
-MASTER_NODE=${nodes[1]}
-MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-if [ -z "$MASTER_IP" ]; then
-    echo "Error: Could not retrieve IP address for master host $MASTER_NODE on interface $NETWORK_INTERFACE"
-    exit 1
-fi
-echo "Master IP address (node 1): $MASTER_IP"
-echo "Nginx node (node 0): $NGINX_NODE"
-
-# Generate frontend IP list for nginx config
-frontend_hosts=()
-frontend_ips=()
-# Node 1 always has a frontend (with NATS/ETCD)
-frontend_hosts+=("$MASTER_NODE")
-frontend_ips+=("$MASTER_IP")
-
-# Add additional frontends based on num_additional_frontends
-{% endraw %}ADDITIONAL_FRONTENDS={{ num_additional_frontends }}{% raw %}
-if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
-    # Calculate which nodes get additional frontends
-    # We have TOTAL_NODES prefill/decode nodes, distribute additional frontends across them
-    nodes_per_frontend=$(( (TOTAL_NODES - 1 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))  # ceil division
-    frontend_node_idx=2  # Start from node 2 (node 1 already has frontend)
-
-    for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
-        if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
-            node_name=${nodes[$frontend_node_idx]}
-            node_ip=$(srun --nodes=1 --ntasks=1 --nodelist=$node_name ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-            frontend_hosts+=("$node_name")
-            frontend_ips+=("$node_ip")
-            echo "Additional frontend $i on node $frontend_node_idx: $node_name ($node_ip)"
-            frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
-        fi
-    done
-fi
-
-echo "Frontend hosts: ${frontend_hosts[@]}"
-echo "Frontend IPs: ${frontend_ips[@]}"
-
-# Generate nginx configuration
-# Build a Python list literal of frontend hosts from the bash array
-FRONTEND_LIST=$(printf "'%s'," "${frontend_ips[@]}")
-FRONTEND_LIST="[${FRONTEND_LIST%,}]"
-export FRONTEND_LIST SCRIPT_DIR LOG_DIR
-python3 - <<'PY'
-import os
-from jinja2 import Template
-
-template_path = os.path.join(os.environ['SCRIPT_DIR'], 'nginx.conf.j2')
-output_path = os.path.join(os.environ['LOG_DIR'], 'nginx.conf')
-
-with open(template_path, 'r') as f:
-    tmpl = Template(f.read())
-
-frontend_hosts = eval(os.environ['FRONTEND_LIST'])
-config = tmpl.render(frontend_hosts=frontend_hosts)
-
-with open(output_path, 'w') as f:
-    f.write(config)
-PY
-
-{% endraw %}
-{% else %}
-{% raw %}
-# Traditional architecture - first prefill node handles everything
-MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-if [ -z "$MASTER_IP" ]; then
-    echo "Error: Could not retrieve IP address for master host ${nodes[0]} on interface $NETWORK_INTERFACE"
-    exit 1
-fi
-echo "Master IP address: $MASTER_IP"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-# Compute leader nodes for each worker
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# With multiple frontends: keep offset 0; nginx coexists on node 0
-WORKER_NODE_OFFSET=0
-{% endraw %}
-{% else %}
-{% raw %}
-# Traditional: workers start from node 0
-WORKER_NODE_OFFSET=0
-{% endraw %}
-{% endif %}
-{% raw %}
-
-prefill_leaders=()
-for i in $(seq 0 $((PREFILL_WORKERS - 1))); do
-    leader_idx=$((WORKER_NODE_OFFSET + i * PREFILL_NODES_PER_WORKER))
-    prefill_leaders[$i]=$leader_idx
-done
-
-decode_leaders=()
-for i in $(seq 0 $((DECODE_WORKERS - 1))); do
-    leader_idx=$((WORKER_NODE_OFFSET + PREFILL_NODES + i * DECODE_NODES_PER_WORKER))
-    decode_leaders[$i]=$leader_idx
-done
-
-echo "Prefill worker leaders: ${prefill_leaders[@]}"
-echo "Decode worker leaders: ${decode_leaders[@]}"
-
-# Prepare enroot arguments to pass to srun commands
-ENROOT_ARGS="\
-    --container-image=${CONTAINER_IMAGE} \
-    --no-container-entrypoint \
-    --no-container-mount-home \
-    --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
-"
-
-# Build common worker arguments
-{% endraw %}
-SCRIPT_VARIANT="{{ script_variant | default('default') }}"
-{% raw %}
-WORKER_ARGS="--gpu_type ${GPU_TYPE} --script-variant ${SCRIPT_VARIANT} --gpus_per_node ${GPUS_PER_NODE} --master_ip ${MASTER_IP}"
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Add multiple frontends flag for worker setup
-WORKER_ARGS="$WORKER_ARGS --multiple-frontends-enabled"
-{% endraw %}
-{% endif %}
-{% if use_init_location %}
-{% raw %}
-# Add multiple frontends flag for worker setup
-WORKER_ARGS="$WORKER_ARGS --use_init_locations"
-{% endraw %}
-{% endif %}
-{% if run_in_ci %}
-{% raw %}
-# Add CI mode flag for worker setup
-WORKER_ARGS="$WORKER_ARGS --run-in-ci"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-# Launch nginx on node 0
-echo "Launching nginx on ${NGINX_NODE}"
-cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$NGINX_NODE --output=${LOG_DIR}/${NGINX_NODE}_nginx.out --error=${LOG_DIR}/${NGINX_NODE}_nginx.err python /scripts/worker_setup.py --worker_type nginx --nginx_config /logs/nginx.conf ${WORKER_ARGS}"
-echo "$cmd"
-$cmd &
-
-# Launch frontend on master node (node 1) - this will also start NATS/ETCD
-echo "Launching frontend + NATS/ETCD on master node ${MASTER_NODE}"
-cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE --output=${LOG_DIR}/${MASTER_NODE}_frontend_0.out --error=${LOG_DIR}/${MASTER_NODE}_frontend.err python /scripts/worker_setup.py --worker_type frontend --worker_idx 0 ${WORKER_ARGS}"
-echo "$cmd"
-$cmd &
-
-# Launch additional frontends on designated nodes
-if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
-    frontend_idx=1  # Start from 1 since node 1 is frontend 0
-    nodes_per_frontend=$(( (TOTAL_NODES - 2 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))
-    frontend_node_idx=2
-
-    for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
-        if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
-            node=${nodes[$frontend_node_idx]}
-            echo "Launching additional frontend $frontend_idx on node $frontend_node_idx: $node"
-            cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_frontend_${frontend_idx}.out --error=${LOG_DIR}/${node}_frontend_${frontend_idx}.err python /scripts/worker_setup.py --worker_type frontend --worker_idx ${frontend_idx} ${WORKER_ARGS}"
-            echo "$cmd"
-            $cmd &
-            frontend_idx=$((frontend_idx + 1))
-            frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
-        fi
-    done
-fi
-{% endraw %}
-{% endif %}
-{% raw %}
-
-# Launch prefill workers
-for worker_idx in $(seq 0 $((PREFILL_WORKERS - 1))); do
-    leader_idx=${prefill_leaders[$worker_idx]}
-    leader_node=${nodes[$leader_idx]}
-
-    # Get leader IP for this worker group
-    LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-    echo "Prefill worker $worker_idx leader: $leader_node ($LEADER_IP)"
-
-    # Launch all nodes for this worker
-    for node_idx in $(seq 0 $((PREFILL_NODES_PER_WORKER - 1))); do
-        global_node_idx=$((leader_idx + node_idx))
-        node=${nodes[$global_node_idx]}
-        local_rank=$node_idx
-
-        echo "Launching prefill worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
-{% endraw %}
-{% if enable_config_dump %}
-{% raw %}
-        CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
-{% endraw %}
-{% else %}
-{% raw %}
-        CONFIG_DUMP_ARG=""
-{% endraw %}
-{% endif %}
-{% raw %}
-        cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill_w${worker_idx}.out --error=${LOG_DIR}/${node}_prefill_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${PREFILL_NODES_PER_WORKER} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_w${worker_idx}_gpu_utilization.log ${WORKER_ARGS} ${CONFIG_DUMP_ARG}"
-        echo "$cmd"
-        $cmd &
-    done
-done
-
-# Launch decode workers
-for worker_idx in $(seq 0 $((DECODE_WORKERS - 1))); do
-    leader_idx=${decode_leaders[$worker_idx]}
-    leader_node=${nodes[$leader_idx]}
-
-    # Get leader IP for this worker group
-    LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
-    echo "Decode worker $worker_idx leader: $leader_node ($LEADER_IP)"
-
-    # Launch all nodes for this worker
-    for node_idx in $(seq 0 $((DECODE_NODES_PER_WORKER - 1))); do
-        global_node_idx=$((leader_idx + node_idx))
-        node=${nodes[$global_node_idx]}
-        local_rank=$node_idx
-
-        echo "Launching decode worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
-{% endraw %}
-{% if enable_config_dump %}
-{% raw %}
-        CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
-{% endraw %}
-{% else %}
-{% raw %}
-        CONFIG_DUMP_ARG=""
-{% endraw %}
-{% endif %}
-{% raw %}
-        cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode_w${worker_idx}.out --error=${LOG_DIR}/${node}_decode_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${DECODE_NODES_PER_WORKER} --worker_type decode --gpu_utilization_log /logs/${node}_decode_w${worker_idx}_gpu_utilization.log ${CONFIG_DUMP_ARG} ${WORKER_ARGS}"
-        echo "$cmd"
-        $cmd &
-    done
-done
-
-echo ""
-{% endraw %}
-{% if enable_multiple_frontends %}
-{% raw %}
-echo "Frontend available at: http://${NGINX_NODE}:8000"
-echo "To connect to the nginx node:"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${NGINX_NODE} --overlap --pty bash"
-echo "To connect to the master node (NATS/ETCD):"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${MASTER_NODE} --overlap --pty bash"
-{% endraw %}
-{% else %}
-{% raw %}
-echo "To connect to the host prefill node:"
-echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
-{% endraw %}
-{% endif %}
-{% raw %}
-
-echo ""
-echo "Make sure to cancel the job at the end:"
-echo "scancel $SLURM_JOB_ID"
-
-# Instead of waiting for all tasks to complete, wait for profile.sh to complete and then exit.
-
-{% endraw %}
-
-PROFILER_TYPE={{ profiler_type }}
-PROFILER_ARGS="{{ profiler_arg }}"
-
-{% if do_profile %}
-{% raw %}
-srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --output=${LOG_DIR}/profile.out --error=${LOG_DIR}/profile.err --overlap bash /scripts/${PROFILER_TYPE}/bench.sh $PREFILL_WORKERS $DECODE_WORKERS $PREFILL_GPUS $DECODE_GPUS ${PROFILER_ARGS} &
-{% endraw %}
-{% endif %}
-
-{% raw %}
-wait -n
-first_exit_code=$?
-echo "Script finished at $(date) with exit code ${first_exit_code}"
-exit $first_exit_code
-{% endraw %}
--- a/examples/backends/sglang/slurm_jobs/parse.py
+++ b/examples/backends/sglang/slurm_jobs/parse.py
-# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# ruff: noqa
-# pylint: skip-file
-
-import json
-import os
-import re
-
-### Slurm configs
-SLURM_JOB_ID = "slurm id"
-### Model Deployment configurations
-PREFILL_TP = "Prefill TP"
-PREFILL_DP = "Prefill DP"
-DECODE_TP = "Decode TP"
-DECODE_DP = "Decode DP"
-FRONTENDS = "Frontends"
-### Profiler configs
-PROFILER_TYPE = "Profiler type"
-ISL = "ISL"
-OSL = "OSL"
-REQUEST_RATE = "Request rate"
-CONCURRENCIES = "Concurrencies"
-OUTPUT_TPS = "Output TPS"
-OUTPUT_TPS_PER_USER = "Output TPS/User"
-ITL = "Mean ITL (ms)"
-TTFT = "Mean TTFT (ms)"
-TPOT = "Mean TPOT (ms)"
-### FORMAT PRINT ORDERS
-KEY_PRINT_ORDER = [
-    SLURM_JOB_ID,
-    PREFILL_TP,
-    PREFILL_DP,
-    DECODE_TP,
-    DECODE_DP,
-    FRONTENDS,
-    PROFILER_TYPE,
-    ISL,
-    OSL,
-    REQUEST_RATE,
-    CONCURRENCIES,
-    OUTPUT_TPS,
-    OUTPUT_TPS_PER_USER,
-    ITL,
-    TTFT,
-    TPOT,
-]
-
-
-def format_key_order():
-    report = "================\nThe following log will be reported according to this order:\n----\n"
-    for key in KEY_PRINT_ORDER:
-        report += f"{key}\n"
-    print(report[:-1])
-
-
-def format_print(result):
-    report = "================\n"
-    for key in KEY_PRINT_ORDER:
-        report += f"{result.get(key, '')}\n"
-    print(report[:-1])
-
-
-def analyze_sgl_out(folder):
-    result = []
-    for file in os.listdir(folder):
-        with open(f"{folder}/{file}", "r") as f:
-            content = json.load(f)
-            res = [
-                content["max_concurrency"],
-                content["output_throughput"],
-                content["mean_itl_ms"],
-                content["mean_ttft_ms"],
-                content["request_rate"],
-            ]
-
-            if "mean_tpot_ms" in content:
-                res.append(content["mean_tpot_ms"])
-            result.append(res)
-    out = {
-        REQUEST_RATE: [],
-        CONCURRENCIES: [],
-        OUTPUT_TPS: [],
-        ITL: [],
-        TTFT: [],
-        TPOT: [],
-    }
-
-    for data in sorted(result, key=lambda x: x[0]):
-        con, tps, itl, ttft, req_rate = data[0:5]
-        out[CONCURRENCIES].append(con)
-        out[OUTPUT_TPS].append(tps)
-        out[ITL].append(itl)
-        out[TTFT].append(ttft)
-        out[REQUEST_RATE].append(req_rate)
-
-        if len(data) >= 6:
-            if TPOT not in out:
-                out[TPOT] = []
-            out[TPOT].append(data[5])
-
-    return out
-
-
-def analyze_gap_out(folder):
-    result = []
-    for file in os.listdir(folder):
-        with open(f"{folder}/{file}", "r") as f:
-            content = json.load(f)
-            result.append(
-                (
-                    content["input_config"]["perf_analyzer"]["stimulus"]["concurrency"],
-                    content["output_token_throughput_per_user"]["avg"],
-                    content["output_token_throughput"]["avg"],
-                )
-            )
-
-    out = {CONCURRENCIES: [], OUTPUT_TPS: [], OUTPUT_TPS_PER_USER: []}
-
-    for con, tpspuser, tps in sorted(result, key=lambda x: x[0]):
-        out[CONCURRENCIES].append(con)
-        out[OUTPUT_TPS].append(tps)
-        out[OUTPUT_TPS_PER_USER].append(tpspuser)
-
-    return out
-
-
-def analyze(p):
-    files = os.listdir(p)
-
-    prefill_nodes = {}
-    decode_nodes = {}
-    frontends = []
-
-    profile_result = {}
-
-    for file in files:
-        p_re = re.search(
-            "([-_A-Za-z0-9]+)_(prefill|decode|nginx|frontend)_([a-zA-Z0-9]+).out", file
-        )
-        if p_re is not None:
-            _, node_type, number = p_re.groups()
-            if node_type == "prefill":
-                if number not in prefill_nodes:
-                    prefill_nodes[number] = []
-                prefill_nodes[number].append(file)
-            elif node_type == "decode":
-                if number not in decode_nodes:
-                    decode_nodes[number] = []
-                decode_nodes[number].append(file)
-            elif node_type == "frontend":
-                frontends.append(file)
-
-        profiler_match = re.match("(sglang|vllm|gap)_isl_([0-9]+)_osl_([0-9]+)", file)
-        if profiler_match:
-            profiler, isl, osl = profiler_match.groups()
-            if profiler == "gap":
-                profile_result = analyze_gap_out(f"{p}/{file}")
-            else:
-                profile_result = analyze_sgl_out(f"{p}/{file}")
-
-            profile_result[PROFILER_TYPE] = profiler
-            profile_result[ISL] = isl
-            profile_result[OSL] = osl
-
-    config = {SLURM_JOB_ID: p}
-    if len(prefill_nodes.values()) != 0:
-        config[PREFILL_TP] = f"{len(list(prefill_nodes.values())[0]) * 4}"
-        config[PREFILL_DP] = f"{len(prefill_nodes.keys())}"
-
-    if len(decode_nodes.values()) != 0:
-        config[DECODE_TP] = f"{len(list(decode_nodes.values())[0]) * 4}"
-        config[DECODE_DP] = f"{len(decode_nodes.keys())}"
-
-    if len(frontends) != 0:
-        config[FRONTENDS] = f"{len(frontends)}"
-
-    result = {**config}
-    for key, value in profile_result.items():
-        result[key] = (
-            value
-            if type(value) != list
-            else ", ".join([str(x) for x in value])  # ignore:
-        )
-    return result
-
-
-paths = [x for x in os.listdir(".") if ".py" not in x and os.path.isdir(x)]
-format_key_order()
-
-
-def extract_job_id(dirname):
-    """Extract job ID from directory name for sorting.
-
-    Handles formats like:
-    - 12345_3P_1D_20250104_123456 (disaggregated)
-    - 12345_4A_20250104_123456 (aggregated)
-    - 12345 (legacy format)
-    """
-    try:
-        return int(dirname.split("_")[0])
-    except (ValueError, IndexError):
-        # If directory name doesn't match expected format, return -1
-        return -1
-
-
-for path in sorted(paths, key=extract_job_id, reverse=True):
-    result = analyze(path)
-    if OUTPUT_TPS not in result:
-        pass
-    else:
-        format_print(result)
--- a/examples/backends/sglang/slurm_jobs/scripts/aiperf/bench.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/aiperf/bench.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-prefill_workers=$1
-decode_workers=$2
-prefill_gpus=$3
-decode_gpus=$4
-total_gpus=$((prefill_gpus+decode_gpus))
-
-chosen_isl=$5
-chosen_osl=$6
-chosen_concurrencies=$7
-
-echo "Profiling for model with PrefillDP=${prefill_workers}, DecodeDP=${decode_workers}"
-
-head_node="localhost"
-head_port="8000"
-
-SERVED_MODEL_NAME="deepseek-ai/DeepSeek-R1"
-MODEL_PATH=/model/
-
-random_seed=$(python3 -c "import random; print(random.randint(0, 65535))")
-random_seed=$RANDOM
-echo "Chosen random seed ${random_seed}"
-
-source /scripts/benchmark_utils.sh
-
-wait_for_model $head_node $head_port $prefill_workers $decode_workers 5 900 60
-
-set -e
-warmup_model $head_node $head_port $SERVED_MODEL_NAME $MODEL_PATH "${chosen_isl}x${chosen_osl}x10000x10000x250"
-set +e
-
-aiperf_warmup_workers=$(python3 -c "print(max(${DP:-0}, ${prefill_workers:-0}, ${decode_workers:-0}))")
-
-IFS='x' read -r -a concurrency_list <<< "$chosen_concurrencies"
-
-profile_folder="/logs/gap_isl_${chosen_isl}_osl_${chosen_osl}"
-mkdir -p $profile_folder
-
-tmp_work_dir=$(mktemp -d -t aiperf-XXXXXXXX)
-for concurrency in ${concurrency_list[@]}; do
-    export_folder="${tmp_work_dir}/concurrency_${concurrency}"
-    mkdir -p $export_folder
-    export_model_name=${SERVED_MODEL_NAME//\//_}
-    export_file="${export_model_name}_generation_${concurrency}.json"
-
-    echo "Run benchmark for concurrency $concurrency; ISL $chosen_isl; OSL $chosen_osl"
-    command=(
-        aiperf profile
-        -m ${SERVED_MODEL_NAME}
-        --tokenizer ${MODEL_PATH}
-        --endpoint-type chat
-        --endpoint /v1/chat/completions
-        --url "${head_node}:${head_port}"
-        --streaming
-
-        --concurrency ${concurrency}
-        --warmup-request-count $(( 2*aiperf_warmup_workers ))
-        --request-count $(( 5*concurrency ))
-
-        --synthetic-input-tokens-mean ${chosen_isl} --synthetic-input-tokens-stddev 0
-        --output-tokens-mean ${chosen_osl} --output-tokens-stddev 0
-        --extra-inputs "max_tokens:${chosen_osl}" --extra-inputs "min_tokens:${chosen_osl}"
-
-        --artifact-dir ${export_folder}
-        --profile-export-file ${export_file}
-
-        --random-seed ${random_seed}
-
-        --tokenizer-trust-remote-code
-        --num-dataset-entries 3000
-    )
-
-    set -e
-    ${command[@]}
-    set +e
-
-    cp $export_folder/*/*_aiperf.json $profile_folder
-done
--- a/examples/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/benchmark_utils.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/check_server_health.py
+++ b/examples/backends/sglang/slurm_jobs/scripts/check_server_health.py
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-low-latency.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-low-latency.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-max-tpt.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-max-tpt.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-middle-curve.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/1k1k-middle-curve.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-low-latency.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-low-latency.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-max-tpt.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-max-tpt.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-middle-curve.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp4/disagg/8k1k-middle-curve.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1k1k-low-latency.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1k1k-low-latency.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1k1k-max-tpt.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/1k1k-max-tpt.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/8k1k-low-latency.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/8k1k-low-latency.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/8k1k-max-tpt.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/gb200-fp8/disagg/8k1k-max-tpt.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh
+++ b/examples/backends/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh
--- a/examples/backends/sglang/slurm_jobs/scripts/nginx.conf.j2
+++ b/examples/backends/sglang/slurm_jobs/scripts/nginx.conf.j2