Unverified Commit 6642e23e authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

feat: sglang to 0.5.9 + updated docs (#6518)


Co-authored-by: default avatarbaihuitian <baihuitian.bht@gmail.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 1df620b4
# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
## SLURM Prerequisites
For this example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `job_script_template.j2` template in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../../../../docs/pages/backends/sglang/README.md#using-docker-containers).
This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
## Usage
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
1. **Submit a benchmark job**:
```bash
python3 submit_job_script.py \
--template job_script_template.j2 \
--model-dir <path-to>/deepseek-r1-0528 \
--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh \
--gpus-per-node 4 \
--config-dir <path-to>/klconfigs \
--gpu-type gb200-fp8 \
--network-interface enP6p9s0np0 \
--prefill-nodes 6 \
--decode-nodes 12 \
--prefill-workers 3 \
--decode-workers 1 \
--account <account> \
--partition <partition> \
--time-limit 4:00:00 \
--enable-multiple-frontends \
--num-additional-frontends 9 \
--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"
```
This command will deploy 3 prefill workers and 1 decode worker with 9 additional frontends load-balanced by nginx. Diving deeper into the command:
- `--template job_script_template.j2`: Path to Jinja2 template file (this shouldn't change unless you want to modify the template)
- `--model-dir <path-to>/deepseek-r1-0528`: Path to DSR1-FP8 model directory
- `--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh`: Enroot container image URI
- `--gpus-per-node 4`: Number of GPUs per node (each GB200 tray has 4 GPUs)
- `--config-dir <path-to>/klconfigs`: Various configs (see explanation below)
- `--gpu-type gb200-fp8`: GPU type to use, choices: `gb200-fp8`
- `--network-interface enP6p9s0np0`: Network interface to use (depends on your cluster)
- `--prefill-nodes 6`: Number of prefill nodes
- `--decode-nodes 12`: Number of decode nodes
- `--prefill-workers 3`: Number of prefill workers
- `--decode-workers 1`: Number of decode workers
- `--account <account>`: SLURM account
- `--partition <partition>`: SLURM partition
- `--time-limit 4:00:00`: Time limit in HH:MM:SS format
- `--enable-multiple-frontends`: Enable multiple frontend architecture with nginx load balancer
- `--num-additional-frontends 9`: Number of additional frontends
- `--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"`: Profiler configurations (see explanation below)
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2. **Check logs in real-time**:
```bash
cd logs/{JOB_ID}
tail -f *_prefill_*.err *_decode_*.err
```
## Configs directory
The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
```bash
klconfigs/
├── decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json
├── deepep_config.json
├── dgcache/
└── prefill_dsr1-0528_in1000out1000_num40000.json
```
1. `decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
2. `deepep_config.json`: DeepEP config file for GB2009
3. `dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
4. `prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
## Profiler
If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --nodes={{ total_nodes }}
#SBATCH --ntasks={{ total_nodes }}
#SBATCH --ntasks-per-node=1
#SBATCH --account={{ account }}
#SBATCH --time={{ time_limit }}
#SBATCH --output=logs/%j_{{ agg_workers }}A_{{ timestamp }}/log.out
#SBATCH --error=logs/%j_{{ agg_workers }}A_{{ timestamp }}/log.err
#SBATCH --partition={{ partition }}
# Constants
set -x
AGG_NODES={{ agg_nodes }}
AGG_WORKERS={{ agg_workers }}
TOTAL_NODES={{ total_nodes }}
GPUS_PER_NODE={{ gpus_per_node }}
TOTAL_GPUS=$((AGG_NODES * GPUS_PER_NODE))
PREFILL_GPUS=0
DECODE_GPUS=$TOTAL_GPUS
AGG_NODES_PER_WORKER=$((AGG_NODES / AGG_WORKERS))
LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}_{{ agg_workers }}A_{{ timestamp }}"
SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
MODEL_DIR="{{ model_dir }}"
CONFIG_DIR="{{ config_dir }}"
CONTAINER_IMAGE="{{ container_image }}"
NETWORK_INTERFACE="{{ network_interface }}"
GPU_TYPE="{{ gpu_type | default('h100') }}"
set +x
{% raw %}
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
nodes=($(scontrol show hostnames $SLURM_NODELIST))
if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
exit 1
fi
# Print node information
for i in "${!nodes[@]}"; do
echo "Node $i: ${nodes[$i]}"
done
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Multiple frontend architecture
# Node 0: nginx + aggregated worker shard
# Node 1: NATS/ETCD + first frontend
# Node 2+: aggregated workers + optional additional frontends
NGINX_NODE=${nodes[0]}
MASTER_NODE=${nodes[1]}
MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
if [ -z "$MASTER_IP" ]; then
echo "Error: Could not retrieve IP address for master host $MASTER_NODE on interface $NETWORK_INTERFACE"
exit 1
fi
echo "Master IP address (node 1): $MASTER_IP"
echo "Nginx node (node 0): $NGINX_NODE"
# Generate frontend IP list for nginx config
frontend_hosts=()
frontend_ips=()
# Node 1 always has a frontend (with NATS/ETCD)
frontend_hosts+=("$MASTER_NODE")
frontend_ips+=("$MASTER_IP")
# Add additional frontends based on num_additional_frontends
{% endraw %}ADDITIONAL_FRONTENDS={{ num_additional_frontends }}{% raw %}
if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
# Calculate which nodes get additional frontends
# We have AGG_NODES aggregated worker nodes, distribute additional frontends across them
nodes_per_frontend=$(( (AGG_NODES - 1 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS )) # ceil division
frontend_node_idx=2 # Start from node 2 (node 1 already has frontend)
for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
node_name=${nodes[$frontend_node_idx]}
node_ip=$(srun --nodes=1 --ntasks=1 --nodelist=$node_name ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
frontend_hosts+=("$node_name")
frontend_ips+=("$node_ip")
echo "Additional frontend $i on node $frontend_node_idx: $node_name ($node_ip)"
frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
fi
done
fi
echo "Frontend hosts: ${frontend_hosts[@]}"
echo "Frontend IPs: ${frontend_ips[@]}"
# Generate nginx configuration
# Build a Python list literal of frontend hosts from the bash array
FRONTEND_LIST=$(printf "'%s'," "${frontend_ips[@]}")
FRONTEND_LIST="[${FRONTEND_LIST%,}]"
export FRONTEND_LIST SCRIPT_DIR LOG_DIR
python3 - <<'PY'
import os
from jinja2 import Template
template_path = os.path.join(os.environ['SCRIPT_DIR'], 'nginx.conf.j2')
output_path = os.path.join(os.environ['LOG_DIR'], 'nginx.conf')
with open(template_path, 'r') as f:
tmpl = Template(f.read())
frontend_hosts = eval(os.environ['FRONTEND_LIST'])
config = tmpl.render(frontend_hosts=frontend_hosts)
with open(output_path, 'w') as f:
f.write(config)
PY
{% endraw %}
{% else %}
{% raw %}
# Traditional architecture - first aggregated worker node handles everything
MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
if [ -z "$MASTER_IP" ]; then
echo "Error: Could not retrieve IP address for master host ${nodes[0]} on interface $NETWORK_INTERFACE"
exit 1
fi
echo "Master IP address: $MASTER_IP"
{% endraw %}
{% endif %}
{% raw %}
# Compute leader nodes for each aggregated worker
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# With multiple frontends: keep offset 0; nginx coexists on node 0
WORKER_NODE_OFFSET=0
{% endraw %}
{% else %}
{% raw %}
# Traditional: workers start from node 0
WORKER_NODE_OFFSET=0
{% endraw %}
{% endif %}
{% raw %}
agg_leaders=()
for i in $(seq 0 $((AGG_WORKERS - 1))); do
leader_idx=$((WORKER_NODE_OFFSET + i * AGG_NODES_PER_WORKER))
agg_leaders[$i]=$leader_idx
done
echo "Aggregated worker leaders: ${agg_leaders[@]}"
# Prepare enroot arguments to pass to srun commands
ENROOT_ARGS="\
--container-image=${CONTAINER_IMAGE} \
--no-container-entrypoint \
--no-container-mount-home \
--container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
"
# Build common worker arguments
{% endraw %}
SCRIPT_VARIANT="{{ script_variant | default('default') }}"
{% raw %}
WORKER_ARGS="--gpu_type ${GPU_TYPE} --script-variant ${SCRIPT_VARIANT} --gpus_per_node ${GPUS_PER_NODE} --master_ip ${MASTER_IP}"
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Add multiple frontends flag for worker setup
WORKER_ARGS="$WORKER_ARGS --multiple-frontends-enabled"
{% endraw %}
{% endif %}
{% if run_in_ci %}
{% raw %}
# Add CI mode flag for worker setup
WORKER_ARGS="$WORKER_ARGS --run-in-ci"
{% endraw %}
{% endif %}
{% raw %}
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Launch nginx on node 0
echo "Launching nginx on ${NGINX_NODE}"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$NGINX_NODE --output=${LOG_DIR}/${NGINX_NODE}_nginx.out --error=${LOG_DIR}/${NGINX_NODE}_nginx.err python /scripts/worker_setup.py --worker_type nginx --nginx_config /logs/nginx.conf ${WORKER_ARGS}"
echo "$cmd"
$cmd &
# Launch frontend on master node (node 1) - this will also start NATS/ETCD
echo "Launching frontend + NATS/ETCD on master node ${MASTER_NODE}"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE --output=${LOG_DIR}/${MASTER_NODE}_frontend_0.out --error=${LOG_DIR}/${MASTER_NODE}_frontend.err python /scripts/worker_setup.py --worker_type frontend --worker_idx 0 ${WORKER_ARGS}"
echo "$cmd"
$cmd &
# Launch additional frontends on designated nodes
if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
frontend_idx=1 # Start from 1 since node 1 is frontend 0
nodes_per_frontend=$(( (TOTAL_NODES - 2 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))
frontend_node_idx=2
for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
node=${nodes[$frontend_node_idx]}
echo "Launching additional frontend $frontend_idx on node $frontend_node_idx: $node"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_frontend_${frontend_idx}.out --error=${LOG_DIR}/${node}_frontend_${frontend_idx}.err python /scripts/worker_setup.py --worker_type frontend --worker_idx ${frontend_idx} ${WORKER_ARGS}"
echo "$cmd"
$cmd &
frontend_idx=$((frontend_idx + 1))
frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
fi
done
fi
{% endraw %}
{% else %}
{% raw %}
# Traditional: first aggregated worker node also runs frontend + NATS/ETCD
# This is handled in setup_aggregated_worker when worker_idx=0 and local_rank=0
{% endraw %}
{% endif %}
{% raw %}
# Launch aggregated workers
for worker_idx in $(seq 0 $((AGG_WORKERS - 1))); do
leader_idx=${agg_leaders[$worker_idx]}
leader_node=${nodes[$leader_idx]}
# Get leader IP for this worker group
LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
echo "Aggregated worker $worker_idx leader: $leader_node ($LEADER_IP)"
# Launch all nodes for this worker
for node_idx in $(seq 0 $((AGG_NODES_PER_WORKER - 1))); do
global_node_idx=$((leader_idx + node_idx))
node=${nodes[$global_node_idx]}
local_rank=$node_idx
echo "Launching aggregated worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
{% endraw %}
{% if enable_config_dump %}
{% raw %}
CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
{% endraw %}
{% else %}
{% raw %}
CONFIG_DUMP_ARG=""
{% endraw %}
{% endif %}
{% raw %}
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_agg_w${worker_idx}.out --error=${LOG_DIR}/${node}_agg_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${AGG_NODES_PER_WORKER} --worker_type aggregated --gpu_utilization_log /logs/${node}_agg_w${worker_idx}_gpu_utilization.log ${CONFIG_DUMP_ARG} ${WORKER_ARGS}"
echo "$cmd"
$cmd &
done
done
echo ""
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
echo "Frontend available at: http://${NGINX_NODE}:8000"
echo "To connect to the nginx node:"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${NGINX_NODE} --overlap --pty bash"
echo "To connect to the master node (NATS/ETCD):"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${MASTER_NODE} --overlap --pty bash"
{% endraw %}
{% else %}
{% raw %}
echo "To connect to the master node:"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
{% endraw %}
{% endif %}
{% raw %}
echo ""
echo "Make sure to cancel the job at the end:"
echo "scancel $SLURM_JOB_ID"
# Instead of waiting for all tasks to complete, wait for profile.sh to complete and then exit.
{% endraw %}
PROFILER_TYPE={{ profiler_type }}
PROFILER_ARGS="{{ profiler_arg }}"
{% if do_profile %}
{% raw %}
srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --output=${LOG_DIR}/profile.out --error=${LOG_DIR}/profile.err --overlap bash /scripts/${PROFILER_TYPE}/bench.sh 0 $AGG_WORKERS $PREFILL_GPUS $DECODE_GPUS $TOTAL_GPUS ${PROFILER_ARGS} &
{% endraw %}
{% endif %}
{% raw %}
wait -n
first_exit_code=$?
echo "Script finished at $(date) with exit code ${first_exit_code}"
exit $first_exit_code
{% endraw %}
#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --nodes={{ total_nodes }}
#SBATCH --ntasks={{ total_nodes }}
#SBATCH --ntasks-per-node=1
#SBATCH --account={{ account }}
#SBATCH --time={{ time_limit }}
#SBATCH --output=logs/%j_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}/log.out
#SBATCH --error=logs/%j_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}/log.err
#SBATCH --partition={{ partition }}
# Constants
set -x
PREFILL_NODES={{ prefill_nodes }}
DECODE_NODES={{ decode_nodes }}
PREFILL_WORKERS={{ prefill_workers }}
DECODE_WORKERS={{ decode_workers }}
TOTAL_NODES=$((PREFILL_NODES + DECODE_NODES))
GPUS_PER_NODE={{ gpus_per_node }}
TOTAL_GPUS=$((TOTAL_NODES * GPUS_PER_NODE))
PREFILL_GPUS=$((PREFILL_NODES * GPUS_PER_NODE))
DECODE_GPUS=$((DECODE_NODES * GPUS_PER_NODE))
PREFILL_NODES_PER_WORKER=$((PREFILL_NODES / PREFILL_WORKERS))
DECODE_NODES_PER_WORKER=$((DECODE_NODES / DECODE_WORKERS))
LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}_{{ prefill_workers }}P_{{ decode_workers }}D_{{ timestamp }}"
SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
MODEL_DIR="{{ model_dir }}"
CONFIG_DIR="{{ config_dir }}"
CONTAINER_IMAGE="{{ container_image }}"
NETWORK_INTERFACE="{{ network_interface }}"
GPU_TYPE="{{ gpu_type | default('h100') }}"
set +x
{% raw %}
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
nodes=($(scontrol show hostnames $SLURM_NODELIST))
if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
exit 1
fi
# Print node information
for i in "${!nodes[@]}"; do
echo "Node $i: ${nodes[$i]}"
done
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Multiple frontend architecture
# Node 0: nginx only + prefill shard
# Node 1: NATS/ETCD + first frontend + prefill shard
# Node 2+: prefill/decode workers + optional additional frontends
NGINX_NODE=${nodes[0]}
MASTER_NODE=${nodes[1]}
MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
if [ -z "$MASTER_IP" ]; then
echo "Error: Could not retrieve IP address for master host $MASTER_NODE on interface $NETWORK_INTERFACE"
exit 1
fi
echo "Master IP address (node 1): $MASTER_IP"
echo "Nginx node (node 0): $NGINX_NODE"
# Generate frontend IP list for nginx config
frontend_hosts=()
frontend_ips=()
# Node 1 always has a frontend (with NATS/ETCD)
frontend_hosts+=("$MASTER_NODE")
frontend_ips+=("$MASTER_IP")
# Add additional frontends based on num_additional_frontends
{% endraw %}ADDITIONAL_FRONTENDS={{ num_additional_frontends }}{% raw %}
if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
# Calculate which nodes get additional frontends
# We have TOTAL_NODES prefill/decode nodes, distribute additional frontends across them
nodes_per_frontend=$(( (TOTAL_NODES - 1 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS )) # ceil division
frontend_node_idx=2 # Start from node 2 (node 1 already has frontend)
for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
node_name=${nodes[$frontend_node_idx]}
node_ip=$(srun --nodes=1 --ntasks=1 --nodelist=$node_name ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
frontend_hosts+=("$node_name")
frontend_ips+=("$node_ip")
echo "Additional frontend $i on node $frontend_node_idx: $node_name ($node_ip)"
frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
fi
done
fi
echo "Frontend hosts: ${frontend_hosts[@]}"
echo "Frontend IPs: ${frontend_ips[@]}"
# Generate nginx configuration
# Build a Python list literal of frontend hosts from the bash array
FRONTEND_LIST=$(printf "'%s'," "${frontend_ips[@]}")
FRONTEND_LIST="[${FRONTEND_LIST%,}]"
export FRONTEND_LIST SCRIPT_DIR LOG_DIR
python3 - <<'PY'
import os
from jinja2 import Template
template_path = os.path.join(os.environ['SCRIPT_DIR'], 'nginx.conf.j2')
output_path = os.path.join(os.environ['LOG_DIR'], 'nginx.conf')
with open(template_path, 'r') as f:
tmpl = Template(f.read())
frontend_hosts = eval(os.environ['FRONTEND_LIST'])
config = tmpl.render(frontend_hosts=frontend_hosts)
with open(output_path, 'w') as f:
f.write(config)
PY
{% endraw %}
{% else %}
{% raw %}
# Traditional architecture - first prefill node handles everything
MASTER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
if [ -z "$MASTER_IP" ]; then
echo "Error: Could not retrieve IP address for master host ${nodes[0]} on interface $NETWORK_INTERFACE"
exit 1
fi
echo "Master IP address: $MASTER_IP"
{% endraw %}
{% endif %}
{% raw %}
# Compute leader nodes for each worker
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# With multiple frontends: keep offset 0; nginx coexists on node 0
WORKER_NODE_OFFSET=0
{% endraw %}
{% else %}
{% raw %}
# Traditional: workers start from node 0
WORKER_NODE_OFFSET=0
{% endraw %}
{% endif %}
{% raw %}
prefill_leaders=()
for i in $(seq 0 $((PREFILL_WORKERS - 1))); do
leader_idx=$((WORKER_NODE_OFFSET + i * PREFILL_NODES_PER_WORKER))
prefill_leaders[$i]=$leader_idx
done
decode_leaders=()
for i in $(seq 0 $((DECODE_WORKERS - 1))); do
leader_idx=$((WORKER_NODE_OFFSET + PREFILL_NODES + i * DECODE_NODES_PER_WORKER))
decode_leaders[$i]=$leader_idx
done
echo "Prefill worker leaders: ${prefill_leaders[@]}"
echo "Decode worker leaders: ${decode_leaders[@]}"
# Prepare enroot arguments to pass to srun commands
ENROOT_ARGS="\
--container-image=${CONTAINER_IMAGE} \
--no-container-entrypoint \
--no-container-mount-home \
--container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
"
# Build common worker arguments
{% endraw %}
SCRIPT_VARIANT="{{ script_variant | default('default') }}"
{% raw %}
WORKER_ARGS="--gpu_type ${GPU_TYPE} --script-variant ${SCRIPT_VARIANT} --gpus_per_node ${GPUS_PER_NODE} --master_ip ${MASTER_IP}"
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Add multiple frontends flag for worker setup
WORKER_ARGS="$WORKER_ARGS --multiple-frontends-enabled"
{% endraw %}
{% endif %}
{% if use_init_location %}
{% raw %}
# Add multiple frontends flag for worker setup
WORKER_ARGS="$WORKER_ARGS --use_init_locations"
{% endraw %}
{% endif %}
{% if run_in_ci %}
{% raw %}
# Add CI mode flag for worker setup
WORKER_ARGS="$WORKER_ARGS --run-in-ci"
{% endraw %}
{% endif %}
{% raw %}
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
# Launch nginx on node 0
echo "Launching nginx on ${NGINX_NODE}"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$NGINX_NODE --output=${LOG_DIR}/${NGINX_NODE}_nginx.out --error=${LOG_DIR}/${NGINX_NODE}_nginx.err python /scripts/worker_setup.py --worker_type nginx --nginx_config /logs/nginx.conf ${WORKER_ARGS}"
echo "$cmd"
$cmd &
# Launch frontend on master node (node 1) - this will also start NATS/ETCD
echo "Launching frontend + NATS/ETCD on master node ${MASTER_NODE}"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$MASTER_NODE --output=${LOG_DIR}/${MASTER_NODE}_frontend_0.out --error=${LOG_DIR}/${MASTER_NODE}_frontend.err python /scripts/worker_setup.py --worker_type frontend --worker_idx 0 ${WORKER_ARGS}"
echo "$cmd"
$cmd &
# Launch additional frontends on designated nodes
if [ "$ADDITIONAL_FRONTENDS" -gt 0 ]; then
frontend_idx=1 # Start from 1 since node 1 is frontend 0
nodes_per_frontend=$(( (TOTAL_NODES - 2 + ADDITIONAL_FRONTENDS - 1) / ADDITIONAL_FRONTENDS ))
frontend_node_idx=2
for i in $(seq 1 $ADDITIONAL_FRONTENDS); do
if [ $frontend_node_idx -lt $TOTAL_NODES ]; then
node=${nodes[$frontend_node_idx]}
echo "Launching additional frontend $frontend_idx on node $frontend_node_idx: $node"
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_frontend_${frontend_idx}.out --error=${LOG_DIR}/${node}_frontend_${frontend_idx}.err python /scripts/worker_setup.py --worker_type frontend --worker_idx ${frontend_idx} ${WORKER_ARGS}"
echo "$cmd"
$cmd &
frontend_idx=$((frontend_idx + 1))
frontend_node_idx=$((frontend_node_idx + nodes_per_frontend))
fi
done
fi
{% endraw %}
{% endif %}
{% raw %}
# Launch prefill workers
for worker_idx in $(seq 0 $((PREFILL_WORKERS - 1))); do
leader_idx=${prefill_leaders[$worker_idx]}
leader_node=${nodes[$leader_idx]}
# Get leader IP for this worker group
LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
echo "Prefill worker $worker_idx leader: $leader_node ($LEADER_IP)"
# Launch all nodes for this worker
for node_idx in $(seq 0 $((PREFILL_NODES_PER_WORKER - 1))); do
global_node_idx=$((leader_idx + node_idx))
node=${nodes[$global_node_idx]}
local_rank=$node_idx
echo "Launching prefill worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
{% endraw %}
{% if enable_config_dump %}
{% raw %}
CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
{% endraw %}
{% else %}
{% raw %}
CONFIG_DUMP_ARG=""
{% endraw %}
{% endif %}
{% raw %}
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill_w${worker_idx}.out --error=${LOG_DIR}/${node}_prefill_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${PREFILL_NODES_PER_WORKER} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_w${worker_idx}_gpu_utilization.log ${WORKER_ARGS} ${CONFIG_DUMP_ARG}"
echo "$cmd"
$cmd &
done
done
# Launch decode workers
for worker_idx in $(seq 0 $((DECODE_WORKERS - 1))); do
leader_idx=${decode_leaders[$worker_idx]}
leader_node=${nodes[$leader_idx]}
# Get leader IP for this worker group
LEADER_IP=$(srun --nodes=1 --ntasks=1 --nodelist=$leader_node ip addr show $NETWORK_INTERFACE | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1)
echo "Decode worker $worker_idx leader: $leader_node ($LEADER_IP)"
# Launch all nodes for this worker
for node_idx in $(seq 0 $((DECODE_NODES_PER_WORKER - 1))); do
global_node_idx=$((leader_idx + node_idx))
node=${nodes[$global_node_idx]}
local_rank=$node_idx
echo "Launching decode worker $worker_idx, node $global_node_idx (local_rank $local_rank): $node"
{% endraw %}
{% if enable_config_dump %}
{% raw %}
CONFIG_DUMP_ARG="--dump-config-path /logs/${node}_config.json"
{% endraw %}
{% else %}
{% raw %}
CONFIG_DUMP_ARG=""
{% endraw %}
{% endif %}
{% raw %}
cmd="srun --overlap $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode_w${worker_idx}.out --error=${LOG_DIR}/${node}_decode_w${worker_idx}.err python /scripts/worker_setup.py --leader_ip ${LEADER_IP} --worker_idx ${worker_idx} --local_rank ${local_rank} --nodes_per_worker ${DECODE_NODES_PER_WORKER} --worker_type decode --gpu_utilization_log /logs/${node}_decode_w${worker_idx}_gpu_utilization.log ${CONFIG_DUMP_ARG} ${WORKER_ARGS}"
echo "$cmd"
$cmd &
done
done
echo ""
{% endraw %}
{% if enable_multiple_frontends %}
{% raw %}
echo "Frontend available at: http://${NGINX_NODE}:8000"
echo "To connect to the nginx node:"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${NGINX_NODE} --overlap --pty bash"
echo "To connect to the master node (NATS/ETCD):"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${MASTER_NODE} --overlap --pty bash"
{% endraw %}
{% else %}
{% raw %}
echo "To connect to the host prefill node:"
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
{% endraw %}
{% endif %}
{% raw %}
echo ""
echo "Make sure to cancel the job at the end:"
echo "scancel $SLURM_JOB_ID"
# Instead of waiting for all tasks to complete, wait for profile.sh to complete and then exit.
{% endraw %}
PROFILER_TYPE={{ profiler_type }}
PROFILER_ARGS="{{ profiler_arg }}"
{% if do_profile %}
{% raw %}
srun --nodes=1 --ntasks=1 $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --output=${LOG_DIR}/profile.out --error=${LOG_DIR}/profile.err --overlap bash /scripts/${PROFILER_TYPE}/bench.sh $PREFILL_WORKERS $DECODE_WORKERS $PREFILL_GPUS $DECODE_GPUS ${PROFILER_ARGS} &
{% endraw %}
{% endif %}
{% raw %}
wait -n
first_exit_code=$?
echo "Script finished at $(date) with exit code ${first_exit_code}"
exit $first_exit_code
{% endraw %}
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# ruff: noqa
# pylint: skip-file
import json
import os
import re
### Slurm configs
SLURM_JOB_ID = "slurm id"
### Model Deployment configurations
PREFILL_TP = "Prefill TP"
PREFILL_DP = "Prefill DP"
DECODE_TP = "Decode TP"
DECODE_DP = "Decode DP"
FRONTENDS = "Frontends"
### Profiler configs
PROFILER_TYPE = "Profiler type"
ISL = "ISL"
OSL = "OSL"
REQUEST_RATE = "Request rate"
CONCURRENCIES = "Concurrencies"
OUTPUT_TPS = "Output TPS"
OUTPUT_TPS_PER_USER = "Output TPS/User"
ITL = "Mean ITL (ms)"
TTFT = "Mean TTFT (ms)"
TPOT = "Mean TPOT (ms)"
### FORMAT PRINT ORDERS
KEY_PRINT_ORDER = [
SLURM_JOB_ID,
PREFILL_TP,
PREFILL_DP,
DECODE_TP,
DECODE_DP,
FRONTENDS,
PROFILER_TYPE,
ISL,
OSL,
REQUEST_RATE,
CONCURRENCIES,
OUTPUT_TPS,
OUTPUT_TPS_PER_USER,
ITL,
TTFT,
TPOT,
]
def format_key_order():
report = "================\nThe following log will be reported according to this order:\n----\n"
for key in KEY_PRINT_ORDER:
report += f"{key}\n"
print(report[:-1])
def format_print(result):
report = "================\n"
for key in KEY_PRINT_ORDER:
report += f"{result.get(key, '')}\n"
print(report[:-1])
def analyze_sgl_out(folder):
result = []
for file in os.listdir(folder):
with open(f"{folder}/{file}", "r") as f:
content = json.load(f)
res = [
content["max_concurrency"],
content["output_throughput"],
content["mean_itl_ms"],
content["mean_ttft_ms"],
content["request_rate"],
]
if "mean_tpot_ms" in content:
res.append(content["mean_tpot_ms"])
result.append(res)
out = {
REQUEST_RATE: [],
CONCURRENCIES: [],
OUTPUT_TPS: [],
ITL: [],
TTFT: [],
TPOT: [],
}
for data in sorted(result, key=lambda x: x[0]):
con, tps, itl, ttft, req_rate = data[0:5]
out[CONCURRENCIES].append(con)
out[OUTPUT_TPS].append(tps)
out[ITL].append(itl)
out[TTFT].append(ttft)
out[REQUEST_RATE].append(req_rate)
if len(data) >= 6:
if TPOT not in out:
out[TPOT] = []
out[TPOT].append(data[5])
return out
def analyze_gap_out(folder):
result = []
for file in os.listdir(folder):
with open(f"{folder}/{file}", "r") as f:
content = json.load(f)
result.append(
(
content["input_config"]["perf_analyzer"]["stimulus"]["concurrency"],
content["output_token_throughput_per_user"]["avg"],
content["output_token_throughput"]["avg"],
)
)
out = {CONCURRENCIES: [], OUTPUT_TPS: [], OUTPUT_TPS_PER_USER: []}
for con, tpspuser, tps in sorted(result, key=lambda x: x[0]):
out[CONCURRENCIES].append(con)
out[OUTPUT_TPS].append(tps)
out[OUTPUT_TPS_PER_USER].append(tpspuser)
return out
def analyze(p):
files = os.listdir(p)
prefill_nodes = {}
decode_nodes = {}
frontends = []
profile_result = {}
for file in files:
p_re = re.search(
"([-_A-Za-z0-9]+)_(prefill|decode|nginx|frontend)_([a-zA-Z0-9]+).out", file
)
if p_re is not None:
_, node_type, number = p_re.groups()
if node_type == "prefill":
if number not in prefill_nodes:
prefill_nodes[number] = []
prefill_nodes[number].append(file)
elif node_type == "decode":
if number not in decode_nodes:
decode_nodes[number] = []
decode_nodes[number].append(file)
elif node_type == "frontend":
frontends.append(file)
profiler_match = re.match("(sglang|vllm|gap)_isl_([0-9]+)_osl_([0-9]+)", file)
if profiler_match:
profiler, isl, osl = profiler_match.groups()
if profiler == "gap":
profile_result = analyze_gap_out(f"{p}/{file}")
else:
profile_result = analyze_sgl_out(f"{p}/{file}")
profile_result[PROFILER_TYPE] = profiler
profile_result[ISL] = isl
profile_result[OSL] = osl
config = {SLURM_JOB_ID: p}
if len(prefill_nodes.values()) != 0:
config[PREFILL_TP] = f"{len(list(prefill_nodes.values())[0]) * 4}"
config[PREFILL_DP] = f"{len(prefill_nodes.keys())}"
if len(decode_nodes.values()) != 0:
config[DECODE_TP] = f"{len(list(decode_nodes.values())[0]) * 4}"
config[DECODE_DP] = f"{len(decode_nodes.keys())}"
if len(frontends) != 0:
config[FRONTENDS] = f"{len(frontends)}"
result = {**config}
for key, value in profile_result.items():
result[key] = (
value
if type(value) != list
else ", ".join([str(x) for x in value]) # ignore:
)
return result
paths = [x for x in os.listdir(".") if ".py" not in x and os.path.isdir(x)]
format_key_order()
def extract_job_id(dirname):
"""Extract job ID from directory name for sorting.
Handles formats like:
- 12345_3P_1D_20250104_123456 (disaggregated)
- 12345_4A_20250104_123456 (aggregated)
- 12345 (legacy format)
"""
try:
return int(dirname.split("_")[0])
except (ValueError, IndexError):
# If directory name doesn't match expected format, return -1
return -1
for path in sorted(paths, key=extract_job_id, reverse=True):
result = analyze(path)
if OUTPUT_TPS not in result:
pass
else:
format_print(result)
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
prefill_workers=$1
decode_workers=$2
prefill_gpus=$3
decode_gpus=$4
total_gpus=$((prefill_gpus+decode_gpus))
chosen_isl=$5
chosen_osl=$6
chosen_concurrencies=$7
echo "Profiling for model with PrefillDP=${prefill_workers}, DecodeDP=${decode_workers}"
head_node="localhost"
head_port="8000"
SERVED_MODEL_NAME="deepseek-ai/DeepSeek-R1"
MODEL_PATH=/model/
random_seed=$(python3 -c "import random; print(random.randint(0, 65535))")
random_seed=$RANDOM
echo "Chosen random seed ${random_seed}"
source /scripts/benchmark_utils.sh
wait_for_model $head_node $head_port $prefill_workers $decode_workers 5 900 60
set -e
warmup_model $head_node $head_port $SERVED_MODEL_NAME $MODEL_PATH "${chosen_isl}x${chosen_osl}x10000x10000x250"
set +e
aiperf_warmup_workers=$(python3 -c "print(max(${DP:-0}, ${prefill_workers:-0}, ${decode_workers:-0}))")
IFS='x' read -r -a concurrency_list <<< "$chosen_concurrencies"
profile_folder="/logs/gap_isl_${chosen_isl}_osl_${chosen_osl}"
mkdir -p $profile_folder
tmp_work_dir=$(mktemp -d -t aiperf-XXXXXXXX)
for concurrency in ${concurrency_list[@]}; do
export_folder="${tmp_work_dir}/concurrency_${concurrency}"
mkdir -p $export_folder
export_model_name=${SERVED_MODEL_NAME//\//_}
export_file="${export_model_name}_generation_${concurrency}.json"
echo "Run benchmark for concurrency $concurrency; ISL $chosen_isl; OSL $chosen_osl"
command=(
aiperf profile
-m ${SERVED_MODEL_NAME}
--tokenizer ${MODEL_PATH}
--endpoint-type chat
--endpoint /v1/chat/completions
--url "${head_node}:${head_port}"
--streaming
--concurrency ${concurrency}
--warmup-request-count $(( 2*aiperf_warmup_workers ))
--request-count $(( 5*concurrency ))
--synthetic-input-tokens-mean ${chosen_isl} --synthetic-input-tokens-stddev 0
--output-tokens-mean ${chosen_osl} --output-tokens-stddev 0
--extra-inputs "max_tokens:${chosen_osl}" --extra-inputs "min_tokens:${chosen_osl}"
--artifact-dir ${export_folder}
--profile-export-file ${export_file}
--random-seed ${random_seed}
--tokenizer-trust-remote-code
--num-dataset-entries 3000
)
set -e
${command[@]}
set +e
cp $export_folder/*/*_aiperf.json $profile_folder
done
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment