Unverified Commit 959f810f authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

feat: sglang + gb200 (#2223)

parent ae51b3f4
...@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
### Large Scale P/D and WideEP Features ### Large Scale P/D and WideEP Features
| Feature | SGLang | Notes | | Feature | SGLang | Notes |
|--------------------|--------|-----------------------------------------------------------------------| |---------------------|--------|--------------------------------------------------------------|
| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | | **WideEP** | ✅ | Full support on H100s/GB200 |
| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported | | **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | | **GB200 Support** | | |
## Quick Start ## Quick Start
...@@ -155,7 +155,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ ...@@ -155,7 +155,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Run on multi-node ### Run a multi-node sized model
- **[Run a multi-node model](docs/multinode-examples.md)** - **[Run a multi-node model](docs/multinode-examples.md)**
### Large scale P/D disaggregation with WideEP ### Large scale P/D disaggregation with WideEP
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs).
## Instructions
1. Build the Dynamo container
```bash
cd $DYNAMO_ROOT
docker build \
-f container/Dockerfile.sglang-wideep \
-t dynamo-wideep-gb200 \
--build-arg MODE=blackwell \
--build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \
--build-arg ARCH=arm64 \
--build-arg ARCH_ALT=aarch64 \
.
```
2. You can run this container on each 4xGB200 node using the following command.
> [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume /PATH_TO_DSR1_MODEL/:/model/ \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-wideep-gb200:latest
```
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
```bash
./utils/gen_env_vars.sh
```
4. Run the ingress and prefill worker
```bash
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
python3 utils/sgl_http_server.py --ns dynamo &
# run prefill worker
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
MC_FORCE_MNNVL=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 components/worker.py \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
--disaggregation-bootstrap-port 30001 \
--disaggregation-transfer-backend nixl \
--nnodes 2 \
--node-rank 0 \
--tp-size 8 \
--dp-size 8 \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 6144 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--disable-cuda-graph \
--chunked-prefill-size 16384 \
--max-total-tokens 32768 \
--mem-fraction-static 0.8 \
--log-level debug
```
5. Run the decode worker on the head decode node
```bash
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 components/decode_worker.py \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode decode \
--dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
--disaggregation-bootstrap-port 30001 \
--nnodes 12 \
--node-rank 0 \
--tp-size 48 \
--dp-size 48 \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 36864 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-bs 768 \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--chunked-prefill-size 36864 \
--mem-fraction-static 0.82 \
--log-level debug
```
On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
...@@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca ...@@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
## Instructions ## Instructions
1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases. 1. Build the Dynamo container
```bash
docker pull lmsysorg/sglang:v0.4.8.post1-cu126
```
You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags)
2. Build the Dynamo container
```bash ```bash
cd $DYNAMO_ROOT cd $DYNAMO_ROOT
docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
``` ```
3. You can run this container on each 8xH100 node using the following command. You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
2. You can run this container on each 8xH100 node using the following command.
> [!IMPORTANT] > [!IMPORTANT]
> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) > We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
...@@ -47,17 +41,17 @@ docker run \ ...@@ -47,17 +41,17 @@ docker run \
In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory. In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. 3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
```bash ```bash
./utils/gen_env_vars.sh ./utils/gen_env_vars.sh
``` ```
5. Run the ingress and prefill worker 4. Run the ingress and prefill worker
```bash ```bash
# run ingress # run ingress
dynamo run in=http out=dyn & python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
python3 utils/sgl_http_server.py --ns dynamo & python3 utils/sgl_http_server.py --ns dynamo &
# run prefill worker # run prefill worker
...@@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \ ...@@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \
On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3 On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
7. Run the decode worker on the head decode node 5. Run the decode worker on the head decode node
```bash ```bash
python3 -m dynamo.sglang.decode_worker \ python3 -m dynamo.sglang.decode_worker \
...@@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \ ...@@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \
--deepep-mode low_latency \ --deepep-mode low_latency \
--mem-fraction-static 0.835 \ --mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \ --ep-num-redundant-experts 32 \
--cuda-graph-bs 256 --cuda-graph-bs 128
``` ```
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8 On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
...@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same ...@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands: In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
prefill: prefill:
```bash ```bash
... ...
--max-running-requests 8192 \ --max-running-requests 8192 \
...@@ -142,6 +137,7 @@ prefill: ...@@ -142,6 +137,7 @@ prefill:
``` ```
decode: decode:
```bash ```bash
... ...
--max-running-requests 18432 \ --max-running-requests 18432 \
...@@ -152,9 +148,10 @@ decode: ...@@ -152,9 +148,10 @@ decode:
We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future. We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL** 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used. We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
Example usage: Example usage:
```bash ```bash
# warmup # warmup
./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup ./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
...@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache ...@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
``` ```
2. **GenAI Perf to benchmark completions with custom dataset** 2. **GenAI Perf to benchmark completions with custom dataset**
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
Example usage: Example usage:
```bash ```bash
# generate data # generate data
python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1 python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1
......
...@@ -45,6 +45,7 @@ logs/ ...@@ -45,6 +45,7 @@ logs/
## Setup ## Setup
For simplicity of the example, we will make some assumptions about your SLURM cluster: For simplicity of the example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes 1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly testing, you should aim to allocate groups of nodes that are performantly
...@@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ...@@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
## Usage ## Usage
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
1. **Submit a benchmark job**: 1. **Submit a benchmark job**:
```bash ```bash
python submit_job_script.py \ python submit_job_script.py \
--template job_script_template.j2 \ --template job_script_template.j2 \
...@@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ...@@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
``` ```
**Required arguments**: **Required arguments**:
- `--template`: Path to Jinja2 template file - `--template`: Path to Jinja2 template file
- `--model-dir`: Model directory path - `--model-dir`: Model directory path
- `--config-dir`: Config directory path - `--config-dir`: Config directory path
...@@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ...@@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
- `--account`: SLURM account - `--account`: SLURM account
**Optional arguments**: **Optional arguments**:
- `--prefill-nodes`: Number of prefill nodes (default: `2`) - `--prefill-nodes`: Number of prefill nodes (default: `2`)
- `--decode-nodes`: Number of decode nodes (default: `2`) - `--decode-nodes`: Number of decode nodes (default: `2`)
- `--gpus-per-node`: Number of GPUs per node (default: `8`) - `--gpus-per-node`: Number of GPUs per node (default: `8`)
- `--network-interface`: Network interface to use (default: `eth3`) - `--network-interface`: Network interface to use (default: `eth3`)
- `--job-name`: SLURM job name (default: `dynamo_setup`) - `--job-name`: SLURM job name (default: `dynamo_setup`)
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`) - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
- `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
- `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters. **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2. **Monitor job progress**: 2. **Example with different GPU types**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3. **Monitor job progress**:
```bash ```bash
squeue -u $USER squeue -u $USER
``` ```
3. **Check logs in real-time**: 4. **Check logs in real-time**:
```bash ```bash
tail -f logs/{JOB_ID}/log.out tail -f logs/{JOB_ID}/log.out
``` ```
4. **Monitor GPU utilization**: You can view logs of all prefill or decode workers simultaneously by running:
```bash
# prefill workers err (or .out)
tail -f logs/{JOB_ID}/*_prefill.err
# decode workers err (or .out)
tail -f logs/{JOB_ID}/*_decode.err
```
5. **Monitor GPU utilization**:
```bash ```bash
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
``` ```
......
...@@ -7,6 +7,7 @@ ...@@ -7,6 +7,7 @@
#SBATCH --time={{ time_limit }} #SBATCH --time={{ time_limit }}
#SBATCH --output=logs/%j/log.out #SBATCH --output=logs/%j/log.out
#SBATCH --error=logs/%j/log.err #SBATCH --error=logs/%j/log.err
#SBATCH --partition={{ partition }}
# Constants # Constants
PREFILL_NODES={{ prefill_nodes }} PREFILL_NODES={{ prefill_nodes }}
...@@ -20,6 +21,8 @@ MODEL_DIR="{{ model_dir }}" ...@@ -20,6 +21,8 @@ MODEL_DIR="{{ model_dir }}"
CONFIG_DIR="{{ config_dir }}" CONFIG_DIR="{{ config_dir }}"
CONTAINER_IMAGE="{{ container_image }}" CONTAINER_IMAGE="{{ container_image }}"
NETWORK_INTERFACE="{{ network_interface }}" NETWORK_INTERFACE="{{ network_interface }}"
GPU_TYPE="{{ gpu_type | default('h100') }}"
USE_SGLANG_COMMANDS="{{ use_sglang_commands | default(false) }}"
{% raw %} {% raw %}
...@@ -36,14 +39,14 @@ for i in "${!nodes[@]}"; do ...@@ -36,14 +39,14 @@ for i in "${!nodes[@]}"; do
echo "Node $i: ${nodes[$i]}" echo "Node $i: ${nodes[$i]}"
done done
PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip route get $(getent ahosts ${nodes[0]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}')
if [ -z "$PREFILL_HOST_IP" ]; then if [ -z "$PREFILL_HOST_IP" ]; then
echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE" echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE"
exit 1 exit 1
fi fi
echo "Prefill host IP address: $PREFILL_HOST_IP" echo "Prefill host IP address: $PREFILL_HOST_IP"
DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ip route get $(getent ahosts ${nodes[$PREFILL_NODES]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}')
if [ -z "$DECODE_HOST_IP" ]; then if [ -z "$DECODE_HOST_IP" ]; then
echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE" echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE"
exit 1 exit 1
...@@ -54,21 +57,25 @@ echo "Decode host IP address: $DECODE_HOST_IP" ...@@ -54,21 +57,25 @@ echo "Decode host IP address: $DECODE_HOST_IP"
ENROOT_ARGS="\ ENROOT_ARGS="\
--container-image=${CONTAINER_IMAGE} \ --container-image=${CONTAINER_IMAGE} \
--no-container-entrypoint \ --no-container-entrypoint \
--container-mount-home \ --no-container-mount-home \
--no-container-remap-root \
--container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \ --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
" "
# Build common worker arguments
WORKER_ARGS="--gpu_type ${GPU_TYPE} --gpus_per_node ${GPUS_PER_NODE}"
if [ "$USE_SGLANG_COMMANDS" = "True" ]; then
WORKER_ARGS="${WORKER_ARGS} --use-sglang-commands"
fi
# Launch prefill tasks on the first PREFILL_NODES nodes # Launch prefill tasks on the first PREFILL_NODES nodes
for i in $(seq 0 $((PREFILL_NODES - 1))); do for i in $(seq 0 $((PREFILL_NODES - 1))); do
node=${nodes[$i]} node=${nodes[$i]}
rank=$i rank=$i
echo "Launching prefill task on node ${i} (rank ${rank}): $node" echo "Launching prefill task on node ${i} (rank ${rank}): $node"
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err"
echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &" cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log ${WORKER_ARGS}"
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ echo "$cmd"
--output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \ $cmd &
python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &
done done
# Launch decode tasks on the next DECODE_NODES nodes # Launch decode tasks on the next DECODE_NODES nodes
...@@ -76,11 +83,10 @@ for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do ...@@ -76,11 +83,10 @@ for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do
node=${nodes[$i]} node=${nodes[$i]}
rank=$((i - PREFILL_NODES)) rank=$((i - PREFILL_NODES))
echo "Launching decode task on node ${i} (rank ${rank}): $node" echo "Launching decode task on node ${i} (rank ${rank}): $node"
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err"
echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &" cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log ${WORKER_ARGS}"
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ echo "$cmd"
--output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \ $cmd &
python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &
done done
echo "" echo ""
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Function to print usage
print_usage() {
echo "Usage: $0 <mode> <cmd>"
echo " mode: prefill or decode"
echo " cmd: dynamo or sglang"
echo ""
echo "Examples:"
echo " $0 prefill dynamo"
echo " $0 decode sglang"
exit 1
}
# Check if correct number of arguments provided
if [ $# -ne 2 ]; then
echo "Error: Expected 2 arguments, got $#"
print_usage
fi
# Parse arguments
mode=$1
cmd=$2
# Validate mode argument
if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then
echo "Error: mode must be 'prefill' or 'decode', got '$mode'"
print_usage
fi
# Validate cmd argument
if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then
echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'"
print_usage
fi
echo "Mode: $mode"
echo "Command: $cmd"
# Check if required environment variables are set
if [ -z "$HOST_IP" ]; then
echo "Error: HOST_IP environment variable is not set"
exit 1
fi
if [ -z "$PORT" ]; then
echo "Error: PORT environment variable is not set"
exit 1
fi
if [ -z "$TOTAL_GPUS" ]; then
echo "Error: TOTAL_GPUS environment variable is not set"
exit 1
fi
if [ -z "$RANK" ]; then
echo "Error: RANK environment variable is not set"
exit 1
fi
if [ -z "$TOTAL_NODES" ]; then
echo "Error: TOTAL_NODES environment variable is not set"
exit 1
fi
# TODO: since the args for sglang and dynamo are the same, we can be a bit cleaner here
# Construct command based on mode and cmd
if [ "$mode" = "prefill" ]; then
if [ "$cmd" = "dynamo" ]; then
# We are not using a init-expert-location file for e2e benchmarking
# We also don't currently have a --deepep-config file for GB200
# Need to increase --context-length to 10k for 8k1k benchmarking
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
MC_FORCE_MNNVL=1 \
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 components/worker.py \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr "$HOST_IP:$PORT" \
--disaggregation-bootstrap-port 30001 \
--disaggregation-transfer-backend nixl \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 6144 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--disable-cuda-graph \
--chunked-prefill-size 16384 \
--max-total-tokens 32768 \
--mem-fraction-static 0.8 \
--log-level debug
elif [ "$cmd" = "sglang" ]; then
# GB200 sglang prefill command
# We are not using a init-expert-location file for e2e benchmarking
# We also don't currently have a --deepep-config file for GB200
# Need to increase --context-length to 10k for 8k1k benchmarking
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m sglang.launch_server \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr "$HOST_IP:$PORT" \
--disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 6144 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--disable-cuda-graph \
--chunked-prefill-size 16384 \
--max-total-tokens 32768 \
--mem-fraction-static 0.8 \
--log-level debug
fi
elif [ "$mode" = "decode" ]; then
if [ "$cmd" = "dynamo" ]; then
# Need to increase --context-length to 10k for 8k1k benchmarking
# We are not using a init-expert-location file for e2e benchmarking
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 components/decode_worker.py \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode decode \
--dist-init-addr "$HOST_IP:$PORT" \
--disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 36864 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-bs 768 \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--chunked-prefill-size 36864 \
--mem-fraction-static 0.82 \
--log-level debug
elif [ "$cmd" = "sglang" ]; then
# GB200 sglang decode command
# Need to increase --context-length to 10k for 8k1k benchmarking
# We are not using a init-expert-location file for e2e benchmarking
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
MC_TE_METRIC=true \
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
NCCL_MNNVL_ENABLE=1 \
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m sglang.launch_server \
--model-path /model/ \
--trust-remote-code \
--disaggregation-mode decode \
--dist-init-addr "$HOST_IP:$PORT" \
--disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--host 0.0.0.0 \
--decode-log-interval 1 \
--max-running-requests 36864 \
--context-length 2716 \
--disable-radix-cache \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-bs 768 \
--disable-shared-experts-fusion \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm static \
--eplb-algorithm deepseek \
--attention-backend cutlass_mla \
--watchdog-timeout 1000000 \
--chunked-prefill-size 36864 \
--mem-fraction-static 0.82 \
--log-level debug
fi
fi
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Function to print usage
print_usage() {
echo "Usage: $0 <mode> <cmd>"
echo " mode: prefill or decode"
echo " cmd: dynamo or sglang"
echo ""
echo "Examples:"
echo " $0 prefill dynamo"
echo " $0 decode sglang"
exit 1
}
# Check if correct number of arguments provided
if [ $# -ne 2 ]; then
echo "Error: Expected 2 arguments, got $#"
print_usage
fi
# Parse arguments
mode=$1
cmd=$2
# Validate mode argument
if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then
echo "Error: mode must be 'prefill' or 'decode', got '$mode'"
print_usage
fi
# Validate cmd argument
if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then
echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'"
print_usage
fi
echo "Mode: $mode"
echo "Command: $cmd"
# Check if required environment variables are set
if [ -z "$HOST_IP" ]; then
echo "Error: HOST_IP environment variable is not set"
exit 1
fi
if [ -z "$PORT" ]; then
echo "Error: PORT environment variable is not set"
exit 1
fi
if [ -z "$TOTAL_GPUS" ]; then
echo "Error: TOTAL_GPUS environment variable is not set"
exit 1
fi
if [ -z "$RANK" ]; then
echo "Error: RANK environment variable is not set"
exit 1
fi
if [ -z "$TOTAL_NODES" ]; then
echo "Error: TOTAL_NODES environment variable is not set"
exit 1
fi
# Construct command based on mode and cmd
if [ "$mode" = "prefill" ]; then
if [ "$cmd" = "dynamo" ]; then
# H100 dynamo prefill command
python3 components/worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr "$HOST_IP:$PORT" \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode normal \
--mem-fraction-static 0.85 \
--deepep-config /configs/deepep.json \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek
elif [ "$cmd" = "sglang" ]; then
# H100 sglang prefill command
python3 -m sglang.launch_server \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode prefill \
--dist-init-addr "$HOST_IP:$PORT" \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--host 0.0.0.0 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode normal \
--mem-fraction-static 0.85 \
--ep-num-redundant-experts 32 \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek \
--deepep-config /configs/deepep.json
fi
elif [ "$mode" = "decode" ]; then
if [ "$cmd" = "dynamo" ]; then
# H100 dynamo decode command
python3 components/decode_worker.py \
--model-path /model/ \
--served-model-name deepseek-ai/DeepSeek-R1 \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 30001 \
--dist-init-addr "$HOST_IP:$PORT" \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode low_latency \
--mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \
--cuda-graph-bs 256
elif [ "$cmd" = "sglang" ]; then
# H100 sglang decode command
python3 -m sglang.launch_server \
--model-path /model/ \
--disaggregation-transfer-backend nixl \
--disaggregation-mode decode \
--dist-init-addr "$HOST_IP:$PORT" \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
--tp-size "$TOTAL_GPUS" \
--dp-size "$TOTAL_GPUS" \
--enable-dp-attention \
--decode-log-interval 1 \
--enable-deepep-moe \
--page-size 1 \
--host 0.0.0.0 \
--trust-remote-code \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--disable-radix-cache \
--watchdog-timeout 1000000 \
--enable-two-batch-overlap \
--deepep-mode low_latency \
--mem-fraction-static 0.835 \
--ep-num-redundant-experts 32 \
--cuda-graph-bs 256
fi
fi
...@@ -8,8 +8,8 @@ benchmark_dynamo.sh script. ...@@ -8,8 +8,8 @@ benchmark_dynamo.sh script.
The script will: The script will:
- Setup the environment - Setup the environment
- Update the YAML config file - Generate the python3 command to run the prefill or decode worker
- Start Dynamo graphs.disagg service - Start dynamo (or sglang)
- Monitor the GPU utilization - Monitor the GPU utilization
""" """
...@@ -165,6 +165,19 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac ...@@ -165,6 +165,19 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
default=None, default=None,
help="File to log GPU utilization (default: None)", help="File to log GPU utilization (default: None)",
) )
parser.add_argument(
"--use-sglang-commands",
action="store_true",
default=False,
help="Helper to spin up SGLang servers instead of dynamo. This is helpful for benchmarking SGLang as well",
)
parser.add_argument(
"--gpu_type",
type=str,
choices=["h100", "gb200"],
default="h100",
help="Type of GPU to use",
)
return parser.parse_args(args) return parser.parse_args(args)
...@@ -181,73 +194,114 @@ def _validate_args(args: argparse.Namespace) -> None: ...@@ -181,73 +194,114 @@ def _validate_args(args: argparse.Namespace) -> None:
raise ValueError("GPUs per node must be at least 1") raise ValueError("GPUs per node must be at least 1")
def setup_prefill_node( def get_sglang_mini_lb_command_args(prefill_host_ip: str, decode_host_ip: str) -> str:
rank: int, prefill_host_ip: str, total_nodes: int, total_gpus: int cmd = (
) -> int: f"python3 -m sglang.srt.disaggregation.launch_lb "
f"--prefill http://{prefill_host_ip}:30000 "
f"--decode http://{decode_host_ip}:30000 "
"--host 0.0.0.0 "
"--port 8000 "
"--timeout 3600"
)
return cmd
def setup_env_vars_for_gpu_script(
host_ip: str,
rank: int,
total_gpus: int,
total_nodes: int,
port: int = DIST_INIT_PORT,
):
"""Setup environment variables required by GPU scripts (h100.sh, gb200.sh)"""
os.environ["HOST_IP"] = host_ip
os.environ["PORT"] = str(port)
os.environ["TOTAL_GPUS"] = str(total_gpus)
os.environ["RANK"] = str(rank)
os.environ["TOTAL_NODES"] = str(total_nodes)
logging.info(f"Set HOST_IP: {host_ip}")
logging.info(f"Set PORT: {port}")
logging.info(f"Set TOTAL_GPUS: {total_gpus}")
logging.info(f"Set RANK: {rank}")
logging.info(f"Set TOTAL_NODES: {total_nodes}")
def get_gpu_command(worker_type: str, use_sglang_commands: bool, gpu_type: str) -> str:
"""Generate command to run the appropriate GPU script"""
script_name = f"{gpu_type}.sh"
script_path = Path(__file__).parent / script_name
mode = worker_type # "prefill" or "decode"
cmd = "sglang" if use_sglang_commands else "dynamo"
return f"bash {script_path} {mode} {cmd}"
def setup_head_prefill_node(prefill_host_ip: str) -> None:
""" """
Setup the prefill node. Setup NATS, etcd, ingress, and http servers on the prefill host node.
""" """
if rank == 0: logging.info(f"Starting nats server on node {prefill_host_ip}")
logging.info(f"Setting up host prefill node: {rank}")
logging.info(f"Starting nats server on node {rank} with IP {prefill_host_ip}") nats_process = run_command("nats-server -js", background=True)
if not nats_process:
nats_process = run_command("nats-server -js", background=True) raise RuntimeError("Failed to start nats-server")
if not nats_process:
raise RuntimeError("Failed to start nats-server") logging.info(f"Starting etcd server on node {prefill_host_ip}")
etcd_cmd = (
etcd_cmd = ( f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} "
f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} "
f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} " f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}"
f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}" )
)
etcd_process = run_command(etcd_cmd, background=True) etcd_process = run_command(etcd_cmd, background=True)
if not etcd_process: if not etcd_process:
raise RuntimeError("Failed to start etcd") raise RuntimeError("Failed to start etcd")
ingress_process = run_command("dynamo run in=http out=dyn", background=True) logging.info(f"Starting ingress server on node {prefill_host_ip}")
if not ingress_process: ingress_process = run_command(
raise RuntimeError("Failed to start ingress") "dynamo run in=http out=dyn --http-port=8000", background=True
)
if not ingress_process:
raise RuntimeError("Failed to start ingress")
logging.info(
f"Starting http server on port 9001 for flush_cache endpoint on node {prefill_host_ip}"
)
cache_flush_server_cmd = "python3 utils/sgl_http_server.py --ns dynamo"
cache_flush_server_process = run_command(cache_flush_server_cmd, background=True)
if not cache_flush_server_process:
raise RuntimeError("Failed to start cache flush server")
def setup_prefill_node(
rank: int,
prefill_host_ip: str,
total_nodes: int,
total_gpus: int,
use_sglang_commands: bool,
gpu_type: str,
) -> int:
"""
Setup the prefill node.
"""
if not use_sglang_commands:
if rank == 0:
setup_head_prefill_node(prefill_host_ip)
else:
logging.info(f"Setting up child prefill node: {rank}")
if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
raise RuntimeError("Failed to connect to etcd")
else: else:
logging.info(f"Setting up child prefill node: {rank}") logging.info("Using SGLang servers. No need to setup etcd or nats")
if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
raise RuntimeError("Failed to connect to etcd")
# NOTE: This implements the example in examples/sglang/dsr1-wideep.md # Setup environment variables for GPU script
# For other examples, the command might have to be modified. setup_env_vars_for_gpu_script(prefill_host_ip, rank, total_gpus, total_nodes)
dynamo_cmd = (
f"python3 -m dynamo.sglang.worker " # Use appropriate GPU script instead of generating command directly
"--model-path /model/ " cmd_to_run = get_gpu_command("prefill", use_sglang_commands, gpu_type)
"--served-model-name deepseek-ai/DeepSeek-R1 " return run_command(cmd_to_run)
"--skip-tokenizer-init "
"--disaggregation-mode prefill "
"--disaggregation-transfer-backend nixl "
"--disaggregation-bootstrap-port 30001 "
f"--dist-init-addr {prefill_host_ip}:{DIST_INIT_PORT} "
f"--nnodes {total_nodes} "
f"--node-rank {rank} "
f"--tp-size {total_gpus} "
f"--dp-size {total_gpus} "
"--enable-dp-attention "
"--decode-log-interval 1 "
"--enable-deepep-moe "
"--page-size 1 "
"--trust-remote-code "
"--moe-dense-tp-size 1 "
"--enable-dp-lm-head "
"--disable-radix-cache "
"--watchdog-timeout 1000000 "
"--enable-two-batch-overlap "
"--deepep-mode normal "
"--mem-fraction-static 0.85 "
"--deepep-config /configs/deepep.json "
"--ep-num-redundant-experts 32 "
"--ep-dispatch-algorithm dynamic "
"--eplb-algorithm deepseek "
)
return run_command(dynamo_cmd)
def setup_decode_node( def setup_decode_node(
...@@ -256,45 +310,29 @@ def setup_decode_node( ...@@ -256,45 +310,29 @@ def setup_decode_node(
prefill_host_ip: str, prefill_host_ip: str,
total_nodes: int, total_nodes: int,
total_gpus: int, total_gpus: int,
use_sglang_commands: bool,
gpu_type: str,
) -> int: ) -> int:
""" """
Setup the decode node. Setup the decode node.
""" """
logging.info(f"Setting up child decode node: {rank}") logging.info(f"Setting up child decode node: {rank}")
if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): if use_sglang_commands:
raise RuntimeError("Failed to connect to etcd") sgl_mini_lb_cmd = get_sglang_mini_lb_command_args(
prefill_host_ip, decode_host_ip
dynamo_cmd = ( )
"python3 -m dynamo.sglang.decode_worker " run_command(sgl_mini_lb_cmd, background=True)
"--model-path /model/ " else:
"--served-model-name deepseek-ai/DeepSeek-R1 " if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
"--skip-tokenizer-init " raise RuntimeError("Failed to connect to etcd")
"--disaggregation-mode decode "
"--disaggregation-transfer-backend nixl " # Setup environment variables for GPU script
"--disaggregation-bootstrap-port 30001 " setup_env_vars_for_gpu_script(decode_host_ip, rank, total_gpus, total_nodes)
f"--dist-init-addr {decode_host_ip}:{DIST_INIT_PORT} "
f"--nnodes {total_nodes} "
f"--node-rank {rank} "
f"--tp-size {total_gpus} "
f"--dp-size {total_gpus} "
"--enable-dp-attention "
"--decode-log-interval 1 "
"--enable-deepep-moe "
"--page-size 1 "
"--trust-remote-code "
"--moe-dense-tp-size 1 "
"--enable-dp-lm-head "
"--disable-radix-cache "
"--watchdog-timeout 1000000 "
"--enable-two-batch-overlap "
"--deepep-mode low_latency "
"--mem-fraction-static 0.835 "
"--ep-num-redundant-experts 32 "
"--cuda-graph-bs 256 "
)
return run_command(dynamo_cmd) # Use appropriate GPU script instead of generating command directly
cmd_to_run = get_gpu_command("decode", use_sglang_commands, gpu_type)
return run_command(cmd_to_run)
def setup_env(prefill_host_ip: str): def setup_env(prefill_host_ip: str):
...@@ -321,6 +359,7 @@ def main(input_args: list[str] | None = None): ...@@ -321,6 +359,7 @@ def main(input_args: list[str] | None = None):
logging.info(f"Prefill host IP: {args.prefill_host_ip}") logging.info(f"Prefill host IP: {args.prefill_host_ip}")
logging.info(f"Decode host IP: {args.decode_host_ip}") logging.info(f"Decode host IP: {args.decode_host_ip}")
logging.info(f"Rank: {args.rank}") logging.info(f"Rank: {args.rank}")
logging.info(f"Use SGLang commands: {args.use_sglang_commands}")
setup_env(args.prefill_host_ip) setup_env(args.prefill_host_ip)
if args.worker_type == "prefill": if args.worker_type == "prefill":
...@@ -329,6 +368,8 @@ def main(input_args: list[str] | None = None): ...@@ -329,6 +368,8 @@ def main(input_args: list[str] | None = None):
args.prefill_host_ip, args.prefill_host_ip,
args.total_nodes, args.total_nodes,
args.total_nodes * args.gpus_per_node, args.total_nodes * args.gpus_per_node,
args.use_sglang_commands,
args.gpu_type,
) )
else: else:
setup_decode_node( setup_decode_node(
...@@ -337,6 +378,8 @@ def main(input_args: list[str] | None = None): ...@@ -337,6 +378,8 @@ def main(input_args: list[str] | None = None):
args.prefill_host_ip, args.prefill_host_ip,
args.total_nodes, args.total_nodes,
args.total_nodes * args.gpus_per_node, args.total_nodes * args.gpus_per_node,
args.use_sglang_commands,
args.gpu_type,
) )
logging.info(f"{args.worker_type.capitalize()} node setup complete") logging.info(f"{args.worker_type.capitalize()} node setup complete")
......
...@@ -86,7 +86,7 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac ...@@ -86,7 +86,7 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
parser.add_argument("--config-dir", required=True, help="Config directory path") parser.add_argument("--config-dir", required=True, help="Config directory path")
parser.add_argument("--container-image", required=True, help="Container image") parser.add_argument("--container-image", required=True, help="Container image")
parser.add_argument( parser.add_argument(
"--time-limit", default="01:00:00", help="Time limit (HH:MM:SS)" "--time-limit", default="04:00:00", help="Time limit (HH:MM:SS)"
) )
parser.add_argument( parser.add_argument(
"--prefill-nodes", type=int, default=2, help="Number of prefill nodes" "--prefill-nodes", type=int, default=2, help="Number of prefill nodes"
...@@ -100,6 +100,20 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac ...@@ -100,6 +100,20 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
parser.add_argument( parser.add_argument(
"--network-interface", default="eth3", help="Network interface to use" "--network-interface", default="eth3", help="Network interface to use"
) )
parser.add_argument(
"--gpu-type", choices=["h100", "gb200"], default="h100", help="GPU type to use"
)
parser.add_argument(
"--use-sglang-commands",
action="store_true",
default=False,
help="Use SGLang commands instead of Dynamo",
)
parser.add_argument(
"--partition",
default="batch",
help="SLURM partition to use",
)
return parser.parse_args(args) return parser.parse_args(args)
...@@ -120,6 +134,9 @@ def main(input_args: list[str] | None = None): ...@@ -120,6 +134,9 @@ def main(input_args: list[str] | None = None):
"container_image": args.container_image, "container_image": args.container_image,
"gpus_per_node": args.gpus_per_node, "gpus_per_node": args.gpus_per_node,
"network_interface": args.network_interface, "network_interface": args.network_interface,
"gpu_type": args.gpu_type,
"use_sglang_commands": args.use_sglang_commands,
"partition": args.partition,
} }
with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file: with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file:
......
...@@ -13,160 +13,132 @@ ...@@ -13,160 +13,132 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# This should be pinned to the sglang version that is installed with Dynamo ARG SGLANG_IMAGE_TAG="v0.4.10-cu126"
# in the pyproject.toml
FROM lmsysorg/sglang:v0.4.8.post1-cu126
# Add NIXL build dependencies FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG}
RUN apt-get update -y && \
apt-get install -y \
cmake \
meson \
ninja-build \
pybind11-dev \
patchelf \
net-tools
# Install Python build dependencies
RUN pip install --break-system-packages meson-python wheel build
# Add architecture args for NIXL build ARG MODE="hopper"
ARG ARCH=amd64 ARG ARCH="amd64"
ARG ARCH_ALT=x86_64 ARG ARCH_ALT="x86_64"
ARG NIXL_UCX_REF="v1.19.x"
WORKDIR /sgl-workspace ARG NIXL_TAG="0.4.1"
ARG CMAKE_VERSION="3.31.8"
ARG RUST_VERSION="1.87.0"
ARG CARGO_BUILD_JOBS="16"
# Install UCX dependencies
RUN apt-get update -y && \ RUN apt-get update -y && \
apt-get install -y --no-install-recommends \ apt-get install -y \
--reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ cmake meson ninja-build pybind11-dev patchelf net-tools \
libnuma-dev librdmacm-dev ibverbs-providers \ build-essential protobuf-compiler libssl-dev pkg-config \
autoconf libtool clang libclang-dev git rapidjson-dev zlib1g-dev && \
pip install --break-system-packages meson-python wheel build
# Build UCX from source
ARG NIXL_UCX_REF=v1.19.x # Build UCX + NIXL for x86/hopper until its fully tested on GB200
RUN rm -rf /opt/hpcx/ucx && \ RUN if [ "$MODE" = "hopper" ]; then \
rm -rf /usr/local/ucx && \ apt-get install -y --no-install-recommends \
cd /usr/local/src && \ libibverbs-dev rdma-core ibverbs-utils libibumad-dev \
git clone https://github.com/openucx/ucx.git && \ libnuma-dev librdmacm-dev ibverbs-providers autoconf libtool && \
cd ucx && \ # UCX from source
git checkout $NIXL_UCX_REF && \ rm -rf /opt/hpcx/ucx /usr/local/ucx && \
./autogen.sh && ./configure \ cd /usr/local/src && \
--prefix=/usr/local/ucx \ git clone https://github.com/openucx/ucx.git && \
--enable-shared \ cd ucx && git checkout $NIXL_UCX_REF && \
--disable-static \ ./autogen.sh && \
--disable-doxygen-doc \ ./configure \
--enable-optimizations \ --prefix=/usr/local/ucx \
--enable-cma \ --enable-shared \
--enable-devel-headers \ --disable-static \
--with-cuda=/usr/local/cuda \ --disable-doxygen-doc \
--with-verbs \ --enable-optimizations \
--with-efa \ --enable-cma \
--with-dm \ --enable-devel-headers \
--with-gdrcopy=/usr/local \ --with-cuda=/usr/local/cuda \
--enable-mt && \ --with-verbs \
make -j && \ --with-efa \
make -j install-strip && \ --with-dm \
ldconfig --with-gdrcopy=/usr/local \
--enable-mt && \
make -j && make install-strip && ldconfig && \
# NIXL
git clone https://github.com/ai-dynamo/nixl.git /opt/nixl && \
cd /opt/nixl && git checkout $NIXL_TAG && \
pip install --break-system-packages . \
--config-settings="setup-args=-Ducx_path=/usr/local/ucx"; \
fi
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH
ARG NIXL_TAG=0.4.1 # Dynamo
RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_TAG} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx"
WORKDIR /sgl-workspace
# Allow forceful shutdown of inflight requests
ENV SGL_FORCE_SHUTDOWN=1
WORKDIR /sgl-workspace WORKDIR /sgl-workspace
RUN git clone https://github.com/ai-dynamo/dynamo.git RUN git clone https://github.com/ai-dynamo/dynamo.git
# install dynamo in editable mode
WORKDIR /sgl-workspace/dynamo
# Rust build/dev dependencies
RUN apt update -y && \
apt install --no-install-recommends -y \
build-essential \
protobuf-compiler \
cmake \
libssl-dev \
pkg-config \
clang \
libclang-dev \
git
# Define Rust target based on ARCH_ALT ARG
ARG RUSTARCH=${ARCH_ALT}-unknown-linux-gnu
ENV RUSTUP_HOME=/usr/local/rustup \ ENV RUSTUP_HOME=/usr/local/rustup \
CARGO_HOME=/usr/local/cargo \ CARGO_HOME=/usr/local/cargo \
PATH=/usr/local/cargo/bin:$PATH \ PATH=/usr/local/cargo/bin:$PATH
RUST_VERSION=1.86.0
# Install Rust using RUSTARCH derived from ARCH_ALT RUN wget --tries=3 --waitretry=5 \
RUN wget --tries=3 --waitretry=5 "https://static.rust-lang.org/rustup/archive/1.28.1/${RUSTARCH}/rustup-init" && \ "https://static.rust-lang.org/rustup/archive/1.28.1/${ARCH_ALT}-unknown-linux-gnu/rustup-init" && \
# TODO: Add SHA check back based on RUSTARCH
chmod +x rustup-init && \ chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile minimal --default-toolchain $RUST_VERSION --default-host ${RUSTARCH} && \ ./rustup-init -y \
--no-modify-path \
--profile minimal \
--default-toolchain $RUST_VERSION \
--default-host ${ARCH_ALT}-unknown-linux-gnu && \
rm rustup-init && \ rm rustup-init && \
chmod -R a+w $RUSTUP_HOME $CARGO_HOME chmod -R a+w $RUSTUP_HOME $CARGO_HOME
ARG CARGO_BUILD_JOBS ARG CARGO_BUILD_JOBS
# Set CARGO_BUILD_JOBS to 16 if not provided ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS}
# This is to prevent cargo from building $(nproc) jobs in parallel,
# which might exceed the number of opened files limit. RUN cd dynamo && cargo build --release
ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS:-16}
RUN cargo build --release RUN cd dynamo/lib/bindings/python && \
pip install --break-system-packages -e . && \
cd /sgl-workspace/dynamo && \
pip install --break-system-packages .
RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../.. RUN pip install --break-system-packages sglang-router==0.1.5
RUN pip install --break-system-packages .
RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.28/nats-server-v2.10.28-${ARCH}.deb && \ RUN wget --tries=3 --waitretry=5 \
https://github.com/nats-io/nats-server/releases/download/v2.10.28/\
nats-server-v2.10.28-${ARCH}.deb && \
dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb
ENV ETCD_VERSION="v3.5.21" ENV ETCD_VERSION="v3.5.21"
RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/download/$ETCD_VERSION/etcd-$ETCD_VERSION-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \ RUN wget --tries=3 --waitretry=5 \
https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/\
etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \
mkdir -p /usr/local/bin/etcd && \ mkdir -p /usr/local/bin/etcd && \
tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/etcd --strip-components=1 && \ tar -xzf /tmp/etcd.tar.gz \
-C /usr/local/bin/etcd --strip-components=1 && \
rm /tmp/etcd.tar.gz rm /tmp/etcd.tar.gz
ENV PATH=/usr/local/bin/etcd/:$PATH
ARG CMAKE_VERSION=3.31.8 ENV PATH=/usr/local/bin/etcd:$PATH
RUN mkdir /sgl-workspace/cmake_build
WORKDIR /sgl-workspace/cmake_build
# uninstall CMake # GenAI Perf
RUN apt-get purge -y cmake RUN apt-get purge -y cmake
# download newer version of CMake
RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
tar -xvzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake
ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH
# should be 3.31.8 RUN mkdir /sgl-workspace/cmake_build && \
cd /sgl-workspace/cmake_build && \
wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/\
cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
tar -xzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \
mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake && \
rm cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz
ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH
RUN cmake --version RUN cmake --version
# Install perf_analyzer and genai-perf RUN git clone --depth=1 \
RUN apt-get update -y && \ https://github.com/triton-inference-server/perf_analyzer.git && \
apt-get install -y --no-install-recommends \
rapidjson-dev \
# jq and curl for polling various endpoints and health checks
jq \
curl \
zlib1g-dev
RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \
mkdir perf_analyzer/build && \ mkdir perf_analyzer/build && \
cmake -B perf_analyzer/build -S perf_analyzer && \ cmake -B perf_analyzer/build -S perf_analyzer && \
cmake --build perf_analyzer/build -- -j8 cmake --build perf_analyzer/build -- -j$(nproc)
ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH
RUN pip install --break-system-packages genai-perf RUN pip install --break-system-packages genai-perf
# https://pypi.org/project/sglang-router/0.1.5 is latest # Enable forceful shutdown of inflight requests
RUN pip install sglang-router==0.1.5 ENV SGL_FORCE_SHUTDOWN=1
WORKDIR /sgl-workspace/dynamo/components/backends/sglang WORKDIR /sgl-workspace/dynamo/components/backends/sglang
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment