"deploy/operator/config/samples/kustomization.yaml" did not exist on "a91e63482ccbbc5c389a21993032d09d90f4939e"
Unverified Commit 0ab6bc2b authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

chore: update slurm scrips for better warmup and bump sgl version (#3291)

parent 87190db0
......@@ -14,6 +14,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
![Dynamo banner](./docs/images/frontpage-banner.png)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
......@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News
* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
- [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node
......@@ -54,7 +55,7 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
## Framework Support Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ |
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
......@@ -63,6 +64,7 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ |
To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**
......@@ -77,6 +79,7 @@ Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](
## 1. Initial setup
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
......@@ -89,6 +92,7 @@ To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynam
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
To quickly setup etcd & NATS, you can also run:
```
# At the root of the repository:
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
......@@ -125,7 +129,7 @@ python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
python -m dynamo.sglang --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
#### Send a Request
......@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
* **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
* **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
- **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
# Engines
......@@ -170,6 +174,7 @@ uv pip install ai-dynamo[vllm]
```
Run the backend/worker like this:
```
python -m dynamo.vllm --help
```
......@@ -188,8 +193,9 @@ uv pip install ai-dynamo[sglang]
```
Run the backend/worker like this:
```
python -m dynamo.sglang.worker --help
python -m dynamo.sglang --help
```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
......@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
### Install prerequisites
```
# Optional step: Only required for Blackwell and Grace Hopper
uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
......@@ -221,11 +228,13 @@ sudo apt-get -y install libopenmpi-dev
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
### After installing the pre-requisites above, install Dynamo
```
uv pip install ai-dynamo[trtllm]
```
Run the backend/worker like this:
```
python -m dynamo.trtllm --help
```
......@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
## 1. Install libraries
**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
```
......@@ -255,8 +268,8 @@ brew install cmake protobuf
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
## 2. Install Rust
......@@ -270,11 +283,13 @@ source $HOME/.cargo/env
Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
- Install uv
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
- Create a virtual environment
```bash
uv venv dynamo
source dynamo/bin/activate
......
......@@ -29,7 +29,7 @@ docker build \
-f container/Dockerfile.sglang-wideep \
-t dynamo-wideep-gb200 \
--build-arg MODE=blackwell \
--build-arg SGLANG_IMAGE_TAG=v0.5.0rc0-cu129-gb200 \
--build-arg SGLANG_IMAGE_TAG=v0.5.3rc0-cu129-gb200 \
--build-arg ARCH=arm64 \
--build-arg ARCH_ALT=aarch64 \
.
......
# Example: Deploy Multi-node SGLang with Dynamo on SLURM
# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
## Overview
## SLURM Prerequisites
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
## Scripts
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
### Log File Structure
```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
│ ├── log.err # Main job errors
│ ├── node0197_prefill.out # Prefill node stdout (node0197)
│ ├── node0197_prefill.err # Prefill node stderr (node0197)
│ ├── node0200_prefill.out # Prefill node stdout (node0200)
│ ├── node0200_prefill.err # Prefill node stderr (node0200)
│ ├── node0201_decode.out # Decode node stdout (node0201)
│ ├── node0201_decode.err # Decode node stderr (node0201)
│ ├── node0204_decode.out # Decode node stdout (node0204)
│ ├── node0204_decode.err # Decode node stderr (node0204)
│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197)
│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200)
│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201)
│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204)
├── 3063137/ # Another job ID directory
├── 3062689/ # Another job ID directory
└── ...
```
## Setup
For simplicity of the example, we will make some assumptions about your SLURM cluster:
For this example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
......@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-h100.md#instructions).
described [here](../docs/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
## Usage
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
1. **Submit a benchmark job**:
```bash
python submit_job_script.py \
python3 submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account
--model-dir <path-to>/deepseek-r1-0528 \
--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh \
--gpus-per-node 4 \
--config-dir <path-to>/klconfigs \
--gpu-type gb200-fp8 \
--network-interface enP6p9s0np0 \
--prefill-nodes 6 \
--decode-nodes 12 \
--prefill-workers 3 \
--decode-workers 1 \
--account <account> \
--partition <partition> \
--time-limit 4:00:00 \
--enable-multiple-frontends \
--num-additional-frontends 9 \
--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"
```
**Required arguments**:
- `--template`: Path to Jinja2 template file
- `--model-dir`: Model directory path
- `--config-dir`: Config directory path
- `--container-image`: Container image URI (e.g., `registry/repository:tag`)
- `--account`: SLURM account
**Optional arguments**:
- `--prefill-nodes`: Number of prefill nodes (default: `2`)
- `--decode-nodes`: Number of decode nodes (default: `2`)
- `--gpus-per-node`: Number of GPUs per node (default: `8`)
- `--network-interface`: Network interface to use (default: `eth3`)
- `--job-name`: SLURM job name (default: `dynamo_setup`)
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
- `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
- `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
This command will deploy 3 prefill workers and 1 decode worker with 9 additional frontends load-balanced by nginx. Diving deeper into the command:
- `--template job_script_template.j2`: Path to Jinja2 template file (this shouldn't change unless you want to modify the template)
- `--model-dir <path-to>/deepseek-r1-0528`: Path to DSR1-FP8 model directory
- `--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh`: Enroot container image URI
- `--gpus-per-node 4`: Number of GPUs per node (each GB200 tray has 4 GPUs)
- `--config-dir <path-to>/klconfigs`: Various configs (see explanation below)
- `--gpu-type gb200-fp8`: GPU type to use, choices: `gb200-fp8`
- `--network-interface enP6p9s0np0`: Network interface to use (depends on your cluster)
- `--prefill-nodes 6`: Number of prefill nodes
- `--decode-nodes 12`: Number of decode nodes
- `--prefill-workers 3`: Number of prefill workers
- `--decode-workers 1`: Number of decode workers
- `--account <account>`: SLURM account
- `--partition <partition>`: SLURM partition
- `--time-limit 4:00:00`: Time limit in HH:MM:SS format
- `--enable-multiple-frontends`: Enable multiple frontend architecture with nginx load balancer
- `--num-additional-frontends 9`: Number of additional frontends
- `--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"`: Profiler configurations (see explanation below)
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2. **Example with different GPU types**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3. **Monitor job progress**:
2. **Check logs in real-time**:
```bash
squeue -u $USER
cd logs/{JOB_ID}
tail -f *_prefill_*.err *_decode_*.err
```
4. **Check logs in real-time**:
```bash
tail -f logs/{JOB_ID}/log.out
```
## Configs directory
You can view logs of all prefill or decode workers simultaneously by running:
The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
```bash
# prefill workers err (or .out)
tail -f logs/{JOB_ID}/*_prefill.err
```bash
klconfigs/
├── decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json
├── deepep_config.json
├── dgcache/
└── prefill_dsr1-0528_in1000out1000_num40000.json
```
# decode workers err (or .out)
tail -f logs/{JOB_ID}/*_decode.err
```
1. `decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
2. `deepep_config.json`: DeepEP config file for GB2009
3. `dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
4. `prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
5. **Monitor GPU utilization**:
```bash
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
```
**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
## Outputs
## Profiler
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
......@@ -50,24 +50,30 @@ warmup_model() {
model_path=$4
config=$5
IFS='x' read -r -a config_list <<< "$config"
isl=${config_list[0]}
osl=${config_list[1]}
num_prompts=${config_list[2]}
concurrency=${config_list[3]}
request_rate=${config_list[4]}
model_name="deepseek-ai/DeepSeek-R1"
model_path="deepseek-ai/DeepSeek-R1-0528"
head_node="localhost"
head_port="8000"
chosen_isl=1024
chosen_osl=1024
chosen_req_rate="inf"
chosen_concurrencies=(1 2 4 8 16 32 64 128)
for concurrency in ${chosen_concurrencies[@]}
do
num_prompts=$((concurrency * 5))
command=(
python3 -m sglang.bench_serving
--base-url "http://${service_host}:${service_port}"
--model ${served_model_name} --tokenizer ${model_path}
--base-url "http://${head_node}:${head_port}"
--model ${model_name} --tokenizer ${model_path}
--backend sglang-oai
--dataset-name random --random-input ${isl} --random-output ${osl}
--dataset-name random --random-input ${chosen_isl} --random-output ${chosen_osl}
--random-range-ratio 1
--num-prompts ${num_prompts} --request-rate ${request_rate} --max-concurrency ${concurrency}
--num-prompts ${num_prompts} --request-rate ${chosen_req_rate} --max-concurrency ${concurrency}
)
echo "Config ${config}. Running command ${command[@]}"
${command[@]}
echo "Running with concurrency: ${concurrency}, num_prompts: ${num_prompts}"
"${command[@]}"
done
}
\ No newline at end of file
......@@ -32,8 +32,8 @@ echo "Mode: $mode"
echo "Command: dynamo"
# Check if required environment variables are set
if [ -z "$HOST_IP" ]; then
echo "Error: HOST_IP environment variable is not set"
if [ -z "$HOST_IP_MACHINE" ]; then
echo "Error: HOST_IP_MACHINE environment variable is not set"
exit 1
fi
......@@ -67,6 +67,9 @@ if [ "$mode" = "prefill" ]; then
# GB200 dynamo prefill command
set -x
# SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
# timeouts and kernel cache
export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/prefill_dsr1-0528_in1000out1000_num40000.json"; fi
......@@ -80,15 +83,15 @@ if [ "$mode" = "prefill" ]; then
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode prefill \
--dist-init-addr "$HOST_IP:$PORT" \
--dist-init-addr "$HOST_IP_MACHINE:$PORT" \
--disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
......@@ -100,7 +103,8 @@ if [ "$mode" = "prefill" ]; then
--max-running-requests 12288 \
--context-length 9600 \
--disable-radix-cache \
--enable-deepep-moe \
--moe-a2a-backend deepep \
--load-balance-method round_robin \
--deepep-mode normal \
--ep-dispatch-algorithm dynamic \
--moe-dense-tp-size 1 \
......@@ -122,6 +126,10 @@ elif [ "$mode" = "decode" ]; then
command_suffix=""
if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json"; fi
# timeouts and kernel cache
export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
# GB200 dynamo decode command
DYN_SKIP_SGLANG_LOG_FORMATTING=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 \
......@@ -135,15 +143,15 @@ elif [ "$mode" = "decode" ]; then
MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.decode_worker \
python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \
--skip-tokenizer-init \
--trust-remote-code \
--disaggregation-mode decode \
--dist-init-addr "$HOST_IP:$PORT" \
--dist-init-addr "$HOST_IP_MACHINE:$PORT" \
--disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \
......@@ -155,7 +163,8 @@ elif [ "$mode" = "decode" ]; then
--max-running-requests 36864 \
--context-length 9600 \
--disable-radix-cache \
--enable-deepep-moe \
--moe-a2a-backend deepep \
--prefill-round-robin-balance \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
......
......@@ -175,8 +175,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
parser.add_argument(
"--gpu_type",
type=str,
choices=["h100", "gb200-fp8"],
default="h100",
choices=["gb200-fp8"],
default="gb200-fp8",
help="Type of GPU to use",
)
......@@ -237,8 +237,8 @@ def setup_env_vars_for_gpu_script(
port: int = DIST_INIT_PORT,
use_init_locations: bool = True,
):
"""Setup environment variables required by GPU scripts (h100.sh, gb200-fp8.sh, gb200-fp4.sh)"""
os.environ["HOST_IP"] = host_ip
"""Setup environment variables required by GPU scripts (gb200-fp8.sh)"""
os.environ["HOST_IP_MACHINE"] = host_ip
os.environ["PORT"] = str(port)
os.environ["TOTAL_GPUS"] = str(total_gpus)
os.environ["RANK"] = str(local_rank)
......
......@@ -142,8 +142,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
)
parser.add_argument(
"--gpu-type",
choices=["h100", "gb200-fp8"],
default="h100",
choices=["gb200-fp8"],
default="gb200-fp8",
help="GPU type to use",
)
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import logging
from dynamo.runtime.logging import configure_dynamo_logging
from dynamo.sglang.main import main
if __name__ == "__main__":
configure_dynamo_logging()
logging.warning(
"DEPRECATION WARNING: `python3 -m dynamo.sglang.decode_worker` is deprecated and will be removed in dynamo v0.5.0."
"Use `python3 -m dynamo.sglang` instead.",
)
main()
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import logging
from dynamo.runtime.logging import configure_dynamo_logging
from dynamo.sglang.main import main
if __name__ == "__main__":
configure_dynamo_logging()
logging.warning(
"DEPRECATION WARNING: `python3 -m dynamo.sglang.worker` is deprecated and will be removed in dynamo v0.5.0."
"Use `python3 -m dynamo.sglang` instead.",
)
main()
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Note: This Dockerfile will be deprecated in favor of Dockerfile.sglang-wideep soon. Please build the container with that Dockerfile instead.
ARG BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
# TODO OPS-612: NCCL will hang with 25.03, so use 25.01 for now
# Please check https://github.com/ai-dynamo/dynamo/pull/1065
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
ARG SGLANG_IMAGE_TAG="v0.5.0rc2-cu126"
ARG SGLANG_IMAGE_TAG="v0.5.3rc0-cu126"
FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG}
......
......@@ -68,9 +68,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleMain,
multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker"},
initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python3"},
expectedArgs: []string{"-m", "dynamo.sglang.worker"},
expectedArgs: []string{"-m", "dynamo.sglang"},
description: "Single node should not modify python commands",
},
{
......@@ -79,9 +79,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker,
multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama"},
initialArgs: []string{"-m", "dynamo.sglang", "--model", "llama"},
expectedCommand: []string{"python3"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
expectedArgs: []string{"-m", "dynamo.sglang", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
description: "Direct python command with simple deployer should append flags",
},
{
......@@ -90,9 +90,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker,
multinodeDeployer: &MockShellDeployer{},
initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama"},
initialArgs: []string{"-m", "dynamo.sglang", "--model", "llama"},
expectedCommand: []string{"sh", "-c"},
expectedArgs: []string{"exec python3 -m dynamo.sglang.worker --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"},
expectedArgs: []string{"exec python3 -m dynamo.sglang --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"},
description: "Direct python command with shell deployer should wrap with sh -c exec",
},
{
......@@ -101,9 +101,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &MockShellDeployer{},
initialCommand: []string{"python"},
initialArgs: []string{"-m", "dynamo.sglang.worker"},
initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"},
expectedArgs: []string{"-m", "dynamo.sglang", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"},
description: "Leader role should never use shell wrapping",
},
{
......@@ -112,9 +112,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker,
multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3.11"},
initialArgs: []string{"-m", "dynamo.sglang.worker"},
initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python3.11"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
expectedArgs: []string{"-m", "dynamo.sglang", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
description: "Python version variants should be recognized",
},
{
......@@ -202,8 +202,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleMain,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"},
expectedArgs: []string{"python -m dynamo.sglang.worker"},
initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang"},
description: "Single node should not modify shell commands",
},
{
......@@ -212,8 +212,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"},
initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"},
description: "Shell commands should use regex injection for python commands",
},
{
......@@ -222,8 +222,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"echo blah | wc -l && python -m dynamo.sglang.worker && ls -al"},
expectedArgs: []string{"echo blah | wc -l && python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"},
initialArgs: []string{"echo blah | wc -l && python -m dynamo.sglang && ls -al"},
expectedArgs: []string{"echo blah | wc -l && python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"},
description: "Complex shell commands should inject flags only into python part",
},
{
......@@ -232,8 +232,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleWorker,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"},
initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"},
description: "Shell command worker should get grove env vars in node rank",
},
{
......@@ -242,8 +242,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &LWSMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"},
initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"},
description: "LWS shell commands should use LWS variables",
},
{
......@@ -252,8 +252,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker | tee /tmp/log"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"},
initialArgs: []string{"python -m dynamo.sglang | tee /tmp/log"},
expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"},
description: "Shell commands with pipes should inject flags before pipe",
},
{
......@@ -262,8 +262,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"echo start", "python -m dynamo.sglang.worker", "echo done"},
expectedArgs: []string{"echo start", "python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"},
initialArgs: []string{"echo start", "python -m dynamo.sglang", "echo done"},
expectedArgs: []string{"echo start", "python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"},
description: "Shell commands with multiple args should process each individually, modify only the python arg",
},
{
......@@ -282,8 +282,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker", "python -m dynamo.sglang.worker --other-flags"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang.worker --other-flags"},
initialArgs: []string{"python -m dynamo.sglang", "python -m dynamo.sglang --other-flags"},
expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang --other-flags"},
description: "Should stop processing after first successful python flag injection",
},
}
......@@ -444,7 +444,7 @@ func TestSGLangBackend_ProbeRemoval(t *testing.T) {
startupProbe := &corev1.Probe{InitialDelaySeconds: 5}
container := &corev1.Container{
Args: []string{"python -m dynamo.sglang.worker"},
Args: []string{"python -m dynamo.sglang"},
LivenessProbe: livenessProbe,
ReadinessProbe: readinessProbe,
StartupProbe: startupProbe,
......
......@@ -1675,7 +1675,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c",
},
Args: []string{
"python3 -m dynamo.sglang.worker --custom-flag custom-value",
"python3 -m dynamo.sglang --custom-flag custom-value",
},
},
},
......@@ -1828,7 +1828,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c",
},
Args: []string{
"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value",
"python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value",
},
Ports: []corev1.ContainerPort{
{
......@@ -1980,7 +1980,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c",
},
Args: []string{
"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value",
"python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value",
},
Ports: []corev1.ContainerPort{
{
......@@ -3207,7 +3207,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"},
Args: []string{"python3 -m dynamo.sglang"},
},
},
},
......@@ -3216,7 +3216,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleMain,
numberOfNodes: 1,
expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker"},
expectContains: []string{"python3", "-m", "dynamo.sglang"},
expectNotContains: []string{"dist-init-addr", "nnodes", "tp-size"},
},
{
......@@ -3226,7 +3226,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"},
Args: []string{"python3 -m dynamo.sglang"},
},
},
},
......@@ -3235,7 +3235,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleLeader,
numberOfNodes: 3,
expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"},
expectContains: []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
},
{
name: "SGLang multinode worker",
......@@ -3244,7 +3244,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"},
Args: []string{"python3 -m dynamo.sglang"},
},
},
},
......@@ -3253,7 +3253,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleWorker,
numberOfNodes: 3,
expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"},
expectContains: []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
},
{
name: "SGLang with user command override",
......@@ -3685,7 +3685,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{
name: "detect SGLang from args",
command: []string{"/bin/sh", "-c"},
args: []string{"python -m dynamo.sglang.worker --model test"},
args: []string{"python -m dynamo.sglang --model test"},
expected: BackendFrameworkSGLang,
},
{
......@@ -3703,7 +3703,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{
name: "detect from python3.11",
command: []string{},
args: []string{"python3.11 -m dynamo.sglang.decode_worker"},
args: []string{"python3.11 -m dynamo.sglang"},
expected: BackendFrameworkSGLang,
},
{
......@@ -3715,7 +3715,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{
name: "multiple backends detected",
command: []string{},
args: []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang.worker"},
args: []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang"},
expectError: true,
},
}
......@@ -3777,7 +3777,7 @@ func TestDetermineBackendFramework(t *testing.T) {
{
name: "worker with detected matching explicit",
componentType: "worker",
args: []string{"python -m dynamo.sglang.worker"},
args: []string{"python -m dynamo.sglang"},
explicitBackendFramework: "sglang",
expected: BackendFrameworkSGLang,
},
......@@ -3881,7 +3881,7 @@ func TestGetBackendFrameworkFromComponent(t *testing.T) {
ComponentType: "worker", // Worker component
ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{
Args: []string{"python -m dynamo.sglang.worker"},
Args: []string{"python -m dynamo.sglang"},
},
},
},
......
......@@ -30,7 +30,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
......@@ -51,15 +50,14 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> [!Caution]
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
## Software Compatibility
### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version |
| :----------------- | :------------ | :----------------------------------- | :----------- |
| :----------------- | :---------- | :------------------------------------ | :----------- |
| ai-dynamo | 0.5.1 | >=2.28 | |
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| |
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues) | |
| NIXL | 0.4.1 | >=2.27 | >=11.8 |
### Build Dependency
......@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **TensorRT-LLM** | 1.1.0rc5 |
| **NIXL** | 0.4.1 |
| **vLLM** | 0.10.1.1 |
| **SGLang** | 0.5.0rc2 |
| **SGLang** | 0.5.3rc0 |
> [!Important]
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
......@@ -79,14 +77,12 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :----------- |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported¹ |
> [!Caution]
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
**Dynamo** currently provides build support in the following ways:
......
......@@ -3,12 +3,14 @@
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/architecture/disagg_serving.md)
- [KV Cache Routing Architecture](../../../docs/architecture/kv_cache_routing.md)
## Architecture Overview
The multi-node setup consists of:
- **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them
- **2 Model Replicas**: Each with dedicated prefill and decode workers
- **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers**
......@@ -57,6 +59,7 @@ KV-aware routing optimizes LLM inference by directing requests to workers that a
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused
......@@ -90,6 +93,7 @@ For more information about the SGLang backend and its integration with Dynamo, s
### 3. Network Requirements
Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
......@@ -98,6 +102,7 @@ Ensure the following ports are accessible between nodes:
### 4. Hardware Setup
This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
......@@ -131,7 +136,7 @@ Open a terminal on Node 1 and launch both workers:
```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
......@@ -141,7 +146,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
......@@ -153,6 +158,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
```
> [!INFO]
>
> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
......@@ -165,7 +171,7 @@ Open a terminal on Node 2 and launch both workers:
```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
......@@ -176,7 +182,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-transfer-backend nixl &
# Launch decode worker in foreground
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
......@@ -206,6 +212,7 @@ hostname -I | awk '{print $1}'
```
The frontend will:
- Discover all available decode workers via etcd
- Enable KV-aware routing for intelligent request distribution
- Monitor worker health and adjust routing accordingly
......@@ -418,6 +425,7 @@ curl http://${DYN_FRONTEND_IP}:8000/health
### Workers Not Discovering Each Other
1. Verify etcd connectivity from all nodes:
```bash
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
```
......@@ -461,9 +469,11 @@ Stop all components in reverse order:
1. Stop Frontend (Ctrl+C in the frontend terminal)
2. Stop workers on each node:
- On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
- On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
- To stop the background prefill workers, use one of these methods:
```bash
# Method 1: Kill background jobs in the same terminal
jobs # See background jobs
......@@ -473,8 +483,9 @@ Stop all components in reverse order:
exit
# Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.worker.*prefill"
pkill -f "dynamo.sglang.*prefill"
```
3. Stop infrastructure services:
```bash
docker compose -f deploy/docker-compose.yml down
......
......@@ -60,7 +60,7 @@ vllm = [
sglang = [
"uvloop",
"nixl<=0.4.1",
"sglang[all]==0.5.0rc2",
"sglang[all]==0.5.3rc0",
]
llama_cpp = [
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment