Unverified Commit 0ab6bc2b authored by ishandhanani's avatar ishandhanani Committed by GitHub
Browse files

chore: update slurm scrips for better warmup and bump sgl version (#3291)

parent 87190db0
...@@ -14,6 +14,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -14,6 +14,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
![Dynamo banner](./docs/images/frontpage-banner.png) ![Dynamo banner](./docs/images/frontpage-banner.png)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
...@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative ...@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News ## Latest News
* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md) - [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node ## The Era of Multi-GPU, Multi-Node
...@@ -53,16 +54,17 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa ...@@ -53,16 +54,17 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
## Framework Support Matrix ## Framework Support Matrix
| Feature | vLLM | SGLang | TensorRT-LLM | | Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------| | ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ |
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ | | [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 | | [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ | | [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 | | [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ | | [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ | | [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | 🚧 | ✅ |
To learn more about each framework and their capabilities, check out each framework's README! To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)** - **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)** - **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)** - **[TensorRT-LLM](components/backends/trtllm/README.md)**
...@@ -77,6 +79,7 @@ Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md]( ...@@ -77,6 +79,7 @@ Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](
## 1. Initial setup ## 1. Initial setup
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv: The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
``` ```
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
``` ```
...@@ -89,6 +92,7 @@ To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynam ...@@ -89,6 +92,7 @@ To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynam
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`. - [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
To quickly setup etcd & NATS, you can also run: To quickly setup etcd & NATS, you can also run:
``` ```
# At the root of the repository: # At the root of the repository:
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used. # Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
...@@ -125,7 +129,7 @@ python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key ...@@ -125,7 +129,7 @@ python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these, # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them. # both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init python -m dynamo.sglang --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
``` ```
#### Send a Request #### Send a Request
...@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res ...@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments: Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
* **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf - **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
* **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements - **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
# Engines # Engines
...@@ -170,6 +174,7 @@ uv pip install ai-dynamo[vllm] ...@@ -170,6 +174,7 @@ uv pip install ai-dynamo[vllm]
``` ```
Run the backend/worker like this: Run the backend/worker like this:
``` ```
python -m dynamo.vllm --help python -m dynamo.vllm --help
``` ```
...@@ -188,8 +193,9 @@ uv pip install ai-dynamo[sglang] ...@@ -188,8 +193,9 @@ uv pip install ai-dynamo[sglang]
``` ```
Run the backend/worker like this: Run the backend/worker like this:
``` ```
python -m dynamo.sglang.worker --help python -m dynamo.sglang --help
``` ```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs. You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
...@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/ ...@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1` > Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
### Install prerequisites ### Install prerequisites
``` ```
# Optional step: Only required for Blackwell and Grace Hopper # Optional step: Only required for Blackwell and Grace Hopper
uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 uv pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
...@@ -221,11 +228,13 @@ sudo apt-get -y install libopenmpi-dev ...@@ -221,11 +228,13 @@ sudo apt-get -y install libopenmpi-dev
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html). > You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
### After installing the pre-requisites above, install Dynamo ### After installing the pre-requisites above, install Dynamo
``` ```
uv pip install ai-dynamo[trtllm] uv pip install ai-dynamo[trtllm]
``` ```
Run the backend/worker like this: Run the backend/worker like this:
``` ```
python -m dynamo.trtllm --help python -m dynamo.trtllm --help
``` ```
...@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`. ...@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
## 1. Install libraries ## 1. Install libraries
**Ubuntu:** **Ubuntu:**
``` ```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
``` ```
**macOS:** **macOS:**
- [Homebrew](https://brew.sh/) - [Homebrew](https://brew.sh/)
``` ```
# if brew is not installed on your system, install it # if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
``` ```
- [Xcode](https://developer.apple.com/xcode/) - [Xcode](https://developer.apple.com/xcode/)
``` ```
...@@ -255,8 +268,8 @@ brew install cmake protobuf ...@@ -255,8 +268,8 @@ brew install cmake protobuf
## Check that Metal is accessible ## Check that Metal is accessible
xcrun -sdk macosx metal xcrun -sdk macosx metal
``` ```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
## 2. Install Rust ## 2. Install Rust
...@@ -270,11 +283,13 @@ source $HOME/.cargo/env ...@@ -270,11 +283,13 @@ source $HOME/.cargo/env
Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it. Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
- Install uv - Install uv
```bash ```bash
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
``` ```
- Create a virtual environment - Create a virtual environment
```bash ```bash
uv venv dynamo uv venv dynamo
source dynamo/bin/activate source dynamo/bin/activate
......
...@@ -29,7 +29,7 @@ docker build \ ...@@ -29,7 +29,7 @@ docker build \
-f container/Dockerfile.sglang-wideep \ -f container/Dockerfile.sglang-wideep \
-t dynamo-wideep-gb200 \ -t dynamo-wideep-gb200 \
--build-arg MODE=blackwell \ --build-arg MODE=blackwell \
--build-arg SGLANG_IMAGE_TAG=v0.5.0rc0-cu129-gb200 \ --build-arg SGLANG_IMAGE_TAG=v0.5.3rc0-cu129-gb200 \
--build-arg ARCH=arm64 \ --build-arg ARCH=arm64 \
--build-arg ARCH_ALT=aarch64 \ --build-arg ARCH_ALT=aarch64 \
. .
......
# Example: Deploy Multi-node SGLang with Dynamo on SLURM # Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster. This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
## Overview ## SLURM Prerequisites
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode. For this example, we will make some assumptions about your SLURM cluster:
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
## Scripts
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
### Log File Structure
```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
│ ├── log.err # Main job errors
│ ├── node0197_prefill.out # Prefill node stdout (node0197)
│ ├── node0197_prefill.err # Prefill node stderr (node0197)
│ ├── node0200_prefill.out # Prefill node stdout (node0200)
│ ├── node0200_prefill.err # Prefill node stderr (node0200)
│ ├── node0201_decode.out # Decode node stdout (node0201)
│ ├── node0201_decode.err # Decode node stderr (node0201)
│ ├── node0204_decode.out # Decode node stdout (node0204)
│ ├── node0204_decode.err # Decode node stderr (node0204)
│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197)
│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200)
│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201)
│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204)
├── 3063137/ # Another job ID directory
├── 3062689/ # Another job ID directory
└── ...
```
## Setup
For simplicity of the example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes 1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance available. For functional testing, most setups should be fine. For performance
...@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ...@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead. modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as 3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-h100.md#instructions). described [here](../docs/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps. This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
## Usage ## Usage
> [!NOTE] > [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome. > The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
1. **Submit a benchmark job**: 1. **Submit a benchmark job**:
```bash ```bash
python submit_job_script.py \ python3 submit_job_script.py \
--template job_script_template.j2 \ --template job_script_template.j2 \
--model-dir /path/to/model \ --model-dir <path-to>/deepseek-r1-0528 \
--config-dir /path/to/configs \ --container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh \
--container-image container-image-uri \ --gpus-per-node 4 \
--account your-slurm-account --config-dir <path-to>/klconfigs \
--gpu-type gb200-fp8 \
--network-interface enP6p9s0np0 \
--prefill-nodes 6 \
--decode-nodes 12 \
--prefill-workers 3 \
--decode-workers 1 \
--account <account> \
--partition <partition> \
--time-limit 4:00:00 \
--enable-multiple-frontends \
--num-additional-frontends 9 \
--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"
``` ```
**Required arguments**: This command will deploy 3 prefill workers and 1 decode worker with 9 additional frontends load-balanced by nginx. Diving deeper into the command:
- `--template`: Path to Jinja2 template file - `--template job_script_template.j2`: Path to Jinja2 template file (this shouldn't change unless you want to modify the template)
- `--model-dir`: Model directory path - `--model-dir <path-to>/deepseek-r1-0528`: Path to DSR1-FP8 model directory
- `--config-dir`: Config directory path - `--container-image <path-to>/dynamo-sglang+v0.5.3rc1-v0.3.12.sqsh`: Enroot container image URI
- `--container-image`: Container image URI (e.g., `registry/repository:tag`) - `--gpus-per-node 4`: Number of GPUs per node (each GB200 tray has 4 GPUs)
- `--account`: SLURM account - `--config-dir <path-to>/klconfigs`: Various configs (see explanation below)
- `--gpu-type gb200-fp8`: GPU type to use, choices: `gb200-fp8`
**Optional arguments**: - `--network-interface enP6p9s0np0`: Network interface to use (depends on your cluster)
- `--prefill-nodes 6`: Number of prefill nodes
- `--prefill-nodes`: Number of prefill nodes (default: `2`) - `--decode-nodes 12`: Number of decode nodes
- `--decode-nodes`: Number of decode nodes (default: `2`) - `--prefill-workers 3`: Number of prefill workers
- `--gpus-per-node`: Number of GPUs per node (default: `8`) - `--decode-workers 1`: Number of decode workers
- `--network-interface`: Network interface to use (default: `eth3`) - `--account <account>`: SLURM account
- `--job-name`: SLURM job name (default: `dynamo_setup`) - `--partition <partition>`: SLURM partition
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`) - `--time-limit 4:00:00`: Time limit in HH:MM:SS format
- `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`) - `--enable-multiple-frontends`: Enable multiple frontend architecture with nginx load balancer
- `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`) - `--num-additional-frontends 9`: Number of additional frontends
- `--profiler "type=vllm; isl=8192; osl=1024; concurrencies=16x2048x4096x8192; req-rate=inf"`: Profiler configurations (see explanation below)
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters. **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2. **Example with different GPU types**: 2. **Check logs in real-time**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3. **Monitor job progress**:
```bash ```bash
squeue -u $USER cd logs/{JOB_ID}
tail -f *_prefill_*.err *_decode_*.err
``` ```
4. **Check logs in real-time**: ## Configs directory
```bash
tail -f logs/{JOB_ID}/log.out
```
You can view logs of all prefill or decode workers simultaneously by running: The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
```bash ```bash
# prefill workers err (or .out) klconfigs/
tail -f logs/{JOB_ID}/*_prefill.err ├── decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json
├── deepep_config.json
├── dgcache/
└── prefill_dsr1-0528_in1000out1000_num40000.json
```
# decode workers err (or .out) 1. `decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
tail -f logs/{JOB_ID}/*_decode.err 2. `deepep_config.json`: DeepEP config file for GB2009
``` 3. `dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
4. `prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
5. **Monitor GPU utilization**: **Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
```bash
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
```
## Outputs ## Profiler
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container. If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
...@@ -50,24 +50,30 @@ warmup_model() { ...@@ -50,24 +50,30 @@ warmup_model() {
model_path=$4 model_path=$4
config=$5 config=$5
IFS='x' read -r -a config_list <<< "$config" model_name="deepseek-ai/DeepSeek-R1"
isl=${config_list[0]} model_path="deepseek-ai/DeepSeek-R1-0528"
osl=${config_list[1]} head_node="localhost"
num_prompts=${config_list[2]} head_port="8000"
concurrency=${config_list[3]} chosen_isl=1024
request_rate=${config_list[4]} chosen_osl=1024
chosen_req_rate="inf"
chosen_concurrencies=(1 2 4 8 16 32 64 128)
command=( for concurrency in ${chosen_concurrencies[@]}
python3 -m sglang.bench_serving do
--base-url "http://${service_host}:${service_port}" num_prompts=$((concurrency * 5))
--model ${served_model_name} --tokenizer ${model_path}
--backend sglang-oai
--dataset-name random --random-input ${isl} --random-output ${osl}
--random-range-ratio 1
--num-prompts ${num_prompts} --request-rate ${request_rate} --max-concurrency ${concurrency}
)
echo "Config ${config}. Running command ${command[@]}" command=(
python3 -m sglang.bench_serving
--base-url "http://${head_node}:${head_port}"
--model ${model_name} --tokenizer ${model_path}
--backend sglang-oai
--dataset-name random --random-input ${chosen_isl} --random-output ${chosen_osl}
--random-range-ratio 1
--num-prompts ${num_prompts} --request-rate ${chosen_req_rate} --max-concurrency ${concurrency}
)
${command[@]} echo "Running with concurrency: ${concurrency}, num_prompts: ${num_prompts}"
} "${command[@]}"
done
}
\ No newline at end of file
...@@ -32,8 +32,8 @@ echo "Mode: $mode" ...@@ -32,8 +32,8 @@ echo "Mode: $mode"
echo "Command: dynamo" echo "Command: dynamo"
# Check if required environment variables are set # Check if required environment variables are set
if [ -z "$HOST_IP" ]; then if [ -z "$HOST_IP_MACHINE" ]; then
echo "Error: HOST_IP environment variable is not set" echo "Error: HOST_IP_MACHINE environment variable is not set"
exit 1 exit 1
fi fi
...@@ -67,6 +67,9 @@ if [ "$mode" = "prefill" ]; then ...@@ -67,6 +67,9 @@ if [ "$mode" = "prefill" ]; then
# GB200 dynamo prefill command # GB200 dynamo prefill command
set -x set -x
# SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ # SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
# timeouts and kernel cache
export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/prefill_dsr1-0528_in1000out1000_num40000.json"; fi if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/prefill_dsr1-0528_in1000out1000_num40000.json"; fi
...@@ -80,15 +83,15 @@ if [ "$mode" = "prefill" ]; then ...@@ -80,15 +83,15 @@ if [ "$mode" = "prefill" ]; then
NCCL_MNNVL_ENABLE=1 \ NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \ NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \ PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.worker \ python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \ --model-path /model/ \
--skip-tokenizer-init \ --skip-tokenizer-init \
--trust-remote-code \ --trust-remote-code \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--dist-init-addr "$HOST_IP:$PORT" \ --dist-init-addr "$HOST_IP_MACHINE:$PORT" \
--disaggregation-bootstrap-port 30001 \ --disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \ --nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \ --node-rank "$RANK" \
...@@ -100,7 +103,8 @@ if [ "$mode" = "prefill" ]; then ...@@ -100,7 +103,8 @@ if [ "$mode" = "prefill" ]; then
--max-running-requests 12288 \ --max-running-requests 12288 \
--context-length 9600 \ --context-length 9600 \
--disable-radix-cache \ --disable-radix-cache \
--enable-deepep-moe \ --moe-a2a-backend deepep \
--load-balance-method round_robin \
--deepep-mode normal \ --deepep-mode normal \
--ep-dispatch-algorithm dynamic \ --ep-dispatch-algorithm dynamic \
--moe-dense-tp-size 1 \ --moe-dense-tp-size 1 \
...@@ -122,6 +126,10 @@ elif [ "$mode" = "decode" ]; then ...@@ -122,6 +126,10 @@ elif [ "$mode" = "decode" ]; then
command_suffix="" command_suffix=""
if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json"; fi if [[ "${USE_INIT_LOCATIONS,,}" == "true" ]]; then command_suffix="--init-expert-location /configs/decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json"; fi
# timeouts and kernel cache
export TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=1800
export SGL_DG_CACHE_DIR="/configs/dgcache/3p1dcache"
# GB200 dynamo decode command # GB200 dynamo decode command
DYN_SKIP_SGLANG_LOG_FORMATTING=1 \ DYN_SKIP_SGLANG_LOG_FORMATTING=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 \ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 \
...@@ -135,15 +143,15 @@ elif [ "$mode" = "decode" ]; then ...@@ -135,15 +143,15 @@ elif [ "$mode" = "decode" ]; then
MC_FORCE_MNNVL=1 \ MC_FORCE_MNNVL=1 \
NCCL_CUMEM_ENABLE=1 \ NCCL_CUMEM_ENABLE=1 \
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
PYTHONUNBUFFERED=1 \ PYTHONUNBUFFERED=1 \
python3 -m dynamo.sglang.decode_worker \ python3 -m dynamo.sglang \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--model-path /model/ \ --model-path /model/ \
--skip-tokenizer-init \ --skip-tokenizer-init \
--trust-remote-code \ --trust-remote-code \
--disaggregation-mode decode \ --disaggregation-mode decode \
--dist-init-addr "$HOST_IP:$PORT" \ --dist-init-addr "$HOST_IP_MACHINE:$PORT" \
--disaggregation-bootstrap-port 30001 \ --disaggregation-bootstrap-port 30001 \
--nnodes "$TOTAL_NODES" \ --nnodes "$TOTAL_NODES" \
--node-rank "$RANK" \ --node-rank "$RANK" \
...@@ -155,7 +163,8 @@ elif [ "$mode" = "decode" ]; then ...@@ -155,7 +163,8 @@ elif [ "$mode" = "decode" ]; then
--max-running-requests 36864 \ --max-running-requests 36864 \
--context-length 9600 \ --context-length 9600 \
--disable-radix-cache \ --disable-radix-cache \
--enable-deepep-moe \ --moe-a2a-backend deepep \
--prefill-round-robin-balance \
--deepep-mode low_latency \ --deepep-mode low_latency \
--moe-dense-tp-size 1 \ --moe-dense-tp-size 1 \
--enable-dp-lm-head \ --enable-dp-lm-head \
......
...@@ -175,8 +175,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac ...@@ -175,8 +175,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
parser.add_argument( parser.add_argument(
"--gpu_type", "--gpu_type",
type=str, type=str,
choices=["h100", "gb200-fp8"], choices=["gb200-fp8"],
default="h100", default="gb200-fp8",
help="Type of GPU to use", help="Type of GPU to use",
) )
...@@ -237,8 +237,8 @@ def setup_env_vars_for_gpu_script( ...@@ -237,8 +237,8 @@ def setup_env_vars_for_gpu_script(
port: int = DIST_INIT_PORT, port: int = DIST_INIT_PORT,
use_init_locations: bool = True, use_init_locations: bool = True,
): ):
"""Setup environment variables required by GPU scripts (h100.sh, gb200-fp8.sh, gb200-fp4.sh)""" """Setup environment variables required by GPU scripts (gb200-fp8.sh)"""
os.environ["HOST_IP"] = host_ip os.environ["HOST_IP_MACHINE"] = host_ip
os.environ["PORT"] = str(port) os.environ["PORT"] = str(port)
os.environ["TOTAL_GPUS"] = str(total_gpus) os.environ["TOTAL_GPUS"] = str(total_gpus)
os.environ["RANK"] = str(local_rank) os.environ["RANK"] = str(local_rank)
......
...@@ -142,8 +142,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac ...@@ -142,8 +142,8 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac
) )
parser.add_argument( parser.add_argument(
"--gpu-type", "--gpu-type",
choices=["h100", "gb200-fp8"], choices=["gb200-fp8"],
default="h100", default="gb200-fp8",
help="GPU type to use", help="GPU type to use",
) )
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import logging
from dynamo.runtime.logging import configure_dynamo_logging
from dynamo.sglang.main import main
if __name__ == "__main__":
configure_dynamo_logging()
logging.warning(
"DEPRECATION WARNING: `python3 -m dynamo.sglang.decode_worker` is deprecated and will be removed in dynamo v0.5.0."
"Use `python3 -m dynamo.sglang` instead.",
)
main()
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This module is deprecated. Use `python3 -m dynamo.sglang` instead.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import logging
from dynamo.runtime.logging import configure_dynamo_logging
from dynamo.sglang.main import main
if __name__ == "__main__":
configure_dynamo_logging()
logging.warning(
"DEPRECATION WARNING: `python3 -m dynamo.sglang.worker` is deprecated and will be removed in dynamo v0.5.0."
"Use `python3 -m dynamo.sglang` instead.",
)
main()
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# Note: This Dockerfile will be deprecated in favor of Dockerfile.sglang-wideep soon. Please build the container with that Dockerfile instead.
ARG BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base" ARG BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
# TODO OPS-612: NCCL will hang with 25.03, so use 25.01 for now # TODO OPS-612: NCCL will hang with 25.03, so use 25.01 for now
# Please check https://github.com/ai-dynamo/dynamo/pull/1065 # Please check https://github.com/ai-dynamo/dynamo/pull/1065
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
ARG SGLANG_IMAGE_TAG="v0.5.0rc2-cu126" ARG SGLANG_IMAGE_TAG="v0.5.3rc0-cu126"
FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG} FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG}
......
...@@ -68,9 +68,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) { ...@@ -68,9 +68,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleMain, role: RoleMain,
multinodeDeployer: &MockSimpleDeployer{}, multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3"}, initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker"}, initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python3"}, expectedCommand: []string{"python3"},
expectedArgs: []string{"-m", "dynamo.sglang.worker"}, expectedArgs: []string{"-m", "dynamo.sglang"},
description: "Single node should not modify python commands", description: "Single node should not modify python commands",
}, },
{ {
...@@ -79,9 +79,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) { ...@@ -79,9 +79,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker, role: RoleWorker,
multinodeDeployer: &MockSimpleDeployer{}, multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3"}, initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama"}, initialArgs: []string{"-m", "dynamo.sglang", "--model", "llama"},
expectedCommand: []string{"python3"}, expectedCommand: []string{"python3"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"}, expectedArgs: []string{"-m", "dynamo.sglang", "--model", "llama", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
description: "Direct python command with simple deployer should append flags", description: "Direct python command with simple deployer should append flags",
}, },
{ {
...@@ -90,9 +90,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) { ...@@ -90,9 +90,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker, role: RoleWorker,
multinodeDeployer: &MockShellDeployer{}, multinodeDeployer: &MockShellDeployer{},
initialCommand: []string{"python3"}, initialCommand: []string{"python3"},
initialArgs: []string{"-m", "dynamo.sglang.worker", "--model", "llama"}, initialArgs: []string{"-m", "dynamo.sglang", "--model", "llama"},
expectedCommand: []string{"sh", "-c"}, expectedCommand: []string{"sh", "-c"},
expectedArgs: []string{"exec python3 -m dynamo.sglang.worker --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"}, expectedArgs: []string{"exec python3 -m dynamo.sglang --model llama --dist-init-addr $(LEADER_HOST):29500 --nnodes 2 --node-rank $(WORKER_INDEX)"},
description: "Direct python command with shell deployer should wrap with sh -c exec", description: "Direct python command with shell deployer should wrap with sh -c exec",
}, },
{ {
...@@ -101,9 +101,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) { ...@@ -101,9 +101,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &MockShellDeployer{}, multinodeDeployer: &MockShellDeployer{},
initialCommand: []string{"python"}, initialCommand: []string{"python"},
initialArgs: []string{"-m", "dynamo.sglang.worker"}, initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python"}, expectedCommand: []string{"python"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"}, expectedArgs: []string{"-m", "dynamo.sglang", "--dist-init-addr", "$(LEADER_HOST):29500", "--nnodes", "3", "--node-rank", "0"},
description: "Leader role should never use shell wrapping", description: "Leader role should never use shell wrapping",
}, },
{ {
...@@ -112,9 +112,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) { ...@@ -112,9 +112,9 @@ func TestSGLangBackend_PythonCommandInjection(t *testing.T) {
role: RoleWorker, role: RoleWorker,
multinodeDeployer: &MockSimpleDeployer{}, multinodeDeployer: &MockSimpleDeployer{},
initialCommand: []string{"python3.11"}, initialCommand: []string{"python3.11"},
initialArgs: []string{"-m", "dynamo.sglang.worker"}, initialArgs: []string{"-m", "dynamo.sglang"},
expectedCommand: []string{"python3.11"}, expectedCommand: []string{"python3.11"},
expectedArgs: []string{"-m", "dynamo.sglang.worker", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"}, expectedArgs: []string{"-m", "dynamo.sglang", "--dist-init-addr", "leader.example.com:29500", "--nnodes", "2", "--node-rank", "1"},
description: "Python version variants should be recognized", description: "Python version variants should be recognized",
}, },
{ {
...@@ -202,8 +202,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -202,8 +202,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleMain, role: RoleMain,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"}, initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang.worker"}, expectedArgs: []string{"python -m dynamo.sglang"},
description: "Single node should not modify shell commands", description: "Single node should not modify shell commands",
}, },
{ {
...@@ -212,8 +212,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -212,8 +212,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"}, initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"}, expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0"},
description: "Shell commands should use regex injection for python commands", description: "Shell commands should use regex injection for python commands",
}, },
{ {
...@@ -222,8 +222,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -222,8 +222,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"echo blah | wc -l && python -m dynamo.sglang.worker && ls -al"}, initialArgs: []string{"echo blah | wc -l && python -m dynamo.sglang && ls -al"},
expectedArgs: []string{"echo blah | wc -l && python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"}, expectedArgs: []string{"echo blah | wc -l && python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 && ls -al"},
description: "Complex shell commands should inject flags only into python part", description: "Complex shell commands should inject flags only into python part",
}, },
{ {
...@@ -232,8 +232,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -232,8 +232,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleWorker, role: RoleWorker,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"}, initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"}, expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1))"},
description: "Shell command worker should get grove env vars in node rank", description: "Shell command worker should get grove env vars in node rank",
}, },
{ {
...@@ -242,8 +242,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -242,8 +242,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &LWSMultinodeDeployer{}, multinodeDeployer: &LWSMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker"}, initialArgs: []string{"python -m dynamo.sglang"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"}, expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(LWS_LEADER_ADDRESS):29500 --nnodes 2 --node-rank 0"},
description: "LWS shell commands should use LWS variables", description: "LWS shell commands should use LWS variables",
}, },
{ {
...@@ -252,8 +252,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -252,8 +252,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker | tee /tmp/log"}, initialArgs: []string{"python -m dynamo.sglang | tee /tmp/log"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"}, expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0 | tee /tmp/log"},
description: "Shell commands with pipes should inject flags before pipe", description: "Shell commands with pipes should inject flags before pipe",
}, },
{ {
...@@ -262,8 +262,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -262,8 +262,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"echo start", "python -m dynamo.sglang.worker", "echo done"}, initialArgs: []string{"echo start", "python -m dynamo.sglang", "echo done"},
expectedArgs: []string{"echo start", "python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"}, expectedArgs: []string{"echo start", "python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "echo done"},
description: "Shell commands with multiple args should process each individually, modify only the python arg", description: "Shell commands with multiple args should process each individually, modify only the python arg",
}, },
{ {
...@@ -282,8 +282,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) { ...@@ -282,8 +282,8 @@ func TestSGLangBackend_ShellCommandInjection(t *testing.T) {
role: RoleLeader, role: RoleLeader,
multinodeDeployer: &GroveMultinodeDeployer{}, multinodeDeployer: &GroveMultinodeDeployer{},
initialCommand: []string{"sh", "-c"}, initialCommand: []string{"sh", "-c"},
initialArgs: []string{"python -m dynamo.sglang.worker", "python -m dynamo.sglang.worker --other-flags"}, initialArgs: []string{"python -m dynamo.sglang", "python -m dynamo.sglang --other-flags"},
expectedArgs: []string{"python -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang.worker --other-flags"}, expectedArgs: []string{"python -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-test-service-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 2 --node-rank 0", "python -m dynamo.sglang --other-flags"},
description: "Should stop processing after first successful python flag injection", description: "Should stop processing after first successful python flag injection",
}, },
} }
...@@ -444,7 +444,7 @@ func TestSGLangBackend_ProbeRemoval(t *testing.T) { ...@@ -444,7 +444,7 @@ func TestSGLangBackend_ProbeRemoval(t *testing.T) {
startupProbe := &corev1.Probe{InitialDelaySeconds: 5} startupProbe := &corev1.Probe{InitialDelaySeconds: 5}
container := &corev1.Container{ container := &corev1.Container{
Args: []string{"python -m dynamo.sglang.worker"}, Args: []string{"python -m dynamo.sglang"},
LivenessProbe: livenessProbe, LivenessProbe: livenessProbe,
ReadinessProbe: readinessProbe, ReadinessProbe: readinessProbe,
StartupProbe: startupProbe, StartupProbe: startupProbe,
......
...@@ -1675,7 +1675,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) { ...@@ -1675,7 +1675,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c", "-c",
}, },
Args: []string{ Args: []string{
"python3 -m dynamo.sglang.worker --custom-flag custom-value", "python3 -m dynamo.sglang --custom-flag custom-value",
}, },
}, },
}, },
...@@ -1828,7 +1828,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) { ...@@ -1828,7 +1828,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c", "-c",
}, },
Args: []string{ Args: []string{
"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value", "python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank 0 --custom-flag custom-value",
}, },
Ports: []corev1.ContainerPort{ Ports: []corev1.ContainerPort{
{ {
...@@ -1980,7 +1980,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) { ...@@ -1980,7 +1980,7 @@ func TestGenerateGrovePodCliqueSet(t *testing.T) {
"-c", "-c",
}, },
Args: []string{ Args: []string{
"python3 -m dynamo.sglang.worker --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value", "python3 -m dynamo.sglang --dist-init-addr $(GROVE_PCSG_NAME)-$(GROVE_PCSG_INDEX)-worker-ldr-0.$(GROVE_HEADLESS_SERVICE):29500 --nnodes 3 --node-rank $((GROVE_PCLQ_POD_INDEX + 1)) --custom-flag custom-value",
}, },
Ports: []corev1.ContainerPort{ Ports: []corev1.ContainerPort{
{ {
...@@ -3207,7 +3207,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3207,7 +3207,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker, ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{ ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{ MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"}, Args: []string{"python3 -m dynamo.sglang"},
}, },
}, },
}, },
...@@ -3216,7 +3216,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3216,7 +3216,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleMain, role: RoleMain,
numberOfNodes: 1, numberOfNodes: 1,
expectError: false, expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker"}, expectContains: []string{"python3", "-m", "dynamo.sglang"},
expectNotContains: []string{"dist-init-addr", "nnodes", "tp-size"}, expectNotContains: []string{"dist-init-addr", "nnodes", "tp-size"},
}, },
{ {
...@@ -3226,7 +3226,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3226,7 +3226,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker, ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{ ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{ MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"}, Args: []string{"python3 -m dynamo.sglang"},
}, },
}, },
}, },
...@@ -3235,7 +3235,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3235,7 +3235,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleLeader, role: RoleLeader,
numberOfNodes: 3, numberOfNodes: 3,
expectError: false, expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"}, expectContains: []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
}, },
{ {
name: "SGLang multinode worker", name: "SGLang multinode worker",
...@@ -3244,7 +3244,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3244,7 +3244,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
ComponentType: commonconsts.ComponentTypeWorker, ComponentType: commonconsts.ComponentTypeWorker,
ExtraPodSpec: &common.ExtraPodSpec{ ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{ MainContainer: &corev1.Container{
Args: []string{"python3 -m dynamo.sglang.worker"}, Args: []string{"python3 -m dynamo.sglang"},
}, },
}, },
}, },
...@@ -3253,7 +3253,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) { ...@@ -3253,7 +3253,7 @@ func TestGeneratePodSpecForComponent_SGLang(t *testing.T) {
role: RoleWorker, role: RoleWorker,
numberOfNodes: 3, numberOfNodes: 3,
expectError: false, expectError: false,
expectContains: []string{"python3", "-m", "dynamo.sglang.worker", "dist-init-addr", "nnodes", "node-rank"}, expectContains: []string{"python3", "-m", "dynamo.sglang", "dist-init-addr", "nnodes", "node-rank"},
}, },
{ {
name: "SGLang with user command override", name: "SGLang with user command override",
...@@ -3685,7 +3685,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) { ...@@ -3685,7 +3685,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{ {
name: "detect SGLang from args", name: "detect SGLang from args",
command: []string{"/bin/sh", "-c"}, command: []string{"/bin/sh", "-c"},
args: []string{"python -m dynamo.sglang.worker --model test"}, args: []string{"python -m dynamo.sglang --model test"},
expected: BackendFrameworkSGLang, expected: BackendFrameworkSGLang,
}, },
{ {
...@@ -3703,7 +3703,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) { ...@@ -3703,7 +3703,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{ {
name: "detect from python3.11", name: "detect from python3.11",
command: []string{}, command: []string{},
args: []string{"python3.11 -m dynamo.sglang.decode_worker"}, args: []string{"python3.11 -m dynamo.sglang"},
expected: BackendFrameworkSGLang, expected: BackendFrameworkSGLang,
}, },
{ {
...@@ -3715,7 +3715,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) { ...@@ -3715,7 +3715,7 @@ func TestDetectBackendFrameworkFromArgs(t *testing.T) {
{ {
name: "multiple backends detected", name: "multiple backends detected",
command: []string{}, command: []string{},
args: []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang.worker"}, args: []string{"python -m dynamo.vllm.worker && python -m dynamo.sglang"},
expectError: true, expectError: true,
}, },
} }
...@@ -3777,7 +3777,7 @@ func TestDetermineBackendFramework(t *testing.T) { ...@@ -3777,7 +3777,7 @@ func TestDetermineBackendFramework(t *testing.T) {
{ {
name: "worker with detected matching explicit", name: "worker with detected matching explicit",
componentType: "worker", componentType: "worker",
args: []string{"python -m dynamo.sglang.worker"}, args: []string{"python -m dynamo.sglang"},
explicitBackendFramework: "sglang", explicitBackendFramework: "sglang",
expected: BackendFrameworkSGLang, expected: BackendFrameworkSGLang,
}, },
...@@ -3881,7 +3881,7 @@ func TestGetBackendFrameworkFromComponent(t *testing.T) { ...@@ -3881,7 +3881,7 @@ func TestGetBackendFrameworkFromComponent(t *testing.T) {
ComponentType: "worker", // Worker component ComponentType: "worker", // Worker component
ExtraPodSpec: &common.ExtraPodSpec{ ExtraPodSpec: &common.ExtraPodSpec{
MainContainer: &corev1.Container{ MainContainer: &corev1.Container{
Args: []string{"python -m dynamo.sglang.worker"}, Args: []string{"python -m dynamo.sglang"},
}, },
}, },
}, },
......
...@@ -30,7 +30,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo ...@@ -30,7 +30,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **NVIDIA Ada Lovelace Architecture** | Supported | | **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported | | **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility ## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms: **Dynamo** is compatible with the following platforms:
...@@ -51,16 +50,15 @@ If you are using a **GPU**, the following GPU models and architectures are suppo ...@@ -51,16 +50,15 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> [!Caution] > [!Caution]
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04. > KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
## Software Compatibility ## Software Compatibility
### Runtime Dependency ### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version | | **Python Package** | **Version** | glibc version | CUDA Version |
| :----------------- | :------------ | :----------------------------------- | :----------- | | :----------------- | :---------- | :------------------------------------ | :----------- |
| ai-dynamo | 0.5.1 | >=2.28 | | | ai-dynamo | 0.5.1 | >=2.28 | |
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| | | ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues) | |
| NIXL | 0.4.1 | >=2.27 | >=11.8 | | NIXL | 0.4.1 | >=2.27 | >=11.8 |
### Build Dependency ### Build Dependency
...@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo ...@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **TensorRT-LLM** | 1.1.0rc5 | | **TensorRT-LLM** | 1.1.0rc5 |
| **NIXL** | 0.4.1 | | **NIXL** | 0.4.1 |
| **vLLM** | 0.10.1.1 | | **vLLM** | 0.10.1.1 |
| **SGLang** | 0.5.0rc2 | | **SGLang** | 0.5.3rc0 |
> [!Important] > [!Important]
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail. > Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
...@@ -78,27 +76,25 @@ If you are using a **GPU**, the following GPU models and architectures are suppo ...@@ -78,27 +76,25 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
### AWS ### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** | | **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :----------- | | :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported¹ | | **Amazon Linux** | 2023 | x86_64 | Supported¹ |
> [!Caution] > [!Caution]
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend). > ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support ## Build Support
**Dynamo** currently provides build support in the following ways: **Dynamo** currently provides build support in the following ways:
- **Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**. - **Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**.
No wheels are available for other platforms at this time. No wheels are available for other platforms at this time.
- **Runtime Container Images**: We distribute only **AMD64** images of the runtime target on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime), [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime), and [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime). - **Runtime Container Images**: We distribute only **AMD64** images of the runtime target on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime), [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime), and [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime).
Users must build the container image from source if they require an **ARM64** image. Users must build the container image from source if they require an **ARM64** image.
- **Deployment-supportive Images**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the [Dynamo kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs. - **Deployment-supportive Images**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the [Dynamo kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
It is currently provided as an **AMD64** image only. It is currently provided as an **AMD64** image only.
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo. [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds), [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform), and [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) are available. - **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo. [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds), [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform), and [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) are available.
......
...@@ -3,12 +3,14 @@ ...@@ -3,12 +3,14 @@
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution. This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see: For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/architecture/disagg_serving.md) - [Dynamo Disaggregated Serving](../../../docs/architecture/disagg_serving.md)
- [KV Cache Routing Architecture](../../../docs/architecture/kv_cache_routing.md) - [KV Cache Routing Architecture](../../../docs/architecture/kv_cache_routing.md)
## Architecture Overview ## Architecture Overview
The multi-node setup consists of: The multi-node setup consists of:
- **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them - **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them
- **2 Model Replicas**: Each with dedicated prefill and decode workers - **2 Model Replicas**: Each with dedicated prefill and decode workers
- **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers** - **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers**
...@@ -57,6 +59,7 @@ KV-aware routing optimizes LLM inference by directing requests to workers that a ...@@ -57,6 +59,7 @@ KV-aware routing optimizes LLM inference by directing requests to workers that a
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions - **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for: This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently - **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching - **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused - **Similar queries**: Common prefixes are computed once and reused
...@@ -90,6 +93,7 @@ For more information about the SGLang backend and its integration with Dynamo, s ...@@ -90,6 +93,7 @@ For more information about the SGLang backend and its integration with Dynamo, s
### 3. Network Requirements ### 3. Network Requirements
Ensure the following ports are accessible between nodes: Ensure the following ports are accessible between nodes:
- **2379**: etcd client port - **2379**: etcd client port
- **4222**: NATS client port - **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node) - **8000**: Frontend HTTP port (only needed on frontend node)
...@@ -98,6 +102,7 @@ Ensure the following ports are accessible between nodes: ...@@ -98,6 +102,7 @@ Ensure the following ports are accessible between nodes:
### 4. Hardware Setup ### 4. Hardware Setup
This example assumes: This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers) - **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers) - **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required) - **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
...@@ -131,7 +136,7 @@ Open a terminal on Node 1 and launch both workers: ...@@ -131,7 +136,7 @@ Open a terminal on Node 1 and launch both workers:
```bash ```bash
# Launch prefill worker in background # Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
...@@ -141,7 +146,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ ...@@ -141,7 +146,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-mode prefill \ --disaggregation-mode prefill \
--disaggregation-transfer-backend nixl & --disaggregation-transfer-backend nixl &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
...@@ -153,6 +158,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ ...@@ -153,6 +158,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
``` ```
> [!INFO] > [!INFO]
>
> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs) > - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers > - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation) > - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
...@@ -165,7 +171,7 @@ Open a terminal on Node 2 and launch both workers: ...@@ -165,7 +171,7 @@ Open a terminal on Node 2 and launch both workers:
```bash ```bash
# Launch prefill worker in background # Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
...@@ -176,7 +182,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \ ...@@ -176,7 +182,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-transfer-backend nixl & --disaggregation-transfer-backend nixl &
# Launch decode worker in foreground # Launch decode worker in foreground
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \ --model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \ --page-size 16 \
...@@ -206,6 +212,7 @@ hostname -I | awk '{print $1}' ...@@ -206,6 +212,7 @@ hostname -I | awk '{print $1}'
``` ```
The frontend will: The frontend will:
- Discover all available decode workers via etcd - Discover all available decode workers via etcd
- Enable KV-aware routing for intelligent request distribution - Enable KV-aware routing for intelligent request distribution
- Monitor worker health and adjust routing accordingly - Monitor worker health and adjust routing accordingly
...@@ -418,6 +425,7 @@ curl http://${DYN_FRONTEND_IP}:8000/health ...@@ -418,6 +425,7 @@ curl http://${DYN_FRONTEND_IP}:8000/health
### Workers Not Discovering Each Other ### Workers Not Discovering Each Other
1. Verify etcd connectivity from all nodes: 1. Verify etcd connectivity from all nodes:
```bash ```bash
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
``` ```
...@@ -461,9 +469,11 @@ Stop all components in reverse order: ...@@ -461,9 +469,11 @@ Stop all components in reverse order:
1. Stop Frontend (Ctrl+C in the frontend terminal) 1. Stop Frontend (Ctrl+C in the frontend terminal)
2. Stop workers on each node: 2. Stop workers on each node:
- On Node 1: Press Ctrl+C in the terminal (this stops the decode worker) - On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
- On Node 2: Press Ctrl+C in the terminal (this stops the decode worker) - On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
- To stop the background prefill workers, use one of these methods: - To stop the background prefill workers, use one of these methods:
```bash ```bash
# Method 1: Kill background jobs in the same terminal # Method 1: Kill background jobs in the same terminal
jobs # See background jobs jobs # See background jobs
...@@ -473,8 +483,9 @@ Stop all components in reverse order: ...@@ -473,8 +483,9 @@ Stop all components in reverse order:
exit exit
# Method 3: Kill by process name (from any terminal) # Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.worker.*prefill" pkill -f "dynamo.sglang.*prefill"
``` ```
3. Stop infrastructure services: 3. Stop infrastructure services:
```bash ```bash
docker compose -f deploy/docker-compose.yml down docker compose -f deploy/docker-compose.yml down
......
...@@ -60,7 +60,7 @@ vllm = [ ...@@ -60,7 +60,7 @@ vllm = [
sglang = [ sglang = [
"uvloop", "uvloop",
"nixl<=0.4.1", "nixl<=0.4.1",
"sglang[all]==0.5.0rc2", "sglang[all]==0.5.3rc0",
] ]
llama_cpp = [ llama_cpp = [
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment