@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
...
@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News
## Latest News
*[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
-[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node
## The Era of Multi-GPU, Multi-Node
...
@@ -53,16 +54,17 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
...
@@ -53,16 +54,17 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
...
@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
***[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
-**[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
***[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
-**[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
...
@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
...
@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
### Install prerequisites
### Install prerequisites
```
```
# Optional step: Only required for Blackwell and Grace Hopper
# Optional step: Only required for Blackwell and Grace Hopper
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
### After installing the pre-requisites above, install Dynamo
### After installing the pre-requisites above, install Dynamo
```
```
uv pip install ai-dynamo[trtllm]
uv pip install ai-dynamo[trtllm]
```
```
Run the backend/worker like this:
Run the backend/worker like this:
```
```
python -m dynamo.trtllm --help
python -m dynamo.trtllm --help
```
```
...
@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
...
@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
## 2. Install Rust
## 2. Install Rust
...
@@ -270,11 +283,13 @@ source $HOME/.cargo/env
...
@@ -270,11 +283,13 @@ source $HOME/.cargo/env
Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
# Example: Deploy Multi-node SGLang with Dynamo on SLURM
# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
## Overview
## SLURM Prerequisites
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
For this example, we will make some assumptions about your SLURM cluster:
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
## Scripts
-**`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-**`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
-**`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-**`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
-**`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
### Log File Structure
```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
For simplicity of the example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
available. For functional testing, most setups should be fine. For performance
...
@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
...
@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-h100.md#instructions).
described [here](../docs/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.
This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview
-**`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-**`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
-**`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-**`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
## Usage
## Usage
> [!NOTE]
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2.**Example with different GPU types**:
2.**Check logs in real-time**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3.**Monitor job progress**:
```bash
```bash
squeue -u$USER
cd logs/{JOB_ID}
tail-f*_prefill_*.err *_decode_*.err
```
```
4.**Check logs in real-time**:
## Configs directory
```bash
tail-f logs/{JOB_ID}/log.out
```
You can view logs of all prefill or decode workers simultaneously by running:
The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
1.`decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
tail-f logs/{JOB_ID}/*_decode.err
2.`deepep_config.json`: DeepEP config file for GB2009
```
3.`dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
4.`prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
5.**Monitor GPU utilization**:
**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| |
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| |
| NIXL | 0.4.1 | >=2.27 | >=11.8 |
| NIXL | 0.4.1 | >=2.27 | >=11.8 |
### Build Dependency
### Build Dependency
...
@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
...
@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **TensorRT-LLM** | 1.1.0rc5 |
| **TensorRT-LLM** | 1.1.0rc5 |
| **NIXL** | 0.4.1 |
| **NIXL** | 0.4.1 |
| **vLLM** | 0.10.1.1 |
| **vLLM** | 0.10.1.1 |
| **SGLang** | 0.5.0rc2 |
| **SGLang** | 0.5.3rc0 |
> [!Important]
> [!Important]
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
...
@@ -78,27 +76,25 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
...
@@ -78,27 +76,25 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
## Build Support
**Dynamo** currently provides build support in the following ways:
**Dynamo** currently provides build support in the following ways:
-**Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**.
-**Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**.
No wheels are available for other platforms at this time.
No wheels are available for other platforms at this time.
-**Runtime Container Images**: We distribute only **AMD64** images of the runtime target on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime), [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime), and [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime).
-**Runtime Container Images**: We distribute only **AMD64** images of the runtime target on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime), [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime), and [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime).
Users must build the container image from source if they require an **ARM64** image.
Users must build the container image from source if they require an **ARM64** image.
-**Deployment-supportive Images**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the [Dynamo kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
-**Deployment-supportive Images**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the [Dynamo kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
It is currently provided as an **AMD64** image only.
It is currently provided as an **AMD64** image only.
-**Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo. [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds), [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform), and [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) are available.
-**Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo. [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds), [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform), and [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) are available.
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
For more information about the core concepts, see: