@@ -29,7 +30,7 @@ High-throughput, low-latency inference framework designed for serving generative
## Latest News
*[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
-[08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
## The Era of Multi-GPU, Multi-Node
...
...
@@ -54,7 +55,7 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
# Edit deploy/docker-compose.yml to comment out "runtime: nvidia" of the dcgm-exporter service if the nvidia container runtime isn't deployed or to be used.
@@ -156,8 +160,8 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
***[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
***[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
-**[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
-**[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.
...
...
@@ -207,6 +213,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
### Install prerequisites
```
# Optional step: Only required for Blackwell and Grace Hopper
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
### After installing the pre-requisites above, install Dynamo
```
uv pip install ai-dynamo[trtllm]
```
Run the backend/worker like this:
```
python -m dynamo.trtllm --help
```
...
...
@@ -237,16 +246,20 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
## 2. Install Rust
...
...
@@ -270,11 +283,13 @@ source $HOME/.cargo/env
Follow the instructions in [uv installation](https://docs.astral.sh/uv/#installation) guide to install uv if you don't have `uv` installed. Once uv is installed, create a virtual environment and activate it.
# Example: Deploy Multi-node SGLang with Dynamo on SLURM
# Example: Deploy DeepSeek R1 - FP8 with Dynamo and SGLang on SLURM
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
This folder allows you to deploy the SGLang DeepSeek-R1 Disaggregated with WideEP on a GB200 SLURM cluster.
## Overview
## SLURM Prerequisites
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
## Scripts
-**`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-**`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
-**`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-**`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
-**`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
### Log File Structure
```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
For simplicity of the example, we will make some assumptions about your SLURM cluster:
For this example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
...
...
@@ -58,97 +17,96 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-h100.md#instructions).
described [here](../docs/dsr1-wideep-gb200.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.
## Scripts Overview
-**`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-**`job_script_template.j2`**: Jinja2 template for generating SLURM sbatch scripts
-**`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-**`submit_disagg.sh`**: A simple one-liner script that invokes the `submit_job_script.py`
## Logs Folder Structure
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
## Usage
> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `ip addr show $NETWORK_INTERFACE` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions are always welcome.
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
2.**Example with different GPU types**:
```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100
# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```
3.**Monitor job progress**:
2.**Check logs in real-time**:
```bash
squeue -u$USER
cd logs/{JOB_ID}
tail-f*_prefill_*.err *_decode_*.err
```
4.**Check logs in real-time**:
```bash
tail-f logs/{JOB_ID}/log.out
```
## Configs directory
You can view logs of all prefill or decode workers simultaneously by running:
The `--config-dir` argument is used to specify the directory containing the various configs that are used when running this model. Here are the current configs that are in our directory.
1.`decode_dsr1-0528_loadgen_in1024out1024_num2000_2p12d.json`: `init-expert-location` for decode worker
2.`deepep_config.json`: DeepEP config file for GB2009
3.`dgcache/`: DeepGEMM kernel cache directory. Instructions for creating this can be found [here](https://github.com/sgl-project/sglang/issues/9867#issuecomment-3336551174)
4.`prefill_dsr1-0528_in1000out1000_num40000.json`: `init-expert-location` for prefill worker
**Note**: The expert locations are collected using the instructions [here](https://github.com/sgl-project/sglang/issues/6017). See the section titled "Create expert distribution data". Note that this is sensitive to your data and performance results may differ if you dont benchmark with the same data that was used to collect the expert locations.
## Outputs
## Profiler
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
If you provide the `--profiler` command, the sbatch script will automatically warmup the model and run the vllm benchmarking script. Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| |
| ai-dynamo-runtime | 0.5.1 | >=2.28 (Python 3.12 has known issues)| |
| NIXL | 0.4.1 | >=2.27 | >=11.8 |
### Build Dependency
...
...
@@ -69,7 +67,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **TensorRT-LLM** | 1.1.0rc5 |
| **NIXL** | 0.4.1 |
| **vLLM** | 0.10.1.1 |
| **SGLang** | 0.5.0rc2 |
| **SGLang** | 0.5.3rc0 |
> [!Important]
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
...
...
@@ -79,14 +77,12 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
**Dynamo** currently provides build support in the following ways:
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see: