Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
40ca062f
Unverified
Commit
40ca062f
authored
Jun 14, 2025
by
Ryan McCormick
Committed by
GitHub
Jun 13, 2025
Browse files
docs: Add multi-node TRTLLM worker example (Deepseek R1) (#1511)
parent
382e3aed
Changes
6
Show whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
317 additions
and
1 deletion
+317
-1
examples/tensorrt_llm/README.md
examples/tensorrt_llm/README.md
+7
-1
examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
...ples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
+174
-0
examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
...rrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
+24
-0
examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
...tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
+70
-0
examples/tensorrt_llm/configs/deepseek_r1/multinode/start_frontend_services.sh
.../configs/deepseek_r1/multinode/start_frontend_services.sh
+16
-0
examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
..._llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
+26
-0
No files found.
examples/tensorrt_llm/README.md
View file @
40ca062f
...
@@ -154,6 +154,12 @@ You can find the example Deepseek R1 configs for GB200
...
@@ -154,6 +154,12 @@ You can find the example Deepseek R1 configs for GB200
[
here
](
configs/deepseek_r1
)
, but the config settings can be customized for testing
[
here
](
configs/deepseek_r1
)
, but the config settings can be customized for testing
other hardware configurations or parallelism strategies.
other hardware configurations or parallelism strategies.
This "multi-node" example demonstrates how to generally connect dynamo workers from
different nodes, but for simplicity, each worker individually fits on a single node.
For details on how to launch a worker that spans multiple nodes due to sheer model
size, or for features like large scale expert parallelism, see the
[
multinode worker example
](
configs/deepseek_r1/multinode
)
.
##### Head Node
##### Head Node
Start nats/etcd:
Start nats/etcd:
...
@@ -294,7 +300,7 @@ Remaining tasks:
...
@@ -294,7 +300,7 @@ Remaining tasks:
-
[x] Add support for the disaggregated serving.
-
[x] Add support for the disaggregated serving.
-
[x] Add multi-node support.
-
[x] Add multi-node support.
-
[x] Add instructions for benchmarking.
-
[x] Add instructions for benchmarking.
-
[x] Use processor from dynamo-llm framework.
-
[ ] Add integration test coverage.
-
[ ] Add integration test coverage.
-
[ ] Merge the code base with llm example to reduce the code duplication.
-
[ ] Merge the code base with llm example to reduce the code duplication.
-
[ ] Use processor from dynamo-llm framework.
-
[ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
-
[ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
0 → 100644
View file @
40ca062f
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via
`mpirun`
or
`srun`
. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate an example of launching a multi-node TP16/EP16
aggregated worker on a slurm cluster with
`srun`
.
NOTE: Some of the scripts used in this example like
`start_frontend_services.sh`
and
`start_trtllm_worker.sh`
should be translatable to other environments like Kubernetes, or
using
`mpirun`
directly, with relative ease.
## Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
1.
First, we assume you have access to a slurm cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2.
Second, we assume this slurm cluster has the
[
Pyxis
](
https://github.com/NVIDIA/pyxis
)
SPANK plugin setup. In particular, the
`srun_script.sh`
script in this
example will use
`srun`
arguments like
`--container-image`
,
`--container-mounts`
, and
`--container-env`
that are added to
`srun`
by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3.
Third, we assume you have already built a recent Dynamo+TRTLLM container image as
described
[
here
](
https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker
)
.
This is the image that can be set to the
`IMAGE`
environment variable in later steps.
4.
Fourth, we assume you pre-allocate a group of nodes using
`salloc`
. We
will allocate 4 nodes below as a reference command. This is technically not
a requirement, but makes iterations of testing/experimenting easier when
you have a reserved set of nodes for a period of time. Make sure to set your
`PARTITION`
and
`ACCOUNT`
according to your slurm cluster setup:
```
bash
# Set partition manually based on your slurm cluster's partition names
PARTITION
=
""
# Set account manually if this command doesn't work on your cluster
ACCOUNT
=
"
$(
sacctmgr
-nP
show assoc where
user
=
$(
whoami
)
format
=
account
)
"
salloc
\
--partition
=
"
${
PARTITION
}
"
\
--account
=
"
${
ACCOUNT
}
"
\
--job-name
=
"
${
ACCOUNT
}
-dynamo.trtllm"
\
-t
05:00:00
\
--nodes
4
```
5.
Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which should be the default behavior after executing the
`salloc`
command above.
If not, then you should SSH into one of the allocated nodes.
## Launching Slurm Jobs
This example aims to automate as much of the environment setup as possible,
but all slurm clusters and environments are different, and you may need to
dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated at least 4 nodes via
`salloc`
, and are
inside an interactive shell on one of the allocated nodes:
```
bash
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker
export
IMAGE
=
"<dynamo_trtllm_image>"
# NOTE: In general, Deepseek R1 is very large, so it is recommended to
# pre-download the model weights and save them in some shared location,
# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
# to reuse the pre-downloaded weights instead.
#
# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
#
# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
# https://huggingface.co/deepseek-ai/DeepSeek-R1
export
MODEL_PATH
=
"nvidia/DeepSeek-R1-FP4"
# NOTE: This path assumes you have mounted the config file into /mnt inside
# the container. See the MOUNTS variable in srun_script.sh
export
ENGINE_CONFIG
=
"/mnt/agg_DEP16_dsr1.yaml"
# Launches frontend + etcd/nats on current (head) node.
# Launches one large trtllm worker across multiple nodes via MPI tasks.
./srun_script.sh
```
## Understanding the Output
1.
The
`srun_script.sh`
launches two
`srun`
jobs. The first launches
etcd, NATS, and the OpenAI frontend on the head node only
called "node1" in the example output below. The second launches
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
using 4 GPUs each.
```
# Frontend/etcd/nats services
srun: launching StepId=453374.17 on host node1, 1 tasks: 0
...
# TP16 TRTLLM worker split across 4 nodes with 4 gpus each
srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
```
2. The OpenAI frontend will listen for and dynamically discover workers as
they register themselves with Dynamo's distributed runtime:
```
0: 2025-06-13T02:36:48.160Z INFO dynamo_run::input::http: Watching for remote model at models
0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
```
3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
GPU on each node, which will each output their progress while loading the model.
You can see each rank's output prefixed with the rank at the start of each log line
until the model succesfully finishes loading:
```
8: rank8 run mgmn worker node with mpi_world_size: 16 ...
10: rank10 run mgmn worker node with mpi_world_size: 16 ...
9: rank9 run mgmn worker node with mpi_world_size: 16 ...
11: rank11 run mgmn worker node with mpi_world_size: 16 ...
...
15: Model init total -- 55.42s
11: Model init total -- 55.91s
12: Model init total -- 55.24s
```
4. After the model fully finishes loading on all ranks, the worker will register itself,
and the OpenAI frontend will detect it, signaled by this output:
```
0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="Deepseek-R1-FP4"
```
5. At this point, with the worker fully initialized and detected by the frontend,
it is now ready for inference.
## Example Request
To verify the deployed model is working, send a `curl` request:
```
bash
# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
HOST=localhost
PORT=8000
MODEL=Deepseek-R1-FP4
curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions
\
-H "Content-Type: application/json"
\
-d '{
"model": "'${MODEL}'",
"messages": [
{
"role": "user",
"content": "Tell me a story as if we were playing dungeons and dragons."
}
],
"stream": true,
"max_tokens": 30
}'
```
## Cleanup
To cleanup background `srun` processes launched by `srun_script.sh`, you can run:
```
bash
pkill srun
```
## Known Issues
- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
FP4 weights. In theory, the example should work on alternative setups such as
H100 nodes with FP8 weights, but this hasn't been tested yet.
- This example only tests an aggregated model setup for now. A disaggregated
serving example will be added in the near future.
examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
0 → 100644
View file @
40ca062f
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
backend
:
pytorch
tensor_parallel_size
:
16
moe_expert_parallel_size
:
16
enable_attention_dp
:
true
max_batch_size
:
256
max_num_tokens
:
256
max_seq_len
:
8448
kv_cache_config
:
free_gpu_memory_fraction
:
0.8
use_cuda_graph
:
true
cuda_graph_padding_enabled
:
true
cuda_graph_batch_sizes
:
-
1
-
2
-
4
-
8
-
16
-
32
-
64
-
128
-
256
kv_cache_dtype
:
fp8
examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
0 → 100755
View file @
40ca062f
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This is one of the only variables that must be set currently, most of the rest may
# just work out of the box if following the steps in the README.
IMAGE
=
"
${
IMAGE
:-
""
}
"
# Set to mount current host directory to /mnt inside the container as an example,
# but you may freely customize the mounts based on your cluster. A common practice
# is to mount paths to NFS storage for common scripts, model weights, etc.
# NOTE: This can be a comma separated list of multiple mounts as well.
MOUNTS
=
"
$PWD
:/mnt"
# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
# For 8xH100 nodes as an example, you may set this to 2 nodes x 16 gpus, or 4 nodes x 32 gpus instead.
NUM_NODES
=
4
NUM_GPUS_TOTAL
=
16
# Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well.
ACCOUNT
=
"
$(
sacctmgr
-nP
show assoc where
user
=
$(
whoami
)
format
=
account
)
"
export
HEAD_NODE
=
"
${
SLURMD_NODENAME
}
"
export
HEAD_NODE_IP
=
"
$(
hostname
-i
)
"
export
ETCD_ENDPOINTS
=
"
${
HEAD_NODE_IP
}
:2379"
export
NATS_SERVER
=
"
${
HEAD_NODE_IP
}
:4222"
if
[[
-z
${
IMAGE
}
]]
;
then
echo
"ERROR: You need to set the IMAGE environment variable to the "
\
"Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' "
\
"See how to build one from source here: "
\
"https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
exit
1
fi
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo
"Launching frontend services in background."
srun
\
--overlap
\
--container-image
"
${
IMAGE
}
"
\
--container-mounts
"
${
MOUNTS
}
"
\
--verbose
\
--label
\
-A
"
${
ACCOUNT
}
"
\
-J
"
${
ACCOUNT
}
-dynamo.trtllm"
\
--nodelist
"
${
HEAD_NODE
}
"
\
--nodes
1
\
--jobid
"
${
SLURM_JOB_ID
}
"
\
/mnt/start_frontend_services.sh &
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo
"Launching multi-node worker in background."
srun
\
--mpi
pmix
\
--oversubscribe
\
--container-image
"
${
IMAGE
}
"
\
--container-mounts
"
${
MOUNTS
}
"
\
--container-env
ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE
\
--verbose
\
--label
\
-A
"
${
ACCOUNT
}
"
\
-J
"
${
ACCOUNT
}
-dynamo.trtllm"
\
--nodes
"
${
NUM_NODES
}
"
\
--ntasks
"
${
NUM_GPUS_TOTAL
}
"
\
--jobid
"
${
SLURM_JOB_ID
}
"
\
/mnt/start_trtllm_worker.sh &
examples/tensorrt_llm/configs/deepseek_r1/multinode/start_frontend_services.sh
0 → 100755
View file @
40ca062f
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Start NATS
nats-server
-js
&
# Start etcd
etcd
--listen-client-urls
http://0.0.0.0:2379
--advertise-client-urls
http://0.0.0.0:2379
--data-dir
/tmp/etcd &
# Wait for NATS/etcd to startup
sleep
3
# Start OpenAI Frontend which will dynamically discover workers when they startup
# NOTE: This is a blocking call.
dynamo-run
in
=
http
out
=
dyn
--http-port
8000
examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
0 → 100755
View file @
40ca062f
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
if
[[
-z
${
MODEL_PATH
}
]]
;
then
echo
"ERROR: MODEL_PATH was not set."
echo
"ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally "
\
"downloaded path to the model weights. Since Deepseek R1 is large, it is "
\
"recommended to pre-download them to a shared location and provide the path."
exit
1
fi
if
[[
-z
${
ENGINE_CONFIG
}
]]
;
then
echo
"ERROR: ENGINE_CONFIG was not set."
echo
"ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
exit
1
fi
# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
# worker and registers itself with the runtime. It is currently easier to wrap
# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
# but this may be refactored into 'dynamo serve' in the future.
trtllm-llmapi-launch
\
python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py
\
--model-path
"
${
MODEL_PATH
}
"
\
--extra-engine-args
"
${
ENGINE_CONFIG
}
"
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment