docs: Add multi-node TRTLLM worker example (Deepseek R1) (#1511)

40ca062f · Ryan McCormick · GitHub · 382e3aed · 40ca062f · 40ca062f
Unverified Commit 40ca062f authored Jun 14, 2025 by Ryan McCormick Committed by GitHub Jun 13, 2025
6 changed files
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -154,6 +154,12 @@ You can find the example Deepseek R1 configs for GB200
 [here](configs/deepseek_r1), but the config settings can be customized for testing
 other hardware configurations or parallelism strategies.
+This "multi-node" example demonstrates how to generally connect dynamo workers from
+different nodes, but for simplicity, each worker individually fits on a single node.
+For details on how to launch a worker that spans multiple nodes due to sheer model
+size, or for features like large scale expert parallelism, see the
+[multinode worker example](configs/deepseek_r1/multinode).
 ##### Head Node
 Start nats/etcd:
@@ -294,7 +300,7 @@ Remaining tasks:
 - [x] Add support for the disaggregated serving.
 - [x] Add multi-node support.
 - [x] Add instructions for benchmarking.
+- [x] Use processor from dynamo-llm framework.
 - [ ] Add integration test coverage.
 - [ ] Merge the code base with llm example to reduce the code duplication.
- [ ] Use processor from dynamo-llm framework.
 - [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
+To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
+the set of nodes need to be launched together in the same MPI world, such as
+via `mpirun` or `srun`. This is true regardless of whether the worker is
+aggregated, prefill-only, or decode-only.
+In this document we will demonstrate an example of launching a multi-node TP16/EP16
+aggregated worker on a slurm cluster with `srun`.
+NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
+`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
+using `mpirun` directly, with relative ease.
+## Setup
+For simplicity of the example, we will make some assumptions about your slurm cluster:
+1. First, we assume you have access to a slurm cluster with multiple GPU nodes
+   available. For functional testing, most setups should be fine. For performance
+   testing, you should aim to allocate groups of nodes that are performantly
+   inter-connected, such as those in an NVL72 setup.
+2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
+   SPANK plugin setup. In particular, the `srun_script.sh` script in this
+   example will use `srun` arguments like `--container-image`,
+   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
+   If your cluster supports similar container based plugins, you may be able to
+   modify the script to use that instead.
+3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
+   described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
+   This is the image that can be set to the `IMAGE` environment variable in later steps.
+4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
+   will allocate 4 nodes below as a reference command. This is technically not
+   a requirement, but makes iterations of testing/experimenting easier when
+   you have a reserved set of nodes for a period of time. Make sure to set your
+   `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+    ```bash
+    # Set partition manually based on your slurm cluster's partition names
+    PARTITION=""
+    # Set account manually if this command doesn't work on your cluster
+    ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+    salloc \
+      --partition="${PARTITION}" \
+      --account="${ACCOUNT}" \
+      --job-name="${ACCOUNT}-dynamo.trtllm" \
+      -t 05:00:00 \
+      --nodes 4
+    ```
+5. Lastly, we will assume you are inside an interactive shell on one of your allocated
+   nodes, which should be the default behavior after executing the `salloc` command above.
+   If not, then you should SSH into one of the allocated nodes.
+## Launching Slurm Jobs
+This example aims to automate as much of the environment setup as possible,
+but all slurm clusters and environments are different, and you may need to
+dive into the scripts to make modifications based on your specific environment.
+Assuming you have already allocated at least 4 nodes via `salloc`, and are
+inside an interactive shell on one of the allocated nodes:
+```bash
+# NOTE: IMAGE must be set manually for now
+# To build an iamge, see the steps here:
+# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker
+export IMAGE="<dynamo_trtllm_image>"
+# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+# pre-download the model weights and save them in some shared location,
+# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
+# to reuse the pre-downloaded weights instead.
+#
+# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+#
+# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
+# https://huggingface.co/deepseek-ai/DeepSeek-R1
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+# NOTE: This path assumes you have mounted the config file into /mnt inside
+# the container. See the MOUNTS variable in srun_script.sh
+export ENGINE_CONFIG="/mnt/agg_DEP16_dsr1.yaml"
+# Launches frontend + etcd/nats on current (head) node.
+# Launches one large trtllm worker across multiple nodes via MPI tasks.
+./srun_script.sh
+```
+## Understanding the Output
+1. The `srun_script.sh` launches two `srun` jobs. The first launches
+   etcd, NATS, and the OpenAI frontend on the head node only
+   called "node1" in the example output below. The second launches
+   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
+   using 4 GPUs each.
+    ```
+    # Frontend/etcd/nats services
+    srun: launching StepId=453374.17 on host node1, 1 tasks: 0
+    ...
+    # TP16 TRTLLM worker split across 4 nodes with 4 gpus each
+    srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
+    srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
+    srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
+    srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
+   ```
+2. The OpenAI frontend will listen for and dynamically discover workers as
+   they register themselves with Dynamo's distributed runtime:
+   ```
+   0: 2025-06-13T02:36:48.160Z  INFO dynamo_run::input::http: Watching for remote model at models
+   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
+   ```
+3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
+   GPU on each node, which will each output their progress while loading the model.
+   You can see each rank's output prefixed with the rank at the start of each log line
+   until the model succesfully finishes loading:
+    ```
+     8: rank8 run mgmn worker node with mpi_world_size: 16 ...
+    10: rank10 run mgmn worker node with mpi_world_size: 16 ...
+     9: rank9 run mgmn worker node with mpi_world_size: 16 ...
+    11: rank11 run mgmn worker node with mpi_world_size: 16 ...
+    ...
+    15: Model init total -- 55.42s
+    11: Model init total -- 55.91s
+    12: Model init total -- 55.24s
+    ```
+4. After the model fully finishes loading on all ranks, the worker will register itself,
+   and the OpenAI frontend will detect it, signaled by this output:
+    ```
+    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="Deepseek-R1-FP4"
+    ```
+5. At this point, with the worker fully initialized and detected by the frontend,
+   it is now ready for inference.
+## Example Request
+To verify the deployed model is working, send a `curl` request:
+```bash
+# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
+HOST=localhost
+PORT=8000
+MODEL=Deepseek-R1-FP4
+curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "'${MODEL}'",
+  "messages": [
+  {
+    "role": "user",
+    "content": "Tell me a story as if we were playing dungeons and dragons."
+  }
+  ],
+  "stream": true,
+  "max_tokens": 30
+}'
+```
+## Cleanup
+To cleanup background `srun` processes launched by `srun_script.sh`, you can run:
+```bash
+pkill srun
+```
+## Known Issues
+- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
+  FP4 weights. In theory, the example should work on alternative setups such as
+  H100 nodes with FP8 weights, but this hasn't been tested yet.
+- This example only tests an aggregated model setup for now. A disaggregated
+  serving example will be added in the near future.
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/agg_DEP16_dsr1.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+backend: pytorch
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+enable_attention_dp: true
+max_batch_size: 256
+max_num_tokens: 256
+max_seq_len: 8448
+kv_cache_config:
+  free_gpu_memory_fraction: 0.8
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+kv_cache_dtype: fp8
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/srun_script.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# This is one of the only variables that must be set currently, most of the rest may
+# just work out of the box if following the steps in the README.
+IMAGE="${IMAGE:-""}"
+# Set to mount current host directory to /mnt inside the container as an example,
+# but you may freely customize the mounts based on your cluster. A common practice
+# is to mount paths to NFS storage for common scripts, model weights, etc.
+# NOTE: This can be a comma separated list of multiple mounts as well.
+MOUNTS="$PWD:/mnt"
+# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
+# For 8xH100 nodes as an example, you may set this to 2 nodes x 16 gpus, or 4 nodes x 32 gpus instead.
+NUM_NODES=4
+NUM_GPUS_TOTAL=16
+# Automate settings of certain variables for convenience, but you are free
+# to manually set these for more control as well.
+ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+export HEAD_NODE="${SLURMD_NODENAME}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="${HEAD_NODE_IP}:4222"
+if [[ -z ${IMAGE} ]]; then
+  echo "ERROR: You need to set the IMAGE environment variable to the " \
+       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
+       "See how to build one from source here: " \
+       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
+  exit 1
+fi
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching frontend services in background."
+srun \
+  --overlap \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodelist "${HEAD_NODE}" \
+  --nodes 1 \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_frontend_services.sh &
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching multi-node worker in background."
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_NODES}" \
+  --ntasks "${NUM_GPUS_TOTAL}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_frontend_services.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_frontend_services.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# Start NATS
+nats-server -js &
+# Start etcd
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+# Wait for NATS/etcd to startup
+sleep 3
+# Start OpenAI Frontend which will dynamically discover workers when they startup
+# NOTE: This is a blocking call.
+dynamo-run in=http out=dyn --http-port 8000
--- a/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
+++ b/examples/tensorrt_llm/configs/deepseek_r1/multinode/start_trtllm_worker.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+if [[ -z ${MODEL_PATH} ]]; then
+    echo "ERROR: MODEL_PATH was not set."
+    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
+         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
+         "recommended to pre-download them to a shared location and provide the path."
+    exit 1
+fi
+if [[ -z ${ENGINE_CONFIG} ]]; then
+    echo "ERROR: ENGINE_CONFIG was not set."
+    echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
+    exit 1
+fi
+# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
+# worker and registers itself with the runtime. It is currently easier to wrap
+# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
+# but this may be refactored into 'dynamo serve' in the future.
+trtllm-llmapi-launch \
+  python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \
+    --model-path "${MODEL_PATH}" \
+    --extra-engine-args "${ENGINE_CONFIG}"