README.md 13.2 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# LLM Deployment Examples using TensorRT-LLM

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
33
34
35
36
37
38

## Deployment Architectures

See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.

39
40
41
42
43
44
45
46
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.

## Getting Started

1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts

47
48
### Prerequisites

49
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
50
```bash
51
docker compose -f deploy/metrics/docker-compose.yml up -d
52
53
54
55
56
```

### Build docker

```bash
57
58
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
59

60
# On an x86 machine:
61
./container/build.sh --framework tensorrtllm
62
63
64

# On an ARM machine:
./container/build.sh --framework tensorrtllm --platform linux/arm64
65
66
67
68
69

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
70
71
```

72
73
74
> [!NOTE]
> Because of a known issue of C++11 ABI compatibility within the NGC pytorch container,
> we rebuild TensorRT-LLM from source. See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
Akash's avatar
Akash committed
75
> for more information.
76
77
78
>
> Hence, when running this script for the first time, the time taken by this script can be
> quite long.
79

80

81
82
83
84
85
86
87
### Run container

```
./container/run.sh --framework tensorrtllm -it
```
## Run Deployment

88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
This figure shows an overview of the major components to deploy:



```

+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
|      |<-----|           |<-----|                  |<------------|     Worker    |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+

```

Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.

111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
### Example architectures

#### Aggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
```

#### Aggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
```

#### Disaggregated serving
```bash
127
cd /workspace/examples/tensorrt_llm
128
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
129
130
131
132
```

#### Disaggregated serving with KV Routing
```bash
133
cd /workspace/examples/tensorrt_llm
134
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
135
```
136

137
#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1
138
139
140
141
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
```
142

143
Notes:
144
145
146
147
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.

  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`

148
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
149
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
150

151
152
153
154
155
156
157
158
#### Multi-Node Disaggregated Serving

In the following example, we will demonstrate how to run a Disaggregated Serving
deployment across multiple nodes. For simplicity, we will demonstrate how to
deploy a single Decode worker on one node, and a single Prefill worker on the other node.
However, the instance counts, TP sizes, other configs, and responsibilities of each node
can be customized and deployed in similar ways.

159
160
161
162
163
164
165
For example, to deploy Deepseek R1, you could replace the referenced example
configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1
example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`).
You can find the example Deepseek R1 configs for GB200
[here](configs/deepseek_r1), but the config settings can be customized for testing
other hardware configurations or parallelism strategies.

166
167
168
169
170
171
This "multi-node" example demonstrates how to generally connect dynamo workers from
different nodes, but for simplicity, each worker individually fits on a single node.
For details on how to launch a worker that spans multiple nodes due to sheer model
size, or for features like large scale expert parallelism, see the
[multinode worker example](configs/deepseek_r1/multinode).

172
173
174
175
176
177
178
179
180
181
182
183
184
185
##### Head Node

Start nats/etcd:
```bash
# NATS data persisted to /tmp/nats/jetstream by default
nats-server -js &

# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &

# NOTE: Clearing out the etcd and nats jetstream data directories across runs
#       helps to guarantee a clean and reproducible results.
```

186
Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml &
```

Notes:
- The aggregated graph (`graphs.agg`) is chosen here because it also describes
  our desired deployment settings for the head node: launching the utility components
  (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
  `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker`
  independently on a separate node in the next step of this demonstration.
  You are free to customize the graph and configuration of components launched on
  each node.
- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a
  single source of truth to be used for deployments on all of our nodes, describing
  the configurations for all of our components, including both decode and prefill
  workers, but can be customized based on your deployment needs.

##### Worker Node(s)

Set environment variables pointing at the etcd/nats endpoints on the head node
so the Dynamo Distributed Runtime can orchestrate communication and
discoverability between the head node and worker nodes:
```bash
# if not head node
export HEAD_NODE_IP="<head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

Deploy a Prefill worker:
219
```bash
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
cd /workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
```

Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!

##### Additional Notes for Multi-Node Deployments

Notes:
- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`,
  and change the config to one that includes the router, such as `configs/disagg_router.yaml`
- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
  Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
  remains mostly the same. The primary difference between aggregation and disaggregation for this step is
  whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file
  (ex: `configs/disagg.yaml` vs `configs/agg.yaml`).
- To apply the same concept for launching additional decode workers on worker nodes, you can
  directly start them, similar to the prefill worker step above:
  ```bash
  # Example: deploy decode worker only
  cd /workspace/examples/tensorrt_llm
  dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker &
  ```
243
244
245
246
247
248
249
250
- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster,
  try unsetting the following environment variables before launching the TRTLLM worker. If you intend to
  run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may
  want to save these values into temporary variables and then restore them afterwards.
  ```bash
  # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes`
  unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
  ```
251

252
#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286

Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations

##### Head Node

Start nats/etcd
```bash
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
```

Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:

```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml  &
```

##### Worker Node(s)

Set environment variables pointing at the etcd/nats endpoints on the head node.
```bash
export HEAD_NODE_IP="<head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

Deploy a Prefill worker:
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker &
```

Notes:
287
288
289
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.

  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
290
291
292
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

293

294
295
296
297
### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

298
299
NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component.

300
301
### Close deployment

302
See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
303

304
### Benchmarking
305

306
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
307
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
308
309
310
311

### Future Work

Remaining tasks:
312
- [x] Add support for the disaggregated serving.
313
314
- [x] Add multi-node support.
- [x] Add instructions for benchmarking.
315
- [x] Use processor from dynamo-llm framework.
316
317
- [ ] Add integration test coverage.
- [ ] Merge the code base with llm example to reduce the code duplication.
318
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.