"vllm/vscode:/vscode.git/clone" did not exist on "7b5ecf79bd94aab0d782c70126d0dcc37c16bc60"
README.md 12.4 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Anish's avatar
Anish committed
18
# LLM Deployment using TensorRT-LLM
19
20
21

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
33

Anish's avatar
Anish committed
34
35
36
37
38
---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
39
- [Single Node Examples](#single-node-examples)
Anish's avatar
Anish committed
40
41
42
43
44
45
46
47
48
- [Advanced Examples](#advanced-examples)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)

## Feature Support Matrix

### Core Dynamo Features
49

Anish's avatar
Anish committed
50
51
| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
52
53
54
55
56
57
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
58

Anish's avatar
Anish committed
59
### Large Scale P/D and WideEP Features
60

Anish's avatar
Anish committed
61
62
63
64
65
| Feature            | TensorRT-LLM | Notes                                                                 |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP**         | ✅           |                                                                 |
| **DP Rank Routing**| ✅           |                                                                 |
| **GB200 Support**  | ✅           |                                                                 |
66

Anish's avatar
Anish committed
67
## Quick Start
68

Anish's avatar
Anish committed
69
70
71
72
73
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
74
75

```bash
76
docker compose -f deploy/docker-compose.yml up -d
77
78
```

Anish's avatar
Anish committed
79
### Build container
80
81

```bash
82
83
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
84

85
# On an x86 machine:
86
./container/build.sh --framework tensorrtllm
87
88
89

# On an ARM machine:
./container/build.sh --framework tensorrtllm --platform linux/arm64
90
91
92
93
94

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
95
96
97
98
```

### Run container

Anish's avatar
Anish committed
99
```bash
100
101
102
./container/run.sh --framework tensorrtllm -it
```

Anish's avatar
Anish committed
103
## Single Node Examples
104

Anish's avatar
Anish committed
105
106
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
107

Anish's avatar
Anish committed
108
This figure shows an overview of the major components to deploy:
109
110
111

```
+------+      +-----------+      +------------------+             +---------------+
112
113
| HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
|      |<-----|           |<-----|                  |<------------|               |
114
115
116
117
118
119
120
121
122
123
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```

124
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
125

Anish's avatar
Anish committed
126
### Aggregated
127
```bash
128
cd $DYNAMO_HOME/components/backends/trtllm
129
./launch/agg.sh
130
```
131

Anish's avatar
Anish committed
132
### Aggregated with KV Routing
133
```bash
134
cd $DYNAMO_HOME/components/backends/trtllm
135
./launch/agg_router.sh
136
```
137

Anish's avatar
Anish committed
138
### Disaggregated
139

140
141
> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
142

143
```bash
144
cd $DYNAMO_HOME/components/backends/trtllm
145
./launch/disagg.sh
146
147
```

Anish's avatar
Anish committed
148
### Disaggregated with KV Routing
149

150
151
> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
152

153
```bash
154
cd $DYNAMO_HOME/components/backends/trtllm
155
./launch/disagg_router.sh
156
157
```

Anish's avatar
Anish committed
158
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
159
```bash
160
cd $DYNAMO_HOME/components/backends/trtllm
161

162
163
164
165
166
export AGG_ENGINE_ARGS=./engine_configs/deepseek_r1/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
167
168
169
```

Notes:
170
171
172
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.

  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
173

174
175
176
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Anish's avatar
Anish committed
177
## Advanced Examples
178

Anish's avatar
Anish committed
179
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
180

Anish's avatar
Anish committed
181
### Multinode Deployment
182

Anish's avatar
Anish committed
183
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
184

Anish's avatar
Anish committed
185
186
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
187

188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:

- `agg.yaml` - Aggregated serving
- `agg_router.yaml` - Aggregated serving with KV routing
- `disagg.yaml` - Disaggregated serving
- `disagg_router.yaml` - Disaggregated serving with KV routing

#### Prerequisites

- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.

- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
  ```bash
  ./container/build.sh --framework tensorrtllm
  # Tag and push to your container registry
  # Update the image references in the YAML files
  ```

- **Port Forwarding**: After deployment, forward the frontend service to access the API:
  ```bash
  kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
  ```

#### Deploy to Kubernetes

Example with disagg:
Export the NAMESPACE  you used in your Dynamo Cloud Installation.

```bash
cd dynamo
cd components/backends/trtllm/deploy
kubectl apply -f disagg.yaml -n $NAMESPACE
```

To change `DYN_LOG` level, edit the yaml file by adding

```yaml
...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...
```

### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.

### Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)


247
## Disaggregation Strategy
248

249
The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.
250

251
By default, Dynamo uses a `decode first` strategy: incoming requests are initially routed to the decode worker, which then forwards them to the prefill worker in round-robin fashion. The prefill worker processes the request and returns results to the decode worker for any remaining decode operations.
252

253
When using KV routing, however, Dynamo switches to a `prefill first` strategy. In this mode, requests are routed directly to the prefill worker, which can help maximize KV cache reuse and improve overall efficiency for certain workloads. Choosing the appropriate strategy can have a significant impact on performance, depending on your use case.
254

255
The disaggregation strategy can be set using the `DISAGGREGATION_STRATEGY` environment variable. You can set the strategy before launching your deployment, for example:
256
```bash
257
DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
258
259
```

260
## KV Cache Transfer in Disaggregated Serving
261

262
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
263

264
265
## Request Migration

266
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
267
268
269
270
271

```bash
python3 -m dynamo.trtllm ... --migration-limit=3
```

272
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
273

Anish's avatar
Anish committed
274
275
276
277
278
## Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
279

Anish's avatar
Anish committed
280
281
282
283
## Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)