README.md 13.5 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Anish's avatar
Anish committed
18
# LLM Deployment using TensorRT-LLM
19
20
21

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
33

Anish's avatar
Anish committed
34
35
36
37
38
---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
39
- [Single Node Examples](#single-node-examples)
Anish's avatar
Anish committed
40
41
42
43
- [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
44
- [Multimodal Support](#multimodal-support)
45
- [Logits Processing](#logits-processing)
46
- [Performance Sweep](#performance-sweep)
Anish's avatar
Anish committed
47
48
49
50

## Feature Support Matrix

### Core Dynamo Features
51

Anish's avatar
Anish committed
52
53
| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
54
55
56
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
57
58
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
59
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
60

Anish's avatar
Anish committed
61
### Large Scale P/D and WideEP Features
62

63
64
| Feature            | TensorRT-LLM | Notes                                                           |
|--------------------|--------------|-----------------------------------------------------------------|
Anish's avatar
Anish committed
65
66
67
| **WideEP**         | ✅           |                                                                 |
| **DP Rank Routing**| ✅           |                                                                 |
| **GB200 Support**  | ✅           |                                                                 |
68

69
## TensorRT-LLM Quick Start
70

Anish's avatar
Anish committed
71
72
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

73
### Start Infrastructure Services (Local Development Only)
Anish's avatar
Anish committed
74

75
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
76
77

```bash
78
docker compose -f deploy/docker-compose.yml up -d
79
80
```

81
82
83
84
85
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Anish's avatar
Anish committed
86
### Build container
87
88

```bash
89
90
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
91

92
# On an x86 machine:
93
./container/build.sh --framework trtllm
94
95

# On an ARM machine:
96
./container/build.sh --framework trtllm --platform linux/arm64
97
98
99
100

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
101
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main
102
103
104
105
```

### Run container

Anish's avatar
Anish committed
106
```bash
107
./container/run.sh --framework trtllm -it
108
109
```

Anish's avatar
Anish committed
110
## Single Node Examples
111

Anish's avatar
Anish committed
112
113
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
114

115
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing.md).
116

Anish's avatar
Anish committed
117
### Aggregated
118
```bash
119
cd $DYNAMO_HOME/examples/backends/trtllm
120
./launch/agg.sh
121
```
122

Anish's avatar
Anish committed
123
### Aggregated with KV Routing
124
```bash
125
cd $DYNAMO_HOME/examples/backends/trtllm
126
./launch/agg_router.sh
127
```
128

Anish's avatar
Anish committed
129
### Disaggregated
130
131

```bash
132
cd $DYNAMO_HOME/examples/backends/trtllm
133
./launch/disagg.sh
134
135
```

Anish's avatar
Anish committed
136
### Disaggregated with KV Routing
137

138
> [!IMPORTANT]
139
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
140

141
```bash
142
cd $DYNAMO_HOME/examples/backends/trtllm
143
./launch/disagg_router.sh
144
145
```

Anish's avatar
Anish committed
146
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
147
```bash
148
cd $DYNAMO_HOME/examples/backends/trtllm
149

150
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
151
152
153
154
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
155
156
157
```

Notes:
158
159
160
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Anish's avatar
Anish committed
161
## Advanced Examples
162

Anish's avatar
Anish committed
163
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
164

Anish's avatar
Anish committed
165
### Multinode Deployment
166

Anish's avatar
Anish committed
167
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
168

Anish's avatar
Anish committed
169
170
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
171

172
173
### Kubernetes Deployment

174
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../examples/backends/trtllm/deploy/README.md).
175
176
177

### Client

178
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
179

180
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
181
182
183

### Benchmarking

184
To benchmark your deployment with AIPerf, see this utility script, configuring the
185
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
186

187
## KV Cache Transfer in Disaggregated Serving
188

189
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
190

191

192
193
## Request Migration

194
You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
195
196

```bash
197
# For decode and aggregated workers
198
199
200
python3 -m dynamo.trtllm ... --migration-limit=3
```

201
202
203
204
> [!IMPORTANT]
> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.

See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
205

206
207
208
209
210
211
212
213
214
## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
215
| **Disaggregated** | ✅ | ✅ |
216

217
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
218

Anish's avatar
Anish committed
219
220
## Client

221
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
Anish's avatar
Anish committed
222
223

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
224

Anish's avatar
Anish committed
225
226
## Benchmarking

227
To benchmark your deployment with AIPerf, see this utility script, configuring the
Anish's avatar
Anish committed
228
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
229
230
231

## Multimodal support

232
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
233

234
235
236
237
238
239
240
241
242
243
244
245
246
## Logits Processing

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).

### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.

```bash
247
cd $DYNAMO_HOME/examples/backends/trtllm
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```

Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".

### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:

```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor

class TemperatureProcessor(BaseLogitsProcessor):
    def __init__(self, temperature: float = 1.0):
        if temperature <= 0:
            raise ValueError("Temperature must be positive")
        self.temperature = temperature

    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
        if self.temperature == 1.0:
            return
        logits.div_(self.temperature)
```

Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:

```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor

processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```

### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

291
292
## Performance Sweep

293
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
294
295
296
297
298

## Dynamo KV Block Manager Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

299
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .