README.md 12.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Anish's avatar
Anish committed
18
# LLM Deployment using TensorRT-LLM
19
20
21

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
33

Anish's avatar
Anish committed
34
35
36
37
38
---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
39
- [Single Node Examples](#single-node-examples)
Anish's avatar
Anish committed
40
41
42
43
- [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
44
- [Multimodal Support](#multimodal-support)
45
- [Logits Processing](#logits-processing)
46
- [Performance Sweep](#performance-sweep)
Anish's avatar
Anish committed
47
48
49
50

## Feature Support Matrix

### Core Dynamo Features
51

Anish's avatar
Anish committed
52
53
| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
54
55
56
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
57
58
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
59
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
60

Anish's avatar
Anish committed
61
### Large Scale P/D and WideEP Features
62

63
64
| Feature            | TensorRT-LLM | Notes                                                           |
|--------------------|--------------|-----------------------------------------------------------------|
Anish's avatar
Anish committed
65
66
67
| **WideEP**         | ✅           |                                                                 |
| **DP Rank Routing**| ✅           |                                                                 |
| **GB200 Support**  | ✅           |                                                                 |
68

69
## TensorRT-LLM Quick Start
70

Anish's avatar
Anish committed
71
72
73
74
75
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
76
77

```bash
78
docker compose -f deploy/docker-compose.yml up -d
79
80
```

Anish's avatar
Anish committed
81
### Build container
82
83

```bash
84
85
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
86

87
# On an x86 machine:
88
./container/build.sh --framework trtllm
89
90

# On an ARM machine:
91
./container/build.sh --framework trtllm --platform linux/arm64
92
93
94
95

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
96
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main
97
98
99
100
```

### Run container

Anish's avatar
Anish committed
101
```bash
102
./container/run.sh --framework trtllm -it
103
104
```

Anish's avatar
Anish committed
105
## Single Node Examples
106

Anish's avatar
Anish committed
107
108
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
109

110
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing.md).
111

Anish's avatar
Anish committed
112
### Aggregated
113
```bash
114
cd $DYNAMO_HOME/examples/backends/trtllm
115
./launch/agg.sh
116
```
117

Anish's avatar
Anish committed
118
### Aggregated with KV Routing
119
```bash
120
cd $DYNAMO_HOME/examples/backends/trtllm
121
./launch/agg_router.sh
122
```
123

Anish's avatar
Anish committed
124
### Disaggregated
125
126

```bash
127
cd $DYNAMO_HOME/examples/backends/trtllm
128
./launch/disagg.sh
129
130
```

Anish's avatar
Anish committed
131
### Disaggregated with KV Routing
132

133
> [!IMPORTANT]
134
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
135

136
```bash
137
cd $DYNAMO_HOME/examples/backends/trtllm
138
./launch/disagg_router.sh
139
140
```

Anish's avatar
Anish committed
141
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
142
```bash
143
cd $DYNAMO_HOME/examples/backends/trtllm
144

145
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
146
147
148
149
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
150
151
152
```

Notes:
153
154
155
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Anish's avatar
Anish committed
156
## Advanced Examples
157

Anish's avatar
Anish committed
158
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
159

Anish's avatar
Anish committed
160
### Multinode Deployment
161

Anish's avatar
Anish committed
162
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
163

Anish's avatar
Anish committed
164
165
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
166

167
168
### Kubernetes Deployment

169
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../examples/backends/trtllm/deploy/README.md).
170
171
172

### Client

173
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
174

175
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
176
177
178

### Benchmarking

179
To benchmark your deployment with AIPerf, see this utility script, configuring the
180
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
181

182
## KV Cache Transfer in Disaggregated Serving
183

184
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
185

186

187
188
## Request Migration

189
You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
190
191

```bash
192
# For decode and aggregated workers
193
194
195
python3 -m dynamo.trtllm ... --migration-limit=3
```

196
197
198
199
> [!IMPORTANT]
> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.

See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
200

201
202
203
204
205
206
207
208
209
## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
210
| **Disaggregated** | ✅ | ✅ |
211

212
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
213

Anish's avatar
Anish committed
214
215
## Client

216
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
Anish's avatar
Anish committed
217
218

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
219

Anish's avatar
Anish committed
220
221
## Benchmarking

222
To benchmark your deployment with AIPerf, see this utility script, configuring the
Anish's avatar
Anish committed
223
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
224
225
226

## Multimodal support

227
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
228

229
230
231
232
233
234
235
236
237
238
239
240
241
## Logits Processing

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).

### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.

```bash
242
cd $DYNAMO_HOME/examples/backends/trtllm
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```

Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".

### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:

```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor

class TemperatureProcessor(BaseLogitsProcessor):
    def __init__(self, temperature: float = 1.0):
        if temperature <= 0:
            raise ValueError("Temperature must be positive")
        self.temperature = temperature

    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
        if self.temperature == 1.0:
            return
        logits.div_(self.temperature)
```

Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:

```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor

processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```

### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

286
287
## Performance Sweep

288
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
289
290
291
292
293

## Dynamo KV Block Manager Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

294
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .