README.md 14.4 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Anish's avatar
Anish committed
18
# LLM Deployment using TensorRT-LLM
19
20
21

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
33

Anish's avatar
Anish committed
34
35
36
37
38
---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
39
- [Single Node Examples](#single-node-examples)
Anish's avatar
Anish committed
40
41
42
43
44
- [Advanced Examples](#advanced-examples)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
45
- [Multimodal Support](#multimodal-support)
46
- [Logits Processing](#logits-processing)
47
- [Performance Sweep](#performance-sweep)
Anish's avatar
Anish committed
48
49
50
51

## Feature Support Matrix

### Core Dynamo Features
52

Anish's avatar
Anish committed
53
54
| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
55
56
57
58
59
60
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
61

Anish's avatar
Anish committed
62
### Large Scale P/D and WideEP Features
63

Anish's avatar
Anish committed
64
65
66
67
68
| Feature            | TensorRT-LLM | Notes                                                                 |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP**         | ✅           |                                                                 |
| **DP Rank Routing**| ✅           |                                                                 |
| **GB200 Support**  | ✅           |                                                                 |
69

70
## TensorRT-LLM Quick Start
71

Anish's avatar
Anish committed
72
73
74
75
76
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
77
78

```bash
79
docker compose -f deploy/docker-compose.yml up -d
80
81
```

Anish's avatar
Anish committed
82
### Build container
83
84

```bash
85
86
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
87

88
# On an x86 machine:
89
./container/build.sh --framework trtllm
90
91

# On an ARM machine:
92
./container/build.sh --framework trtllm --platform linux/arm64
93
94
95
96

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
97
./container/build.sh --framework trtllm --use-default-experimental-tensorrtllm-commit
98
99
100
101
```

### Run container

Anish's avatar
Anish committed
102
```bash
103
./container/run.sh --framework trtllm -it
104
105
```

Anish's avatar
Anish committed
106
## Single Node Examples
107

Anish's avatar
Anish committed
108
109
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
110

Anish's avatar
Anish committed
111
This figure shows an overview of the major components to deploy:
112
113
114

```
+------+      +-----------+      +------------------+             +---------------+
115
116
| HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
|      |<-----|           |<-----|                  |<------------|               |
117
118
119
120
121
122
123
124
125
126
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```

127
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
128

Anish's avatar
Anish committed
129
### Aggregated
130
```bash
131
cd $DYNAMO_HOME/components/backends/trtllm
132
./launch/agg.sh
133
```
134

Anish's avatar
Anish committed
135
### Aggregated with KV Routing
136
```bash
137
cd $DYNAMO_HOME/components/backends/trtllm
138
./launch/agg_router.sh
139
```
140

Anish's avatar
Anish committed
141
### Disaggregated
142

143
144
> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
145

146
```bash
147
cd $DYNAMO_HOME/components/backends/trtllm
148
./launch/disagg.sh
149
150
```

Anish's avatar
Anish committed
151
### Disaggregated with KV Routing
152

153
154
> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
155

156
```bash
157
cd $DYNAMO_HOME/components/backends/trtllm
158
./launch/disagg_router.sh
159
160
```

Anish's avatar
Anish committed
161
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
162
```bash
163
cd $DYNAMO_HOME/components/backends/trtllm
164

165
166
167
168
169
export AGG_ENGINE_ARGS=./engine_configs/deepseek_r1/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
170
171
172
```

Notes:
173
174
175
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Anish's avatar
Anish committed
176
## Advanced Examples
177

Anish's avatar
Anish committed
178
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
179

Anish's avatar
Anish committed
180
### Multinode Deployment
181

Anish's avatar
Anish committed
182
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
183

Anish's avatar
Anish committed
184
185
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
186

187
188
### Kubernetes Deployment

189
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md).
190
191
192

### Client

193
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
194

195
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
196
197
198
199

### Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
200
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
201
202


203
## Disaggregation Strategy
204

205
The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.
206

207
By default, Dynamo uses a `decode first` strategy: incoming requests are initially routed to the decode worker, which then forwards them to the prefill worker in round-robin fashion. The prefill worker processes the request and returns results to the decode worker for any remaining decode operations.
208

209
When using KV routing, however, Dynamo switches to a `prefill first` strategy. In this mode, requests are routed directly to the prefill worker, which can help maximize KV cache reuse and improve overall efficiency for certain workloads. Choosing the appropriate strategy can have a significant impact on performance, depending on your use case.
210

211
The disaggregation strategy can be set using the `DISAGGREGATION_STRATEGY` environment variable. You can set the strategy before launching your deployment, for example:
212
```bash
213
DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
214
215
```

216
## KV Cache Transfer in Disaggregated Serving
217

218
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
219

220

221
222
## Request Migration

223
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
224
225
226
227
228

```bash
python3 -m dynamo.trtllm ... --migration-limit=3
```

229
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
230

Anish's avatar
Anish committed
231
232
## Client

233
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
Anish's avatar
Anish committed
234
235

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
236

Anish's avatar
Anish committed
237
238
239
240
## Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
241
242
243

## Multimodal support

244
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [Multimodal Support Guide](./multimodal_support.md).
245

246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
## Logits Processing

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).

### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.

```bash
cd $DYNAMO_HOME/components/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```

Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".

### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:

```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor

class TemperatureProcessor(BaseLogitsProcessor):
    def __init__(self, temperature: float = 1.0):
        if temperature <= 0:
            raise ValueError("Temperature must be positive")
        self.temperature = temperature

    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
        if self.temperature == 1.0:
            return
        logits.div_(self.temperature)
```

Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:

```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor

processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```

### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

303
304
305
## Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.