"lib/llm/vscode:/vscode.git/clone" did not exist on "6afa679c5d3debe06b0a7d2886da104216c92754"
README.md 9.37 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# LLM Deployment Examples using TensorRT-LLM

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# User Documentation

- [Deployment Architectures](#deployment-architectures)
- [Getting Started](#getting-started)
  - [Prerequisites](#prerequisites)
  - [Build docker](#build-docker)
  - [Run container](#run-container)
  - [Run deployment](#run-deployment)
    - [Single Node deployment](#single-node-deployments)
    - [Multinode deployment](#multinode-deployment)
  - [Client](#client)
  - [Benchmarking](#benchmarking)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [More Example Architectures](#more-example-architectures)
  - [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)

# Quick Start

41
42
43
44
45
46
47
48
49
50
51
## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
52
53
54
55
56

## Deployment Architectures

See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.

57
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
58
59
60
61
62
63
64

## Getting Started

1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts

65
66
### Prerequisites

67
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
68
```bash
69
docker compose -f deploy/metrics/docker-compose.yml up -d
70
71
72
73
74
```

### Build docker

```bash
75
76
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
77

78
# On an x86 machine:
79
./container/build.sh --framework tensorrtllm
80
81
82

# On an ARM machine:
./container/build.sh --framework tensorrtllm --platform linux/arm64
83
84
85
86
87

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
88
89
90
91
92
93
94
95
96
```

### Run container

```
./container/run.sh --framework tensorrtllm -it
```
## Run Deployment

97
98
99
100
101
102
103
This figure shows an overview of the major components to deploy:



```

+------+      +-----------+      +------------------+             +---------------+
104
105
| HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
|      |<-----|           |<-----|                  |<------------|               |
106
107
108
109
110
111
112
113
114
115
116
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+

```

117
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
118

119
### Single-Node Deployments
120

121
122
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
123

124
#### Aggregated
125
```bash
126
127
cd $DYNAMO_ROOT/examples/tensorrt_llm
./launch/agg.sh
128
```
129

130
#### Aggregated with KV Routing
131
```bash
132
133
cd $DYNAMO_ROOT/examples/tensorrt_llm
./launch/agg_router.sh
134
```
135

136
#### Disaggregated
137

138
139
> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
140

141
```bash
142
143
cd $DYNAMO_ROOT/examples/tensorrt_llm
./launch/disagg.sh
144
145
```

146
#### Disaggregated with KV Routing
147

148
149
> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
150

151
```bash
152
153
cd $DYNAMO_ROOT/examples/tensorrt_llm
./launch/disagg_router.sh
154
155
```

156
#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
157
```bash
158
cd $DYNAMO_ROOT/examples/tensorrt_llm
159

160
161
162
163
164
export AGG_ENGINE_ARGS=./engine_configs/deepseek_r1/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
165
166
167
```

Notes:
168
169
170
- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.

  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
171

172
173
174
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

175
176
177
### Multinode Deployment

For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
178

179
180
181
182
### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

183
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
184

185
### Benchmarking
186

187
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
188
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
189
190


191
## Disaggregation Strategy
192

193
The disaggregation strategy controls how requests are distributed between the prefill and decode workers in a disaggregated deployment.
194

195
By default, Dynamo uses a `decode first` strategy: incoming requests are initially routed to the decode worker, which then forwards them to the prefill worker in round-robin fashion. The prefill worker processes the request and returns results to the decode worker for any remaining decode operations.
196

197
When using KV routing, however, Dynamo switches to a `prefill first` strategy. In this mode, requests are routed directly to the prefill worker, which can help maximize KV cache reuse and improve overall efficiency for certain workloads. Choosing the appropriate strategy can have a significant impact on performance, depending on your use case.
198

199
The disaggregation strategy can be set using the `DISAGGREGATION_STRATEGY` environment variable. You can set the strategy before launching your deployment, for example:
200
```bash
201
DISAGGREGATION_STRATEGY="prefill_first" ./launch/disagg.sh
202
203
```

204
## KV Cache Transfer in Disaggregated Serving
205

206
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-tranfer.md).
207

208
## More Example Architectures
209

210
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)