README.md 4.23 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: TensorRT-LLM
5
6
7
8
---

## Use the Latest Release

9
We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes.
10
11
12

---

13
Dynamo TensorRT-LLM integrates [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
14
15
16
17
18
19
20
21

## Feature Support Matrix

### Core Dynamo Features

| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
22
23
24
25
26
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ |  |
| [**Load Based Planner**](../../components/planner/README.md) | 🚧 | Planned |
| [**KVBM**](../../components/kvbm/README.md) | ✅ | |
27
28
29
30
31
32
33
34
35

### Large Scale P/D and WideEP Features

| Feature            | TensorRT-LLM | Notes                                                           |
|--------------------|--------------|-----------------------------------------------------------------|
| **WideEP**         | ✅           |                                                                 |
| **DP Rank Routing**| ✅           |                                                                 |
| **GB200 Support**  | ✅           |                                                                 |

36
## Quick Start
37

38
**Step 1 (host terminal):** Start infrastructure services:
39
40
41
42
43

```bash
docker compose -f deploy/docker-compose.yml up -d
```

44
**Step 2 (host terminal):** Pull and run the prebuilt container:
45
46

```bash
47
DYNAMO_VERSION=1.0.0
48
49
50
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
docker run --gpus all -it --network host --ipc host \
  nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
51
52
```

53
54
55
> [!NOTE]
> The `DYNAMO_VERSION` variable above can be set to any specific available version of the container.
> To find the available `tensorrtllm-runtime` versions for Dynamo, visit the [NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime).
56

57
**Step 3 (inside the container):** Launch an aggregated serving deployment (uses `Qwen/Qwen3-0.6B` by default):
58
59
60
61
62
63

```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```

64
The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting `MODEL_PATH` and `SERVED_MODEL_NAME` environment variables before running the script.
65

66
**Step 4 (host terminal):** Verify the deployment:
67
68

```bash
69
curl localhost:8000/v1/chat/completions \
70
71
  -H "Content-Type: application/json" \
  -d '{
72
73
74
75
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
    "stream": true,
    "max_tokens": 30
76
77
78
  }'
```

79
### Kubernetes Deployment
80

81
You can deploy TensorRT-LLM with Dynamo on Kubernetes using a `DynamoGraphDeployment`. For more details, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
82

83
## Next Steps
84

85
86
87
88
89
90
- **[Reference Guide](trtllm-reference-guide.md)**: Features, configuration, and operational details
- **[Examples](trtllm-examples.md)**: All deployment patterns with launch scripts
- **[KV Cache Transfer](trtllm-kv-cache-transfer.md)**: KV cache transfer methods for disaggregated serving
- **[Prometheus Metrics](trtllm-prometheus.md)**: Metrics and monitoring
- **[Multinode Examples](multinode/trtllm-multinode-examples.md)**: Multi-node deployment with SLURM
- **[Deploying TensorRT-LLM with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)**: Kubernetes deployment guide