trtllm-reference-guide.md 3.53 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Reference Guide
subtitle: Features, configuration, and operational details for the TensorRT-LLM backend
---

## Building a Custom Container

To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](./trtllm-building-custom-container.md) guide.

## KV Cache Transfer

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](./trtllm-kv-cache-transfer.md).

## Request Migration

Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.

## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |

For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.

## Multimodal Support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).

37
## Diffusion Support (Experimental)
38

39
Dynamo supports video and image generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Diffusion Guide](./trtllm-diffusion.md).
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

## Logits Processing

Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](./trtllm-logits-processing.md).

## DP Rank Routing (Attention Data Parallelism)

TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](./trtllm-dp-rank-routing.md).

## KVBM Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

See the instructions here: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm).

## Observability

57
TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Observability Guide](./trtllm-observability.md).
58
59
60
61

## Known Issues and Mitigations

For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](./trtllm-known-issues.md) page.