Unverified Commit 1bb28d6f authored by Tanmay Verma's avatar Tanmay Verma Committed by GitHub
Browse files

docs: Refactor TensorRT-LLM backend docs (#6782)

parent 8bd5966b
...@@ -4,28 +4,13 @@ ...@@ -4,28 +4,13 @@
title: TensorRT-LLM title: TensorRT-LLM
--- ---
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release ## Use the Latest Release
We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes. We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes.
--- ---
## Table of Contents Dynamo TensorRT-LLM integrates [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
- [Multimodal Support](#multimodal-support)
- [Video Diffusion Support](#video-diffusion-support-experimental)
- [Logits Processing](#logits-processing)
- [DP Rank Routing](#dp-rank-routing-attention-data-parallelism)
- [Performance Sweep](#performance-sweep)
- [Known Issues and Mitigations](#known-issues-and-mitigations)
## Feature Support Matrix ## Feature Support Matrix
...@@ -48,341 +33,58 @@ We recommend using the [latest stable release](https://github.com/ai-dynamo/dyna ...@@ -48,341 +33,58 @@ We recommend using the [latest stable release](https://github.com/ai-dynamo/dyna
| **DP Rank Routing**| ✅ | | | **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | | | **GB200 Support** | ✅ | |
## TensorRT-LLM Quick Start ## Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only) **Step 1 (host terminal):** Start infrastructure services:
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
```bash ```bash
docker compose -f deploy/docker-compose.yml up -d docker compose -f deploy/docker-compose.yml up -d
``` ```
> [!NOTE] **Step 2 (host terminal):** Pull and run the prebuilt container:
> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Build container
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# On an x86 machine:
python container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
# On an ARM machine:
python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename --cuda-version=13.1
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
```
### Run container
```bash ```bash
./container/run.sh --framework trtllm -it DYNAMO_VERSION=0.9.0
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
docker run --gpus all -it --network host --ipc host \
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
``` ```
## Single Node Examples > [!NOTE]
> The `DYNAMO_VERSION` variable above can be set to any specific available version of the container.
> [!IMPORTANT] > To find the available `tensorrtllm-runtime` versions for Dynamo, visit the [NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime).
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
### Aggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```
### Aggregated with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
```
### Disaggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
```
### Disaggregated with KV Routing
> [!IMPORTANT]
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash **Step 3 (inside the container):** Launch an aggregated serving deployment (uses `Qwen/Qwen3-0.6B` by default):
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
```
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash ```bash
cd $DYNAMO_HOME/examples/backends/trtllm cd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh ./launch/agg.sh
``` ```
Notes: The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting `MODEL_PATH` and `SERVED_MODEL_NAME` environment variables before running the script.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
## Advanced Examples
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4-plus-eagle.md)**
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
### Client
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
### Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
## KV Cache Transfer in Disaggregated Serving
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
## Request Migration
Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
## Client
See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
## Video Diffusion Support (Experimental)
Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
### Requirements
- **TensorRT-LLM with visual_gen**: The `visual_gen` module is part of TensorRT-LLM (`tensorrt_llm._torch.visual_gen`). Install TensorRT-LLM following the [official instructions](https://github.com/NVIDIA/TensorRT-LLM#installation).
- **imageio with ffmpeg**: Required for encoding generated frames to MP4 video:
```bash
pip install imageio[ffmpeg]
```
- **dynamo-runtime with video API**: The Dynamo runtime must include `ModelType.Videos` support. Ensure you're using a compatible version.
### Supported Models
| Diffusers Pipeline | Description | Example Model |
|--------------------|-------------|---------------|
| `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed.
### Quick Start
```bash
python -m dynamo.trtllm \
--modality video_diffusion \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--media-output-fs-url file:///tmp/dynamo_media
```
### API Endpoint
Video generation uses the `/v1/videos` endpoint: **Step 4 (host terminal):** Verify the deployment:
```bash ```bash
curl -X POST http://localhost:8000/v1/videos \ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"prompt": "A cat playing piano", "model": "Qwen/Qwen3-0.6B",
"model": "wan_t2v", "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"seconds": 4, "stream": true,
"size": "832x480", "max_tokens": 30
"nvext": {
"fps": 24
}
}' }'
``` ```
### Configuration Options ### Kubernetes Deployment
| Flag | Description | Default |
|------|-------------|---------|
| `--media-output-fs-url` | Filesystem URL for storing generated media | `file:///tmp/dynamo_media` |
| `--default-height` | Default video height | `480` |
| `--default-width` | Default video width | `832` |
| `--default-num-frames` | Default frame count | `81` |
| `--enable-teacache` | Enable TeaCache optimization | `False` |
| `--disable-torch-compile` | Disable torch.compile | `False` |
### Limitations
- Video diffusion is experimental and not recommended for production use
- Only text-to-video is supported in this release (image-to-video planned)
- Requires GPU with sufficient VRAM for the diffusion model
## Logits Processing
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```
Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor
class TemperatureProcessor(BaseLogitsProcessor):
def __init__(self, temperature: float = 1.0):
if temperature <= 0:
raise ValueError("Temperature must be positive")
self.temperature = temperature
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
if self.temperature == 1.0:
return
logits.div_(self.temperature)
```
Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor
processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```
### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
## DP Rank Routing (Attention Data Parallelism)
TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
### Dynamo vs TRT-LLM Internal Routing
- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
### Enabling DP Rank Routing
```bash
# Worker with attention DP
# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
--model-path <MODEL_PATH> \
--tensor-parallel-size 2 \
--enable-attention-dp \
--publish-events-and-metrics
# Frontend with KV routing
python3 -m dynamo.frontend --router-mode kv
```
The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
> [!NOTE]
> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
## Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
## Dynamo KV Block Manager Integration
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
## Known Issues and Mitigations
### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml You can deploy TensorRT-LLM with Dynamo on Kubernetes using a `DynamoGraphDeployment`. For more details, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
cache_transceiver_config:
backend: DEFAULT
max_tokens_in_buffer: 65536 # Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`. ## Next Steps
**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327) - **[Reference Guide](trtllm-reference-guide.md)**: Features, configuration, and operational details
- **[Examples](trtllm-examples.md)**: All deployment patterns with launch scripts
- **[KV Cache Transfer](trtllm-kv-cache-transfer.md)**: KV cache transfer methods for disaggregated serving
- **[Prometheus Metrics](trtllm-prometheus.md)**: Metrics and monitoring
- **[Multinode Examples](multinode/trtllm-multinode-examples.md)**: Multi-node deployment with SLURM
- **[Deploying TensorRT-LLM with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)**: Kubernetes deployment guide
...@@ -4,6 +4,10 @@ ...@@ -4,6 +4,10 @@
title: Multinode Examples title: Multinode Examples
--- ---
For general TensorRT-LLM features and configuration, see the [Reference Guide](../trtllm-reference-guide.md).
---
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/). > **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16), To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
...@@ -36,8 +40,8 @@ For simplicity of the example, we will make some assumptions about your slurm cl ...@@ -36,8 +40,8 @@ For simplicity of the example, we will make some assumptions about your slurm cl
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead. modify the script to use that instead.
3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as 3. Third, we assume you have a Dynamo+TRTLLM container image available.
described [here](../README.md#build-container). You can use the [prebuilt container](../README.md#quick-start) or [build a custom one](../trtllm-building-custom-container.md).
This is the image that can be set to the `IMAGE` environment variable in later steps. This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity will allocate 8 nodes below as a reference command to have enough capacity
...@@ -75,8 +79,11 @@ inside an interactive shell on one of the allocated nodes, set the ...@@ -75,8 +79,11 @@ inside an interactive shell on one of the allocated nodes, set the
following environment variables based: following environment variables based:
```bash ```bash
# NOTE: IMAGE must be set manually for now # NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here: # Use the prebuilt container from NGC (see ../README.md#quick-start):
# ../README.md#build-container # export IMAGE="nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
# Or build a custom one (see ../trtllm-building-custom-container.md)
# Or you can also download the image to shared storage and point
# IMAGE to the local path.
export IMAGE="<dynamo_trtllm_image>" export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers # MOUNTS are the host:container path pairs that are mounted into the containers
...@@ -241,7 +248,7 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \ ...@@ -241,7 +248,7 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
"messages": [ "messages": [
{ {
"role": "user", "role": "user",
"content": "Tell me a story as if we were playing dungeons and dragons." "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
} }
], ],
"stream": true, "stream": true,
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Building a Custom TensorRT-LLM Container
---
For the prebuilt container, see the [TensorRT-LLM Quick Start](README.md#quick-start).
## Building a Custom Container
If you need to build a container from source (e.g., for custom modifications or a different CUDA version):
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# On an x86 machine:
python container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
# On an ARM machine:
python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename --cuda-version=13.1
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
```
Run the custom container:
```bash
./container/run.sh --framework trtllm -it
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: DP Rank Routing (Attention Data Parallelism)
---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
### Dynamo vs TRT-LLM Internal Routing
- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
### Enabling DP Rank Routing
```bash
# Worker with attention DP
# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
--model-path <MODEL_PATH> \
--tensor-parallel-size 2 \
--enable-attention-dp \
--publish-events-and-metrics
# Frontend with KV routing
python3 -m dynamo.frontend --router-mode kv
```
The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
<Note>
Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
</Note>
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Examples
---
For quick start instructions, see the [TensorRT-LLM README](README.md). This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.
## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Client](#client)
- [Benchmarking](#benchmarking)
## Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Note>
- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
- **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events.
- **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD).
</Note>
<Tip>
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend <args>` to start up the ingress and `python3 -m dynamo.trtllm <args>` to start up the workers.
</Tip>
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
## Single Node Examples
### Aggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```
### Aggregated with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
```
### Disaggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
```
### Disaggregated with KV Routing
<Note>
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
</Note>
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
```
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
```
<Note>
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
</Note>
## Advanced Examples
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [Multinode Examples](./multinode/trtllm-multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the [Llama4 + Eagle](./trtllm-llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on a single node.
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./trtllm-llama4-plus-eagle.md)**
### Model-Specific Guides
- **[Gemma3 with Sliding Window Attention](./trtllm-gemma3-sliding-window-attention.md)**
- **[GPT-OSS-120b](./trtllm-gpt-oss.md)** — Reasoning model with tool calling support
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
### Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
## Client
See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.
<Note>
To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
</Note>
## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
...@@ -4,13 +4,16 @@ ...@@ -4,13 +4,16 @@
title: Gemma3 Sliding Window title: Gemma3 Sliding Window
--- ---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU. This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers. VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
> [!Note] > [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting. > - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication. > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
## Aggregated Serving ## Aggregated Serving
```bash ```bash
......
...@@ -4,6 +4,10 @@ ...@@ -4,6 +4,10 @@
title: GPT-OSS title: GPT-OSS
--- ---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs. Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview ## Overview
...@@ -170,7 +174,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \ ...@@ -170,7 +174,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
--expert-parallel-size 4 --expert-parallel-size 4
``` ```
### 6. Verify the Deployment is Ready ### 5. Verify the Deployment is Ready
Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started: Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
``` ```
...@@ -190,7 +194,7 @@ Make sure that both of the endpoints are available before sending an inference r ...@@ -190,7 +194,7 @@ Make sure that both of the endpoints are available before sending an inference r
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress. If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 7. Test the Deployment ### 6. Test the Deployment
Send a test request to verify the deployment: Send a test request to verify the deployment:
...@@ -207,10 +211,10 @@ curl -X POST http://localhost:8000/v1/responses \ ...@@ -207,10 +211,10 @@ curl -X POST http://localhost:8000/v1/responses \
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs. The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 8. Reasoning and Tool Calling ### 7. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally is that the application has a set of tools to aid the assistant provide accurate answer, and it is usually
multi-turn as it involves tool selection and generation based on the tool result. multi-turn as it involves tool selection and generation based on the tool result.
In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```. In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```.
...@@ -514,7 +518,7 @@ flowchart TD ...@@ -514,7 +518,7 @@ flowchart TD
```bash ```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "openai/gpt-oss-120b", "model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "Hello"}], "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"chat_template_args": { "chat_template_args": {
"reasoning_effort": "high" "reasoning_effort": "high"
}, },
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Known Issues and Mitigations
---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml
cache_transceiver_config:
backend: DEFAULT
max_tokens_in_buffer: 65536 # Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
...@@ -4,11 +4,15 @@ ...@@ -4,11 +4,15 @@
title: KV Cache Transfer title: KV Cache Transfer
--- ---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer: In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Using NIXL for KV Cache Transfer ## Using NIXL for KV Cache Transfer
Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment. Start the disaggregated service: See [Disaggregated Serving](./trtllm-examples.md#disaggregated) to learn how to start the deployment.
## Default Method: NIXL ## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
title: Llama4 + Eagle title: Llama4 + Eagle
--- ---
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios: This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/trtllm-multinode-examples.md) to set up the environment for the following scenarios:
- **Aggregated Serving:** - **Aggregated Serving:**
Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving. Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
...@@ -33,7 +33,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" ...@@ -33,7 +33,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
``` ```
See [this](./multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options. See the [multinode setup instructions](./multinode/trtllm-multinode-examples.md#setup) to learn more about the above options.
## Aggregated Serving ## Aggregated Serving
...@@ -55,12 +55,12 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4 ...@@ -55,12 +55,12 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4
## Example Request ## Example Request
See [here](./multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment. See the [example request section](./multinode/trtllm-multinode-examples.md#example-request) to learn how to send a request to the deployment.
``` ```
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8", "model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}], "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"max_tokens": 1024 "max_tokens": 1024
}' -w "\n" }' -w "\n"
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Logits Processing
---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```
<Note>
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
</Note>
### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor
class TemperatureProcessor(BaseLogitsProcessor):
def __init__(self, temperature: float = 1.0):
if temperature <= 0:
raise ValueError("Temperature must be positive")
self.temperature = temperature
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
if self.temperature == 1.0:
return
logits.div_(self.temperature)
```
Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor
processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```
### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
...@@ -4,6 +4,10 @@ ...@@ -4,6 +4,10 @@
title: Prometheus title: Prometheus
--- ---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
## Overview ## Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
...@@ -52,7 +56,7 @@ curl -H 'Content-Type: application/json' \ ...@@ -52,7 +56,7 @@ curl -H 'Content-Type: application/json' \
-d '{ -d '{
"model": "<model_name>", "model": "<model_name>",
"max_completion_tokens": 100, "max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}] "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]
}' \ }' \
http://localhost:8000/v1/chat/completions http://localhost:8000/v1/chat/completions
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Reference Guide
subtitle: Features, configuration, and operational details for the TensorRT-LLM backend
---
## Building a Custom Container
To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](./trtllm-building-custom-container.md) guide.
## KV Cache Transfer
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](./trtllm-kv-cache-transfer.md).
## Request Migration
Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
## Multimodal Support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
## Video Diffusion Support (Experimental)
Dynamo supports video generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Video Diffusion Guide](./trtllm-video-diffusion.md).
## Logits Processing
Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](./trtllm-logits-processing.md).
## DP Rank Routing (Attention Data Parallelism)
TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](./trtllm-dp-rank-routing.md).
## KVBM Integration
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
See the instructions here: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm).
## Observability
TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Prometheus Metrics Guide](./trtllm-prometheus.md).
## Known Issues and Mitigations
For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](./trtllm-known-issues.md) page.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Video Diffusion Support (Experimental)
---
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
---
Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
## Requirements
- **TensorRT-LLM with visual_gen**: The `visual_gen` module is part of TensorRT-LLM (`tensorrt_llm._torch.visual_gen`). Install TensorRT-LLM following the [official instructions](https://github.com/NVIDIA/TensorRT-LLM#installation).
- **imageio with ffmpeg**: Required for encoding generated frames to MP4 video:
```bash
pip install imageio[ffmpeg]
```
- **dynamo-runtime with video API**: The Dynamo runtime must include `ModelType.Videos` support. Ensure you're using a compatible version.
## Supported Models
| Diffusers Pipeline | Description | Example Model |
|--------------------|-------------|---------------|
| `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed.
## Quick Start
```bash
python -m dynamo.trtllm \
--modality video_diffusion \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--media-output-fs-url file:///tmp/dynamo_media
```
## API Endpoint
Video generation uses the `/v1/videos` endpoint:
```bash
curl -X POST http://localhost:8000/v1/videos \
-H "Content-Type: application/json" \
-d '{
"prompt": "A cat playing piano",
"model": "wan_t2v",
"seconds": 4,
"size": "832x480",
"nvext": {
"fps": 24
}
}'
```
## Configuration Options
| Flag | Description | Default |
|------|-------------|---------|
| `--media-output-fs-url` | Filesystem URL for storing generated media | `file:///tmp/dynamo_media` |
| `--default-height` | Default video height | `480` |
| `--default-width` | Default video width | `832` |
| `--default-num-frames` | Default frame count | `81` |
| `--enable-teacache` | Enable TeaCache optimization | `False` |
| `--disable-torch-compile` | Disable torch.compile | `False` |
## Limitations
- Video diffusion is experimental and not recommended for production use
- Only text-to-video is supported in this release (image-to-video planned)
- Requires GPU with sufficient VRAM for the diffusion model
...@@ -146,8 +146,19 @@ navigation: ...@@ -146,8 +146,19 @@ navigation:
path: backends/sglang/sglang-observability.md path: backends/sglang/sglang-observability.md
- page: Agentic Workloads - page: Agentic Workloads
path: backends/sglang/agents.md path: backends/sglang/agents.md
- page: TensorRT-LLM - section: TensorRT-LLM
path: backends/trtllm/README.md path: backends/trtllm/README.md
contents:
- page: Reference Guide
path: backends/trtllm/trtllm-reference-guide.md
- page: Examples
path: backends/trtllm/trtllm-examples.md
- page: Prometheus Metrics
path: backends/trtllm/trtllm-prometheus.md
- page: Video Diffusion (Experimental)
path: backends/trtllm/trtllm-video-diffusion.md
- page: Known Issues and Mitigations
path: backends/trtllm/trtllm-known-issues.md
- page: vLLM - page: vLLM
path: backends/vllm/README.md path: backends/vllm/README.md
...@@ -301,18 +312,22 @@ navigation: ...@@ -301,18 +312,22 @@ navigation:
path: backends/vllm/vllm-omni.md path: backends/vllm/vllm-omni.md
- section: TensorRT-LLM Details - section: TensorRT-LLM Details
contents: contents:
- page: Building a Custom Container
path: backends/trtllm/trtllm-building-custom-container.md
- page: KV Cache Transfer
path: backends/trtllm/trtllm-kv-cache-transfer.md
- page: Logits Processing
path: backends/trtllm/trtllm-logits-processing.md
- page: DP Rank Routing
path: backends/trtllm/trtllm-dp-rank-routing.md
- page: Multinode Examples - page: Multinode Examples
path: backends/trtllm/multinode/multinode-examples.md path: backends/trtllm/multinode/trtllm-multinode-examples.md
- page: Llama4 + Eagle - page: Llama4 + Eagle
path: backends/trtllm/llama4-plus-eagle.md path: backends/trtllm/trtllm-llama4-plus-eagle.md
- page: KV Cache Transfer
path: backends/trtllm/kv-cache-transfer.md
- page: Gemma3 Sliding Window - page: Gemma3 Sliding Window
path: backends/trtllm/gemma3-sliding-window-attention.md path: backends/trtllm/trtllm-gemma3-sliding-window-attention.md
- page: GPT-OSS - page: GPT-OSS
path: backends/trtllm/gpt-oss.md path: backends/trtllm/trtllm-gpt-oss.md
- page: Prometheus
path: backends/trtllm/prometheus.md
# -- Features (hidden sub-pages) -- # -- Features (hidden sub-pages) --
- section: Speculative Decoding - section: Speculative Decoding
path: features/speculative-decoding/README.md path: features/speculative-decoding/README.md
......
...@@ -84,7 +84,7 @@ Dynamo exposes several categories of metrics: ...@@ -84,7 +84,7 @@ Dynamo exposes several categories of metrics:
- **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
- **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
- **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/vllm-observability.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`) - **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/vllm-observability.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/trtllm-prometheus.md) (`trtllm_*`)
## Runtime Hierarchy ## Runtime Hierarchy
......
...@@ -241,7 +241,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving ...@@ -241,7 +241,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
- **UCX** (default): Standard method for KV cache transfer - **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method - **NIXL** (experimental): Alternative transfer method
For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md). For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/trtllm-kv-cache-transfer.md).
## Request Migration ## Request Migration
...@@ -269,8 +269,8 @@ Configure the `model` name and `host` based on your deployment. ...@@ -269,8 +269,8 @@ Configure the `model` name and `host` based on your deployment.
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md)
- **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md) - **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/trtllm-multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4-plus-eagle.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/trtllm-llama4-plus-eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
## Troubleshooting ## Troubleshooting
......
...@@ -41,7 +41,7 @@ Please note that: ...@@ -41,7 +41,7 @@ Please note that:
3. `post_process.py` - Scan the aiperf results to produce a json with entries to each config point. 3. `post_process.py` - Scan the aiperf results to produce a json with entries to each config point.
4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization. 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide. For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/trtllm-multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
## Usage ## Usage
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment