docs: Refactor TensorRT-LLM backend docs (#6782)

1bb28d6f · Tanmay Verma · GitHub · 8bd5966b · 1bb28d6f · 1bb28d6f
Unverified Commit 1bb28d6f authored Mar 02, 2026 by Tanmay Verma Committed by GitHub Mar 03, 2026
18 changed files
--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -4,28 +4,13 @@
 title: TensorRT-LLM
 ---
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
 ## Use the Latest Release
 We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes.
 ---
-## Table of Contents
+Dynamo TensorRT-LLM integrates [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
- [Multimodal Support](#multimodal-support)
- [Video Diffusion Support](#video-diffusion-support-experimental)
- [Logits Processing](#logits-processing)
- [DP Rank Routing](#dp-rank-routing-attention-data-parallelism)
- [Performance Sweep](#performance-sweep)
- [Known Issues and Mitigations](#known-issues-and-mitigations)
 ## Feature Support Matrix
@@ -48,341 +33,58 @@ We recommend using the [latest stable release](https://github.com/ai-dynamo/dyna
 | **DP Rank Routing**| ✅           |                                                                 |
 | **GB200 Support**  | ✅           |                                                                 |
-## TensorRT-LLM Quick Start
+## Quick Start
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
-### Start Infrastructure Services (Local Development Only)
+**Step 1 (host terminal):** Start infrastructure services:
-For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```
-> [!NOTE]
+**Step 2 (host terminal):** Pull and run the prebuilt container:
-> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
-> - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events
-> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
-### Build container
-```bash
-# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
-apt-get update && apt-get -y install git git-lfs
-# On an x86 machine:
-python container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
-docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
-# On an ARM machine:
-python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename --cuda-version=13.1
-docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
-```
-### Run container
 ```bash
-./container/run.sh --framework trtllm -it
+DYNAMO_VERSION=0.9.0
+docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
+docker run --gpus all -it --network host --ipc host \
+  nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
 ```
-## Single Node Examples
+> [!NOTE]
+> The `DYNAMO_VERSION` variable above can be set to any specific available version of the container.
-> [!IMPORTANT]
+> To find the available `tensorrtllm-runtime` versions for Dynamo, visit the [NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime).
-> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
-### Aggregated
-```bash
-cd $DYNAMO_HOME/examples/backends/trtllm
-./launch/agg.sh
-```
-### Aggregated with KV Routing
-```bash
-cd $DYNAMO_HOME/examples/backends/trtllm
-./launch/agg_router.sh
-```
-### Disaggregated
-```bash
-cd $DYNAMO_HOME/examples/backends/trtllm
-./launch/disagg.sh
-```
-### Disaggregated with KV Routing
-> [!IMPORTANT]
-> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
-```bash
+**Step 3 (inside the container):** Launch an aggregated serving deployment (uses `Qwen/Qwen3-0.6B` by default):
-cd $DYNAMO_HOME/examples/backends/trtllm
-./launch/disagg_router.sh
-```
-### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
 ```bash
 cd $DYNAMO_HOME/examples/backends/trtllm
-export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
-export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
-# nvidia/DeepSeek-R1-FP4 is a large model
-export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
 ./launch/agg.sh
 ```
-Notes:
+The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting `MODEL_PATH` and `SERVED_MODEL_NAME` environment variables before running the script.
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
-## Advanced Examples
-Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
-### Multinode Deployment
-For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
-### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4-plus-eagle.md)**
-### Kubernetes Deployment
-For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
-### Client
-See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
-NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
-### Benchmarking
-To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
-## KV Cache Transfer in Disaggregated Serving
-Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
-## Request Migration
-Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
-## Request Cancellation
-When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
-### Cancellation Support Matrix
-| | Prefill | Decode |
-|-|---------|--------|
-| **Aggregated** | ✅ | ✅ |
-| **Disaggregated** | ✅ | ✅ |
-For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
-## Client
-See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
-NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
-## Benchmarking
-To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
-## Multimodal support
-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
-## Video Diffusion Support (Experimental)
-Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
-### Requirements
- **TensorRT-LLM with visual_gen**: The `visual_gen` module is part of TensorRT-LLM (`tensorrt_llm._torch.visual_gen`). Install TensorRT-LLM following the [official instructions](https://github.com/NVIDIA/TensorRT-LLM#installation).
- **imageio with ffmpeg**: Required for encoding generated frames to MP4 video:
-  ```bash
-  pip install imageio[ffmpeg]
-  ```
- **dynamo-runtime with video API**: The Dynamo runtime must include `ModelType.Videos` support. Ensure you're using a compatible version.
-### Supported Models
-| Diffusers Pipeline | Description | Example Model |
-|--------------------|-------------|---------------|
-| `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
-The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed.
-### Quick Start
-```bash
-python -m dynamo.trtllm \
-  --modality video_diffusion \
-  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
-  --media-output-fs-url file:///tmp/dynamo_media
-```
-### API Endpoint
-Video generation uses the `/v1/videos` endpoint:
+**Step 4 (host terminal):** Verify the deployment:
 ```bash
-curl -X POST http://localhost:8000/v1/videos \
+curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "prompt": "A cat playing piano",
+    "model": "Qwen/Qwen3-0.6B",
-    "model": "wan_t2v",
+    "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
-    "seconds": 4,
+    "stream": true,
-    "size": "832x480",
+    "max_tokens": 30
-    "nvext": {
-      "fps": 24
-    }
  }'
 ```
-### Configuration Options
+### Kubernetes Deployment
-| Flag | Description | Default |
-|------|-------------|---------|
-| `--media-output-fs-url` | Filesystem URL for storing generated media | `file:///tmp/dynamo_media` |
-| `--default-height` | Default video height | `480` |
-| `--default-width` | Default video width | `832` |
-| `--default-num-frames` | Default frame count | `81` |
-| `--enable-teacache` | Enable TeaCache optimization | `False` |
-| `--disable-torch-compile` | Disable torch.compile | `False` |
-### Limitations
- Video diffusion is experimental and not recommended for production use
- Only text-to-video is supported in this release (image-to-video planned)
- Requires GPU with sufficient VRAM for the diffusion model
-## Logits Processing
-Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
-### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
-### Quick test: HelloWorld processor
-You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
-```bash
-cd $DYNAMO_HOME/examples/backends/trtllm
-export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
-./launch/agg.sh
-```
-Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
-### Bring your own processor
-Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
-```python
-from typing import Sequence
-import torch
-from dynamo.logits_processing import BaseLogitsProcessor
-class TemperatureProcessor(BaseLogitsProcessor):
-    def __init__(self, temperature: float = 1.0):
-        if temperature <= 0:
-            raise ValueError("Temperature must be positive")
-        self.temperature = temperature
-    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
-        if self.temperature == 1.0:
-            return
-        logits.div_(self.temperature)
-```
-Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
-```python
-from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
-from dynamo.logits_processing.examples import TemperatureProcessor
-processors = [TemperatureProcessor(temperature=0.7)]
-sampling_params.logits_processor = create_trtllm_adapters(processors)
-```
-### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
-## DP Rank Routing (Attention Data Parallelism)
-TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
-### Dynamo vs TRT-LLM Internal Routing
- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
-### Enabling DP Rank Routing
-```bash
-# Worker with attention DP
-# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
-CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
-  --model-path <MODEL_PATH> \
-  --tensor-parallel-size 2 \
-  --enable-attention-dp \
-  --publish-events-and-metrics
-# Frontend with KV routing
-python3 -m dynamo.frontend --router-mode kv
-```
-The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
-> [!NOTE]
-> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
-## Performance Sweep
-For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
-## Dynamo KV Block Manager Integration
-Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
-Here is the instruction: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
-## Known Issues and Mitigations
-### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
-**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
-**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
-**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
-**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
-```yaml
+You can deploy TensorRT-LLM with Dynamo on Kubernetes using a `DynamoGraphDeployment`. For more details, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
-cache_transceiver_config:
-  backend: DEFAULT
-  max_tokens_in_buffer: 65536  # Must exceed max ISL
-```
-For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
+## Next Steps
-**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
+- **[Reference Guide](trtllm-reference-guide.md)**: Features, configuration, and operational details
+- **[Examples](trtllm-examples.md)**: All deployment patterns with launch scripts
+- **[KV Cache Transfer](trtllm-kv-cache-transfer.md)**: KV cache transfer methods for disaggregated serving
+- **[Prometheus Metrics](trtllm-prometheus.md)**: Metrics and monitoring
+- **[Multinode Examples](multinode/trtllm-multinode-examples.md)**: Multi-node deployment with SLURM
+- **[Deploying TensorRT-LLM with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)**: Kubernetes deployment guide
--- a/docs/backends/trtllm/multinode/multinode-examples.md
+++ b/docs/backends/trtllm/multinode/multinode-examples.md
@@ -4,6 +4,10 @@
 title: Multinode Examples
 ---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](../trtllm-reference-guide.md).
+---
 > **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
 To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
@@ -36,8 +40,8 @@ For simplicity of the example, we will make some assumptions about your slurm cl
   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
   If your cluster supports similar container based plugins, you may be able to
   modify the script to use that instead.
-3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
+3. Third, we assume you have a Dynamo+TRTLLM container image available.
-   described [here](../README.md#build-container).
+   You can use the [prebuilt container](../README.md#quick-start) or [build a custom one](../trtllm-building-custom-container.md).
   This is the image that can be set to the `IMAGE` environment variable in later steps.
 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
   will allocate 8 nodes below as a reference command to have enough capacity
@@ -75,8 +79,11 @@ inside an interactive shell on one of the allocated nodes, set the
 following environment variables based:
 ```bash
 # NOTE: IMAGE must be set manually for now
-# To build an iamge, see the steps here:
+# Use the prebuilt container from NGC (see ../README.md#quick-start):
-# ../README.md#build-container
+#   export IMAGE="nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
+# Or build a custom one (see ../trtllm-building-custom-container.md)
+# Or you can also download the image to shared storage and point
+# IMAGE to the local path.
 export IMAGE="<dynamo_trtllm_image>"
 # MOUNTS are the host:container path pairs that are mounted into the containers
@@ -241,7 +248,7 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
  "messages": [
  {
    "role": "user",
-    "content": "Tell me a story as if we were playing dungeons and dragons."
+    "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
  }
  ],
  "stream": true,

--- a/docs/backends/trtllm/trtllm-building-custom-container.md
+++ b/docs/backends/trtllm/trtllm-building-custom-container.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Building a Custom TensorRT-LLM Container
+---
+For the prebuilt container, see the [TensorRT-LLM Quick Start](README.md#quick-start).
+## Building a Custom Container
+If you need to build a container from source (e.g., for custom modifications or a different CUDA version):
+```bash
+# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
+apt-get update && apt-get -y install git git-lfs
+# On an x86 machine:
+python container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
+docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
+# On an ARM machine:
+python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename --cuda-version=13.1
+docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
+```
+Run the custom container:
+```bash
+./container/run.sh --framework trtllm -it
+```
--- a/docs/backends/trtllm/trtllm-dp-rank-routing.md
+++ b/docs/backends/trtllm/trtllm-dp-rank-routing.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: DP Rank Routing (Attention Data Parallelism)
+---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
+TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
+### Dynamo vs TRT-LLM Internal Routing
+- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
+- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
+### Enabling DP Rank Routing
+```bash
+# Worker with attention DP
+# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
+CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
+  --model-path <MODEL_PATH> \
+  --tensor-parallel-size 2 \
+  --enable-attention-dp \
+  --publish-events-and-metrics
+# Frontend with KV routing
+python3 -m dynamo.frontend --router-mode kv
+```
+The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
+<Note>
+Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
+</Note>
--- a/docs/backends/trtllm/trtllm-examples.md
+++ b/docs/backends/trtllm/trtllm-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Examples
+---
+For quick start instructions, see the [TensorRT-LLM README](README.md). This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.
+## Table of Contents
+- [Infrastructure Setup](#infrastructure-setup)
+- [Single Node Examples](#single-node-examples)
+- [Advanced Examples](#advanced-examples)
+- [Client](#client)
+- [Benchmarking](#benchmarking)
+## Infrastructure Setup
+For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+<Note>
+- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
+- **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events.
+- **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD).
+</Note>
+<Tip>
+Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend <args>` to start up the ingress and `python3 -m dynamo.trtllm <args>` to start up the workers.
+</Tip>
+For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
+## Single Node Examples
+### Aggregated
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/agg.sh
+```
+### Aggregated with KV Routing
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/agg_router.sh
+```
+### Disaggregated
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg.sh
+```
+### Disaggregated with KV Routing
+<Note>
+In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
+</Note>
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg_router.sh
+```
+### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+# nvidia/DeepSeek-R1-FP4 is a large model
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+./launch/agg.sh
+```
+<Note>
+- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
+- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
+</Note>
+## Advanced Examples
+### Multinode Deployment
+For comprehensive instructions on multinode serving, see the [Multinode Examples](./multinode/trtllm-multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the [Llama4 + Eagle](./trtllm-llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on a single node.
+### Speculative Decoding
+- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./trtllm-llama4-plus-eagle.md)**
+### Model-Specific Guides
+- **[Gemma3 with Sliding Window Attention](./trtllm-gemma3-sliding-window-attention.md)**
+- **[GPT-OSS-120b](./trtllm-gpt-oss.md)** — Reasoning model with tool calling support
+### Kubernetes Deployment
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
+### Performance Sweep
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
+## Client
+See the [client](../sglang/README.md#testing-the-deployment) section to learn how to send requests to the deployment.
+<Note>
+To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
+</Note>
+## Benchmarking
+To benchmark your deployment with AIPerf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
--- a/docs/backends/trtllm/gemma3-sliding-window-attention.md
+++ b/docs/backends/trtllm/gemma3-sliding-window-attention.md
@@ -4,13 +4,16 @@
 title: Gemma3 Sliding Window
 ---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
 This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
 VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
 > [!Note]
 > - Ensure that required services such as `nats` and `etcd` are running before starting.
 > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
-> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
 ## Aggregated Serving
 ```bash

--- a/docs/backends/trtllm/gpt-oss.md
+++ b/docs/backends/trtllm/gpt-oss.md
@@ -4,6 +4,10 @@
 title: GPT-OSS
 ---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
 Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
 ## Overview
@@ -170,7 +174,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
  --expert-parallel-size 4
 ```
-### 6. Verify the Deployment is Ready
+### 5. Verify the Deployment is Ready
 Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
 ```
@@ -190,7 +194,7 @@ Make sure that both of the endpoints are available before sending an inference r
 If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
-### 7. Test the Deployment
+### 6. Test the Deployment
 Send a test request to verify the deployment:
@@ -207,10 +211,10 @@ curl -X POST http://localhost:8000/v1/responses \
 The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
-### 8. Reasoning and Tool Calling
+### 7. Reasoning and Tool Calling
 Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
-is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
+is that the application has a set of tools to aid the assistant provide accurate answer, and it is usually
 multi-turn as it involves tool selection and generation based on the tool result.
 In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```.
@@ -514,7 +518,7 @@ flowchart TD
     ```bash
     curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
       "model": "openai/gpt-oss-120b",
-       "messages": [{"role": "user", "content": "Hello"}],
+       "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
       "chat_template_args": {
          "reasoning_effort": "high"
        },

--- a/docs/backends/trtllm/trtllm-known-issues.md
+++ b/docs/backends/trtllm/trtllm-known-issues.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Known Issues and Mitigations
+---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
+### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
+**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
+**Symptoms:**
+- Workers function normally initially but hang after heavy load testing
+- Inference requests get stuck and eventually timeout
+- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
+- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
+**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
+**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
+```yaml
+cache_transceiver_config:
+  backend: DEFAULT
+  max_tokens_in_buffer: 65536  # Must exceed max ISL
+```
+For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
+**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
--- a/docs/backends/trtllm/kv-cache-transfer.md
+++ b/docs/backends/trtllm/kv-cache-transfer.md
@@ -4,11 +4,15 @@
 title: KV Cache Transfer
 ---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
 ## Using NIXL for KV Cache Transfer
-Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
+Start the disaggregated service: See [Disaggregated Serving](./trtllm-examples.md#disaggregated) to learn how to start the deployment.
 ## Default Method: NIXL
 By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

--- a/docs/backends/trtllm/llama4-plus-eagle.md
+++ b/docs/backends/trtllm/llama4-plus-eagle.md
@@ -4,7 +4,7 @@
 title: Llama4 + Eagle
 ---
-This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
+This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/trtllm-multinode-examples.md) to set up the environment for the following scenarios:
 - **Aggregated Serving:**
  Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
@@ -33,7 +33,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 ```
-See [this](./multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
+See the [multinode setup instructions](./multinode/trtllm-multinode-examples.md#setup) to learn more about the above options.
 ## Aggregated Serving
@@ -55,12 +55,12 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4
 ## Example Request
-See [here](./multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
+See the [example request section](./multinode/trtllm-multinode-examples.md#example-request) to learn how to send a request to the deployment.
 ```
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8",
-        "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
+        "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
        "max_tokens": 1024
    }' -w "\n"

--- a/docs/backends/trtllm/trtllm-logits-processing.md
+++ b/docs/backends/trtllm/trtllm-logits-processing.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Logits Processing
+---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
+Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
+### How it works
+- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
+- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
+- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
+### Quick test: HelloWorld processor
+You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
+./launch/agg.sh
+```
+<Note>
+- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
+- Expected chat response contains "Hello world".
+</Note>
+### Bring your own processor
+Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
+```python
+from typing import Sequence
+import torch
+from dynamo.logits_processing import BaseLogitsProcessor
+class TemperatureProcessor(BaseLogitsProcessor):
+    def __init__(self, temperature: float = 1.0):
+        if temperature <= 0:
+            raise ValueError("Temperature must be positive")
+        self.temperature = temperature
+    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
+        if self.temperature == 1.0:
+            return
+        logits.div_(self.temperature)
+```
+Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
+```python
+from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
+from dynamo.logits_processing.examples import TemperatureProcessor
+processors = [TemperatureProcessor(temperature=0.7)]
+sampling_params.logits_processor = create_trtllm_adapters(processors)
+```
+### Current limitations
+- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
+- Processors must modify logits in-place and not return a new tensor.
+- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
--- a/docs/backends/trtllm/prometheus.md
+++ b/docs/backends/trtllm/prometheus.md
@@ -4,6 +4,10 @@
 title: Prometheus
 ---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
 ## Overview
 When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
@@ -52,7 +56,7 @@ curl -H 'Content-Type: application/json' \
 -d '{
  "model": "<model_name>",
  "max_completion_tokens": 100,
-  "messages": [{"role": "user", "content": "Hello"}]
+  "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]
 }' \
 http://localhost:8000/v1/chat/completions

--- a/docs/backends/trtllm/trtllm-reference-guide.md
+++ b/docs/backends/trtllm/trtllm-reference-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Reference Guide
+subtitle: Features, configuration, and operational details for the TensorRT-LLM backend
+---
+## Building a Custom Container
+To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](./trtllm-building-custom-container.md) guide.
+## KV Cache Transfer
+Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](./trtllm-kv-cache-transfer.md).
+## Request Migration
+Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
+## Request Cancellation
+When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
+### Cancellation Support Matrix
+| | Prefill | Decode |
+|-|---------|--------|
+| **Aggregated** | ✅ | ✅ |
+| **Disaggregated** | ✅ | ✅ |
+For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
+## Multimodal Support
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
+## Video Diffusion Support (Experimental)
+Dynamo supports video generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Video Diffusion Guide](./trtllm-video-diffusion.md).
+## Logits Processing
+Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](./trtllm-logits-processing.md).
+## DP Rank Routing (Attention Data Parallelism)
+TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](./trtllm-dp-rank-routing.md).
+## KVBM Integration
+Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
+See the instructions here: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm).
+## Observability
+TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Prometheus Metrics Guide](./trtllm-prometheus.md).
+## Known Issues and Mitigations
+For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](./trtllm-known-issues.md) page.
--- a/docs/backends/trtllm/trtllm-video-diffusion.md
+++ b/docs/backends/trtllm/trtllm-video-diffusion.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Video Diffusion Support (Experimental)
+---
+For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).
+---
+Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
+## Requirements
+- **TensorRT-LLM with visual_gen**: The `visual_gen` module is part of TensorRT-LLM (`tensorrt_llm._torch.visual_gen`). Install TensorRT-LLM following the [official instructions](https://github.com/NVIDIA/TensorRT-LLM#installation).
+- **imageio with ffmpeg**: Required for encoding generated frames to MP4 video:
+  ```bash
+  pip install imageio[ffmpeg]
+  ```
+- **dynamo-runtime with video API**: The Dynamo runtime must include `ModelType.Videos` support. Ensure you're using a compatible version.
+## Supported Models
+| Diffusers Pipeline | Description | Example Model |
+|--------------------|-------------|---------------|
+| `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
+The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed.
+## Quick Start
+```bash
+python -m dynamo.trtllm \
+  --modality video_diffusion \
+  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
+  --media-output-fs-url file:///tmp/dynamo_media
+```
+## API Endpoint
+Video generation uses the `/v1/videos` endpoint:
+```bash
+curl -X POST http://localhost:8000/v1/videos \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "A cat playing piano",
+    "model": "wan_t2v",
+    "seconds": 4,
+    "size": "832x480",
+    "nvext": {
+      "fps": 24
+    }
+  }'
+```
+## Configuration Options
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--media-output-fs-url` | Filesystem URL for storing generated media | `file:///tmp/dynamo_media` |
+| `--default-height` | Default video height | `480` |
+| `--default-width` | Default video width | `832` |
+| `--default-num-frames` | Default frame count | `81` |
+| `--enable-teacache` | Enable TeaCache optimization | `False` |
+| `--disable-torch-compile` | Disable torch.compile | `False` |
+## Limitations
+- Video diffusion is experimental and not recommended for production use
+- Only text-to-video is supported in this release (image-to-video planned)
+- Requires GPU with sufficient VRAM for the diffusion model
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -146,8 +146,19 @@ navigation:
            path: backends/sglang/sglang-observability.md
          - page: Agentic Workloads
            path: backends/sglang/agents.md
-      - page: TensorRT-LLM
+      - section: TensorRT-LLM
        path: backends/trtllm/README.md
+        contents:
+          - page: Reference Guide
+            path: backends/trtllm/trtllm-reference-guide.md
+          - page: Examples
+            path: backends/trtllm/trtllm-examples.md
+          - page: Prometheus Metrics
+            path: backends/trtllm/trtllm-prometheus.md
+          - page: Video Diffusion (Experimental)
+            path: backends/trtllm/trtllm-video-diffusion.md
+          - page: Known Issues and Mitigations
+            path: backends/trtllm/trtllm-known-issues.md
      - page: vLLM
        path: backends/vllm/README.md
@@ -301,18 +312,22 @@ navigation:
            path: backends/vllm/vllm-omni.md
      - section: TensorRT-LLM Details
        contents:
+          - page: Building a Custom Container
+            path: backends/trtllm/trtllm-building-custom-container.md
+          - page: KV Cache Transfer
+            path: backends/trtllm/trtllm-kv-cache-transfer.md
+          - page: Logits Processing
+            path: backends/trtllm/trtllm-logits-processing.md
+          - page: DP Rank Routing
+            path: backends/trtllm/trtllm-dp-rank-routing.md
          - page: Multinode Examples
-            path: backends/trtllm/multinode/multinode-examples.md
+            path: backends/trtllm/multinode/trtllm-multinode-examples.md
          - page: Llama4 + Eagle
-            path: backends/trtllm/llama4-plus-eagle.md
+            path: backends/trtllm/trtllm-llama4-plus-eagle.md
-          - page: KV Cache Transfer
-            path: backends/trtllm/kv-cache-transfer.md
          - page: Gemma3 Sliding Window
-            path: backends/trtllm/gemma3-sliding-window-attention.md
+            path: backends/trtllm/trtllm-gemma3-sliding-window-attention.md
          - page: GPT-OSS
-            path: backends/trtllm/gpt-oss.md
+            path: backends/trtllm/trtllm-gpt-oss.md
-          - page: Prometheus
-            path: backends/trtllm/prometheus.md
      # -- Features (hidden sub-pages) --
      - section: Speculative Decoding
        path: features/speculative-decoding/README.md

--- a/docs/observability/metrics.md
+++ b/docs/observability/metrics.md
@@ -84,7 +84,7 @@ Dynamo exposes several categories of metrics:
 - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
 - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
 - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/vllm-observability.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
+- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/vllm-observability.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/trtllm-prometheus.md) (`trtllm_*`)
 ## Runtime Hierarchy

--- a/examples/backends/trtllm/deploy/README.md
+++ b/examples/backends/trtllm/deploy/README.md
@@ -241,7 +241,7 @@ TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving
 - **UCX** (default): Standard method for KV cache transfer
 - **NIXL** (experimental): Alternative transfer method
-For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).
+For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/trtllm-kv-cache-transfer.md).
 ## Request Migration
@@ -269,8 +269,8 @@ Configure the `model` name and `host` based on your deployment.
 - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md)
 - **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md)
 - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
+- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/trtllm-multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4-plus-eagle.md)
+- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/trtllm-llama4-plus-eagle.md)
 - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
 ## Troubleshooting

--- a/examples/backends/trtllm/performance_sweeps/README.md
+++ b/examples/backends/trtllm/performance_sweeps/README.md
@@ -41,7 +41,7 @@ Please note that:
 3. `post_process.py` - Scan the aiperf results to produce a json with entries to each config point.
 4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
-For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
+For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../../../../docs/backends/trtllm/multinode/trtllm-multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
 ## Usage