# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"LLMDeploymentusingTensorRT-LLM"
---
# LLM Deployment using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
</Warning>
> [!WARNING]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv-cache-routing.md).
...
...
@@ -118,9 +123,8 @@ cd $DYNAMO_HOME/examples/backends/trtllm
### Disaggregated with KV Routing
<Warning>
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
</Warning>
> [!WARNING]
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
...
...
@@ -182,9 +186,8 @@ You can enable [request migration](../../fault-tolerance/request-migration.md) t
python3 -m dynamo.trtllm ... --migration-limit=3
```
<Warning>
**Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
</Warning>
> [!WARNING]
> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"Gemma3withVariableSlidingWindowAttention"
---
# Gemma 3 with Variable Sliding Window Attention
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
<Note>
- Ensure that required services such as `nats` and `etcd` are running before starting.
- Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
- It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
</Note>
> [!NOTE]
> - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
# Running gpt-oss-120b Disaggregated with TensorRT-LLM
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"KVCacheTransferinDisaggregatedServing"
---
# KV Cache Transfer in Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Using NIXL for KV Cache Transfer
Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
### Specify Backends for NIXL
TODO: Add instructions for how to specify different backends for NIXL.
NIXL supports multiple communication backends that can be configured via environment variables. By default, UCX is used if no backends are explicitly specified.
## Alternative Method: UCX
**Environment Variable Format:**
```bash
DYN_KVBM_NIXL_BACKEND_<BACKEND>=<value>
```
**Supported Backends:**
-`UCX` - Unified Communication X (default)
-`GDS` - GPU Direct Storage
**Examples:**
```bash
# Enable UCX backend (default behavior)
export DYN_KVBM_NIXL_BACKEND_UCX=true
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
# Enable GDS backend
export DYN_KVBM_NIXL_BACKEND_GDS=true
# Enable multiple backends
export DYN_KVBM_NIXL_BACKEND_UCX=true
export DYN_KVBM_NIXL_BACKEND_GDS=true
# Explicitly disable a backend
export DYN_KVBM_NIXL_BACKEND_GDS=false
```
**Valid Values:**
-`true`, `1`, `on`, `yes` - Enable the backend
-`false`, `0`, `off`, `no` - Disable the backend
> [!NOTE]
> If no `DYN_KVBM_NIXL_BACKEND_*` environment variables are set, UCX is used as the default backend.
## Alternative Method: UCX
1.**Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
> [!NOTE]
> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](multinode/multinode-examples.md) to set up the environment for the following scenarios:
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
...
...
@@ -148,10 +149,9 @@ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
following the setup above, follow these steps below to launch a **disaggregated**
deployment across 8 nodes:
<Tip>
Make sure you have a fresh environment and don't still have the aggregated
example above still deployed on the same set of nodes.
</Tip>
> [!TIP]
> Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes.
```bash
# Defaults set in srun_disaggregated.sh, but can customize here.
...
...
@@ -176,10 +176,9 @@ example above still deployed on the same set of nodes.
./srun_disaggregated.sh
```
<Tip>
To launch multiple replicas of the configured prefill/decode workers, you can set
NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
</Tip>
> [!TIP]
> To launch multiple replicas of the configured prefill/decode workers, you can set
> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"TensorRT-LLMPrometheusMetrics"
---
# TensorRT-LLM Prometheus Metrics
## Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"LMCacheIntegrationinDynamo"
---
# LMCache Integration in Dynamo
## Introduction
LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"LLMDeploymentusingvLLM"
---
# LLM Deployment using vLLM
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
### Start Infrastructure Services (Local Development Only)
Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
...
...
@@ -81,9 +87,8 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v
## Run Single Node Examples
<Warning>
Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
</Warning>
> [!WARNING]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
### Aggregated Serving
...
...
@@ -127,9 +132,8 @@ cd examples/backends/vllm
bash launch/dep.sh
```
<Tip>
Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
</Tip>
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"RunningDeepseekR1withWideEP"
---
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a seperate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
# Running Deepseek R1 with Wide EP
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"Runninggpt-oss-120bDisaggregatedwithvLLM"
---
# Running gpt-oss-120b Disaggregated with vLLM
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"vLLMPrometheusMetrics"
---
# vLLM Prometheus Metrics
## Overview
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"PromptEmbeddings"
---
# Prompt Embeddings
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"DynamoKVSmartRouterA/BBenchmarkingGuide"
---
# Dynamo KV Smart Router A/B Benchmarking Guide
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
**New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
</Tip>
# SLA-Driven Profiling with DynamoGraphDeploymentRequest
> [!TIP]
> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
## Overview
...
...
@@ -39,9 +39,8 @@ Specifically, the profiler sweeps over the following parallelization mapping for
| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
| Other Models | TP | TP |
<Note>
- Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
</Note>
> [!NOTE]
> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
## Using DGDR for Profiling (Recommended)
...
...
@@ -344,9 +343,8 @@ profilingConfig:
-**num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
-**gpu_type**: Informational, auto-detected by controller
<Tip>
If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
</Tip>
> [!TIP]
> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:"HighLevelArchitecture"
---
# High Level Architecture
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
-**Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
Disaggregated execution of a request has three main steps: