# SLA-Driven Profiling and Planner Deployment Quick Start Guide
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
<Warning>
> [!WARNING]
**Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](../kubernetes/installation-guide.md).
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](../kubernetes/installation-guide.md).
</Warning>
## Overview
## Overview
...
@@ -130,9 +130,8 @@ spec:
...
@@ -130,9 +130,8 @@ spec:
autoApply:true# Auto-deploy after profiling
autoApply:true# Auto-deploy after profiling
```
```
<Tip>
> [!TIP]
For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
-`Failed`: Error occurred (check events for details)
-`Failed`: Error occurred (check events for details)
<Note>
> [!NOTE]
With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
</Note>
### Step 4: Access Your Deployment
### Step 4: Access Your Deployment
...
@@ -202,9 +200,8 @@ The dashboard displays:
...
@@ -202,9 +200,8 @@ The dashboard displays:
-**Predicted Metrics**: Planner's load predictions and recommended replica counts
-**Predicted Metrics**: Planner's load predictions and recommended replica counts
-**Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
-**Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
<Tip>
> [!TIP]
Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
</Tip>
## DGDR Configuration Details
## DGDR Configuration Details
...
@@ -260,9 +257,8 @@ sweep:
...
@@ -260,9 +257,8 @@ sweep:
aic_backend_version:"0.20.0"
aic_backend_version:"0.20.0"
```
```
<Note>
> [!NOTE]
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md#profiling-method).
> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md#profiling-method).
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
<Note>
> [!NOTE]
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../benchmarks/sla-driven-profiling.md#troubleshooting).
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../benchmarks/sla-driven-profiling.md#troubleshooting).
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"SLA-basedPlanner"
---
---
<Tip>
# SLA-based Planner
**New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](sla-planner-quickstart.md).
</Tip>
> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](sla-planner-quickstart.md).
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
<Note>
> [!NOTE]
Currently, SLA-based planner only supports disaggregated setup.
> Currently, SLA-based planner only supports disaggregated setup.
</Note>
<Warning>
> [!WARNING]
Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
<Note>
> [!NOTE]
SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`.
> SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`.
</Note>
## Deploying
## Deploying
For complete deployment instructions, see the [SLA Planner Quick Start Guide](sla-planner-quickstart.md).
For complete deployment instructions, see the [SLA Planner Quick Start Guide](sla-planner-quickstart.md).
<Note>
> [!NOTE]
The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"DynamoRun"
---
---
# Dynamo Run
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
vLLM offers the broadest feature coverage in Dynamo, with full support for disaggregated serving, KV-aware routing, KV block management, LoRA adapters, and multimodal inference including video and audio.
> 1. **Multimodal + KV-Aware Routing**: The KV router uses token-based hashing and does not yet support image/video hashes, so it falls back to random/round-robin routing. ([Source](../router/kv-cache-routing.md))
> 2. **KV-Aware LoRA Routing**: vLLM supports routing requests based on LoRA adapter affinity.
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source](../multimodal/vllm.md))
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source](../backends/vllm/speculative-decoding.md))
## 2. SGLang Backend
SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source](../router/kv-cache-routing.md))
> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source](../multimodal/sglang.md))
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source](../backends/sglang/README.md))
> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
## 3. TensorRT-LLM Backend
TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source](../multimodal/trtllm.md))
> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source](../router/kv-cache-routing.md))
> 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source](../backends/trtllm/README.md))
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"NVIDIADynamoGlossary"
---
---
# NVIDIA Dynamo Glossary
## B
## B
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Dynamo Release Artifacts
This document provides a comprehensive inventory of all Dynamo release artifacts including container images, Python wheels, Helm charts, and Rust crates.
**See also:**[Support Matrix](support-matrix.md) for hardware and platform compatibility | [Feature Matrix](feature-matrix.md) for backend feature support
Release history in this document begins at v0.6.0.
| `dynamo-frontend:0.8.1` | API gateway with Endpoint Prediction Protocol (EPP) | — | — | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend?version=0.8.1) | |
| `kubernetes-operator:0.8.1` | Kubernetes operator for Dynamo deployments | — | — | AMD64/ARM64 | [link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator?version=0.8.1) | |
\* Multimodal inference on CUDA 13 images: works on AMD64 for all backends; works on ARM64 only for TensorRT-LLM (`vllm-runtime:*-cuda13` and `sglang-runtime:*-cuda13` do not support multimodality on ARM64).
### Python Wheels
We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtllm]` wheel. See the [NGC container collection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for supported images.
For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md)
For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source).
```bash
cargo add dynamo-runtime@0.8.1
cargo add dynamo-llm@0.8.1
cargo add dynamo-async-openai@0.8.1
cargo add dynamo-parsers@0.8.1
cargo add dynamo-memory@0.8.1
cargo add dynamo-config@0.8.1
```
## CUDA and Driver Requirements
For detailed CUDA toolkit versions and minimum driver requirements for each container image, see the [Support Matrix](support-matrix.md#cuda-and-driver-requirements).
## Known Issues
For a complete list of known issues, refer to the release notes for each patch:
| v0.8.1 | `vllm-runtime:0.8.1-cuda13` | Container fails to launch. | Known issue |
| v0.8.1 | `sglang-runtime:0.8.1-cuda13`, `vllm-runtime:0.8.1-cuda13` | Multimodality not expected to work on ARM64. Works on AMD64. | Known limitation |
| v0.8.0 | `sglang-runtime:0.8.0-cuda13` | CuDNN installation issue caused PyTorch `v2.9.1` compatibility problems with `nn.Conv3d`, resulting in performance degradation and excessive memory usage in multimodal workloads. | Fixed in v0.8.1 ([#5461](https://github.com/ai-dynamo/dynamo/pull/5461)) |
---
## Release History
-**v0.8.1.post1 Patch**: Updated TRT-LLM to `v1.2.0rc6.post2` (PyPI wheels and TRT-LLM container only)
-**Standalone Frontend Container**: `dynamo-frontend` added in v0.8.0
-**CUDA 13 Runtimes**: Experimental CUDA 13 runtime for vLLM and SGLang in v0.8.0
-**New Rust Crates**: `dynamo-memory` and `dynamo-config` added in v0.8.0
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"DynamoSupportMatrix"
---
---
# Dynamo Support Matrix
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
**See also:**[Release Artifacts](release-artifacts.md) for container images, wheels, Helm charts, and crates | [Feature Matrix](feature-matrix.md) for backend feature support
## Backend Dependencies
The following table shows the backend framework versions included with each Dynamo release:
**main (ToT)** reflects the current development branch. **v0.8.1.post1** is a patch release for PyPI wheels and TRT-LLM container only (no GitHub release).
> [!WARNING]
> Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] Python wheel will fail.
| **Dynamo 0.8.1** | CUDA 12.9, CUDA 13.0 (Experimental) | CUDA 13.0 | CUDA 12.9, CUDA 13.0 (Experimental) |
| **Dynamo 0.8.0** | CUDA 12.9, CUDA 13.0 (Experimental) | CUDA 13.0 | CUDA 12.9, CUDA 13.0 (Experimental) |
| **Dynamo 0.7.1** | CUDA 12.8 | CUDA 13.0 | CUDA 12.9 |
| **Dynamo 0.7.0** | CUDA 12.9 | CUDA 13.0 | CUDA 12.8 |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
For detailed artifact versions and NGC links (including container images, Python wheels, Helm charts, and Rust crates), see the [Release Artifacts](release-artifacts.md) page.
## Hardware Compatibility
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| **CPU Architecture** | **Status** |
...
@@ -13,6 +43,7 @@ This document provides the support matrix for Dynamo, including hardware, softwa
...
@@ -13,6 +43,7 @@ This document provides the support matrix for Dynamo, including hardware, softwa
| **x86_64** | Supported |
| **x86_64** | Supported |
| **ARM64** | Supported |
| **ARM64** | Supported |
Dynamo provides multi-arch container images supporting both AMD64 (x86_64) and ARM64 architectures. See [Release Artifacts](release-artifacts.md) for available images.
### GPU Compatibility
### GPU Compatibility
...
@@ -36,49 +67,49 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
...
@@ -36,49 +67,49 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
| **CentOS Stream** | 9 | x86_64 | Experimental |
<Note>
Wheels are built using a manylinux_2_28-compatible environment and validated on CentOS Stream 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but not officially verified.
Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
Compatibility with other Linux distributions is expected but has not been officially verified yet.
</Note>
<Error>
> [!CAUTION]
KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
</Error>
## Software Compatibility
## Software Compatibility
### Runtime Dependency
### CUDA and Driver Requirements
| **Python Package** | **Version** | glibc version | CUDA Version |
Dynamo container images include CUDA toolkit libraries. The host machine must have a compatible NVIDIA GPU driver installed.
For extended driver compatibility beyond the minimum versions listed above, consider using `cuda-compat` packages on the host. See [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html) for details.
Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
> **AL2023 TensorRT-LLM Limitation:** There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
</Error>
## Build Support
## Build Support
For version-specific artifact details, installation commands, and release history, see [Release Artifacts](release-artifacts.md).
**Dynamo** currently provides build support in the following ways:
**Dynamo** currently provides build support in the following ways:
-**Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
-**Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
-**New as of Dynamo v0.7.0:**[kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
-[kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
-**Dynamo Runtime Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Runtime for each of the LLM inference frameworks on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
-**Dynamo Container Images**: We distribute multi-arch images (x86 & ARM64 compatible) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
-**Dynamo Kubernetes Operator Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Operator on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
-[kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
-[dynamo-config](https://crates.io/crates/dynamo-config/)*(New in v0.8.0)*
-[dynamo-memory](https://crates.io/crates/dynamo-memory/)*(New in v0.8.0)*
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the [Local Quick Start](https://github.com/ai-dynamo/dynamo/blob/main/README.md#local-quick-start) in the README.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"KVRouter"
---
---
# KV Router
## Overview
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: Apache-2.0
title:"KVCacheRouting"
---
---
# KV Cache Routing
This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
To enable KV cache aware routing start the frontend node like this:
To enable KV cache aware routing start the frontend node like this:
The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).