Unverified Commit b19de4ed authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: cleanup of docs refactor for components, integrations, and features (#6019)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80e7bafd
...@@ -58,8 +58,8 @@ Quickstart ...@@ -58,8 +58,8 @@ Quickstart
:hidden: :hidden:
:caption: User Guides :caption: User Guides
KV Cache Offloading <kvbm/kvbm_guide.md> KV Cache Offloading <components/kvbm/kvbm_guide.md>
KV Aware Routing <router/router_guide.md> KV Aware Routing <components/router/router_guide.md>
Tool Calling <agents/tool-calling.md> Tool Calling <agents/tool-calling.md>
Multimodality Support <features/multimodal/README.md> Multimodality Support <features/multimodal/README.md>
LoRA Adapters <features/lora/README.md> LoRA Adapters <features/lora/README.md>
...@@ -76,11 +76,11 @@ Quickstart ...@@ -76,11 +76,11 @@ Quickstart
:caption: Components :caption: Components
Backends <_sections/backends> Backends <_sections/backends>
Frontends <_sections/frontends> Frontend <components/frontend/README>
Router <router/README> Router <components/router/README>
Planner <planner/planner_intro> Planner <components/planner/README>
Profiler <components/profiler/README> Profiler <components/profiler/README>
KVBM <kvbm/kvbm_intro> KVBM <components/kvbm/README>
.. toctree:: .. toctree::
:hidden: :hidden:
......
...@@ -285,6 +285,6 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo ...@@ -285,6 +285,6 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo
## See Also ## See Also
- **[Router README](../router/README.md)**: Quick start guide for the KV Router - **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup - **[Router Guide](../components/router/router_guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes - **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
...@@ -117,7 +117,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} ...@@ -117,7 +117,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). For SLA-based autoscaling, see [SLA Planner Guide](/docs/components/planner/planner_guide.md).
## Understanding Dynamo's Custom Resources ## Understanding Dynamo's Custom Resources
......
...@@ -163,14 +163,14 @@ Planner is deployed as a service component within your DGD. It: ...@@ -163,14 +163,14 @@ Planner is deployed as a service component within your DGD. It:
**Deployment:** **Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla_planner_quickstart.md) for complete instructions. The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../components/planner/planner_guide.md) for complete instructions.
Example configurations with Planner: Example configurations with Planner:
- `examples/backends/vllm/deploy/disagg_planner.yaml` - `examples/backends/vllm/deploy/disagg_planner.yaml`
- `examples/backends/sglang/deploy/disagg_planner.yaml` - `examples/backends/sglang/deploy/disagg_planner.yaml`
- `examples/backends/trtllm/deploy/disagg_planner.yaml` - `examples/backends/trtllm/deploy/disagg_planner.yaml`
For more details, see the [SLA Planner documentation](../planner/sla_planner.md). For more details, see the [SLA Planner documentation](../components/planner/planner_guide.md).
## Autoscaling with Kubernetes HPA ## Autoscaling with Kubernetes HPA
...@@ -725,7 +725,7 @@ If you see unstable scaling: ...@@ -725,7 +725,7 @@ If you see unstable scaling:
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) - [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/) - [KEDA Documentation](https://keda.sh/)
- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter) - [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
- [Planner Documentation](../planner/sla_planner.md) - [Planner Documentation](../components/planner/planner_guide.md)
- [Dynamo Metrics Reference](../observability/metrics.md) - [Dynamo Metrics Reference](../observability/metrics.md)
- [Prometheus and Grafana Setup](../observability/prometheus-grafana.md) - [Prometheus and Grafana Setup](../observability/prometheus-grafana.md)
...@@ -292,7 +292,7 @@ kubectl get pods -n ${NAMESPACE} ...@@ -292,7 +292,7 @@ kubectl get pods -n ${NAMESPACE}
3. **Optional:** 3. **Optional:**
- [Set up Prometheus & Grafana](./observability/metrics.md) - [Set up Prometheus & Grafana](./observability/metrics.md)
- [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling) - [SLA Planner Guide](../components/planner/planner_guide.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting ## Troubleshooting
......
..
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
KV Block Manager
================
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
It offers:
* A **unified memory API** that spans GPU memory (future), pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
.. list-table::
:widths: 20 5 75
:header-rows: 1
* -
-
- Feature
* - **Backend**
- ✅
- Local
* -
- ✅
- Kubernetes
* - **LLM Framework**
- ✅
- vLLM
* -
- ✅
- TensorRT-LLM
* -
- ❌
- SGLang
* - **Serving Type**
- ✅
- Aggregated
* -
- ✅
- Disaggregated
.. toctree::
:hidden:
Overview <self>
Quick Start <README.md>
User Guide <kvbm_guide.md>
Design <kvbm_design.md>
LMCache Integration <../integrations/lmcache_integration.md>
FlexKV Integration <../integrations/flexkv_integration.md>
SGLang HiCache <../integrations/sglang_hicache.md>
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
> [!NOTE]
> **This content has moved.** The canonical location for this documentation is now
> [docs/features/multimodal/](../features/multimodal/README.md).
> This file will be removed in a future release.
# Multimodal Inference in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
> [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources.
## Backend Documentation
```{toctree}
:maxdepth: 1
vLLM Multimodal <vllm.md>
TensorRT-LLM Multimodal <trtllm.md>
SGLang Multimodal <sglang.md>
```
## Support Matrix
### Backend Capabilities
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
**Pattern Key:**
- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode
**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
### Input Format Support
| Format | vLLM | TRT-LLM | SGLang |
|--------|------|---------|--------|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | ✅ | ❌ | ❌ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
## Architecture Patterns
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
- *Inline*: Simpler setup, encoding happens in the prefill worker
- *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
- *Aggregated*: Single worker handles both prefill and decode
- *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
These combine into four deployment patterns:
### EPD - Simple Aggregated
All processing happens within a single worker - the simplest setup.
```text
HTTP Frontend (Rust)
Worker (Python)
↓ image load + encode + prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |
**When to use:** Quick setup, smaller models, development/testing.
### E/PD - Encode Separate
Encoding happens in a separate worker; prefill and decode share the same engine.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |
**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
### E/P/D - Full Disaggregation
Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGlang
Prefill-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
OR
Decode-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |
**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
### EP/D - Traditional Disaggregated
Encoding is combined with prefill, with decode separate.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |
> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
## Example Workflows
You can find example workflows and reference implementations for deploying multimodal models in:
- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# SGLang Multimodal
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings |
| **Image** | Data URL (Base64) | No | No | |
| **Video** | HTTP/HTTPS URL | No | No | |
| **Audio** | HTTP/HTTPS URL | No | No | |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
## Deployment Patterns
SGLang supports EPD, E/PD, and E/P/D patterns. See [Multimodal Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg.sh` | Internal encoding |
| E/PD (Encode Separate) | ✅ | `multimodal_epd.sh` | Vision encoder separate |
| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap |
| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported |
### Component Flags
| Component | Flag | Purpose |
|-----------|------|---------|
| Processor | `--multimodal-processor` | HTTP entry, OpenAI→SGLang conversion |
| Encode Worker | `--multimodal-encode-worker` | Vision encoder, embeddings generation |
| PD Worker | `--multimodal-worker` | Prefill + Decode with embeddings |
| Decode Worker | `--multimodal-worker --serving-mode=decode` | Entry point for disaggregation |
| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | Called by Decode, bootstrap coordination |
### SGLang-Specific Characteristics
- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
- **No Rust Processing**: All tokenization and image handling happens in Python
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## EPD Serving (Simple Aggregated)
### Components
- worker: [DecodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/llm/decode_handler.py) handles encoding, prefilling, and decoding in a single process.
### Workflow
The `DecodeWorkerHandler` receives multimodal requests with image URLs and passes them directly to SGLang's engine. SGLang's internal `mm_data_processor` handles image fetching, loading, encoding, and token expansion.
```mermaid
flowchart LR
HTTP --> worker
worker --tokenized text + image_urls--> SGLang[SGLang Engine]
```
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## E/PD Serving (Encode Separate)
### Components
- workers:
- [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
- tokenizes the prompt using the chat template
- passes the text and image url to the MultimodalEncodeWorker.
### Workflow
The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.
```mermaid
flowchart LR
HTTP --> processor
processor --tokenized request + image_url--> encode_worker
encode_worker --request + embeddings--> worker
worker -.-> encode_worker
encode_worker -.-> processor
processor -.-> HTTP
```
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_epd.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## E/P/D Serving (Full Disaggregation)
### Components
- workers:
- [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
- [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow
In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.
```mermaid
flowchart LR
HTTP --> processor
processor --tokenized request + image_url--> encode_worker
encode_worker --request + embeddings--> worker
worker --request + embeddings--> prefill_worker
prefill_worker --KV Cache--> worker
encode_worker -.-> processor
worker -.-> encode_worker
processor -.-> HTTP
```
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_disagg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## Bootstrap Coordination
SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
### Request Flow (Important)
```text
Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
Entry point for disaggregation!
```
### Bootstrap Process
1. **Decode Worker** receives request from Encode Worker
2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
3. **Prefill Worker** generates `{host, port, room}` and returns immediately
4. **Both workers** connect to same "room" using bootstrap coordinates
5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
### Key Difference from vLLM
- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
## Inter-Component Communication
### Control Flow (NATS)
All component-to-component communication happens via NATS:
#### E/PD Mode (Encode Separate)
```text
Processor → Encode Worker → PD Worker
(NATS) (NATS + NIXL embeddings)
```
#### E/P/D Mode (Full Disaggregation)
```text
Processor → Encode Worker → DECODE Worker → Prefill Worker
(NATS) (NATS) (NATS)
Decode requests bootstrap
Prefill returns {host, port, room}
Both connect via bootstrap
SGLang internal KV cache transfer
```
### Detailed Message Flow
```text
Processor → Encode Worker:
- NATS round_robin with SglangMultimodalRequest
- Contains: tokenized input_ids, image URL, sampling params
Encode Worker → Decode/PD Worker:
- NATS round_robin to "backend" component
- Contains: expanded token_ids, NIXL metadata, embeddings shape
- NIXL transfer: embeddings tensor
Decode Worker → Prefill Worker (disagg only):
- NATS call to "prefill" component
- Decode requests bootstrap coordinates
- Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
Prefill ↔ Decode (via bootstrap):
- SGLang internal connection (not NATS)
- KV cache state shared via bootstrap mechanism
```
### Data Transfer (NIXL)
NIXL is used only for embedding transfer:
```python
# Encode Worker
descriptor = connect.Descriptor(precomputed_embeddings)
with connector.create_readable(descriptor) as readable:
request.serialized_request = readable.metadata()
await pd_worker_client.round_robin(request)
await readable.wait_for_completion()
# PD Worker
embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
descriptor = connect.Descriptor(embeddings)
read_op = await connector.begin_read(request.serialized_request, descriptor)
await read_op.wait_for_completion()
```
## Vision Encoding Details
### Encode Worker Components
The encode worker loads and runs the vision model in Python:
```python
self.image_processor = AutoImageProcessor.from_pretrained(
model_path, trust_remote_code=True
)
self.vision_model = AutoModel.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
```
### Token Expansion Process
1. Processor inserts single image token (e.g., `<|image_pad|>`)
2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)`
3. Encode worker replaces single token with `num_patches` tokens
4. Downstream worker receives expanded token sequence
Example:
```python
# Before: ["Hello", "<|image_pad|>", "world"]
# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
```
## Chat Template Processing
SGLang uses its own chat template system:
```python
from sglang.srt.parser.conversation import chat_templates
conv = chat_templates["qwen2-vl"].copy()
conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
```
Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
## NIXL Usage
| Use Case | NIXL Used? | Data Transfer | Notes |
|----------|------------|---------------|-------|
| EPD (Simple Aggregated) | No | N/A | All processing internal to SGLang |
| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate |
| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
**Key Difference:** SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates
## Supported Models
SGLang multimodal **only supports image-based vision-language models**:
- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
- Models with `AutoImageProcessor` and vision tower
- Models compatible with SGLang's image embedding format
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers |
| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang |
| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation |
| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read |
| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# TensorRT-LLM Multimodal
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
You can provide multimodal inputs in the following ways:
- By sending image URLs
- By providing paths to pre-computed embedding files
> **Note:** You should provide **either image URLs or embedding file paths** in a single request.
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files |
| **Video** | HTTP/HTTPS URL | No | No | Not implemented |
| **Audio** | HTTP/HTTPS URL | No | No | Not implemented |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) |
## Deployment Patterns
TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| Aggregated | ✅ | `agg.sh` | Easiest setup, single worker |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding, 2 workers |
| E/P/D (Full - Image URLs) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with `MultimodalEncoder`, 3 workers |
| E/P/D (Full - Pre-computed Embeddings) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with NIXL transfer, 3 workers |
| E/P/D (Large Models) | ✅ | `epd_disagg.sh` | For Llama-4 Scout/Maverick, multi-node |
### Component Flags
| Component | Flag | Purpose |
|-----------|------|---------|
| Worker | `--modality multimodal` | Complete pipeline (aggregated) |
| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) |
| Decode Worker | `--disaggregation-mode decode` | Decode only |
| Encode Worker | `--disaggregation-mode encode` | Image encoding (E/P/D flow) |
## Aggregated Serving
Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:
```bash
cd $DYNAMO_HOME
export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
./examples/backends/trtllm/launch/agg.sh
```
**Client:**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"stream": false,
"max_tokens": 160
}'
```
## Disaggregated Serving
Example using `Qwen/Qwen2-VL-7B-Instruct`:
```bash
cd $DYNAMO_HOME
export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
export MODALITY="multimodal"
./examples/backends/trtllm/launch/disagg.sh
```
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image"
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"stream": false,
"max_tokens": 160
}'
```
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
## Full E/P/D Flow (Image URLs)
For high-performance multimodal inference, Dynamo supports a standalone encoder with an **Encode-Prefill-Decode (E/P/D)** flow using TRT-LLM's `MultimodalEncoder`. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.
### Supported Input Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS URL** | `https://example.com/image.jpg` | Remote image files |
| **Base64 Data URL** | `data:image/jpeg;base64,...` | Inline base64-encoded images |
### How It Works
In the full E/P/D flow:
1. **Encode Worker**: Runs TRT-LLM's `MultimodalEncoder.generate()` to process image URLs through the vision encoder and projector
2. **Prefill Worker**: Receives `disaggregated_params` containing multimodal embedding handles, processes context and generates KV cache
3. **Decode Worker**: Performs streaming token generation using the KV cache
The encode worker uses TRT-LLM's `MultimodalEncoder` class (which inherits from `BaseLLM`) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.
### How to Launch
```bash
cd $DYNAMO_HOME
# Launch 3-worker E/P/D flow with image URL support
./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh
```
### Example Request
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"max_tokens": 160
}'
```
### E/P/D Architecture (Image URLs)
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker"
participant EncodeWorker as "Encode Worker"
participant DecodeWorker as "Decode Worker"
Client->>Frontend: POST /v1/chat/completions (image URL)
Frontend->>PrefillWorker: Route to prefill worker
PrefillWorker->>EncodeWorker: Send request (image URL)
Note over EncodeWorker: MultimodalEncoder.generate()<br/>runs vision encoder + projector
EncodeWorker->>PrefillWorker: Return disaggregated_params<br/>(multimodal_embedding_handles)
Note over PrefillWorker: Process context with embeddings<br/>Generate KV cache
PrefillWorker->>Frontend: Return prefill response
Frontend->>DecodeWorker: Route to decode worker
DecodeWorker->>Frontend: Stream response chunks
Frontend->>Client: Stream response
```
### Key Differences from EP/D (Traditional Disaggregated)
| Aspect | EP/D (Traditional) | E/P/D (Full) |
|--------|-------------------|--------------|
| **Encoding** | Prefill worker handles image encoding | Dedicated encode worker |
| **Prefill Load** | Higher (encoding + prefill) | Lower (prefill only) |
| **Use Case** | Simpler setup | Better scalability for vision-heavy workloads |
| **Launch Script** | `disagg_multimodal.sh` | `epd_multimodal_image_and_embeddings.sh` |
## Pre-computed Embeddings with E/P/D Flow
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
### Supported File Types
- `.pt` - PyTorch tensor files
- `.pth` - PyTorch checkpoint files
- `.bin` - Binary tensor files
### Embedding File Formats
TRT-LLM supports two formats for embedding files:
**1. Simple Tensor Format**
Direct tensor saved as `.pt` file containing only the embedding tensor:
```python
embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
torch.save(embedding_tensor, "embedding.pt")
```
**2. Dictionary Format with Auxiliary Data**
Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata:
```python
embedding_dict = {
"mm_embeddings": torch.rand(1, 576, 4096),
"special_tokens": [128256, 128257],
"image_token_offsets": [[0, 576]],
# ... other model-specific metadata
}
torch.save(embedding_dict, "llama4_embedding.pt")
```
- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter
- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data
### How to Launch
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
# Launch 3-worker E/P/D flow with NIXL
./launch/epd_disagg.sh
```
> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready.
### Configuration
```bash
# Encode endpoint for Prefill → Encode communication
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
# Security: Allowed directory for embedding files (default: /tmp)
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
# Security: Max file size to prevent DoS attacks (default: 50MB)
export MAX_FILE_SIZE_MB=50
```
### Example Request with Pre-computed Embeddings
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}}
]
}
],
"max_tokens": 160
}'
```
### E/P/D Architecture
The E/P/D flow implements a **3-worker architecture**:
- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL
- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation
- **Decode Worker**: Performs streaming token generation
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker"
participant EncodeWorker as "Encode Worker"
participant DecodeWorker as "Decode Worker"
participant NIXL as "NIXL (RDMA)"
Client->>Frontend: POST /v1/chat/completions
Frontend->>PrefillWorker: Route to prefill worker
PrefillWorker->>EncodeWorker: Send request (embedding paths)
EncodeWorker->>NIXL: Create readable operation
EncodeWorker->>PrefillWorker: Send metadata + NIXL info
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete
PrefillWorker->>Frontend: Return prefill response
Frontend->>DecodeWorker: Route to decode worker
DecodeWorker->>Frontend: Stream response chunks
Frontend->>Client: Stream response
```
## Multi-node Deployment (Slurm)
This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm.
> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
### Environment Setup
Assuming you have allocated your nodes via `salloc` and are inside an interactive shell:
```bash
# Container image (build using docs/backends/trtllm/README.md#build-container)
export IMAGE="<dynamo_trtllm_image>"
# Host:container path pairs for mounting
export MOUNTS="${PWD}/../../../../:/mnt"
# Model configuration
export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
export MODALITY=${MODALITY:-"multimodal"}
```
### Multi-node Disaggregated Launch
For 4 4xGB200 nodes (2 for prefill, 2 for decode):
```bash
# Customize parallelism to match your engine configs
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml"
# export NUM_PREFILL_NODES=2
# export NUM_DECODE_NODES=2
# export NUM_GPUS_PER_NODE=4
# Launches frontend + etcd/nats on head node, plus prefill and decode workers
./srun_disaggregated.sh
```
### Understanding the Output
1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
2. The OpenAI frontend will dynamically discover workers as they register:
```
INFO dynamo_run::input::http: Watching for remote model at models
INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
```
3. TRT-LLM workers output progress from each MPI rank while loading
4. When ready, the frontend logs:
```
INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
```
### Cleanup
```bash
pkill srun
```
## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| Aggregated | `agg.sh` | No | All in one worker |
| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E/P/D (Image URLs) | `epd_multimodal_image_and_embeddings.sh` | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) |
| E/P/D (Pre-computed Embeddings) | `epd_multimodal_image_and_embeddings.sh` | Yes | Encoder → Prefill (embeddings via NIXL RDMA) |
| E/P/D (Large Models) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache) |
> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
## ModelInput Types and Registration
TRT-LLM workers register with Dynamo using:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Tokens` | Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored | All TRT-LLM workers |
```python
# TRT-LLM Worker - Register with Tokens
await register_llm(
ModelInput.Tokens, # Rust does minimal preprocessing
model_type, # ModelType.Chat or ModelType.Prefill
generate_endpoint,
model_name,
...
)
```
## Inter-Component Communication
| Transfer Stage | Message | NIXL Transfer |
|----------------|---------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Prefill → Encode (Image URL)** | Request with image URL | No |
| **Encode → Prefill (Image URL)** | `ep_disaggregated_params` with `multimodal_embedding_handles`, processed prompt, and token IDs | No |
| **Prefill → Encode (Embedding Path)** | Request with embedding file path | No |
| **Encode → Prefill (Embedding Path)** | NIXL readable metadata + shape/dtype + auxiliary data | Yes (Embeddings tensor via RDMA) |
| **Prefill → Decode** | `disaggregated_params` with `_epd_metadata` (prompt, token IDs) | Configurable (KV cache: NIXL default, UCX optional) |
## Known Limitations
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`)
- **llava-v1.6-mistral-7b-hf model crash** - Known issue with TRTLLM backend compatibilty with `TensorRT LLM version: 1.2.0rc6.post1`. To use Llava model download revision `revision='52320fb52229` locally using HF.
- **Embeddings file crash** - Known issue with TRTLLM backend compatibilty with `TensorRT LLM version: 1.2.0rc6.post1`. Embedding file parsing crashes in `attach_multimodal_embeddings(`. To be fixed in next TRTLLM upgrade.
## Supported Models
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
Common examples:
- **Llama 4 Vision models** (Maverick, Scout) - Recommended for large-scale deployments
- **LLaVA models** (e.g., `llava-hf/llava-v1.6-mistral-7b-hf`) - Default model for E/P/D examples
- **Qwen2-VL models** - Supported in traditional disaggregated mode
- Other vision-language models with TRT-LLM support
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/trtllm/engine.py` | TensorRTLLMEngine wrapper (LLM and MultimodalEncoder) |
| `components/src/dynamo/trtllm/constants.py` | DisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE) |
| `components/src/dynamo/trtllm/encode_helper.py` | Encode worker request processing (embedding-path and full EPD flows) |
| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handlers (EncodeHandler, PrefillHandler, DecodeHandler) |
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding |
| `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
> [!IMPORTANT]
> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
## Support Matrix
| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models |
| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images |
| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing |
| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies |
### Supported URL Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files |
| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data |
## Deployment Patterns
vLLM supports all multimodal deployment patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup |
| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
| E/PD (EC Connector) | ✅ | `agg_multimodal_ec_connector.sh` | vLLM-native encoder with ECConnector |
### Component Flags
| Component | Flag | Purpose |
|-----------|------|---------|
| Processor | `--multimodal-processor` | HTTP entry, tokenization |
| Encode Worker | `--multimodal-encode-worker` | Media encoding |
| PD Worker | `--multimodal-worker` | Prefill + Decode |
| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only |
| Decode Worker | `--multimodal-decode-worker` | Decode only |
| Encode+Prefill Worker | `--multimodal-encode-prefill-worker --is-prefill-worker` | Combined (Llama 4) |
| vLLM Native Encoder | `--vllm-native-encoder-worker` | vLLM-native encoding with ECConnector |
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Image Serving
### E/PD Serving (Encode Separate)
**Components:**
- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The EncodeWorkerHandler encodes the image and passes the embeddings to the MultimodalPDWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> pd_worker
pd_worker --> encode_worker
```
> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA.
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
# Serve a LLaVA 1.5 7B model:
bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
# Serve a Qwen2.5-VL model:
bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
### E/P/D Serving (Full Disaggregation)
**Components:**
- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker
encode_worker --> processor
encode_worker --embeddings--> prefill_worker
prefill_worker --> encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
```
> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
## ECConnector Serving
ECConnector is vLLM's native connector for transferring multimodal embeddings via an Embedding Cache. The encoder worker acts as a **producer** (writes embeddings), while the PD worker acts as a **consumer** (reads embeddings).
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor[EC Processor]
processor --image_url--> encoder[vLLM Native Encoder<br/>Producer]
encoder --writes--> cache[(Embedding Cache)]
cache --reads--> pd[PD Worker<br/>Consumer]
pd --> processor
processor --> HTTP
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal_ec_connector.sh --model llava-hf/llava-1.5-7b-hf
# Custom storage path for Embedding Cache
bash launch/agg_multimodal_ec_connector.sh --ec-storage-path /shared/encoder-cache
```
**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
## Llama 4 Serving
The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8.
### Llama 4 Aggregated Serving
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> pd_worker
pd_worker --> processor
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal_llama.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 300,
"temperature": 0.0,
"stream": false
}'
```
### Llama 4 Disaggregated Serving
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> prefill_worker
prefill_worker --> processor
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh --head-node
# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node:
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_llama.sh
```
## Video Serving
### Video Aggregated Serving
**Components:**
- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
The VideoEncodeWorker decodes the video into frames. Unlike the image pipeline which generates embeddings, this pipeline passes raw frames directly to the VllmPDWorker via NATS and RDMA.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> pd_worker
pd_worker --> video_encode_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the video in detail"
},
{
"type": "video_url",
"video_url": {
"url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
}
}
]
}
],
"max_tokens": 300,
"stream": false
}' | jq
```
### Video Disaggregated Serving
**Workflow:**
For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. The VideoEncodeWorker is connected directly to the prefill worker, decoding the video into frames and passing them via RDMA.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --video_url--> video_encode_worker
video_encode_worker --> processor
video_encode_worker --frames--> prefill_worker
prefill_worker --> video_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/multimodal
bash launch/video_disagg.sh
```
## Audio Serving
### Audio Aggregated Serving
**Components:**
- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
- frontend: HTTP endpoint to handle incoming requests.
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> pd_worker
pd_worker --> audio_encode_worker
```
**Launch:**
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_agg.sh
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-Audio-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is recited in the audio?"
},
{
"type": "audio_url",
"audio_url": {
"url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
}
}
]
}
],
"max_tokens": 6000,
"temperature": 0.8,
"stream": false
}' | jq
```
### Audio Disaggregated Serving
**Workflow:**
For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --audio_url--> audio_encode_worker
audio_encode_worker --> processor
audio_encode_worker --embeddings--> prefill_worker
prefill_worker --> audio_encode_worker
prefill_worker --> decode_worker
decode_worker --> prefill_worker
```
**Launch:**
```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh
```
## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------|
| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker |
| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
| E/PD (EC Connector) | `agg_multimodal_ec_connector.sh` | No | ECConnector via Embedding Cache |
## ModelInput Types and Registration
Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests:
| ModelInput Type | Preprocessing | Use Case |
|-----------------|---------------|----------|
| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves |
| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input |
**Registration Pattern:**
```python
# Processor - Entry point from HTTP frontend
await register_llm(
ModelInput.Text, # Frontend sends raw text
ModelType.Chat,
generate_endpoint,
model_name,
...
)
# Workers - Internal components
await register_llm(
ModelInput.Tokens, # Expect pre-tokenized input
ModelType.Chat, # or ModelType.Prefill for prefill workers
generate_endpoint,
model_name,
...
)
```
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
## Supported Models
The following models have been tested with Dynamo's vLLM multimodal backend:
- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct`
- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`
- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf`
- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf`
- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct`
For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested.
## Key Files
| File | Description |
|------|-------------|
| `components/src/dynamo/vllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/vllm/args.py` | Command-line argument parsing |
| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation |
| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementations (custom and vLLM-native) |
| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation |
...@@ -151,5 +151,5 @@ docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \ ...@@ -151,5 +151,5 @@ docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \
## Learn More ## Learn More
- [Dynamo Installation Guide](/docs/kubernetes/installation_guide.md) - [Dynamo Installation Guide](/docs/kubernetes/installation_guide.md)
- [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) - [SLA Planner Guide](/docs/components/planner/planner_guide.md)
- [Benchmarking Guide](/docs/benchmarks/benchmarking.md) - [Benchmarking Guide](/docs/benchmarks/benchmarking.md)
\ No newline at end of file
# Load-based Planner
This document covers load-based planner in `examples/llm/components/planner.py`.
> [!WARNING]
> Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
## Load-based Scaling Up/Down Prefill/Decode Workers
To adjust the number of prefill/decode workers, planner monitors the following metrics:
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## SLA-based Scaling Up/Down Prefill/Decode Workers
See [SLA-Driven Profiling](../benchmarks/sla_driven_profiling.md) for more details.
## Usage
The planner integration with the new frontend + worker architecture is currently a work in progress. This documentation will be updated with the new deployment patterns and code examples once the planner component has been fully adapted to the new workflow.
Configuration options:
* `namespace` (str, default: "dynamo"): Target namespace for planner operations
* `environment` (str, default: "local"): Target environment (local, kubernetes)
* `no-operation` (bool, default: false): Run in observation mode only
* `log-dir` (str, default: None): Tensorboard log directory
* `adjustment-interval` (int, default: 30): Seconds between adjustments
* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers
* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down
* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine
Run as standalone process:
```bash
PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
```
Monitor metrics with Tensorboard:
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```
..
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Planner
=======
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
Key features include:
* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
* **Graceful scaling** that ensures no requests are dropped during scale-down operations
.. admonition:: 🚀 Quick Start
:class: seealso
**New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for a complete, step-by-step workflow.
**Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
.. list-table::
:widths: 20 5 75
:header-rows: 1
* -
-
- Feature
* - **Backend**
- ❌
- Local
* -
- ✅
- Kubernetes
* - **LLM Framework**
- ✅
- vLLM
* -
- ✅
- TensorRT-LLM
* -
- ✅
- SGLang
* - **Serving Type**
- ✅
- Aggregated
* -
- ✅
- Disaggregated
* - **Planner Actions**
- ❌
- Load-based scaling up/down prefill/decode workers
* -
- ✅
- SLA-based scaling up/down prefill/decode workers [1]_
* -
- ❌
- Adjusting engine knobs
.. [1] Supported with some limitations.
.. toctree::
:hidden:
Overview <self>
Planner README <README>
Planner Guide <planner_guide>
Planner Examples <planner_examples>
SLA Planner Quick Start <sla_planner_quickstart>
SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md>
SLA-based Planner <sla_planner.md>
# SLA-based Planner
> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
> [!NOTE]
> Currently, SLA-based planner only supports disaggregated setup.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
## Architecture Overview
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
## Features
* **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
* **Predictive load forecasting**: Uses ARIMA, Prophet, Kalman, or constant predictors to forecast future load
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data
## Design
The SLA planner consists of several key components:
1. **Load Predictors**: Forecast future request patterns (number of requests, input/output sequence lengths)
2. **Performance Interpolators**: Estimate TTFT and ITL based on profiled performance data
3. **Correction Factors**: Adjust predictions based on observed vs. expected performance
4. **Scaling Logic**: Calculate optimal number of prefill/decode replicas to meet SLA targets
## SLA-Driven Pre-Deployment Profiling
**Prerequisite**: SLA-based planner requires pre-deployment profiling to be completed before deployment. The profiling process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters that the planner will use during operation.
See [Pre-Deployment Profiling](../benchmarks/sla_driven_profiling.md) for detailed instructions on running the profiling process.
## Load Prediction
The SLA planner uses a load predictor to forecast the number of requests, ISL, and OSL in the next adjustment interval. Currently, four load prediction models are supported:
### Constant Predictor
- **Use case**: Stable and long prediction interval
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
- **Tunable parameters**:
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
### Kalman Predictor
- **Use case**: Low-latency online forecasting (observe 1 → predict 1) with smooth adaptation
- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
- **Configuration**: `load-predictor: "kalman"`
- **Tunable parameters**:
- `--kalman-q-level`: process noise for level (higher = more responsive)
- `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
- `--kalman-r`: measurement noise (lower = trusts new measurements more)
- `--kalman-min-points`: minimum points before forecasting
- `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
- **Tunable parameters**:
- `--prophet-window-size`: bounds internal history to control refit cost
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`
### Warm-starting Load Predictors (Optional)
You can warm-start the load predictors with a mooncake-style JSONL trace file to provide historical context before live traffic is observed:
- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
- **Effect**: preloads the predictors with historical request-count / ISL / OSL samples extracted from the trace.
## Scaling Algorithm
SLA planner uses a sophisticated scaling algorithm. At each adjustment interval, SLA planner performs the following operations:
### 1. Metric Collection
Every adjustment interval, collect:
- Average Time to First Token (TTFT)
- Average Inter-Token Latency (ITL)
- Request count and duration
- Input/Output sequence lengths
### 2. Correction Factor Calculation
Using the collected metrics, SLA planner applies the interpolator to find out the expected TTFT/ITL and calibrate the interpolation model. This step is important because the actual TTFT/ITL can often be different than the ideal world:
- **TTFT**: actual TTFT heavily depends on request queueing and prefix cache hit rate (if use kv reuse). For example, if all requests arrives at the beginning of the adjustment interval, they queue heavily and TTFT will be significantly higher. If prefix cache hit rate is very high, the actual number of tokens in the prefill will be very low and TTFT will be significantly lower.
- **ITL**: actual ITL maybe affected by chunked small prefill request in decode engine.
- **Metric variances**: large variances in request rate, ISL, and OSL may lead to inaccurate estimation of the TTFT/ITL since SLA only consider the average when interpolating.
SLA planner calculate the correction factor with
- **Prefill correction**: `actual_ttft / expected_ttft`
- **Decode correction**: `actual_itl / expected_itl`
### 3. Load Prediction
SLA planner forecasts these metric in the next interval using the load predictor
- Number of requests
- Input sequence length
- Output sequence length
### 4. Calculating Number of Replicas
**Prefill replicas**: SLA planner assumes the prefill correction factor has linear affect on the prefill throughput per GPU as prefill is single-batched.
```
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
**Decode replicas**:
```
# 1. apply d_correction_factor to the ITL SLA
corrected_itl = self.args.itl / self.d_correction_factor
# 2. reversely find out what is best throughput/gpu that can achieve corrected_itl under the predicted context length
pred_decode_thpt_per_gpu = self.decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# 3. compute number of decode replicas needed
next_num_d = math.ceil(next_num_req * next_osl / self.args.adjustment_interval / pred_decode_thpt_per_gpu / self.args.decode_engine_num_gpu)
```
### 5. Scaling
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
> [!NOTE]
> SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`.
## Deploying
For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
> [!NOTE]
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion.
#### Scaling Decision Flow
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
5. **Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment: "virtual"
backend: "vllm" # or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must use `VirtualConnectorClient`:
```
from dynamo._core import DistributedRuntime, VirtualConnectorClient
client = VirtualConnectorClient(distributed_runtime, namespace)
```
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()`. This blocks until there is a change.
2. **Parse Decisions**: Read `num_prefill_workers` and `num_decode_workers` values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure
4. **Acknowledge Completion**: Mark the decision completed when scaling is finished: `await client.complete(decision)`
A scaling decision (returned by `client.get()`) contains the following fields, which are -1 if not set yet:
- `num_prefill_workers`: Integer specifying the target number of prefill workers
- `num_decode_workers`: Integer specifying the target number of decode workers
- `decision_id`: Integer with incremental ID for each scaling decision
See `components/planner/test/test_virtual_connector.py` for a full example.
# SLA-Driven Profiling and Planner Deployment Quick Start Guide
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
> [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).
## Overview
The DGDR workflow automates the entire process from SLA specification to deployment:
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
```mermaid
flowchart TD
A[Create DGDR] --> B[DGDR Controller]
B --> C{Profiling Method}
C -->|Online| D[Run Profiling Job<br/>2-4 hours]
C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
D --> F[Generate DGD Config]
E --> F
F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
style A fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#fff8e1
```
## What is a DynamoGraphDeploymentRequest (DGDR)?
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
## Prerequisites
Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
### Container Images
Each DGDR requires you to specify container images for the profiling and deployment process:
**profilingConfig.profilerImage** (Required):
Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
**deploymentOverrides.workersImage** (Optional):
Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
```yaml
spec:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
```
## Quick Start: Deploy with DGDR
### Step 1: Create Your DGDR
Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
**Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
Or, you can create your own DGDR for your own needs.
> **Important - Profiling Config Cases**: Prior to 0.8.1, any fields under `profilingConfig.config` are represented in snake_case. Starting 0.8.1, fields under `profilingConfig.config` are represented in camelCase for uniformity. There is backwards compatibility to snake_case, but as all example DGDRs are using camelCase, anyone using a release prior to 0.8.1 must manually update the configs under the examples to have snake_case config fields.
> [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
### Step 2: Apply the DGDR
The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
The Dynamo Operator will immediately begin processing your request.
### Step 3: Monitor Progress
Watch the DGDR status:
```bash
# View status
kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
> [!NOTE]
> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
### Step 4: Access Your Deployment
Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
```bash
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep trtllm-disagg
# Port-forward to access locally
kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
# Test the endpoint
curl http://localhost:8000/v1/models
```
### Step 5 (Optional): Access the Planner Grafana Dashboard
If you want to monitor the SLA Planner's decision-making in real-time, you can deploy the Planner Grafana dashboard.
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
Follow the instructions in [Dynamo Metrics Collection on Kubernetes](../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
> [!TIP]
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
## DGDR Configuration Details
### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |
### SLA Configuration
The `sla` section defines performance requirements and workload characteristics:
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200 # Target Time To First Token (milliseconds, float)
itl: 20 # Target Inter-Token Latency (milliseconds, float)
```
**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
### Profiling Methods
Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
```yaml
# Online Profiling (Default)
sweep:
useAiConfigurator: false
# Offline Profiling (AI Configurator)
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
> [!NOTE]
> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
### Hardware Configuration
For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md#hardware-configuration).
### Advanced Configuration
#### Using Existing DGD Configs (Recommended for Custom Setups)
If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
**Step 1: Create ConfigMap from your DGD config file:**
```bash
kubectl create configmap deepseek-r1-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
**Step 2: Reference the ConfigMap in your DGDR:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.
#### Inline Configuration (Simple Use Cases)
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:
```yaml
profilingConfig:
config:
# SLA targets (required for profiling)
sla:
isl: 8000 # Input sequence length
osl: 200 # Output sequence length
ttft: 200.0 # Time To First Token (ms)
itl: 10.0 # Inter-Token Latency (ms)
# Hardware constraints (optional)
hardware:
minNumGpusPerEngine: 2
maxNumGpusPerEngine: 8
gpuType: h200_sxm
# Profiling sweep settings (optional)
sweep:
prefillInterpolationGranularity: 16 # Number of samples for prefill ISL sweep
decodeInterpolationGranularity: 6 # Number of samples for decode sweep
```
> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
#### Planner Configuration Passthrough
Add planner-specific settings:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
## Understanding Profiling Results
For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details).
## Advanced Topics
### Mocker Deployment
Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for:
- Large-scale experiments without GPU resources
- Testing Planner behavior and infrastructure
- Validating deployment configurations
To deploy mocker instead of the real backend, set `useMocker: true`:
```yaml
spec:
model: <model-name>
backend: trtllm # Real backend for profiling (vllm, sglang, or trtllm)
useMocker: true # Deploy mocker instead of real backend
profilingConfig:
profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
...
autoApply: true
```
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
### Using a Model Cache PVC (0.8.1 or later)
Starting in Dynamo 0.8.1, for large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
### DGDR Immutability
DGDRs are **immutable** - if you need to update SLAs or configuration:
1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications
### Manual Deployment Control
There are two ways to manually control deployment after profiling:
#### Option 1: Use DGDR-Generated Configuration (Recommended)
Disable auto-deployment to review the generated DGD before applying:
```yaml
spec:
autoApply: false
```
Then manually extract and apply the generated DGD:
```bash
# Extract generated DGD from DGDR status
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
# Or save to file first for review/modification
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
vi my-dgd.yaml
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
The generated DGD includes optimized configurations and the SLA planner component. The required `planner-profile-data` ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully.
#### Option 2: Use Standalone Planner Templates (Advanced)
For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
```bash
# After profiling completes, profiling data is automatically stored in ConfigMaps
# OPTIONAL: Inspect profiling results stored in ConfigMaps
# View the generated DGD configuration
kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
# View the planner profiling data (JSON format)
kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
# Update the PROMETHEUS_ENDPOINT environment variable in the planner template
# to match your cluster's Prometheus service location (see comments in the template)
# Update backend planner manifest as needed, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```
> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets.
>
> **Important - Prometheus Configuration**: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like "Failed to resolve prometheus service", ensure the `PROMETHEUS_ENDPOINT` environment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details.
### Relationship to DynamoGraphDeployment (DGD)
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
### Accessing Detailed Profiling Artifacts
By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.
**What's available in ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
**What's available in PVC if attached to DGDR (optional):**
- Performance plots (PNGs)
- DGD configuration and logs of all services for each profiled deployment
- AIPerf profiling artifacts for each AIPerf run
- Raw profiling data (`.npz` files)
- Profiler log
**Setup:**
1. Set up the benchmarking PVC:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```
2. Add `outputPVC` to your DGDR's `profilingConfig`:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
# ... rest of config
```
3. After profiling completes, access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Troubleshooting
### Quick Diagnostics
```bash
# Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE
# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
```
### Common Issues
| Issue | Quick Fix |
|-------|-----------|
| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
> [!NOTE]
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
## Configuration Reference
For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
This includes detailed explanations of:
- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling
## Related Documentation
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling Details](/docs/benchmarks/sla_driven_profiling.md)
- [SLA Planner Architecture](/docs/planner/sla_planner.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
...@@ -119,19 +119,19 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full ...@@ -119,19 +119,19 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full
<!-- Design Docs --> <!-- Design Docs -->
[disagg]: docs/design_docs/disagg_serving.md [disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/README.md [kv-routing]: docs/components/router/router_guide.md
[planner]: docs/planner/planner_intro.rst [planner]: docs/components/planner/README.md
[kvbm]: docs/kvbm/kvbm_intro.rst [kvbm]: docs/components/kvbm/README.md
[migration]: docs/fault_tolerance/request_migration.md [migration]: docs/fault_tolerance/request_migration.md
[tools]: docs/agents/tool-calling.md [tools]: docs/agents/tool-calling.md
<!-- Multimodal --> <!-- Multimodal -->
[mm]: docs/multimodal/index.md [mm]: docs/features/multimodal/README.md
[mm-vllm]: docs/multimodal/vllm.md [mm-vllm]: docs/features/multimodal/multimodal_vllm.md
[mm-trtllm]: docs/multimodal/trtllm.md [mm-trtllm]: docs/features/multimodal/multimodal_trtllm.md
[mm-sglang]: docs/multimodal/sglang.md [mm-sglang]: docs/features/multimodal/multimodal_sglang.md
<!-- Feature-specific --> <!-- Feature-specific -->
[lora]: docs/kubernetes/deployment/dynamomodel-guide.md [lora]: docs/kubernetes/deployment/dynamomodel-guide.md
[vllm-spec]: docs/backends/vllm/speculative_decoding.md [vllm-spec]: docs/features/speculative_decoding/speculative_decoding_vllm.md
[trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md [trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md
...@@ -53,7 +53,7 @@ Advanced disaggregated deployment with SLA-based automatic scaling. ...@@ -53,7 +53,7 @@ Advanced disaggregated deployment with SLA-based automatic scaling.
- `TRTLLMPrefillWorker`: Specialized prefill-only worker - `TRTLLMPrefillWorker`: Specialized prefill-only worker
> [!NOTE] > [!NOTE]
> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/benchmarks/sla_driven_profiling.md) for detailed instructions. > This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/components/profiler/profiler_guide.md) for detailed instructions.
## CRD Structure ## CRD Structure
...@@ -266,7 +266,7 @@ Configure the `model` name and `host` based on your deployment. ...@@ -266,7 +266,7 @@ Configure the `model` name and `host` based on your deployment.
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
......
...@@ -109,7 +109,7 @@ We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/ ...@@ -109,7 +109,7 @@ We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/
### Pre-Deployment Profiling (SLA Planner Only) ### Pre-Deployment Profiling (SLA Planner Only)
If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/benchmarks/sla_driven_profiling.md) to run pre-deployment profiling. If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/components/profiler/profiler_guide.md) to run pre-deployment profiling.
## Usage ## Usage
...@@ -247,9 +247,9 @@ args: ...@@ -247,9 +247,9 @@ args:
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md) - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md) - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner_guide.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
## Troubleshooting ## Troubleshooting
......
...@@ -5,7 +5,7 @@ This example demonstrates running Dynamo across multiple nodes with **KV-aware r ...@@ -5,7 +5,7 @@ This example demonstrates running Dynamo across multiple nodes with **KV-aware r
For more information about the core concepts, see: For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/design_docs/disagg_serving.md) - [Dynamo Disaggregated Serving](../../../docs/design_docs/disagg_serving.md)
- [KV Cache Routing](../../../docs/router/README.md) - [KV Cache Routing](../../../docs/components/router/README.md)
## Architecture Overview ## Architecture Overview
...@@ -65,7 +65,7 @@ This is particularly beneficial for: ...@@ -65,7 +65,7 @@ This is particularly beneficial for:
- **Similar queries**: Common prefixes are computed once and reused - **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context - **Batch processing**: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/router/router_guide.md). For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/components/router/router_guide.md).
## Prerequisites ## Prerequisites
...@@ -475,7 +475,7 @@ python -m dynamo.frontend \ ...@@ -475,7 +475,7 @@ python -m dynamo.frontend \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic) --router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)
``` ```
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/router/router_guide.md). For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/components/router/router_guide.md).
## Cleanup ## Cleanup
......
...@@ -114,7 +114,7 @@ DYN_KVBM_CPU_CACHE_GB=100 vllm serve \ ...@@ -114,7 +114,7 @@ DYN_KVBM_CPU_CACHE_GB=100 vllm serve \
Qwen/Qwen3-8B Qwen/Qwen3-8B
``` ```
For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-vllm) For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-vllm)
### TensorRT-LLM ### TensorRT-LLM
...@@ -136,11 +136,11 @@ DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \ ...@@ -136,11 +136,11 @@ DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \
--extra_llm_api_options /tmp/kvbm_llm_api_config.yaml --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
``` ```
For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm)
## 📚 Docs ## 📚 Docs
- [Architecture](../../../docs/kvbm/README.md#architecture) - [Architecture](../../../docs/components/kvbm/README.md#architecture)
- [Design Deepdive](../../../docs/kvbm/kvbm_design.md) - [Design Deepdive](../../../docs/design_docs/kvbm_design.md)
- [NIXL Overview](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) - [NIXL Overview](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment