Unverified Commit b19de4ed authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: cleanup of docs refactor for components, integrations, and features (#6019)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 80e7bafd
......@@ -58,8 +58,8 @@ Quickstart
:hidden:
:caption: User Guides
KV Cache Offloading <kvbm/kvbm_guide.md>
KV Aware Routing <router/router_guide.md>
KV Cache Offloading <components/kvbm/kvbm_guide.md>
KV Aware Routing <components/router/router_guide.md>
Tool Calling <agents/tool-calling.md>
Multimodality Support <features/multimodal/README.md>
LoRA Adapters <features/lora/README.md>
......@@ -76,11 +76,11 @@ Quickstart
:caption: Components
Backends <_sections/backends>
Frontends <_sections/frontends>
Router <router/README>
Planner <planner/planner_intro>
Frontend <components/frontend/README>
Router <components/router/README>
Planner <components/planner/README>
Profiler <components/profiler/README>
KVBM <kvbm/kvbm_intro>
KVBM <components/kvbm/README>
.. toctree::
:hidden:
......
......@@ -285,6 +285,6 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo
## See Also
- **[Router README](../router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup
- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../components/router/router_guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
......@@ -117,7 +117,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
For SLA-based autoscaling, see [SLA Planner Guide](/docs/components/planner/planner_guide.md).
## Understanding Dynamo's Custom Resources
......
......@@ -163,14 +163,14 @@ Planner is deployed as a service component within your DGD. It:
**Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla_planner_quickstart.md) for complete instructions.
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../components/planner/planner_guide.md) for complete instructions.
Example configurations with Planner:
- `examples/backends/vllm/deploy/disagg_planner.yaml`
- `examples/backends/sglang/deploy/disagg_planner.yaml`
- `examples/backends/trtllm/deploy/disagg_planner.yaml`
For more details, see the [SLA Planner documentation](../planner/sla_planner.md).
For more details, see the [SLA Planner documentation](../components/planner/planner_guide.md).
## Autoscaling with Kubernetes HPA
......@@ -725,7 +725,7 @@ If you see unstable scaling:
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/)
- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
- [Planner Documentation](../planner/sla_planner.md)
- [Planner Documentation](../components/planner/planner_guide.md)
- [Dynamo Metrics Reference](../observability/metrics.md)
- [Prometheus and Grafana Setup](../observability/prometheus-grafana.md)
......@@ -292,7 +292,7 @@ kubectl get pods -n ${NAMESPACE}
3. **Optional:**
- [Set up Prometheus & Grafana](./observability/metrics.md)
- [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
- [SLA Planner Guide](../components/planner/planner_guide.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting
......
..
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
KV Block Manager
================
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
It offers:
* A **unified memory API** that spans GPU memory (future), pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
.. list-table::
:widths: 20 5 75
:header-rows: 1
* -
-
- Feature
* - **Backend**
- ✅
- Local
* -
- ✅
- Kubernetes
* - **LLM Framework**
- ✅
- vLLM
* -
- ✅
- TensorRT-LLM
* -
- ❌
- SGLang
* - **Serving Type**
- ✅
- Aggregated
* -
- ✅
- Disaggregated
.. toctree::
:hidden:
Overview <self>
Quick Start <README.md>
User Guide <kvbm_guide.md>
Design <kvbm_design.md>
LMCache Integration <../integrations/lmcache_integration.md>
FlexKV Integration <../integrations/flexkv_integration.md>
SGLang HiCache <../integrations/sglang_hicache.md>
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
> [!NOTE]
> **This content has moved.** The canonical location for this documentation is now
> [docs/features/multimodal/](../features/multimodal/README.md).
> This file will be removed in a future release.
# Multimodal Inference in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
> [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources.
## Backend Documentation
```{toctree}
:maxdepth: 1
vLLM Multimodal <vllm.md>
TensorRT-LLM Multimodal <trtllm.md>
SGLang Multimodal <sglang.md>
```
## Support Matrix
### Backend Capabilities
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
**Pattern Key:**
- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode
**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
### Input Format Support
| Format | vLLM | TRT-LLM | SGLang |
|--------|------|---------|--------|
| HTTP/HTTPS URL | ✅ | ✅ | ✅ |
| Data URL (Base64) | ✅ | ❌ | ❌ |
| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |
## Architecture Patterns
Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
- *Inline*: Simpler setup, encoding happens in the prefill worker
- *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
- *Aggregated*: Single worker handles both prefill and decode
- *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
These combine into four deployment patterns:
### EPD - Simple Aggregated
All processing happens within a single worker - the simplest setup.
```text
HTTP Frontend (Rust)
Worker (Python)
↓ image load + encode + prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
| Worker | Complete inference pipeline (encode + prefill + decode) |
**When to use:** Quick setup, smaller models, development/testing.
### E/PD - Encode Separate
Encoding happens in a separate worker; prefill and decode share the same engine.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
PD Worker (Python)
↓ receives embeddings via NIXL, prefill + decode
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| PD Worker | Prefill + Decode with embeddings |
**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
### E/P/D - Full Disaggregation
Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGlang
Prefill-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
OR
Decode-first:
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode Worker (Python)
↓ downloads media, generates embeddings, NIXL transfer
Decode Worker (Python)
↓ Bootstraps prefill worker
Prefill Worker (Python)
↓ receives embeddings via NIXL, prefill only, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs |
| Encode Worker | Media encoding, embeddings generation |
| Prefill Worker | Prefill only, transfers KV cache |
| Decode Worker | Decode only, token generation |
**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
### EP/D - Traditional Disaggregated
Encoding is combined with prefill, with decode separate.
```text
HTTP Frontend (Rust)
Processor (Python)
↓ tokenizes, extracts media URL
Encode+Prefill Worker (Python)
↓ downloads media, encodes inline, prefill, KV cache transfer
Decode Worker (Python)
↓ decode only, token generation
Response
```
| Component | Purpose |
|-----------|---------|
| Frontend (Rust) | HTTP entry point |
| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
| Encode+Prefill Worker | Combined encoding and prefill |
| Decode Worker | Decode only, token generation |
> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
## Example Workflows
You can find example workflows and reference implementations for deploying multimodal models in:
- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -151,5 +151,5 @@ docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \
## Learn More
- [Dynamo Installation Guide](/docs/kubernetes/installation_guide.md)
- [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md)
- [SLA Planner Guide](/docs/components/planner/planner_guide.md)
- [Benchmarking Guide](/docs/benchmarks/benchmarking.md)
\ No newline at end of file
# Load-based Planner
This document covers load-based planner in `examples/llm/components/planner.py`.
> [!WARNING]
> Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
## Load-based Scaling Up/Down Prefill/Decode Workers
To adjust the number of prefill/decode workers, planner monitors the following metrics:
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## SLA-based Scaling Up/Down Prefill/Decode Workers
See [SLA-Driven Profiling](../benchmarks/sla_driven_profiling.md) for more details.
## Usage
The planner integration with the new frontend + worker architecture is currently a work in progress. This documentation will be updated with the new deployment patterns and code examples once the planner component has been fully adapted to the new workflow.
Configuration options:
* `namespace` (str, default: "dynamo"): Target namespace for planner operations
* `environment` (str, default: "local"): Target environment (local, kubernetes)
* `no-operation` (bool, default: false): Run in observation mode only
* `log-dir` (str, default: None): Tensorboard log directory
* `adjustment-interval` (int, default: 30): Seconds between adjustments
* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers
* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down
* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine
Run as standalone process:
```bash
PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
```
Monitor metrics with Tensorboard:
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment