"vscode:/vscode.git/clone" did not exist on "43986372f59e08cc580a9fe49c5c6da41f3cbbcf"
Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
...@@ -9,98 +9,66 @@ The prefill and decode phases of LLM requests have different computation charact ...@@ -9,98 +9,66 @@ The prefill and decode phases of LLM requests have different computation charact
Disaggregated execution of a request has three main steps: Disaggregated execution of a request has three main steps:
1. Prefill engine computes prefill phase and generates KV cache 1. Prefill engine computes prefill phase and generates KV cache
2. Prefill engine transfers the KV cache to decode engine, and 2. Prefill engine transfers the KV cache to decode engine
3. Decode engine computes decode phase. 3. Decode engine computes decode phase.
However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions. The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
## Efficient KV Transfer
## Design The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
```mermaid
sequenceDiagram
participant D as Worker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
There are four main components in Dynamo disaggregation:
- Worker: execute prefill and decode requests
- Prefill worker: execute prefill requests only
- Disaggregated router: decide whether to prefill locally or remotely
- Prefill queue: cache and load balance the remote prefill requests
When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
## Conditional Disaggregation
Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
1. The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
2. The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
## Prefill Queue
Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability. ### Router Orchestration
## Efficient KV Transfer The disaggregated serving flow is orchestrated by the `PrefillRouter`:
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
participant D as Worker participant Client
participant SD as WorkerScheduler participant Frontend
participant SP as PrefillWorkerScheduler participant Router as PrefillRouter
participant P as PrefillWorker participant Prefill as Prefill Worker
participant Decode as Decode Worker
Client->>Frontend: Request
Frontend->>Router: Preprocessed Request
Router->>Router: Select prefill worker
Router->>Prefill: Prefill request
Prefill->>Prefill: Compute KV cache
Prefill-->>Router: disaggregated_params
Router->>Router: Select decode worker
Router->>Decode: Decode request + transfer metadata
Decode<<->>Prefill: KV transfer (NIXL)
Decode->>Decode: Generate tokens
Decode-->>Frontend: Stream tokens
Frontend-->>Client: Response
```
Note over SD: KV blocks allocated 1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
SD->>SP: Issue remote prefill request <br> with KV block descriptors via prefill queue
SP->>P: Add to in-flight batch
P-->>D: Remote NIXL read for prefix hit KV blocks (non-block) 2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
P->>P: Execute prefill
P-->>D: Remote NIXL write for comptued KV blocks (non-block)
P->>SP: Notify finish 3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
SP->>SD: Notify finish
SD->>D: Add to in-flight batch
D->>D: Execute decode 4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
```
The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer. ### Backend-Specific Transfer Metadata
After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish. The transfer metadata format varies by backend:
To reduce the size of memory descriptors, Dynamo applies two optimizations: - **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
1. After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
2. Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks. - **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
- **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
## Runtime-Reconfigurable xPyD ## Runtime-Reconfigurable xPyD
The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as: Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
- **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
- **Remove worker**: Worker drains active requests and deregisters from discovery.
- Add worker: add NIXL metadata in ETCD. The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
- Remove worker: flush engine and delete NIXL metadata in ETCD.
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
...@@ -7,56 +7,81 @@ ...@@ -7,56 +7,81 @@
## Overview ## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure: Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens. - `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments. - `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers. - `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function. - `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other. While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple workers: For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
- `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the `Processor`. - `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Processor`: When a new request arrives, `Processor` applies the chat template and performs the tokenization. - `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
Then, it routes the request to the `Worker`.
- `Worker` components (e.g., `VllmDecodeWorker`, `SGLangDecodeWorker`, `TrtllmWorker`): Perform the actual computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate`, `flush_cache`, or `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method. Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization ## Initialization
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd. In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.
```{caution}
The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
```
### Service Discovery Backends
The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
:::caution - **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
:::
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services: - **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**
- etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
- NATS (both static and dynamic mode): for messaging.
where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability). > **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.
For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed. When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` as the service identifier and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`. ### Runtime Initialization
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`. - `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`. - **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
- **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
- **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
- **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
- **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
- **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.
## Calling Endpoints ## Calling Endpoints
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s. Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
The watcher continuously updates the `Client` with information about available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies: The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/main/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit - `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order - `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint - `direct`: direct the request to a specific endpoint by specifying the instance ID
After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection. - **TCP** (default): Direct TCP connection with connection pooling
- **HTTP**: HTTP/2-based transport
- **NATS**: Message broker-based transport (legacy)
## Examples ## Examples
......
...@@ -5,249 +5,189 @@ ...@@ -5,249 +5,189 @@
# Dynamo Architecture Flow # Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm). Color-coded flows indicate different types of operations: This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
## 🔵 Main Request Flow (Blue) ## 🔵 Main Request Flow (Blue)
The primary user journey through the system: The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint 1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000) 2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing 3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange) ## 🟢 Prefill Flow (Green)
The system's intelligent routing and resource allocation: The prefill processing pipeline:
4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing 4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill 5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata
5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
## 🟢 Prefill Worker Flow (Green) ## 🟠 Decode Routing Flow (Orange)
The dedicated prefill processing pipeline: Router orchestration to decode phase:
7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers 6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker
8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication 7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
## 🟣 Completion Flow (Purple) ## 🟣 Completion Flow (Purple)
The response generation and delivery: The response generation and delivery:
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker 8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data 9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
## 🔗 Infrastructure Connections (Dotted lines) ## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support: Coordination and messaging support:
### ETCD Connections (Gray, dotted) ### Service Discovery
- **Frontend, Processor, Planner**: Service discovery and registration - **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup - **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration.
### NATS Connections (Teal, dotted) ### Request Plane
- **PrefillQueue**: JetStream consumer group for reliable work distribution - **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **Processor**: Load balancing across workers - **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
### NATS Connections (Optional, for KV routing)
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
### Planning Connections (Gold, dotted) ### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions - **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker - **Planner → Workers**: Resource scaling commands for workers
## Technical Implementation Details ## Technical Implementation Details
### PrefillRouter Orchestration:
- The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving
- Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
- Injects transfer metadata into decode requests for KV cache coordination
### NIXL (NVIDIA Interchange Library): ### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe - Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination - Transfer metadata exchanged via `disaggregated_params` in prefill response
- PrefillWorker loads metadata to establish direct communication channels - Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
- Block-based transfers (64–128 tokens per block) for efficient batching
### Disaggregated KV Cache: ### Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory - Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker - No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
- Pre-allocated blocks ensure deterministic memory layout and performance - Non-blocking transfers allow GPU forward passes to continue during KV transfer
```mermaid ```mermaid
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%% %%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD graph TD
%% Top Layer - Client & Frontend %% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"] Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"] Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]] S1[["<b>1 REQUEST</b>"]]
S2[["<b>2 PREPROCESS</b>"]]
%% Processing Layer %% Router Layer
Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"] PrefillRouter["<b>PrefillRouter</b><br/><i>Orchestrates Disaggregated Serving</i>"]
S3[["<b>3 VALIDATE</b>"]] S3[["<b>3 ROUTE TO PREFILL</b>"]]
%% Infrastructure - Positioned strategically to minimize crossings %% Infrastructure
subgraph INF["<b>Infrastructure Layer</b>"] subgraph INF["<b>Infrastructure Layer</b>"]
ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")] Discovery[("<b>Discovery</b><br/><i>Service Registry<br/>(ETCD or K8s)</i>")]
NATS[("<b>NATS</b><br/><i>Message Broker</i>")] NATS[("<b>NATS</b><br/><i>KV Events<br/>(Optional)</i>")]
Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"] Planner["<b>Planner</b><br/><i>Auto-scaling</i>"]
end end
%% Worker Layer - Main processing %% Worker Layer
subgraph WL["<b>Worker Layer</b>"] subgraph WL["<b>Worker Layer</b>"]
%% VllmWorker section %% Prefill Worker
VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"] PrefillWorker["<b>Prefill Worker</b><br/><i>Computes KV Cache</i>"]
S4[["<b>4 QUERY</b>"]] S4[["<b>4 PREFILL</b>"]]
S5[["<b>5 DISAGG DECISION</b>"]] S5[["<b>5 RETURN METADATA</b>"]]
S5a[["<b>5a ALLOCATE</b>"]]
S12[["<b>12 DECODE</b>"]] %% Decode Worker
S6[["<b>6 QUEUE</b>"]] DecodeWorker["<b>Decode Worker</b><br/><i>Token Generation</i>"]
S13[["<b>13 RESPONSE</b>"]] S6[["<b>6 ROUTE TO DECODE</b>"]]
S7[["<b>7 KV TRANSFER</b>"]]
%% Storage positioned near workers S8[["<b>8 DECODE</b>"]]
LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")] S9[["<b>9 RESPONSE</b>"]]
%% Prefill System - Right side to minimize crossings %% KV Cache
subgraph PS["<b>Prefill System</b>"] PrefillKVCache[("<b>Prefill KV Cache</b><br/><i>GPU VRAM</i>")]
PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"] DecodeKVCache[("<b>Decode KV Cache</b><br/><i>GPU VRAM</i>")]
PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
S7[["<b>7 NATS PULL</b>"]]
S8[["<b>8 LOAD METADATA</b>"]]
S9[["<b>9 PREFILL</b>"]]
S10[["<b>10 NIXL TRANSFER</b>"]]
S11[["<b>11 NOTIFY</b>"]]
end
end end
%% Main Request Flow (Blue) - Clean vertical flow %% Main Request Flow (Blue)
Client -.-> S1 Client --> S1
S1 -->|HTTP API Call| Frontend S1 -->|HTTP API Call| Frontend
Frontend -.-> S2 Frontend --> S2
S2 -->|Process & Validate| Processor S2 -->|Tokenize & Validate| PrefillRouter
Processor -.-> S3 PrefillRouter --> S3
S3 -->|Route to Worker| VllmWorker S3 -->|Select Prefill Worker| PrefillWorker
%% VllmWorker Internal Flow (Orange) %% Prefill Flow (Green)
VllmWorker -.-> S4 PrefillWorker --> S4
S4 -->|Query Prefix Cache Hit| S5 S4 -->|Compute KV Cache| PrefillKVCache
S5 -->|Prefill Length & Queue Check| S5a PrefillWorker --> S5
S5a -->|Continue to Decode| S12 S5 -->|disaggregated_params| PrefillRouter
%% Allocation & Queuing (Orange) - Minimize crossings %% Decode Routing Flow (Orange)
S5a -->|Allocate KV Cache Blocks| LocalKVCache PrefillRouter --> S6
VllmWorker --> S6 S6 -->|Inject Transfer Metadata| DecodeWorker
S6 -->|Put RemotePrefillRequest| PrefillQueue DecodeWorker --> S7
S7 -->|NIXL GPU-to-GPU| PrefillKVCache
%% Prefill Worker Flow (Green) - Self-contained within PS PrefillKVCache -.->|Direct Transfer| DecodeKVCache
PrefillQueue -.-> S7
S7 -->|Consumer Group Pull| PrefillWorker %% Completion Flow (Purple)
PrefillWorker -.-> S8 DecodeWorker --> S8
PrefillWorker -.-> S9 S8 -->|Generate Tokens| DecodeKVCache
S9 -->|Execute Prefill| S10 DecodeWorker --> S9
S10 -->|Direct GPU Transfer| LocalKVCache S9 -->|Stream Tokens| Frontend
PrefillWorker --> S11 Frontend -->|HTTP Response| Client
%% Return Flow (Purple) - Clean return path %% Infrastructure Connections
S11 -->|Completion Notification| S12 Frontend -.->|Service Discovery| Discovery
S12 -->|Decode from KV Cache| S13 PrefillRouter -.->|Worker Discovery| Discovery
S13 -->|Post-process Response| Processor PrefillWorker -.->|Register| Discovery
Processor -->|HTTP Response| Frontend DecodeWorker -.->|Register| Discovery
Frontend -->|Final Response| Client Planner -.->|Service Discovery| Discovery
%% Infrastructure Connections - Organized to avoid crossings %% NATS for KV events (optional)
%% ETCD Connections - Grouped by proximity PrefillWorker -.->|KV Events| NATS
Frontend -.->|Service Discovery| ETCD DecodeWorker -.->|KV Events| NATS
Processor -.->|Service Discovery| ETCD
VllmWorker -.->|NIXL Metadata| ETCD %% Planning Connections
PrefillWorker -.->|NIXL Metadata| ETCD
S8 -.->|Load NIXL Metadata| ETCD
Planner -.->|Service Discovery| ETCD
%% NATS Connections - Direct to queue system
PrefillQueue -.->|JetStream| NATS
Processor -.->|Load Balancing| NATS
%% Planning Connections - Strategic positioning
Frontend -.->|Metrics| Planner Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| VllmWorker
Planner -.->|Auto-scaling| PrefillWorker Planner -.->|Auto-scaling| PrefillWorker
Planner -.->|Auto-scaling| DecodeWorker
%% Styling - Each component with unique colors %% Styling
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client class Client client
class Frontend frontend class Frontend frontend
class Processor processor class PrefillRouter router
class VllmWorker worker class DecodeWorker worker
class PrefillQueue prefillQueue
class PrefillWorker prefillWorker class PrefillWorker prefillWorker
class Planner planner class Planner planner
class LocalKVCache storage class PrefillKVCache,DecodeKVCache storage
class ETCD etcd class Discovery discovery
class NATS nats class NATS nats
class PS prefillBox
class INF infraLayer class INF infraLayer
class WL workerLayer class WL workerLayer
%% Flow Colors
%% Main Request Flow - Blue
linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px
%% Prefill Flow - Green
linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px
%% Decode Routing Flow - Orange
linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px
%% Completion Flow - Purple
linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px
%% Flow Colors - Different line styles to reduce visual clutter %% Infrastructure - Gray dotted
%% Main Request Flow - Blue (solid) linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 1 stroke:#1565C0,stroke-width:4px
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 3 stroke:#1565C0,stroke-width:4px
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 5 stroke:#1565C0,stroke-width:4px
%% Decision & Allocation Flow - Orange (mixed)
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 7 stroke:#E65100,stroke-width:4px
linkStyle 8 stroke:#E65100,stroke-width:4px
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
%% KV Cache & Queue - Orange (solid)
linkStyle 10 stroke:#E65100,stroke-width:4px
linkStyle 11 stroke:#E65100,stroke-width:4px
linkStyle 12 stroke:#E65100,stroke-width:4px
%% Prefill Worker Flow - Green (mixed)
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 14 stroke:#2E7D32,stroke-width:4px
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 17 stroke:#2E7D32,stroke-width:4px
linkStyle 18 stroke:#2E7D32,stroke-width:4px
linkStyle 19 stroke:#2E7D32,stroke-width:4px
%% Completion Flow - Purple (mixed)
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
%% ETCD Connections - Gray (dotted, thinner)
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
%% NATS Connections - Teal (dotted, thinner)
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
%% Planning Connections - Gold (dotted, thinner)
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
``` ```
...@@ -16,8 +16,7 @@ Dynamo's coordination layer adapts to the deployment environment: ...@@ -16,8 +16,7 @@ Dynamo's coordination layer adapts to the deployment environment:
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP | | **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP | | **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
> [!NOTE] > **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
> The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
``` ```
┌─────────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────────┐
...@@ -51,8 +50,7 @@ The operator explicitly sets: ...@@ -51,8 +50,7 @@ The operator explicitly sets:
DYN_DISCOVERY_BACKEND=kubernetes DYN_DISCOVERY_BACKEND=kubernetes
``` ```
> [!WARNING] > **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
> This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
### How It Works ### How It Works
...@@ -461,5 +459,5 @@ This provides KV-aware routing with reduced accuracy but no NATS dependency. ...@@ -461,5 +459,5 @@ This provides KV-aware routing with reduced accuracy but no NATS dependency.
## Related Documentation ## Related Documentation
- [Distributed Runtime](distributed-runtime.md) - Runtime architecture - [Distributed Runtime](distributed-runtime.md) - Runtime architecture
- [Request Plane](../guides/request-plane.md) - Request transport configuration - [Request Plane](request-plane.md) - Request transport configuration
- [Fault Tolerance](../fault-tolerance/request-cancellation.md) - Failure handling - [Fault Tolerance](../fault-tolerance/README.md) - Failure handling
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KVBM Design
This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
## KVBM Components
![Internal Components of Dynamo KVBM](/assets/img/kvbm-components.png)
*Internal Components of Dynamo KVBM*
### Core
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
### Layouts and Blocks
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
### Transfer Manager
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
### Storage & Pools
- **Device Pool (G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool (G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool (G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
- **Remote Storage (G4)**: Remote or cloud-backed KV block storage. KVBM treats G4 as an opaque blob store accessed through NIXL, unaware of internal layout optimizations.
## KVBM Data Flows
![KVBM Data Flows](/assets/img/kvbm-data-flows.png)
*KVBM Data Flows from device to other memory hierarchies*
### Device → Host (Offload)
- Triggered when explicitly requested to offload by the connector scheduler
- Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy
- Host pool registers the new immutable block (dedup by sequence hash)
### Host → Disk (Offload)
- **Local Disk (G3)**: NIXL Write via POSIX; GDS when available
- **Remote Disk (G4)** (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical
- Triggered on registered host blocks or explicit offload requests
- Worker allocates a Disk block and performs NIXL Write (Host→Disk)
- Disk pool registers the new immutable block (dedup by sequence hash)
### Host → Device (Onboard)
- Called to bring a host block into GPU memory
- Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy
- Device pool registers the new immutable block
### Disk → Device (Onboard)
- Called to bring a disk block directly into GPU memory
- Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS
- Device pool registers the new immutable block
## Internal Architecture Deep Dive
![Internal architecture and key modules in the Dynamo KVBM](/assets/img/kvbm-internal-arch.png)
*Internal architecture and key modules in the Dynamo KVBM*
### KvBlockManager as Orchestration Layer
The `KvBlockManager<H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
`KvBlockManager<H, D>` owns:
- A device-side `BlockPool<Device>`
- A host-side `BlockPool<Host>`
- A remote NIXL agent that supports communication and memory sharing across nodes
- A block set registry for remote lookup and import/export of block metadata
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
### Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. The `BlockLayout` trait abstracts the memory layout. The default implementation, `FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```text
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
- `DeviceStorage` → CUDA device buffer
- `PinnedStorage` → page-locked host memory
- `SystemStorage` → CPU heap memory (fallback/test)
- `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a `StorageAllocator`.
### BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, etc.) tracks two sub-pools:
- **ActivePool**: Contains blocks currently in use by sequences
- **InactivePool**: Recycled blocks ready for allocation (free list)
When a token block is requested (e.g., `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
### Block State Machine
The state machine (`BlockState`) tracks block lifecycle transitions:
| State | Description | Ownership | Valid Actions/Transitions |
|-------|-------------|-----------|---------------------------|
| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | `init_sequence(salt_hash)` → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | `add_token()` / `add_tokens()` (accumulate), `commit()` → Complete, `reset()` → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | `register()` → Registered, `reset()` → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. | Shared ownership (global registry) | Auto `drop()` → triggers Remove event and transitions to Reset |
#### Valid State Transitions
| From → To | Trigger | Validation |
|-----------|---------|------------|
| Reset → Partial | `init_sequence(salt_hash)` | Must not be in use |
| Partial → Complete | `commit()` | Must be full |
| Complete → Registered | `register()` | Must be finalized |
| Registered → Reset | Drop of `RegistrationHandle` | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
#### Example Block Lifecycle
A sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. `init_sequence()` → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → `commit()` → State becomes Complete
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used for lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
### Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
- `PublishHandle` triggers Register events
- Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
### Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
- `nixl_register()`: Registers memory region with NIXL runtime
- `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
- `import_remote_blockset()`: Loads remote node's block layouts into the manager
- `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
#### Remote Memory Registration Protocol
The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL:
**1. Agent Creation & Memory Registration**
Each worker independently sets up a NixlAgent:
- Registers its memory regions (i.e., device memory) through `nixl_register()`
- These regions correspond to blocks managed in the local BlockPool
Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
**2. Metadata Exchange**
After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
Why is this step critical?
- LLM inference workloads often differ in *tensor parallel (TP)* configurations:
- Worker 1 might have TP=4, while Worker 2 has TP=8
- Even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ
- The metadata exchange bridges this semantic mismatch by sharing:
- LayoutConfig (num_layers, page_size, inner_dim, dtype)
- BlockSetID
- Base address + stride information (including alignment)
- Device ID + memory type (host/device)
- Once workers share metadata, each can reconstruct the layout on its side using `deserialize()`
This enables NIXL to:
- Understand where each layer/block resides
- Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
**3. Serialization & Deserialization: Making Layouts Portable**
In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
- FullyContiguousConfig
- base_offset
- Physical memory descriptors (NixlStorage), including:
- Memory type (VRAM, DRAM)
- Address & size
- Device ID
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state.
In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
- A fully reconstructed memory layout view
- Local representation of a remote memory slice with correct offsets and size semantics
It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
**4. Ownership Handles and Lifetime Tracking**
Memory ownership in NIXL is tightly coupled with RAII-based handles:
- When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
- On drop of this handle, an automatic Remove event is published, which:
- Deregisters the block from the NIXL layer
- Removes it from the remote block registry
- This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
This mechanism avoids:
- Stale memory access
- Dangling pointers on GPU or host
- Manual deregistration bugs
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency.
### Storage Backends and Pluggability
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
#### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
- `registerVolume(descriptor)`: Register a logical volume for KV cache data
- `unregisterVolume()`: Cleanly deregister and release volume mappings
- `get() / put()`: Block-level APIs used by KVBM to fetch and store token blocks
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized.
#### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations:
- **StoreEvent**: Emitted when a KV block is registered
- **RemoveEvent**: Emitted when a KV block is released or evicted
Each KVEvent (~100 bytes) contains:
| Field | Description |
|-------|-------------|
| `sequence_hash` | Unique identifier of the KV block |
| `prefix_hash` | Prefix grouping for query-level aggregation |
| `block_size` | Size in bytes |
| `storage_location` | Logical volume identifier |
| `event_type` | Store or Remove |
| `extra_metadata` | Reserved fields for partner-specific optimization |
For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load).
#### Conceptual Design of a Storage Advisor
This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. **This is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
1. Storage volumes are pre-provisioned and mounted by the storage provider
2. These volumes are registered with Dynamo through the NIXL Storage Agent using `registerVolume()` APIs
3. Dynamo KV Block Manager interacts only with logical block-level APIs (`get()` and `put()`)
4. The Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel
5. Storage vendors implement a lightweight subscriber process that listens to these events
To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream:
- On receiving a **StoreEvent**: Insert a record into an internal prefix tree, hash map, or LRU index with `prefix_hash`, `sequence_hash`, and associated metadata
- On receiving a **RemoveEvent**: Delete or prune the corresponding record, optionally triggering cleanup or tier migration workflows
With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies:
- **Hot block promotion**: Frequently accessed KV blocks can be migrated to fast SSD volumes
- **Cold block demotion**: Infrequently used blocks can be demoted to slower storage (HDDs, cloud object storage)
- **Proactive compaction**: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
## Framework Integrations
KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
### Connector Architecture
There are two components of the interface:
- **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion.
- **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass.
![vLLM KVBM Integration](/assets/img/kvbm-integrations.png)
*Typical integration of KVBM with inference frameworks (vLLM shown as example)*
### Onboarding Operations
![Onboarding blocks from Host to Device](/assets/img/kvbm-onboard-host2device.png)
*Onboarding blocks from Host to Device*
![Onboarding blocks from Disk to Device](/assets/img/kvbm-onboard-disk2device.png)
*Onboarding blocks from Disk to Device*
### Offloading Operations
![Offloading blocks from Device to Host & Disk](/assets/img/kvbm-offload.png)
*Offloading blocks from Device to Host & Disk*
## Further Reading
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
- [SGLang HiCache Benchmarks](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
- [EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal](https://arxiv.org/abs/2006.06890)
## See Also
- [KVBM Overview](../components/kvbm/README.md)
- [KVBM Guide](../components/kvbm/kvbm-guide.md)
- [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Planner Design
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](../components/planner/README.md).
## Overview
The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
## Architecture
```text
┌──────────────────────────────────────────────────────────┐
│ Planner Component │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ Metric │ │ Load │ │ Performance │ │
│ │ Collector │ │ Predictor │ │ Interpolator │ │
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Scaling Algorithm │ │
│ └───────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼───────────────────────────┐ │
│ │ Connector Layer │ │
│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
│ │ └───────────────────┘ └───────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
## Scaling Algorithm
### Step 1: Metric Collection
Every `adjustment_interval` seconds, the planner queries Prometheus for:
- Average TTFT and ITL over the interval
- Total request count
- Average input sequence length (ISL) and output sequence length (OSL)
The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
### Step 2: Correction Factor Calculation
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
```text
prefill_correction = actual_ttft / expected_ttft
decode_correction = actual_itl / expected_itl
```
These factors account for hard to model factors such as:
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
- **Metric variance**: Average ISL/OSL may not represent the actual distribution
The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
### Step 3: Load Prediction
The planner forecasts three values for the next interval:
- `next_num_req`: Number of requests
- `next_isl`: Average input sequence length
- `next_osl`: Average output sequence length
Four predictor implementations are available:
| Predictor | Algorithm | Best For |
| ------------ | ---------------------------------------- | -------------------------------- |
| **Constant** | `next = current` | Stable workloads, long intervals |
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
### Step 4: Replica Calculation
**Prefill replicas:**
```python
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
**Decode replicas:**
```python
# Apply correction to the ITL SLA target
corrected_itl = target_itl / decode_correction_factor
# Find best throughput/GPU that achieves corrected ITL at predicted context length
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# Calculate required replicas
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
```
### Step 5: Scaling Execution
The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
## Connector Design
### Interface
```python
class PlannerConnector(ABC):
async def add_component(self, component_name)
async def remove_component(self, component_name)
# Extended interface (not on ABC, but implemented by both connectors):
async def set_component_replicas(self, targets, blocking)
async def validate_deployment(self, ...)
async def wait_for_deployment_ready(self)
```
### KubernetesConnector
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
**Design decisions:**
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
### VirtualConnector
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
**Scaling decision flow:**
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
2. External system reads decision via `client.wait()`
3. External system executes scaling
4. External system reports completion via `client.complete(decision)`
5. Planner sees `scaled_decision_id >= decision_id` and proceeds
**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
## Performance Interpolation
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
Two interpolators are maintained:
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
## Initialization
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
After the delay:
1. Initialize the connector (K8s or Virtual based on `--environment`)
2. Validate deployment structure
3. Load profiling results
4. Build interpolators
5. Initialize load predictor
6. Enter main scaling loop
## Performance Considerations
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Known Limitations
1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
2. **Adjustment interval vs scaling latency**: If `adjustment_interval` \< time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
## Future Work
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
- Multi-DGD coordination for shared-cluster scenarios
- Distribution-aware interpolation (beyond mean ISL/OSL)
- Adaptive adjustment interval based on observed scaling latency
## File Map
| File | Size | Purpose |
| ---------------------------- | ---- | ----------------------------------------------------- |
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
| `kube.py` | 7.4k | Low-level K8s client wrapper |
| `exceptions.py` | 7.2k | Custom exception hierarchy |
| `prometheus.py` | 7.3k | Prometheus query builder and client |
| `defaults.py` | 8.1k | Default configs, backend name mappings |
| `planner_argparse.py` | 6.2k | CLI argument definitions |
...@@ -32,7 +32,7 @@ Dynamo has **two independent communication planes**: ...@@ -32,7 +32,7 @@ Dynamo has **two independent communication planes**:
- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`. - **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing. - **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
**Note:** if you are using `tcp` or `http` request plane and choose to use NATS for KV events, you must still configure NATS server using `NATS_SERVER` environment variable, e.g. `NATS_SERVER=nats://nats-hostname:port`. **Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend.
Because they are independent, you can mix them. Because they are independent, you can mix them.
...@@ -88,7 +88,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B ...@@ -88,7 +88,7 @@ DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
**When to use TCP:** **When to use TCP:**
- Simple deployments with direct service-to-service communication (e.g. frontend to backend) - Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (**no NATS needed unless you enable KV-event-backed routing/replica sync**) - Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`)
- Low-latency requirements - Low-latency requirements
**TCP Configuration Options:** **TCP Configuration Options:**
...@@ -160,7 +160,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B ...@@ -160,7 +160,7 @@ DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
**When to use NATS:** **When to use NATS:**
- Production deployments with service discovery - Production deployments with service discovery
- Currently KV based routing require NATS. If you want to completely disable NATS, KV based routing won't be available - KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy.
- Need for message replay and persistence features - Need for message replay and persistence features
Limitations: Limitations:
...@@ -289,6 +289,6 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -289,6 +289,6 @@ curl http://localhost:8000/v1/chat/completions \
### Resource Usage ### Resource Usage
- **TCP**: Minimal infrastructure (no additional services required) - **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **HTTP**: Minimal infrastructure (no additional services required) - **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **NATS**: Requires running NATS server (additional memory/CPU) - **NATS**: Requires running NATS server (additional memory/CPU)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Router Design
This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes.
## KV Router Architecture
The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size
### Block Tracking Mechanisms
The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
- Incremented when adding a new request
- Updated during token generation
- Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
## KV Cache Router
The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
### KV Cache Routing and Load Balancing
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions.
#### Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
#### Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
### KV Cache Optimizations
Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
### KV Block Management Flow
To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
1. **Request tokenization**: The incoming prompt is converted into tokens
2. **Block partitioning**: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
3. **Block hashing**: Each block of tokens is hashed to create a unique identifier
4. **Cache lookup**:
- For each block, the system checks if a matching block already exists in the KV cache
- If a match is found, the existing KV cache block is reused
- If no match is found, the system proceeds to the next step
5. **Resource allocation**:
- For blocks without matches, the system attempts to allocate new memory space
- If sufficient memory is available, allocate memory space and proceed to step 7
- If memory is constrained, proceed to step 6
6. **Cache eviction** (when necessary):
- The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
- Selected blocks are evicted from the cache
- **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
- Alternatively, some systems may offload less-frequently used blocks to CPU memory.
7. **KV computation**:
- For new blocks, the model computes key and value tensors
- These tensors are stored in the newly allocated cache blocks
- **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
## Events
### KVPublisher
The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
The two types of events are:
- KV stored event
- KV removed event
The publisher can be initialized and used through C bindings or Python bindings.
### Deterministic Event IDs
Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's built-in `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
### KVIndexer
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
### Inter-Router Communication
In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
## Event Transport Modes
The router supports two event transport modes for KV cache state synchronization:
- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency.
- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup.
### JetStream Mode
KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
- **Best for**: Production deployments requiring durability and multi-replica router consistency
- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>KVPublisher]
E2[Engine 2<br/>KVPublisher]
E3[Engine 3<br/>KVPublisher]
end
subgraph "NATS JetStream"
JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
end
subgraph "NATS Object Store"
OS[(Radix Tree<br/>State Snapshot)]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| JS
E2 -->|Publish Events| JS
E3 -->|Publish Events| JS
JS -->|Consume as Durable Consumer| R1
JS -->|Consume as Durable Consumer| R2
JS -->|Periodic Snapshot| OS
style JS fill:#e1f5fe,stroke:#333,color:#333
style OS fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
```
### NATS Core with Local Indexer
When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>LocalKvIndexer]
E2[Engine 2<br/>LocalKvIndexer]
E3[Engine 3<br/>LocalKvIndexer]
end
subgraph "NATS Core"
NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| NC
E2 -->|Publish Events| NC
E3 -->|Publish Events| NC
NC -->|Subscribe| R1
NC -->|Subscribe| R2
style NC fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
```
**How gap detection works:**
1. Each worker assigns monotonically increasing event IDs starting from 0
2. The router tracks the last received event ID per worker
3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
**Startup behavior:**
- When a worker is discovered, the router queries and ingests its full local indexer state
- When a worker is removed, the router removes all its blocks from the global radix tree
>[!Note]
> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
### Local Active Block Management with Replica Sync
In addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
- The router receives and routes a request
- The first token is generated (prefill complete)
- The response ends (request freed)
This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
```mermaid
sequenceDiagram
participant C1 as Client 1
participant R1 as Router 1<br/>(Slot Manager)
participant R2 as Router 2<br/>(Slot Manager)
participant C2 as Client 2
Note over R1,R2: Router Replica Sync Enabled
C1->>R1: Request A
activate R1
R1->>R1: Predict blocks & route to worker
R1-->>R2: Sync: AddRequest(A)
C2->>R2: Request B
activate R2
R2->>R2: Predict blocks & route to worker
R2-->>R1: Sync: AddRequest(B)
R1->>R1: First token received<br/>(prefill complete)
R1-->>R2: Sync: MarkPrefillCompleted(A)
R1->>C1: Stream response
R2->>R2: First token received<br/>(prefill complete)
R2-->>R1: Sync: MarkPrefillCompleted(B)
R2->>C2: Stream response
R1->>R1: Response complete<br/>(free blocks)
R1-->>R2: Sync: Free(A)
deactivate R1
R2->>R2: Response complete<br/>(free blocks)
R2-->>R1: Sync: Free(B)
deactivate R2
Note over R1,R2: Both routers have consistent<br/>view of active blocks
```
This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
## See Also
- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../components/router/router-guide.md)**: Configuration, tuning, and production setup
- **[Router Examples](../components/router/router-examples.md)**: Python API usage and custom routing patterns
- **[KV Event Publishing for Custom Engines](../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
...@@ -72,7 +72,6 @@ The `model_type` can be: ...@@ -72,7 +72,6 @@ The `model_type` can be:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name. - `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM. - `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16. - `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault-tolerance/request-migration.md). Defaults to 0.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None. - `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
See `examples/backends` for full code examples. See `examples/backends` for full code examples.
......
...@@ -61,7 +61,7 @@ be operating within your distributed runtime. ...@@ -61,7 +61,7 @@ be operating within your distributed runtime.
The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later. The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
All examples require the `etcd` and `nats.io` pre-requisites to be running and available. Most examples require `etcd` for service discovery. `nats.io` is required for KV-aware routing with event tracking; for approximate mode (`--no-kv-events`), NATS is optional.
#### Rust `hello_world` #### Rust `hello_world`
......
...@@ -71,30 +71,16 @@ generate_endpoint.serve_endpoint( ...@@ -71,30 +71,16 @@ generate_endpoint.serve_endpoint(
|-----------|------------------|-----------| |-----------|------------------|-----------|
| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown | | **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation | | **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait | | **Decode Workers** | `graceful_shutdown=True` | Decode operations should complete to avoid wasted computation |
| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete | | **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |
### Decode Worker Migration Integration ### Migration Integration
Decode workers use conditional draining based on whether request migration is supported: Backend workers always use `graceful_shutdown=True`, meaning they wait for in-flight requests to complete until the engine is stopped. Request migration is configured at the **frontend** level via `--migration-limit`:
```python - When migration is enabled at the frontend, disconnected streams from failed workers are automatically retried on healthy workers
generate_endpoint.serve_endpoint( - Workers don't need to know about migration configuration - they simply complete their work or signal incomplete streams
handler.generate, - See [Request Migration Architecture](./request-migration.md) for details on how migration works
graceful_shutdown=config.migration_limit <= 0, # If no migration, wait for requests
...
)
```
When `migration_limit > 0`:
- Worker shuts down immediately (`graceful_shutdown=False`)
- In-flight requests are migrated to healthy workers
- No request loss occurs
When `migration_limit <= 0`:
- Worker waits for in-flight requests (`graceful_shutdown=True`)
- Migration is not available
- Requests complete on the shutting-down worker
## Resource Cleanup ## Resource Cleanup
...@@ -218,18 +204,18 @@ Kubernetes uses health endpoints to determine pod readiness: ...@@ -218,18 +204,18 @@ Kubernetes uses health endpoints to determine pod readiness:
### 1. Set Appropriate Grace Periods ### 1. Set Appropriate Grace Periods
Match `terminationGracePeriodSeconds` to your expected request completion time: Match `terminationGracePeriodSeconds` to your expected request completion time:
- Short requests (< 10s): 30s grace period - Short requests (\< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period - Long generation (> 30s): 120s+ grace period
### 2. Enable Request Migration for Decode Workers ### 2. Enable Request Migration
If using disaggregated serving, enable migration for decode workers: Enable migration at the frontend to allow request recovery when workers shut down:
```python ```bash
--migration-limit 3 # Allow up to 3 migration attempts python3 -m dynamo.frontend ... --migration-limit 3 # Allow up to 3 migration attempts
``` ```
This allows immediate shutdown while preserving request state. This allows the frontend to automatically retry disconnected streams on healthy workers.
### 3. Monitor Shutdown Metrics ### 3. Monitor Shutdown Metrics
......
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Request Cancellation Architecture
This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed. This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed.
## AsyncEngineContext Trait ## AsyncEngineContext Trait
...@@ -50,7 +48,7 @@ The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the f ...@@ -50,7 +48,7 @@ The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the f
- **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method - **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method
- **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`. - **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`.
For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom_backend/cancellation/README.md). For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/main/examples/custom-backend/cancellation/README.md).
### Context Usage in Python ### Context Usage in Python
......
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Request Migration Architecture
This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience. This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.
## Overview ## Overview
...@@ -25,12 +23,11 @@ Key responsibilities: ...@@ -25,12 +23,11 @@ Key responsibilities:
### Migration Limit Configuration ### Migration Limit Configuration
Each model can be configured with a migration limit parameter that specifies the maximum number of times a request can be migrated to another worker: The migration limit is configured at the **frontend** level and applies globally to all models served by that frontend. This parameter specifies the maximum number of times a request can be migrated to another worker:
- Default behavior: no migration allowed - Default behavior: no migration allowed (migration_limit=0)
- Can be set independently for different engine types - Set via `--migration-limit` flag on the frontend
- Applicable to LLM worker nodes that perform inference - Applies to all models served by the frontend
- Allows engines to override user-specified limits for compatibility
## Token State Tracking and Request Migration ## Token State Tracking and Request Migration
...@@ -106,9 +103,7 @@ This token accumulation mechanism ensures that migrations are truly seamless, pr ...@@ -106,9 +103,7 @@ This token accumulation mechanism ensures that migrations are truly seamless, pr
The migration system is designed with several important architectural considerations: The migration system is designed with several important architectural considerations:
**Engine Compatibility**: Different LLM engines may have varying capabilities for handling migrated requests. The system allows engines to override migration settings to ensure compatibility and correctness. **Multi-Model Support**: Since a frontend may serve multiple models simultaneously, the migration limit is configured at the frontend level and applies uniformly to all models, simplifying operational management.
**Multi-Model Support**: Since a frontend may serve multiple models simultaneously, migration limits can be configured at the engine level, providing flexibility for different model types with varying reliability characteristics.
**State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation. **State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Disaggregated Serving Guide
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
## Why Use AIConfigurator?
When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?
AIConfigurator answers these questions in seconds, providing:
- Recommended configurations that meet your SLA requirements
- Ready-to-deploy Dynamo configuration files (including Kubernetes manifests)
- Performance comparisons between different deployment strategies
- Up to 1.7x better throughput compared to manual configuration
### End-to-End Workflow
![AIConfigurator end-to-end workflow](/assets/img/e2e-workflow.svg)
### Aggregated vs Disaggregated Architecture
AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:
![Aggregated vs Disaggregated architecture comparison](/assets/img/arch-comparison.svg)
### When to Use Each Architecture
![Decision flowchart for choosing aggregated vs disaggregated](/assets/img/decision-flowchart.svg)
## Quick Start
```bash
# Install
pip3 install aiconfigurator
# Find optimal configuration for vLLM backend
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 8 \
--system h200_sxm \
--backend vllm \
--backend_version 0.12.0 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./results_vllm
# Deploy on Kubernetes
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```
## Complete Walkthrough: vLLM on H200
This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.
### Step 1: Run AIConfigurator
```bash
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--total_gpus 8 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--backend vllm \
--backend_version 0.12.0 \
--save_dir ./results_vllm
```
**Parameters explained:**
- `--model_path`: HuggingFace model ID or local path (e.g., `Qwen/Qwen3-32B-FP8`)
- `--system`: GPU system type (`h200_sxm`, `h100_sxm`, `a100_sxm`)
- `--total_gpus`: Number of GPUs available for deployment
- `--isl` / `--osl`: Input/Output sequence lengths in tokens
- `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`)
- `--backend_version`: Backend version (e.g., `0.12.0` for vLLM)
- `--save_dir`: Directory to save generated deployment configs
### Step 2: Review the Results
AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:
```text
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
Total GPUs: 8
Best Experiment Chosen: disagg at 521.77 tokens/s/gpu
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 4,174.16 tokens/s
- Per-GPU Throughput: 521.77 tokens/s/gpu
- Per-User Throughput: 76.96 tokens/s/user
- TTFT: 388.11ms
- TPOT: 12.99ms
----------------------------------------------------------------------------
```
AIC evaluates both aggregated and disaggregated architectures and outputs ranked configurations for each:
```text
agg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+----------+----+
| Rank | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | parallel | bs |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+----------+----+
| 1 | 397.31 | 60.66 | 509.14 | 8734.68 | 56 (=14x4) | 8 (8=4x2) | 4 | tp2pp1 | 14 |
| 2 | 349.90 | 60.98 | 412.58 | 8596.28 | 48 (=24x2) | 8 (8=2x4) | 2 | tp4pp1 | 24 |
| 3 | 235.62 | 62.71 | 482.57 | 8439.41 | 32 (=32x1) | 8 (8=1x8) | 1 | tp8pp1 | 32 |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+----------+----+
disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+-------------+----------+
| Rank | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | (p)parallel | (d)parallel |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+-------------+----------+
| 1 | 521.77 | 76.96 | 388.11 | 6871.61 | 60 (=60x1) | 8 (8=1x8) | 1 | tp2pp1 | tp4pp1 |
| 2 | 521.77 | 63.29 | 388.11 | 8272.31 | 80 (=40x2) | 8 (8=2x4) | 2 | tp2pp1 | tp2pp1 |
| 3 | 260.89 | 62.81 | 388.11 | 8332.18 | 42 (=42x1) | 8 (8=1x8) | 1 | tp2pp1 | tp1pp1 |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+-------------+----------+
```
**Reading the output:**
- **tokens/s/gpu**: Overall throughput efficiency — higher is better
- **tokens/s/user**: Per-request generation speed (inverse of TPOT)
- **TTFT**: Predicted time to first token
- **concurrency**: Total concurrent requests across all replicas (e.g., `56 (=14x4)` means batch size 14 × 4 replicas)
- **agg Rank 1** recommends TP2 with 4 replicas — simpler to deploy
- **disagg Rank 1** recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA
### Step 3: Deploy on Kubernetes
The `--save_dir` generates ready-to-use Kubernetes manifests:
```
results_vllm/
├── agg/
│ └── top1/
│ └── agg/
│ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ └── agg_config.yaml # Engine configuration
├── disagg/
│ └── top1/
│ └── disagg/
│ ├── k8s_deploy.yaml
│ ├── prefill_config.yaml
│ └── decode_config.yaml
└── pareto_frontier.png
```
#### Prerequisites
Before deploying, ensure you have:
1. **HuggingFace Token Secret** (for gated models):
```bash
kubectl create secret generic hf-token-secret \
-n your-namespace \
--from-literal=HF_TOKEN="your-huggingface-token"
```
2. **Model Cache PVC** (recommended for faster restarts):
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: your-namespace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
```
#### Deploy the Configuration
The generated `k8s_deploy.yaml` provides a starting point. You'll typically need to customize it for your environment:
```bash
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```
**Complete deployment example** with model cache and production settings:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-agg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false # Use existing PVC
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
imagePullPolicy: IfNotPresent
VLLMWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 4
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi # Required for vLLM
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace
imagePullPolicy: IfNotPresent
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--no-enable-prefix-caching"
- "--tensor-parallel-size"
- "2"
- "--pipeline-parallel-size"
- "1"
- "--data-parallel-size"
- "1"
- "--kv-cache-dtype"
- "fp8"
- "--max-model-len"
- "6000"
- "--max-num-seqs"
- "1024"
```
**Key deployment settings:**
| Setting | Purpose | Notes |
|---------|---------|-------|
| `backendFramework: vllm` | Tells Dynamo which runtime to use | Required at spec level |
| `pvcs` + `volumeMounts` | Caches model weights across restarts | Mount at `/opt/models` (not `/root/`) |
| `HF_HOME` env var | Points HuggingFace to cache location | Must match `mountPoint` |
| `sharedMemory.size: 16Gi` | IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM |
| `envFromSecret` | Injects HF_TOKEN | Required for gated models |
### Step 4: Validate with AIPerf
After deployment, validate the predictions against actual performance using [AIPerf](https://github.com/ai-dynamo/aiperf).
<Tip>
Run AIPerf **inside the cluster** to avoid network latency affecting measurements. Use a Kubernetes Job:
</Tip>
#### Deriving AIPerf Parameters from AIC Output
To use AIPerf to benchmark an AIC-recommended configuration, you'll need to translate AIC parameters into AIPerf profiling arguments (we are working to automate this):
![AIC-to-AIPerf parameter mapping](/assets/img/param-mapping.svg)
| AIC Output | AIPerf Parameter | Notes |
|------------|-----------------|-------|
| `concurrency: 56 (=14x4)` | `--concurrency 56` | Use total concurrency when benchmarking via the frontend |
| ISL/OSL targets | `--isl 4000 --osl 500` | Match your AIC inputs |
| - | `--num-requests 800` | Use `concurrency × 40` minimum for statistical stability |
| - | `--extra-inputs "ignore_eos:true"` | Ensures exact OSL tokens generated |
> **Note on concurrency**: AIC reports concurrency as `total (=bs × replicas)`. When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica `bs` value instead.
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: your-namespace
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.10
command:
- /bin/bash
- -c
- |
pip install aiperf
aiperf profile \
-m Qwen/Qwen3-32B-FP8 \
--endpoint-type chat \
-u http://dynamo-agg-frontend:8000 \
--isl 4000 --isl-stddev 0 \
--osl 500 --osl-stddev 0 \
--num-requests 800 \
--concurrency 56 \
--streaming \
--extra-inputs "ignore_eos:true" \
--num-warmup-requests 40 \
--ui-type simple
```
```bash
kubectl apply -f aiperf-job.yaml
kubectl logs -f -l job-name=aiperf-benchmark
```
**Validated results** (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):
| Metric | AIC Prediction | Actual (avg) | Status |
|--------|---------------|--------------|--------|
| TTFT (ms) | 509 | 209 | Better than target |
| ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% |
| Throughput (req/s) | ~6.3 | 6.9 | Within 10% |
| Total Output TPS | ~3,178 | 3,462 | Within 10% |
<Note>
Actual throughput typically reaches ~85-90% of AIC predictions, with ITL/TPOT being the most accurate metric. Expect some variance between benchmark runs; running multiple times is recommended. Enable prefix caching (`--enable-prefix-caching`) for additional TTFT improvements with repeated prompts.
</Note>
## Fine-Tuning Your Deployment
AIConfigurator provides a strong starting point. Here's how to iterate for production:
### Adjusting for Actual Workload
If your real workload differs from the benchmark parameters:
```bash
# For longer outputs (chat/code generation):
# increase OSL, relax TTFT target
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 8 \
--system h200_sxm \
--backend vllm \
--backend_version 0.12.0 \
--isl 2000 \
--osl 2000 \
--ttft 1000 \
--tpot 10 \
--save_dir ./results_long_output
```
### Exploring Alternative Configurations
Use `exp` mode to compare custom configurations:
```yaml
# custom_exp.yaml
exps:
- exp_tp2
- exp_tp4
exp_tp2:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [2]
exp_tp4:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [4]
```
```bash
aiconfigurator cli exp --yaml_path custom_exp.yaml --save_dir ./results_custom
```
> **Critical**: Disaggregated deployments **require RDMA** for KV cache transfer. Without RDMA, performance degrades by **40x** (TTFT increases from 355ms to 10+ seconds). See the Disaggregated Deployment section below.
### Deploying Disaggregated (RDMA Required)
Disaggregated deployments transfer KV cache between prefill and decode workers. **Without RDMA, this transfer becomes a severe bottleneck**, causing 40x performance degradation.
#### Prerequisites for Disaggregated
1. **RDMA-capable network** (InfiniBand or RoCE)
2. **RDMA device plugin** installed on the cluster (provides `rdma/ib` resources)
3. **ETCD and NATS** deployed (for coordination)
#### Disaggregated DGD with RDMA
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-disagg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
imagePullPolicy: IfNotPresent
VLLMPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 2
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" # Enable RDMA transports
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory registration
resources:
limits:
rdma/ib: "2" # Request RDMA resources
requests:
rdma/ib: "2"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "2"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1" # Prefill workers use batch size 1
- --is-prefill-worker
VLLMDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "4"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
rdma/ib: "4"
requests:
rdma/ib: "4"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "4"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1024" # Decode workers handle high concurrency
- --is-decode-worker
```
**Critical RDMA settings:**
| Setting | Purpose |
|---------|---------|
| `rdma/ib: "N"` | Request N RDMA resources (match TP size) |
| `IPC_LOCK` capability | Required for RDMA memory registration |
| `UCX_TLS` env var | Enables RDMA transports (rc_x, dc_x) |
| `UCX_RNDV_SCHEME=get_zcopy` | Zero-copy RDMA transfers |
#### Verifying RDMA is Active
After deployment, check the worker logs for UCX initialization:
```bash
kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"
```
You should see:
```
NIXL INFO Backend UCX was instantiated
```
If you see only TCP transports, RDMA is not active - check your RDMA device plugin and resource requests.
### Tuning vLLM-Specific Parameters
Override vLLM engine parameters with `--generator-set`:
```bash
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 8 \
--system h200_sxm \
--backend vllm \
--backend_version 0.12.0 \
--isl 4000 --osl 500 \
--ttft 600 --tpot 16.67 \
--save_dir ./results_tuned \
--generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
--generator-set Workers.agg.max_num_seqs=2048
```
Run `aiconfigurator cli default --generator-help` to see all available parameters.
### Prefix Caching Considerations
For workloads with repeated prefixes (e.g., system prompts):
- **Enable prefix caching** when you have high prefix hit rates
- **Disable prefix caching** (`--no-enable-prefix-caching`) for diverse prompts
AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.
## Supported Configurations
### Backends and Versions
| Backend | Versions | Status |
|---------|----------|--------|
| TensorRT-LLM | 1.0.0rc3, 1.2.0rc5 | Production |
| vLLM | 0.12.0 | Production |
| SGLang | 0.5.6.post2 | Production |
### Systems
| GPU System | TensorRT-LLM | vLLM | SGLang |
|------------|--------------|------|--------|
| H200 SXM | Yes | Yes | Yes |
| H100 SXM | Yes | Yes | Yes |
| A100 SXM | Yes | Yes | -- |
| B200 SXM | Yes | -- | Yes |
| GB200 SXM | Yes | -- | -- |
### Models
- **Dense**: GPT, LLAMA2/3, QWEN2.5/3
- **MoE**: Mixtral, DEEPSEEK_V3
## Common Use Cases
```bash
# Strict latency SLAs (real-time chat)
aiconfigurator cli default \
--model_path meta-llama/Llama-3.1-70B \
--total_gpus 16 \
--system h200_sxm \
--backend vllm \
--backend_version 0.12.0 \
--ttft 200 --tpot 8
# High throughput (batch processing)
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 32 \
--system h200_sxm \
--backend trtllm \
--ttft 2000 --tpot 50
# Request latency constraint (end-to-end SLA)
aiconfigurator cli default \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 16 \
--system h200_sxm \
--backend vllm \
--backend_version 0.12.0 \
--request_latency 12000 \
--isl 4000 --osl 500
```
## Additional Options
```bash
# Web interface for interactive exploration
pip3 install aiconfigurator[webapp]
aiconfigurator webapp # Visit http://127.0.0.1:7860
# Quick config generation (no parameter sweep)
aiconfigurator cli generate \
--model_path Qwen/Qwen3-32B-FP8 \
--total_gpus 8 \
--system h200_sxm \
--backend vllm
# Check model/system support
aiconfigurator cli support \
--model_path Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--backend vllm
```
## Troubleshooting
### AIConfigurator Issues
**Model not found**: Use the full HuggingFace path (e.g., `Qwen/Qwen3-32B-FP8` not `QWEN3_32B`)
**Backend version mismatch**: Check supported versions with `aiconfigurator cli support --model_path <model> --system <system> --backend <backend>`
### Deployment Issues
**Pods crash with "Permission denied" on cache directory**:
- Mount the PVC at `/opt/models` instead of `/root/.cache/huggingface`
- Set `HF_HOME=/opt/models` environment variable
- Ensure the PVC has `ReadWriteMany` access mode
**Workers stuck in CrashLoopBackOff**:
- Check logs: `kubectl logs <pod-name> --previous`
- Verify `sharedMemory.size` is set (16Gi for vLLM, 80Gi for TRT-LLM)
- Ensure HuggingFace token secret exists and is named correctly
**Model download slow on every restart**:
- Add PVC for model caching (see deployment example above)
- Verify `volumeMounts` and `HF_HOME` are configured on workers
**"Context stopped or killed" errors (disaggregated only)**:
- Deploy ETCD and NATS infrastructure (required for KV cache transfer)
- See [Dynamo Kubernetes Guide](../../kubernetes/README.md) for platform setup
### Performance Issues
**OOM errors**: Reduce `--max-num-seqs` or increase tensor parallelism
**Performance below predictions**:
- Verify warmup requests are sufficient (40+ recommended)
- Check for competing workloads on the cluster
- Ensure KV cache memory fraction is optimized
- Run benchmarks from inside the cluster to eliminate network latency
**Disaggregated TTFT extremely high (10+ seconds)**:
This is almost always caused by **missing RDMA configuration**. Without RDMA, KV cache transfer falls back to TCP and becomes a severe bottleneck.
To diagnose:
```bash
# Check if RDMA resources are allocated
kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"
# Check UCX transport in logs
kubectl logs <worker-pod> | grep -i "UCX\|transport"
```
To fix:
1. Ensure your cluster has RDMA device plugin installed
2. Add `rdma/ib` resource requests to worker pods
3. Add `IPC_LOCK` capability to security context
4. Add UCX environment variables (see Disaggregated Deployment section)
**Disaggregated working but throughput lower than aggregated**:
For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:
- Very long inputs (ISL > 8000) with short outputs
- Workloads needing independent prefill/decode scaling
## Learn More
- [AIConfigurator CLI Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/cli_user_guide.md)
- [Dynamo Deployment Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md)
- [Dynamo Installation Guide](../../kubernetes/installation-guide.md)
- [Benchmarking Guide](../../benchmarks/benchmarking.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# LoRA Adapters
LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.
## Backend Support
| Backend | Status | Notes |
|---------|--------|-------|
| vLLM | ✅ | Full support including KV-aware routing |
| SGLang | 🚧 | In progress |
| TensorRT-LLM | ❌ | Not yet supported |
See the [Feature Matrix](../../reference/feature-matrix.md) for full compatibility details.
## Overview
Dynamo's LoRA implementation provides:
- **Dynamic loading**: Load and unload LoRA adapters at runtime without restarting workers
- **Multiple sources**: Load from local filesystem (`file://`), S3-compatible storage (`s3://`), or Hugging Face Hub (`hf://`)
- **Automatic caching**: Downloaded adapters are cached locally to avoid repeated downloads
- **Discovery integration**: Loaded LoRAs are automatically registered and discoverable via `/v1/models`
- **KV-aware routing**: Route requests to workers with the appropriate LoRA loaded
- **Kubernetes native**: Declarative LoRA management via the `DynamoModel` CRD
### Architecture
```text
┌─────────────────────────────────────────────────────────────────┐
│ LoRA Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Frontend │────▶│ Router │────▶│ Workers │ │
│ │ /v1/models │ │ LoRA-aware │ │ LoRA-loaded │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ LoRA Manager │ │
│ │ ┌───────────┐ ┌─────────────┐ │ │
│ │ │ Downloader│ │ Cache │ │ │
│ │ └───────────┘ └─────────────┘ │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌─────────┐│
│ │ file:// │ │ s3:// │ │ hf:// ││
│ │ Local │ │ S3/MinIO │ │(custom) ││
│ └────────────┘ └────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────────┘
```
The LoRA system consists of:
- **Rust Core** (`lib/llm/src/lora/`): High-performance downloading, caching, and validation
- **Python Manager** (`components/src/dynamo/common/lora/`): Extensible wrapper with custom source support
- **Worker Handlers** (`components/src/dynamo/vllm/handlers.py`): Load/unload API and inference integration
## Quick Start
### Prerequisites
- Dynamo installed with vLLM support
- For S3 sources: AWS credentials configured
- A LoRA adapter compatible with your base model
### Local Development
**1. Start Dynamo with LoRA support:**
```bash
# Start vLLM worker with LoRA flags
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
--connector none \
--enable-lora \
--max-lora-rank 64
```
**2. Load a LoRA adapter:**
```bash
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-lora",
"source": {
"uri": "file:///path/to/my-lora"
}
}'
```
**3. Run inference with the LoRA:**
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-lora",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
### S3-Compatible Storage
For production deployments, store LoRA adapters in S3-compatible storage:
```bash
# Configure S3 credentials
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_ENDPOINT=http://minio:9000 # For MinIO
export AWS_REGION=us-east-1
# Load LoRA from S3
curl -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "customer-support-lora",
"source": {
"uri": "s3://my-loras/customer-support-v1"
}
}'
```
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DYN_LORA_ENABLED` | Enable LoRA adapter support | `false` |
| `DYN_LORA_PATH` | Local cache directory for downloaded LoRAs | `~/.cache/dynamo_loras` |
| `AWS_ACCESS_KEY_ID` | S3 access key (for `s3://` URIs) | - |
| `AWS_SECRET_ACCESS_KEY` | S3 secret key (for `s3://` URIs) | - |
| `AWS_ENDPOINT` | Custom S3 endpoint (for MinIO, etc.) | - |
| `AWS_REGION` | AWS region | `us-east-1` |
| `AWS_ALLOW_HTTP` | Allow HTTP (non-TLS) connections | `false` |
### vLLM Arguments
| Argument | Description |
|----------|-------------|
| `--enable-lora` | Enable LoRA adapter support in vLLM |
| `--max-lora-rank` | Maximum LoRA rank (must be >= your LoRA's rank) |
| `--max-loras` | Maximum number of LoRAs to load simultaneously |
## Backend API Reference
### Load LoRA
Load a LoRA adapter from a source URI.
```text
POST /v1/loras
```
**Request:**
```json
{
"lora_name": "string",
"source": {
"uri": "string"
}
}
```
**Response:**
```json
{
"status": "success",
"message": "LoRA adapter 'my-lora' loaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
```
### List LoRAs
List all loaded LoRA adapters.
```text
GET /v1/loras
```
**Response:**
```json
{
"status": "success",
"loras": {
"my-lora": 1207343256,
"another-lora": 987654321
},
"count": 2
}
```
### Unload LoRA
Unload a LoRA adapter from the worker.
```text
DELETE /v1/loras/{lora_name}
```
**Response:**
```json
{
"status": "success",
"message": "LoRA adapter 'my-lora' unloaded successfully",
"lora_name": "my-lora",
"lora_id": 1207343256
}
```
## Kubernetes Deployment
For Kubernetes deployments, use the `DynamoModel` Custom Resource to declaratively manage LoRA adapters.
### DynamoModel CRD
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: dynamo-system
spec:
modelName: customer-support-adapter-v1
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
```
### How It Works
When you create a `DynamoModel`:
1. **Discovers endpoints**: Finds all pods running your `baseModelName`
2. **Creates service**: Automatically creates a Kubernetes Service
3. **Loads LoRA**: Calls the LoRA load API on each endpoint
4. **Updates status**: Reports which endpoints are ready
### Verify Deployment
```bash
# Check LoRA status
kubectl get dynamomodel customer-support-lora
# Expected output:
# NAME TOTAL READY AGE
# customer-support-lora 2 2 30s
```
For complete Kubernetes deployment details, see:
- [Managing Models with DynamoModel](../../kubernetes/deployment/dynamomodel-guide.md)
- [Kubernetes LoRA Deployment Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md)
## Examples
| Example | Description |
|---------|-------------|
| [Local LoRA with MinIO](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | Local development with S3-compatible storage |
| [Kubernetes LoRA Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md) | Production deployment with DynamoModel CRD |
## Troubleshooting
### LoRA Fails to Load
**Check S3 connectivity:**
```bash
# Verify LoRA exists in S3
aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive
```
**Check cache directory:**
```bash
ls -la ~/.cache/dynamo_loras/
```
**Check worker logs:**
```bash
# Look for LoRA-related messages
kubectl logs deployment/my-worker | grep -i lora
```
### Model Not Found After Loading
- Verify the LoRA name matches exactly (case-sensitive)
- Check if the LoRA is listed: `curl http://localhost:8081/v1/loras`
- Ensure discovery registration succeeded (check worker logs)
### Inference Returns Base Model Response
- Verify the `model` field in your request matches the `lora_name`
- Check that the LoRA is loaded on the worker handling your request
- For disaggregated serving, ensure both prefill and decode workers have the LoRA
## See Also
- [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview
- [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration
- [Dynamo Operator](../../kubernetes/dynamo-operator.md) - Kubernetes operator overview
- [KV-Aware Routing](../../components/router/router-guide.md) - LoRA-aware request routing
...@@ -7,24 +7,22 @@ ...@@ -7,24 +7,22 @@
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models. Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
> [!WARNING] > [!IMPORTANT]
> **Security Requirement**: Multimodal processing must be explicitly enabled at startup. > **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
> See the relevant documentation for each backend for the necessary flags. > See the relevant documentation for each backend for the necessary flags.
>
> This prevents unintended processing of multimodal data from untrusted sources. > This prevents unintended processing of multimodal data from untrusted sources.
## Backend Documentation ## Backend Documentation
## Support Matrix ## Support Matrix
### Backend Capabilities ### Backend Capabilities
| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio | | Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
|-------|------|-------|------|-----|-------|-------|-------| |-------|------|-------|------|-----|-------|-------|-------|
| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 | | **[vLLM](multimodal-vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ | | **[TRT-LLM](multimodal-trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | | **[SGLang](multimodal-sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668)) \* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
...@@ -108,7 +106,7 @@ Response ...@@ -108,7 +106,7 @@ Response
Full disaggregation with separate workers for encoding, prefill, and decode. Full disaggregation with separate workers for encoding, prefill, and decode.
There are two variants of this workflow: There are two variants of this workflow:
- Prefill-first, used by vLLM - Prefill-first, used by vLLM
- Decode-first, used by SGlang - Decode-first, used by SGLang
Prefill-first: Prefill-first:
......
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
# SGLang Multimodal # SGLang Multimodal
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer. This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.
## Support Matrix ## Support Matrix
...@@ -24,12 +24,12 @@ This document provides a comprehensive guide for multimodal inference using SGLa ...@@ -24,12 +24,12 @@ This document provides a comprehensive guide for multimodal inference using SGLa
## Deployment Patterns ## Deployment Patterns
SGLang supports E/PD and E/P/D patterns only (always has a separate encode worker). See [Multimodal Architecture Patterns](index.md#architecture-patterns) for detailed explanations. SGLang supports EPD, E/PD, and E/P/D patterns. See [Multimodal Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes | | Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------| |---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | | N/A | Not supported | | EPD (Simple Aggregated) | | `agg.sh` | Internal encoding |
| E/PD (Encode Separate) | ✅ | `multimodal_agg.sh` | Vision encoder separate | | E/PD (Encode Separate) | ✅ | `multimodal_epd.sh` | Vision encoder separate |
| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap | | E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap |
| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported | | EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported |
...@@ -62,20 +62,72 @@ You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/l ...@@ -62,20 +62,72 @@ You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/l
git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
``` ```
## EPD Serving (Simple Aggregated)
### Components
- worker: [DecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/llm/decode_handler.py) handles encoding, prefilling, and decoding in a single process.
### Workflow
The `DecodeWorkerHandler` receives multimodal requests with image URLs and passes them directly to SGLang's engine. SGLang's internal `mm_data_processor` handles image fetching, loading, encoding, and token expansion.
```mermaid
flowchart LR
HTTP --> worker
worker --tokenized text + image_urls--> SGLang[SGLang Engine]
```
### Launch
```bash
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl
```
**Client:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image."
},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/test2017/000000155781.jpg"
}
}
]
}
],
"max_tokens": 50,
"stream": false
}' | jq
```
## E/PD Serving (Encode Separate) ## E/PD Serving (Encode Separate)
### Components ### Components
- workers: - workers:
- [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding - [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding. - [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding.
- processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) - processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py)
- tokenizes the prompt using the chat template - tokenizes the prompt using the chat template
- passes the text and image url to the MultimodalEncodeWorker. - passes the text and image url to the MultimodalEncodeWorker.
### Workflow ### Workflow
The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS. The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.
```mermaid ```mermaid
flowchart LR flowchart LR
...@@ -93,7 +145,7 @@ flowchart LR ...@@ -93,7 +145,7 @@ flowchart LR
```bash ```bash
cd $DYNAMO_HOME/examples/backends/sglang cd $DYNAMO_HOME/examples/backends/sglang
./launch/multimodal_agg.sh ./launch/multimodal_epd.sh
``` ```
**Client:** **Client:**
...@@ -130,10 +182,10 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -130,10 +182,10 @@ curl http://localhost:8000/v1/chat/completions \
### Components ### Components
- workers: - workers:
- [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding - [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding
- [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding - [MultimodalWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding
- [MultimodalPrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling - [MultimodalPrefillWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling
- processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker. - processor: [MultimodalProcessorHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker.
### Workflow ### Workflow
...@@ -332,6 +384,7 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. ...@@ -332,6 +384,7 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
| Use Case | NIXL Used? | Data Transfer | Notes | | Use Case | NIXL Used? | Data Transfer | Notes |
|----------|------------|---------------|-------| |----------|------------|---------------|-------|
| EPD (Simple Aggregated) | No | N/A | All processing internal to SGLang |
| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate | | E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate |
| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap | | E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
......
...@@ -31,14 +31,15 @@ You can provide multimodal inputs in the following ways: ...@@ -31,14 +31,15 @@ You can provide multimodal inputs in the following ways:
## Deployment Patterns ## Deployment Patterns
TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations. TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes | | Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------| |---------|-----------|---------------|-------|
| EPD (Simple Aggregated) | ✅ | `agg.sh` | Easiest setup | | Aggregated | ✅ | `agg.sh` | Easiest setup, single worker |
| E/PD (Encode Separate) | ❌ | N/A | Not supported | | EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding, 2 workers |
| E/P/D (Full Disaggregation) | 🚧 WIP | N/A | PR #4668 in progress | | E/P/D (Full - Image URLs) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with `MultimodalEncoder`, 3 workers |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding | | E/P/D (Full - Pre-computed Embeddings) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with NIXL transfer, 3 workers |
| E/P/D (Large Models) | ✅ | `epd_disagg.sh` | For Llama-4 Scout/Maverick, multi-node |
### Component Flags ### Component Flags
...@@ -47,7 +48,7 @@ TRT-LLM supports aggregated and traditional disaggregated patterns. See [Archite ...@@ -47,7 +48,7 @@ TRT-LLM supports aggregated and traditional disaggregated patterns. See [Archite
| Worker | `--modality multimodal` | Complete pipeline (aggregated) | | Worker | `--modality multimodal` | Complete pipeline (aggregated) |
| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) | | Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) |
| Decode Worker | `--disaggregation-mode decode` | Decode only | | Decode Worker | `--disaggregation-mode decode` | Decode only |
| Encode Worker (WIP) | `--disaggregation-mode encode` | Image encoding (E/P/D flow) | | Encode Worker | `--disaggregation-mode encode` | Image encoding (E/P/D flow) |
## Aggregated Serving ## Aggregated Serving
...@@ -131,6 +132,90 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' ...@@ -131,6 +132,90 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs. For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.
## Full E/P/D Flow (Image URLs)
For high-performance multimodal inference, Dynamo supports a standalone encoder with an **Encode-Prefill-Decode (E/P/D)** flow using TRT-LLM's `MultimodalEncoder`. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.
### Supported Input Formats
| Format | Example | Description |
|--------|---------|-------------|
| **HTTP/HTTPS URL** | `https://example.com/image.jpg` | Remote image files |
| **Base64 Data URL** | `data:image/jpeg;base64,...` | Inline base64-encoded images |
### How It Works
In the full E/P/D flow:
1. **Encode Worker**: Runs TRT-LLM's `MultimodalEncoder.generate()` to process image URLs through the vision encoder and projector
2. **Prefill Worker**: Receives `disaggregated_params` containing multimodal embedding handles, processes context and generates KV cache
3. **Decode Worker**: Performs streaming token generation using the KV cache
The encode worker uses TRT-LLM's `MultimodalEncoder` class (which inherits from `BaseLLM`) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.
### How to Launch
```bash
cd $DYNAMO_HOME
# Launch 3-worker E/P/D flow with image URL support
./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh
```
### Example Request
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
}
}
]
}
],
"max_tokens": 160
}'
```
### E/P/D Architecture (Image URLs)
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant PrefillWorker as "Prefill Worker"
participant EncodeWorker as "Encode Worker"
participant DecodeWorker as "Decode Worker"
Client->>Frontend: POST /v1/chat/completions (image URL)
Frontend->>PrefillWorker: Route to prefill worker
PrefillWorker->>EncodeWorker: Send request (image URL)
Note over EncodeWorker: MultimodalEncoder.generate()<br/>runs vision encoder + projector
EncodeWorker->>PrefillWorker: Return disaggregated_params<br/>(multimodal_embedding_handles)
Note over PrefillWorker: Process context with embeddings<br/>Generate KV cache
PrefillWorker->>Frontend: Return prefill response
Frontend->>DecodeWorker: Route to decode worker
DecodeWorker->>Frontend: Stream response chunks
Frontend->>Client: Stream response
```
### Key Differences from EP/D (Traditional Disaggregated)
| Aspect | EP/D (Traditional) | E/P/D (Full) |
|--------|-------------------|--------------|
| **Encoding** | Prefill worker handles image encoding | Dedicated encode worker |
| **Prefill Load** | Higher (encoding + prefill) | Lower (prefill only) |
| **Use Case** | Simpler setup | Better scalability for vision-heavy workloads |
| **Launch Script** | `disagg_multimodal.sh` | `epd_multimodal_image_and_embeddings.sh` |
## Pre-computed Embeddings with E/P/D Flow ## Pre-computed Embeddings with E/P/D Flow
For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer. For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer.
...@@ -286,13 +371,13 @@ For 4 4xGB200 nodes (2 for prefill, 2 for decode): ...@@ -286,13 +371,13 @@ For 4 4xGB200 nodes (2 for prefill, 2 for decode):
1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker 1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker
2. The OpenAI frontend will dynamically discover workers as they register: 2. The OpenAI frontend will dynamically discover workers as they register:
``` ```text
INFO dynamo_run::input::http: Watching for remote model at models INFO dynamo_run::input::http: Watching for remote model at models
INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000
``` ```
3. TRT-LLM workers output progress from each MPI rank while loading 3. TRT-LLM workers output progress from each MPI rank while loading
4. When ready, the frontend logs: 4. When ready, the frontend logs:
``` ```text
INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct" INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
``` ```
...@@ -306,10 +391,11 @@ pkill srun ...@@ -306,10 +391,11 @@ pkill srun
| Use Case | Script | NIXL Used? | Data Transfer | | Use Case | Script | NIXL Used? | Data Transfer |
|----------|--------|------------|---------------| |----------|--------|------------|---------------|
| EPD (Simple Aggregated) | `agg.sh` | No | All in one worker | | Aggregated | `agg.sh` | No | All in one worker |
| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) | | EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E/P/D (pre-computed embeddings) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL) | | E/P/D (Image URLs) | `epd_multimodal_image_and_embeddings.sh` | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) |
| E/P/D (WIP) | N/A | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) | | E/P/D (Pre-computed Embeddings) | `epd_multimodal_image_and_embeddings.sh` | Yes | Encoder → Prefill (embeddings via NIXL RDMA) |
| E/P/D (Large Models) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache) |
> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. > **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
...@@ -337,26 +423,29 @@ await register_llm( ...@@ -337,26 +423,29 @@ await register_llm(
| Transfer Stage | Message | NIXL Transfer | | Transfer Stage | Message | NIXL Transfer |
|----------------|---------|---------------| |----------------|---------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No | | **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (pre-computed)** | NIXL metadata | Yes (Embeddings tensor) | | **Prefill → Encode (Image URL)** | Request with image URL | No |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No | | **Encode → Prefill (Image URL)** | `ep_disaggregated_params` with `multimodal_embedding_handles`, processed prompt, and token IDs | No |
| **Prefill → Decode** | Disaggregated params | Configurable (KV cache: NIXL default, UCX optional) | | **Prefill → Encode (Embedding Path)** | Request with embedding file path | No |
| **Encode → Prefill (Embedding Path)** | NIXL readable metadata + shape/dtype + auxiliary data | Yes (Embeddings tensor via RDMA) |
| **Prefill → Decode** | `disaggregated_params` with `_epd_metadata` (prompt, token IDs) | Configurable (KV cache: NIXL default, UCX optional) |
## Known Limitations ## Known Limitations
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No video support** - No video encoder implementation - **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation - **No audio support** - No audio encoder implementation
- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker - **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
- **E/P/D mode is WIP** - Full E/P/D with image URLs under development
- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`) - **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`)
- **llava-v1.6-mistral-7b-hf model crash** - Known issue with TRTLLM backend compatibility with `TensorRT LLM version: 1.2.0rc6.post1`. To use Llava model download revision `revision='52320fb52229` locally using HF.
- **Embeddings file crash** - Known issue with TRTLLM backend compatibility with `TensorRT LLM version: 1.2.0rc6.post1`. Embedding file parsing crashes in `attach_multimodal_embeddings(`. To be fixed in next TRTLLM upgrade.
## Supported Models ## Supported Models
Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo. Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo.
Common examples: Common examples:
- Llama 4 Vision models (Maverick, Scout) - **Llama 4 Vision models** (Maverick, Scout) - Recommended for large-scale deployments
- Qwen2-VL models - **LLaVA models** (e.g., `llava-hf/llava-v1.6-mistral-7b-hf`) - Default model for E/P/D examples
- **Qwen2-VL models** - Supported in traditional disaggregated mode
- Other vision-language models with TRT-LLM support - Other vision-language models with TRT-LLM support
## Key Files ## Key Files
...@@ -364,8 +453,12 @@ Common examples: ...@@ -364,8 +453,12 @@ Common examples:
| File | Description | | File | Description |
|------|-------------| |------|-------------|
| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup | | `components/src/dynamo/trtllm/main.py` | Worker initialization and setup |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing | | `components/src/dynamo/trtllm/engine.py` | TensorRTLLMEngine wrapper (LLM and MultimodalEncoder) |
| `components/src/dynamo/trtllm/constants.py` | DisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE) |
| `components/src/dynamo/trtllm/encode_helper.py` | Encode worker request processing (embedding-path and full EPD flows) |
| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing | | `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing |
| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handler factory | | `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handlers (EncodeHandler, PrefillHandler, DecodeHandler) |
| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler and disaggregation modes | | `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding |
| `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer |
| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing |
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
> [!WARNING] > [!IMPORTANT]
> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`. > **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64). > This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
...@@ -29,7 +29,7 @@ This document provides a comprehensive guide for multimodal inference using vLLM ...@@ -29,7 +29,7 @@ This document provides a comprehensive guide for multimodal inference using vLLM
## Deployment Patterns ## Deployment Patterns
vLLM supports all multimodal deployment patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations. vLLM supports all multimodal deployment patterns. See [Architecture Patterns](README.md#architecture-patterns) for detailed explanations.
| Pattern | Supported | Launch Script | Notes | | Pattern | Supported | Launch Script | Notes |
|---------|-----------|---------------|-------| |---------|-----------|---------------|-------|
...@@ -69,7 +69,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ...@@ -69,7 +69,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
**Components:** **Components:**
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding. - workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler. - processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests. - frontend: HTTP endpoint to handle incoming requests.
...@@ -133,7 +133,7 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -133,7 +133,7 @@ curl http://localhost:8000/v1/chat/completions \
**Components:** **Components:**
- workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling. - workers: [EncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling.
- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler. - processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler.
- frontend: HTTP endpoint to handle incoming requests. - frontend: HTTP endpoint to handle incoming requests.
...@@ -160,8 +160,7 @@ cd $DYNAMO_HOME/examples/backends/vllm ...@@ -160,8 +160,7 @@ cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
``` ```
> [!NOTE] > [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
> Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
## ECConnector Serving ## ECConnector Serving
...@@ -381,7 +380,7 @@ flowchart LR ...@@ -381,7 +380,7 @@ flowchart LR
**Launch:** **Launch:**
```bash ```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency pip install 'vllm[audio]' accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_agg.sh bash launch/audio_agg.sh
``` ```
...@@ -437,7 +436,7 @@ flowchart LR ...@@ -437,7 +436,7 @@ flowchart LR
**Launch:** **Launch:**
```bash ```bash
pip install vllm["audio"] accelerate # multimodal audio models dependency pip install 'vllm[audio]' accelerate # multimodal audio models dependency
cd $DYNAMO_HOME/examples/multimodal cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh bash launch/audio_disagg.sh
``` ```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment