@@ -10,98 +10,66 @@ The prefill and decode phases of LLM requests have different computation charact
...
@@ -10,98 +10,66 @@ The prefill and decode phases of LLM requests have different computation charact
Disaggregated execution of a request has three main steps:
Disaggregated execution of a request has three main steps:
1. Prefill engine computes prefill phase and generates KV cache
1. Prefill engine computes prefill phase and generates KV cache
2. Prefill engine transfers the KV cache to decode engine, and
2. Prefill engine transfers the KV cache to decode engine
3. Decode engine computes decode phase.
3. Decode engine computes decode phase.
However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions.
The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
## Efficient KV Transfer
## Design
The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
```mermaid
sequenceDiagram
participant D as Worker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
There are four main components in Dynamo disaggregation:
- Worker: execute prefill and decode requests
- Prefill worker: execute prefill requests only
- Disaggregated router: decide whether to prefill locally or remotely
- Prefill queue: cache and load balance the remote prefill requests
When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
## Conditional Disaggregation
Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
1. The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
2. The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
## Prefill Queue
Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability.
### Router Orchestration
## Efficient KV Transfer
The disaggregated serving flow is orchestrated by the `PrefillRouter`:
```mermaid
```mermaid
sequenceDiagram
sequenceDiagram
participant D as Worker
participant Client
participant SD as WorkerScheduler
participant Frontend
participant SP as PrefillWorkerScheduler
participant Router as PrefillRouter
participant P as PrefillWorker
participant Prefill as Prefill Worker
participant Decode as Decode Worker
Client->>Frontend: Request
Frontend->>Router: Preprocessed Request
Router->>Router: Select prefill worker
Router->>Prefill: Prefill request
Prefill->>Prefill: Compute KV cache
Prefill-->>Router: disaggregated_params
Router->>Router: Select decode worker
Router->>Decode: Decode request + transfer metadata
Decode<<->>Prefill: KV transfer (NIXL)
Decode->>Decode: Generate tokens
Decode-->>Frontend: Stream tokens
Frontend-->>Client: Response
```
Note over SD: KV blocks allocated
1.**Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
SD->>SP: Issue remote prefill request <br> with KV block descriptors via prefill queue
SP->>P: Add to in-flight batch
P-->>D: Remote NIXL read for prefix hit KV blocks (non-block)
2.**Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
P->>P: Execute prefill
P-->>D: Remote NIXL write for comptued KV blocks (non-block)
P->>SP: Notify finish
3.**Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
SP->>SD: Notify finish
SD->>D: Add to in-flight batch
D->>D: Execute decode
4.**KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
```
The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer.
### Backend-Specific Transfer Metadata
After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish.
The transfer metadata format varies by backend:
To reduce the size of memory descriptors, Dynamo applies two optimizations:
-**SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
1. After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
2. Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks.
-**vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
-**TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
## Runtime-Reconfigurable xPyD
## Runtime-Reconfigurable xPyD
The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as:
Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
-**Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
-**Remove worker**: Worker drains active requests and deregisters from discovery.
- Add worker: add NIXL metadata in ETCD.
The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
- Remove worker: flush engine and delete NIXL metadata in ETCD.
@@ -17,255 +17,189 @@ limitations under the License.
...
@@ -17,255 +17,189 @@ limitations under the License.
# Dynamo Architecture Flow
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations.
This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.
## 🔵 Main Request Flow (Blue)
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
The primary user journey through the system:
1.**Discovery (S1)**: Client discovers the service endpoint
1.**Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
2.**Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
2.**Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
3.**Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
3.**Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
4.**Route (S3)**: Frontend routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange)
## 🟢 Prefill Flow (Green)
The system's intelligent routing and resource allocation:
The prefill processing pipeline:
4.**Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
4.**Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
5.**Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill