Unverified Commit bed29a16 authored by Ryan McCormick's avatar Ryan McCormick Committed by GitHub
Browse files

docs: Update disagg and request flow design docs based on latest code (#5993)

parent dde23cc6
...@@ -10,98 +10,66 @@ The prefill and decode phases of LLM requests have different computation charact ...@@ -10,98 +10,66 @@ The prefill and decode phases of LLM requests have different computation charact
Disaggregated execution of a request has three main steps: Disaggregated execution of a request has three main steps:
1. Prefill engine computes prefill phase and generates KV cache 1. Prefill engine computes prefill phase and generates KV cache
2. Prefill engine transfers the KV cache to decode engine, and 2. Prefill engine transfers the KV cache to decode engine
3. Decode engine computes decode phase. 3. Decode engine computes decode phase.
However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions. The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
## Efficient KV Transfer
## Design The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
```mermaid
sequenceDiagram
participant D as Worker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
There are four main components in Dynamo disaggregation:
- Worker: execute prefill and decode requests
- Prefill worker: execute prefill requests only
- Disaggregated router: decide whether to prefill locally or remotely
- Prefill queue: cache and load balance the remote prefill requests
When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
## Conditional Disaggregation
Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
1. The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
2. The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
## Prefill Queue
Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability. ### Router Orchestration
## Efficient KV Transfer The disaggregated serving flow is orchestrated by the `PrefillRouter`:
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
participant D as Worker participant Client
participant SD as WorkerScheduler participant Frontend
participant SP as PrefillWorkerScheduler participant Router as PrefillRouter
participant P as PrefillWorker participant Prefill as Prefill Worker
participant Decode as Decode Worker
Client->>Frontend: Request
Frontend->>Router: Preprocessed Request
Router->>Router: Select prefill worker
Router->>Prefill: Prefill request
Prefill->>Prefill: Compute KV cache
Prefill-->>Router: disaggregated_params
Router->>Router: Select decode worker
Router->>Decode: Decode request + transfer metadata
Decode<<->>Prefill: KV transfer (NIXL)
Decode->>Decode: Generate tokens
Decode-->>Frontend: Stream tokens
Frontend-->>Client: Response
```
Note over SD: KV blocks allocated 1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
SD->>SP: Issue remote prefill request <br> with KV block descriptors via prefill queue
SP->>P: Add to in-flight batch
P-->>D: Remote NIXL read for prefix hit KV blocks (non-block) 2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
P->>P: Execute prefill
P-->>D: Remote NIXL write for comptued KV blocks (non-block)
P->>SP: Notify finish 3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
SP->>SD: Notify finish
SD->>D: Add to in-flight batch
D->>D: Execute decode 4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
```
The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer. ### Backend-Specific Transfer Metadata
After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish. The transfer metadata format varies by backend:
To reduce the size of memory descriptors, Dynamo applies two optimizations: - **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
1. After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
2. Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks. - **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
- **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
## Runtime-Reconfigurable xPyD ## Runtime-Reconfigurable xPyD
The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as: Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
- **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
- **Remove worker**: Worker drains active requests and deregisters from discovery.
- Add worker: add NIXL metadata in ETCD. The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
- Remove worker: flush engine and delete NIXL metadata in ETCD.
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
...@@ -17,255 +17,189 @@ limitations under the License. ...@@ -17,255 +17,189 @@ limitations under the License.
# Dynamo Architecture Flow # Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations. This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.
## 🔵 Main Request Flow (Blue) ## 🔵 Main Request Flow (Blue)
The primary user journey through the system: The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint 1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000) 2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
3. **Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it 3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
4. **Route (S3)**: Frontend routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange) ## 🟢 Prefill Flow (Green)
The system's intelligent routing and resource allocation: The prefill processing pipeline:
4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing 4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill 5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata
5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
## 🟢 Prefill Worker Flow (Green) ## 🟠 Decode Routing Flow (Orange)
The dedicated prefill processing pipeline: Router orchestration to decode phase:
7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers 6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker
8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication 7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
## 🟣 Completion Flow (Purple) ## 🟣 Completion Flow (Purple)
The response generation and delivery: The response generation and delivery:
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker 8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data 9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
13. **Response (S13)**: The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client
## 🔗 Infrastructure Connections (Dotted lines) ## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support: Coordination and messaging support:
### Service Discovery ### Service Discovery
- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required. - **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd for service discovery and endpoint registration. - **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration.
### Request Plane ### Request Plane
- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport. - **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`. - **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
### NATS Connections (Optional, for KV routing) ### NATS Connections (Optional, for KV routing)
- **PrefillQueue**: JetStream consumer group for reliable work distribution in disaggregated serving
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`) - **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
### Planning Connections (Gold, dotted) ### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions - **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker - **Planner → Workers**: Resource scaling commands for workers
## Technical Implementation Details ## Technical Implementation Details
### PrefillRouter Orchestration:
- The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving
- Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
- Injects transfer metadata into decode requests for KV cache coordination
### NIXL (NVIDIA Interchange Library): ### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe - Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination - Transfer metadata exchanged via `disaggregated_params` in prefill response
- PrefillWorker loads metadata to establish direct communication channels - Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
- Block-based transfers (64–128 tokens per block) for efficient batching
### Disaggregated KV Cache: ### Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory - Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker - No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
- Pre-allocated blocks ensure deterministic memory layout and performance - Non-blocking transfers allow GPU forward passes to continue during KV transfer
```mermaid ```mermaid
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%% %%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD graph TD
%% Top Layer - Client & Frontend %% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"] Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"] Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]] S1[["<b>1 REQUEST</b>"]]
S2[["<b>2 PREPROCESS</b>"]]
%% Processing Layer %% Router Layer
Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"] PrefillRouter["<b>PrefillRouter</b><br/><i>Orchestrates Disaggregated Serving</i>"]
S3[["<b>3 VALIDATE</b>"]] S3[["<b>3 ROUTE TO PREFILL</b>"]]
%% Infrastructure - Positioned strategically to minimize crossings %% Infrastructure
subgraph INF["<b>Infrastructure Layer</b>"] subgraph INF["<b>Infrastructure Layer</b>"]
ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")] Discovery[("<b>Discovery</b><br/><i>Service Registry<br/>(ETCD or K8s)</i>")]
NATS[("<b>NATS</b><br/><i>Message Broker</i>")] NATS[("<b>NATS</b><br/><i>KV Events<br/>(Optional)</i>")]
Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"] Planner["<b>Planner</b><br/><i>Auto-scaling</i>"]
end end
%% Worker Layer - Main processing %% Worker Layer
subgraph WL["<b>Worker Layer</b>"] subgraph WL["<b>Worker Layer</b>"]
%% VllmWorker section %% Prefill Worker
VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"] PrefillWorker["<b>Prefill Worker</b><br/><i>Computes KV Cache</i>"]
S4[["<b>4 QUERY</b>"]] S4[["<b>4 PREFILL</b>"]]
S5[["<b>5 DISAGG DECISION</b>"]] S5[["<b>5 RETURN METADATA</b>"]]
S5a[["<b>5a ALLOCATE</b>"]]
S12[["<b>12 DECODE</b>"]] %% Decode Worker
S6[["<b>6 QUEUE</b>"]] DecodeWorker["<b>Decode Worker</b><br/><i>Token Generation</i>"]
S13[["<b>13 RESPONSE</b>"]] S6[["<b>6 ROUTE TO DECODE</b>"]]
S7[["<b>7 KV TRANSFER</b>"]]
%% Storage positioned near workers S8[["<b>8 DECODE</b>"]]
LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")] S9[["<b>9 RESPONSE</b>"]]
%% Prefill System - Right side to minimize crossings %% KV Cache
subgraph PS["<b>Prefill System</b>"] PrefillKVCache[("<b>Prefill KV Cache</b><br/><i>GPU VRAM</i>")]
PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"] DecodeKVCache[("<b>Decode KV Cache</b><br/><i>GPU VRAM</i>")]
PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
S7[["<b>7 NATS PULL</b>"]]
S8[["<b>8 LOAD METADATA</b>"]]
S9[["<b>9 PREFILL</b>"]]
S10[["<b>10 NIXL TRANSFER</b>"]]
S11[["<b>11 NOTIFY</b>"]]
end
end end
%% Main Request Flow (Blue) - Clean vertical flow %% Main Request Flow (Blue)
Client -.-> S1 Client --> S1
S1 -->|HTTP API Call| Frontend S1 -->|HTTP API Call| Frontend
Frontend -.-> S2 Frontend --> S2
S2 -->|Process & Validate| Processor S2 -->|Tokenize & Validate| PrefillRouter
Processor -.-> S3 PrefillRouter --> S3
S3 -->|Route to Worker| VllmWorker S3 -->|Select Prefill Worker| PrefillWorker
%% VllmWorker Internal Flow (Orange) %% Prefill Flow (Green)
VllmWorker -.-> S4 PrefillWorker --> S4
S4 -->|Query Prefix Cache Hit| S5 S4 -->|Compute KV Cache| PrefillKVCache
S5 -->|Prefill Length & Queue Check| S5a PrefillWorker --> S5
S5a -->|Continue to Decode| S12 S5 -->|disaggregated_params| PrefillRouter
%% Allocation & Queuing (Orange) - Minimize crossings %% Decode Routing Flow (Orange)
S5a -->|Allocate KV Cache Blocks| LocalKVCache PrefillRouter --> S6
VllmWorker --> S6 S6 -->|Inject Transfer Metadata| DecodeWorker
S6 -->|Put RemotePrefillRequest| PrefillQueue DecodeWorker --> S7
S7 -->|NIXL GPU-to-GPU| PrefillKVCache
%% Prefill Worker Flow (Green) - Self-contained within PS PrefillKVCache -.->|Direct Transfer| DecodeKVCache
PrefillQueue -.-> S7
S7 -->|Consumer Group Pull| PrefillWorker %% Completion Flow (Purple)
PrefillWorker -.-> S8 DecodeWorker --> S8
PrefillWorker -.-> S9 S8 -->|Generate Tokens| DecodeKVCache
S9 -->|Execute Prefill| S10 DecodeWorker --> S9
S10 -->|Direct GPU Transfer| LocalKVCache S9 -->|Stream Tokens| Frontend
PrefillWorker --> S11 Frontend -->|HTTP Response| Client
%% Return Flow (Purple) - Clean return path %% Infrastructure Connections
S11 -->|Completion Notification| S12 Frontend -.->|Service Discovery| Discovery
S12 -->|Decode from KV Cache| S13 PrefillRouter -.->|Worker Discovery| Discovery
S13 -->|Post-process Response| Processor PrefillWorker -.->|Register| Discovery
Processor -->|HTTP Response| Frontend DecodeWorker -.->|Register| Discovery
Frontend -->|Final Response| Client Planner -.->|Service Discovery| Discovery
%% Infrastructure Connections - Organized to avoid crossings %% NATS for KV events (optional)
%% ETCD Connections - Grouped by proximity PrefillWorker -.->|KV Events| NATS
Frontend -.->|Service Discovery| ETCD DecodeWorker -.->|KV Events| NATS
Processor -.->|Service Discovery| ETCD
VllmWorker -.->|NIXL Metadata| ETCD %% Planning Connections
PrefillWorker -.->|NIXL Metadata| ETCD
S8 -.->|Load NIXL Metadata| ETCD
Planner -.->|Service Discovery| ETCD
%% NATS Connections - Direct to queue system
PrefillQueue -.->|JetStream| NATS
Processor -.->|Load Balancing| NATS
%% Planning Connections - Strategic positioning
Frontend -.->|Metrics| Planner Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| VllmWorker
Planner -.->|Auto-scaling| PrefillWorker Planner -.->|Auto-scaling| PrefillWorker
Planner -.->|Auto-scaling| DecodeWorker
%% Styling - Each component with unique colors %% Styling
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client class Client client
class Frontend frontend class Frontend frontend
class Processor processor class PrefillRouter router
class VllmWorker worker class DecodeWorker worker
class PrefillQueue prefillQueue
class PrefillWorker prefillWorker class PrefillWorker prefillWorker
class Planner planner class Planner planner
class LocalKVCache storage class PrefillKVCache,DecodeKVCache storage
class ETCD etcd class Discovery discovery
class NATS nats class NATS nats
class PS prefillBox
class INF infraLayer class INF infraLayer
class WL workerLayer class WL workerLayer
%% Flow Colors
%% Main Request Flow - Blue
linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px
%% Prefill Flow - Green
linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px
%% Decode Routing Flow - Orange
linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px
%% Completion Flow - Purple
linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px
%% Flow Colors - Different line styles to reduce visual clutter %% Infrastructure - Gray dotted
%% Main Request Flow - Blue (solid) linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 1 stroke:#1565C0,stroke-width:4px
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 3 stroke:#1565C0,stroke-width:4px
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 5 stroke:#1565C0,stroke-width:4px
%% Decision & Allocation Flow - Orange (mixed)
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 7 stroke:#E65100,stroke-width:4px
linkStyle 8 stroke:#E65100,stroke-width:4px
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
%% KV Cache & Queue - Orange (solid)
linkStyle 10 stroke:#E65100,stroke-width:4px
linkStyle 11 stroke:#E65100,stroke-width:4px
linkStyle 12 stroke:#E65100,stroke-width:4px
%% Prefill Worker Flow - Green (mixed)
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 14 stroke:#2E7D32,stroke-width:4px
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 17 stroke:#2E7D32,stroke-width:4px
linkStyle 18 stroke:#2E7D32,stroke-width:4px
linkStyle 19 stroke:#2E7D32,stroke-width:4px
%% Completion Flow - Purple (mixed)
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
%% ETCD Connections - Gray (dotted, thinner)
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
%% NATS Connections - Teal (dotted, thinner)
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
%% Planning Connections - Gold (dotted, thinner)
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
``` ```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment