docs: Update disagg and request flow design docs based on latest code (#5993)

bed29a16 · Ryan McCormick · GitHub · dde23cc6 · bed29a16 · bed29a16
Unverified Commit bed29a16 authored Feb 07, 2026 by Ryan McCormick Committed by GitHub Feb 07, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 153 additions and 251 deletions

docs/design_docs/disagg_serving.md docs/design_docs/disagg_serving.md +40 -72

docs/design_docs/dynamo_flow.md docs/design_docs/dynamo_flow.md +113 -179

No files found.
--- a/docs/design_docs/disagg_serving.md
+++ b/docs/design_docs/disagg_serving.md
@@ -10,98 +10,66 @@ The prefill and decode phases of LLM requests have different computation charact
 Disaggregated execution of a request has three main steps:
 1. Prefill engine computes prefill phase and generates KV cache
-2. Prefill engine transfers the KV cache to decode engine, and
+2. Prefill engine transfers the KV cache to decode engine
 3. Decode engine computes decode phase.
-However, not all requests’ prefill phases need to be computed in the remote prefill engine. If the prefill is short or the decode engine has a high prefix cache hit, often it is more efficient to prefill locally in the decode engine. The disaggregation design in Dynamo accounts for all these scenarios and features a flexible framework that delivers strong performance across various conditions.
+The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
+## Efficient KV Transfer
-## Design
+The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
-```mermaid
-sequenceDiagram
-    participant D as Worker
-    participant Q as PrefillQueue
-    participant P as PrefillWorker
-    Note over D: Request is routed to decode
-    D->>D: Decide if prefill should be done locally or remotely
-        D->>D: Allocate KV blocks
-        D->>Q: Put RemotePrefillRequest on the queue
-        P->>Q: Pull request from the queue
-        P-->>D: Read cached KVs from Decode
-        D->>D: Decode other requests
-        P->>P: Run prefill
-        P-->>D: Write prefilled KVs into allocated blocks
-        P->>D: Send completion notification
-        Note over D: Notification received when prefill is done
-        D->>D: Schedule decoding
-```
-There are four main components in Dynamo disaggregation:
- Worker: execute prefill and decode requests
- Prefill worker: execute prefill requests only
- Disaggregated router: decide whether to prefill locally or remotely
- Prefill queue: cache and load balance the remote prefill requests
-When worker receives a request, it first decides if the prefill should be done locally or remotely using the disaggregated router and allocates the KV blocks. If prefilling remotely, it then pushes a remote prefill request to the prefill queue. After that, the prefill worker pulls from prefill queue, reads KV blocks with prefix cache hit from the worker, computes the prefill, and writes the computed KV blocks back to the worker. Finally, the worker completes the remaining decoding.
-## Conditional Disaggregation
-Not all requests’ prefill phases need to be computed in the remote prefill engine. Disaggregated router decides whether the prefill phase of a request should be computed locally and globally at runtime based on the prefill length and prefill queue status. Specifically, a request is sent to remote prefill engine if the following two conditions are met:
-1. The absolute prefill length without prefix cache hit is greater than a preset threshold. On the one hand, if the prefill length of a request is short, it can be efficiently computed in the decode engine by piggybacking chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is long, the prefill becomes memory bound and hence can be more efficiently computed in the decode engine.
-2. The number of remote prefill requests in the prefill queue is less than a preset threshold. When the prefill queue has a large number of prefill requests, it indicates that the prefill workers are lagging behind, and it is better to prefill locally until more prefill workers join.
-Conditional disaggregation allows Dynamo to achieve high performance for dynamic workloads
-## Prefill Queue
-Prefill requests are computation bound (except for very short prefills) and should be executed in their dedicated iterations without any other requests to ensure fast TTFT. To balance the load across multiple prefill engines, Dynamo adopts a global prefill queue where workers push remote prefill requests and prefill workers pull and complete the requests one by one. The global prefill queue is implemented based on NATS stream to ensure high performance and availability.
+### Router Orchestration
-## Efficient KV Transfer
+The disaggregated serving flow is orchestrated by the `PrefillRouter`:
 ```mermaid
 sequenceDiagram
-    participant D as Worker
+    participant Client
-    participant SD as WorkerScheduler
+    participant Frontend
-    participant SP as PrefillWorkerScheduler
+    participant Router as PrefillRouter
-    participant P as PrefillWorker
+    participant Prefill as Prefill Worker
+    participant Decode as Decode Worker
+    Client->>Frontend: Request
+    Frontend->>Router: Preprocessed Request
+    Router->>Router: Select prefill worker
+    Router->>Prefill: Prefill request
+    Prefill->>Prefill: Compute KV cache
+    Prefill-->>Router: disaggregated_params
+    Router->>Router: Select decode worker
+    Router->>Decode: Decode request + transfer metadata
+    Decode<<->>Prefill: KV transfer (NIXL)
+    Decode->>Decode: Generate tokens
+    Decode-->>Frontend: Stream tokens
+    Frontend-->>Client: Response
+```
-    Note over SD: KV blocks allocated
+1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
-    SD->>SP: Issue remote prefill request <br> with KV block descriptors via prefill queue
-    SP->>P: Add to in-flight batch
-    P-->>D: Remote NIXL read for prefix hit KV blocks (non-block)
+2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
-    P->>P: Execute prefill
-    P-->>D: Remote NIXL write for comptued KV blocks (non-block)
-    P->>SP: Notify finish
+3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
-    SP->>SD: Notify finish
-    SD->>D: Add to in-flight batch
-    D->>D: Execute decode
+4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
-```
-The key to high-performance disaggregation is efficient KV transfer. Dynamo leverage NIXL to transfer KV cache directly from the VRAM of prefill engine to the VRAM of decode engine. In addition, the KV transfer is non-blocking, allowing GPU forward pass to serve other requests in addition to the KV transfer.
+### Backend-Specific Transfer Metadata
-After the KV blocks are allocated, the worker scheduler sends the remote prefill requests, which contain the memory descriptors for the allocated KV blocks, to the prefill worker scheduler via prefill queue. This allows the prefill worker to read and write from the remote KV blocks without explicit handling in the remote worker engine, thanks to the RDMA read and write NIXL operations. Once the remote prefill is done, worker scheduler simply adds the decode request to the worker in-flight. This allows workers to execute forward passes of ongoing decode/prefill requests while waiting for the remote prefill to finish.
+The transfer metadata format varies by backend:
-To reduce the size of memory descriptors, Dynamo applies two optimizations:
+- **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
-1. After each worker finishes its initialization and allocates all the KV cache pool, it stores the memory descriptor of all blocks (which is also referred to as the NIXL metadata) in ETCD, a distributed key-value store. Prefill workers load and cache the memory descriptors in one worker at the first time that it serves a remote prefill request issued by this worker. Thus, only the KV block ID instead of the full memory descriptor is needed when issuing the remote prefill request.
-2. Dynamo promotes the memory allocator in the prefill engine to allocate continuous blocks and merge continuous blocks into larger blocks to reduce the total number of KV blocks.
+- **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
+- **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
-For decode and prefill with different KV layouts (i.e., due to different TP), Dynamo applies a high-performance kernel that transposes the KV blocks into their matching layout in the KV receiver after the NIXL reads and before the NIXL writes.
 ## Runtime-Reconfigurable xPyD
-The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows runtime-reconfigurable xPyD. Workers and prefill workers can be added and removed at runtime without any system-level synchronization or overheads. New and existing prefill workers both just simply pull remote prefill requests from NATS prefill queue. The NIXL metadata of the new or existing workers (for new prefill workers) are lazily loaded and cached when necessary. Specifically, adding and removing workers and prefill workers is as easy as:
+Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
+- **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
+- **Remove worker**: Worker drains active requests and deregisters from discovery.
- Add worker: add NIXL metadata in ETCD.
+The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
- Remove worker: flush engine and delete NIXL metadata in ETCD.
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
--- a/docs/design_docs/dynamo_flow.md
+++ b/docs/design_docs/dynamo_flow.md
@@ -17,255 +17,189 @@ limitations under the License.
 # Dynamo Architecture Flow
-This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](../../examples/backends/vllm). Color-coded flows indicate different types of operations.
+This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
-> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.
 ## 🔵 Main Request Flow (Blue)
 The primary user journey through the system:
-1. **Discovery (S1)**: Client discovers the service endpoint
+1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
+2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
-3. **Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
+3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
-4. **Route (S3)**: Frontend routes the validated request to appropriate Decode Worker
-## 🟠 Decision and Allocation Flow (Orange)
+## 🟢 Prefill Flow (Green)
-The system's intelligent routing and resource allocation:
+The prefill processing pipeline:
-4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
+4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
-5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill
+5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata
-5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
-6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
-## 🟢 Prefill Worker Flow (Green)
+## 🟠 Decode Routing Flow (Orange)
-The dedicated prefill processing pipeline:
+Router orchestration to decode phase:
-7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
+6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker
-8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
+7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
-9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
-10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
 ## 🟣 Completion Flow (Purple)
 The response generation and delivery:
-11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
+8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache
-12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
+9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
-13. **Response (S13)**: The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client
 ## 🔗 Infrastructure Connections (Dotted lines)
 Coordination and messaging support:
 ### Service Discovery
 - **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd for service discovery and endpoint registration.
+- **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration.
 ### Request Plane
 - **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
 - **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
 ### NATS Connections (Optional, for KV routing)
- **PrefillQueue**: JetStream consumer group for reliable work distribution in disaggregated serving
 - **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
 ### Planning Connections (Gold, dotted)
 - **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker
+- **Planner → Workers**: Resource scaling commands for workers
 ## Technical Implementation Details
+### PrefillRouter Orchestration:
+- The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving
+- Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
+- Injects transfer metadata into decode requests for KV cache coordination
 ### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
+- Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
+- Transfer metadata exchanged via `disaggregated_params` in prefill response
- PrefillWorker loads metadata to establish direct communication channels
+- Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
- Block-based transfers (64–128 tokens per block) for efficient batching
 ### Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory
+- Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker
+- No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
- Pre-allocated blocks ensure deterministic memory layout and performance
+- Non-blocking transfers allow GPU forward passes to continue during KV transfer
 ```mermaid
 %%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
 graph TD
    %% Top Layer - Client & Frontend
    Client["<b>HTTP Client</b>"]
-    S1[["<b>1 DISCOVERY</b>"]]
    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
-    S2[["<b>2 REQUEST</b>"]]
+    S1[["<b>1 REQUEST</b>"]]
+    S2[["<b>2 PREPROCESS</b>"]]
-    %% Processing Layer
+    %% Router Layer
-    Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"]
+    PrefillRouter["<b>PrefillRouter</b><br/><i>Orchestrates Disaggregated Serving</i>"]
-    S3[["<b>3 VALIDATE</b>"]]
+    S3[["<b>3 ROUTE TO PREFILL</b>"]]
-    %% Infrastructure - Positioned strategically to minimize crossings
+    %% Infrastructure
    subgraph INF["<b>Infrastructure Layer</b>"]
-        ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")]
+        Discovery[("<b>Discovery</b><br/><i>Service Registry<br/>(ETCD or K8s)</i>")]
-        NATS[("<b>NATS</b><br/><i>Message Broker</i>")]
+        NATS[("<b>NATS</b><br/><i>KV Events<br/>(Optional)</i>")]
-        Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"]
+        Planner["<b>Planner</b><br/><i>Auto-scaling</i>"]
    end
-    %% Worker Layer - Main processing
+    %% Worker Layer
    subgraph WL["<b>Worker Layer</b>"]
-        %% VllmWorker section
+        %% Prefill Worker
-        VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"]
+        PrefillWorker["<b>Prefill Worker</b><br/><i>Computes KV Cache</i>"]
-        S4[["<b>4 QUERY</b>"]]
+        S4[["<b>4 PREFILL</b>"]]
-        S5[["<b>5 DISAGG DECISION</b>"]]
+        S5[["<b>5 RETURN METADATA</b>"]]
-        S5a[["<b>5a ALLOCATE</b>"]]
-        S12[["<b>12 DECODE</b>"]]
+        %% Decode Worker
-        S6[["<b>6 QUEUE</b>"]]
+        DecodeWorker["<b>Decode Worker</b><br/><i>Token Generation</i>"]
-        S13[["<b>13 RESPONSE</b>"]]
+        S6[["<b>6 ROUTE TO DECODE</b>"]]
+        S7[["<b>7 KV TRANSFER</b>"]]
-        %% Storage positioned near workers
+        S8[["<b>8 DECODE</b>"]]
-        LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")]
+        S9[["<b>9 RESPONSE</b>"]]
-        %% Prefill System - Right side to minimize crossings
+        %% KV Cache
-        subgraph PS["<b>Prefill System</b>"]
+        PrefillKVCache[("<b>Prefill KV Cache</b><br/><i>GPU VRAM</i>")]
-            PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"]
+        DecodeKVCache[("<b>Decode KV Cache</b><br/><i>GPU VRAM</i>")]
-            PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
-            S7[["<b>7 NATS PULL</b>"]]
-            S8[["<b>8 LOAD METADATA</b>"]]
-            S9[["<b>9 PREFILL</b>"]]
-            S10[["<b>10 NIXL TRANSFER</b>"]]
-            S11[["<b>11 NOTIFY</b>"]]
-        end
    end
-    %% Main Request Flow (Blue) - Clean vertical flow
+    %% Main Request Flow (Blue)
-    Client -.-> S1
+    Client --> S1
    S1 -->|HTTP API Call| Frontend
-    Frontend -.-> S2
+    Frontend --> S2
-    S2 -->|Process & Validate| Processor
+    S2 -->|Tokenize & Validate| PrefillRouter
-    Processor -.-> S3
+    PrefillRouter --> S3
-    S3 -->|Route to Worker| VllmWorker
+    S3 -->|Select Prefill Worker| PrefillWorker
-    %% VllmWorker Internal Flow (Orange)
+    %% Prefill Flow (Green)
-    VllmWorker -.-> S4
+    PrefillWorker --> S4
-    S4 -->|Query Prefix Cache Hit| S5
+    S4 -->|Compute KV Cache| PrefillKVCache
-    S5 -->|Prefill Length & Queue Check| S5a
+    PrefillWorker --> S5
-    S5a -->|Continue to Decode| S12
+    S5 -->|disaggregated_params| PrefillRouter
-    %% Allocation & Queuing (Orange) - Minimize crossings
+    %% Decode Routing Flow (Orange)
-    S5a -->|Allocate KV Cache Blocks| LocalKVCache
+    PrefillRouter --> S6
-    VllmWorker --> S6
+    S6 -->|Inject Transfer Metadata| DecodeWorker
-    S6 -->|Put RemotePrefillRequest| PrefillQueue
+    DecodeWorker --> S7
+    S7 -->|NIXL GPU-to-GPU| PrefillKVCache
-    %% Prefill Worker Flow (Green) - Self-contained within PS
+    PrefillKVCache -.->|Direct Transfer| DecodeKVCache
-    PrefillQueue -.-> S7
-    S7 -->|Consumer Group Pull| PrefillWorker
+    %% Completion Flow (Purple)
-    PrefillWorker -.-> S8
+    DecodeWorker --> S8
-    PrefillWorker -.-> S9
+    S8 -->|Generate Tokens| DecodeKVCache
-    S9 -->|Execute Prefill| S10
+    DecodeWorker --> S9
-    S10 -->|Direct GPU Transfer| LocalKVCache
+    S9 -->|Stream Tokens| Frontend
-    PrefillWorker --> S11
+    Frontend -->|HTTP Response| Client
-    %% Return Flow (Purple) - Clean return path
+    %% Infrastructure Connections
-    S11 -->|Completion Notification| S12
+    Frontend -.->|Service Discovery| Discovery
-    S12 -->|Decode from KV Cache| S13
+    PrefillRouter -.->|Worker Discovery| Discovery
-    S13 -->|Post-process Response| Processor
+    PrefillWorker -.->|Register| Discovery
-    Processor -->|HTTP Response| Frontend
+    DecodeWorker -.->|Register| Discovery
-    Frontend -->|Final Response| Client
+    Planner -.->|Service Discovery| Discovery
-    %% Infrastructure Connections - Organized to avoid crossings
+    %% NATS for KV events (optional)
-    %% ETCD Connections - Grouped by proximity
+    PrefillWorker -.->|KV Events| NATS
-    Frontend -.->|Service Discovery| ETCD
+    DecodeWorker -.->|KV Events| NATS
-    Processor -.->|Service Discovery| ETCD
-    VllmWorker -.->|NIXL Metadata| ETCD
+    %% Planning Connections
-    PrefillWorker -.->|NIXL Metadata| ETCD
-    S8 -.->|Load NIXL Metadata| ETCD
-    Planner -.->|Service Discovery| ETCD
-    %% NATS Connections - Direct to queue system
-    PrefillQueue -.->|JetStream| NATS
-    Processor -.->|Load Balancing| NATS
-    %% Planning Connections - Strategic positioning
    Frontend -.->|Metrics| Planner
-    Planner -.->|Auto-scaling| VllmWorker
    Planner -.->|Auto-scaling| PrefillWorker
+    Planner -.->|Auto-scaling| DecodeWorker
-    %% Styling - Each component with unique colors
+    %% Styling
    classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
    classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
-    classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
+    classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
    classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
-    classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
+    classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px
-    classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
-    classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
    classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
    classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
-    classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
+    classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px
    classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
    classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
    classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
    class Client client
    class Frontend frontend
-    class Processor processor
+    class PrefillRouter router
-    class VllmWorker worker
+    class DecodeWorker worker
-    class PrefillQueue prefillQueue
    class PrefillWorker prefillWorker
    class Planner planner
-    class LocalKVCache storage
+    class PrefillKVCache,DecodeKVCache storage
-    class ETCD etcd
+    class Discovery discovery
    class NATS nats
-    class PS prefillBox
    class INF infraLayer
    class WL workerLayer
+    %% Flow Colors
+    %% Main Request Flow - Blue
+    linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px
+    %% Prefill Flow - Green
+    linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px
+    %% Decode Routing Flow - Orange
+    linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px
+    %% Completion Flow - Purple
+    linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px
-    %% Flow Colors - Different line styles to reduce visual clutter
+    %% Infrastructure - Gray dotted
-    %% Main Request Flow - Blue (solid)
+    linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 1 stroke:#1565C0,stroke-width:4px
-    linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 3 stroke:#1565C0,stroke-width:4px
-    linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 5 stroke:#1565C0,stroke-width:4px
-    %% Decision & Allocation Flow - Orange (mixed)
-    linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 7 stroke:#E65100,stroke-width:4px
-    linkStyle 8 stroke:#E65100,stroke-width:4px
-    linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
-    %% KV Cache & Queue - Orange (solid)
-    linkStyle 10 stroke:#E65100,stroke-width:4px
-    linkStyle 11 stroke:#E65100,stroke-width:4px
-    linkStyle 12 stroke:#E65100,stroke-width:4px
-    %% Prefill Worker Flow - Green (mixed)
-    linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 14 stroke:#2E7D32,stroke-width:4px
-    linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 17 stroke:#2E7D32,stroke-width:4px
-    linkStyle 18 stroke:#2E7D32,stroke-width:4px
-    linkStyle 19 stroke:#2E7D32,stroke-width:4px
-    %% Completion Flow - Purple (mixed)
-    linkStyle 20 stroke:#6A1B9A,stroke-width:4px
-    linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
-    linkStyle 22 stroke:#6A1B9A,stroke-width:4px
-    linkStyle 23 stroke:#6A1B9A,stroke-width:4px
-    linkStyle 24 stroke:#6A1B9A,stroke-width:4px
-    %% Infrastructure Flows - Lighter and dotted to reduce visual noise
-    %% ETCD Connections - Gray (dotted, thinner)
-    linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-    %% NATS Connections - Teal (dotted, thinner)
-    linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
-    %% Planning Connections - Gold (dotted, thinner)
-    linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
-    linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
 ```