Commit 70266ec8 authored by ptarasiewiczNV's avatar ptarasiewiczNV Committed by GitHub
Browse files

docs: Add disaggregated architecture mermaid diagram (#190)


Co-authored-by: default avatarhongkuanz <hongkuanz@nvidia.com>
Co-authored-by: default avatarMeenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
Co-authored-by: default avatarDmitry Tokarev <dtokarev@nvidia.com>
parent aca25898
......@@ -33,6 +33,29 @@ Single-instance deployment where both prefill and decode are done by the same wo
### Disaggregated
Distributed deployment where prefill and decode are done by separate workers that can scale independently.
```mermaid
sequenceDiagram
participant D as VllmWorker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
## Getting Started
1. Choose a deployment architecture based on your requirements
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment