docs: Add disaggregated architecture mermaid diagram (#190)

Co-authored-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>

docs: Add disaggregated architecture mermaid diagram (#190)
Co-authored-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
70266ec8 · ptarasiewiczNV · GitHub · aca25898 · 70266ec8
Commit 70266ec8 authored Mar 17, 2025 by ptarasiewiczNV Committed by GitHub Mar 17, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 23 additions and 0 deletions

deploy/examples/llm/README.md deploy/examples/llm/README.md +23 -0

No files found.
--- a/deploy/examples/llm/README.md
+++ b/deploy/examples/llm/README.md
@@ -33,6 +33,29 @@ Single-instance deployment where both prefill and decode are done by the same wo
 ### Disaggregated
 Distributed deployment where prefill and decode are done by separate workers that can scale independently.
+```mermaid
+sequenceDiagram
+    participant D as VllmWorker
+    participant Q as PrefillQueue
+    participant P as PrefillWorker
+    Note over D: Request is routed to decode
+    D->>D: Decide if prefill should be done locally or remotely
+        D->>D: Allocate KV blocks
+        D->>Q: Put RemotePrefillRequest on the queue
+        P->>Q: Pull request from the queue
+        P-->>D: Read cached KVs from Decode
+        D->>D: Decode other requests
+        P->>P: Run prefill
+        P-->>D: Write prefilled KVs into allocated blocks
+        P->>D: Send completion notification
+        Note over D: Notification received when prefill is done
+        D->>D: Schedule decoding
+```
 ## Getting Started
 1. Choose a deployment architecture based on your requirements