docs: update mm embedding cache graphs (#6697)

626fb5dd · Qi Wang · GitHub · 880db75d · 626fb5dd · 626fb5dd
Unverified Commit 626fb5dd authored Feb 27, 2026 by Qi Wang Committed by GitHub Feb 27, 2026
Showing with 47 additions and 17 deletions

docs/pages/features/multimodal/multimodal-trtllm.md docs/pages/features/multimodal/multimodal-trtllm.md +21 -2

docs/pages/features/multimodal/multimodal-vllm.md docs/pages/features/multimodal/multimodal-vllm.md +26 -15

No files found.
--- a/docs/pages/features/multimodal/multimodal-trtllm.md
+++ b/docs/pages/features/multimodal/multimodal-trtllm.md
@@ -392,12 +392,31 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings fo
 | Setting | Implementation | Launch Script | Status |
 |---------|---------------|---------------|--------|
-| **Disaggregated (E/PD)** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
+| **Disaggregated Encoder** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
 | **Aggregated** | N/A | N/A | Not yet supported |
 The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
-### Disaggregated (E/PD)
+### Disaggregated Encoder (Embedding Cache in Prefill Worker)
+In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.
+```mermaid
+---
+title: Embedding Cache — Disaggregated Encoder
+---
+flowchart LR
+    req[Request] --> cpu_check{"CPU cache hit?<br/>(EmbeddingCacheManager)"}
+    subgraph P ["Prefill Worker (P)"]
+        cpu_check -. hit .-> use[Use cached embedding]
+        use --> trtllm[TRT-LLM Instance]
+    end
+    cpu_check -- miss --> E["Encode Worker (E)"]
+    E -- "embeddings via NIXL" --> save["Save to cache"]
+    save --> trtllm
+```
 The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:

--- a/docs/pages/features/multimodal/multimodal-vllm.md
+++ b/docs/pages/features/multimodal/multimodal-vllm.md
@@ -419,18 +419,23 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
 | Setting | Implementation | Launch Script |
 |---------|---------------|---------------|
 | **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
-| **Disaggregated** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
+| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
-### ec_both (Aggregated)
+### Aggregated Worker
 A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
 ```mermaid
+---
+title: Embedding Cache — Aggregated Encoder (e.g. aggregated EP or EPD node)
+---
 flowchart LR
-  HTTP --> vllm[vLLM Instance<br/>ec_both]
+  req[Multimodal Request] --> gpu{GPU Encoder Cache<br/>hit?}
-  vllm --save--> cache[(CPU Embedding Cache<br/>LRU)]
+  gpu -- yes --> skip[Use cached GPU embedding<br/>no encoder, no connector]
-  cache --load--> vllm
+  gpu -- no --> cpu{CPU Embedding Cache<br/>hit?}
-  vllm --> HTTP
+  cpu -- yes --> load[Load: CPU → GPU<br/>skip encoder]
+  cpu -- no --> encode[Run Encoder]
+  encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)]
 ```
 **Launch:**
@@ -442,19 +447,25 @@ bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-
 This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
-### Disaggregated Embedding Cache
+### Disaggregated Encoder (Embedding Cache in Prefill Worker)
-In the disaggregated setting, Dynamo maintains the embedding cache in its own worker layer on top of the vLLM engine. The encode worker computes embeddings and writes them to the cache, while the PD worker reads cached embeddings instead of re-encoding.
+In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.
 ```mermaid
+---
+title: Embedding Cache — Disaggregated Encoder
+---
 flowchart LR
-  HTTP --> processor
+    req[Request] --> cpu_check{"CPU cache hit?<br/>(EmbeddingCacheManager)"}
-  processor --> HTTP
-  processor --image_url--> encode_worker[Encode Worker]
+    subgraph P ["Prefill Worker (P)"]
-  encode_worker --> processor
+        cpu_check -. hit .-> use[Use cached embedding]
-  encode_worker --embeddings--> cache[(Dynamo Embedding Cache)]
+        use --> vllm[vLLM Instance]
-  cache --load--> pd_worker[PD Worker]
+    end
-  pd_worker --> encode_worker
+    cpu_check -- miss --> E["Encode Worker (E)"]
+    E -- "embeddings via NIXL" --> save["Save to cache"]
+    save --> vllm
 ```
 **Launch:**