Unverified Commit 626fb5dd authored by Qi Wang's avatar Qi Wang Committed by GitHub
Browse files

docs: update mm embedding cache graphs (#6697)

parent 880db75d
...@@ -392,12 +392,31 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings fo ...@@ -392,12 +392,31 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings fo
| Setting | Implementation | Launch Script | Status | | Setting | Implementation | Launch Script | Status |
|---------|---------------|---------------|--------| |---------|---------------|---------------|--------|
| **Disaggregated (E/PD)** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported | | **Disaggregated Encoder** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
| **Aggregated** | N/A | N/A | Not yet supported | | **Aggregated** | N/A | N/A | Not yet supported |
The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding. The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
### Disaggregated (E/PD) ### Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.
```mermaid
---
title: Embedding Cache — Disaggregated Encoder
---
flowchart LR
req[Request] --> cpu_check{"CPU cache hit?<br/>(EmbeddingCacheManager)"}
subgraph P ["Prefill Worker (P)"]
cpu_check -. hit .-> use[Use cached embedding]
use --> trtllm[TRT-LLM Instance]
end
cpu_check -- miss --> E["Encode Worker (E)"]
E -- "embeddings via NIXL" --> save["Save to cache"]
save --> trtllm
```
The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`: The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:
......
...@@ -419,18 +419,23 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings: ...@@ -419,18 +419,23 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script | | Setting | Implementation | Launch Script |
|---------|---------------|---------------| |---------|---------------|---------------|
| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` | | **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
| **Disaggregated** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` | | **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
### ec_both (Aggregated) ### Aggregated Worker
A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely. A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
```mermaid ```mermaid
---
title: Embedding Cache — Aggregated Encoder (e.g. aggregated EP or EPD node)
---
flowchart LR flowchart LR
HTTP --> vllm[vLLM Instance<br/>ec_both] req[Multimodal Request] --> gpu{GPU Encoder Cache<br/>hit?}
vllm --save--> cache[(CPU Embedding Cache<br/>LRU)] gpu -- yes --> skip[Use cached GPU embedding<br/>no encoder, no connector]
cache --load--> vllm gpu -- no --> cpu{CPU Embedding Cache<br/>hit?}
vllm --> HTTP cpu -- yes --> load[Load: CPU → GPU<br/>skip encoder]
cpu -- no --> encode[Run Encoder]
encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)]
``` ```
**Launch:** **Launch:**
...@@ -442,19 +447,25 @@ bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity- ...@@ -442,19 +447,25 @@ bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled). This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
### Disaggregated Embedding Cache ### Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, Dynamo maintains the embedding cache in its own worker layer on top of the vLLM engine. The encode worker computes embeddings and writes them to the cache, while the PD worker reads cached embeddings instead of re-encoding. In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.
```mermaid ```mermaid
---
title: Embedding Cache — Disaggregated Encoder
---
flowchart LR flowchart LR
HTTP --> processor req[Request] --> cpu_check{"CPU cache hit?<br/>(EmbeddingCacheManager)"}
processor --> HTTP
processor --image_url--> encode_worker[Encode Worker] subgraph P ["Prefill Worker (P)"]
encode_worker --> processor cpu_check -. hit .-> use[Use cached embedding]
encode_worker --embeddings--> cache[(Dynamo Embedding Cache)] use --> vllm[vLLM Instance]
cache --load--> pd_worker[PD Worker] end
pd_worker --> encode_worker
cpu_check -- miss --> E["Encode Worker (E)"]
E -- "embeddings via NIXL" --> save["Save to cache"]
save --> vllm
``` ```
**Launch:** **Launch:**
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment