| **Disaggregated (E/PD)** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
| **Disaggregated Encoder** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
### Disaggregated (E/PD)
### Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.
E -- "embeddings via NIXL" --> save["Save to cache"]
save --> trtllm
```
The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:
The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:
@@ -419,18 +419,23 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
...
@@ -419,18 +419,23 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script |
| Setting | Implementation | Launch Script |
|---------|---------------|---------------|
|---------|---------------|---------------|
| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
| **Disaggregated** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
### ec_both (Aggregated)
### Aggregated Worker
A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
```mermaid
```mermaid
---
title: Embedding Cache — Aggregated Encoder (e.g. aggregated EP or EPD node)
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
### Disaggregated Embedding Cache
### Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, Dynamo maintains the embedding cache in its own worker layer on top of the vLLM engine. The encode worker computes embeddings and writes them to the cache, while the PD worker reads cached embeddings instead of re-encoding.
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (`EmbeddingCacheManager`). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.