docs: embedding cache in vLLM and TRT-LLM (#6555)

d5add7ff · Qi Wang · GitHub · 5d958623 · d5add7ff · d5add7ff
Unverified Commit d5add7ff authored Feb 25, 2026 by Qi Wang Committed by GitHub Feb 25, 2026
4 changed files
--- a/docs/pages/features/multimodal/multimodal-trtllm.md
+++ b/docs/pages/features/multimodal/multimodal-trtllm.md
@@ -388,6 +388,26 @@ For 4 4xGB200 nodes (2 for prefill, 2 for decode):
 pkill srun
 ```
+## Embedding Cache
+Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:
+| Setting | Implementation | Launch Script | Status |
+|---------|---------------|---------------|--------|
+| **Disaggregated (E/PD)** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
+| **Aggregated** | N/A | N/A | Not yet supported |
+The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
+### Disaggregated (E/PD)
+The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
+```
 ## NIXL Usage
 | Use Case | Script | NIXL Used? | Data Transfer |

--- a/docs/pages/features/multimodal/multimodal-vllm.md
+++ b/docs/pages/features/multimodal/multimodal-vllm.md
@@ -38,7 +38,6 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE
 | E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
 | E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
 | EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
-| E/PD (EC Connector) | ✅ | `agg_multimodal_ec_connector.sh` | vLLM-native encoder with ECConnector |
 ### Component Flags
@@ -161,34 +160,6 @@ bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
 > [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
-## ECConnector Serving
-ECConnector is vLLM's native connector for transferring multimodal embeddings via an Embedding Cache. The encoder worker acts as a **producer** (writes embeddings), while the PD worker acts as a **consumer** (reads embeddings).
-**Workflow:**
-```mermaid
-flowchart LR
-  HTTP --> processor[EC Processor]
-  processor --image_url--> encoder[vLLM Native Encoder<br/>Producer]
-  encoder --writes--> cache[(Embedding Cache)]
-  cache --reads--> pd[PD Worker<br/>Consumer]
-  pd --> processor
-  processor --> HTTP
-```
-**Launch:**
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/agg_multimodal_ec_connector.sh --model llava-hf/llava-1.5-7b-hf
-# Custom storage path for Embedding Cache
-bash launch/agg_multimodal_ec_connector.sh --ec-storage-path /shared/encoder-cache
-```
-**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
 ## Llama 4 Serving
 The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
@@ -440,6 +411,60 @@ cd $DYNAMO_HOME/examples/multimodal
 bash launch/audio_disagg.sh
 ```
+## Embedding Cache
+Dynamo supports embedding cache in both aggregated and disaggregated settings:
+| Setting | Implementation | Launch Script |
+|---------|---------------|---------------|
+| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
+| **Disaggregated** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
+### ec_both (Aggregated)
+A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
+```mermaid
+flowchart LR
+  HTTP --> vllm[vLLM Instance<br/>ec_both]
+  vllm --save--> cache[(CPU Embedding Cache<br/>LRU)]
+  cache --load--> vllm
+  vllm --> HTTP
+```
+**Launch:**
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-gb 10
+```
+This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
+### Disaggregated Embedding Cache
+In the disaggregated setting, Dynamo maintains the embedding cache in its own worker layer on top of the vLLM engine. The encode worker computes embeddings and writes them to the cache, while the PD worker reads cached embeddings instead of re-encoding.
+```mermaid
+flowchart LR
+  HTTP --> processor
+  processor --> HTTP
+  processor --image_url--> encode_worker[Encode Worker]
+  encode_worker --> processor
+  encode_worker --embeddings--> cache[(Dynamo Embedding Cache)]
+  cache --load--> pd_worker[PD Worker]
+  pd_worker --> encode_worker
+```
+**Launch:**
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
+```
+**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
 ## NIXL Usage
 | Use Case | Script | NIXL Used? | Data Transfer |
@@ -448,7 +473,7 @@ bash launch/audio_disagg.sh
 | E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
 | E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
 | EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
-| E/PD (EC Connector) | `agg_multimodal_ec_connector.sh` | No | ECConnector via Embedding Cache |
+| EC Both (Local Node) | `vllm_serve_embedding_cache.sh` | No | ECConnector via CPU Embedding Cache |
 ## ModelInput Types and Registration

--- a/examples/backends/trtllm/launch/e_pd_disagg.sh
+++ b/examples/backends/trtllm/launch/e_pd_disagg.sh
@@ -17,9 +17,11 @@ export ENCODE_ENDPOINT=${ENCODE_ENDPOINT:-"dyn://dynamo.tensorrt_llm_encode.gene
 export MODALITY=${MODALITY:-"multimodal"}
 export ALLOWED_LOCAL_MEDIA_PATH=${ALLOWED_LOCAL_MEDIA_PATH:-"/tmp"}
 export MAX_FILE_SIZE_MB=${MAX_FILE_SIZE_MB:-50}
-export DYN_ENCODER_CACHE_CAPACITY_GB=${DYN_ENCODER_CACHE_CAPACITY_GB:-4}
 export CUSTOM_TEMPLATE=${CUSTOM_TEMPLATE:-"$DYNAMO_HOME/examples/backends/trtllm/templates/llava_multimodal.jinja"}
+# Extra arguments forwarded to the PD worker (e.g. --multimodal-embedding-cache-capacity-gb 10)
+EXTRA_PD_ARGS=("$@")
 # Setup cleanup trap
 cleanup() {
    echo "Cleaning up background processes..."
@@ -54,7 +56,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.trtllm \
  --custom-jinja-template "$CUSTOM_TEMPLATE" \
  --encode-endpoint "$ENCODE_ENDPOINT" \
  --disaggregation-mode prefill_and_decode \
-  --dyn-encoder-cache-capacity-gb "$DYN_ENCODER_CACHE_CAPACITY_GB" &
+  "${EXTRA_PD_ARGS[@]}" &
 PD_PID_1=$!
 wait $DYNAMO_PID
--- a/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+++ b/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+CAPACITY_GB=10
+EXTRA_ARGS=()
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --multimodal-embedding-cache-capacity-gb)
+            CAPACITY_GB="$2"; shift 2 ;;
+        *)
+            EXTRA_ARGS+=("$1"); shift ;;
+    esac
+done
+EC_ARGS=()
+if [[ "$CAPACITY_GB" != "0" ]]; then
+    EC_ARGS=(--ec-transfer-config "{
+        \"ec_role\": \"ec_both\",
+        \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
+        \"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
+        \"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": $CAPACITY_GB}
+    }")
+fi
+CUDA_VISIBLE_DEVICES=2 \
+vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
+    --enable-log-requests \
+    --max-model-len 16384 \
+    --gpu-memory-utilization .9 \
+    "${EC_ARGS[@]}" \
+    "${EXTRA_ARGS[@]}"
\ No newline at end of file