Unverified Commit d5add7ff authored by Qi Wang's avatar Qi Wang Committed by GitHub
Browse files

docs: embedding cache in vLLM and TRT-LLM (#6555)

parent 5d958623
...@@ -388,6 +388,26 @@ For 4 4xGB200 nodes (2 for prefill, 2 for decode): ...@@ -388,6 +388,26 @@ For 4 4xGB200 nodes (2 for prefill, 2 for decode):
pkill srun pkill srun
``` ```
## Embedding Cache
Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:
| Setting | Implementation | Launch Script | Status |
|---------|---------------|---------------|--------|
| **Disaggregated (E/PD)** | Dynamo-managed cache in the PD worker layer on top of TRT-LLM engine | `disagg_e_pd.sh` + `--multimodal-embedding-cache-capacity-gb` | Supported |
| **Aggregated** | N/A | N/A | Not yet supported |
The cache uses `MultimodalEmbeddingCacheManager` to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.
### Disaggregated (E/PD)
The `disagg_e_pd.sh` script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing `--multimodal-embedding-cache-capacity-gb`:
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
## NIXL Usage ## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer | | Use Case | Script | NIXL Used? | Data Transfer |
......
...@@ -38,7 +38,6 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE ...@@ -38,7 +38,6 @@ vLLM supports all multimodal deployment patterns. See [Architecture Patterns](RE
| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker | | E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker |
| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate | | E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate |
| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models | | EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models |
| E/PD (EC Connector) | ✅ | `agg_multimodal_ec_connector.sh` | vLLM-native encoder with ECConnector |
### Component Flags ### Component Flags
...@@ -161,34 +160,6 @@ bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf ...@@ -161,34 +160,6 @@ bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported. > [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
## ECConnector Serving
ECConnector is vLLM's native connector for transferring multimodal embeddings via an Embedding Cache. The encoder worker acts as a **producer** (writes embeddings), while the PD worker acts as a **consumer** (reads embeddings).
**Workflow:**
```mermaid
flowchart LR
HTTP --> processor[EC Processor]
processor --image_url--> encoder[vLLM Native Encoder<br/>Producer]
encoder --writes--> cache[(Embedding Cache)]
cache --reads--> pd[PD Worker<br/>Consumer]
pd --> processor
processor --> HTTP
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_multimodal_ec_connector.sh --model llava-hf/llava-1.5-7b-hf
# Custom storage path for Embedding Cache
bash launch/agg_multimodal_ec_connector.sh --ec-storage-path /shared/encoder-cache
```
**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
## Llama 4 Serving ## Llama 4 Serving
The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill. The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill.
...@@ -440,6 +411,60 @@ cd $DYNAMO_HOME/examples/multimodal ...@@ -440,6 +411,60 @@ cd $DYNAMO_HOME/examples/multimodal
bash launch/audio_disagg.sh bash launch/audio_disagg.sh
``` ```
## Embedding Cache
Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script |
|---------|---------------|---------------|
| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
| **Disaggregated** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
### ec_both (Aggregated)
A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
```mermaid
flowchart LR
HTTP --> vllm[vLLM Instance<br/>ec_both]
vllm --save--> cache[(CPU Embedding Cache<br/>LRU)]
cache --load--> vllm
vllm --> HTTP
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-gb 10
```
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
### Disaggregated Embedding Cache
In the disaggregated setting, Dynamo maintains the embedding cache in its own worker layer on top of the vLLM engine. The encode worker computes embeddings and writes them to the cache, while the PD worker reads cached embeddings instead of re-encoding.
```mermaid
flowchart LR
HTTP --> processor
processor --> HTTP
processor --image_url--> encode_worker[Encode Worker]
encode_worker --> processor
encode_worker --embeddings--> cache[(Dynamo Embedding Cache)]
cache --load--> pd_worker[PD Worker]
pd_worker --> encode_worker
```
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
```
**Client:** Same as [E/PD Serving](#epd-serving-encode-separate)
## NIXL Usage ## NIXL Usage
| Use Case | Script | NIXL Used? | Data Transfer | | Use Case | Script | NIXL Used? | Data Transfer |
...@@ -448,7 +473,7 @@ bash launch/audio_disagg.sh ...@@ -448,7 +473,7 @@ bash launch/audio_disagg.sh
| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) | | E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) |
| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) | | E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) |
| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) | | EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) |
| E/PD (EC Connector) | `agg_multimodal_ec_connector.sh` | No | ECConnector via Embedding Cache | | EC Both (Local Node) | `vllm_serve_embedding_cache.sh` | No | ECConnector via CPU Embedding Cache |
## ModelInput Types and Registration ## ModelInput Types and Registration
......
...@@ -17,9 +17,11 @@ export ENCODE_ENDPOINT=${ENCODE_ENDPOINT:-"dyn://dynamo.tensorrt_llm_encode.gene ...@@ -17,9 +17,11 @@ export ENCODE_ENDPOINT=${ENCODE_ENDPOINT:-"dyn://dynamo.tensorrt_llm_encode.gene
export MODALITY=${MODALITY:-"multimodal"} export MODALITY=${MODALITY:-"multimodal"}
export ALLOWED_LOCAL_MEDIA_PATH=${ALLOWED_LOCAL_MEDIA_PATH:-"/tmp"} export ALLOWED_LOCAL_MEDIA_PATH=${ALLOWED_LOCAL_MEDIA_PATH:-"/tmp"}
export MAX_FILE_SIZE_MB=${MAX_FILE_SIZE_MB:-50} export MAX_FILE_SIZE_MB=${MAX_FILE_SIZE_MB:-50}
export DYN_ENCODER_CACHE_CAPACITY_GB=${DYN_ENCODER_CACHE_CAPACITY_GB:-4}
export CUSTOM_TEMPLATE=${CUSTOM_TEMPLATE:-"$DYNAMO_HOME/examples/backends/trtllm/templates/llava_multimodal.jinja"} export CUSTOM_TEMPLATE=${CUSTOM_TEMPLATE:-"$DYNAMO_HOME/examples/backends/trtllm/templates/llava_multimodal.jinja"}
# Extra arguments forwarded to the PD worker (e.g. --multimodal-embedding-cache-capacity-gb 10)
EXTRA_PD_ARGS=("$@")
# Setup cleanup trap # Setup cleanup trap
cleanup() { cleanup() {
echo "Cleaning up background processes..." echo "Cleaning up background processes..."
...@@ -54,7 +56,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.trtllm \ ...@@ -54,7 +56,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.trtllm \
--custom-jinja-template "$CUSTOM_TEMPLATE" \ --custom-jinja-template "$CUSTOM_TEMPLATE" \
--encode-endpoint "$ENCODE_ENDPOINT" \ --encode-endpoint "$ENCODE_ENDPOINT" \
--disaggregation-mode prefill_and_decode \ --disaggregation-mode prefill_and_decode \
--dyn-encoder-cache-capacity-gb "$DYN_ENCODER_CACHE_CAPACITY_GB" & "${EXTRA_PD_ARGS[@]}" &
PD_PID_1=$! PD_PID_1=$!
wait $DYNAMO_PID wait $DYNAMO_PID
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
CAPACITY_GB=10
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case "$1" in
--multimodal-embedding-cache-capacity-gb)
CAPACITY_GB="$2"; shift 2 ;;
*)
EXTRA_ARGS+=("$1"); shift ;;
esac
done
EC_ARGS=()
if [[ "$CAPACITY_GB" != "0" ]]; then
EC_ARGS=(--ec-transfer-config "{
\"ec_role\": \"ec_both\",
\"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
\"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
\"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": $CAPACITY_GB}
}")
fi
CUDA_VISIBLE_DEVICES=2 \
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--enable-log-requests \
--max-model-len 16384 \
--gpu-memory-utilization .9 \
"${EC_ARGS[@]}" \
"${EXTRA_ARGS[@]}"
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment