docs: delete script and add instructions (#6763)

552ae186 · Qi Wang · GitHub · bcbb4d4c · 552ae186 · bcbb4d4c
Unverified Commit 552ae186 authored Mar 03, 2026 by Qi Wang Committed by GitHub Mar 03, 2026
Showing with 21 additions and 66 deletions

docs/features/multimodal/multimodal-vllm.md docs/features/multimodal/multimodal-vllm.md +21 -4

examples/backends/vllm/launch/vllm_serve_embedding_cache.sh examples/backends/vllm/launch/vllm_serve_embedding_cache.sh +0 -62

No files found.
--- a/docs/features/multimodal/multimodal-vllm.md
+++ b/docs/features/multimodal/multimodal-vllm.md
@@ -418,12 +418,12 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:

 | Setting | Implementation | Launch Script |
 |---------|---------------|---------------|
-| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
 | **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
+| **Aggregated** | Experimental via vLLM git patches | N/A |

 ### Aggregated Worker

-A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
+A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Experimental — requires vLLM patches (see below).

 ```mermaid
 ---
@@ -441,8 +441,25 @@ flowchart LR
 **Launch:**

 ```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-gb 10
+
+cd /opt/dynamo/venv/lib/python3.12/site-packages
+
+curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
+
+curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
+import sys
+chunks = sys.stdin.read().split('diff --git ')
+filtered = [c for c in chunks if c.startswith('a/vllm/')]
+print(''.join('diff --git ' + c for c in filtered))
+" | patch -p1
+
+vllm serve $model \
+    --ec-transfer-config "{
+        \"ec_role\": \"ec_both\",
+        \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
+        \"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
+        \"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": 10}
+    }"
 ```

 This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).

--- a/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+++ b/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-
-MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
-CAPACITY_GB=10
-EXTRA_ARGS=()
-
-while [[ $# -gt 0 ]]; do
-    case "$1" in
-        --multimodal-embedding-cache-capacity-gb)
-            CAPACITY_GB="$2"; shift 2 ;;
-        *)
-            EXTRA_ARGS+=("$1"); shift ;;
-    esac
-done
-
-EC_ARGS=()
-if [[ "$CAPACITY_GB" != "0" ]]; then
-    EC_ARGS=(--ec-transfer-config "{
-        \"ec_role\": \"ec_both\",
-        \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
-        \"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
-        \"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": $CAPACITY_GB}
-    }")
-fi
-
-HTTP_PORT="${DYN_HTTP_PORT:-8000}"
-echo "=========================================="
-echo "Launching vLLM Serve + Embedding Cache (1 GPU)"
-echo "=========================================="
-echo "Model:       $MODEL"
-echo "Server:      http://localhost:$HTTP_PORT"
-echo "=========================================="
-echo ""
-echo "Example test command:"
-echo ""
-echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
-echo "    -H 'Content-Type: application/json' \\"
-echo "    -d '{"
-echo "      \"model\": \"${MODEL}\","
-echo "      \"messages\": [{"
-echo "        \"role\": \"user\","
-echo "        \"content\": ["
-echo "          {\"type\": \"text\", \"text\": \"Describe the image.\"},"
-echo "          {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/480px-Cat03.jpg\"}}"
-echo "        ]"
-echo "      }],"
-echo "      \"max_tokens\": 50"
-echo "    }'"
-echo ""
-echo "=========================================="
-
-CUDA_VISIBLE_DEVICES=2 \
-vllm serve $MODEL \
-    --port "$HTTP_PORT" \
-    --enable-log-requests \
-    --max-model-len 16384 \
-    --gpu-memory-utilization .9 \
-    "${EC_ARGS[@]}" \
-    "${EXTRA_ARGS[@]}"
\ No newline at end of file