chore: remove vLLM patches for agg embedding cache (#7799)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove vLLM patches for agg embedding cache (#7799)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eac5e463 · Qi Wang · GitHub · a72bd22d · eac5e463 · eac5e463
Unverified Commit eac5e463 authored Apr 03, 2026 by Qi Wang Committed by GitHub Apr 03, 2026
7 changed files
--- a/benchmarks/multimodal/sweep/README.md
+++ b/benchmarks/multimodal/sweep/README.md
@@ -47,11 +47,11 @@ input_files:
 # Each config launches the workflow with its own extra_args
 configs:
  - label: cache-off
-    workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"]
  - label: cache-on
-    workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"]
 ```

--- a/benchmarks/multimodal/sweep/experiments/embedding_cache/vllm_serve.yaml
+++ b/benchmarks/multimodal/sweep/experiments/embedding_cache/vllm_serve.yaml
@@ -30,9 +30,9 @@ input_files:
 configs:
  - label: cache-off
-    workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"]
  - label: cache-on
-    workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"]
--- a/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+++ b/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Minimal vllm serve wrapper for benchmark sweeps.
+# Launched by the sweep orchestrator via: bash vllm_serve.sh --model <model> [extra_args...]
-MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
+MODEL=""
-CAPACITY_GB=10
+CAPACITY_GB=0
 EXTRA_ARGS=()
 while [[ $# -gt 0 ]]; do
@@ -17,7 +20,11 @@ while [[ $# -gt 0 ]]; do
    esac
 done
-# Need vLLM main or v0.17+
+if [[ -z "$MODEL" ]]; then
+    echo "ERROR: --model is required" >&2
+    exit 1
+fi
 EC_ARGS=()
 if [[ "$CAPACITY_GB" != "0" ]]; then
    EC_ARGS=(--ec-transfer-config "{
@@ -36,7 +43,6 @@ else
    GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM_UTIL"
 fi
-CUDA_VISIBLE_DEVICES=2 \
 vllm serve "$MODEL" \
    --enable-log-requests \
    --max-model-len 16384 \

--- a/docs/features/multimodal/embedding-cache.md
+++ b/docs/features/multimodal/embedding-cache.md
@@ -27,7 +27,7 @@ If your workload consists entirely of unique images, the cache provides no benef
 | **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker |
 | **SGLang** | ❌ | ❌ | Not supported yet |
-This support requires vLLM `0.18.0` or newer.
+This support requires vLLM `0.17.0` or newer.
 ## How It Works

--- a/docs/features/multimodal/multimodal-vllm.md
+++ b/docs/features/multimodal/multimodal-vllm.md
@@ -228,12 +228,12 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
 | Setting                   | Implementation                                                 | Launch Script               |
 | ------------------------- | -------------------------------------------------------------- | --------------------------- |
-| **Aggregated**            | Supported via vLLM ECConnector in vLLM 0.18+                   | `agg_multimodal.sh` (or with `vllm serve` directly) |
+| **Aggregated**            | Supported via vLLM ECConnector in vLLM 0.17+                   | `agg_multimodal.sh` (or with `vllm serve` directly) |
 | **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
 ### Aggregated Worker
-A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely.
+A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.
 ```mermaid
 ---
@@ -248,12 +248,20 @@ flowchart LR
  encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)]
 ```
-**Launch:**
+**Launch with Dynamo:**
+```bash
+bash examples/backends/vllm/launch/agg_multimodal.sh \
+    --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
+    --multimodal-embedding-cache-capacity-gb 10
+```
+`dynamo.vllm` automatically configures `ec_both` mode with the `DynamoMultimodalEmbeddingCacheConnector` when the capacity is > 0.
-<!-- TODO: Add an example of Dynamo+vLLM Agg worker + Embedding Cache -->
+**Launch with `vllm serve` (standalone, no Dynamo):**
 ```bash
-vllm serve $model \
+vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
    --ec-transfer-config "{
        \"ec_role\": \"ec_both\",
        \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
@@ -262,7 +270,7 @@ vllm serve $model \
    }"
 ```
-This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
+The `multimodal_embedding_cache_capacity_gb` parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.
 ### Disaggregated Encoder (Embedding Cache in Prefill Worker)

--- a/recipes/qwen3-vl-30b/README.md
+++ b/recipes/qwen3-vl-30b/README.md
@@ -48,7 +48,7 @@ kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
 1. Exact cache hit rates cannot be explicitly controlled via dataset due to potential LRU embedding cache eviction policies; however, decreasing the image pool relative to the number of requests allows for proportionally higher probabilities of seeing duplicate images and cache hits. Increasing the embedding cache capacity also allows for higher cache hit rate because it will evict less.
-**2. Agg embedding cache requires `ec_both` ECConnector role in vLLM, but that functionality was merged post 1.0.0 release. The worker startup in `vllm/agg-embedding-cache/deploy.yaml` applies the required upstream vLLM patches inline at runtime. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.**
+**2. Agg embedding cache uses vLLM's native `ec_both` ECConnector role, supported in vLLM 0.17+. No patches required. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.**
 3. Replace placeholders in `*.yaml` before running:
   - `storageClassName: "your-storage-class-name"` in `model-cache/model-cache.yaml`

--- a/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml
+++ b/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml
@@ -42,17 +42,6 @@ spec:
          args:
            - |
              set -euo pipefail
-              SITE_PACKAGES="$(python3 -c 'import pathlib, vllm; print(pathlib.Path(vllm.__file__).resolve().parent.parent)')"
-              cd "${SITE_PACKAGES}"
-              curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
-              curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
-              import sys
-              chunks = sys.stdin.read().split('diff --git ')
-              filtered = [c for c in chunks if c.startswith('a/vllm/')]
-              print(''.join('diff --git ' + c for c in filtered))
-              " | patch -p1
-              cd /workspace
              python3 -m dynamo.vllm \
                --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
                --enable-multimodal \