Unverified Commit eac5e463 authored by Qi Wang's avatar Qi Wang Committed by GitHub
Browse files

chore: remove vLLM patches for agg embedding cache (#7799)


Co-authored-by: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent a72bd22d
...@@ -47,11 +47,11 @@ input_files: ...@@ -47,11 +47,11 @@ input_files:
# Each config launches the workflow with its own extra_args # Each config launches the workflow with its own extra_args
configs: configs:
- label: cache-off - label: cache-off
workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"] extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"]
- label: cache-on - label: cache-on
workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"] extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"]
``` ```
......
...@@ -30,9 +30,9 @@ input_files: ...@@ -30,9 +30,9 @@ input_files:
configs: configs:
- label: cache-off - label: cache-off
workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"] extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"]
- label: cache-on - label: cache-on
workflow: examples/backends/vllm/launch/vllm_serve_embedding_cache.sh workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"] extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"]
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
#
# Minimal vllm serve wrapper for benchmark sweeps.
# Launched by the sweep orchestrator via: bash vllm_serve.sh --model <model> [extra_args...]
MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" MODEL=""
CAPACITY_GB=10 CAPACITY_GB=0
EXTRA_ARGS=() EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
...@@ -17,7 +20,11 @@ while [[ $# -gt 0 ]]; do ...@@ -17,7 +20,11 @@ while [[ $# -gt 0 ]]; do
esac esac
done done
# Need vLLM main or v0.17+ if [[ -z "$MODEL" ]]; then
echo "ERROR: --model is required" >&2
exit 1
fi
EC_ARGS=() EC_ARGS=()
if [[ "$CAPACITY_GB" != "0" ]]; then if [[ "$CAPACITY_GB" != "0" ]]; then
EC_ARGS=(--ec-transfer-config "{ EC_ARGS=(--ec-transfer-config "{
...@@ -36,7 +43,6 @@ else ...@@ -36,7 +43,6 @@ else
GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM_UTIL" GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM_UTIL"
fi fi
CUDA_VISIBLE_DEVICES=2 \
vllm serve "$MODEL" \ vllm serve "$MODEL" \
--enable-log-requests \ --enable-log-requests \
--max-model-len 16384 \ --max-model-len 16384 \
......
...@@ -27,7 +27,7 @@ If your workload consists entirely of unique images, the cache provides no benef ...@@ -27,7 +27,7 @@ If your workload consists entirely of unique images, the cache provides no benef
| **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker | | **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker |
| **SGLang** | ❌ | ❌ | Not supported yet | | **SGLang** | ❌ | ❌ | Not supported yet |
This support requires vLLM `0.18.0` or newer. This support requires vLLM `0.17.0` or newer.
## How It Works ## How It Works
......
...@@ -228,12 +228,12 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings: ...@@ -228,12 +228,12 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script | | Setting | Implementation | Launch Script |
| ------------------------- | -------------------------------------------------------------- | --------------------------- | | ------------------------- | -------------------------------------------------------------- | --------------------------- |
| **Aggregated** | Supported via vLLM ECConnector in vLLM 0.18+ | `agg_multimodal.sh` (or with `vllm serve` directly) | | **Aggregated** | Supported via vLLM ECConnector in vLLM 0.17+ | `agg_multimodal.sh` (or with `vllm serve` directly) |
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` | | **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
### Aggregated Worker ### Aggregated Worker
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.
```mermaid ```mermaid
--- ---
...@@ -248,12 +248,20 @@ flowchart LR ...@@ -248,12 +248,20 @@ flowchart LR
encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)] encode -- save: GPU → CPU --> store[(CPU Embedding Cache<br/>LRU)]
``` ```
**Launch:** **Launch with Dynamo:**
```bash
bash examples/backends/vllm/launch/agg_multimodal.sh \
--model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--multimodal-embedding-cache-capacity-gb 10
```
`dynamo.vllm` automatically configures `ec_both` mode with the `DynamoMultimodalEmbeddingCacheConnector` when the capacity is > 0.
<!-- TODO: Add an example of Dynamo+vLLM Agg worker + Embedding Cache --> **Launch with `vllm serve` (standalone, no Dynamo):**
```bash ```bash
vllm serve $model \ vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--ec-transfer-config "{ --ec-transfer-config "{
\"ec_role\": \"ec_both\", \"ec_role\": \"ec_both\",
\"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\", \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
...@@ -262,7 +270,7 @@ vllm serve $model \ ...@@ -262,7 +270,7 @@ vllm serve $model \
}" }"
``` ```
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled). The `multimodal_embedding_cache_capacity_gb` parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.
### Disaggregated Encoder (Embedding Cache in Prefill Worker) ### Disaggregated Encoder (Embedding Cache in Prefill Worker)
......
...@@ -48,7 +48,7 @@ kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE} ...@@ -48,7 +48,7 @@ kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
1. Exact cache hit rates cannot be explicitly controlled via dataset due to potential LRU embedding cache eviction policies; however, decreasing the image pool relative to the number of requests allows for proportionally higher probabilities of seeing duplicate images and cache hits. Increasing the embedding cache capacity also allows for higher cache hit rate because it will evict less. 1. Exact cache hit rates cannot be explicitly controlled via dataset due to potential LRU embedding cache eviction policies; however, decreasing the image pool relative to the number of requests allows for proportionally higher probabilities of seeing duplicate images and cache hits. Increasing the embedding cache capacity also allows for higher cache hit rate because it will evict less.
**2. Agg embedding cache requires `ec_both` ECConnector role in vLLM, but that functionality was merged post 1.0.0 release. The worker startup in `vllm/agg-embedding-cache/deploy.yaml` applies the required upstream vLLM patches inline at runtime. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.** **2. Agg embedding cache uses vLLM's native `ec_both` ECConnector role, supported in vLLM 0.17+. No patches required. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.**
3. Replace placeholders in `*.yaml` before running: 3. Replace placeholders in `*.yaml` before running:
- `storageClassName: "your-storage-class-name"` in `model-cache/model-cache.yaml` - `storageClassName: "your-storage-class-name"` in `model-cache/model-cache.yaml`
......
...@@ -42,17 +42,6 @@ spec: ...@@ -42,17 +42,6 @@ spec:
args: args:
- | - |
set -euo pipefail set -euo pipefail
SITE_PACKAGES="$(python3 -c 'import pathlib, vllm; print(pathlib.Path(vllm.__file__).resolve().parent.parent)')"
cd "${SITE_PACKAGES}"
curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
import sys
chunks = sys.stdin.read().split('diff --git ')
filtered = [c for c in chunks if c.startswith('a/vllm/')]
print(''.join('diff --git ' + c for c in filtered))
" | patch -p1
cd /workspace
python3 -m dynamo.vllm \ python3 -m dynamo.vllm \
--model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \ --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--enable-multimodal \ --enable-multimodal \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment