Unverified Commit 552ae186 authored by Qi Wang's avatar Qi Wang Committed by GitHub
Browse files

docs: delete script and add instructions (#6763)

parent bcbb4d4c
......@@ -418,12 +418,12 @@ Dynamo supports embedding cache in both aggregated and disaggregated settings:
| Setting | Implementation | Launch Script |
|---------|---------------|---------------|
| **Aggregated** | `ec_both` via upstream vLLM (main or v0.17.0+) | `vllm_serve_embedding_cache.sh` |
| **Disaggregated encoder** | Dynamo-managed cache in the worker layer on top of vLLM engine | `disagg_multimodal_e_pd.sh` |
| **Aggregated** | Experimental via vLLM git patches | N/A |
### Aggregated Worker
A single vLLM instance acts as both **producer** (encodes and saves embeddings to CPU cache) and **consumer** (loads cached embeddings back to GPU). Repeated images skip encoding entirely.
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Experimental — requires vLLM patches (see below).
```mermaid
---
......@@ -441,8 +441,25 @@ flowchart LR
**Launch:**
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/vllm_serve_embedding_cache.sh --multimodal-embedding-cache-capacity-gb 10
cd /opt/dynamo/venv/lib/python3.12/site-packages
curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
import sys
chunks = sys.stdin.read().split('diff --git ')
filtered = [c for c in chunks if c.startswith('a/vllm/')]
print(''.join('diff --git ' + c for c in filtered))
" | patch -p1
vllm serve $model \
--ec-transfer-config "{
\"ec_role\": \"ec_both\",
\"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
\"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
\"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": 10}
}"
```
This configures `vllm serve` with `ec_role=ec_both` and the `DynamoMultimodalEmbeddingCacheConnector` automatically. The capacity parameter controls the CPU-side LRU cache size in GB (0 = disabled).
......
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
CAPACITY_GB=10
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case "$1" in
--multimodal-embedding-cache-capacity-gb)
CAPACITY_GB="$2"; shift 2 ;;
*)
EXTRA_ARGS+=("$1"); shift ;;
esac
done
EC_ARGS=()
if [[ "$CAPACITY_GB" != "0" ]]; then
EC_ARGS=(--ec-transfer-config "{
\"ec_role\": \"ec_both\",
\"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
\"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
\"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": $CAPACITY_GB}
}")
fi
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching vLLM Serve + Embedding Cache (1 GPU)"
echo "=========================================="
echo "Model: $MODEL"
echo "Server: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{"
echo " \"role\": \"user\","
echo " \"content\": ["
echo " {\"type\": \"text\", \"text\": \"Describe the image.\"},"
echo " {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/480px-Cat03.jpg\"}}"
echo " ]"
echo " }],"
echo " \"max_tokens\": 50"
echo " }'"
echo ""
echo "=========================================="
CUDA_VISIBLE_DEVICES=2 \
vllm serve $MODEL \
--port "$HTTP_PORT" \
--enable-log-requests \
--max-model-len 16384 \
--gpu-memory-utilization .9 \
"${EC_ARGS[@]}" \
"${EXTRA_ARGS[@]}"
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment