"docs/vscode:/vscode.git/clone" did not exist on "622d45128c02e5296e1177481c65199754eab396"
Unverified Commit 5e51d6dd authored by Elijah Soba's avatar Elijah Soba Committed by GitHub
Browse files

feat: Qwen3-VL-30B recipe for agg embedding cache with vLLM patch (#6919)


Signed-off-by: default avatarElijah Soba <esoba@nvidia.com>
parent fa474d36
# Qwen3-VL-30B-A3B-Instruct-FP8: Aggregated Embedding Cache On vs Off Comparison
This recipe demonstrates the performance difference when embedding cache is enabled for multi-modal payloads. It includes guidance on creating an artificial dataset with user-defined image re-use, and production-ready deployments for `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`.
## Results
| Metric | Cache ON | Cache OFF | Delta |
|----------------------|---------:|----------:|-------:|
| Output TPS (tok/s) | 3575.6 | 3072.3 | +16.4% |
| TTFT avg (ms) | 526.0 | 727.5 | -27.7% |
| TTFT p50 (ms) | 356.8 | 510.8 | -30.1% |
| ITL avg (ms) | 14.1 | 15.5 | -8.8% |
| Req Latency avg (ms) | 2630.0 | 3035.7 | -13.4% |
**Enabling embedding cache on `Qwen3-VL-30B-A3B-Instruct-FP8` shows an average improvement of +16% throughput, -28% TTFT, and -13% request latency on a single aggregated replica of GB200 using the vLLM backend**
## Pre-requisites
To reproduce the results in the table, the following is required:
1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **GB200**
3. **HuggingFace token** configured:
```bash
export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
```
## Dataset Generation
`data-gen/generate-datasets-job.yaml` creates a dataset of synthetic text + image data with 80% image overlap. The script does this by manipulating the "total slots" and "image pool".
Total number of slots is calculated as `num_requests*images/request`, representing how many total images the benchmark will iterate through. The image pool is how many images the benchmark can choose from to attach to a request.
The `data-gen/generate-datasets-job.yaml` script creates a dataset of 1000 requests, 1 image per request, and an image pool of 200. Each request will pick an image from this pool without replacement, and loop back through the image pool after it has been exhausted. Thus, the first 200 out of 1000 requests will contain unique images, while the remaining 800 out of 1000 requests will have been seen already by the inference engine. Refer to jsonl [documentation](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/multimodal/jsonl) for more details on data generation.
Each dataset is hardcoded to have 400 tokens of user-input text.
To generate the dataset, run:
```bash
kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
```
## Notes
1. Exact cache hit rates cannot be explicitly controlled via dataset due to potential LRU embedding cache eviction policies; however, decreasing the image pool relative to the number of requests allows for proportionally higher probabilities of seeing duplicate images and cache hits. Increasing the embedding cache capacity also allows for higher cache hit rate because it will evict less.
**2. Agg embedding cache requires `ec_both` ECConnector role in vLLM, but that functionality was merged post 1.0.0 release. The worker startup in `vllm/agg-embedding-cache/deploy.yaml` applies the required upstream vLLM patches inline at runtime. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.**
3. Replace placeholders in `*.yaml` before running:
- `storageClassName: "your-storage-class-name"` in `model-cache/model-cache.yaml`
- `image: <your-dynamo-image>` in all `vllm/*/deploy.yaml` files
- `NAMESPACE=your-namespace` and `HF_TOKEN="your-token"` in the setup commands
## Directory setup
This recipe has three top-level components: `model-cache/` for PVC/model prep, `data-gen/` for dataset creation, and `vllm/agg-embedding-cache/` for deployment and benchmarking with [AIPerf](https://github.com/ai-dynamo/aiperf).
```text
qwen3-vl-30b/
├── data-gen/
│ └── generate-datasets-job.yaml
├── model-cache/
│ ├── model-cache.yaml
│ └── model-download.yaml
└── vllm/
└── agg-embedding-cache/
├── deploy.yaml
├── perf.yaml
└── run-benchmark.sh
```
The `deploy.yaml` script has `DYN_MULTIMODAL_EMBEDDING_CACHE_GB=10` by default, which represents an embedding cache **on** configuration. To toggle it off, set the env variable to 0.
Similarly, each `perf.yaml` exposes a `CACHE_MODE` env variable to control where AIPerf dumps its results. Set it to either `cache_on` or `cache_off` depending on your deployment.
## Quick Start
### 1. Set Namespace and Create Storage
```bash
export NAMESPACE=your-namespace
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl get pvc -n ${NAMESPACE}
```
### 2. Download Model and Generate Datasets
```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE} --timeout=3600s
kubectl logs job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE}
```
### 3. Deploy and Benchmark (`agg-embedding-cache`)
```bash
# deploy.yaml defaults to cache ON (DYN_MULTIMODAL_EMBEDDING_CACHE_GB=10)
kubectl apply -f vllm/agg-embedding-cache/deploy.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Ready dynamographdeployment/qwen3-vl-agg -n ${NAMESPACE} --timeout=900s
kubectl apply -f vllm/agg-embedding-cache/perf.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Ready pod/qwen3-vl-agg-benchmark -n ${NAMESPACE} --timeout=300s
```
Optional: to run cache OFF, change `DYN_MULTIMODAL_EMBEDDING_CACHE_GB` to `0` in `vllm/agg-embedding-cache/deploy.yaml` and set `CACHE_MODE=cache_off` in `vllm/agg-embedding-cache/perf.yaml` before applying.
### 4. Monitor Benchmark Progress
```bash
kubectl get pods -n ${NAMESPACE} -l app=benchmark
# Follow benchmark logs in real time
kubectl logs -f qwen3-vl-agg-benchmark -n ${NAMESPACE}
# Wait for completion
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/qwen3-vl-agg-benchmark -n ${NAMESPACE} --timeout=7200s
```
Wait for `Run complete. Artifacts in /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg/<cache_mode>`.
`vllm/agg-embedding-cache/run-benchmark.sh` is also provided as a helper to launch cache-on/cache-off runs.
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: qwen3-vl-30b-generate-datasets
spec:
backoffLimit: 1
completions: 1
parallelism: 1
template:
metadata:
labels:
app: generate-datasets
spec:
restartPolicy: Never
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
- name: generate-datasets
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -lc
- |
set -euo pipefail
GENERATOR_MAIN="/workspace/benchmarks/multimodal/jsonl/main.py"
OUTPUT_DIR="/perf-cache/datasets"
if [[ ! -f "${GENERATOR_MAIN}" ]]; then
echo "Generator not found at ${GENERATOR_MAIN}"
exit 1
fi
mkdir -p "${OUTPUT_DIR}"
python3 "${GENERATOR_MAIN}" \
-n 1000 \
--images-per-request 1 \
--images-pool 200 \
--user-text-tokens 400 \
--image-mode http \
--image-dir /perf-cache/images \
-o "${OUTPUT_DIR}/qwen3_vl_1000req_1img_pool200.jsonl"
echo "Dataset generation complete in ${OUTPUT_DIR}"
volumeMounts:
- name: perf-cache
mountPath: /perf-cache
volumes:
- name: perf-cache
persistentVolumeClaim:
claimName: perf-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: "your-storage-class-name"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: "your-storage-class-name"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: "your-storage-class-name"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" # Remove FP8 for BF16 variant
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: MODEL_REVISION
value: "main"
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download "$MODEL_NAME" --revision "$MODEL_REVISION"
volumeMounts:
- name: model-cache
mountPath: /home/dynamo/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: qwen3-vl-agg
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
- name: DYN_REQUEST_PLANE
value: tcp
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
imagePullPolicy: IfNotPresent
workingDir: /workspace
replicas: 1
resources:
requests:
cpu: "1"
limits:
cpu: "1"
subComponentType: null
VllmWorker:
componentType: worker
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
command:
- /bin/bash
- -lc
args:
- |
set -euo pipefail
SITE_PACKAGES="$(python3 -c 'import pathlib, vllm; print(pathlib.Path(vllm.__file__).resolve().parent.parent)')"
cd "${SITE_PACKAGES}"
curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
import sys
chunks = sys.stdin.read().split('diff --git ')
filtered = [c for c in chunks if c.startswith('a/vllm/')]
print(''.join('diff --git ' + c for c in filtered))
" | patch -p1
cd /workspace
python3 -m dynamo.vllm \
--model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--enable-multimodal \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--max-model-len 16384 \
--disable-log-requests \
--enable-prefix-caching \
--multimodal-embedding-cache-capacity-gb "${DYN_MULTIMODAL_EMBEDDING_CACHE_GB}"
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
imagePullPolicy: IfNotPresent
env:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
- name: DYN_REQUEST_PLANE
value: tcp
- name: DYN_VLLM_EMBEDDING_TRANSFER_MODE
value: nixl-write
- name: DYN_MULTIMODAL_EMBEDDING_CACHE_GB
value: "10"
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
subComponentType: null
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: Pod
metadata:
name: qwen3-vl-agg-benchmark
labels:
app: benchmark
spec:
containers:
- name: benchmark
image: python:3.11
command:
- /bin/bash
- -lc
- |
set -euo pipefail
ulimit -n 1048576
ulimit -u 65536
apt update && apt install -y tmux curl jq
pip install aiperf
echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
sleep 5
done
echo "Model '${MODEL_NAME}' is ready."
INPUT_FILE="${DATASET_DIR}/qwen3_vl_1000req_1img_pool200.jsonl"
if [ ! -f "${INPUT_FILE}" ]; then
echo "Dataset not found: ${INPUT_FILE}"
exit 1
fi
RUN_DIR="${ARTIFACT_BASE_DIR}/${CACHE_MODE}"
mkdir -p "${RUN_DIR}"
echo "Running benchmark ..."
aiperf profile \
--model "${MODEL_NAME}" \
--input-file "${INPUT_FILE}" \
--custom-dataset-type single_turn \
--url "http://${FRONTEND}:8000" \
--streaming \
--ui-type none \
--request-count "${REQUEST_COUNT}" \
--concurrency "${CONCURRENCY}" \
--request-rate-mode constant \
--warmup-request-count "${WARMUP_REQUEST_COUNT}" \
--artifact-dir "${RUN_DIR}" \
--extra-inputs "max_tokens:${MAX_TOKENS}" \
--extra-inputs "min_tokens:${MAX_TOKENS}" \
--extra-inputs "ignore_eos:true"
echo "Run complete. Artifacts in ${RUN_DIR}"
sleep 3600
env:
- name: MODEL_NAME
value: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
- name: FRONTEND
value: qwen3-vl-agg-frontend
- name: CACHE_MODE
value: cache_on
- name: MAX_TOKENS
value: "150"
- name: REQUEST_COUNT
value: "1000"
- name: CONCURRENCY
value: "64"
- name: WARMUP_REQUEST_COUNT
value: "3"
- name: DATASET_DIR
value: /perf-cache/datasets
- name: ARTIFACT_BASE_DIR
value: /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg
resources:
requests:
cpu: "8"
memory: 16Gi
limits:
cpu: "16"
memory: 32Gi
volumeMounts:
- name: perf-cache
mountPath: /perf-cache
volumes:
- name: perf-cache
persistentVolumeClaim:
claimName: perf-cache
restartPolicy: Never
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Usage:
# ./run-benchmark.sh on # benchmark with embedding cache ON (10GB)
# ./run-benchmark.sh off # benchmark with embedding cache OFF
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
NAMESPACE="${NAMESPACE:-dynamo}"
if [[ $# -ne 1 ]] || [[ "$1" != "on" && "$1" != "off" ]]; then
echo "Usage: $0 <on|off>"
exit 1
fi
MODE="$1"
if [[ "${MODE}" == "on" ]]; then
CACHE_GB="10"
CACHE_MODE="cache_on"
else
CACHE_GB="0"
CACHE_MODE="cache_off"
fi
echo "==> Embedding cache: ${MODE} (${CACHE_GB}GB)"
# Patch deploy.yaml: set DYN_MULTIMODAL_EMBEDDING_CACHE_GB value
awk -v cache_gb="${CACHE_GB}" '
/name: DYN_MULTIMODAL_EMBEDDING_CACHE_GB/ { print; getline; print " value: \"" cache_gb "\""; next }
{ print }
' "${SCRIPT_DIR}/deploy.yaml" | \
kubectl apply -f - -n "${NAMESPACE}"
echo "==> Waiting for worker to be ready..."
kubectl wait --for=condition=Ready \
dynamographdeployment/qwen3-vl-agg \
-n "${NAMESPACE}" --timeout=600s
# Delete old benchmark pod if exists
kubectl delete pod qwen3-vl-agg-benchmark \
-n "${NAMESPACE}" --ignore-not-found
# Patch perf.yaml: replace CACHE_MODE value
sed 's/value: cache_o[nf]*/value: '"${CACHE_MODE}"'/' \
"${SCRIPT_DIR}/perf.yaml" | \
kubectl apply -f - -n "${NAMESPACE}"
echo "==> Benchmark pod launched (cache ${MODE})"
echo " Monitor with: kubectl logs -f qwen3-vl-agg-benchmark -n ${NAMESPACE}"
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment