feat: Qwen3-VL-30B recipe for agg embedding cache with vLLM patch (#6919)

Signed-off-by: Elijah Soba <esoba@nvidia.com>

feat: Qwen3-VL-30B recipe for agg embedding cache with vLLM patch (#6919)
Signed-off-by: Elijah Soba <esoba@nvidia.com>
5e51d6dd · Elijah Soba · GitHub · fa474d36 · 5e51d6dd · 5e51d6dd
Unverified Commit 5e51d6dd authored Mar 10, 2026 by Elijah Soba Committed by GitHub Mar 10, 2026
7 changed files
--- a/recipes/qwen3-vl-30b/README.md
+++ b/recipes/qwen3-vl-30b/README.md
+# Qwen3-VL-30B-A3B-Instruct-FP8: Aggregated Embedding Cache On vs Off Comparison
+This recipe demonstrates the performance difference when embedding cache is enabled for multi-modal payloads. It includes guidance on creating an artificial dataset with user-defined image re-use, and production-ready deployments for `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8`.
+## Results
+| Metric               | Cache ON | Cache OFF | Delta  |
+|----------------------|---------:|----------:|-------:|
+| Output TPS (tok/s)           |   3575.6 |    3072.3 | +16.4% |
+| TTFT avg (ms)        |    526.0 |     727.5 | -27.7% |
+| TTFT p50 (ms)        |    356.8 |     510.8 | -30.1% |
+| ITL avg (ms)         |     14.1 |      15.5 |  -8.8% |
+| Req Latency avg (ms) |   2630.0 |    3035.7 | -13.4% |
+**Enabling embedding cache on `Qwen3-VL-30B-A3B-Instruct-FP8` shows an average improvement of +16% throughput, -28% TTFT, and -13% request latency on a single aggregated replica of GB200 using the vLLM backend**
+## Pre-requisites
+To reproduce the results in the table, the following is required:
+1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **GB200**
+3. **HuggingFace token** configured:
+   ```bash
+   export NAMESPACE=your-namespace
+   kubectl create secret generic hf-token-secret \
+     --from-literal=HF_TOKEN="your-token" \
+     -n ${NAMESPACE}
+   ```
+## Dataset Generation
+`data-gen/generate-datasets-job.yaml` creates a dataset of synthetic text + image data with 80% image overlap. The script does this by manipulating the "total slots" and "image pool".
+Total number of slots is calculated as `num_requests*images/request`, representing how many total images the benchmark will iterate through. The image pool is how many images the benchmark can choose from to attach to a request.
+The `data-gen/generate-datasets-job.yaml` script creates a dataset of 1000 requests, 1 image per request, and an image pool of 200. Each request will pick an image from this pool without replacement, and loop back through the image pool after it has been exhausted. Thus, the first 200 out of 1000 requests will contain unique images, while the remaining 800 out of 1000 requests will have been seen already by the inference engine. Refer to jsonl [documentation](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/multimodal/jsonl) for more details on data generation.
+Each dataset is hardcoded to have 400 tokens of user-input text.
+To generate the dataset, run:
+```bash
+kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
+```
+## Notes
+1. Exact cache hit rates cannot be explicitly controlled via dataset due to potential LRU embedding cache eviction policies; however, decreasing the image pool relative to the number of requests allows for proportionally higher probabilities of seeing duplicate images and cache hits. Increasing the embedding cache capacity also allows for higher cache hit rate because it will evict less.
+**2. Agg embedding cache requires `ec_both` ECConnector role in vLLM, but that functionality was merged post 1.0.0 release. The worker startup in `vllm/agg-embedding-cache/deploy.yaml` applies the required upstream vLLM patches inline at runtime. See [multimodal-vllm.md](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache) for more details.**
+3. Replace placeholders in `*.yaml` before running:
+   - `storageClassName: "your-storage-class-name"` in `model-cache/model-cache.yaml`
+   - `image: <your-dynamo-image>` in all `vllm/*/deploy.yaml` files
+   - `NAMESPACE=your-namespace` and `HF_TOKEN="your-token"` in the setup commands
+## Directory setup
+This recipe has three top-level components: `model-cache/` for PVC/model prep, `data-gen/` for dataset creation, and `vllm/agg-embedding-cache/` for deployment and benchmarking with [AIPerf](https://github.com/ai-dynamo/aiperf).
+```text
+qwen3-vl-30b/
+├── data-gen/
+│   └── generate-datasets-job.yaml
+├── model-cache/
+│   ├── model-cache.yaml
+│   └── model-download.yaml
+└── vllm/
+    └── agg-embedding-cache/
+        ├── deploy.yaml
+        ├── perf.yaml
+        └── run-benchmark.sh
+```
+The `deploy.yaml` script has `DYN_MULTIMODAL_EMBEDDING_CACHE_GB=10` by default, which represents an embedding cache **on** configuration. To toggle it off, set the env variable to 0.
+Similarly, each `perf.yaml` exposes a `CACHE_MODE` env variable to control where AIPerf dumps its results. Set it to either `cache_on` or `cache_off` depending on your deployment.
+## Quick Start
+### 1. Set Namespace and Create Storage
+```bash
+export NAMESPACE=your-namespace
+kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
+kubectl get pvc -n ${NAMESPACE}
+```
+### 2. Download Model and Generate Datasets
+```bash
+kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
+kubectl apply -f data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE} --timeout=3600s
+kubectl logs job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE}
+```
+### 3. Deploy and Benchmark (`agg-embedding-cache`)
+```bash
+# deploy.yaml defaults to cache ON (DYN_MULTIMODAL_EMBEDDING_CACHE_GB=10)
+kubectl apply -f vllm/agg-embedding-cache/deploy.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Ready dynamographdeployment/qwen3-vl-agg -n ${NAMESPACE} --timeout=900s
+kubectl apply -f vllm/agg-embedding-cache/perf.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Ready pod/qwen3-vl-agg-benchmark -n ${NAMESPACE} --timeout=300s
+```
+Optional: to run cache OFF, change `DYN_MULTIMODAL_EMBEDDING_CACHE_GB` to `0` in `vllm/agg-embedding-cache/deploy.yaml` and set `CACHE_MODE=cache_off` in `vllm/agg-embedding-cache/perf.yaml` before applying.
+### 4. Monitor Benchmark Progress
+```bash
+kubectl get pods -n ${NAMESPACE} -l app=benchmark
+# Follow benchmark logs in real time
+kubectl logs -f qwen3-vl-agg-benchmark -n ${NAMESPACE}
+# Wait for completion
+kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/qwen3-vl-agg-benchmark -n ${NAMESPACE} --timeout=7200s
+```
+Wait for `Run complete. Artifacts in /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg/<cache_mode>`.
+`vllm/agg-embedding-cache/run-benchmark.sh` is also provided as a helper to launch cache-on/cache-off runs.
\ No newline at end of file
--- a/recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml
+++ b/recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: qwen3-vl-30b-generate-datasets
+spec:
+  backoffLimit: 1
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: generate-datasets
+    spec:
+      restartPolicy: Never
+      securityContext:
+        runAsUser: 0
+        runAsGroup: 0
+        fsGroup: 0
+      containers:
+        - name: generate-datasets
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/bash
+            - -lc
+            - |
+              set -euo pipefail
+              GENERATOR_MAIN="/workspace/benchmarks/multimodal/jsonl/main.py"
+              OUTPUT_DIR="/perf-cache/datasets"
+              if [[ ! -f "${GENERATOR_MAIN}" ]]; then
+                echo "Generator not found at ${GENERATOR_MAIN}"
+                exit 1
+              fi
+              mkdir -p "${OUTPUT_DIR}"
+              python3 "${GENERATOR_MAIN}" \
+                -n 1000 \
+                --images-per-request 1 \
+                --images-pool 200 \
+                --user-text-tokens 400 \
+                --image-mode http \
+                --image-dir /perf-cache/images \
+                -o "${OUTPUT_DIR}/qwen3_vl_1000req_1img_pool200.jsonl"
+              echo "Dataset generation complete in ${OUTPUT_DIR}"
+          volumeMounts:
+            - name: perf-cache
+              mountPath: /perf-cache
+      volumes:
+        - name: perf-cache
+          persistentVolumeClaim:
+            claimName: perf-cache
--- a/recipes/qwen3-vl-30b/model-cache/model-cache.yaml
+++ b/recipes/qwen3-vl-30b/model-cache/model-cache.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 100Gi
+  storageClassName: "your-storage-class-name"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: compilation-cache
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
+  storageClassName: "your-storage-class-name"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: perf-cache
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 50Gi
+  storageClassName: "your-storage-class-name"
--- a/recipes/qwen3-vl-30b/model-cache/model-download.yaml
+++ b/recipes/qwen3-vl-30b/model-cache/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"  # Remove FP8 for BF16 variant
+            - name: HF_HOME
+              value: /home/dynamo/.cache/huggingface
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+            - name: MODEL_REVISION
+              value: "main"
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download "$MODEL_NAME" --revision "$MODEL_REVISION"
+          volumeMounts:
+            - name: model-cache
+              mountPath: /home/dynamo/.cache/huggingface
+      volumes:
+        - name: model-cache
+          persistentVolumeClaim:
+            claimName: model-cache
--- a/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml
+++ b/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: qwen3-vl-agg
+spec:
+  pvcs:
+    - create: false
+      name: model-cache
+    - create: false
+      name: compilation-cache
+  services:
+    Frontend:
+      componentType: frontend
+      envs:
+        - name: HF_HOME
+          value: /home/dynamo/.cache/huggingface
+        - name: DYN_REQUEST_PLANE
+          value: tcp
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          imagePullPolicy: IfNotPresent
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+        limits:
+          cpu: "1"
+      subComponentType: null
+    VllmWorker:
+      componentType: worker
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        mainContainer:
+          command:
+            - /bin/bash
+            - -lc
+          args:
+            - |
+              set -euo pipefail
+              SITE_PACKAGES="$(python3 -c 'import pathlib, vllm; print(pathlib.Path(vllm.__file__).resolve().parent.parent)')"
+              cd "${SITE_PACKAGES}"
+              curl -sL https://github.com/vllm-project/vllm/pull/34182.diff | patch -p1
+              curl -sL https://github.com/vllm-project/vllm/pull/34783.diff | python3 -c "
+              import sys
+              chunks = sys.stdin.read().split('diff --git ')
+              filtered = [c for c in chunks if c.startswith('a/vllm/')]
+              print(''.join('diff --git ' + c for c in filtered))
+              " | patch -p1
+              cd /workspace
+              python3 -m dynamo.vllm \
+                --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
+                --enable-multimodal \
+                --tensor-parallel-size 1 \
+                --gpu-memory-utilization 0.85 \
+                --max-model-len 16384 \
+                --disable-log-requests \
+                --enable-prefix-caching \
+                --multimodal-embedding-cache-capacity-gb "${DYN_MULTIMODAL_EMBEDDING_CACHE_GB}"
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          imagePullPolicy: IfNotPresent
+          env:
+            - name: HF_HOME
+              value: /home/dynamo/.cache/huggingface
+            - name: DYN_REQUEST_PLANE
+              value: tcp
+            - name: DYN_VLLM_EMBEDDING_TRANSFER_MODE
+              value: nixl-write
+            - name: DYN_MULTIMODAL_EMBEDDING_CACHE_GB
+              value: "10"
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"
+      subComponentType: null
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /home/dynamo/.cache/huggingface
+        - name: compilation-cache
+          mountPoint: /home/dynamo/.cache/vllm
+          useAsCompilationCache: true
--- a/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml
+++ b/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Pod
+metadata:
+  name: qwen3-vl-agg-benchmark
+  labels:
+    app: benchmark
+spec:
+  containers:
+    - name: benchmark
+      image: python:3.11
+      command:
+        - /bin/bash
+        - -lc
+        - |
+          set -euo pipefail
+          ulimit -n 1048576
+          ulimit -u 65536
+          apt update && apt install -y tmux curl jq
+          pip install aiperf
+          echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
+          until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+            echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
+            sleep 5
+          done
+          echo "Model '${MODEL_NAME}' is ready."
+          INPUT_FILE="${DATASET_DIR}/qwen3_vl_1000req_1img_pool200.jsonl"
+          if [ ! -f "${INPUT_FILE}" ]; then
+            echo "Dataset not found: ${INPUT_FILE}"
+            exit 1
+          fi
+          RUN_DIR="${ARTIFACT_BASE_DIR}/${CACHE_MODE}"
+          mkdir -p "${RUN_DIR}"
+          echo "Running benchmark ..."
+          aiperf profile \
+            --model "${MODEL_NAME}" \
+            --input-file "${INPUT_FILE}" \
+            --custom-dataset-type single_turn \
+            --url "http://${FRONTEND}:8000" \
+            --streaming \
+            --ui-type none \
+            --request-count "${REQUEST_COUNT}" \
+            --concurrency "${CONCURRENCY}" \
+            --request-rate-mode constant \
+            --warmup-request-count "${WARMUP_REQUEST_COUNT}" \
+            --artifact-dir "${RUN_DIR}" \
+            --extra-inputs "max_tokens:${MAX_TOKENS}" \
+            --extra-inputs "min_tokens:${MAX_TOKENS}" \
+            --extra-inputs "ignore_eos:true"
+          echo "Run complete. Artifacts in ${RUN_DIR}"
+          sleep 3600
+      env:
+        - name: MODEL_NAME
+          value: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
+        - name: FRONTEND
+          value: qwen3-vl-agg-frontend
+        - name: CACHE_MODE
+          value: cache_on
+        - name: MAX_TOKENS
+          value: "150"
+        - name: REQUEST_COUNT
+          value: "1000"
+        - name: CONCURRENCY
+          value: "64"
+        - name: WARMUP_REQUEST_COUNT
+          value: "3"
+        - name: DATASET_DIR
+          value: /perf-cache/datasets
+        - name: ARTIFACT_BASE_DIR
+          value: /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg
+      resources:
+        requests:
+          cpu: "8"
+          memory: 16Gi
+        limits:
+          cpu: "16"
+          memory: 32Gi
+      volumeMounts:
+        - name: perf-cache
+          mountPath: /perf-cache
+  volumes:
+    - name: perf-cache
+      persistentVolumeClaim:
+        claimName: perf-cache
+  restartPolicy: Never
--- a/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/run-benchmark.sh
+++ b/recipes/qwen3-vl-30b/vllm/agg-embedding-cache/run-benchmark.sh
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Usage:
+#   ./run-benchmark.sh on   # benchmark with embedding cache ON (10GB)
+#   ./run-benchmark.sh off  # benchmark with embedding cache OFF
+#
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+NAMESPACE="${NAMESPACE:-dynamo}"
+if [[ $# -ne 1 ]] || [[ "$1" != "on" && "$1" != "off" ]]; then
+  echo "Usage: $0 <on|off>"
+  exit 1
+fi
+MODE="$1"
+if [[ "${MODE}" == "on" ]]; then
+  CACHE_GB="10"
+  CACHE_MODE="cache_on"
+else
+  CACHE_GB="0"
+  CACHE_MODE="cache_off"
+fi
+echo "==> Embedding cache: ${MODE} (${CACHE_GB}GB)"
+# Patch deploy.yaml: set DYN_MULTIMODAL_EMBEDDING_CACHE_GB value
+awk -v cache_gb="${CACHE_GB}" '
+  /name: DYN_MULTIMODAL_EMBEDDING_CACHE_GB/ { print; getline; print "              value: \"" cache_gb "\""; next }
+  { print }
+' "${SCRIPT_DIR}/deploy.yaml" | \
+  kubectl apply -f - -n "${NAMESPACE}"
+echo "==> Waiting for worker to be ready..."
+kubectl wait --for=condition=Ready \
+  dynamographdeployment/qwen3-vl-agg \
+  -n "${NAMESPACE}" --timeout=600s
+# Delete old benchmark pod if exists
+kubectl delete pod qwen3-vl-agg-benchmark \
+  -n "${NAMESPACE}" --ignore-not-found
+# Patch perf.yaml: replace CACHE_MODE value
+sed 's/value: cache_o[nf]*/value: '"${CACHE_MODE}"'/' \
+  "${SCRIPT_DIR}/perf.yaml" | \
+  kubectl apply -f - -n "${NAMESPACE}"
+echo "==> Benchmark pod launched (cache ${MODE})"
+echo "    Monitor with: kubectl logs -f qwen3-vl-agg-benchmark -n ${NAMESPACE}"
\ No newline at end of file