feat(recipes): add Qwen3-32B-FP8 vLLM disaggregated single-node recipe (#7915)

Signed-off-by: enfinity <festusowumi@gmail.com>

feat(recipes): add Qwen3-32B-FP8 vLLM disaggregated single-node recipe (#7915)
Signed-off-by: enfinity <festusowumi@gmail.com>
df53b7a2 · Festus Ayobami Owumi · GitHub · ccd1711c · df53b7a2 · df53b7a2
Unverified Commit df53b7a2 authored Apr 16, 2026 by Festus Ayobami Owumi Committed by GitHub Apr 16, 2026
4 changed files
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -35,6 +35,7 @@ These recipes demonstrate aggregated or disaggregated serving:
 | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
 | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
 | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
+| **[Qwen3-32B-FP8](qwen3-32b-fp8/vllm/disagg/)** | vLLM | Disagg (Single-Node) | 8x A100 | ✅ | ✅ | 2× TP2 prefill + 1× TP4 decode, NixlConnector KV transfer | ❌ |
 | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
 | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
 | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |

--- a/recipes/qwen3-32b-fp8/README.md
+++ b/recipes/qwen3-32b-fp8/README.md
 # Qwen3-32B-FP8 Recipes
-Production-ready deployments for **Qwen3-32B** with FP8 quantization using TensorRT-LLM.
+Production-ready deployments for **Qwen3-32B-FP8** with FP8 quantization using TensorRT-LLM and vLLM.
 ## Available Configurations
@@ -8,6 +8,7 @@ Production-ready deployments for **Qwen3-32B** with FP8 quantization using Tenso
 |--------------|------|------|-------------|
 | [**trtllm/agg**](trtllm/agg/) | 2x GPU | Aggregated | TP2, round-robin routing |
 | [**trtllm/disagg**](trtllm/disagg/) | 8x GPU | Disaggregated | Prefill/decode separation |
+| [**vllm/disagg**](vllm/disagg/) | 8x GPU | Disaggregated | 2× TP2 prefill + 1× TP4 decode |
 ## Prerequisites
@@ -34,13 +35,19 @@ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeo
 # Deploy (choose one configuration)
 kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
 # OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
+# OR: kubectl apply -f vllm/disagg/deploy.yaml -n ${NAMESPACE}
 ```
 ## Test the Deployment
 ```bash
 # Port-forward the frontend
+# If deployed trtllm/agg:
 kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}
+# If deployed trtllm/disagg:
+# kubectl port-forward svc/qwen3-32b-fp8-disagg-frontend 8000:8000 -n ${NAMESPACE}
+# If deployed vllm/disagg:
+# kubectl port-forward svc/qwen3-32b-fp8-vllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
 # Send a test request
 curl http://localhost:8000/v1/chat/completions \
@@ -55,12 +62,16 @@ curl http://localhost:8000/v1/chat/completions \
 ## Model Details
 - **Model**: `Qwen/Qwen3-32B-FP8`
- **Backend**: TensorRT-LLM (PyTorch backend)
+- **Backends**: TensorRT-LLM (PyTorch backend) and vLLM
 - **Quantization**: FP8
- **Tensor Parallel**: 2
+- **TensorRT-LLM aggregated**: TP=2
+- **TensorRT-LLM disaggregated**: 4× prefill TP=1 + 2× decode TP=2
+- **vLLM disaggregated**: 2× prefill TP=2 + 1× decode TP=4
 ## Notes
 - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
 - The aggregated config uses CUDA graphs for optimized inference
 - KV cache uses FP8 dtype for memory efficiency
+- The `vllm/disagg` config splits 8 GPUs as 2× prefill (TP=2) + 1× decode (TP=4) using NixlConnector KV transfer; all workers must be co-located on one node
+- `--max-model-len 8192` is set in `vllm/disagg/deploy.yaml` for A100 40 GB compatibility; remove or increase this flag on H100/H200
\ No newline at end of file
--- a/recipes/qwen3-32b-fp8/vllm/disagg/deploy.yaml
+++ b/recipes/qwen3-32b-fp8/vllm/disagg/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: qwen3-32b-fp8-vllm-disagg
+spec:
+  backendFramework: vllm
+  pvcs:
+    - name: model-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          workingDir: /workspace/examples/backends/vllm
+      envs:
+        - name: HF_HOME
+          value: /opt/models
+      replicas: 1
+    VllmPrefillWorker:
+      componentType: worker
+      subComponentType: prefill
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 40Gi
+      extraPodSpec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - weight: 100
+              podAffinityTerm:
+                labelSelector:
+                  matchExpressions:
+                  - key: nvidia.com/dynamo-component-type
+                    operator: In
+                    values:
+                    - worker
+                topologyKey: kubernetes.io/hostname
+        mainContainer:
+          env:
+            - name: SERVED_MODEL_NAME
+              value: "Qwen/Qwen3-32B-FP8"
+            - name: MODEL_PATH
+              value: "Qwen/Qwen3-32B-FP8"
+            - name: HF_HOME
+              value: /opt/models
+            - name: UCX_TLS
+              value: "rc_x,rc,cuda_copy,cuda_ipc"
+            - name: UCX_NET_DEVICES
+              value: "mlx5_0:1"
+            - name: UCX_IB_ADDR_TYPE
+              value: "eth"
+            - name: UCX_RNDV_SCHEME
+              value: "get_zcopy"
+            - name: UCX_RNDV_THRESH
+              value: "0"
+          args:
+          - |
+            ulimit -l unlimited && python3 -m dynamo.vllm \
+              --model $MODEL_PATH \
+              --served-model-name $SERVED_MODEL_NAME \
+              --tensor-parallel-size 2 \
+              --data-parallel-size 1 \
+              --disaggregation-mode prefill \
+              --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
+              --gpu-memory-utilization 0.90 \
+              --max-model-len 8192 \
+              --no-enable-prefix-caching \
+              --block-size 128
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          workingDir: /workspace/examples/backends/vllm
+          securityContext:
+            runAsUser: 0
+            capabilities:
+              add:
+                - IPC_LOCK
+                - SYS_RESOURCE
+      replicas: 2
+      resources:
+        limits:
+          gpu: "2"
+          custom:
+            rdma/ib: "2"
+        requests:
+          gpu: "2"
+          custom:
+            rdma/ib: "2"
+    VllmDecodeWorker:
+      componentType: worker
+      subComponentType: decode
+      envFromSecret: hf-token-secret
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /opt/models
+      sharedMemory:
+        size: 40Gi
+      extraPodSpec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - weight: 100
+              podAffinityTerm:
+                labelSelector:
+                  matchExpressions:
+                  - key: nvidia.com/dynamo-component-type
+                    operator: In
+                    values:
+                    - worker
+                topologyKey: kubernetes.io/hostname
+        mainContainer:
+          env:
+            - name: SERVED_MODEL_NAME
+              value: "Qwen/Qwen3-32B-FP8"
+            - name: MODEL_PATH
+              value: "Qwen/Qwen3-32B-FP8"
+            - name: HF_HOME
+              value: /opt/models
+            - name: UCX_TLS
+              value: "rc_x,rc,cuda_copy,cuda_ipc"
+            - name: UCX_NET_DEVICES
+              value: "mlx5_0:1"
+            - name: UCX_IB_ADDR_TYPE
+              value: "eth"
+            - name: UCX_RNDV_SCHEME
+              value: "get_zcopy"
+            - name: UCX_RNDV_THRESH
+              value: "0"
+          args:
+          - |
+            ulimit -l unlimited && python3 -m dynamo.vllm \
+              --model $MODEL_PATH \
+              --served-model-name $SERVED_MODEL_NAME \
+              --tensor-parallel-size 4 \
+              --data-parallel-size 1 \
+              --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
+              --gpu-memory-utilization 0.90 \
+              --max-model-len 8192 \
+              --no-enable-prefix-caching \
+              --block-size 128
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
+          workingDir: /workspace/examples/backends/vllm
+          securityContext:
+            runAsUser: 0
+            capabilities:
+              add:
+                - IPC_LOCK
+                - SYS_RESOURCE
+      replicas: 1
+      resources:
+        limits:
+          gpu: "4"
+          custom:
+            rdma/ib: "2"
+        requests:
+          gpu: "4"
+          custom:
+            rdma/ib: "2"
--- a/recipes/qwen3-32b-fp8/vllm/disagg/perf.yaml
+++ b/recipes/qwen3-32b-fp8/vllm/disagg/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: qwen3-32b-fp8-vllm-disagg-perf
+spec:
+  backoffLimit: 1
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: qwen3-32b-fp8-vllm-disagg-perf
+    spec:
+      containers:
+      - command:
+        - /bin/sh
+        - -c
+        - |
+          apt-get update && apt-get install -y curl jq procps git && apt-get clean
+          pip install git+https://github.com/ai-dynamo/aiperf.git@54cd6dc820bff8bfebc875da104e59d745e14f75;
+          echo "aiperf installation completed";
+          sysctl -w net.ipv4.ip_local_port_range="1024 65000"
+          cat /proc/sys/net/ipv4/ip_local_port_range
+          export COLUMNS=200
+          EPOCH=$(date +%s)
+          ## utility functions -- can be moved to a bash script / configmap
+          wait_for_model_ready() {
+            echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
+            while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+                echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s before checking again http://$ENDPOINT/v1/models"
+                sleep 5
+            done
+            echo "✅ Model '$TARGET_MODEL' is now available!"
+            echo "Model '$TARGET_MODEL' is now available!"
+            curl -s "http://$ENDPOINT/v1/models" | jq .
+          }
+          run_perf() {
+            local concurrency=$1
+            local isl=$2
+            local osl=$3
+            key=concurrency_${concurrency}
+            export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${key}"
+            mkdir -p "$ARTIFACT_DIR"
+            echo "ARTIFACT_DIR: $ARTIFACT_DIR"
+            aiperf profile --artifact-dir $ARTIFACT_DIR \
+                --model $TARGET_MODEL \
+                --tokenizer $TARGET_MODEL \
+                --endpoint-type chat  \
+                --endpoint /v1/chat/completions \
+                --streaming \
+                --url http://$ENDPOINT \
+                --synthetic-input-tokens-mean $isl \
+                --synthetic-input-tokens-stddev 0 \
+                --output-tokens-mean $osl \
+                --output-tokens-stddev 0 \
+                --extra-inputs "max_tokens:$osl" \
+                --extra-inputs "min_tokens:$osl" \
+                --extra-inputs "ignore_eos:true" \
+                --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+                --extra-inputs "repetition_penalty:1.0" \
+                --extra-inputs "temperature: 0.0" \
+                --concurrency $concurrency \
+                --request-count $((10*concurrency)) \
+                --warmup-request-count $concurrency \
+                --num-dataset-entries 12800 \
+                --random-seed 100 \
+                --workers-max $concurrency \
+                -H 'Authorization: Bearer NOT USED' \
+                -H 'Accept: text/event-stream'\
+                --record-processors 32 \
+                --ui simple
+            echo "ARTIFACT_DIR: $ARTIFACT_DIR"
+            ls -la $ARTIFACT_DIR
+          }
+          #### Actual execution ####
+          wait_for_model_ready
+          mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
+          # Calculate total concurrency based on per-GPU concurrency and GPU count
+          TOTAL_CONCURRENCY=$((CONCURRENCY_PER_GPU * DEPLOYMENT_GPU_COUNT))
+          echo "Calculated total concurrency: $TOTAL_CONCURRENCY (${CONCURRENCY_PER_GPU} per GPU × ${DEPLOYMENT_GPU_COUNT} GPUs)"
+          # Write input_config.json
+          cat > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json" <<EOF
+          {
+            "gpu_count": $DEPLOYMENT_GPU_COUNT,
+            "concurrency_per_gpu": $CONCURRENCY_PER_GPU,
+            "total_concurrency": $TOTAL_CONCURRENCY,
+            "mode": "$DEPLOYMENT_MODE",
+            "isl": $ISL,
+            "osl": $OSL,
+            "endpoint": "$ENDPOINT",
+            "model endpoint": "$TARGET_MODEL"
+          }
+          EOF
+          # Run perf with calculated total concurrency
+          run_perf $TOTAL_CONCURRENCY $ISL $OSL
+          echo "done with concurrency $TOTAL_CONCURRENCY"
+        env:
+        - name: TARGET_MODEL
+          value: "Qwen/Qwen3-32B-FP8"
+        - name: ENDPOINT
+          value: qwen3-32b-fp8-vllm-disagg-frontend:8000
+        - name: CONCURRENCY_PER_GPU
+          value: "1"
+        - name: DEPLOYMENT_GPU_COUNT
+          value: "8"
+        - name: ISL
+          value: "2000"
+        - name: OSL
+          value: "500"
+        - name: DEPLOYMENT_MODE
+          value: vllm-disagg
+        - name: AIPERF_HTTP_CONNECTION_LIMIT
+          value: "200"
+        - name: JOB_NAME
+          valueFrom:
+            fieldRef:
+              apiVersion: v1
+              fieldPath: metadata.labels['job-name']
+        - name: ROOT_ARTIFACT_DIR
+          value: /model-cache/perf
+        - name: HF_HOME
+          value: /model-cache
+        - name: PYTHONUNBUFFERED
+          value: "1"
+        image: python:3.12-slim
+        imagePullPolicy: IfNotPresent
+        name: perf
+        securityContext:
+          privileged: true
+        volumeMounts:
+        - name: model-cache
+          mountPath: /model-cache
+        workingDir: /workspace
+      imagePullSecrets:
+      - name: nvcrimagepullsecret
+      restartPolicy: Never
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache