Unverified Commit df53b7a2 authored by Festus Ayobami Owumi's avatar Festus Ayobami Owumi Committed by GitHub
Browse files

feat(recipes): add Qwen3-32B-FP8 vLLM disaggregated single-node recipe (#7915)


Signed-off-by: default avatarenfinity <festusowumi@gmail.com>
parent ccd1711c
...@@ -35,6 +35,7 @@ These recipes demonstrate aggregated or disaggregated serving: ...@@ -35,6 +35,7 @@ These recipes demonstrate aggregated or disaggregated serving:
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ | | **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ | | **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/vllm/disagg/)** | vLLM | Disagg (Single-Node) | 8x A100 | ✅ | ✅ | 2× TP2 prefill + 1× TP4 decode, NixlConnector KV transfer | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ | | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ | | **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ | | **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
......
# Qwen3-32B-FP8 Recipes # Qwen3-32B-FP8 Recipes
Production-ready deployments for **Qwen3-32B** with FP8 quantization using TensorRT-LLM. Production-ready deployments for **Qwen3-32B-FP8** with FP8 quantization using TensorRT-LLM and vLLM.
## Available Configurations ## Available Configurations
...@@ -8,6 +8,7 @@ Production-ready deployments for **Qwen3-32B** with FP8 quantization using Tenso ...@@ -8,6 +8,7 @@ Production-ready deployments for **Qwen3-32B** with FP8 quantization using Tenso
|--------------|------|------|-------------| |--------------|------|------|-------------|
| [**trtllm/agg**](trtllm/agg/) | 2x GPU | Aggregated | TP2, round-robin routing | | [**trtllm/agg**](trtllm/agg/) | 2x GPU | Aggregated | TP2, round-robin routing |
| [**trtllm/disagg**](trtllm/disagg/) | 8x GPU | Disaggregated | Prefill/decode separation | | [**trtllm/disagg**](trtllm/disagg/) | 8x GPU | Disaggregated | Prefill/decode separation |
| [**vllm/disagg**](vllm/disagg/) | 8x GPU | Disaggregated | 2× TP2 prefill + 1× TP4 decode |
## Prerequisites ## Prerequisites
...@@ -34,13 +35,19 @@ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeo ...@@ -34,13 +35,19 @@ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeo
# Deploy (choose one configuration) # Deploy (choose one configuration)
kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE} kubectl apply -f trtllm/agg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE} # OR: kubectl apply -f trtllm/disagg/deploy.yaml -n ${NAMESPACE}
# OR: kubectl apply -f vllm/disagg/deploy.yaml -n ${NAMESPACE}
``` ```
## Test the Deployment ## Test the Deployment
```bash ```bash
# Port-forward the frontend # Port-forward the frontend
# If deployed trtllm/agg:
kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE} kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}
# If deployed trtllm/disagg:
# kubectl port-forward svc/qwen3-32b-fp8-disagg-frontend 8000:8000 -n ${NAMESPACE}
# If deployed vllm/disagg:
# kubectl port-forward svc/qwen3-32b-fp8-vllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request # Send a test request
curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
...@@ -55,12 +62,16 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -55,12 +62,16 @@ curl http://localhost:8000/v1/chat/completions \
## Model Details ## Model Details
- **Model**: `Qwen/Qwen3-32B-FP8` - **Model**: `Qwen/Qwen3-32B-FP8`
- **Backend**: TensorRT-LLM (PyTorch backend) - **Backends**: TensorRT-LLM (PyTorch backend) and vLLM
- **Quantization**: FP8 - **Quantization**: FP8
- **Tensor Parallel**: 2 - **TensorRT-LLM aggregated**: TP=2
- **TensorRT-LLM disaggregated**: 4× prefill TP=1 + 2× decode TP=2
- **vLLM disaggregated**: 2× prefill TP=2 + 1× decode TP=4
## Notes ## Notes
- Update `storageClassName` in `model-cache/model-cache.yaml` before deploying - Update `storageClassName` in `model-cache/model-cache.yaml` before deploying
- The aggregated config uses CUDA graphs for optimized inference - The aggregated config uses CUDA graphs for optimized inference
- KV cache uses FP8 dtype for memory efficiency - KV cache uses FP8 dtype for memory efficiency
- The `vllm/disagg` config splits 8 GPUs as 2× prefill (TP=2) + 1× decode (TP=4) using NixlConnector KV transfer; all workers must be co-located on one node
- `--max-model-len 8192` is set in `vllm/disagg/deploy.yaml` for A100 40 GB compatibility; remove or increase this flag on H100/H200
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: qwen3-32b-fp8-vllm-disagg
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
volumeMounts:
- name: model-cache
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
workingDir: /workspace/examples/backends/vllm
envs:
- name: HF_HOME
value: /opt/models
replicas: 1
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 40Gi
extraPodSpec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-component-type
operator: In
values:
- worker
topologyKey: kubernetes.io/hostname
mainContainer:
env:
- name: SERVED_MODEL_NAME
value: "Qwen/Qwen3-32B-FP8"
- name: MODEL_PATH
value: "Qwen/Qwen3-32B-FP8"
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,cuda_copy,cuda_ipc"
- name: UCX_NET_DEVICES
value: "mlx5_0:1"
- name: UCX_IB_ADDR_TYPE
value: "eth"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
args:
- |
ulimit -l unlimited && python3 -m dynamo.vllm \
--model $MODEL_PATH \
--served-model-name $SERVED_MODEL_NAME \
--tensor-parallel-size 2 \
--data-parallel-size 1 \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--no-enable-prefix-caching \
--block-size 128
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
workingDir: /workspace/examples/backends/vllm
securityContext:
runAsUser: 0
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
replicas: 2
resources:
limits:
gpu: "2"
custom:
rdma/ib: "2"
requests:
gpu: "2"
custom:
rdma/ib: "2"
VllmDecodeWorker:
componentType: worker
subComponentType: decode
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /opt/models
sharedMemory:
size: 40Gi
extraPodSpec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-component-type
operator: In
values:
- worker
topologyKey: kubernetes.io/hostname
mainContainer:
env:
- name: SERVED_MODEL_NAME
value: "Qwen/Qwen3-32B-FP8"
- name: MODEL_PATH
value: "Qwen/Qwen3-32B-FP8"
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,cuda_copy,cuda_ipc"
- name: UCX_NET_DEVICES
value: "mlx5_0:1"
- name: UCX_IB_ADDR_TYPE
value: "eth"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
args:
- |
ulimit -l unlimited && python3 -m dynamo.vllm \
--model $MODEL_PATH \
--served-model-name $SERVED_MODEL_NAME \
--tensor-parallel-size 4 \
--data-parallel-size 1 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--no-enable-prefix-caching \
--block-size 128
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
workingDir: /workspace/examples/backends/vllm
securityContext:
runAsUser: 0
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
replicas: 1
resources:
limits:
gpu: "4"
custom:
rdma/ib: "2"
requests:
gpu: "4"
custom:
rdma/ib: "2"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: qwen3-32b-fp8-vllm-disagg-perf
spec:
backoffLimit: 1
completions: 1
parallelism: 1
template:
metadata:
labels:
app: qwen3-32b-fp8-vllm-disagg-perf
spec:
containers:
- command:
- /bin/sh
- -c
- |
apt-get update && apt-get install -y curl jq procps git && apt-get clean
pip install git+https://github.com/ai-dynamo/aiperf.git@54cd6dc820bff8bfebc875da104e59d745e14f75;
echo "aiperf installation completed";
sysctl -w net.ipv4.ip_local_port_range="1024 65000"
cat /proc/sys/net/ipv4/ip_local_port_range
export COLUMNS=200
EPOCH=$(date +%s)
## utility functions -- can be moved to a bash script / configmap
wait_for_model_ready() {
echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready yet, sleeping 5s before checking again http://$ENDPOINT/v1/models"
sleep 5
done
echo "✅ Model '$TARGET_MODEL' is now available!"
echo "Model '$TARGET_MODEL' is now available!"
curl -s "http://$ENDPOINT/v1/models" | jq .
}
run_perf() {
local concurrency=$1
local isl=$2
local osl=$3
key=concurrency_${concurrency}
export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${key}"
mkdir -p "$ARTIFACT_DIR"
echo "ARTIFACT_DIR: $ARTIFACT_DIR"
aiperf profile --artifact-dir $ARTIFACT_DIR \
--model $TARGET_MODEL \
--tokenizer $TARGET_MODEL \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url http://$ENDPOINT \
--synthetic-input-tokens-mean $isl \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean $osl \
--output-tokens-stddev 0 \
--extra-inputs "max_tokens:$osl" \
--extra-inputs "min_tokens:$osl" \
--extra-inputs "ignore_eos:true" \
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
--extra-inputs "repetition_penalty:1.0" \
--extra-inputs "temperature: 0.0" \
--concurrency $concurrency \
--request-count $((10*concurrency)) \
--warmup-request-count $concurrency \
--num-dataset-entries 12800 \
--random-seed 100 \
--workers-max $concurrency \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'\
--record-processors 32 \
--ui simple
echo "ARTIFACT_DIR: $ARTIFACT_DIR"
ls -la $ARTIFACT_DIR
}
#### Actual execution ####
wait_for_model_ready
mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
# Calculate total concurrency based on per-GPU concurrency and GPU count
TOTAL_CONCURRENCY=$((CONCURRENCY_PER_GPU * DEPLOYMENT_GPU_COUNT))
echo "Calculated total concurrency: $TOTAL_CONCURRENCY (${CONCURRENCY_PER_GPU} per GPU × ${DEPLOYMENT_GPU_COUNT} GPUs)"
# Write input_config.json
cat > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json" <<EOF
{
"gpu_count": $DEPLOYMENT_GPU_COUNT,
"concurrency_per_gpu": $CONCURRENCY_PER_GPU,
"total_concurrency": $TOTAL_CONCURRENCY,
"mode": "$DEPLOYMENT_MODE",
"isl": $ISL,
"osl": $OSL,
"endpoint": "$ENDPOINT",
"model endpoint": "$TARGET_MODEL"
}
EOF
# Run perf with calculated total concurrency
run_perf $TOTAL_CONCURRENCY $ISL $OSL
echo "done with concurrency $TOTAL_CONCURRENCY"
env:
- name: TARGET_MODEL
value: "Qwen/Qwen3-32B-FP8"
- name: ENDPOINT
value: qwen3-32b-fp8-vllm-disagg-frontend:8000
- name: CONCURRENCY_PER_GPU
value: "1"
- name: DEPLOYMENT_GPU_COUNT
value: "8"
- name: ISL
value: "2000"
- name: OSL
value: "500"
- name: DEPLOYMENT_MODE
value: vllm-disagg
- name: AIPERF_HTTP_CONNECTION_LIMIT
value: "200"
- name: JOB_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['job-name']
- name: ROOT_ARTIFACT_DIR
value: /model-cache/perf
- name: HF_HOME
value: /model-cache
- name: PYTHONUNBUFFERED
value: "1"
image: python:3.12-slim
imagePullPolicy: IfNotPresent
name: perf
securityContext:
privileged: true
volumeMounts:
- name: model-cache
mountPath: /model-cache
workingDir: /workspace
imagePullSecrets:
- name: nvcrimagepullsecret
restartPolicy: Never
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment