Unverified Commit c81c86e4 authored by ptarasiewiczNV's avatar ptarasiewiczNV Committed by GitHub
Browse files

feat: vLLM DSR1 recipe (#4463)


Signed-off-by: default avatarPiotr Tarasiewicz <ptarasiewicz@nvidia.com>
parent 77841e7b
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
### DeepSeek-R1 with vLLM — Disaggregated on 8x Hopper
This recipe deploys DeepSeek-R1 using vLLM in a disaggregated prefill/decode setup on a single Hopper node with 8 GPUs.
- Model cache PVC + download job: `recipes/deepseek-r1/model-cache/`
- Deployment manifest: `recipes/deepseek-r1/vllm/disagg/deploy_hopper_8gpu.yaml`
### 0) Prerequisites: Install the platform
Follow the Kubernetes deployment guide to install the Dynamo platform and prerequisites (CRDs/operator, etc.):
- `docs/kubernetes/README.md`
Ensure you have a GPU-enabled cluster with sufficient capacity (8x H100/H200 “Hopper”), and that the NVIDIA GPU Operator is healthy.
### 1) Set namespace
```bash
export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE} || true
```
### 2) Apply Hugging Face secret
Edit your HF token into the provided secret and apply:
```bash
# Option A: Apply YAML (edit the file to set your token)
kubectl apply -f ../../hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
# Option B: Create directly
# kubectl create secret generic hf-token-secret \
# --from-literal=HF_TOKEN="<your-hf-token>" \
# -n ${NAMESPACE}
```
### 3) Provision model cache and download models
Update `storageClassName` in `recipes/deepseek-r1/model-cache/model-cache.yaml` to match your cluster, then apply:
```bash
# PVC for model cache
# Ensure storageClassName in model-cache.yaml matches an available StorageClass on your cluster
kubectl apply -f ../../../deepseek-r1/model-cache/model-cache.yaml -n ${NAMESPACE}
# Download DeepSeek-R1 weights into the cache
kubectl apply -f ../../../deepseek-r1/model-cache/model-download.yaml -n ${NAMESPACE}
# Wait for download job to finish
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
```
This will populate:
- `/model-cache/deepseek-r1`
- `/model-cache/deepseek-r1-fp4`
### 4) Deploy vLLM (Disaggregated, Prefill DEP16, Decode DEP16)
Apply the single-node disaggregated deployment:
```bash
kubectl apply -f ./deploy_hopper_16gpu.yaml -n ${NAMESPACE}
```
The manifest runs separate prefill and decode workers, each mounting the shared model cache, with settings tuned for Hopper.
Test the deployment locally by port-forwarding and sending a request:
```bash
# Port-forward the frontend Service to localhost:8000 (replace <frontend-svc> with the actual Service name)
kubectl port-forward svc/test3-vllm-dsr1-frontend 8000:8000 -n ${NAMESPACE} &
```
```bash
curl -sS http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer dummy' \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [{"role":"user","content":"Say hello!"}],
"max_tokens": 64
}'
```
### Notes
- If your cluster/network requires specific interfaces, adjust environment variables (e.g., `NCCL_SOCKET_IFNAME`) in the manifest accordingly.
- If your storage class differs, update `storageClassName` before applying the PVC.
- **If you want to run multinode deployments, IBGDA (InfiniBand GPU Direct Async) must be enabled on your nodes.** To enable IBGDA, you can follow this configuration script: [configure_system_drivers.sh](https://github.com/vllm-project/vllm/blob/v0.11.2/tools/ep_kernels/configure_system_drivers.sh). The script configures NVIDIA driver parameters and requires a system reboot to take effect.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: test3-vllm-dsr1
spec:
backendFramework: vllm
pvcs:
- name: model-cache-pvc
create: false
services:
Frontend:
dynamoNamespace: vllm-dsr1
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
timeoutSeconds: 1800
failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
decode:
dynamoNamespace: vllm-dsr1
componentType: worker
replicas: 1
multinode:
nodeCount: 2
resources:
limits:
gpu: "8"
custom:
rdma/ib: "8"
volumeMounts:
- name: model-cache-pvc
mountPoint: /model-cache
sharedMemory:
size: 80Gi
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/dynamo
env:
- name: VLLM_USE_DEEP_GEMM
value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_low_latency
- name: VLLM_MOE_DP_CHUNK_SIZE
value: "512"
- name: VLLM_SKIP_P2P_CHECK
value: "1"
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1"
- name: NVIDIA_GDRCOPY
value: enabled
- name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
value: "uniform_random"
- name: NVSHMEM_QP_DEPTH
value: "1512"
- name: GLOO_SOCKET_IFNAME
value: eth0
command:
- /bin/bash
- -c
args:
- |
exec python3 -m dynamo.vllm \
--model /model-cache/deepseek-r1 \
--served-model-name deepseek-ai/DeepSeek-R1 \
--data-parallel-hybrid-lb \
--tensor-parallel-size 1 \
--data-parallel-size 16 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--max-model-len 16384 \
--enable-dbo \
--dbo-decode-token-threshold 32 \
--async-scheduling \
--enable-eplb \
--eplb-config '{"window_size":"1000","step_interval":"3000","num_redundant_experts":"32","log_balancedness":"False"}' \
--max-num-seqs 512 \
--compilation_config '{"pass_config":{"enable_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY"}'
prefill:
dynamoNamespace: vllm-dsr1
componentType: worker
replicas: 1
multinode:
nodeCount: 2
resources:
limits:
gpu: "8"
custom:
rdma/ib: "8"
volumeMounts:
- name: model-cache-pvc
mountPoint: /model-cache
sharedMemory:
size: 80Gi
extraPodSpec:
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 600
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/dynamo
env:
- name: VLLM_USE_DEEP_GEMM
value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_high_throughput
- name: VLLM_MOE_DP_CHUNK_SIZE
value: "512"
- name: VLLM_SKIP_P2P_CHECK
value: "1"
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1"
- name: NVIDIA_GDRCOPY
value: enabled
- name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
value: "uniform_random"
- name: NVSHMEM_QP_DEPTH
value: "1512"
- name: GLOO_SOCKET_IFNAME
value: eth0
command:
- /bin/bash
- -c
args:
- |
exec python3 -m dynamo.vllm \
--model /model-cache/deepseek-r1 \
--is-prefill-worker \
--served-model-name deepseek-ai/DeepSeek-R1 \
--data-parallel-hybrid-lb \
--tensor-parallel-size 1 \
--data-parallel-size 16 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--max-model-len 16384 \
--enable-dbo \
--dbo-decode-token-threshold 32 \
--async-scheduling \
--enable-eplb \
--eplb-config '{"window_size":"1000","step_interval":"3000","num_redundant_experts":"32","log_balancedness":"False"}' \
--max-num-seqs 512 \
--compilation_config '{"pass_config":{"enable_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY"}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment