Unverified Commit 0bfe9822 authored by ptarasiewiczNV's avatar ptarasiewiczNV Committed by GitHub
Browse files

chore: Clean up vLLM DSR1 wideEP recipe (#5389)


Signed-off-by: default avatarPiotr Tarasiewicz <ptarasiewicz@nvidia.com>
parent 69e44e98
...@@ -89,10 +89,10 @@ curl -sS http://localhost:8000/v1/chat/completions \ ...@@ -89,10 +89,10 @@ curl -sS http://localhost:8000/v1/chat/completions \
### Notes ### Notes
- For more details on expert parallel and advanced deployment configurations, refer to [vLLM Expert Parallel Deployment Documentation](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/).
- If your cluster/network requires specific interfaces, adjust environment variables (e.g., `NCCL_SOCKET_IFNAME`) in the manifest accordingly. - If your cluster/network requires specific interfaces, adjust environment variables (e.g., `NCCL_SOCKET_IFNAME`) in the manifest accordingly.
- If your storage class differs, update `storageClassName` before applying the PVC. - If your storage class differs, update `storageClassName` before applying the PVC.
- **If you want to run multinode deployments, IBGDA (InfiniBand GPU Direct Async) must be enabled on your nodes.** To enable IBGDA, you can follow this configuration script: [configure_system_drivers.sh](https://github.com/vllm-project/vllm/blob/v0.11.2/tools/ep_kernels/configure_system_drivers.sh). The script configures NVIDIA driver parameters and requires a system reboot to take effect. - **If you want to run multinode deployments, IBGDA (InfiniBand GPU Direct Async) must be enabled on your nodes.** To enable IBGDA, you can follow this configuration script: [configure_system_drivers.sh](https://github.com/vllm-project/vllm/blob/v0.11.2/tools/ep_kernels/configure_system_drivers.sh). The script configures NVIDIA driver parameters and requires a system reboot to take effect.
- `VLLM_MOE_DP_CHUNK_SIZE` can be tuned further. The value 384 was chosen to be largest possible that still can be deployed on 16 H200s. This value should be greater than per rank concurrency. - `VLLM_MOE_DP_CHUNK_SIZE` can be tuned further. The value 384 was chosen to be largest possible that still can be deployed on 16 H200s. This value should be greater than per rank concurrency.
- Starting with vLLM v0.12.0 (Dynamo v0.8.0) DeepSeek-R1 in this configuration might return gibberish outputs, please track the upstream issue [vLLM #32190](https://github.com/vllm-project/vllm/issues/32190).
...@@ -54,8 +54,6 @@ spec: ...@@ -54,8 +54,6 @@ spec:
env: env:
- name: VLLM_USE_DEEP_GEMM - name: VLLM_USE_DEEP_GEMM
value: "1" value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_low_latency
- name: VLLM_MOE_DP_CHUNK_SIZE - name: VLLM_MOE_DP_CHUNK_SIZE
value: "384" value: "384"
- name: VLLM_SKIP_P2P_CHECK - name: VLLM_SKIP_P2P_CHECK
...@@ -64,8 +62,6 @@ spec: ...@@ -64,8 +62,6 @@ spec:
value: "1" value: "1"
- name: NVIDIA_GDRCOPY - name: NVIDIA_GDRCOPY
value: enabled value: enabled
- name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
value: "uniform_random"
- name: GLOO_SOCKET_IFNAME - name: GLOO_SOCKET_IFNAME
value: eth0 value: eth0
command: command:
...@@ -76,11 +72,11 @@ spec: ...@@ -76,11 +72,11 @@ spec:
exec python3 -m dynamo.vllm \ exec python3 -m dynamo.vllm \
--model /model-cache/deepseek-r1 \ --model /model-cache/deepseek-r1 \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--all2all-backend deepep_low_latency \
--data-parallel-hybrid-lb \ --data-parallel-hybrid-lb \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--data-parallel-size 16 \ --data-parallel-size 16 \
--enable-expert-parallel \ --enable-expert-parallel \
--no-enable-prefix-caching \
--max-model-len 16384 \ --max-model-len 16384 \
--enable-dbo \ --enable-dbo \
--dbo-decode-token-threshold 32 \ --dbo-decode-token-threshold 32 \
...@@ -119,18 +115,12 @@ spec: ...@@ -119,18 +115,12 @@ spec:
env: env:
- name: VLLM_USE_DEEP_GEMM - name: VLLM_USE_DEEP_GEMM
value: "1" value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_high_throughput
- name: VLLM_MOE_DP_CHUNK_SIZE
value: "384"
- name: VLLM_SKIP_P2P_CHECK - name: VLLM_SKIP_P2P_CHECK
value: "1" value: "1"
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1" value: "1"
- name: NVIDIA_GDRCOPY - name: NVIDIA_GDRCOPY
value: enabled value: enabled
- name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
value: "uniform_random"
- name: GLOO_SOCKET_IFNAME - name: GLOO_SOCKET_IFNAME
value: eth0 value: eth0
command: command:
...@@ -142,17 +132,15 @@ spec: ...@@ -142,17 +132,15 @@ spec:
--model /model-cache/deepseek-r1 \ --model /model-cache/deepseek-r1 \
--is-prefill-worker \ --is-prefill-worker \
--served-model-name deepseek-ai/DeepSeek-R1 \ --served-model-name deepseek-ai/DeepSeek-R1 \
--all2all-backend deepep_high_throughput \
--data-parallel-hybrid-lb \ --data-parallel-hybrid-lb \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--data-parallel-size 16 \ --data-parallel-size 16 \
--enable-expert-parallel \ --enable-expert-parallel \
--no-enable-prefix-caching \
--max-model-len 16384 \ --max-model-len 16384 \
--enable-dbo \ --enable-dbo \
--dbo-decode-token-threshold 32 \ --dbo-decode-token-threshold 32 \
--async-scheduling \ --async-scheduling \
--enable-eplb \ --enable-eplb \
--eplb-config '{"window_size":"1000","step_interval":"3000","num_redundant_experts":"32","log_balancedness":"False"}' \ --eplb-config '{"window_size":"1000","step_interval":"3000","num_redundant_experts":"32","log_balancedness":"False"}' \
--max-num-seqs 512 \ --max-num-seqs 512
--compilation_config '{"pass_config":{"enable_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY"}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment