chore: Clean up vLLM DSR1 wideEP recipe (#5389)

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

chore: Clean up vLLM DSR1 wideEP recipe (#5389)
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
0bfe9822 · ptarasiewiczNV · GitHub · 69e44e98 · 0bfe9822 · 0bfe9822
Unverified Commit 0bfe9822 authored Jan 13, 2026 by ptarasiewiczNV Committed by GitHub Jan 13, 2026
Showing with 4 additions and 16 deletions

recipes/deepseek-r1/vllm/disagg/README.md recipes/deepseek-r1/vllm/disagg/README.md +1 -1

recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml +3 -15

No files found.
--- a/recipes/deepseek-r1/vllm/disagg/README.md
+++ b/recipes/deepseek-r1/vllm/disagg/README.md
@@ -89,10 +89,10 @@ curl -sS http://localhost:8000/v1/chat/completions \
 ### Notes
+- For more details on expert parallel and advanced deployment configurations, refer to [vLLM Expert Parallel Deployment Documentation](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/).
 - If your cluster/network requires specific interfaces, adjust environment variables (e.g., `NCCL_SOCKET_IFNAME`) in the manifest accordingly.
 - If your storage class differs, update `storageClassName` before applying the PVC.
 - **If you want to run multinode deployments, IBGDA (InfiniBand GPU Direct Async) must be enabled on your nodes.** To enable IBGDA, you can follow this configuration script: [configure_system_drivers.sh](https://github.com/vllm-project/vllm/blob/v0.11.2/tools/ep_kernels/configure_system_drivers.sh). The script configures NVIDIA driver parameters and requires a system reboot to take effect.
 - `VLLM_MOE_DP_CHUNK_SIZE` can be tuned further. The value 384 was chosen to be largest possible that still can be deployed on 16 H200s. This value should be greater than per rank concurrency.
- Starting with vLLM v0.12.0 (Dynamo v0.8.0) DeepSeek-R1 in this configuration might return gibberish outputs, please track the upstream issue [vLLM #32190](https://github.com/vllm-project/vllm/issues/32190).
--- a/recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml
+++ b/recipes/deepseek-r1/vllm/disagg/deploy_hopper_16gpu.yaml
@@ -54,8 +54,6 @@ spec:
          env:
            - name: VLLM_USE_DEEP_GEMM
              value: "1"
-            - name: VLLM_ALL2ALL_BACKEND
-              value: deepep_low_latency
            - name: VLLM_MOE_DP_CHUNK_SIZE
              value: "384"
            - name: VLLM_SKIP_P2P_CHECK
@@ -64,8 +62,6 @@ spec:
              value: "1"
            - name: NVIDIA_GDRCOPY
              value: enabled
-            - name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
-              value: "uniform_random"
            - name: GLOO_SOCKET_IFNAME
              value: eth0
          command:
@@ -76,11 +72,11 @@ spec:
              exec python3 -m dynamo.vllm \
                --model /model-cache/deepseek-r1 \
                --served-model-name deepseek-ai/DeepSeek-R1 \
+                --all2all-backend deepep_low_latency \
                --data-parallel-hybrid-lb \
                --tensor-parallel-size 1 \
                --data-parallel-size 16 \
                --enable-expert-parallel \
-                --no-enable-prefix-caching \
                --max-model-len 16384 \
                --enable-dbo \
                --dbo-decode-token-threshold 32 \
@@ -119,18 +115,12 @@ spec:
          env:
            - name: VLLM_USE_DEEP_GEMM
              value: "1"
-            - name: VLLM_ALL2ALL_BACKEND
-              value: deepep_high_throughput
-            - name: VLLM_MOE_DP_CHUNK_SIZE
-              value: "384"
            - name: VLLM_SKIP_P2P_CHECK
              value: "1"
            - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
              value: "1"
            - name: NVIDIA_GDRCOPY
              value: enabled
-            - name: VLLM_MOE_ROUTING_SIMULATION_STRATEGY
-              value: "uniform_random"
            - name: GLOO_SOCKET_IFNAME
              value: eth0
          command:
@@ -142,17 +132,15 @@ spec:
                --model /model-cache/deepseek-r1 \
                --is-prefill-worker \
                --served-model-name deepseek-ai/DeepSeek-R1 \
+                --all2all-backend deepep_high_throughput \
                --data-parallel-hybrid-lb \
                --tensor-parallel-size 1 \
                --data-parallel-size 16 \
                --enable-expert-parallel \
-                --no-enable-prefix-caching \
                --max-model-len 16384 \
                --enable-dbo \
                --dbo-decode-token-threshold 32 \
                --async-scheduling \
                --enable-eplb \
                --eplb-config '{"window_size":"1000","step_interval":"3000","num_redundant_experts":"32","log_balancedness":"False"}' \
-                --max-num-seqs 512 \
+                --max-num-seqs 512
-                --compilation_config '{"pass_config":{"enable_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY"}'