[doc] Add more details for Ray-based DP (#20948)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[doc] Add more details for Ray-based DP (#20948)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
d9127818 · Rui Qiao · GitHub · 20149d84 · d9127818
Unverified Commit d9127818 authored Jul 15, 2025 by Rui Qiao Committed by GitHub Jul 15, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 10 additions and 2 deletions

docs/serving/data_parallel_deployment.md docs/serving/data_parallel_deployment.md +10 -2

No files found.
--- a/docs/serving/data_parallel_deployment.md
+++ b/docs/serving/data_parallel_deployment.md
@@ -57,12 +57,20 @@ vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
 ```
-This DP mode can also be used with Ray, in which case only a single launch command is needed irrespective of the number of nodes:
+This DP mode can also be used with Ray by specifying `--data-parallel-backend=ray`:
 ```bash
-vllm serve $MODEL --data-parallel-size 16 --tensor-parallel-size 2 --data-parallel-backend=ray
+vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
+                  --data-parallel-backend=ray
 ```
+There are several notable differences when using Ray:
+- A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node
+- There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address`
+- There is no need to specify `--data-parallel-rpc-port`
+- Remote DP ranks will be allocated based on node resources of the Ray cluster
 Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic.
 When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node.