Unverified Commit 7c16f3fb authored by Isotr0py's avatar Isotr0py Committed by GitHub
Browse files

[Doc] Add documents for multi-node distributed serving with MP backend (#30509)


Signed-off-by: default avatarIsotr0py <mozf@mail2.sysu.edu.cn>
parent ddbfbe52
...@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul ...@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
### What is Ray? ### What is Ray?
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine. Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens. vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
...@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \ ...@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
--distributed-executor-backend ray --distributed-executor-backend ray
``` ```
### Running vLLM with MultiProcessing
Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.
Choose one node as the head node and run:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 0 \
--master-addr <HEAD_NODE_IP>
```
On the other worker node, run:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 1 \
--master-addr <HEAD_NODE_IP> --headless
```
## Optimizing network communication for tensor parallelism ## Optimizing network communication for tensor parallelism
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
......
...@@ -124,9 +124,7 @@ class MultiprocExecutor(Executor): ...@@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
# Set multiprocessing envs # Set multiprocessing envs
set_multiprocessing_worker_envs() set_multiprocessing_worker_envs()
# Multiprocessing-based executor does not support multi-node setting. # use the loopback address get_loopback_ip() for communication.
# Since it only works for single node, we can use the loopback address
# get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method( distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port() get_loopback_ip(), get_open_port()
) )
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment