Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
7c16f3fb
Unverified
Commit
7c16f3fb
authored
Dec 14, 2025
by
Isotr0py
Committed by
GitHub
Dec 13, 2025
Browse files
[Doc] Add documents for multi-node distributed serving with MP backend (#30509)
Signed-off-by:
Isotr0py
<
mozf@mail2.sysu.edu.cn
>
parent
ddbfbe52
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
24 additions
and
4 deletions
+24
-4
docs/serving/parallelism_scaling.md
docs/serving/parallelism_scaling.md
+23
-1
vllm/v1/executor/multiproc_executor.py
vllm/v1/executor/multiproc_executor.py
+1
-3
No files found.
docs/serving/parallelism_scaling.md
View file @
7c16f3fb
...
@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
...
@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
### What is Ray?
### What is Ray?
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments
requir
e Ray as the runtime engine.
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments
can us
e Ray as the runtime engine.
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
...
@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
...
@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
--distributed-executor-backend
ray
--distributed-executor-backend
ray
```
```
### Running vLLM with MultiProcessing
Besides Ray, Multi-node vLLM deployments can also use
`multiprocessing`
as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with
`tp_size=8`
and
`pp_size=2`
.
Choose one node as the head node and run:
```
bash
vllm serve /path/to/the/model/in/the/container
\
--tensor-parallel-size
8
--pipeline-parallel-size
2
\
--nnodes
2
--node-rank
0
\
--master-addr
<HEAD_NODE_IP>
```
On the other worker node, run:
```
bash
vllm serve /path/to/the/model/in/the/container
\
--tensor-parallel-size
8
--pipeline-parallel-size
2
\
--nnodes
2
--node-rank
1
\
--master-addr
<HEAD_NODE_IP>
--headless
```
## Optimizing network communication for tensor parallelism
## Optimizing network communication for tensor parallelism
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
...
...
vllm/v1/executor/multiproc_executor.py
View file @
7c16f3fb
...
@@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
...
@@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
# Set multiprocessing envs
# Set multiprocessing envs
set_multiprocessing_worker_envs
()
set_multiprocessing_worker_envs
()
# Multiprocessing-based executor does not support multi-node setting.
# use the loopback address get_loopback_ip() for communication.
# Since it only works for single node, we can use the loopback address
# get_loopback_ip() for communication.
distributed_init_method
=
get_distributed_init_method
(
distributed_init_method
=
get_distributed_init_method
(
get_loopback_ip
(),
get_open_port
()
get_loopback_ip
(),
get_open_port
()
)
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment