Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
7c16f3fb
Unverified
Commit
7c16f3fb
authored
Dec 14, 2025
by
Isotr0py
Committed by
GitHub
Dec 13, 2025
Browse files
[Doc] Add documents for multi-node distributed serving with MP backend (#30509)
Signed-off-by:
Isotr0py
<
mozf@mail2.sysu.edu.cn
>
parent
ddbfbe52
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
24 additions
and
4 deletions
+24
-4
docs/serving/parallelism_scaling.md
docs/serving/parallelism_scaling.md
+23
-1
vllm/v1/executor/multiproc_executor.py
vllm/v1/executor/multiproc_executor.py
+1
-3
No files found.
docs/serving/parallelism_scaling.md
View file @
7c16f3fb
...
...
@@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
### What is Ray?
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments
requir
e Ray as the runtime engine.
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments
can us
e Ray as the runtime engine.
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
...
...
@@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
--distributed-executor-backend
ray
```
### Running vLLM with MultiProcessing
Besides Ray, Multi-node vLLM deployments can also use
`multiprocessing`
as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with
`tp_size=8`
and
`pp_size=2`
.
Choose one node as the head node and run:
```
bash
vllm serve /path/to/the/model/in/the/container
\
--tensor-parallel-size
8
--pipeline-parallel-size
2
\
--nnodes
2
--node-rank
0
\
--master-addr
<HEAD_NODE_IP>
```
On the other worker node, run:
```
bash
vllm serve /path/to/the/model/in/the/container
\
--tensor-parallel-size
8
--pipeline-parallel-size
2
\
--nnodes
2
--node-rank
1
\
--master-addr
<HEAD_NODE_IP>
--headless
```
## Optimizing network communication for tensor parallelism
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
...
...
vllm/v1/executor/multiproc_executor.py
View file @
7c16f3fb
...
...
@@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
# Set multiprocessing envs
set_multiprocessing_worker_envs
()
# Multiprocessing-based executor does not support multi-node setting.
# Since it only works for single node, we can use the loopback address
# get_loopback_ip() for communication.
# use the loopback address get_loopback_ip() for communication.
distributed_init_method
=
get_distributed_init_method
(
get_loopback_ip
(),
get_open_port
()
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment