Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
4ab3ac28
Unverified
Commit
4ab3ac28
authored
Jun 27, 2025
by
Michael Goin
Committed by
GitHub
Jun 27, 2025
Browse files
[Bugfix] Fix flaky failure when getting DP ports (#20151)
Signed-off-by:
mgoin
<
mgoin64@gmail.com
>
parent
d1c956dc
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
32 additions
and
9 deletions
+32
-9
vllm/config.py
vllm/config.py
+32
-9
No files found.
vllm/config.py
View file @
4ab3ac28
...
...
@@ -1878,18 +1878,41 @@ class ParallelConfig:
return
answer
def
stateless_init_dp_group
(
self
)
->
"ProcessGroup"
:
# NOTE: In high-concurrency scenarios multiple processes
# can pick the same (currently free) port through a race
# condition when calling `get_open_port()`. When the first
# process binds the port the others will subsequently fail
# with `torch.distributed.DistNetworkError: EADDRINUSE`.
# To make the initialization more robust we retry a few times
# with a fresh port whenever this specific error is observed.
from
torch.distributed
import
DistNetworkError
from
vllm.distributed.utils
import
(
stateless_init_torch_distributed_process_group
)
max_retries
=
5
last_exc
:
Optional
[
Exception
]
=
None
for
_
in
range
(
max_retries
):
try
:
# use gloo since the engine process might not have cuda device
dp_group
=
stateless_init_torch_distributed_process_group
(
return
stateless_init_torch_distributed_process_group
(
self
.
data_parallel_master_ip
,
self
.
get_next_dp_init_port
(),
self
.
data_parallel_rank
,
self
.
data_parallel_size
,
backend
=
"gloo"
)
return
dp_group
except
DistNetworkError
as
e
:
# We only want to retry when the root cause is EADDRINUSE.
if
"EADDRINUSE"
in
str
(
e
):
logger
.
warning
(
"Address already in use. Retrying with a new port."
)
last_exc
=
e
continue
# try again with a new port
raise
e
# If we get here all retries have failed.
assert
last_exc
is
not
None
raise
last_exc
@
staticmethod
def
has_unfinished_dp
(
dp_group
:
"ProcessGroup"
,
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment