Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
14f46a65
Commit
14f46a65
authored
Apr 23, 2025
by
lizhg1
Browse files
fix nccl 报超时的问题,原因是双线程调度导致提前关闭远端worker,warmup后远端worker执行异常,引发rank0超时。
http://hpczentao.sugon.com/bug-view-93029.html
parent
ba2ca2db
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
2 deletions
+5
-2
vllm/engine/llm_engine.py
vllm/engine/llm_engine.py
+5
-2
No files found.
vllm/engine/llm_engine.py
View file @
14f46a65
...
...
@@ -1328,6 +1328,8 @@ class LLMEngine:
while
True
:
self
.
sem_m2s
.
acquire
()
if
not
self
.
thread_running
:
logger
.
debug
(
"Stopping remote worker execution loop."
)
self
.
model_executor
.
stop_remote_worker_execution_loop
()
break
virtual_engine
=
0
...
...
@@ -1438,8 +1440,9 @@ class LLMEngine:
# torch.distributed ops which may otherwise timeout, and unblocks
# the RPC thread in the workers so that they can process any other
# queued control plane messages, such as add/remove lora adapters.
logger
.
debug
(
"Stopping remote worker execution loop."
)
self
.
model_executor
.
stop_remote_worker_execution_loop
()
# logger.debug("Stopping remote worker execution loop.")
# self.model_executor.stop_remote_worker_execution_loop()
self
.
finish_thread
()
return
ctx
.
request_outputs
def
step
(
self
)
->
List
[
Union
[
RequestOutput
,
PoolingRequestOutput
]]:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment