Commit 14f46a65 authored by lizhg1's avatar lizhg1
Browse files

fix nccl 报超时的问题,原因是双线程调度导致提前关闭远端worker,warmup后远端worker执行异常,引发rank0超时。

http://hpczentao.sugon.com/bug-view-93029.html
parent ba2ca2db
...@@ -1328,6 +1328,8 @@ class LLMEngine: ...@@ -1328,6 +1328,8 @@ class LLMEngine:
while True: while True:
self.sem_m2s.acquire() self.sem_m2s.acquire()
if not self.thread_running: if not self.thread_running:
logger.debug("Stopping remote worker execution loop.")
self.model_executor.stop_remote_worker_execution_loop()
break break
virtual_engine = 0 virtual_engine = 0
...@@ -1438,8 +1440,9 @@ class LLMEngine: ...@@ -1438,8 +1440,9 @@ class LLMEngine:
# torch.distributed ops which may otherwise timeout, and unblocks # torch.distributed ops which may otherwise timeout, and unblocks
# the RPC thread in the workers so that they can process any other # the RPC thread in the workers so that they can process any other
# queued control plane messages, such as add/remove lora adapters. # queued control plane messages, such as add/remove lora adapters.
logger.debug("Stopping remote worker execution loop.") # logger.debug("Stopping remote worker execution loop.")
self.model_executor.stop_remote_worker_execution_loop() # self.model_executor.stop_remote_worker_execution_loop()
self.finish_thread()
return ctx.request_outputs return ctx.request_outputs
def step(self) -> List[Union[RequestOutput, PoolingRequestOutput]]: def step(self) -> List[Union[RequestOutput, PoolingRequestOutput]]:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment