Unverified Commit fddbb84d authored by Yuewei Na's avatar Yuewei Na Committed by GitHub
Browse files

fix: snapshot children before process group kill to prevent GPU memory leaks (#7122)


Signed-off-by: default avatarYuewei Na <yna@nvidia.com>
Signed-off-by: default avatarYuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: default avatarYuewei Na <nv-yna@users.noreply.github.com>
parent e96f71ba
...@@ -260,14 +260,17 @@ class ManagedProcess: ...@@ -260,14 +260,17 @@ class ManagedProcess:
Termination Strategy (Graceful → Escalate to Force): Termination Strategy (Graceful → Escalate to Force):
===================================================== =====================================================
1. Send SIGTERM to process group immediately (no delay before SIGTERM) 1. Snapshot child processes (before killing parent, so we can find them)
2. Wait up to 2s (poll every 0.1s) for processes to exit 2. Send SIGTERM to process group immediately (no delay before SIGTERM)
3. If still alive after 2s: Send SIGKILL (force kill) 3. Wait up to 2s (poll every 0.1s) for processes to exit
4. Terminate individual processes (self.proc, tee, sed): 4. If still alive after 2s: Send SIGKILL (force kill)
5. Terminate individual processes (self.proc, tee, sed):
- Send SIGTERM immediately (no delay) - Send SIGTERM immediately (no delay)
- Wait up to 10s for exit - Wait up to 10s for exit
- If still alive after 10s: Send SIGKILL (force kill) - If still alive after 10s: Send SIGKILL (force kill)
5. Clean up straggler processes (if configured) 6. Kill any orphaned child processes that escaped the process group
(e.g. TRT-LLM engine workers started via MPI in separate PGIDs)
7. Clean up straggler processes (if configured)
Signal Details: Signal Details:
- SIGTERM (15): Graceful - allows cleanup handlers to run - SIGTERM (15): Graceful - allows cleanup handlers to run
...@@ -280,6 +283,19 @@ class ManagedProcess: ...@@ -280,6 +283,19 @@ class ManagedProcess:
This ALWAYS runs regardless of terminate_all_matching_process_names setting. This ALWAYS runs regardless of terminate_all_matching_process_names setting.
""" """
try: try:
# Snapshot all child processes BEFORE killing the parent.
# Some frameworks (e.g. TRT-LLM via MPI) spawn children in separate
# process groups. After the parent dies these children become orphans
# reparented to init, so psutil can no longer find them via the parent
# PID. We must capture them now while the parent is still alive.
orphan_candidates = []
if self.proc:
try:
parent = psutil.Process(self.proc.pid)
orphan_candidates = parent.children(recursive=True)
except psutil.NoSuchProcess:
pass
self._terminate_process_group() self._terminate_process_group()
process_list = [self.proc, self._tee_proc, self._sed_proc] process_list = [self.proc, self._tee_proc, self._sed_proc]
...@@ -294,6 +310,22 @@ class ManagedProcess: ...@@ -294,6 +310,22 @@ class ManagedProcess:
process.wait() process.wait()
except Exception as e: except Exception as e:
self._logger.warning("Error terminating process: %s", e) self._logger.warning("Error terminating process: %s", e)
# Kill any child processes that survived the process group kill.
# This catches children in different PGIDs (e.g. MPI workers, engine
# cores) that os.killpg() could not reach.
for child in orphan_candidates:
try:
if child.is_running():
self._logger.info(
"Killing orphaned child process: PID %s name=%s",
child.pid,
child.name(),
)
terminate_process_tree(child.pid, self._logger)
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass
if self.data_dir: if self.data_dir:
self._remove_directory(self.data_dir) self._remove_directory(self.data_dir)
finally: finally:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment