Unverified Commit c617f988 authored by Yunxuan Xiao's avatar Yunxuan Xiao Committed by GitHub
Browse files

Clean up staging tmp checkpoint directory (#28848)



clean up remaining tmp checkpoint dir
Signed-off-by: default avatarwoshiyyya <xiaoyunxuan1998@gmail.com>
parent 136cd893
...@@ -2468,6 +2468,10 @@ class Trainer: ...@@ -2468,6 +2468,10 @@ class Trainer:
# Solely rely on numerical checkpoint id for rotation. # Solely rely on numerical checkpoint id for rotation.
# mtime is not reliable especially on some fuse fs in cloud environments. # mtime is not reliable especially on some fuse fs in cloud environments.
self._rotate_checkpoints(use_mtime=False, output_dir=run_dir) self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
elif self.is_local_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir)
self.args.distributed_state.wait_for_everyone() self.args.distributed_state.wait_for_everyone()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment