Fix bug for checkpoint saving on multi node training setting (#28078)

* add multi-node traning setting * fix style

Fix bug for checkpoint saving on multi node training setting (#28078)
* add multi-node traning setting * fix style
1c286be5 · dumpmemory · GitHub · dec84b32 · 1c286be5
Unverified Commit 1c286be5 authored Dec 16, 2023 by dumpmemory Committed by GitHub Dec 15, 2023
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 1 deletion

src/transformers/trainer.py src/transformers/trainer.py +3 -1

No files found.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2386,7 +2386,9 @@ class Trainer:
        self.args.distributed_state.wait_for_everyone()
        # Then go through the rewriting process starting on process 0
        if staging_output_dir != output_dir:
-            with self.args.main_process_first(desc="Renaming model checkpoint folder to true location"):
+            with self.args.main_process_first(
+                desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
+            ):
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)