Proper map location for optimizer load (#22273)

* Proper map location for optimizer load * What happened to my code?

Proper map location for optimizer load (#22273)
* Proper map location for optimizer load * What happened to my code?
da005253 · Sylvain Gugger · GitHub · 786092a3 · da005253
Unverified Commit da005253 authored Mar 20, 2023 by Sylvain Gugger Committed by GitHub Mar 20, 2023
Show whitespace changes
Inline Side-by-side

Showing with 5 additions and 1 deletion

src/transformers/trainer.py src/transformers/trainer.py +5 -1

No files found.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2433,8 +2433,12 @@ class Trainer:
                    self.model_wrapped.register_post_step_hook(opt_load_hook)
                else:
+                    # We use the CPU when training on one GPU to avoid OOM for GPU RAM when training big models.
+                    # In distributed training however, we load directly on each GPU and risk the GPU OOM as it's more
+                    # likely to get OOM on CPU (since we load num_gpu times the optimizer state
+                    map_location = self.args.device if self.args.world_size > 1 else "cpu"
                    self.optimizer.load_state_dict(
-                        torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location="cpu")
+                        torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
                    )
                with warnings.catch_warnings(record=True) as caught_warnings:
                    self.lr_scheduler.load_state_dict(torch.load(os.path.join(checkpoint, SCHEDULER_NAME)))