Neuron: When save_safetensor=False, no need to move model to CPU (#29703)

save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix https://github.com/huggingface/transformers/pull/27799 (issue https://github.com/huggingface/transformers/issues/27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint https://github.com/huggingface/transformers/issues/28438. This PR disable moving of model to CPU when save_safetensor=False.

Neuron: When save_safetensor=False, no need to move model to CPU (#29703)
save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix https://github.com/huggingface/transformers/pull/27799 (issue https://github.com/huggingface/transformers/issues/27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint https://github.com/huggingface/transformers/issues/28438. This PR disable moving of model to CPU when save_safetensor=False.
d1d94d79 · jeffhataws · GitHub · 661190b4 · d1d94d79
Unverified Commit d1d94d79 authored Apr 24, 2024 by jeffhataws Committed by GitHub Apr 24, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

src/transformers/trainer.py src/transformers/trainer.py +4 -2

No files found.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -3267,7 +3267,8 @@ class Trainer:
        logger.info(f"Saving model checkpoint to {output_dir}")
        model = self.model
        xm.mark_step()
-        model.to("cpu")
+        if self.args.save_safetensors:
+            model.to("cpu")

        if xm.is_master_ordinal():
            os.makedirs(output_dir, exist_ok=True)
@@ -3302,7 +3303,8 @@ class Trainer:

        # We moved the model from TPU -> CPU for saving the weights.
        # Now we should move it back to subsequent compute still works.
-        model.to(self.args.device)
+        if self.args.save_safetensors:
+            model.to(self.args.device)

    def _save(self, output_dir: Optional[str] = None, state_dict=None):
        # If we are executing this function, we are the process zero, so we don't check for that.