"docs/source/vscode:/vscode.git/clone" did not exist on "683cbc4c340b7e3d24981ac1c8ac90fe776cda36"
Unverified Commit d1d94d79 authored by jeffhataws's avatar jeffhataws Committed by GitHub
Browse files

Neuron: When save_safetensor=False, no need to move model to CPU (#29703)

save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix https://github.com/huggingface/transformers/pull/27799
(issue https://github.com/huggingface/transformers/issues/27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint https://github.com/huggingface/transformers/issues/28438.
This PR disable moving of model to CPU when save_safetensor=False.
parent 661190b4
...@@ -3267,6 +3267,7 @@ class Trainer: ...@@ -3267,6 +3267,7 @@ class Trainer:
logger.info(f"Saving model checkpoint to {output_dir}") logger.info(f"Saving model checkpoint to {output_dir}")
model = self.model model = self.model
xm.mark_step() xm.mark_step()
if self.args.save_safetensors:
model.to("cpu") model.to("cpu")
if xm.is_master_ordinal(): if xm.is_master_ordinal():
...@@ -3302,6 +3303,7 @@ class Trainer: ...@@ -3302,6 +3303,7 @@ class Trainer:
# We moved the model from TPU -> CPU for saving the weights. # We moved the model from TPU -> CPU for saving the weights.
# Now we should move it back to subsequent compute still works. # Now we should move it back to subsequent compute still works.
if self.args.save_safetensors:
model.to(self.args.device) model.to(self.args.device)
def _save(self, output_dir: Optional[str] = None, state_dict=None): def _save(self, output_dir: Optional[str] = None, state_dict=None):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment