`model.tie_weights()` should be applied after `accelerator.prepare()` (#18676)

* `model.tie_weights()` should be applied after `accelerator.prepare` Weight tying should be done after the model has been moved to XLA device as mentioned on PyTorch/XLA Troubleshooting guide [here](https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#xla-tensor-quirks) * format code

`model.tie_weights()` should be applied after `accelerator.prepare()` (#18676)
* `model.tie_weights()` should be applied after `accelerator.prepare` Weight tying should be done after the model has been moved to XLA device as mentioned on PyTorch/XLA Troubleshooting guide [here](https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#xla-tensor-quirks) * format code
e54a1b49 · Atharva Ingle · GitHub · bbbb453e · e54a1b49
Unverified Commit e54a1b49 authored Aug 18, 2022 by Atharva Ingle Committed by GitHub Aug 18, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 4 deletions

examples/pytorch/language-modeling/run_mlm_no_trainer.py examples/pytorch/language-modeling/run_mlm_no_trainer.py +4 -4

No files found.
--- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
@@ -518,10 +518,6 @@ def main():
    ]
    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate)

-    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
-    if accelerator.distributed_type == DistributedType.TPU:
-        model.tie_weights()
-
    # Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be
    # shorter in multiprocess)

@@ -544,6 +540,10 @@ def main():
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

+    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
+    if accelerator.distributed_type == DistributedType.TPU:
+        model.tie_weights()
+
    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    if overrode_max_train_steps: