Fix TPU Convergence bug introduced by PR#6151 (#6488)

Currently with the bug introduced we're taking two optimizer steps per batch: one global one, where `xm.optimizer_step` injects a CRS between all cores in training, and one without. This has been affecting training accuracy (for example, XLNet GLUE on MNLI is not converging, etc.).

Fix TPU Convergence bug introduced by PR#6151 (#6488)
Currently with the bug introduced we're taking two optimizer steps per batch: one global one, where `xm.optimizer_step` injects a CRS between all cores in training, and one without. This has been affecting training accuracy (for example, XLNet GLUE on MNLI is not converging, etc.).
24107c2c · Jin Young (Daniel) Sohn · GitHub · 895ed8f4 · 24107c2c
Unverified Commit 24107c2c authored Aug 14, 2020 by Jin Young (Daniel) Sohn Committed by GitHub Aug 14, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

src/transformers/trainer.py src/transformers/trainer.py +1 -1

No files found.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -572,7 +572,7 @@ class Trainer:

                    if is_torch_tpu_available():
                        xm.optimizer_step(self.optimizer)
-                    if self.args.fp16 and _use_native_amp:
+                    elif self.args.fp16 and _use_native_amp:
                        self.scaler.step(self.optimizer)
                        self.scaler.update()
                    else: