Fix TPU Convergence bug introduced by PR#6151 (#6488)
Currently with the bug introduced we're taking two optimizer steps per batch: one global one, where `xm.optimizer_step` injects a CRS between all cores in training, and one without. This has been affecting training accuracy (for example, XLNet GLUE on MNLI is not converging, etc.).
Showing
Please register or sign in to comment