fairseq/multiprocessing_trainer.py · d7d82715f968097bba08c92416d332d969bd1f06 · OpenDAS / Fairseq

Fix all-reduce for new versions of PyTorch · d7d82715

Myle Ott authored Nov 09, 2017

We previously assumed that once a model parameter's gradient buffer was allocated, it stayed fixed during training.
However, this assumption is violated in recent versions of PyTorch (i.e., the gradient buffer may be reallocated during
training), and it's no longer a safe assumption to make.

This is primarily relevant when we do the all-reduce, since we all-reduce a flattened (i.e., contiguous) copy of the
gradients. We can make this more robust by copying the result of the all-reduce back into the model parameter's gradient
buffers after each update. Intra-device copies are cheap, so this doesn't affect performance.

d7d82715

multiprocessing_trainer.py 11.6 KB

Replace multiprocessing_trainer.py