dont project maske tokens for mlm loss (#859)
Summary: This saves ~4-5gb gpu memory while training roberta large with `seq_len=512`. I am able to fit `--max-sentences=16` on `volta32gb` for `roberta-large` Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/859 Differential Revision: D17435814 fbshipit-source-id: 2663909768fac0ef0102107613770ee01b1f8c00
Showing