train.py · 6c1da0f74b1187060715219da306cc1af6505e1b · OpenDAS / Fairseq

added multilingual masked LM training (#849) · 32335404

Naman Goyal authored Sep 20, 2019

Summary:
The multilingual-RoBERTa training is working with aconneau XLM data.

Two pieces remaining:

1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.

2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849

Differential Revision: D17162460

fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f

32335404

train.py 12.1 KB

Replace train.py