• Naman Goyal's avatar
    added multilingual masked LM training (#849) · 32335404
    Naman Goyal authored
    Summary:
    The multilingual-RoBERTa training is working with aconneau XLM data.
    
    Two pieces remaining:
    
    1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.
    
    2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
    Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849
    
    Differential Revision: D17162460
    
    fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
    32335404
train.py 12.1 KB