• Naman Goyal's avatar
    allowing sharded dataset (#696) · 0add50c2
    Naman Goyal authored
    
    
    Summary:
    Co-authored-by: default avatarmyleott <myleott@fb.com>
    
    Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner.
    
    For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`.
    
    myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets.
    Pull Request resolved: https://github.com/pytorch/fairseq/pull/696
    
    Differential Revision: D15214049
    
    fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013
    0add50c2
test_train.py 4.01 KB