allowing sharded dataset (#696)
Summary: Co-authored-by:myleott <myleott@fb.com> Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner. For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`. myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets. Pull Request resolved: https://github.com/pytorch/fairseq/pull/696 Differential Revision: D15214049 fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013
Showing
Please register or sign in to comment