Commit 8b514b9f authored by Liang Wang's avatar Liang Wang Committed by Facebook Github Bot
Browse files

FIx dataset loading when there are multiple valid subsets (#835)

Summary:
When we have multiple valid subsets, say `valid`, `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set `combine` to `False` solves this issue.

In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument `--valid-subset valid,valid1,valid2`, the log is as follows:

```
......
| ./mix_data/bin valid src-trg 3000 examples
| ./mix_data/bin valid1 src-trg 5000 examples
| ./mix_data/bin valid2 src-trg 7801 examples
| ./mix_data/bin valid1 src-trg 5000 examples
| ./mix_data/bin valid2 src-trg 7801 examples
......
```

As shown above, `valid1` and `valid2` subsets are incorrectly loaded twice.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/835

Differential Revision: D16006343

Pulled By: myleott

fbshipit-source-id: ece7fee3a00f97a6b3409defbf7f7ffaf0a54fdc
parent ab2fa185
...@@ -43,7 +43,7 @@ def main(args, init_distributed=False): ...@@ -43,7 +43,7 @@ def main(args, init_distributed=False):
# Load valid dataset (we load training data below, based on the latest checkpoint) # Load valid dataset (we load training data below, based on the latest checkpoint)
for valid_sub_split in args.valid_subset.split(','): for valid_sub_split in args.valid_subset.split(','):
task.load_dataset(valid_sub_split, combine=True, epoch=0) task.load_dataset(valid_sub_split, combine=False, epoch=0)
# Build model and criterion # Build model and criterion
model = task.build_model(args) model = task.build_model(args)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment