• Liang Wang's avatar
    FIx dataset loading when there are multiple valid subsets (#835) · 8b514b9f
    Liang Wang authored
    Summary:
    When we have multiple valid subsets, say `valid`, `valid1` and `valid2`, if `combine=True` holds, when loading `valid` subset, it will try to locate and load `valid`, `valid1`, `valid2`... and then combine them into one dataset. Set `combine` to `False` solves this issue.
    
    In my experiment, I have 3 valid subsets with 3000, 5000 and 8701 examples, with argument `--valid-subset valid,valid1,valid2`, the log is as follows:
    
    ```
    ......
    | ./mix_data/bin valid src-trg 3000 examples
    | ./mix_data/bin valid1 src-trg 5000 examples
    | ./mix_data/bin valid2 src-trg 7801 examples
    | ./mix_data/bin valid1 src-trg 5000 examples
    | ./mix_data/bin valid2 src-trg 7801 examples
    ......
    ```
    
    As shown above, `valid1` and `valid2` subsets are incorrectly loaded twice.
    Pull Request resolved: https://github.com/pytorch/fairseq/pull/835
    
    Differential Revision: D16006343
    
    Pulled By: myleott
    
    fbshipit-source-id: ece7fee3a00f97a6b3409defbf7f7ffaf0a54fdc
    8b514b9f
train.py 11.2 KB