Support dataset upsampling / relative ratio in PytorchTranslateTask (#494)

Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/494 Pull Request resolved: https://github.com/pytorch/fairseq/pull/657 Library side change split from D14924942 Added 2 arguments for load_dataset in PytorchTranslateTask 1. dataset_upsampling. A nested dictionary {direction:{dataset: upsampling_ratio}}. Upsampling_ratio larger than one mean that the bitext is ob- served more often than actually present in the combined bitext and synthetic training corpus. 2. dataset_relative_ratio. A tuple (dataset, ratio). The ratio represents the frequency certain dataset gets sampled to the rest of corpora map. At most one of them could be specified. Reviewed By: liezl200 Differential Revision: D15041293 fbshipit-source-id: 92daad29895c234e26d1b19f121106118a3957ad

Support dataset upsampling / relative ratio in PytorchTranslateTask (#494)
Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/494 Pull Request resolved: https://github.com/pytorch/fairseq/pull/657 Library side change split from D14924942 Added 2 arguments for load_dataset in PytorchTranslateTask 1. dataset_upsampling. A nested dictionary {direction:{dataset: upsampling_ratio}}. Upsampling_ratio larger than one mean that the bitext is ob- served more often than actually present in the combined bitext and synthetic training corpus. 2. dataset_relative_ratio. A tuple (dataset, ratio). The ratio represents the frequency certain dataset gets sampled to the rest of corpora map. At most one of them could be specified. Reviewed By: liezl200 Differential Revision: D15041293 fbshipit-source-id: 92daad29895c234e26d1b19f121106118a3957ad
ff74ca94 · Ning Dong · Facebook Github Bot · da9e493e · ff74ca94
Commit ff74ca94 authored May 01, 2019 by Ning Dong Committed by Facebook Github Bot May 01, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 1 deletion

fairseq/data/data_utils.py fairseq/data/data_utils.py +4 -1

No files found.
--- a/fairseq/data/data_utils.py
+++ b/fairseq/data/data_utils.py
@@ -8,7 +8,7 @@
 import contextlib
 import os
 import numpy as np
-
+from collections import Iterable

 def infer_language_pair(path):
    """Infer language pair from filename: <split>.<lang1>-<lang2>.(...).idx"""
@@ -96,6 +96,9 @@ def filter_by_size(indices, size_fn, max_positions, raise_exception=False):
                for key in intersect_keys
            )
        else:
+            # For MultiCorpusSampledDataset, will generalize it later
+            if not isinstance(size_fn(idx), Iterable):
+                return all(size_fn(idx) <= b for b in max_positions)
            return all(a is None or b is None or a <= b
                       for a, b in zip(size_fn(idx), max_positions))