Commit ff74ca94 authored by Ning Dong's avatar Ning Dong Committed by Facebook Github Bot
Browse files

Support dataset upsampling / relative ratio in PytorchTranslateTask (#494)

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/494

Pull Request resolved: https://github.com/pytorch/fairseq/pull/657

Library side change split from D14924942

Added 2 arguments for load_dataset in PytorchTranslateTask
1. dataset_upsampling. A nested dictionary {direction:{dataset: upsampling_ratio}}. Upsampling_ratio larger than one mean that the bitext is ob- served more often than actually present in the combined bitext and synthetic training corpus.

2. dataset_relative_ratio. A tuple (dataset, ratio). The ratio represents the frequency certain dataset gets sampled to the rest of corpora map.

At most one of them could be specified.

Reviewed By: liezl200

Differential Revision: D15041293

fbshipit-source-id: 92daad29895c234e26d1b19f121106118a3957ad
parent da9e493e
......@@ -8,7 +8,7 @@
import contextlib
import os
import numpy as np
from collections import Iterable
def infer_language_pair(path):
"""Infer language pair from filename: <split>.<lang1>-<lang2>.(...).idx"""
......@@ -96,6 +96,9 @@ def filter_by_size(indices, size_fn, max_positions, raise_exception=False):
for key in intersect_keys
)
else:
# For MultiCorpusSampledDataset, will generalize it later
if not isinstance(size_fn(idx), Iterable):
return all(size_fn(idx) <= b for b in max_positions)
return all(a is None or b is None or a <= b
for a, b in zip(size_fn(idx), max_positions))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment