Commits · 47fbc4918f44fd152bc909d47c0815ca4cec2ac8 · OpenDAS / Fairseq

10 May, 2019 1 commit
- fbshipit-source-id: 682b375c6e7535f12faaf9ca32811051f9e874da · 47fbc491
  myleott authored May 10, 2019
  
  47fbc491
08 May, 2019 5 commits

Better error message for incorrect --dataset-impl · 61f29f7f

Myle Ott authored May 08, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/723

Differential Revision: D15260870

Pulled By: myleott

fbshipit-source-id: 73d9b138b9ab44f96824076258f1a6319193d0f7

61f29f7f

bug_fixes and small changes to masked lm (#721) · bd6e5c4f

Naman Goyal authored May 08, 2019

Summary:
1) Made the model compatible with using either `masked_lm_dataset` or `monolingual_dataset`.
2) fixed default args setting task. (`bert` vs `masked_lm`) myleott should we keep both?
3) bug in setting default value of `sentence_class_num`
4) bug for padding mask in `fp16`.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/721

Differential Revision: D15259885

fbshipit-source-id: 9dbf7fb8192992c1251670287bed719e41c08fcc

bd6e5c4f

Cleanup LM + Flake8 · f2563c21

Myle Ott authored May 08, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/720

Differential Revision: D15259091

Pulled By: myleott

fbshipit-source-id: 06a35996c06ccddb49fdc9e01e348ff3c9da334e

f2563c21

Fix indexing in TokenBlockDataset · eddcdf08

Myle Ott authored May 08, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/719

Differential Revision: D15258483

Pulled By: myleott

fbshipit-source-id: dd00daa6f1c87264c1196a77dfffc8c876ebde7f

eddcdf08

Bugfix · 0cb45bcb

Myle Ott authored May 08, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/717

Differential Revision: D15254560

Pulled By: myleott

fbshipit-source-id: 2a07614e8d294636f706939e60f0091c73115494

0cb45bcb

07 May, 2019 4 commits

fixed arg passing in masked_lm_dataset · 20e7836e

Naman Goyal authored May 07, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/715

Differential Revision: D15240723

fbshipit-source-id: 11d7280cb187d68f107902822e878f2a04b840c7

20e7836e

bugfix: passing args · e37bd948

taineleau authored May 07, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/711

Differential Revision: D15239618

Pulled By: myleott

fbshipit-source-id: 82f3f79501a13a967324b8a66281cd134bf1ef23

e37bd948

Memory-Mapped IndexedDataset implementation (#589) · a1c997bd

Davide Caroselli authored May 07, 2019

Summary:
Following discussion in https://github.com/pytorch/fairseq/issues/574:

 - Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatible with IndexedDataset/IndexedDatasetBuilder
- Update scripts/read_binarized.py to support new MMapIndexedDataset
- Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate)
- Implemented also utils functions in indexed_dataset: make_dataset(), dataset_exists()
Pull Request resolved: https://github.com/pytorch/fairseq/pull/589

Differential Revision: D14597128

Pulled By: myleott

fbshipit-source-id: 4e92d99920cbaa52cfe5a0f1f5d9ae5c92d4268e

a1c997bd

Improve init speed of TokenBlockDataset and EpochBatchIterator · e4edf27a

Myle Ott authored May 07, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/704

Differential Revision: D15221549

Pulled By: myleott

fbshipit-source-id: b0021acdc2d7792ce51421f1432e1f2bd8218f7b

e4edf27a

06 May, 2019 2 commits

allowing sharded dataset (#696) · 0add50c2

Naman Goyal authored May 06, 2019

Summary:
Co-authored-by: myleott <myleott@fb.com>

Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner.

For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`.

myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/696

Differential Revision: D15214049

fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013

0add50c2

added masked_lm task (#697) · e1ffea87

Naman Goyal authored May 06, 2019



Summary:
Co-authored-by: jingfeidu <jingfeidu@fb.com>

1) Adding `masked_lm` task for BERT like training. Code mostly taken from jingfeidu 's implementation.

2) Added `has_eos` option to `block_pair_dataset` for working with dataset that has been preprocessed with having `eos`.

Depends on: https://github.com/pytorch/fairseq/pull/696
Pull Request resolved: https://github.com/pytorch/fairseq/pull/697

Differential Revision: D15214050

fbshipit-source-id: c179ce2d70e59d2ddc941b13ceda99d929878931

e1ffea87

04 May, 2019 2 commits

Deprecate dummy_batch (#699) · fc1a19a3

Myle Ott authored May 04, 2019

Summary:
It was tedious defining these, let's try just taking the first batch lazily instead.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/699

Differential Revision: D15188266

Pulled By: myleott

fbshipit-source-id: a4c9f7ee3111278faaffa8a22ba91ed5f50e143d

fc1a19a3

Bugfix in size of multi-corpus dataset · 657a8836

Kritika Singh authored May 03, 2019

Summary: See comment

Reviewed By: jay-mahadeokar

Differential Revision: D15070187

fbshipit-source-id: ffefca0effb2cc866ce6fa22a59d5419b592fb7b

657a8836

01 May, 2019 4 commits

Make MultiCorpusSampledDataset and IndexedCachedDataset Picklable · e112d501

Myle Ott authored May 01, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/691

Differential Revision: D15172543

Pulled By: myleott

fbshipit-source-id: f2b626ff7f5e95f0ddc83c105af7ab9d092a135e

e112d501

add ConcatDataset support for XLM · 91c78477

taineleau authored May 01, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/684

Differential Revision: D15154631

Pulled By: myleott

fbshipit-source-id: 5e7dd9651d9ed239b60c51b9a11d08c80307d3ba

91c78477

Support dataset upsampling / relative ratio in PytorchTranslateTask (#494) · ff74ca94

Ning Dong authored May 01, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/494

Pull Request resolved: https://github.com/pytorch/fairseq/pull/657

Library side change split from D14924942

Added 2 arguments for load_dataset in PytorchTranslateTask
1. dataset_upsampling. A nested dictionary {direction:{dataset: upsampling_ratio}}. Upsampling_ratio larger than one mean that the bitext is ob- served more often than actually present in the combined bitext and synthetic training corpus.

2. dataset_relative_ratio. A tuple (dataset, ratio). The ratio represents the frequency certain dataset gets sampled to the rest of corpora map.

At most one of them could be specified.

Reviewed By: liezl200

Differential Revision: D15041293

fbshipit-source-id: 92daad29895c234e26d1b19f121106118a3957ad

ff74ca94

Add default noising argument in WordNoiser initialization (#664) · 37420855

Ning Dong authored Apr 30, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/664

Previously arguments for noising (dropout_prob for WordDropout and max_shuffle_distance for WordShuffle) are only passed in noising() so it could not be customized in NoisingDataset.

Now add default argument in initializer so the value could be specified at construction.

Reviewed By: liezl200

Differential Revision: D15071632

fbshipit-source-id: 59a9bf5a5e6d03c1e74f1b31c1927e221cb11dfa

37420855

30 Apr, 2019 3 commits

Merge internal changes · 6b8cb7db

Myle Ott authored Apr 30, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/682

Differential Revision: D15147735

Pulled By: myleott

fbshipit-source-id: 4a5f12c0b24591f964fe1f465be3775a67578e79

6b8cb7db

Merge internal changes (#654) · d45db804

Myle Ott authored Apr 29, 2019

Summary:
- Add --add-bos-token option to LM task
- Cleanup utils.py and options.py
Pull Request resolved: https://github.com/pytorch/fairseq/pull/654

Differential Revision: D15041794

Pulled By: myleott

fbshipit-source-id: 3ad00007769d5f48308052cfd40de39c5ffa1a6e

d45db804

Add more details in error message when sentence length > max tokens (#672) · 89a69616

Liezl Puzon authored Apr 29, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/672

title

Reviewed By: jmp84, pipibjc

Differential Revision: D15094977

fbshipit-source-id: c24e4ec9355b53e1585ac4da32809f1c339c7364

89a69616

27 Apr, 2019 1 commit

Add small comments for MonolingualDataset and TokenBlockDataset · 8bf8399d

Myle Ott authored Apr 27, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/669

Differential Revision: D15114160

Pulled By: myleott

fbshipit-source-id: 64f4a8154c8931ddbbe459d4d4a54c46680ad6b6

8bf8399d

17 Apr, 2019 3 commits

Open BlockPairDataset for MaskedLMData to work (#641) · d2f3007c

Kartikay Khandelwal authored Apr 17, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/641

Fix breaking import

Reviewed By: pipibjc

Differential Revision: D14978454

fbshipit-source-id: 7b43152cb30100881e9991ead871531ee3f60e07

d2f3007c

Enable custom sampling strategy in MultiCorpusSampledDataset (#639) · 90d6eac2

Ning Dong authored Apr 16, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/639

Add argument sampling_func in the constructor to enable custom sampling over a list of dataset keys. The default strategy is to sample uniformly as it did previously.

Reviewed By: liezl200

Differential Revision: D14965774

fbshipit-source-id: f3285688a9ae3729c0ba12c22254c1144d0eea9e

90d6eac2

Black formatting for multi_corpus_sampled_dataset.py (#638) · 17cef3f6

Ning Dong authored Apr 16, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/638

RT

Reviewed By: liezl200

Differential Revision: D14967268

fbshipit-source-id: 2da361497743d90a841fdbf2a50085136c70b468

17cef3f6

16 Apr, 2019 1 commit

Open Source MLM Implementation in Fairseq (#635) · 8776928c

Kartikay Khandelwal authored Apr 16, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/635

Adding a task and relevant models, datasets and criteria needed for training Cross-lingual Language Models similar to Masked Language Model used in XLM (Lample and Conneau, 2019 - https://arxiv.org/abs/1901.07291).

Reviewed By: liezl200

Differential Revision: D14943776

fbshipit-source-id: 3e416a730303d1dd4f5b92550c78db989be27073

8776928c

10 Apr, 2019 1 commit

Back translation + denoising in MultilingualTranslation task (#620) · d7e19573

Peng-Jen Chen authored Apr 10, 2019

Summary:
- Add language token to MultilingualTranslation task
- Add back translation and denoising loss to MultilingualTranslation task
Pull Request resolved: https://github.com/pytorch/fairseq/pull/620

Reviewed By: liezl200

Differential Revision: D14756873

Pulled By: pipibjc

fbshipit-source-id: 89d668db26848fd95f446edf5923bab2113636f7

d7e19573

03 Apr, 2019 1 commit

sort dictionary items lexicographically for consistency · 10ad7495

Paco Guzman authored Apr 03, 2019

Summary: Sorts dictionaries lexicographically before creating counter. This makes distributed preprocessing deterministic

Reviewed By: myleott

Differential Revision: D14678214

fbshipit-source-id: 7a9e2f0cb367e8fb76da01e108dda4c6c5aab505

10ad7495

02 Apr, 2019 1 commit

Update data_utils.py for (#598) · 3efc39ee

Yash Kumar Atri authored Apr 01, 2019

Summary:
Correcting the syntax error in assert function cause of a character before error message.

Assertion and the code is working fine now, Tested with wmt-ende task.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/598

Differential Revision: D14712846

Pulled By: myleott

fbshipit-source-id: 3f708aa2362ceecba19174750f9ffc9238537512

3efc39ee

15 Mar, 2019 1 commit

0.6.1 -> 0.6.2 (#577) · e6422528

Myle Ott authored Mar 15, 2019

Summary:
Changelog:
- 998ba4f: Add language models from Baevski & Auli (2018)
- 4294c4f6: Add mixture of experts code from Shen et al. (2019)
- 00493490: Add example for multilingual training
- 48d9afbe: Speed improvements, including fused operators from apex
- 44d27e64: Add Tensorboard support
- d17fa851: Add Adadelta optimizer
- 9e1c880f: Add `FairseqEncoderModel`
- b65c579b: Add `FairseqTask.inference_step` to modularize generate.py
- 2ad1178e: Add back `--curriculum`
- Misc bug fixes and other features

Pull Request resolved: https://github.com/pytorch/fairseq/pull/577

Differential Revision: D14481233

Pulled By: myleott

fbshipit-source-id: 4ff8625ef1c0b24273fc65df7c5658e3c932e8b7

e6422528

01 Mar, 2019 2 commits

Fixed the issue that no space in string converted from tensor · 88bf8b56

James King authored Mar 01, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/548

Differential Revision: D14286021

Pulled By: myleott

fbshipit-source-id: 7c725304185e63787220371a812ec860e178872c

88bf8b56

Refactor BERTDataset to the more general MaskedLMDataset · 92a6c548

Kartikay Khandelwal authored Feb 28, 2019

Summary: The current BERTDataset has a lot of components needed for generic MaskedLM training but is too restrictive in terms of the assumptions it makes - two blocks being masked, the special tokens used for the sentence embedding as well as the separator etc. In this diff I refactor this dataset and at the same time add make some of the parameters including the probabilities associated with masking configurable.

Reviewed By: rutyrinott

Differential Revision: D14222467

fbshipit-source-id: e9f78788dfe7f56646ba09c62967c4c0bd30aed8

92a6c548

28 Feb, 2019 1 commit

Move string line encoding logic from tokenizer to Dictionary (unified diff). (#541) · f296824f

Vladimir Karpukhin authored Feb 28, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/541

Just a combo of a stacked pair D14057943 & D14176011,
Made this as a separete diff cause there seems to be some issue with porting a stacked change into github repo

Differential Revision: D14251048

fbshipit-source-id: 0a47f534a69d6ab2ebe035fba40fd51748cccfb8

f296824f

26 Feb, 2019 2 commits

Support LM generation from interactive.py (fixes #526) · 98daf039

Myle Ott authored Feb 25, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/528

Differential Revision: D14218377

Pulled By: myleott

fbshipit-source-id: facb0a32f6aebf56a4fea7259080394ad2d2d846

98daf039

Multilingual training example (#527) · 00493490

Myle Ott authored Feb 25, 2019

Summary:
* Add example for multilingual translation on IWSLT'17
* Match dataset ordering for multilingual_translation and translation
* Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules
Pull Request resolved: https://github.com/pytorch/fairseq/pull/527

Differential Revision: D14218372

Pulled By: myleott

fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b

00493490

22 Feb, 2019 1 commit

Modularize generate.py (#351) · b65c579b

Myle Ott authored Feb 22, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/351

This makes it easier for tasks to plugin to generate.py/interactive.py
Pull Request resolved: https://github.com/pytorch/fairseq/pull/520

Differential Revision: D14183881

Pulled By: myleott

fbshipit-source-id: ede5e53ddc1215ed3b12b8f1eba048c946913c33

b65c579b

19 Feb, 2019 1 commit

moving masking logic to collate · 08e866f9

Ruty Rinott authored Feb 19, 2019

Summary: Move masking logic to data_utils

Reviewed By: kartikayk, jingfeidu

Differential Revision: D14098403

fbshipit-source-id: c7b7e811ab48b9c5a12662dc1e2f2ed694724176

08e866f9

16 Feb, 2019 1 commit

Merge internal changes · 9998bbfa

Myle Ott authored Feb 15, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/505

Differential Revision: D14110201

Pulled By: myleott

fbshipit-source-id: 099ce61fa386c016f3a1d1815c6fe1a9a6c9005d

9998bbfa

30 Jan, 2019 1 commit

Merge internal changes (#483) · 42be3ebd

Myle Ott authored Jan 30, 2019

Summary:
Changelog:
- `4889802`: can now remove detokenize sentencepiece output with `--remove-bpe=sentencepiece` (fixes #331). Also added `--sacrebleu` for computing detokenized BLEU.
- `0d76427`: fix assertion error when training language model with dataset containing empty sentences
- minor bug and style fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/483

Differential Revision: D13867899

Pulled By: myleott

fbshipit-source-id: 25c940b847fe270262ac8f5ac838407b3977fdda

42be3ebd

24 Jan, 2019 1 commit

Enforce UTF-8 when open() text files (#460) · 38f1dee9

Davide Caroselli authored Jan 24, 2019

Summary:
When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.

I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/460

Differential Revision: D13802525

Pulled By: myleott

fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367

38f1dee9