Commits · fdf4c3e9002ec1ee01a281779a095512ada90e40 · OpenDAS / Fairseq

25 Oct, 2019 1 commit

Simplify fairseq multihead attention (#888) · fdf4c3e9

Halil Akin authored Oct 25, 2019

Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/888

We want to simplify multihead attention and get rid of the dynamic in_proj_weight logic. Sending the diff early for feedback, will have further changes as I try to fix breaking tests

Reviewed By: edunov

Differential Revision: D17912661

fbshipit-source-id: 0e6319fc694d8ec5187d1c2fefe5839d9d522186

fdf4c3e9

24 Oct, 2019 4 commits

OSS tracing compliant transformer to unbreak master (#1299) · 5b086a0c

Ning Dong authored Oct 24, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1299

 LevT calls into tracing compliant transformer we didn't plan to OSS earlier. This is a workaround to unbreak the master. Will revisit and simplify the code later.

Reviewed By: pipibjc

Differential Revision: D18110339

fbshipit-source-id: 3bb51c56c2c20f45db1d5786d030b374b412eab1

5b086a0c

fix inconsistency w/ recent pytorch cuda device logic · d0358bb3

Jerry Ma authored Oct 23, 2019

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/892

Differential Revision: D18109685

Pulled By: jma127

fbshipit-source-id: f96e1080a5577b8ee0748dfdd956bf72bed47474

d0358bb3

Reset both WPS and UPS on first minibatch (#891) · 39faa0a4

Jerry Ma authored Oct 23, 2019

Summary:
Makes more sense to reset either both meters or neither of them.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/891

Differential Revision: D18109027

Pulled By: jma127

fbshipit-source-id: f63baed9a6b928a6f591a76e69ef6e9c524e4398

39faa0a4

NAT productionization · 5a2f76ed

Ning Dong authored Oct 23, 2019

Summary:
NAT productionization diff

(1) Integrate NAT model training / Evaluation in LATTE base training workflow.
(2) Make NAT tracing compliant. Since it calls into Fairseq transformer, we need to refactor the code and I created a ~copy of it named fb_tracing_transformer.
(3) Decoder side C++ code is landed in the diff earlier.

Reviewed By: xianxl

Differential Revision: D17888324

fbshipit-source-id: ef4ef195fddd360da921502adcef82b087e46ce6

5a2f76ed

23 Oct, 2019 1 commit

Add warmup support in reduce_on_plateau lr schedule · 8defa9d9

Yilei Li authored Oct 22, 2019

Summary:
Enables reduce_on_plateau schedule with optional warmup phase, where we linearly increase the learning rate from some initial learning rate (``--warmup-init-lr``) until the configured learning rate (``--lr``). Thereafter the lr is adjusted according to original reduce_on_plateau scheme
During warmup::

lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates)
lr = lrs[update_num]

Reviewed By: yqwangustc

Differential Revision: D17779925

fbshipit-source-id: c3bfb3321c76850824fc42df4fac4e5dcf73fbf8

8defa9d9

22 Oct, 2019 3 commits

fix score · e49b302a

Changhan Wang authored Oct 22, 2019

Summary: Bugfix for inconsistent scores on the same input sentences. This only affects the displayed scores in `generate.py` and does not affect the model outputs.

Reviewed By: MultiPath

Differential Revision: D17799343

fbshipit-source-id: 2b868ac03097a4db27db736e126a61d50958acc5

e49b302a

Rename "loaded {} batches" to "loaded {} blocks" (#1279) · 2d51e04d

Louis MARTIN authored Oct 21, 2019

Summary:
Very small change.
The previous message was misleading, the length of TokenBlocksDataset is a number of "blocks" or "streams" but not the number of batches strictly speaking if I am not mistaken. I use the notion of batch from roberta https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md.
It took me some time to understand what was going on, I hope it saves some time for others.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1279

Differential Revision: D18051476

fbshipit-source-id: 71fa35f21b9dbc8d6bde28cd3a487723690aadee

2d51e04d

Fix load_dataset signature (#1281) · 34e6a5e8

Louis MARTIN authored Oct 21, 2019

Summary:
Fix for https://github.com/pytorch/fairseq/issues/1240
Tested with MaskedLMTask.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1281

Differential Revision: D18051472

fbshipit-source-id: 0aeff60c71489655f5e621349f780ba9cd8c027a

34e6a5e8

20 Oct, 2019 2 commits

Enable separate models for insertion and deletion; · 66d24dc2

Jiatao Gu authored Oct 20, 2019

Summary:
The Diff conatins two fixes:
(1) enabling non-shared decoder layers for deletion/insertion
(2) adding options to perform sampling instead of argmax when learning the deletion

Reviewed By: kahne

Differential Revision: D18011220

fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8

66d24dc2

Fix typos on Examples for Nonautoregressive translation · a3c629b5

Jiatao Gu authored Oct 19, 2019

Summary: Fix typos in the examples

Reviewed By: kahne

Differential Revision: D18030097

fbshipit-source-id: 84f0cbafd85e50ffd5033738835373935e3b83d4

a3c629b5

18 Oct, 2019 3 commits

add missing function to FairseqLanguageModel · b8d024e9

Spencer Poff authored Oct 18, 2019

Summary: In https://github.com/fairinternal/fairseq-py/pull/877, sequence_generator began calling `model.forward_decoder`, but not all decoder models were given an implementation of that function.

Reviewed By: okhonko

Differential Revision: D17863751

fbshipit-source-id: ea70b636c9dafcf87f5d5e49631d0c4b7cf14984

b8d024e9

fixed a bug in preprocess glue dataset dev filename (#1270) · c8a7b627

dikshameghwal authored Oct 18, 2019

Summary:
removed redundant quotes in the filename assigned for dev dataset for GLUE tasks
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1270

Differential Revision: D18013071

fbshipit-source-id: 35f00162e117c6584dc859f760503ca32dcb706e

c8a7b627

fix levenshtein transfromer attn · 3dcb5c77

Changhan Wang authored Oct 18, 2019

Summary: When the `if` statements in the levenshtein transformer decoder forward are removed, `attn` may get inconsistent batch sizes with output tokens. This is a fix.

Reviewed By: cndn

Differential Revision: D17936411

fbshipit-source-id: a1583f3806dc9f41caeb783c043429e247035803

3dcb5c77

15 Oct, 2019 2 commits

Add Unit test cases for BMUF · b5f41f82

Nayan Singhal authored Oct 15, 2019

Summary:
This unit test guards the bmuf code.

change:
1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine.

Reviewed By: jay-mahadeokar

Differential Revision: D17821391

fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a

b5f41f82

fix libnat imports · e3a40d9d

Changhan Wang authored Oct 14, 2019

Summary: Bring back the changes in D17661768

Reviewed By: ailzhang

Differential Revision: D17920299

fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad

e3a40d9d

12 Oct, 2019 1 commit

Added option to save checkpoints using Path Manager. · d80ad54f

Sujit Verma authored Oct 11, 2019

Summary: Added option to save checkpoints using Path Manager.

Reviewed By: hudeven

Differential Revision: D17392754

fbshipit-source-id: 4b8e556ef8455a1548e5a083d779ed809cd785be

d80ad54f

11 Oct, 2019 2 commits

fix the random mask function for CMLM model · 02b74c58

Jiatao Gu authored Oct 11, 2019

Summary: The original implementation of the random mask is different from what the paper was stated.

Reviewed By: kahne

Differential Revision: D17652564

fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1

02b74c58

add new_arange function + FIX BUGS of returning attn values · cce92bdd

Jiatao Gu authored Oct 11, 2019

Summary:
Implementation of Levenshtein Transformer paper.
Add a new helper function "new_arange" to create arange tensor easily.
Fix bugs of returning attn values for NAT models
Delete files which are not necessary or experimental.

Reviewed By: kahne

Differential Revision: D17652009

fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c

cce92bdd

10 Oct, 2019 2 commits

Add ctc loss to ASR task (#1233) · c4893ca6

Dmytro Okhonko authored Oct 10, 2019

Summary:
Adds CTC loss and corresponding transformer ctc based models.

Tested with
`CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10`
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233

Reviewed By: jcai1

Differential Revision: D17856824

Pulled By: okhonko

fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef

c4893ca6

wav2letter integration · 33646ac9

Jeff Cai authored Oct 09, 2019

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846

Reviewed By: jcai1

Differential Revision: D17845996

Pulled By: okhonko

fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc

33646ac9

09 Oct, 2019 1 commit

Fix data loading memory issue in pyspeech · b6e001f6

Alex Xiao authored Oct 09, 2019

Summary:
We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's.

3 changes:

1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task.

2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do.

2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well.

Reviewed By: yqwangustc

Differential Revision: D17750715

fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377

b6e001f6

08 Oct, 2019 3 commits

Add printing of PyTorch memory summary on OOM (#885) · 63b6b3f4

Jerry Ma authored Oct 08, 2019

Summary:
PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885

Differential Revision: D17820445

Pulled By: jma127

fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0

63b6b3f4

ensemble levts · 34e79c58

Jungo Kasai authored Oct 08, 2019

Summary:
Add ensemble wrappers to the levenshtein NAT.
Levenshtein
Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection.
1. Deletion
2. Placeholder Insertion
3. Word Selection

Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design.

Reviewed By: kahne

Differential Revision: D17723202

fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e

34e79c58

fix max lengths in Levenshtein Tramsformer · c2165224

Changhan Wang authored Oct 08, 2019

Summary: Fix the max length calculation in Levenshtein Transformer

Reviewed By: jhcross

Differential Revision: D17672946

fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4

c2165224

07 Oct, 2019 1 commit

Setting Global sync to 50 in BMUF · 6f58e15e

Nayan Singhal authored Oct 07, 2019

Summary:
In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2.

Setting the default global-sync-iter = 50
and users can just define --use-bmuf to enable it for training.

Reviewed By: skritika

Differential Revision: D17765094

fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc

6f58e15e

05 Oct, 2019 1 commit

add pre-trained wav2vec model · 4cb895b6

alexeib authored Oct 04, 2019

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/884

Differential Revision: D17774515

Pulled By: alexeib

fbshipit-source-id: d1ffe8ab723fa284c69b067bbd43d699eaa2f02f

4cb895b6

04 Oct, 2019 2 commits

Add periodic CUDA cache cleanup (#882) · 315c463d

Jerry Ma authored Oct 04, 2019

Summary:
This adds a periodic call to `torch.cuda.empty_cache()` in order to
mitigate memory fragmentation in the PyTorch CUDA cached allocator
that can cause OOMs on models approaching GPU memory limit.
By default, this will occur every 64 updates.

Performance considerations:

- I've benchmarked this on a reasonably large model with memory
  footprint 16 GB, and the overhead with the default setting is <0.2%.
  With `update-freq > 1`, the cost is mitigated even further.
- This behavior can be disabled with a value of zero.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882

Differential Revision: D17742386

Pulled By: jma127

fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865

315c463d

Native Torchscript Wordpiece Tokenizer Op for BERTSquadQA, Torchscriptify BertSQUADQAModel (#879) · de348d1f

Debojeet Chatterjee authored Oct 04, 2019

Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/879

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1023

Pull Request resolved: https://github.com/pytorch/fairseq/pull/1211

Added a new native op that does wordpiece tokenization while additionally returning token start and end indices in the raw text as required by BertSquadQA. Includes Unit Tests for the native op and also to check its parity with the PyText Wordpiece Tokenizer.

Also combined is a torchscript implementation of the Bert SQUAD QA Model.

There are scripts for evaluation and testing of the torchscript code as well.

Reviewed By: borguz, hikushalhere

Differential Revision: D17455985

fbshipit-source-id: c2617c7ecbce0f733b31d04558da965d0b62637b

de348d1f

01 Oct, 2019 1 commit

extract FP16OptimizerMixin for share the same logic in PyText (#1180) · 58e43cb3

Chenyang Yu authored Oct 01, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1180

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/874

extract FP16OptimizerMixin for share the same logic in PyText

Reviewed By: hudeven

Differential Revision: D17594102

fbshipit-source-id: 8625a4e4f3e09cbaba6ae92599c1121b86ed4e78

58e43cb3

30 Sep, 2019 2 commits

Implementation of the paper "Jointly Learning to Align and Translate with... · 1c667929

Sarthak Garg authored Sep 30, 2019

Implementation of the paper "Jointly Learning to Align and Translate with Transformer Models" (#877)

Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/877

This PR implements guided alignment training described in "Jointly Learning to Align and Translate with Transformer Models (https://arxiv.org/abs/1909.02074)".

In summary, it allows for training selected heads of the Transformer Model with external alignments computed by Statistical Alignment Toolkits. During inference, attention probabilities from the trained heads can be used to extract reliable alignments. In our work, we did not see any regressions in the translation performance because of guided alignment training.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1095

Differential Revision: D17170337

Pulled By: myleott

fbshipit-source-id: daa418bef70324d7088dbb30aa2adf9f95774859

1c667929

Fix torch.hub to not depend on libnat · acb6fba0

Myle Ott authored Sep 30, 2019

Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/878

Differential Revision: D17661768

Pulled By: myleott

fbshipit-source-id: 1e4c5f09eb14c40d491ca2459fd2adb8382fb6d2

acb6fba0

29 Sep, 2019 2 commits

fix typo in README of examples/translation · 13519720

Guntupalli Venkata Sai Kalyan authored Sep 29, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1200

Differential Revision: D17659658

Pulled By: myleott

fbshipit-source-id: 1863e6d60a439dbb7e71e5da68817c9d53649737

13519720

Implementation of the WeCNLP abstract "Cross+Self-Attention for Transformer Models" (#1097) · 4ac2c5f2

Stephan Peitz authored Sep 29, 2019

Summary:
This PR implements a new attention module which combines cross-attention (encoder-decoder attention) and the decoder self-attention. This work was accepted as an abstract at WeCNLP 2019 (https://www.wecnlp.ai/wecnlp-2019).

Cross+Self-Attention reduces the amount of parameter and increases the inference speed without any degradation in translation quality.
More details can be found in the attached [abstract](https://github.com/pytorch/fairseq/files/3561282/paper.pdf)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1097

Differential Revision: D17653168

Pulled By: myleott

fbshipit-source-id: deb834c2c78a229d7418ffbfea20ba3ce252991c

4ac2c5f2

28 Sep, 2019 1 commit

RoBERTa now supported on TPU and TensorFlow via transformers library · ea1a410d

Myle Ott authored Sep 28, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1197

Differential Revision: D17651374

Pulled By: myleott

fbshipit-source-id: 5feb986de1e682eb83c4479f419ad51325718572

ea1a410d

27 Sep, 2019 5 commits

Fixing example of batched predictions for Roberta (#1195) · 1cb267ed

Aditya Chetan authored Sep 27, 2019

Summary:
For batched predictions in Roberta, the README was giving an example that was pretty unclear. After a thorough discussion with ngoyal2707 in issue https://github.com/pytorch/fairseq/issues/1167 he gave a clear example of how batched predictions were supposed to be done. Since I spent a lot of time on this inconsistency, I thought that it might benefit the community if his solution was in the official README 😄 !

For for details, see issue https://github.com/pytorch/fairseq/issues/1167
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1195

Differential Revision: D17639354

Pulled By: myleott

fbshipit-source-id: 3eb60c5804a6481f533b19073da7880dfd0d522d

1cb267ed

Levenshtein Transformer paper code · 86857a58

Changhan Wang authored Sep 27, 2019

Summary:
Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006)
* Added Levenshtein Transformer model, task and criterion class
* Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines
* Add an option for prepending BOS to dictionary class and translation task class

Reviewed By: myleott

Differential Revision: D17297372

fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7

86857a58

Fixing BMUF warmup and sync strategy · 6c1da0f7

Nayan Singhal authored Sep 27, 2019

Summary:
Bmuf sync started happening even before warmup is done.
This diff fixes the behavior and do bmuf sync once warmup is done or if it's zero.

TODO: write a unit test case so that these problems can be figure out faster.

Reviewed By: jay-mahadeokar

Differential Revision: D17356277

fbshipit-source-id: 21500e6ed1225b97794e4ee203e5d7d04a2840f8

6c1da0f7

Explain the language modelling format in RoBERTa pretraining readme · 62e65c41

Louis Martin authored Sep 27, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174

Differential Revision: D17627767

Pulled By: myleott

fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851

62e65c41

Update getting_started.rst (#1188) · 2314979e

Zhanghao Wu authored Sep 27, 2019

Summary:
Hi,

I think there is a minor mistake in the doc. `--distributed-no-spawn` argument is needed for distributed training on multiple machines without `slurm`. Otherwise, the program will start 8 jobs on each GPU, when `nproc_per_node=8`.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1188

Differential Revision: D17627778

Pulled By: myleott

fbshipit-source-id: 35ab6b650dc1132d7cb2d150e80d2ebf0caf3e69

2314979e