1. 26 Oct, 2019 1 commit
    • Xian Li's avatar
      fix a type mismatch in NAT quantization run · eb68afca
      Xian Li authored
      Summary:
      Fix a type mismatch which was found after patching NAT on top of quantization.
      Ning suggested this fix. Need to further understand: why this only appears after patching quantization diff?
      
      Reviewed By: kahne, jhcross
      
      Differential Revision: D18147726
      
      fbshipit-source-id: a51becc9ad58a637a0180074eaa2b46990ab9f84
      eb68afca
  2. 25 Oct, 2019 2 commits
  3. 24 Oct, 2019 4 commits
  4. 23 Oct, 2019 1 commit
    • Yilei Li's avatar
      Add warmup support in reduce_on_plateau lr schedule · 8defa9d9
      Yilei Li authored
      Summary:
      Enables reduce_on_plateau schedule with optional warmup phase, where we linearly increase the learning rate from some initial learning rate (``--warmup-init-lr``) until the configured learning rate (``--lr``). Thereafter the lr is adjusted according to original reduce_on_plateau scheme
      During warmup::
      
            lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates)
            lr = lrs[update_num]
      
      Reviewed By: yqwangustc
      
      Differential Revision: D17779925
      
      fbshipit-source-id: c3bfb3321c76850824fc42df4fac4e5dcf73fbf8
      8defa9d9
  5. 22 Oct, 2019 3 commits
  6. 20 Oct, 2019 2 commits
    • Jiatao Gu's avatar
      Enable separate models for insertion and deletion; · 66d24dc2
      Jiatao Gu authored
      Summary:
      The Diff conatins two fixes:
      (1) enabling non-shared decoder layers for deletion/insertion
      (2) adding options to perform sampling instead of argmax when learning the deletion
      
      Reviewed By: kahne
      
      Differential Revision: D18011220
      
      fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8
      66d24dc2
    • Jiatao Gu's avatar
      Fix typos on Examples for Nonautoregressive translation · a3c629b5
      Jiatao Gu authored
      Summary: Fix typos in the examples
      
      Reviewed By: kahne
      
      Differential Revision: D18030097
      
      fbshipit-source-id: 84f0cbafd85e50ffd5033738835373935e3b83d4
      a3c629b5
  7. 18 Oct, 2019 3 commits
  8. 15 Oct, 2019 2 commits
    • Nayan Singhal's avatar
      Add Unit test cases for BMUF · b5f41f82
      Nayan Singhal authored
      Summary:
      This unit test guards the bmuf code.
      
      change:
      1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine.
      
      Reviewed By: jay-mahadeokar
      
      Differential Revision: D17821391
      
      fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a
      b5f41f82
    • Changhan Wang's avatar
      fix libnat imports · e3a40d9d
      Changhan Wang authored
      Summary: Bring back the changes in D17661768
      
      Reviewed By: ailzhang
      
      Differential Revision: D17920299
      
      fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
      e3a40d9d
  9. 12 Oct, 2019 1 commit
  10. 11 Oct, 2019 2 commits
    • Jiatao Gu's avatar
      fix the random mask function for CMLM model · 02b74c58
      Jiatao Gu authored
      Summary: The original implementation of the random mask is different from what the paper was stated.
      
      Reviewed By: kahne
      
      Differential Revision: D17652564
      
      fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1
      02b74c58
    • Jiatao Gu's avatar
      add new_arange function + FIX BUGS of returning attn values · cce92bdd
      Jiatao Gu authored
      Summary:
      Implementation of Levenshtein Transformer paper.
      Add a new helper function "new_arange" to create arange tensor easily.
      Fix bugs of returning attn values for NAT models
      Delete files which are not necessary or experimental.
      
      Reviewed By: kahne
      
      Differential Revision: D17652009
      
      fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
      cce92bdd
  11. 10 Oct, 2019 2 commits
    • Dmytro Okhonko's avatar
      Add ctc loss to ASR task (#1233) · c4893ca6
      Dmytro Okhonko authored
      Summary:
      Adds CTC loss and corresponding transformer ctc based models.
      
      Tested with
      `CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10`
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233
      
      Reviewed By: jcai1
      
      Differential Revision: D17856824
      
      Pulled By: okhonko
      
      fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
      c4893ca6
    • Jeff Cai's avatar
      wav2letter integration · 33646ac9
      Jeff Cai authored
      Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846
      
      Reviewed By: jcai1
      
      Differential Revision: D17845996
      
      Pulled By: okhonko
      
      fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc
      33646ac9
  12. 09 Oct, 2019 1 commit
    • Alex Xiao's avatar
      Fix data loading memory issue in pyspeech · b6e001f6
      Alex Xiao authored
      Summary:
      We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers  because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's.
      
      3 changes:
      
      1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task.
      
      2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do.
      
      2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well.
      
      Reviewed By: yqwangustc
      
      Differential Revision: D17750715
      
      fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
      b6e001f6
  13. 08 Oct, 2019 3 commits
    • Jerry Ma's avatar
      Add printing of PyTorch memory summary on OOM (#885) · 63b6b3f4
      Jerry Ma authored
      Summary:
      PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885
      
      Differential Revision: D17820445
      
      Pulled By: jma127
      
      fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
      63b6b3f4
    • Jungo Kasai's avatar
      ensemble levts · 34e79c58
      Jungo Kasai authored
      Summary:
      Add ensemble wrappers to the levenshtein NAT.
      Levenshtein
      Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection.
      1. Deletion
      2. Placeholder Insertion
      3. Word Selection
      
      Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design.
      
      Reviewed By: kahne
      
      Differential Revision: D17723202
      
      fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
      34e79c58
    • Changhan Wang's avatar
      fix max lengths in Levenshtein Tramsformer · c2165224
      Changhan Wang authored
      Summary: Fix the max length calculation in Levenshtein Transformer
      
      Reviewed By: jhcross
      
      Differential Revision: D17672946
      
      fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
      c2165224
  14. 07 Oct, 2019 1 commit
    • Nayan Singhal's avatar
      Setting Global sync to 50 in BMUF · 6f58e15e
      Nayan Singhal authored
      Summary:
      In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2.
      
      Setting the default global-sync-iter = 50
      and users can just define --use-bmuf to enable it for training.
      
      Reviewed By: skritika
      
      Differential Revision: D17765094
      
      fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc
      6f58e15e
  15. 05 Oct, 2019 1 commit
  16. 04 Oct, 2019 2 commits
  17. 01 Oct, 2019 1 commit
  18. 30 Sep, 2019 2 commits
  19. 29 Sep, 2019 2 commits
  20. 28 Sep, 2019 1 commit
  21. 27 Sep, 2019 3 commits
    • Aditya Chetan's avatar
      Fixing example of batched predictions for Roberta (#1195) · 1cb267ed
      Aditya Chetan authored
      Summary:
      For batched predictions in Roberta, the README was giving an example that was pretty unclear. After a thorough discussion with ngoyal2707 in issue https://github.com/pytorch/fairseq/issues/1167 he gave a clear example of how batched predictions were supposed to be done. Since I spent a lot of time on this inconsistency, I thought that it might benefit the community if his solution was in the official README 😄 !
      
      For for details, see issue https://github.com/pytorch/fairseq/issues/1167
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/1195
      
      Differential Revision: D17639354
      
      Pulled By: myleott
      
      fbshipit-source-id: 3eb60c5804a6481f533b19073da7880dfd0d522d
      1cb267ed
    • Changhan Wang's avatar
      Levenshtein Transformer paper code · 86857a58
      Changhan Wang authored
      Summary:
      Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006)
      * Added Levenshtein Transformer model, task and criterion class
      * Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines
      * Add an option for prepending BOS to dictionary class and translation task class
      
      Reviewed By: myleott
      
      Differential Revision: D17297372
      
      fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
      86857a58
    • Nayan Singhal's avatar
      Fixing BMUF warmup and sync strategy · 6c1da0f7
      Nayan Singhal authored
      Summary:
      Bmuf sync started happening even before warmup is done.
      This diff fixes the behavior and do bmuf sync once warmup is done or if it's zero.
      
      TODO: write a unit test case so that these problems can be figure out faster.
      
      Reviewed By: jay-mahadeokar
      
      Differential Revision: D17356277
      
      fbshipit-source-id: 21500e6ed1225b97794e4ee203e5d7d04a2840f8
      6c1da0f7