1. 07 Nov, 2019 2 commits
  2. 06 Nov, 2019 2 commits
  3. 05 Nov, 2019 2 commits
    • ngoyal2707's avatar
      XLM-R code and model release (#900) · e23e5eaa
      ngoyal2707 authored
      Summary:
      TODO:
      1) Need to update bibtex entry
      2) Need to upload models, spm_vocab and dict.txt to public s3 location.
      
      For Future:
      
      1) I will probably add instructions to finetune on XNLI and NER, POS etc. but currently no timeline for that.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/900
      
      Reviewed By: myleott
      
      Differential Revision: D18333076
      
      Pulled By: myleott
      
      fbshipit-source-id: 3f3d3716fcc41c78d2dd4525f60b519abbd0459c
      e23e5eaa
    • Spencer Poff's avatar
      Fixing key padding mask during transformer generation · 68dd3e17
      Spencer Poff authored
      Summary:
      https://github.com/pytorch/fairseq/pull/1097 added key padding mask history in TransformerDecoderLayer, but during an edge case where only the current or only the previous key_padding_mask exists, the resulting key_padding_mask is the wrong size.
      
      This diff adds empty columns in such a case to ensure key_padding_mask is a usable size.
      
      Reviewed By: myleott
      
      Differential Revision: D18224313
      
      fbshipit-source-id: c9fb7266baf0a2d79a66704e00a5ea8bd2987ff6
      68dd3e17
  4. 02 Nov, 2019 1 commit
  5. 01 Nov, 2019 2 commits
  6. 31 Oct, 2019 2 commits
  7. 30 Oct, 2019 1 commit
    • Xian Li's avatar
      layer drop · 856d8b82
      Xian Li authored
      Summary: This diff enables layer drop in transformer decoder in production training pipeline (ptt_transformer). It builds on top of the fairseq implementation D18094657 added by Angela Fan, and added additional logic to handle corresponding dropping layers at test time in exported model.
      
      Reviewed By: jhcross
      
      Differential Revision: D18165586
      
      fbshipit-source-id: 373ac00268a25fa9e412edcb483becdfe792d992
      856d8b82
  8. 28 Oct, 2019 1 commit
    • Ning Dong's avatar
      Fix LevT generator interface · 50cf3bb5
      Ning Dong authored
      Summary: Revert the interface change for iterative_refinement_generator
      
      Reviewed By: kahne
      
      Differential Revision: D18165103
      
      fbshipit-source-id: 075c276746eb90d7c359b6ad92e1ef25e8452bcc
      50cf3bb5
  9. 27 Oct, 2019 1 commit
    • Angela Fan's avatar
      adding layerdrop code for training, pruning, and readme (#890) · dabbef46
      Angela Fan authored
      Summary:
      TEST 1: EVALUATION TIME WORKS
      checked
      achieves correct model perplexity: 18.68
      
      TEST 2: TRAINING NEW MODEL WORKS
      checked
      
      without layerdrop:
      --decoder-layerdrop 0 OR no flag at all
      | epoch 001:     10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30
      | epoch 001:     20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57
      | epoch 001:     30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84
      | epoch 001:     40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112
      | epoch 001:     50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140
      
      with layerdrop (regularization effect should be seen in PPL):
      --decoder-layerdrop 0.2
      
      | epoch 001:     10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24
      | epoch 001:     20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45
      | epoch 001:     30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68
      | epoch 001:     40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90
      | epoch 001:     50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112
      
      TEST 3: PICKING UP TRAINING FROM EXISTING MODEL
      checked
      
      | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates)
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      
      TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      achieves correct accuracy on SST2 for this model
      
      TEST 5: TRAINING NEW BERT MODEL WORKS
      checked and works
      
      TEST 6: NMT
      
      without layerdrop
      --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified
      
      | epoch 001:     10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3
      | epoch 001:     20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6
      | epoch 001:     30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9
      | epoch 001:     40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12
      | epoch 001:     50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15
      
      with layerdrop (regularization effect should be seen in PPL)
      
      A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2
      B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5
      C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0
      
      | epoch 001:     10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3
      | epoch 001:     20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6
      | epoch 001:     30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8
      | epoch 001:     40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11
      | epoch 001:     50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14
      
      TEST 7: PRUNING TESTCASES
      
      A) after adding the pruning flags, model can evaluate as a full model
      checked, reaches correct PPL
      num. model params: 246933504
      | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s)
      | Loss: 2.9275, Perplexity: 18.68
      
      B) after adding pruning flags, model can be pruned. this works with multiple flag settings
      checked three cases:
      num. model params: 146163712
      | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s)
      | Loss: 3.0932, Perplexity: 22.05
      
      num. model params: 209144832
      | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s)
      | Loss: 2.9526, Perplexity: 19.16
      
      C) model can pick up training if you want to finetune the pruned model
      checked:
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      | WARNING: overflow detected, setting loss scale to: 64.0
      | WARNING: overflow detected, setting loss scale to: 32.0
      | epoch 272:   1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396
      
      D) works with BERT
      checked:
      without specifying any flags, reproduces the correct standard accuracy
      with flags, produces the correct pruned accuracy
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop
      | Accuracy:  0.9220183486238532
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890
      
      Reviewed By: edunov
      
      Differential Revision: D18094657
      
      Pulled By: huihuifan
      
      fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
      dabbef46
  10. 26 Oct, 2019 1 commit
    • Xian Li's avatar
      fix a type mismatch in NAT quantization run · eb68afca
      Xian Li authored
      Summary:
      Fix a type mismatch which was found after patching NAT on top of quantization.
      Ning suggested this fix. Need to further understand: why this only appears after patching quantization diff?
      
      Reviewed By: kahne, jhcross
      
      Differential Revision: D18147726
      
      fbshipit-source-id: a51becc9ad58a637a0180074eaa2b46990ab9f84
      eb68afca
  11. 25 Oct, 2019 2 commits
  12. 24 Oct, 2019 4 commits
  13. 23 Oct, 2019 1 commit
    • Yilei Li's avatar
      Add warmup support in reduce_on_plateau lr schedule · 8defa9d9
      Yilei Li authored
      Summary:
      Enables reduce_on_plateau schedule with optional warmup phase, where we linearly increase the learning rate from some initial learning rate (``--warmup-init-lr``) until the configured learning rate (``--lr``). Thereafter the lr is adjusted according to original reduce_on_plateau scheme
      During warmup::
      
            lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates)
            lr = lrs[update_num]
      
      Reviewed By: yqwangustc
      
      Differential Revision: D17779925
      
      fbshipit-source-id: c3bfb3321c76850824fc42df4fac4e5dcf73fbf8
      8defa9d9
  14. 22 Oct, 2019 3 commits
  15. 20 Oct, 2019 2 commits
    • Jiatao Gu's avatar
      Enable separate models for insertion and deletion; · 66d24dc2
      Jiatao Gu authored
      Summary:
      The Diff conatins two fixes:
      (1) enabling non-shared decoder layers for deletion/insertion
      (2) adding options to perform sampling instead of argmax when learning the deletion
      
      Reviewed By: kahne
      
      Differential Revision: D18011220
      
      fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8
      66d24dc2
    • Jiatao Gu's avatar
      Fix typos on Examples for Nonautoregressive translation · a3c629b5
      Jiatao Gu authored
      Summary: Fix typos in the examples
      
      Reviewed By: kahne
      
      Differential Revision: D18030097
      
      fbshipit-source-id: 84f0cbafd85e50ffd5033738835373935e3b83d4
      a3c629b5
  16. 18 Oct, 2019 3 commits
  17. 15 Oct, 2019 2 commits
    • Nayan Singhal's avatar
      Add Unit test cases for BMUF · b5f41f82
      Nayan Singhal authored
      Summary:
      This unit test guards the bmuf code.
      
      change:
      1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine.
      
      Reviewed By: jay-mahadeokar
      
      Differential Revision: D17821391
      
      fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a
      b5f41f82
    • Changhan Wang's avatar
      fix libnat imports · e3a40d9d
      Changhan Wang authored
      Summary: Bring back the changes in D17661768
      
      Reviewed By: ailzhang
      
      Differential Revision: D17920299
      
      fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
      e3a40d9d
  18. 12 Oct, 2019 1 commit
  19. 11 Oct, 2019 2 commits
    • Jiatao Gu's avatar
      fix the random mask function for CMLM model · 02b74c58
      Jiatao Gu authored
      Summary: The original implementation of the random mask is different from what the paper was stated.
      
      Reviewed By: kahne
      
      Differential Revision: D17652564
      
      fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1
      02b74c58
    • Jiatao Gu's avatar
      add new_arange function + FIX BUGS of returning attn values · cce92bdd
      Jiatao Gu authored
      Summary:
      Implementation of Levenshtein Transformer paper.
      Add a new helper function "new_arange" to create arange tensor easily.
      Fix bugs of returning attn values for NAT models
      Delete files which are not necessary or experimental.
      
      Reviewed By: kahne
      
      Differential Revision: D17652009
      
      fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
      cce92bdd
  20. 10 Oct, 2019 2 commits
    • Dmytro Okhonko's avatar
      Add ctc loss to ASR task (#1233) · c4893ca6
      Dmytro Okhonko authored
      Summary:
      Adds CTC loss and corresponding transformer ctc based models.
      
      Tested with
      `CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10`
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233
      
      Reviewed By: jcai1
      
      Differential Revision: D17856824
      
      Pulled By: okhonko
      
      fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
      c4893ca6
    • Jeff Cai's avatar
      wav2letter integration · 33646ac9
      Jeff Cai authored
      Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846
      
      Reviewed By: jcai1
      
      Differential Revision: D17845996
      
      Pulled By: okhonko
      
      fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc
      33646ac9
  21. 09 Oct, 2019 1 commit
    • Alex Xiao's avatar
      Fix data loading memory issue in pyspeech · b6e001f6
      Alex Xiao authored
      Summary:
      We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers  because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's.
      
      3 changes:
      
      1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task.
      
      2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do.
      
      2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well.
      
      Reviewed By: yqwangustc
      
      Differential Revision: D17750715
      
      fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
      b6e001f6
  22. 08 Oct, 2019 2 commits
    • Jerry Ma's avatar
      Add printing of PyTorch memory summary on OOM (#885) · 63b6b3f4
      Jerry Ma authored
      Summary:
      PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885
      
      Differential Revision: D17820445
      
      Pulled By: jma127
      
      fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
      63b6b3f4
    • Jungo Kasai's avatar
      ensemble levts · 34e79c58
      Jungo Kasai authored
      Summary:
      Add ensemble wrappers to the levenshtein NAT.
      Levenshtein
      Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection.
      1. Deletion
      2. Placeholder Insertion
      3. Word Selection
      
      Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design.
      
      Reviewed By: kahne
      
      Differential Revision: D17723202
      
      fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
      34e79c58