1. 13 Nov, 2019 1 commit
  2. 10 Nov, 2019 1 commit
  3. 09 Nov, 2019 1 commit
  4. 08 Nov, 2019 1 commit
    • Xian Li's avatar
      Fix LevT edge cases · e9171ce1
      Xian Li authored
      Summary:
      To avoid the case where can_ins_mask has all False so max_lengths has size [0, 1] which failed expand_as operator. Move it back into the skipping branch in script.
      
      The same for deletion and ins_word.
      
      Reviewed By: kahne
      
      Differential Revision: D18365340
      
      fbshipit-source-id: 509ac21d7d6fd9083d0710697288203977314c52
      e9171ce1
  5. 05 Nov, 2019 2 commits
    • ngoyal2707's avatar
      XLM-R code and model release (#900) · e23e5eaa
      ngoyal2707 authored
      Summary:
      TODO:
      1) Need to update bibtex entry
      2) Need to upload models, spm_vocab and dict.txt to public s3 location.
      
      For Future:
      
      1) I will probably add instructions to finetune on XNLI and NER, POS etc. but currently no timeline for that.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/900
      
      Reviewed By: myleott
      
      Differential Revision: D18333076
      
      Pulled By: myleott
      
      fbshipit-source-id: 3f3d3716fcc41c78d2dd4525f60b519abbd0459c
      e23e5eaa
    • Spencer Poff's avatar
      Fixing key padding mask during transformer generation · 68dd3e17
      Spencer Poff authored
      Summary:
      https://github.com/pytorch/fairseq/pull/1097 added key padding mask history in TransformerDecoderLayer, but during an edge case where only the current or only the previous key_padding_mask exists, the resulting key_padding_mask is the wrong size.
      
      This diff adds empty columns in such a case to ensure key_padding_mask is a usable size.
      
      Reviewed By: myleott
      
      Differential Revision: D18224313
      
      fbshipit-source-id: c9fb7266baf0a2d79a66704e00a5ea8bd2987ff6
      68dd3e17
  6. 31 Oct, 2019 1 commit
  7. 27 Oct, 2019 1 commit
    • Angela Fan's avatar
      adding layerdrop code for training, pruning, and readme (#890) · dabbef46
      Angela Fan authored
      Summary:
      TEST 1: EVALUATION TIME WORKS
      checked
      achieves correct model perplexity: 18.68
      
      TEST 2: TRAINING NEW MODEL WORKS
      checked
      
      without layerdrop:
      --decoder-layerdrop 0 OR no flag at all
      | epoch 001:     10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30
      | epoch 001:     20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57
      | epoch 001:     30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84
      | epoch 001:     40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112
      | epoch 001:     50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140
      
      with layerdrop (regularization effect should be seen in PPL):
      --decoder-layerdrop 0.2
      
      | epoch 001:     10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24
      | epoch 001:     20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45
      | epoch 001:     30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68
      | epoch 001:     40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90
      | epoch 001:     50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112
      
      TEST 3: PICKING UP TRAINING FROM EXISTING MODEL
      checked
      
      | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates)
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      
      TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      achieves correct accuracy on SST2 for this model
      
      TEST 5: TRAINING NEW BERT MODEL WORKS
      checked and works
      
      TEST 6: NMT
      
      without layerdrop
      --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified
      
      | epoch 001:     10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3
      | epoch 001:     20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6
      | epoch 001:     30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9
      | epoch 001:     40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12
      | epoch 001:     50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15
      
      with layerdrop (regularization effect should be seen in PPL)
      
      A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2
      B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5
      C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0
      
      | epoch 001:     10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3
      | epoch 001:     20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6
      | epoch 001:     30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8
      | epoch 001:     40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11
      | epoch 001:     50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14
      
      TEST 7: PRUNING TESTCASES
      
      A) after adding the pruning flags, model can evaluate as a full model
      checked, reaches correct PPL
      num. model params: 246933504
      | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s)
      | Loss: 2.9275, Perplexity: 18.68
      
      B) after adding pruning flags, model can be pruned. this works with multiple flag settings
      checked three cases:
      num. model params: 146163712
      | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s)
      | Loss: 3.0932, Perplexity: 22.05
      
      num. model params: 209144832
      | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s)
      | Loss: 2.9526, Perplexity: 19.16
      
      C) model can pick up training if you want to finetune the pruned model
      checked:
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      | WARNING: overflow detected, setting loss scale to: 64.0
      | WARNING: overflow detected, setting loss scale to: 32.0
      | epoch 272:   1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396
      
      D) works with BERT
      checked:
      without specifying any flags, reproduces the correct standard accuracy
      with flags, produces the correct pruned accuracy
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop
      | Accuracy:  0.9220183486238532
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890
      
      Reviewed By: edunov
      
      Differential Revision: D18094657
      
      Pulled By: huihuifan
      
      fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
      dabbef46
  8. 26 Oct, 2019 1 commit
    • Xian Li's avatar
      fix a type mismatch in NAT quantization run · eb68afca
      Xian Li authored
      Summary:
      Fix a type mismatch which was found after patching NAT on top of quantization.
      Ning suggested this fix. Need to further understand: why this only appears after patching quantization diff?
      
      Reviewed By: kahne, jhcross
      
      Differential Revision: D18147726
      
      fbshipit-source-id: a51becc9ad58a637a0180074eaa2b46990ab9f84
      eb68afca
  9. 24 Oct, 2019 2 commits
    • Ning Dong's avatar
      OSS tracing compliant transformer to unbreak master (#1299) · 5b086a0c
      Ning Dong authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/1299
      
       LevT calls into tracing compliant transformer we didn't plan to OSS earlier. This is a workaround to unbreak the master. Will revisit and simplify the code later.
      
      Reviewed By: pipibjc
      
      Differential Revision: D18110339
      
      fbshipit-source-id: 3bb51c56c2c20f45db1d5786d030b374b412eab1
      5b086a0c
    • Ning Dong's avatar
      NAT productionization · 5a2f76ed
      Ning Dong authored
      Summary:
      NAT productionization diff
      
      (1) Integrate NAT model training / Evaluation in LATTE base training workflow.
      (2) Make NAT tracing compliant. Since it calls into Fairseq transformer, we need to refactor the code and I created a ~copy of it named fb_tracing_transformer.
      (3) Decoder side C++ code is landed in the diff earlier.
      
      Reviewed By: xianxl
      
      Differential Revision: D17888324
      
      fbshipit-source-id: ef4ef195fddd360da921502adcef82b087e46ce6
      5a2f76ed
  10. 22 Oct, 2019 1 commit
    • Changhan Wang's avatar
      fix score · e49b302a
      Changhan Wang authored
      Summary: Bugfix for inconsistent scores on the same input sentences. This only affects the displayed scores in `generate.py` and does not affect the model outputs.
      
      Reviewed By: MultiPath
      
      Differential Revision: D17799343
      
      fbshipit-source-id: 2b868ac03097a4db27db736e126a61d50958acc5
      e49b302a
  11. 20 Oct, 2019 1 commit
    • Jiatao Gu's avatar
      Enable separate models for insertion and deletion; · 66d24dc2
      Jiatao Gu authored
      Summary:
      The Diff conatins two fixes:
      (1) enabling non-shared decoder layers for deletion/insertion
      (2) adding options to perform sampling instead of argmax when learning the deletion
      
      Reviewed By: kahne
      
      Differential Revision: D18011220
      
      fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8
      66d24dc2
  12. 18 Oct, 2019 2 commits
    • Spencer Poff's avatar
      add missing function to FairseqLanguageModel · b8d024e9
      Spencer Poff authored
      Summary: In https://github.com/fairinternal/fairseq-py/pull/877, sequence_generator began calling `model.forward_decoder`, but not all decoder models were given an implementation of that function.
      
      Reviewed By: okhonko
      
      Differential Revision: D17863751
      
      fbshipit-source-id: ea70b636c9dafcf87f5d5e49631d0c4b7cf14984
      b8d024e9
    • Changhan Wang's avatar
      fix levenshtein transfromer attn · 3dcb5c77
      Changhan Wang authored
      Summary: When the `if` statements in the levenshtein transformer decoder forward are removed, `attn` may get inconsistent batch sizes with output tokens. This is a fix.
      
      Reviewed By: cndn
      
      Differential Revision: D17936411
      
      fbshipit-source-id: a1583f3806dc9f41caeb783c043429e247035803
      3dcb5c77
  13. 15 Oct, 2019 1 commit
    • Changhan Wang's avatar
      fix libnat imports · e3a40d9d
      Changhan Wang authored
      Summary: Bring back the changes in D17661768
      
      Reviewed By: ailzhang
      
      Differential Revision: D17920299
      
      fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
      e3a40d9d
  14. 11 Oct, 2019 1 commit
    • Jiatao Gu's avatar
      add new_arange function + FIX BUGS of returning attn values · cce92bdd
      Jiatao Gu authored
      Summary:
      Implementation of Levenshtein Transformer paper.
      Add a new helper function "new_arange" to create arange tensor easily.
      Fix bugs of returning attn values for NAT models
      Delete files which are not necessary or experimental.
      
      Reviewed By: kahne
      
      Differential Revision: D17652009
      
      fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
      cce92bdd
  15. 08 Oct, 2019 2 commits
    • Jungo Kasai's avatar
      ensemble levts · 34e79c58
      Jungo Kasai authored
      Summary:
      Add ensemble wrappers to the levenshtein NAT.
      Levenshtein
      Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection.
      1. Deletion
      2. Placeholder Insertion
      3. Word Selection
      
      Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design.
      
      Reviewed By: kahne
      
      Differential Revision: D17723202
      
      fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
      34e79c58
    • Changhan Wang's avatar
      fix max lengths in Levenshtein Tramsformer · c2165224
      Changhan Wang authored
      Summary: Fix the max length calculation in Levenshtein Transformer
      
      Reviewed By: jhcross
      
      Differential Revision: D17672946
      
      fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
      c2165224
  16. 30 Sep, 2019 2 commits
  17. 29 Sep, 2019 1 commit
  18. 27 Sep, 2019 1 commit
    • Changhan Wang's avatar
      Levenshtein Transformer paper code · 86857a58
      Changhan Wang authored
      Summary:
      Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006)
      * Added Levenshtein Transformer model, task and criterion class
      * Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines
      * Add an option for prepending BOS to dictionary class and translation task class
      
      Reviewed By: myleott
      
      Differential Revision: D17297372
      
      fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
      86857a58
  19. 26 Sep, 2019 1 commit
  20. 20 Sep, 2019 1 commit
    • Naman Goyal's avatar
      added multilingual masked LM training (#849) · 32335404
      Naman Goyal authored
      Summary:
      The multilingual-RoBERTa training is working with aconneau XLM data.
      
      Two pieces remaining:
      
      1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.
      
      2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849
      
      Differential Revision: D17162460
      
      fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
      32335404
  21. 18 Sep, 2019 1 commit
  22. 05 Sep, 2019 1 commit
    • Roman Rädle's avatar
      Return predicted token for RoBERTa filling mask · 3e3fe722
      Roman Rädle authored
      Summary:
      Added the `predicted_token` to each `topk` filled output item
      
      Updated RoBERTa filling mask example in README.md
      
      Reviewed By: myleott
      
      Differential Revision: D17188810
      
      fbshipit-source-id: 5fdc57ff2c13239dabf13a8dad43ae9a55e8931c
      3e3fe722
  23. 21 Aug, 2019 1 commit
    • Jeff Cai's avatar
      Parameterized criterions (#808) · ba5f829f
      Jeff Cai authored
      Summary:
      Support criterion with parameters, such as AutoSegmentationCriterion (ASG) used in wav2letter which has a transition matrix parameter. This is needed to integrate wav2letter's ASG into PySpeech.
      
      With this diff, parameters in criterions will be:
      (1) updated by optimizers, with a configurable learning rate
      (2) saved and loaded from checkpoints, preserving backward compatibility for criterions without parameters
      (3) synchronized across nodes in distributed training.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/808
      
      Reviewed By: jcai1
      
      Differential Revision: D16934097
      
      Pulled By: okhonko
      
      fbshipit-source-id: 121ec9382459385c6f9cbef3a8274bec1a434038
      ba5f829f
  24. 19 Aug, 2019 1 commit
  25. 16 Aug, 2019 1 commit
  26. 15 Aug, 2019 1 commit
  27. 14 Aug, 2019 1 commit
  28. 13 Aug, 2019 1 commit
  29. 12 Aug, 2019 2 commits
  30. 10 Aug, 2019 2 commits
  31. 08 Aug, 2019 1 commit
  32. 07 Aug, 2019 2 commits