1. 21 Nov, 2019 1 commit
    • Alex Xiao's avatar
      Refactor data sharding to be specified via caller of task rather than task itself · 99fbd317
      Alex Xiao authored
      Summary: Modifying number of shards internally to disable data sharding for batch iteration is dangerous because the caller of these tasks is not limited to fairspeq/train. So therefore we should put the onus of data sharding properly on the caller rather than the task itself.
      
      Reviewed By: myleott
      
      Differential Revision: D18456424
      
      fbshipit-source-id: d46be16c441c50082f9a768d0b259e6c28a4b67b
      99fbd317
  2. 08 Nov, 2019 1 commit
  3. 06 Nov, 2019 1 commit
  4. 27 Oct, 2019 1 commit
    • Angela Fan's avatar
      adding layerdrop code for training, pruning, and readme (#890) · dabbef46
      Angela Fan authored
      Summary:
      TEST 1: EVALUATION TIME WORKS
      checked
      achieves correct model perplexity: 18.68
      
      TEST 2: TRAINING NEW MODEL WORKS
      checked
      
      without layerdrop:
      --decoder-layerdrop 0 OR no flag at all
      | epoch 001:     10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30
      | epoch 001:     20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57
      | epoch 001:     30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84
      | epoch 001:     40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112
      | epoch 001:     50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140
      
      with layerdrop (regularization effect should be seen in PPL):
      --decoder-layerdrop 0.2
      
      | epoch 001:     10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24
      | epoch 001:     20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45
      | epoch 001:     30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68
      | epoch 001:     40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90
      | epoch 001:     50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112
      
      TEST 3: PICKING UP TRAINING FROM EXISTING MODEL
      checked
      
      | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates)
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      
      TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      achieves correct accuracy on SST2 for this model
      
      TEST 5: TRAINING NEW BERT MODEL WORKS
      checked and works
      
      TEST 6: NMT
      
      without layerdrop
      --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified
      
      | epoch 001:     10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3
      | epoch 001:     20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6
      | epoch 001:     30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9
      | epoch 001:     40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12
      | epoch 001:     50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15
      
      with layerdrop (regularization effect should be seen in PPL)
      
      A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2
      B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5
      C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0
      
      | epoch 001:     10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3
      | epoch 001:     20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6
      | epoch 001:     30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8
      | epoch 001:     40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11
      | epoch 001:     50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14
      
      TEST 7: PRUNING TESTCASES
      
      A) after adding the pruning flags, model can evaluate as a full model
      checked, reaches correct PPL
      num. model params: 246933504
      | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s)
      | Loss: 2.9275, Perplexity: 18.68
      
      B) after adding pruning flags, model can be pruned. this works with multiple flag settings
      checked three cases:
      num. model params: 146163712
      | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s)
      | Loss: 3.0932, Perplexity: 22.05
      
      num. model params: 209144832
      | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s)
      | Loss: 2.9526, Perplexity: 19.16
      
      C) model can pick up training if you want to finetune the pruned model
      checked:
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      | WARNING: overflow detected, setting loss scale to: 64.0
      | WARNING: overflow detected, setting loss scale to: 32.0
      | epoch 272:   1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396
      
      D) works with BERT
      checked:
      without specifying any flags, reproduces the correct standard accuracy
      with flags, produces the correct pruned accuracy
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop
      | Accuracy:  0.9220183486238532
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890
      
      Reviewed By: edunov
      
      Differential Revision: D18094657
      
      Pulled By: huihuifan
      
      fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
      dabbef46
  5. 24 Oct, 2019 1 commit
  6. 12 Oct, 2019 1 commit
  7. 09 Oct, 2019 1 commit
    • Alex Xiao's avatar
      Fix data loading memory issue in pyspeech · b6e001f6
      Alex Xiao authored
      Summary:
      We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers  because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's.
      
      3 changes:
      
      1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task.
      
      2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do.
      
      2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well.
      
      Reviewed By: yqwangustc
      
      Differential Revision: D17750715
      
      fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
      b6e001f6
  8. 08 Oct, 2019 1 commit
  9. 04 Oct, 2019 1 commit
    • Jerry Ma's avatar
      Add periodic CUDA cache cleanup (#882) · 315c463d
      Jerry Ma authored
      Summary:
      This adds a periodic call to `torch.cuda.empty_cache()` in order to
      mitigate memory fragmentation in the PyTorch CUDA cached allocator
      that can cause OOMs on models approaching GPU memory limit.
      By default, this will occur every 64 updates.
      
      Performance considerations:
      
      - I've benchmarked this on a reasonably large model with memory
        footprint 16 GB, and the overhead with the default setting is <0.2%.
        With `update-freq > 1`, the cost is mitigated even further.
      - This behavior can be disabled with a value of zero.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882
      
      Differential Revision: D17742386
      
      Pulled By: jma127
      
      fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
      315c463d
  10. 20 Sep, 2019 1 commit
    • Naman Goyal's avatar
      added multilingual masked LM training (#849) · 32335404
      Naman Goyal authored
      Summary:
      The multilingual-RoBERTa training is working with aconneau XLM data.
      
      Two pieces remaining:
      
      1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.
      
      2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849
      
      Differential Revision: D17162460
      
      fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
      32335404
  11. 16 Sep, 2019 1 commit
    • Naman Goyal's avatar
      added fast stats sync option (#858) · e1ba32aa
      Naman Goyal authored
      Summary:
      Added `--fast-stat-sync` option.
      This avoids pickle and achieves `~7%` more `wps` on 16 nodes.
      It is less flexible as it just aggregates only basic stats and it ignores the aggregate function defined by criterion.
      
      Let me know what you think myleott
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/858
      
      Differential Revision: D17398770
      
      fbshipit-source-id: 36261a1d970e67deeda8211af8f009ef9b4f9c14
      e1ba32aa
  12. 21 Aug, 2019 1 commit
    • Jeff Cai's avatar
      Parameterized criterions (#808) · ba5f829f
      Jeff Cai authored
      Summary:
      Support criterion with parameters, such as AutoSegmentationCriterion (ASG) used in wav2letter which has a transition matrix parameter. This is needed to integrate wav2letter's ASG into PySpeech.
      
      With this diff, parameters in criterions will be:
      (1) updated by optimizers, with a configurable learning rate
      (2) saved and loaded from checkpoints, preserving backward compatibility for criterions without parameters
      (3) synchronized across nodes in distributed training.
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/808
      
      Reviewed By: jcai1
      
      Differential Revision: D16934097
      
      Pulled By: okhonko
      
      fbshipit-source-id: 121ec9382459385c6f9cbef3a8274bec1a434038
      ba5f829f
  13. 20 Aug, 2019 1 commit
  14. 30 Jul, 2019 1 commit
  15. 22 Jul, 2019 1 commit
  16. 11 Jul, 2019 1 commit
  17. 21 Jun, 2019 1 commit
  18. 20 Jun, 2019 2 commits
  19. 12 Jun, 2019 1 commit
    • Nayan Singhal's avatar
      Add Model Averaging · 6982c404
      Nayan Singhal authored
      Summary:
      Implemented model averaging for fairseq.
      Removed the ddp wrapper if global optimizer is provided.
      Syncing all the models based on the iteration provide in the input
      
      TODO:
      1) Fix throughput and wps meter. Need to check other meters too.
      2) Replace Model average code with BMUF algorithm implementation.
      
      Reviewed By: myleott
      
      Differential Revision: D15711044
      
      fbshipit-source-id: 58a4af74db2a61d06762597b95836cbeb1ed82cc
      6982c404
  20. 30 May, 2019 1 commit
  21. 17 May, 2019 2 commits
  22. 09 May, 2019 1 commit
  23. 04 May, 2019 1 commit
  24. 03 May, 2019 1 commit
  25. 02 May, 2019 1 commit
  26. 01 May, 2019 1 commit
  27. 30 Apr, 2019 1 commit
  28. 29 Apr, 2019 1 commit
  29. 10 Apr, 2019 1 commit
  30. 04 Apr, 2019 1 commit
    • Jay Mahadeokar's avatar
      aligned training task and CE related changes · 3658fa32
      Jay Mahadeokar authored
      Summary:
      This diff adds:
      
      1. Aligned training task specifically for doing cross entropy criterion training using prod data and prod like models
      2. Few changes to correctly register the task and criterions.
      3. Changes to trainer code for propogating accuracy metrics which we care about for training.
      
      Couple of things are hacky right now:
      - The reporting is not modular (this needs to be thought about in general for fairseq).
      
      - The get dummy batch could be specific to task instead of specific for dataset.
      
      Reviewed By: myleott
      
      Differential Revision: D14670482
      
      fbshipit-source-id: dc077247b2ae9d26a8e842a386ec5faa5771e836
      3658fa32
  31. 12 Mar, 2019 1 commit
    • Dmytro Okhonko's avatar
      Handle 3+ dimensional input in sequence_generator + nits · 860010e9
      Dmytro Okhonko authored
      Summary: sequence_generator assumes that model input is 2d tensor of longs. But it can be something like 3d tensor of floats and we should be able to handle this as long as first dimension is batch size followed by source lengths.
      
      Reviewed By: myleott
      
      Differential Revision: D14420044
      
      fbshipit-source-id: bf8b1e42ad1873f7b803c1a377b0af21648db015
      860010e9
  32. 26 Feb, 2019 1 commit
    • Myle Ott's avatar
      Multilingual training example (#527) · 00493490
      Myle Ott authored
      Summary:
      * Add example for multilingual translation on IWSLT'17
      * Match dataset ordering for multilingual_translation and translation
      * Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/527
      
      Differential Revision: D14218372
      
      Pulled By: myleott
      
      fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b
      00493490
  33. 06 Feb, 2019 1 commit
  34. 25 Jan, 2019 1 commit
  35. 17 Jan, 2019 1 commit
    • Myle Ott's avatar
      Fix initial learning rate (#453) · 2210fa71
      Myle Ott authored
      Summary:
      There was a very subtle bug here 😢When we recently removed this line (7633129b), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set.
      
      For example, the inverse_sqrt scheduler resets the learning rate upon initialization:
      https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50
      
      **Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/453
      
      Differential Revision: D13704453
      
      Pulled By: myleott
      
      fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764
      2210fa71
  36. 09 Jan, 2019 1 commit
  37. 05 Jan, 2019 1 commit
  38. 28 Dec, 2018 1 commit