1. 27 Oct, 2019 1 commit
    • Angela Fan's avatar
      adding layerdrop code for training, pruning, and readme (#890) · dabbef46
      Angela Fan authored
      Summary:
      TEST 1: EVALUATION TIME WORKS
      checked
      achieves correct model perplexity: 18.68
      
      TEST 2: TRAINING NEW MODEL WORKS
      checked
      
      without layerdrop:
      --decoder-layerdrop 0 OR no flag at all
      | epoch 001:     10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30
      | epoch 001:     20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57
      | epoch 001:     30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84
      | epoch 001:     40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112
      | epoch 001:     50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140
      
      with layerdrop (regularization effect should be seen in PPL):
      --decoder-layerdrop 0.2
      
      | epoch 001:     10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24
      | epoch 001:     20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45
      | epoch 001:     30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68
      | epoch 001:     40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90
      | epoch 001:     50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112
      
      TEST 3: PICKING UP TRAINING FROM EXISTING MODEL
      checked
      
      | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates)
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      
      TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      achieves correct accuracy on SST2 for this model
      
      TEST 5: TRAINING NEW BERT MODEL WORKS
      checked and works
      
      TEST 6: NMT
      
      without layerdrop
      --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified
      
      | epoch 001:     10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3
      | epoch 001:     20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6
      | epoch 001:     30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9
      | epoch 001:     40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12
      | epoch 001:     50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15
      
      with layerdrop (regularization effect should be seen in PPL)
      
      A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2
      B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5
      C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0
      
      | epoch 001:     10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3
      | epoch 001:     20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6
      | epoch 001:     30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8
      | epoch 001:     40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11
      | epoch 001:     50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14
      
      TEST 7: PRUNING TESTCASES
      
      A) after adding the pruning flags, model can evaluate as a full model
      checked, reaches correct PPL
      num. model params: 246933504
      | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s)
      | Loss: 2.9275, Perplexity: 18.68
      
      B) after adding pruning flags, model can be pruned. this works with multiple flag settings
      checked three cases:
      num. model params: 146163712
      | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s)
      | Loss: 3.0932, Perplexity: 22.05
      
      num. model params: 209144832
      | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s)
      | Loss: 2.9526, Perplexity: 19.16
      
      C) model can pick up training if you want to finetune the pruned model
      checked:
      | loading train data for epoch 272
      | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
      | WARNING: overflow detected, setting loss scale to: 64.0
      | WARNING: overflow detected, setting loss scale to: 32.0
      | epoch 272:   1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396
      
      D) works with BERT
      checked:
      without specifying any flags, reproduces the correct standard accuracy
      with flags, produces the correct pruned accuracy
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Accuracy:  0.9231651376146789
      
      | [input] dictionary: 50265 types
      | [label] dictionary: 9 types
      | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop
      | Accuracy:  0.9220183486238532
      Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890
      
      Reviewed By: edunov
      
      Differential Revision: D18094657
      
      Pulled By: huihuifan
      
      fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
      dabbef46