• Angela Fan's avatar
    adding layerdrop code for training, pruning, and readme (#890) · dabbef46
    Angela Fan authored
    Summary:
    TEST 1: EVALUATION TIME WORKS
    checked
    achieves correct model perplexity: 18.68
    
    TEST 2: TRAINING NEW MODEL WORKS
    checked
    
    without layerdrop:
    --decoder-layerdrop 0 OR no flag at all
    | epoch 001:     10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30
    | epoch 001:     20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57
    | epoch 001:     30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84
    | epoch 001:     40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112
    | epoch 001:     50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140
    
    with layerdrop (regularization effect should be seen in PPL):
    --decoder-layerdrop 0.2
    
    | epoch 001:     10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24
    | epoch 001:     20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45
    | epoch 001:     30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68
    | epoch 001:     40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90
    | epoch 001:     50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112
    
    TEST 3: PICKING UP TRAINING FROM EXISTING MODEL
    checked
    
    | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates)
    | loading train data for epoch 272
    | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
    
    TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS
    | [input] dictionary: 50265 types
    | [label] dictionary: 9 types
    | Accuracy:  0.9231651376146789
    achieves correct accuracy on SST2 for this model
    
    TEST 5: TRAINING NEW BERT MODEL WORKS
    checked and works
    
    TEST 6: NMT
    
    without layerdrop
    --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified
    
    | epoch 001:     10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3
    | epoch 001:     20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6
    | epoch 001:     30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9
    | epoch 001:     40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12
    | epoch 001:     50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15
    
    with layerdrop (regularization effect should be seen in PPL)
    
    A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2
    B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5
    C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0
    
    | epoch 001:     10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3
    | epoch 001:     20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6
    | epoch 001:     30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8
    | epoch 001:     40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11
    | epoch 001:     50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14
    
    TEST 7: PRUNING TESTCASES
    
    A) after adding the pruning flags, model can evaluate as a full model
    checked, reaches correct PPL
    num. model params: 246933504
    | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s)
    | Loss: 2.9275, Perplexity: 18.68
    
    B) after adding pruning flags, model can be pruned. this works with multiple flag settings
    checked three cases:
    num. model params: 146163712
    | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s)
    | Loss: 3.0932, Perplexity: 22.05
    
    num. model params: 209144832
    | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s)
    | Loss: 2.9526, Perplexity: 19.16
    
    C) model can pick up training if you want to finetune the pruned model
    checked:
    | loading train data for epoch 272
    | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train
    | WARNING: overflow detected, setting loss scale to: 64.0
    | WARNING: overflow detected, setting loss scale to: 32.0
    | epoch 272:   1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396
    
    D) works with BERT
    checked:
    without specifying any flags, reproduces the correct standard accuracy
    with flags, produces the correct pruned accuracy
    
    | [input] dictionary: 50265 types
    | [label] dictionary: 9 types
    | Accuracy:  0.9231651376146789
    
    | [input] dictionary: 50265 types
    | [label] dictionary: 9 types
    | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop
    | Accuracy:  0.9220183486238532
    Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890
    
    Reviewed By: edunov
    
    Differential Revision: D18094657
    
    Pulled By: huihuifan
    
    fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
    dabbef46
transformer.py 36.3 KB