- 21 Nov, 2019 1 commit
-
-
Alex Xiao authored
Summary: Modifying number of shards internally to disable data sharding for batch iteration is dangerous because the caller of these tasks is not limited to fairspeq/train. So therefore we should put the onus of data sharding properly on the caller rather than the task itself. Reviewed By: myleott Differential Revision: D18456424 fbshipit-source-id: d46be16c441c50082f9a768d0b259e6c28a4b67b
-
- 08 Nov, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/903 Reviewed By: sujitoc Differential Revision: D18327653 fbshipit-source-id: 739ddbaf54862acdf7b4f1bc3ad538bde5ae00fd
-
- 06 Nov, 2019 1 commit
-
-
Jerry Ma authored
Summary: - Adds memory summary logging to validation and optimization steps. - Clarifies in the logging that optimization OOMs are not recoverable. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/893 Differential Revision: D18110763 Pulled By: jma127 fbshipit-source-id: 49340e611169c606ab9c991265167a79f51846e6
-
- 27 Oct, 2019 1 commit
-
-
Angela Fan authored
Summary: TEST 1: EVALUATION TIME WORKS checked achieves correct model perplexity: 18.68 TEST 2: TRAINING NEW MODEL WORKS checked without layerdrop: --decoder-layerdrop 0 OR no flag at all | epoch 001: 10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30 | epoch 001: 20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57 | epoch 001: 30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84 | epoch 001: 40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112 | epoch 001: 50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140 with layerdrop (regularization effect should be seen in PPL): --decoder-layerdrop 0.2 | epoch 001: 10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24 | epoch 001: 20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45 | epoch 001: 30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68 | epoch 001: 40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90 | epoch 001: 50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112 TEST 3: PICKING UP TRAINING FROM EXISTING MODEL checked | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates) | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 achieves correct accuracy on SST2 for this model TEST 5: TRAINING NEW BERT MODEL WORKS checked and works TEST 6: NMT without layerdrop --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified | epoch 001: 10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3 | epoch 001: 20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6 | epoch 001: 30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9 | epoch 001: 40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12 | epoch 001: 50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15 with layerdrop (regularization effect should be seen in PPL) A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2 B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5 C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0 | epoch 001: 10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3 | epoch 001: 20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6 | epoch 001: 30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8 | epoch 001: 40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11 | epoch 001: 50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14 TEST 7: PRUNING TESTCASES A) after adding the pruning flags, model can evaluate as a full model checked, reaches correct PPL num. model params: 246933504 | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s) | Loss: 2.9275, Perplexity: 18.68 B) after adding pruning flags, model can be pruned. this works with multiple flag settings checked three cases: num. model params: 146163712 | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s) | Loss: 3.0932, Perplexity: 22.05 num. model params: 209144832 | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s) | Loss: 2.9526, Perplexity: 19.16 C) model can pick up training if you want to finetune the pruned model checked: | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | epoch 272: 1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396 D) works with BERT checked: without specifying any flags, reproduces the correct standard accuracy with flags, produces the correct pruned accuracy | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 | [input] dictionary: 50265 types | [label] dictionary: 9 types | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop | Accuracy: 0.9220183486238532 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890 Reviewed By: edunov Differential Revision: D18094657 Pulled By: huihuifan fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
-
- 24 Oct, 2019 1 commit
-
-
Jerry Ma authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/892 Differential Revision: D18109685 Pulled By: jma127 fbshipit-source-id: f96e1080a5577b8ee0748dfdd956bf72bed47474
-
- 12 Oct, 2019 1 commit
-
-
Sujit Verma authored
Summary: Added option to save checkpoints using Path Manager. Reviewed By: hudeven Differential Revision: D17392754 fbshipit-source-id: 4b8e556ef8455a1548e5a083d779ed809cd785be
-
- 09 Oct, 2019 1 commit
-
-
Alex Xiao authored
Summary: We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's. 3 changes: 1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task. 2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do. 2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well. Reviewed By: yqwangustc Differential Revision: D17750715 fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
-
- 08 Oct, 2019 1 commit
-
-
Jerry Ma authored
Summary: PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
-
- 04 Oct, 2019 1 commit
-
-
Jerry Ma authored
Summary: This adds a periodic call to `torch.cuda.empty_cache()` in order to mitigate memory fragmentation in the PyTorch CUDA cached allocator that can cause OOMs on models approaching GPU memory limit. By default, this will occur every 64 updates. Performance considerations: - I've benchmarked this on a reasonably large model with memory footprint 16 GB, and the overhead with the default setting is <0.2%. With `update-freq > 1`, the cost is mitigated even further. - This behavior can be disabled with a value of zero. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882 Differential Revision: D17742386 Pulled By: jma127 fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
-
- 20 Sep, 2019 1 commit
-
-
Naman Goyal authored
Summary: The multilingual-RoBERTa training is working with aconneau XLM data. Two pieces remaining: 1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple. 2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline). Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849 Differential Revision: D17162460 fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
-
- 16 Sep, 2019 1 commit
-
-
Naman Goyal authored
Summary: Added `--fast-stat-sync` option. This avoids pickle and achieves `~7%` more `wps` on 16 nodes. It is less flexible as it just aggregates only basic stats and it ignores the aggregate function defined by criterion. Let me know what you think myleott Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/858 Differential Revision: D17398770 fbshipit-source-id: 36261a1d970e67deeda8211af8f009ef9b4f9c14
-
- 21 Aug, 2019 1 commit
-
-
Jeff Cai authored
Summary: Support criterion with parameters, such as AutoSegmentationCriterion (ASG) used in wav2letter which has a transition matrix parameter. This is needed to integrate wav2letter's ASG into PySpeech. With this diff, parameters in criterions will be: (1) updated by optimizers, with a configurable learning rate (2) saved and loaded from checkpoints, preserving backward compatibility for criterions without parameters (3) synchronized across nodes in distributed training. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/808 Reviewed By: jcai1 Differential Revision: D16934097 Pulled By: okhonko fbshipit-source-id: 121ec9382459385c6f9cbef3a8274bec1a434038
-
- 20 Aug, 2019 1 commit
-
-
Arya McCarthy authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1040 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/836 Reviewed By: myleott, liezl200 Differential Revision: D16889252 fbshipit-source-id: 45a1b6c1217fb099f0350096e38e1c7d83ea0a64
-
- 30 Jul, 2019 1 commit
-
-
Myle Ott authored
Summary: The previous BSD+PATENTS license was controversial. We have been approved to relicense fairseq under the MIT license. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/786 Differential Revision: D16560654 Pulled By: myleott fbshipit-source-id: f78b1beb4f2895dd7b9bfc79f5f952a2bfb94034
-
- 22 Jul, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/756 Differential Revision: D16418302 Pulled By: myleott fbshipit-source-id: 62495a0bff41d1741e2b09807a3b43ff2c66c8fb
-
- 11 Jul, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/711 Differential Revision: D16192752 Pulled By: myleott fbshipit-source-id: 102ed337a3d31e2047be7c033e9007c04223a684
-
- 21 Jun, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/671 Differential Revision: D15925248 fbshipit-source-id: 9eeea8a257929347e2458afdfc1def8dbb925a72
-
- 20 Jun, 2019 2 commits
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/818 Differential Revision: D15916265 Pulled By: myleott fbshipit-source-id: c66c0bd988d3472c4150226952f34ee8d4c3db86
-
alexeib authored
Summary: Merging wav2vec to master. Includes renames (Cpc -> wav2vec) and some light example files. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/654 Differential Revision: D15913409 Pulled By: alexeib fbshipit-source-id: f723e6f211706cd9431c7d76dc12c4e80c9cfc80
-
- 12 Jun, 2019 1 commit
-
-
Nayan Singhal authored
Summary: Implemented model averaging for fairseq. Removed the ddp wrapper if global optimizer is provided. Syncing all the models based on the iteration provide in the input TODO: 1) Fix throughput and wps meter. Need to check other meters too. 2) Replace Model average code with BMUF algorithm implementation. Reviewed By: myleott Differential Revision: D15711044 fbshipit-source-id: 58a4af74db2a61d06762597b95836cbeb1ed82cc
-
- 30 May, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/613 Differential Revision: D15541384 Pulled By: myleott fbshipit-source-id: ef2c0b0a51cdf37af2ccff0546f524d49f87e65d
-
- 17 May, 2019 2 commits
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/588 Differential Revision: D15389638 Pulled By: myleott fbshipit-source-id: 4632ce22d51dc2c74d250bae999630095d849701
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/586 Differential Revision: D15372949 Pulled By: myleott fbshipit-source-id: c1cf1c645e8d55fc8568f23a47c45677ac9ab1da
-
- 09 May, 2019 1 commit
-
-
Myle Ott authored
-
- 04 May, 2019 1 commit
-
-
Myle Ott authored
Summary: It was tedious defining these, let's try just taking the first batch lazily instead. Pull Request resolved: https://github.com/pytorch/fairseq/pull/699 Differential Revision: D15188266 Pulled By: myleott fbshipit-source-id: a4c9f7ee3111278faaffa8a22ba91ed5f50e143d
-
- 03 May, 2019 1 commit
-
-
Yongqiang Wang authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/2 Pull Request resolved: https://github.com/pytorch/fairseq/pull/689 We found not raising OOM during trainer.train_step causes various issue, including NCCL hangs / gloo sync errors because gradient is not synced properly. Before we found the root cause, let's give users an option to raise OOMs. Reviewed By: jmp84 Differential Revision: D15170357 fbshipit-source-id: 3e15e4e111a8380612157955509c39821a216ec4
-
- 02 May, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/692 Differential Revision: D15174954 fbshipit-source-id: 1a7bff9aeed3e2cc658577be9d79e8c9f72314c2
-
- 01 May, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/685 Differential Revision: D15154647 Pulled By: myleott fbshipit-source-id: 36c72359755192a4a53367e19f8dd006791d483c
-
- 30 Apr, 2019 1 commit
-
-
Myle Ott authored
Summary: - Add --add-bos-token option to LM task - Cleanup utils.py and options.py Pull Request resolved: https://github.com/pytorch/fairseq/pull/654 Differential Revision: D15041794 Pulled By: myleott fbshipit-source-id: 3ad00007769d5f48308052cfd40de39c5ffa1a6e
-
- 29 Apr, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/676 Differential Revision: D15114128 Pulled By: myleott fbshipit-source-id: b11dde77b2f2610d33649101aea03fb5a3eeb56a
-
- 10 Apr, 2019 1 commit
-
-
Peng-Jen Chen authored
Summary: - Add language token to MultilingualTranslation task - Add back translation and denoising loss to MultilingualTranslation task Pull Request resolved: https://github.com/pytorch/fairseq/pull/620 Reviewed By: liezl200 Differential Revision: D14756873 Pulled By: pipibjc fbshipit-source-id: 89d668db26848fd95f446edf5923bab2113636f7
-
- 04 Apr, 2019 1 commit
-
-
Jay Mahadeokar authored
Summary: This diff adds: 1. Aligned training task specifically for doing cross entropy criterion training using prod data and prod like models 2. Few changes to correctly register the task and criterions. 3. Changes to trainer code for propogating accuracy metrics which we care about for training. Couple of things are hacky right now: - The reporting is not modular (this needs to be thought about in general for fairseq). - The get dummy batch could be specific to task instead of specific for dataset. Reviewed By: myleott Differential Revision: D14670482 fbshipit-source-id: dc077247b2ae9d26a8e842a386ec5faa5771e836
-
- 12 Mar, 2019 1 commit
-
-
Dmytro Okhonko authored
Summary: sequence_generator assumes that model input is 2d tensor of longs. But it can be something like 3d tensor of floats and we should be able to handle this as long as first dimension is batch size followed by source lengths. Reviewed By: myleott Differential Revision: D14420044 fbshipit-source-id: bf8b1e42ad1873f7b803c1a377b0af21648db015
-
- 26 Feb, 2019 1 commit
-
-
Myle Ott authored
Summary: * Add example for multilingual translation on IWSLT'17 * Match dataset ordering for multilingual_translation and translation * Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules Pull Request resolved: https://github.com/pytorch/fairseq/pull/527 Differential Revision: D14218372 Pulled By: myleott fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b
-
- 06 Feb, 2019 1 commit
-
-
Wei Ho authored
Add CheckpointManager to keep avg checkpoint weights in memory to reduce disk read when averaging + various checkpoint refactoring Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/315 Reviewed By: akinh Differential Revision: D13510446 fbshipit-source-id: 22a6594af9253130a93e638285a47183a974e0de
-
- 25 Jan, 2019 1 commit
-
-
Xian Li authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/474 Reviewed By: theweiho, akinh Differential Revision: D13701447 fbshipit-source-id: 34036dce7601835b605e3b169210edc7a6715de6
-
- 17 Jan, 2019 1 commit
-
-
Myle Ott authored
Summary: There was a very subtle bug here
😢 When we recently removed this line (7633129b), it meant that the learning rate scheduler didn't get initialized until after the first update. Unfortunately pytorch optimizers store the learning rate in their internal state, so some learning rate schedulers use their `__init__` method to reset the learning rate to some sane initial value. This is especially problematic for LR schedulers that include a warmup, where the Optimizer is likely to contain the peak learning rate at initialization, and it's only in the LR scheduler's `__init__` that the (much smaller) warmup value is set. For example, the inverse_sqrt scheduler resets the learning rate upon initialization: https://github.com/pytorch/fairseq/blob/7853818c2e33a63ec17a31bcfe20e4fc75d94130/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py#L48-L50 **Impact:** For the last ~1.5 weeks, the first training update would use the optimizer's default learning rate instead of the initial rate set by the LR scheduler. All subsequent updates used the correct learning rates. This primarily affects LR schedulers with warmups. Pull Request resolved: https://github.com/pytorch/fairseq/pull/453 Differential Revision: D13704453 Pulled By: myleott fbshipit-source-id: a946da30100f837c66bdc6b9b77b014ab4eb8764
-
- 09 Jan, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/439 Differential Revision: D13608151 Pulled By: myleott fbshipit-source-id: 198b84995a6329f8329829cc91184d88f1eab947
-
- 05 Jan, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/translate/pull/283 Pull Request resolved: https://github.com/pytorch/fairseq/pull/428 Differential Revision: D13564190 Pulled By: myleott fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
-
- 28 Dec, 2018 1 commit
-
-
Myle Ott authored
Summary: This was broken in 03a57dec. Pull Request resolved: https://github.com/pytorch/fairseq/pull/424 Differential Revision: D13557540 Pulled By: myleott fbshipit-source-id: 62deda5353032aff20d35d046b0bb843da44d27c
-