- 05 Nov, 2019 2 commits
-
-
ngoyal2707 authored
Summary: TODO: 1) Need to update bibtex entry 2) Need to upload models, spm_vocab and dict.txt to public s3 location. For Future: 1) I will probably add instructions to finetune on XNLI and NER, POS etc. but currently no timeline for that. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/900 Reviewed By: myleott Differential Revision: D18333076 Pulled By: myleott fbshipit-source-id: 3f3d3716fcc41c78d2dd4525f60b519abbd0459c
-
Spencer Poff authored
Summary: https://github.com/pytorch/fairseq/pull/1097 added key padding mask history in TransformerDecoderLayer, but during an edge case where only the current or only the previous key_padding_mask exists, the resulting key_padding_mask is the wrong size. This diff adds empty columns in such a case to ensure key_padding_mask is a usable size. Reviewed By: myleott Differential Revision: D18224313 fbshipit-source-id: c9fb7266baf0a2d79a66704e00a5ea8bd2987ff6
-
- 02 Nov, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1340 Differential Revision: D18289455 Pulled By: myleott fbshipit-source-id: a1c8163a35273b6c646d300142701e8a317d7378
-
- 01 Nov, 2019 2 commits
-
-
Chau Tran authored
Summary: Fix integration test Reviewed By: xianxl Differential Revision: D18040440 fbshipit-source-id: 98c8ab7970d081f17deb54c69aa35669de12d767
-
Halil Akin authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/898 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1333 Pull Request resolved: https://github.com/fairinternal/fairspeq/pull/11 This in_proj_weight and in_proj_bias properties are not the right way of providing backward compatibility, and it's causing other incompatibilities with the new Dynamic Quantization API. So, let's remove this, and properly fix the failing tests. Reviewed By: myleott Differential Revision: D18264129 fbshipit-source-id: fc1838657a60d914ca83c4e0f6add5ed8206ac54
-
- 31 Oct, 2019 2 commits
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/897 Differential Revision: D18250587 Pulled By: myleott fbshipit-source-id: b9cef376bc014b68766229aab7b6e454480757d3
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/895 Reviewed By: akinh Differential Revision: D18246479 Pulled By: myleott fbshipit-source-id: a610f1e4943619d32a523601a572fb09cdc5638d
-
- 30 Oct, 2019 1 commit
-
-
Xian Li authored
Summary: This diff enables layer drop in transformer decoder in production training pipeline (ptt_transformer). It builds on top of the fairseq implementation D18094657 added by Angela Fan, and added additional logic to handle corresponding dropping layers at test time in exported model. Reviewed By: jhcross Differential Revision: D18165586 fbshipit-source-id: 373ac00268a25fa9e412edcb483becdfe792d992
-
- 28 Oct, 2019 1 commit
-
-
Ning Dong authored
Summary: Revert the interface change for iterative_refinement_generator Reviewed By: kahne Differential Revision: D18165103 fbshipit-source-id: 075c276746eb90d7c359b6ad92e1ef25e8452bcc
-
- 27 Oct, 2019 1 commit
-
-
Angela Fan authored
Summary: TEST 1: EVALUATION TIME WORKS checked achieves correct model perplexity: 18.68 TEST 2: TRAINING NEW MODEL WORKS checked without layerdrop: --decoder-layerdrop 0 OR no flag at all | epoch 001: 10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30 | epoch 001: 20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57 | epoch 001: 30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84 | epoch 001: 40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112 | epoch 001: 50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140 with layerdrop (regularization effect should be seen in PPL): --decoder-layerdrop 0.2 | epoch 001: 10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24 | epoch 001: 20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45 | epoch 001: 30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68 | epoch 001: 40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90 | epoch 001: 50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112 TEST 3: PICKING UP TRAINING FROM EXISTING MODEL checked | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates) | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 achieves correct accuracy on SST2 for this model TEST 5: TRAINING NEW BERT MODEL WORKS checked and works TEST 6: NMT without layerdrop --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified | epoch 001: 10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3 | epoch 001: 20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6 | epoch 001: 30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9 | epoch 001: 40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12 | epoch 001: 50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15 with layerdrop (regularization effect should be seen in PPL) A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2 B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5 C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0 | epoch 001: 10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3 | epoch 001: 20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6 | epoch 001: 30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8 | epoch 001: 40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11 | epoch 001: 50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14 TEST 7: PRUNING TESTCASES A) after adding the pruning flags, model can evaluate as a full model checked, reaches correct PPL num. model params: 246933504 | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s) | Loss: 2.9275, Perplexity: 18.68 B) after adding pruning flags, model can be pruned. this works with multiple flag settings checked three cases: num. model params: 146163712 | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s) | Loss: 3.0932, Perplexity: 22.05 num. model params: 209144832 | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s) | Loss: 2.9526, Perplexity: 19.16 C) model can pick up training if you want to finetune the pruned model checked: | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | epoch 272: 1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396 D) works with BERT checked: without specifying any flags, reproduces the correct standard accuracy with flags, produces the correct pruned accuracy | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 | [input] dictionary: 50265 types | [label] dictionary: 9 types | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop | Accuracy: 0.9220183486238532 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/890 Reviewed By: edunov Differential Revision: D18094657 Pulled By: huihuifan fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
-
- 26 Oct, 2019 1 commit
-
-
Xian Li authored
Summary: Fix a type mismatch which was found after patching NAT on top of quantization. Ning suggested this fix. Need to further understand: why this only appears after patching quantization diff? Reviewed By: kahne, jhcross Differential Revision: D18147726 fbshipit-source-id: a51becc9ad58a637a0180074eaa2b46990ab9f84
-
- 25 Oct, 2019 2 commits
-
-
Halil Akin authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1304 Pull Request resolved: https://github.com/pytorch/translate/pull/657 Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1065 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/889 We are converting matmuls to quantizable nn.Linear modules in this diff. First let's test profile after the diff to see how low level operations are changing. Reviewed By: jmp84, edunov, lly-zero-one, jhcross Differential Revision: D17964796 fbshipit-source-id: 3ddd3ff81fa1ea5864dded98e993f4fe3b71fe5e
-
Halil Akin authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/888 We want to simplify multihead attention and get rid of the dynamic in_proj_weight logic. Sending the diff early for feedback, will have further changes as I try to fix breaking tests Reviewed By: edunov Differential Revision: D17912661 fbshipit-source-id: 0e6319fc694d8ec5187d1c2fefe5839d9d522186
-
- 24 Oct, 2019 4 commits
-
-
Ning Dong authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1299 LevT calls into tracing compliant transformer we didn't plan to OSS earlier. This is a workaround to unbreak the master. Will revisit and simplify the code later. Reviewed By: pipibjc Differential Revision: D18110339 fbshipit-source-id: 3bb51c56c2c20f45db1d5786d030b374b412eab1
-
Jerry Ma authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/892 Differential Revision: D18109685 Pulled By: jma127 fbshipit-source-id: f96e1080a5577b8ee0748dfdd956bf72bed47474
-
Jerry Ma authored
Summary: Makes more sense to reset either both meters or neither of them. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/891 Differential Revision: D18109027 Pulled By: jma127 fbshipit-source-id: f63baed9a6b928a6f591a76e69ef6e9c524e4398
-
Ning Dong authored
Summary: NAT productionization diff (1) Integrate NAT model training / Evaluation in LATTE base training workflow. (2) Make NAT tracing compliant. Since it calls into Fairseq transformer, we need to refactor the code and I created a ~copy of it named fb_tracing_transformer. (3) Decoder side C++ code is landed in the diff earlier. Reviewed By: xianxl Differential Revision: D17888324 fbshipit-source-id: ef4ef195fddd360da921502adcef82b087e46ce6
-
- 23 Oct, 2019 1 commit
-
-
Yilei Li authored
Summary: Enables reduce_on_plateau schedule with optional warmup phase, where we linearly increase the learning rate from some initial learning rate (``--warmup-init-lr``) until the configured learning rate (``--lr``). Thereafter the lr is adjusted according to original reduce_on_plateau scheme During warmup:: lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates) lr = lrs[update_num] Reviewed By: yqwangustc Differential Revision: D17779925 fbshipit-source-id: c3bfb3321c76850824fc42df4fac4e5dcf73fbf8
-
- 22 Oct, 2019 3 commits
-
-
Changhan Wang authored
Summary: Bugfix for inconsistent scores on the same input sentences. This only affects the displayed scores in `generate.py` and does not affect the model outputs. Reviewed By: MultiPath Differential Revision: D17799343 fbshipit-source-id: 2b868ac03097a4db27db736e126a61d50958acc5
-
Louis MARTIN authored
Summary: Very small change. The previous message was misleading, the length of TokenBlocksDataset is a number of "blocks" or "streams" but not the number of batches strictly speaking if I am not mistaken. I use the notion of batch from roberta https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md. It took me some time to understand what was going on, I hope it saves some time for others. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1279 Differential Revision: D18051476 fbshipit-source-id: 71fa35f21b9dbc8d6bde28cd3a487723690aadee
-
Louis MARTIN authored
Summary: Fix for https://github.com/pytorch/fairseq/issues/1240 Tested with MaskedLMTask. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1281 Differential Revision: D18051472 fbshipit-source-id: 0aeff60c71489655f5e621349f780ba9cd8c027a
-
- 20 Oct, 2019 2 commits
-
-
Jiatao Gu authored
Summary: The Diff conatins two fixes: (1) enabling non-shared decoder layers for deletion/insertion (2) adding options to perform sampling instead of argmax when learning the deletion Reviewed By: kahne Differential Revision: D18011220 fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8
-
Jiatao Gu authored
Summary: Fix typos in the examples Reviewed By: kahne Differential Revision: D18030097 fbshipit-source-id: 84f0cbafd85e50ffd5033738835373935e3b83d4
-
- 18 Oct, 2019 3 commits
-
-
Spencer Poff authored
Summary: In https://github.com/fairinternal/fairseq-py/pull/877, sequence_generator began calling `model.forward_decoder`, but not all decoder models were given an implementation of that function. Reviewed By: okhonko Differential Revision: D17863751 fbshipit-source-id: ea70b636c9dafcf87f5d5e49631d0c4b7cf14984
-
dikshameghwal authored
Summary: removed redundant quotes in the filename assigned for dev dataset for GLUE tasks Pull Request resolved: https://github.com/pytorch/fairseq/pull/1270 Differential Revision: D18013071 fbshipit-source-id: 35f00162e117c6584dc859f760503ca32dcb706e
-
Changhan Wang authored
Summary: When the `if` statements in the levenshtein transformer decoder forward are removed, `attn` may get inconsistent batch sizes with output tokens. This is a fix. Reviewed By: cndn Differential Revision: D17936411 fbshipit-source-id: a1583f3806dc9f41caeb783c043429e247035803
-
- 15 Oct, 2019 2 commits
-
-
Nayan Singhal authored
Summary: This unit test guards the bmuf code. change: 1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine. Reviewed By: jay-mahadeokar Differential Revision: D17821391 fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a
-
Changhan Wang authored
Summary: Bring back the changes in D17661768 Reviewed By: ailzhang Differential Revision: D17920299 fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
-
- 12 Oct, 2019 1 commit
-
-
Sujit Verma authored
Summary: Added option to save checkpoints using Path Manager. Reviewed By: hudeven Differential Revision: D17392754 fbshipit-source-id: 4b8e556ef8455a1548e5a083d779ed809cd785be
-
- 11 Oct, 2019 2 commits
-
-
Jiatao Gu authored
Summary: The original implementation of the random mask is different from what the paper was stated. Reviewed By: kahne Differential Revision: D17652564 fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1
-
Jiatao Gu authored
Summary: Implementation of Levenshtein Transformer paper. Add a new helper function "new_arange" to create arange tensor easily. Fix bugs of returning attn values for NAT models Delete files which are not necessary or experimental. Reviewed By: kahne Differential Revision: D17652009 fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
-
- 10 Oct, 2019 2 commits
-
-
Dmytro Okhonko authored
Summary: Adds CTC loss and corresponding transformer ctc based models. Tested with `CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10` Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233 Reviewed By: jcai1 Differential Revision: D17856824 Pulled By: okhonko fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
-
Jeff Cai authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846 Reviewed By: jcai1 Differential Revision: D17845996 Pulled By: okhonko fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc
-
- 09 Oct, 2019 1 commit
-
-
Alex Xiao authored
Summary: We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's. 3 changes: 1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task. 2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do. 2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well. Reviewed By: yqwangustc Differential Revision: D17750715 fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
-
- 08 Oct, 2019 3 commits
-
-
Jerry Ma authored
Summary: PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
-
Jungo Kasai authored
Summary: Add ensemble wrappers to the levenshtein NAT. Levenshtein Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection. 1. Deletion 2. Placeholder Insertion 3. Word Selection Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design. Reviewed By: kahne Differential Revision: D17723202 fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
-
Changhan Wang authored
Summary: Fix the max length calculation in Levenshtein Transformer Reviewed By: jhcross Differential Revision: D17672946 fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
-
- 07 Oct, 2019 1 commit
-
-
Nayan Singhal authored
Summary: In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2. Setting the default global-sync-iter = 50 and users can just define --use-bmuf to enable it for training. Reviewed By: skritika Differential Revision: D17765094 fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc
-
- 05 Oct, 2019 1 commit
-
-
alexeib authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/884 Differential Revision: D17774515 Pulled By: alexeib fbshipit-source-id: d1ffe8ab723fa284c69b067bbd43d699eaa2f02f
-
- 04 Oct, 2019 1 commit
-
-
Jerry Ma authored
Summary: This adds a periodic call to `torch.cuda.empty_cache()` in order to mitigate memory fragmentation in the PyTorch CUDA cached allocator that can cause OOMs on models approaching GPU memory limit. By default, this will occur every 64 updates. Performance considerations: - I've benchmarked this on a reasonably large model with memory footprint 16 GB, and the overhead with the default setting is <0.2%. With `update-freq > 1`, the cost is mitigated even further. - This behavior can be disabled with a value of zero. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882 Differential Revision: D17742386 Pulled By: jma127 fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
-