- 25 Oct, 2019 1 commit
-
-
Halil Akin authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/888 We want to simplify multihead attention and get rid of the dynamic in_proj_weight logic. Sending the diff early for feedback, will have further changes as I try to fix breaking tests Reviewed By: edunov Differential Revision: D17912661 fbshipit-source-id: 0e6319fc694d8ec5187d1c2fefe5839d9d522186
-
- 24 Oct, 2019 4 commits
-
-
Ning Dong authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1299 LevT calls into tracing compliant transformer we didn't plan to OSS earlier. This is a workaround to unbreak the master. Will revisit and simplify the code later. Reviewed By: pipibjc Differential Revision: D18110339 fbshipit-source-id: 3bb51c56c2c20f45db1d5786d030b374b412eab1
-
Jerry Ma authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/892 Differential Revision: D18109685 Pulled By: jma127 fbshipit-source-id: f96e1080a5577b8ee0748dfdd956bf72bed47474
-
Jerry Ma authored
Summary: Makes more sense to reset either both meters or neither of them. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/891 Differential Revision: D18109027 Pulled By: jma127 fbshipit-source-id: f63baed9a6b928a6f591a76e69ef6e9c524e4398
-
Ning Dong authored
Summary: NAT productionization diff (1) Integrate NAT model training / Evaluation in LATTE base training workflow. (2) Make NAT tracing compliant. Since it calls into Fairseq transformer, we need to refactor the code and I created a ~copy of it named fb_tracing_transformer. (3) Decoder side C++ code is landed in the diff earlier. Reviewed By: xianxl Differential Revision: D17888324 fbshipit-source-id: ef4ef195fddd360da921502adcef82b087e46ce6
-
- 23 Oct, 2019 1 commit
-
-
Yilei Li authored
Summary: Enables reduce_on_plateau schedule with optional warmup phase, where we linearly increase the learning rate from some initial learning rate (``--warmup-init-lr``) until the configured learning rate (``--lr``). Thereafter the lr is adjusted according to original reduce_on_plateau scheme During warmup:: lrs = torch.linspace(args.warmup_init_lr, args.lr, args.warmup_updates) lr = lrs[update_num] Reviewed By: yqwangustc Differential Revision: D17779925 fbshipit-source-id: c3bfb3321c76850824fc42df4fac4e5dcf73fbf8
-
- 22 Oct, 2019 3 commits
-
-
Changhan Wang authored
Summary: Bugfix for inconsistent scores on the same input sentences. This only affects the displayed scores in `generate.py` and does not affect the model outputs. Reviewed By: MultiPath Differential Revision: D17799343 fbshipit-source-id: 2b868ac03097a4db27db736e126a61d50958acc5
-
Louis MARTIN authored
Summary: Very small change. The previous message was misleading, the length of TokenBlocksDataset is a number of "blocks" or "streams" but not the number of batches strictly speaking if I am not mistaken. I use the notion of batch from roberta https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md. It took me some time to understand what was going on, I hope it saves some time for others. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1279 Differential Revision: D18051476 fbshipit-source-id: 71fa35f21b9dbc8d6bde28cd3a487723690aadee
-
Louis MARTIN authored
Summary: Fix for https://github.com/pytorch/fairseq/issues/1240 Tested with MaskedLMTask. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1281 Differential Revision: D18051472 fbshipit-source-id: 0aeff60c71489655f5e621349f780ba9cd8c027a
-
- 20 Oct, 2019 2 commits
-
-
Jiatao Gu authored
Summary: The Diff conatins two fixes: (1) enabling non-shared decoder layers for deletion/insertion (2) adding options to perform sampling instead of argmax when learning the deletion Reviewed By: kahne Differential Revision: D18011220 fbshipit-source-id: c60815fb7bc3a0004c81249504f7a641536ae2d8
-
Jiatao Gu authored
Summary: Fix typos in the examples Reviewed By: kahne Differential Revision: D18030097 fbshipit-source-id: 84f0cbafd85e50ffd5033738835373935e3b83d4
-
- 18 Oct, 2019 3 commits
-
-
Spencer Poff authored
Summary: In https://github.com/fairinternal/fairseq-py/pull/877, sequence_generator began calling `model.forward_decoder`, but not all decoder models were given an implementation of that function. Reviewed By: okhonko Differential Revision: D17863751 fbshipit-source-id: ea70b636c9dafcf87f5d5e49631d0c4b7cf14984
-
dikshameghwal authored
Summary: removed redundant quotes in the filename assigned for dev dataset for GLUE tasks Pull Request resolved: https://github.com/pytorch/fairseq/pull/1270 Differential Revision: D18013071 fbshipit-source-id: 35f00162e117c6584dc859f760503ca32dcb706e
-
Changhan Wang authored
Summary: When the `if` statements in the levenshtein transformer decoder forward are removed, `attn` may get inconsistent batch sizes with output tokens. This is a fix. Reviewed By: cndn Differential Revision: D17936411 fbshipit-source-id: a1583f3806dc9f41caeb783c043429e247035803
-
- 15 Oct, 2019 2 commits
-
-
Nayan Singhal authored
Summary: This unit test guards the bmuf code. change: 1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine. Reviewed By: jay-mahadeokar Differential Revision: D17821391 fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a
-
Changhan Wang authored
Summary: Bring back the changes in D17661768 Reviewed By: ailzhang Differential Revision: D17920299 fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
-
- 12 Oct, 2019 1 commit
-
-
Sujit Verma authored
Summary: Added option to save checkpoints using Path Manager. Reviewed By: hudeven Differential Revision: D17392754 fbshipit-source-id: 4b8e556ef8455a1548e5a083d779ed809cd785be
-
- 11 Oct, 2019 2 commits
-
-
Jiatao Gu authored
Summary: The original implementation of the random mask is different from what the paper was stated. Reviewed By: kahne Differential Revision: D17652564 fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1
-
Jiatao Gu authored
Summary: Implementation of Levenshtein Transformer paper. Add a new helper function "new_arange" to create arange tensor easily. Fix bugs of returning attn values for NAT models Delete files which are not necessary or experimental. Reviewed By: kahne Differential Revision: D17652009 fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
-
- 10 Oct, 2019 2 commits
-
-
Dmytro Okhonko authored
Summary: Adds CTC loss and corresponding transformer ctc based models. Tested with `CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10` Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233 Reviewed By: jcai1 Differential Revision: D17856824 Pulled By: okhonko fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
-
Jeff Cai authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846 Reviewed By: jcai1 Differential Revision: D17845996 Pulled By: okhonko fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc
-
- 09 Oct, 2019 1 commit
-
-
Alex Xiao authored
Summary: We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's. 3 changes: 1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task. 2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do. 2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well. Reviewed By: yqwangustc Differential Revision: D17750715 fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
-
- 08 Oct, 2019 3 commits
-
-
Jerry Ma authored
Summary: PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
-
Jungo Kasai authored
Summary: Add ensemble wrappers to the levenshtein NAT. Levenshtein Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection. 1. Deletion 2. Placeholder Insertion 3. Word Selection Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design. Reviewed By: kahne Differential Revision: D17723202 fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
-
Changhan Wang authored
Summary: Fix the max length calculation in Levenshtein Transformer Reviewed By: jhcross Differential Revision: D17672946 fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
-
- 07 Oct, 2019 1 commit
-
-
Nayan Singhal authored
Summary: In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2. Setting the default global-sync-iter = 50 and users can just define --use-bmuf to enable it for training. Reviewed By: skritika Differential Revision: D17765094 fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc
-
- 05 Oct, 2019 1 commit
-
-
alexeib authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/884 Differential Revision: D17774515 Pulled By: alexeib fbshipit-source-id: d1ffe8ab723fa284c69b067bbd43d699eaa2f02f
-
- 04 Oct, 2019 2 commits
-
-
Jerry Ma authored
Summary: This adds a periodic call to `torch.cuda.empty_cache()` in order to mitigate memory fragmentation in the PyTorch CUDA cached allocator that can cause OOMs on models approaching GPU memory limit. By default, this will occur every 64 updates. Performance considerations: - I've benchmarked this on a reasonably large model with memory footprint 16 GB, and the overhead with the default setting is <0.2%. With `update-freq > 1`, the cost is mitigated even further. - This behavior can be disabled with a value of zero. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882 Differential Revision: D17742386 Pulled By: jma127 fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
-
Debojeet Chatterjee authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/879 Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1023 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1211 Added a new native op that does wordpiece tokenization while additionally returning token start and end indices in the raw text as required by BertSquadQA. Includes Unit Tests for the native op and also to check its parity with the PyText Wordpiece Tokenizer. Also combined is a torchscript implementation of the Bert SQUAD QA Model. There are scripts for evaluation and testing of the torchscript code as well. Reviewed By: borguz, hikushalhere Differential Revision: D17455985 fbshipit-source-id: c2617c7ecbce0f733b31d04558da965d0b62637b
-
- 01 Oct, 2019 1 commit
-
-
Chenyang Yu authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1180 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/874 extract FP16OptimizerMixin for share the same logic in PyText Reviewed By: hudeven Differential Revision: D17594102 fbshipit-source-id: 8625a4e4f3e09cbaba6ae92599c1121b86ed4e78
-
- 30 Sep, 2019 2 commits
-
-
Sarthak Garg authored
Implementation of the paper "Jointly Learning to Align and Translate with Transformer Models" (#877) Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/877 This PR implements guided alignment training described in "Jointly Learning to Align and Translate with Transformer Models (https://arxiv.org/abs/1909.02074)". In summary, it allows for training selected heads of the Transformer Model with external alignments computed by Statistical Alignment Toolkits. During inference, attention probabilities from the trained heads can be used to extract reliable alignments. In our work, we did not see any regressions in the translation performance because of guided alignment training. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1095 Differential Revision: D17170337 Pulled By: myleott fbshipit-source-id: daa418bef70324d7088dbb30aa2adf9f95774859
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/878 Differential Revision: D17661768 Pulled By: myleott fbshipit-source-id: 1e4c5f09eb14c40d491ca2459fd2adb8382fb6d2
-
- 29 Sep, 2019 2 commits
-
-
Guntupalli Venkata Sai Kalyan authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1200 Differential Revision: D17659658 Pulled By: myleott fbshipit-source-id: 1863e6d60a439dbb7e71e5da68817c9d53649737
-
Stephan Peitz authored
Summary: This PR implements a new attention module which combines cross-attention (encoder-decoder attention) and the decoder self-attention. This work was accepted as an abstract at WeCNLP 2019 (https://www.wecnlp.ai/wecnlp-2019). Cross+Self-Attention reduces the amount of parameter and increases the inference speed without any degradation in translation quality. More details can be found in the attached [abstract](https://github.com/pytorch/fairseq/files/3561282/paper.pdf) Pull Request resolved: https://github.com/pytorch/fairseq/pull/1097 Differential Revision: D17653168 Pulled By: myleott fbshipit-source-id: deb834c2c78a229d7418ffbfea20ba3ce252991c
-
- 28 Sep, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1197 Differential Revision: D17651374 Pulled By: myleott fbshipit-source-id: 5feb986de1e682eb83c4479f419ad51325718572
-
- 27 Sep, 2019 5 commits
-
-
Aditya Chetan authored
Summary: For batched predictions in Roberta, the README was giving an example that was pretty unclear. After a thorough discussion with ngoyal2707 in issue https://github.com/pytorch/fairseq/issues/1167 he gave a clear example of how batched predictions were supposed to be done. Since I spent a lot of time on this inconsistency, I thought that it might benefit the community if his solution was in the official README
😄 ! For for details, see issue https://github.com/pytorch/fairseq/issues/1167 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1195 Differential Revision: D17639354 Pulled By: myleott fbshipit-source-id: 3eb60c5804a6481f533b19073da7880dfd0d522d -
Changhan Wang authored
Summary: Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006) * Added Levenshtein Transformer model, task and criterion class * Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines * Add an option for prepending BOS to dictionary class and translation task class Reviewed By: myleott Differential Revision: D17297372 fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
-
Nayan Singhal authored
Summary: Bmuf sync started happening even before warmup is done. This diff fixes the behavior and do bmuf sync once warmup is done or if it's zero. TODO: write a unit test case so that these problems can be figure out faster. Reviewed By: jay-mahadeokar Differential Revision: D17356277 fbshipit-source-id: 21500e6ed1225b97794e4ee203e5d7d04a2840f8
-
Louis Martin authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174 Differential Revision: D17627767 Pulled By: myleott fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
-
Zhanghao Wu authored
Summary: Hi, I think there is a minor mistake in the doc. `--distributed-no-spawn` argument is needed for distributed training on multiple machines without `slurm`. Otherwise, the program will start 8 jobs on each GPU, when `nproc_per_node=8`. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1188 Differential Revision: D17627778 Pulled By: myleott fbshipit-source-id: 35ab6b650dc1132d7cb2d150e80d2ebf0caf3e69
-