- 18 Oct, 2019 3 commits
-
-
Spencer Poff authored
Summary: In https://github.com/fairinternal/fairseq-py/pull/877, sequence_generator began calling `model.forward_decoder`, but not all decoder models were given an implementation of that function. Reviewed By: okhonko Differential Revision: D17863751 fbshipit-source-id: ea70b636c9dafcf87f5d5e49631d0c4b7cf14984
-
dikshameghwal authored
Summary: removed redundant quotes in the filename assigned for dev dataset for GLUE tasks Pull Request resolved: https://github.com/pytorch/fairseq/pull/1270 Differential Revision: D18013071 fbshipit-source-id: 35f00162e117c6584dc859f760503ca32dcb706e
-
Changhan Wang authored
Summary: When the `if` statements in the levenshtein transformer decoder forward are removed, `attn` may get inconsistent batch sizes with output tokens. This is a fix. Reviewed By: cndn Differential Revision: D17936411 fbshipit-source-id: a1583f3806dc9f41caeb783c043429e247035803
-
- 15 Oct, 2019 2 commits
-
-
Nayan Singhal authored
Summary: This unit test guards the bmuf code. change: 1. distributed_init assumes we are always using cuda device which is not the case if you are using "gloo" backend on CPU machine. Reviewed By: jay-mahadeokar Differential Revision: D17821391 fbshipit-source-id: 28e1bb39f7a4889b1dc6bd636b7c499e55bfc69a
-
Changhan Wang authored
Summary: Bring back the changes in D17661768 Reviewed By: ailzhang Differential Revision: D17920299 fbshipit-source-id: be3f93a044a8710c8b475012c39e36a3e6507fad
-
- 12 Oct, 2019 1 commit
-
-
Sujit Verma authored
Summary: Added option to save checkpoints using Path Manager. Reviewed By: hudeven Differential Revision: D17392754 fbshipit-source-id: 4b8e556ef8455a1548e5a083d779ed809cd785be
-
- 11 Oct, 2019 2 commits
-
-
Jiatao Gu authored
Summary: The original implementation of the random mask is different from what the paper was stated. Reviewed By: kahne Differential Revision: D17652564 fbshipit-source-id: 238a9158041b3ff2482ee50ce6151c3f77f0b2c1
-
Jiatao Gu authored
Summary: Implementation of Levenshtein Transformer paper. Add a new helper function "new_arange" to create arange tensor easily. Fix bugs of returning attn values for NAT models Delete files which are not necessary or experimental. Reviewed By: kahne Differential Revision: D17652009 fbshipit-source-id: 436bbb5d45de2f8067003232de4f2bd51e87719c
-
- 10 Oct, 2019 2 commits
-
-
Dmytro Okhonko authored
Summary: Adds CTC loss and corresponding transformer ctc based models. Tested with `CUDA_VISIBLE_DEVICES=0 python train.py $DATA_PATH --save-dir $SAVE_DIR --max-epoch 30 --task speech_recognition --arch vggtransformer_enc_1 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 10000 --log-format json --log-interval 1 --criterion ctc_loss --user-dir examples/speech_recognition/ --validate-interval=10` Pull Request resolved: https://github.com/pytorch/fairseq/pull/1233 Reviewed By: jcai1 Differential Revision: D17856824 Pulled By: okhonko fbshipit-source-id: f3eac64d3fdd0c37cf8c539dd360cfb610d8a6ef
-
Jeff Cai authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/846 Reviewed By: jcai1 Differential Revision: D17845996 Pulled By: okhonko fbshipit-source-id: 3826fd9a4418496916bf1835c319dd85c89945cc
-
- 09 Oct, 2019 1 commit
-
-
Alex Xiao authored
Summary: We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just makes it flat out impossible to use 8 GPU's. 3 changes: 1. This diff modifies the data loading such that we do the sharding while we read the handles file, rather than later. This modification is done on a task-by-task basis, since the task specifies how the data is loaded. I've tried to make the code compatible with both sharding during handle loading and sharding during batch iteration. I've currently only done the sharding during handle loading for the aligned_training task. 2. To support data sharding at data loading time and the requirement that all shards must have exactly the same # of batches, I've added a method to do this synchronization where all shards with too many batches would just truncate the extra ones, similar to what we already do. 2. In fairspeq/train.py, we are actually loading the training dataset and batch iterator twice, once in train.py and once when loading the checkpoint (which we always do regardless if there is a checkpoint). This means double the loading time which can be painful for very large files. I've removed the extraneous loading in this diff as well. Reviewed By: yqwangustc Differential Revision: D17750715 fbshipit-source-id: 0e6e3d363525fa5661f1c784303390ea13f46377
-
- 08 Oct, 2019 3 commits
-
-
Jerry Ma authored
Summary: PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
-
Jungo Kasai authored
Summary: Add ensemble wrappers to the levenshtein NAT. Levenshtein Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection. 1. Deletion 2. Placeholder Insertion 3. Word Selection Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design. Reviewed By: kahne Differential Revision: D17723202 fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
-
Changhan Wang authored
Summary: Fix the max length calculation in Levenshtein Transformer Reviewed By: jhcross Differential Revision: D17672946 fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
-
- 07 Oct, 2019 1 commit
-
-
Nayan Singhal authored
Summary: In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2. Setting the default global-sync-iter = 50 and users can just define --use-bmuf to enable it for training. Reviewed By: skritika Differential Revision: D17765094 fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc
-
- 05 Oct, 2019 1 commit
-
-
alexeib authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/884 Differential Revision: D17774515 Pulled By: alexeib fbshipit-source-id: d1ffe8ab723fa284c69b067bbd43d699eaa2f02f
-
- 04 Oct, 2019 2 commits
-
-
Jerry Ma authored
Summary: This adds a periodic call to `torch.cuda.empty_cache()` in order to mitigate memory fragmentation in the PyTorch CUDA cached allocator that can cause OOMs on models approaching GPU memory limit. By default, this will occur every 64 updates. Performance considerations: - I've benchmarked this on a reasonably large model with memory footprint 16 GB, and the overhead with the default setting is <0.2%. With `update-freq > 1`, the cost is mitigated even further. - This behavior can be disabled with a value of zero. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882 Differential Revision: D17742386 Pulled By: jma127 fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
-
Debojeet Chatterjee authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/879 Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1023 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1211 Added a new native op that does wordpiece tokenization while additionally returning token start and end indices in the raw text as required by BertSquadQA. Includes Unit Tests for the native op and also to check its parity with the PyText Wordpiece Tokenizer. Also combined is a torchscript implementation of the Bert SQUAD QA Model. There are scripts for evaluation and testing of the torchscript code as well. Reviewed By: borguz, hikushalhere Differential Revision: D17455985 fbshipit-source-id: c2617c7ecbce0f733b31d04558da965d0b62637b
-
- 01 Oct, 2019 1 commit
-
-
Chenyang Yu authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1180 Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/874 extract FP16OptimizerMixin for share the same logic in PyText Reviewed By: hudeven Differential Revision: D17594102 fbshipit-source-id: 8625a4e4f3e09cbaba6ae92599c1121b86ed4e78
-
- 30 Sep, 2019 2 commits
-
-
Sarthak Garg authored
Implementation of the paper "Jointly Learning to Align and Translate with Transformer Models" (#877) Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/877 This PR implements guided alignment training described in "Jointly Learning to Align and Translate with Transformer Models (https://arxiv.org/abs/1909.02074)". In summary, it allows for training selected heads of the Transformer Model with external alignments computed by Statistical Alignment Toolkits. During inference, attention probabilities from the trained heads can be used to extract reliable alignments. In our work, we did not see any regressions in the translation performance because of guided alignment training. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1095 Differential Revision: D17170337 Pulled By: myleott fbshipit-source-id: daa418bef70324d7088dbb30aa2adf9f95774859
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/878 Differential Revision: D17661768 Pulled By: myleott fbshipit-source-id: 1e4c5f09eb14c40d491ca2459fd2adb8382fb6d2
-
- 29 Sep, 2019 2 commits
-
-
Guntupalli Venkata Sai Kalyan authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1200 Differential Revision: D17659658 Pulled By: myleott fbshipit-source-id: 1863e6d60a439dbb7e71e5da68817c9d53649737
-
Stephan Peitz authored
Summary: This PR implements a new attention module which combines cross-attention (encoder-decoder attention) and the decoder self-attention. This work was accepted as an abstract at WeCNLP 2019 (https://www.wecnlp.ai/wecnlp-2019). Cross+Self-Attention reduces the amount of parameter and increases the inference speed without any degradation in translation quality. More details can be found in the attached [abstract](https://github.com/pytorch/fairseq/files/3561282/paper.pdf) Pull Request resolved: https://github.com/pytorch/fairseq/pull/1097 Differential Revision: D17653168 Pulled By: myleott fbshipit-source-id: deb834c2c78a229d7418ffbfea20ba3ce252991c
-
- 28 Sep, 2019 1 commit
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1197 Differential Revision: D17651374 Pulled By: myleott fbshipit-source-id: 5feb986de1e682eb83c4479f419ad51325718572
-
- 27 Sep, 2019 5 commits
-
-
Aditya Chetan authored
Summary: For batched predictions in Roberta, the README was giving an example that was pretty unclear. After a thorough discussion with ngoyal2707 in issue https://github.com/pytorch/fairseq/issues/1167 he gave a clear example of how batched predictions were supposed to be done. Since I spent a lot of time on this inconsistency, I thought that it might benefit the community if his solution was in the official README
😄 ! For for details, see issue https://github.com/pytorch/fairseq/issues/1167 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1195 Differential Revision: D17639354 Pulled By: myleott fbshipit-source-id: 3eb60c5804a6481f533b19073da7880dfd0d522d -
Changhan Wang authored
Summary: Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006) * Added Levenshtein Transformer model, task and criterion class * Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines * Add an option for prepending BOS to dictionary class and translation task class Reviewed By: myleott Differential Revision: D17297372 fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
-
Nayan Singhal authored
Summary: Bmuf sync started happening even before warmup is done. This diff fixes the behavior and do bmuf sync once warmup is done or if it's zero. TODO: write a unit test case so that these problems can be figure out faster. Reviewed By: jay-mahadeokar Differential Revision: D17356277 fbshipit-source-id: 21500e6ed1225b97794e4ee203e5d7d04a2840f8
-
Louis Martin authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174 Differential Revision: D17627767 Pulled By: myleott fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
-
Zhanghao Wu authored
Summary: Hi, I think there is a minor mistake in the doc. `--distributed-no-spawn` argument is needed for distributed training on multiple machines without `slurm`. Otherwise, the program will start 8 jobs on each GPU, when `nproc_per_node=8`. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1188 Differential Revision: D17627778 Pulled By: myleott fbshipit-source-id: 35ab6b650dc1132d7cb2d150e80d2ebf0caf3e69
-
- 26 Sep, 2019 1 commit
-
-
vineetk1 authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1185 Differential Revision: D17602249 Pulled By: lematt1991 fbshipit-source-id: bd515b7d2ebce8181a80684f45223a8db7c7e3cd
-
- 24 Sep, 2019 1 commit
-
-
Jamie Morton authored
Summary: This is to make this instructions a little more generalizable, since in some systems, bash will parse the spaces within quotes Addressing https://github.com/pytorch/fairseq/issues/1146 Pull Request resolved: https://github.com/pytorch/fairseq/pull/1165 Differential Revision: D17547810 Pulled By: myleott fbshipit-source-id: 5a026d42f678126b5ca8bc4477ba8f26ea549dcd
-
- 23 Sep, 2019 3 commits
-
-
Naman Goyal authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/869 Reviewed By: myleott Differential Revision: D17531776 Pulled By: myleott fbshipit-source-id: 349c9449a0a7db5d3bb8449561302d4220cfa60c
-
Jerry Ma authored
Summary: - More clearly document the correspondence between FairseqAdam and torch.optim.AdamW - Add ResamplingDataset to Sphinx docs Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/868 Differential Revision: D17523244 Pulled By: jma127 fbshipit-source-id: 8e7b34b24889b2c8f70b09a52a625d2af135734b
-
Naman Goyal authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/866 Differential Revision: D17517115 fbshipit-source-id: fd6921e642c99e37fce6ad58b24c93e70a5364e5
-
- 20 Sep, 2019 3 commits
-
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/865 Differential Revision: D17510276 Pulled By: myleott fbshipit-source-id: 24119402ad5fe95a1312fadb77bafe49a9197c6b
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1155 Differential Revision: D17509762 Pulled By: myleott fbshipit-source-id: 4de535289c1f35abff0d8142d8580f3ede039f47
-
Naman Goyal authored
Summary: The multilingual-RoBERTa training is working with aconneau XLM data. Two pieces remaining: 1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple. 2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline). Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849 Differential Revision: D17162460 fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
-
- 19 Sep, 2019 2 commits
-
-
Jerry Ma authored
Summary: As discussed with Naman earlier today. Weighted sampling with replacement can be done on a per-epoch basis using `set_epoch()` functionality, which generates the samples as a function of random seed and epoch. Additionally, `FairseqTask` needs to set the starting epoch for the dataset at the very beginning of iterator construction. Not yet implemented is the per-epoch iterator construction, which is necessary to actually regenerate the batches for each epoch. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/861 Differential Revision: D17460687 Pulled By: jma127 fbshipit-source-id: 1c2a54f04ac96b3561c100a6fd66a9fccbe3c658
-
Myle Ott authored
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1147 Differential Revision: D17468447 Pulled By: myleott fbshipit-source-id: 0dbac04b92c8df74ad991d5e92cd02036d662369
-
- 18 Sep, 2019 1 commit
-
-
Jerry Ma authored
Summary: `python setup.py build_ext --inplace` generates C++ source files directly in the Python source tree. They should most likely be ignored by git. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/860 Differential Revision: D17460597 Pulled By: jma127 fbshipit-source-id: 72a29d438ebb57627b68ec7e9a2a77c8a36f1c21
-