Commits · 7a5996fdc75dc9325646fd00e8e69e3f55cbb05e · OpenDAS / Fairseq

02 May, 2019 2 commits

Move distributed_init into DistributedFairseqModel (#687) · 34726d56

Myle Ott authored May 02, 2019

Summary:
This should make rendezvous happen as lazily as possible.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/687

Differential Revision: D15151145

Pulled By: myleott

fbshipit-source-id: d70816a85414c5d509a6b12e2b339b4736db2c88

34726d56

Validate on all sets based on --save-interval-updates · fb18be00

Myle Ott authored May 02, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/693

Differential Revision: D15174831

fbshipit-source-id: 98688b1269ead5694e5116659ff64507d3c0d1c0

fb18be00

30 Apr, 2019 1 commit

Merge internal changes (#654) · d45db804

Myle Ott authored Apr 29, 2019

Summary:
- Add --add-bos-token option to LM task
- Cleanup utils.py and options.py
Pull Request resolved: https://github.com/pytorch/fairseq/pull/654

Differential Revision: D15041794

Pulled By: myleott

fbshipit-source-id: 3ad00007769d5f48308052cfd40de39c5ffa1a6e

d45db804

24 Apr, 2019 1 commit

Don't reload best validation loss when using --reset-optimizer · 0020477a

Myle Ott authored Apr 24, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/661

Differential Revision: D15068312

Pulled By: myleott

fbshipit-source-id: 1216835fd4c7f83ea5e350bff83901c93ac57447

0020477a

15 Apr, 2019 1 commit

fix checkpoint timer (#634) · de8aeab5

freewym authored Apr 15, 2019

Summary:
If arg.keep_interval_updates or args.keep_last_epochs > 0, `checkpoints` would refer to a list of checkpoint files to be removed, which can be empty. So moved the logging code to the right position.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/634

Differential Revision: D14933655

Pulled By: myleott

fbshipit-source-id: 68182ee99d9701e1536833d31e0a7c5d2eb2d679

de8aeab5

09 Apr, 2019 1 commit

Fix save_dir creation while training on multiple nodes (#626) · 94e9d77c

Kartikay Khandelwal authored Apr 09, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/626

While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory.

Reviewed By: myleott

Differential Revision: D14841304

fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74

94e9d77c

07 Apr, 2019 1 commit

move distributed_init after get_batch_iterator · 34028c63

Haoran Li authored Apr 07, 2019

Summary: There are constantly wait timeout issue for using multiple nodes, even setting copylocallytempdir:/ doesn't help, eg f105637629. It seems to be working after I moved distributed_init after get_batch_iterator, eg f106520580

Reviewed By: myleott

Differential Revision: D14817769

fbshipit-source-id: edbb101a28d8082241c7bdd8c5500c9dad27647c

34028c63

02 Apr, 2019 2 commits

Add checkpoint write timer · eef6663c

Myle Ott authored Apr 02, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/613

Differential Revision: D14712311

Pulled By: myleott

fbshipit-source-id: 3e7646629b539c10b6af89dece2c0c564f31125f

eef6663c

Use --train-subset and --valid-subset properly · e88ad84b

Myle Ott authored Apr 02, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/614

Differential Revision: D14712321

Pulled By: myleott

fbshipit-source-id: 8ef973c5d30ebccf0df0f1cabdddd590248a8f8d

e88ad84b

12 Mar, 2019 1 commit

Handle 3+ dimensional input in sequence_generator + nits · 860010e9

Dmytro Okhonko authored Mar 12, 2019

Summary: sequence_generator assumes that model input is 2d tensor of longs. But it can be something like 3d tensor of floats and we should be able to handle this as long as first dimension is batch size followed by source lengths.

Reviewed By: myleott

Differential Revision: D14420044

fbshipit-source-id: bf8b1e42ad1873f7b803c1a377b0af21648db015

860010e9

11 Mar, 2019 1 commit

Add missing parentheses in regex expression (#567) · fef4e002

Jose Fonollosa authored Mar 11, 2019

Summary:
The regex pattern without parentheses is not correct. The checkpoints are not sorted in descending order
Pull Request resolved: https://github.com/pytorch/fairseq/pull/567

Differential Revision: D14404380

Pulled By: myleott

fbshipit-source-id: 98cd0cfa8c92b78a03ffbb94840bc0f7a118eca1

fef4e002

04 Mar, 2019 1 commit

Add --curriculum (fixes #533) · 2ad1178e

Myle Ott authored Mar 04, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/554

Differential Revision: D14300596

Pulled By: myleott

fbshipit-source-id: f38c8e58daef99d5e4b97dd423e4142e4294a4f0

2ad1178e

26 Feb, 2019 2 commits

Add Tensorboard support (#530) · 44d27e64

Myle Ott authored Feb 25, 2019

Summary:
Enable with the `--tensorboard-logdir` option.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/530

Differential Revision: D14218430

Pulled By: myleott

fbshipit-source-id: e7a54f66f928e3bb02ae03fda09b22fa4fa7d053

44d27e64

Misc fixes · 65c1903e

Myle Ott authored Feb 25, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/529

Differential Revision: D14218384

Pulled By: myleott

fbshipit-source-id: 5d2cbb1f56ea42e9929785aff4a5ae5f44d13724

65c1903e

09 Feb, 2019 1 commit

Add fairseq to PyPI (#495) · fbd4cef9

Myle Ott authored Feb 08, 2019

Summary:
- fairseq can now be installed via pip: `pip install fairseq`
- command-line tools are globally accessible: `fairseq-preprocess`, `fairseq-train`, `fairseq-generate`, etc.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/495

Differential Revision: D14017761

Pulled By: myleott

fbshipit-source-id: 10c9f6634a3056074eac2f33324b4f1f404d4235

fbd4cef9

05 Feb, 2019 1 commit

Add standalone binaries · 829bd8ce

Myle Ott authored Feb 05, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/489

Differential Revision: D13956810

Pulled By: myleott

fbshipit-source-id: 61ace179d1d3790226c38b3f3e47f5452b5ec514

829bd8ce

30 Jan, 2019 1 commit

Do distributed init after data loading · ec6f8ef9

Myle Ott authored Jan 30, 2019

Summary:
FACEBOOK

This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue.

The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that.

Reviewed By: rutyrinott, ngoyal2707

Differential Revision: D13873224

fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0

ec6f8ef9

25 Jan, 2019 1 commit

Add code for "Pay Less Attention with Lightweight and Dynamic Convolutions" (#473) · b41c74dc

Myle Ott authored Jan 25, 2019

Summary:
Changelog:
- `e330f56`: Add code for the "Pay Less Attention with Lightweight and Dynamic Convolutions" paper
- `5e3b98c`: Add scripts for computing tokenized BLEU with compound splitting and sacrebleu
- update READMEs
- misc fixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/473

Differential Revision: D13819717

Pulled By: myleott

fbshipit-source-id: f2dc12ea89a436b950cafec3593ed1b04af808e9

b41c74dc

24 Jan, 2019 1 commit

Print model and number of trained params · d0ebcec4

Myle Ott authored Jan 24, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/469

Differential Revision: D13802945

Pulled By: myleott

fbshipit-source-id: b6976506a8336b96ee40505c4a7638541cc99c95

d0ebcec4

16 Jan, 2019 1 commit

FIX: '--user-dir' on multi-gpu (#449) · 7853818c

Davide Caroselli authored Jan 16, 2019

Summary:
On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.

This pull request fixes this problem: custom module import in now explicit on every `main()` function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/449

Differential Revision: D13676922

Pulled By: myleott

fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6

7853818c

09 Jan, 2019 1 commit

Misc fixes · 4b1f4788

Myle Ott authored Jan 09, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/439

Differential Revision: D13608151

Pulled By: myleott

fbshipit-source-id: 198b84995a6329f8329829cc91184d88f1eab947

4b1f4788

05 Jan, 2019 1 commit

Merge internal changes (#283) · 7633129b

Myle Ott authored Jan 04, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/283

Pull Request resolved: https://github.com/pytorch/fairseq/pull/428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5

7633129b

28 Dec, 2018 1 commit

Make multiprocessing_train.py work with multi-node setups · 0cb87130

Myle Ott authored Dec 28, 2018

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/425

Differential Revision: D13558340

Pulled By: myleott

fbshipit-source-id: dff8c77027e821d8c80bfbd6a6ccce9ca1a44b78

0cb87130

07 Dec, 2018 1 commit

Take a dummy train step under OOM to keep multiprocessing in sync · 6c006a34

Halil Akin authored Dec 06, 2018

Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.

Reviewed By: myleott

Differential Revision: D13086018

fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee

6c006a34

18 Nov, 2018 1 commit

Merge small fixes from internal · 693894b6

Naman Goyal authored Nov 18, 2018

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/374

Differential Revision: D13116074

Pulled By: myleott

fbshipit-source-id: 485724cc5a40e8360d21e4bf9c35821baa0ddc57

693894b6

21 Oct, 2018 1 commit

Manually port pull request 385 · 8441cbf3

Peng-Jen Chen authored Oct 20, 2018

Summary:
Manually port fairinternal fairseq-py pull request #385 [1] to fbcode.

Resolve the merge conflict of removing fp16_trainer per offline discussion with Myle. Also updated codes to make generate.py works.

[1] https://github.com/fairinternal/fairseq-py/pull/385/commits/18fa6e154781cf0c4b1596429dba7e753a545069

Reviewed By: liezl200

Differential Revision: D10052908

fbshipit-source-id: c3c378d78dc1e9ac087c815f359e78c0048ff2f5

8441cbf3

30 Sep, 2018 1 commit

Merge internal changes (#295) · b87c5366

Myle Ott authored Sep 30, 2018

Summary:
Changelog:
- `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266.
- `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper
- `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/295

Differential Revision: D10121559

Pulled By: myleott

fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27

b87c5366

25 Sep, 2018 1 commit

Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0 · 1082ba35

Sergey Edunov authored Sep 06, 2018

- no more FP16Trainer, we just have an FP16Optimizer wrapper
- most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
- Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
- Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq

1082ba35

03 Sep, 2018 8 commits
- Add documentation · 6381cc97
  Myle Ott authored Sep 03, 2018
  
  6381cc97
- Clean up FairseqTask so that it's easier to extend/add new tasks · 2e507d3c
  Myle Ott authored Aug 30, 2018
  
  2e507d3c
- dont send dummy batch when reloading from checkpoint · 343819f9
  Alexei Baevski authored Aug 28, 2018
```
also don't crash if param does not recieve grads
```
  343819f9
- Add training wall time meter · 9c102784
  Myle Ott authored Aug 24, 2018
  
  9c102784
- Warn when using FP16 on pre-Volta GPUs · 8d6665f2
  Myle Ott authored Aug 14, 2018
  
  8d6665f2
- Reset gnorm after each epoch · 97a6b139
  Sergey Edunov authored Aug 09, 2018
  
  97a6b139
- cosine + triangular lr scheduler · 75e12a27
  Alexei Baevski authored Aug 08, 2018
  
  75e12a27
- add flag that allows keeping optimizer config · 2dc074d8
  alexeib authored Jul 28, 2018
```
adds -reset-optimizer, --reset-lr-scheduler, and --optimizer-overrides flags
```
  2dc074d8
25 Jul, 2018 1 commit

Transformer lm · d2e2a1d4

Alexei Baevski authored Jul 18, 2018

This implements transformer based language model. It already obtains better perplexity on wikitext103 without any tuning. I will also train it on gbw where I also expect to get better ppl

Example training command:

python train.py /private/home/abaevski/data/wiki103 —save-dir /tmp —fp16 —max-epoch 80 —save-interval 1 —arch transformer_lm —task language_modeling —optimizer nag —lr 0.008 —lr-scheduler reduce_lr_on_plateau —lr-shrink 0.6 —dropout 0.2 —criterion adaptive_loss —adaptive-softmax-cutoff 10000,50000,200000 —max-tokens 512 —tokens-per-sample 512 —seed 1 —sample-break-mode none —log-format json —log-interval 50 —save-interval-updates 2500 —keep-interval-updates 25
small transformer got to 31.3 ppl on wiki text 103 (compared to 35 with fconv) while @myleott got a big transformer lm to 27 something ppl on wiki text 103

d2e2a1d4

21 Jun, 2018 3 commits
- Fix interpretation of --max-epoch · e9967cd3
  Myle Ott authored Jun 21, 2018
  
  e9967cd3
- Store full checkpoints instead of symlinking · 9dcee4c7
  Myle Ott authored Jun 18, 2018
  
  9dcee4c7
- Two tiny changes to train/eval_lm. For train fix an off by one, while for... · 762956a5
  Mehdi Drissi authored Jun 21, 2018
```
Two tiny changes to train/eval_lm. For train fix an off by one, while for eval_lm make it work when the task is translation'
```
  762956a5