"vscode:/vscode.git/clone" did not exist on "ec13a815b13ec6be3eeb8c3eb9ccb725dc322233"
  1. 08 May, 2019 2 commits
    • Myle Ott's avatar
      Cleanup LM + Flake8 · f2563c21
      Myle Ott authored
      Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/720
      
      Differential Revision: D15259091
      
      Pulled By: myleott
      
      fbshipit-source-id: 06a35996c06ccddb49fdc9e01e348ff3c9da334e
      f2563c21
    • Jay Mahadeokar's avatar
      bugfix data not in args · 6a7eb6ce
      Jay Mahadeokar authored
      Summary:
      D15214049 introduced a bug such that if a tasks args does not contain data, then it will give error
      ```
      File "/data/users/jaym/fbsource/fbcode/buck-out/dev/gen/deeplearning/projects/fairspeq/train#link-tree/train.py", line 119, in reload_train
         if len(args.data.split(":")) == 1:
      AttributeError: 'Namespace' object has no attribute 'data'
      ```
      
      This diff checks if data is in args to avoid above error.
      
      Reviewed By: myleott, jmp84
      
      Differential Revision: D15253373
      
      fbshipit-source-id: 14fb9ad878ee50f1b7583349bb17e29c03c40815
      6a7eb6ce
  2. 06 May, 2019 1 commit
    • Naman Goyal's avatar
      allowing sharded dataset (#696) · 0add50c2
      Naman Goyal authored
      
      
      Summary:
      Co-authored-by: default avatarmyleott <myleott@fb.com>
      
      Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner.
      
      For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`.
      
      myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/696
      
      Differential Revision: D15214049
      
      fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013
      0add50c2
  3. 05 May, 2019 2 commits
  4. 04 May, 2019 1 commit
  5. 02 May, 2019 2 commits
  6. 30 Apr, 2019 1 commit
  7. 24 Apr, 2019 1 commit
  8. 15 Apr, 2019 1 commit
    • freewym's avatar
      fix checkpoint timer (#634) · de8aeab5
      freewym authored
      Summary:
      If arg.keep_interval_updates or args.keep_last_epochs > 0, `checkpoints` would refer to a list of checkpoint files to be removed, which can be empty. So moved the logging code to the right position.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/634
      
      Differential Revision: D14933655
      
      Pulled By: myleott
      
      fbshipit-source-id: 68182ee99d9701e1536833d31e0a7c5d2eb2d679
      de8aeab5
  9. 09 Apr, 2019 1 commit
    • Kartikay Khandelwal's avatar
      Fix save_dir creation while training on multiple nodes (#626) · 94e9d77c
      Kartikay Khandelwal authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/626
      
      While training a model on multiple GPUs, the current fairseq train workflow fails while creating the directory from which to load a checkpoint. This seems to be happening because multiple nodes attempt to create the same directory thus causing some weird interaction with os.makedirs option "exist_ok=True". Fixing this by making sure only rank 0 creates this directory.
      
      Reviewed By: myleott
      
      Differential Revision: D14841304
      
      fbshipit-source-id: c9b73ba804de97e2cb19a616189fefce476d8c74
      94e9d77c
  10. 07 Apr, 2019 1 commit
    • Haoran Li's avatar
      move distributed_init after get_batch_iterator · 34028c63
      Haoran Li authored
      Summary: There are constantly wait timeout issue for using multiple nodes, even setting copylocallytempdir:/ doesn't help, eg f105637629. It seems to be working after I moved distributed_init after get_batch_iterator, eg f106520580
      
      Reviewed By: myleott
      
      Differential Revision: D14817769
      
      fbshipit-source-id: edbb101a28d8082241c7bdd8c5500c9dad27647c
      34028c63
  11. 02 Apr, 2019 2 commits
  12. 12 Mar, 2019 1 commit
    • Dmytro Okhonko's avatar
      Handle 3+ dimensional input in sequence_generator + nits · 860010e9
      Dmytro Okhonko authored
      Summary: sequence_generator assumes that model input is 2d tensor of longs. But it can be something like 3d tensor of floats and we should be able to handle this as long as first dimension is batch size followed by source lengths.
      
      Reviewed By: myleott
      
      Differential Revision: D14420044
      
      fbshipit-source-id: bf8b1e42ad1873f7b803c1a377b0af21648db015
      860010e9
  13. 11 Mar, 2019 1 commit
  14. 04 Mar, 2019 1 commit
  15. 26 Feb, 2019 2 commits
  16. 09 Feb, 2019 1 commit
    • Myle Ott's avatar
      Add fairseq to PyPI (#495) · fbd4cef9
      Myle Ott authored
      Summary:
      - fairseq can now be installed via pip: `pip install fairseq`
      - command-line tools are globally accessible: `fairseq-preprocess`, `fairseq-train`, `fairseq-generate`, etc.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/495
      
      Differential Revision: D14017761
      
      Pulled By: myleott
      
      fbshipit-source-id: 10c9f6634a3056074eac2f33324b4f1f404d4235
      fbd4cef9
  17. 05 Feb, 2019 1 commit
  18. 30 Jan, 2019 1 commit
    • Myle Ott's avatar
      Do distributed init after data loading · ec6f8ef9
      Myle Ott authored
      Summary:
      FACEBOOK
      
      This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue.
      
      The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that.
      
      Reviewed By: rutyrinott, ngoyal2707
      
      Differential Revision: D13873224
      
      fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0
      ec6f8ef9
  19. 25 Jan, 2019 1 commit
  20. 24 Jan, 2019 1 commit
  21. 16 Jan, 2019 1 commit
    • Davide Caroselli's avatar
      FIX: '--user-dir' on multi-gpu (#449) · 7853818c
      Davide Caroselli authored
      Summary:
      On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.
      
      This pull request fixes this problem: custom module import in now explicit on every `main()` function.
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/449
      
      Differential Revision: D13676922
      
      Pulled By: myleott
      
      fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6
      7853818c
  22. 09 Jan, 2019 1 commit
  23. 05 Jan, 2019 1 commit
  24. 28 Dec, 2018 1 commit
  25. 07 Dec, 2018 1 commit
    • Halil Akin's avatar
      Take a dummy train step under OOM to keep multiprocessing in sync · 6c006a34
      Halil Akin authored
      Summary: This is not a guaranteed solution (since processes may still get out of sync if OOM happens after an all_gather/all_reduce has been done) - but should still make multiprocessing training more robust in practice since it seems we usually OOM early enough.
      
      Reviewed By: myleott
      
      Differential Revision: D13086018
      
      fbshipit-source-id: feb1b01c2eb8818797cfdabc0faac8056ba1b4ee
      6c006a34
  26. 18 Nov, 2018 1 commit
  27. 21 Oct, 2018 1 commit
  28. 30 Sep, 2018 1 commit
    • Myle Ott's avatar
      Merge internal changes (#295) · b87c5366
      Myle Ott authored
      Summary:
      Changelog:
      - `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266.
      - `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper
      - `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294)
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/295
      
      Differential Revision: D10121559
      
      Pulled By: myleott
      
      fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27
      b87c5366
  29. 25 Sep, 2018 1 commit
    • Sergey Edunov's avatar
      Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0 · 1082ba35
      Sergey Edunov authored
      - no more FP16Trainer, we just have an FP16Optimizer wrapper
      - most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
      - Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
      - Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq
      1082ba35
  30. 03 Sep, 2018 6 commits