"cacheflow/vscode:/vscode.git/clone" did not exist on "04e5acc08ed5b878225491bf62540ea10274fb29"
  1. 30 Sep, 2018 1 commit
    • Myle Ott's avatar
      Merge internal changes (#295) · b87c5366
      Myle Ott authored
      Summary:
      Changelog:
      - `90f52a1`: Support loading subsets of the data on each worker with the `--fix-batches-to-gpus` flag. This should fix #217 and #266.
      - `6eda0a9`: Update README for replicating the "Scaling Neural Machine Translation" paper
      - `b14c7cf`: Fallback to no_c10d backend for pytorch 0.4.1 (fixes #294)
      Pull Request resolved: https://github.com/pytorch/fairseq/pull/295
      
      Differential Revision: D10121559
      
      Pulled By: myleott
      
      fbshipit-source-id: 41c84d0ee4cdd113544b5d3aa38ae8b23acc2c27
      b87c5366
  2. 25 Sep, 2018 1 commit
    • Sergey Edunov's avatar
      Switch to DistributedDataParallelC10d and bump version 0.5.0 -> 0.6.0 · 1082ba35
      Sergey Edunov authored
      - no more FP16Trainer, we just have an FP16Optimizer wrapper
      - most of the distributed code is moved to a new wrapper class called DistributedFairseqModel, which behaves like DistributedDataParallel and a FairseqModel at the same time
      - Trainer now requires an extra dummy_batch argument at initialization, which we do fwd/bwd on when there's an uneven number of batches per worker. We hide the gradients from these dummy batches by multiplying the loss by 0
      - Trainer.train_step now takes a list of samples, which will allow cleaner --update-freq
      1082ba35
  3. 03 Sep, 2018 8 commits
  4. 25 Jul, 2018 1 commit
    • Alexei Baevski's avatar
      Transformer lm · d2e2a1d4
      Alexei Baevski authored
      This implements transformer based language model. It already obtains better perplexity on wikitext103 without any tuning. I will also train it on gbw where I also expect to get better ppl
      
      Example training command:
      
      python train.py /private/home/abaevski/data/wiki103 —save-dir /tmp —fp16 —max-epoch 80 —save-interval 1 —arch transformer_lm —task language_modeling —optimizer nag —lr 0.008 —lr-scheduler reduce_lr_on_plateau —lr-shrink 0.6 —dropout 0.2 —criterion adaptive_loss —adaptive-softmax-cutoff 10000,50000,200000 —max-tokens 512 —tokens-per-sample 512 —seed 1 —sample-break-mode none —log-format json —log-interval 50 —save-interval-updates 2500 —keep-interval-updates 25
      small transformer got to 31.3 ppl on wiki text 103 (compared to 35 with fconv) while @myleott got a big transformer lm to 27 something ppl on wiki text 103
      d2e2a1d4
  5. 21 Jun, 2018 3 commits
  6. 15 Jun, 2018 16 commits
  7. 27 Feb, 2018 4 commits
  8. 22 Jan, 2018 5 commits
  9. 06 Dec, 2017 1 commit