Commits · a1c997bd9a4e626b0b75aa2dd8d08d8d2beb0c71 · OpenDAS / Fairseq

07 May, 2019 1 commit

Memory-Mapped IndexedDataset implementation (#589) · a1c997bd

Davide Caroselli authored May 07, 2019

Summary:
Following discussion in https://github.com/pytorch/fairseq/issues/574:

 - Implemented MMapIndexedDataset and MMapIndexedDatasetBuilder compatible with IndexedDataset/IndexedDatasetBuilder
- Update scripts/read_binarized.py to support new MMapIndexedDataset
- Option '--raw-text' and '--lazy-load' replaced with '--dataset-impl' and moved the option definition custom task args to more high-level options.add_dataset_args() (more appropriate)
- Implemented also utils functions in indexed_dataset: make_dataset(), dataset_exists()
Pull Request resolved: https://github.com/pytorch/fairseq/pull/589

Differential Revision: D14597128

Pulled By: myleott

fbshipit-source-id: 4e92d99920cbaa52cfe5a0f1f5d9ae5c92d4268e

a1c997bd

30 Apr, 2019 1 commit

Merge internal changes (#654) · d45db804

Myle Ott authored Apr 29, 2019

Summary:
- Add --add-bos-token option to LM task
- Cleanup utils.py and options.py
Pull Request resolved: https://github.com/pytorch/fairseq/pull/654

Differential Revision: D15041794

Pulled By: myleott

fbshipit-source-id: 3ad00007769d5f48308052cfd40de39c5ffa1a6e

d45db804

01 Mar, 2019 2 commits

Use --workers for validation sets in preprocess.py · 66262a38

Myle Ott authored Mar 01, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/550

Differential Revision: D14286008

Pulled By: myleott

fbshipit-source-id: 6055acf98023fdd01f85ac3d7c4e7fb786e54389

66262a38

ignore data files in .gitignore · 4d59517f

JingboWang1997 authored Feb 28, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/546

Differential Revision: D14272808

Pulled By: myleott

fbshipit-source-id: e993450354e7d7561b14b56c12d4859a8ee7121b

4d59517f

28 Feb, 2019 1 commit

Move string line encoding logic from tokenizer to Dictionary (unified diff). (#541) · f296824f

Vladimir Karpukhin authored Feb 28, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/541

Just a combo of a stacked pair D14057943 & D14176011,
Made this as a separete diff cause there seems to be some issue with porting a stacked change into github repo

Differential Revision: D14251048

fbshipit-source-id: 0a47f534a69d6ab2ebe035fba40fd51748cccfb8

f296824f

26 Feb, 2019 1 commit

Multilingual training example (#527) · 00493490

Myle Ott authored Feb 25, 2019

Summary:
* Add example for multilingual translation on IWSLT'17
* Match dataset ordering for multilingual_translation and translation
* Fix bug with LegacyDistributedDataParallel when calling forward of sub-modules
Pull Request resolved: https://github.com/pytorch/fairseq/pull/527

Differential Revision: D14218372

Pulled By: myleott

fbshipit-source-id: 2e3fe24aa39476bcc5c9af68ef9a40192db34a3b

00493490

07 Feb, 2019 1 commit

stitch preprocessing pipeline · cea0e4b9

Ruty Rinott authored Feb 06, 2019

Summary:
1. add call to binarization to complete preprocessing pipeline
2. add ability to specify task to select the dictionary, and add a bert task
3. Get rid of function calls that are no longer needed after moving functions from fairseq here

Reviewed By: jingfeidu

Differential Revision: D13977842

fbshipit-source-id: ec9bbb4e98e62e12c20ba68bb52b8bcc94aee91d

cea0e4b9

05 Feb, 2019 1 commit

Add standalone binaries · 829bd8ce

Myle Ott authored Feb 05, 2019

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/489

Differential Revision: D13956810

Pulled By: myleott

fbshipit-source-id: 61ace179d1d3790226c38b3f3e47f5452b5ec514

829bd8ce

01 Feb, 2019 1 commit

Support custom Dictionary implementations in 'preprocess.py' (#448) · bbb4120b

Davide Caroselli authored Feb 01, 2019

Summary:
The `preprocess.py` script has been refactored in order to:

1. Use the `options` module for command line arguments parsing. This will give to `preprocess.py` the ability to load custom modules with `--user-dir` flag (already implemented to all other binaries)
2. Dictionary loading and building code has moved to Task implementation. This allows custom Dictionary classes to be used during the data generation step.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/448

Differential Revision: D13674819

Pulled By: myleott

fbshipit-source-id: b40648a98ed6c08284577e5ec25876e018d8c822

bbb4120b

29 Jan, 2019 1 commit

make dictionary class as input for fairseq preprocess functions (#482) · 66ce2175

Jingfei Du authored Jan 29, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/482

With this change, we can use different dictionary classes when calling build_dictionary and build_and_save_dictionary

Reviewed By: liaimi

Differential Revision: D13855100

fbshipit-source-id: 62e6db310b5f078e05c547d2671252233be7b7f0

66ce2175

24 Jan, 2019 2 commits

Enforce UTF-8 when open() text files (#460) · 38f1dee9

Davide Caroselli authored Jan 24, 2019

Summary:
When opening text files without specifying the encoding (i.e. `open(path, "r")` or `open(path, "w")`), python3 will use the preferred locale encoding (`locale.getpreferredencoding()`) so the result is platform dependent and can change from one machine to another.

I believe fairseq should enforce its standard (UTF-8 seems like the best choice to me). This pull request explicity specify UTF-8 encoding when reading text files.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/460

Differential Revision: D13802525

Pulled By: myleott

fbshipit-source-id: 672fd55707ee559ab36d74bc1c24026166ea2367

38f1dee9

change f"{args}" to "{}".format(args) (#467) · 8eb49c84

vufg authored Jan 24, 2019

Summary:
Although both are supported by Python 3.6, I think it would be better to unify the usage of string format function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/467

Differential Revision: D13802506

Pulled By: myleott

fbshipit-source-id: 5c4877547b1c4ca806ab54c80ae483cfbaa7827a

8eb49c84

16 Jan, 2019 1 commit

FIX: '--user-dir' on multi-gpu (#449) · 7853818c

Davide Caroselli authored Jan 16, 2019

Summary:
On a multi-gpu training scenario, the `train.py` script spawns new processes with `torch.multiprocessing.spawn`. Unfortunately those child processes don't inherit the modules imported with `--user-dir`.

This pull request fixes this problem: custom module import in now explicit on every `main()` function.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/449

Differential Revision: D13676922

Pulled By: myleott

fbshipit-source-id: 520358d66155697885b878a37e7d0484bddbc1c6

7853818c

05 Jan, 2019 1 commit

Merge internal changes (#283) · 7633129b

Myle Ott authored Jan 04, 2019

Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/283

Pull Request resolved: https://github.com/pytorch/fairseq/pull/428

Differential Revision: D13564190

Pulled By: myleott

fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5

7633129b

06 Dec, 2018 1 commit

Fix arg formatting in preprocess.py and add fmt control for black formatting (#399) · 82a9f923

Myle Ott authored Dec 06, 2018

Summary:
Not switching to Black formatting just yet, but adding fmt: off directives in case we decide to later.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/399

Differential Revision: D13364674

Pulled By: myleott

fbshipit-source-id: a20a11a18be3d583ee30eff770278fb4bd05b93c

82a9f923

18 Nov, 2018 1 commit

Fix build for docs · 0864a9c4

Myle Ott authored Nov 18, 2018

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/372

Differential Revision: D13114426

Pulled By: myleott

fbshipit-source-id: 6c24b96a3556a0ecd3d1f350642a884254a40bd3

0864a9c4

10 Nov, 2018 1 commit

pipeline for LM training · 880e7cd4

Ruty Rinott authored Nov 09, 2018

Summary:
step 2 of pipeline for LM training
assumes tokenized text data as input. Splits it into train/validation/test, and runs binarization
(step a_ii in https://fb.quip.com/kazzAxvZHBj9)

Reviewed By: borguz

Differential Revision: D10454705

fbshipit-source-id: 74e8679041f5507c4e404c1b719547c2ae9ed983

880e7cd4

25 Sep, 2018 1 commit
- Parallel preprocessing · 862cad11
  Sergey Edunov authored Sep 12, 2018
  
  862cad11
03 Sep, 2018 1 commit
- Add documentation · 6381cc97
  Myle Ott authored Sep 03, 2018
  
  6381cc97
31 Jul, 2018 1 commit
- Correct the help name of the prefixes arguments (#234) · 9143dfab
  alvations authored Jul 31, 2018
  
  9143dfab
21 Jun, 2018 1 commit
- Fix `--output-format raw` option to preprocess.py (Fixes #188) (#190) · 572a1d55
  Myle Ott authored Jun 21, 2018
  
  572a1d55
15 Jun, 2018 3 commits

Conv lm implementation · 4c2ef2de

alexeib authored May 25, 2018

This implements convolutional language model from https://arxiv.org/pdf/1612.08083.pdf

There are 3 modes for constructing batches:

- token block: fill each sample with a specified number of tokens without regard for sentence delimiters - this is what was used for training in the paper
- complete: fill each sample with a specified number of tokens but make sure it contains only complete sentences (i.e. if next sentence goes over token block limit, move it to the next sample) - this was used for evaluation in the paper
- eos: one sentence per sample (skip blank lines)

some results:

GCNN-13 - GBW - 37.46
GCNN-14B - GBW - 33.88
GCNN-8 - Wiki103 - 43.76
GCNN-14 - Wiki103 - 35.66

train:

python train.py /private/home/abaevski/data/wiki103 --save-dir /tmp --fp16 --max-epoch 35 --save-interval 1 --save-interval-updates 1000 --keep-interval-updates 25 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 --decoder-embed-dim 280 --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion cross_entropy --max-tokens 1024 --max-target-positions 1024 --seed 1 --log-format json --log-interval 500

eval:

python eval_lm.py ~abaevski/data/wiki103 --path '/checkpoint02/abaevski/2018-04-27/lm_wiki.fp16.mxup300000.fconv.adam.lrs=reduce_lr_on_plateau.emb280.layers(850,6)*3+(850,1)+(850,5)*4+(850,1)+(850,4)*3+(1024,1)+(2048,4).lr0.0005.clp0.1.drp0.3.wd0.0.crt=cross_entropy.mxtk2048.smptk256.seed1.ngpu8/checkpoint_last.pt'

4c2ef2de

Fix preprocess.py · fa7c575a
Myle Ott authored Apr 12, 2018

fa7c575a
Pad dictionary to be a multiple of 8 in preprocessing · 745d5fbd
Myle Ott authored Apr 12, 2018

745d5fbd

05 Mar, 2018 1 commit
- Allow more flexible pre-processing and generation (#227) · b03b53b4
  Sergey Edunov authored Mar 05, 2018
```
* Allow more flexible pre-processing and generation

* Addressing CR comments

* small fix
```
  b03b53b4
27 Feb, 2018 2 commits

Fix tests and flake8 · 29c82741
Myle Ott authored Feb 15, 2018

29c82741

fairseq-py goes distributed (#106) · 66415206

Myle Ott authored Feb 27, 2018

This PR includes breaking API changes to modularize fairseq-py and adds support for distributed training across multiple nodes.

Changes:
- c7033ef: add support for distributed training! See updated README for usage.
- e016299: modularize fairseq-py, adding support for register_model, register_criterion, register_optimizer, etc.
- 154e440: update LSTM implementation to use PackedSequence objects in the encoder, better following best practices and improving perf
- 90c2973 and 1da6265: improve unit test coverage

66415206

13 Nov, 2017 1 commit
- Remove Python3.6 format string from preprocess.py (fixes #60) (#61) · 3e3529e5
  Myle Ott authored Nov 13, 2017
  
  3e3529e5
08 Nov, 2017 2 commits

Replace unk with original string · 42a0150c

Louis Martin authored Nov 06, 2017

* Add <eos> for unk replacement
* Add IndexedRawTextDataset to load raw text files
* Replace unk with original string
* Add load_raw_text_dataset() and --output-format
* Move has_binary_files to data.py

42a0150c

Support custom dictionary in preprocess.py · 3af8ec82
Myle Ott authored Oct 23, 2017

3af8ec82

19 Oct, 2017 1 commit
- Fix flake8 warnings · cb0d7b2a
  Louis Martin authored Sep 25, 2017
  
  cb0d7b2a
15 Sep, 2017 1 commit
- Initial commit · e734b0fa
  Sergey Edunov authored Sep 14, 2017
  
  e734b0fa