Commits · 7a7e2c26063efa837d56933dd6de33369aad0c62 · chenpangpang / transformers

02 Nov, 2020 1 commit

Add line by line option to mlm/plm scripts (#8240) · e1b1b614

Sylvain Gugger authored Nov 02, 2020



* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

e1b1b614

30 Oct, 2020 2 commits

Remove deprecated arguments from new run_clm (#8197) · 9eb3a410
Sylvain Gugger authored Oct 30, 2020

9eb3a410

Finalize lm examples (#8188) · cdc48ce9

Sylvain Gugger authored Oct 30, 2020



* Finish the cleanup of the language-modeling examples

* Update main README

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Propagate changes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

cdc48ce9

29 Oct, 2020 2 commits

Fix eval ref miss in Chinese WWM. (#8115) · 9a21b506

wlhgtc authored Oct 30, 2020



* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

* fix eval ref file miss bug

* format file

* MOD: move ref code to contrib

* MOD: add delimeter check

* reformat code

* refomat code

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

9a21b506

Add a template for examples and apply it for mlm and plm examples (#8153) · 69117628

Sylvain Gugger authored Oct 29, 2020

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Styling

69117628

28 Oct, 2020 1 commit

New run_clm script (#8105) · 47dfa65b

Sylvain Gugger authored Oct 28, 2020



* New run_clm script

* Formatting

* More comments

* Remove unused imports

* Apply suggestions from code review
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address review comments

* Change link to the hub
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

47dfa65b

26 Oct, 2020 1 commit

Update README.md (#8050) · 098ddc22

mohammadreza-Banaei73 authored Oct 26, 2020

--wwm cant be used as an argument given run_language_modeling.py and should be changed to --whole_word_mask

098ddc22

22 Oct, 2020 1 commit

# Add whole word mask support for lm fine-tune (#7925) · a16e568f

wlhgtc authored Oct 22, 2020



* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

a16e568f

12 Oct, 2020 2 commits
- Fix code quality · d6175a42
  sgugger authored Oct 12, 2020
  
  d6175a42
- The input training data files (multiple files in glob format). (#7717) · f176e707
  Kelvin authored Oct 12, 2020
```
Very often splitting large files to smaller files can prevent tokenizer going out of memory in environment like Colab that does not have swap memory
```
  f176e707
01 Sep, 2020 1 commit

Add cache_dir to save features TextDataset (#6879) · 21d71923

Jin Young (Daniel) Sohn authored Sep 01, 2020

* Add cache_dir to save features TextDataset

This is in case the dataset is in a RO filesystem, for which is the case
in tests (GKE TPU tests).

* style

21d71923

26 Aug, 2020 1 commit
- Black 20 release · a75c64d8
  Lysandre authored Aug 26, 2020
  
  a75c64d8
29 Jul, 2020 1 commit
- XLNet PLM Readme (#6121) · 641b873c
  Lysandre Debut authored Jul 29, 2020
  
  641b873c
07 Jul, 2020 1 commit

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

25 May, 2020 1 commit
- add DistilBERT to supported models (#4558) · 50d1ce41
  Antonis Maronikolakis authored May 25, 2020
  
  50d1ce41
19 May, 2020 1 commit

Distributed eval: SequentialDistributedSampler + gather all results (#4243) · 5e7fe8b5

Julien Chaumond authored May 18, 2020

* Distributed eval: SequentialDistributedSampler + gather all results

* For consistency only write to disk from world_master

Close https://github.com/huggingface/transformers/issues/4272

* Working distributed eval

* Hook into scripts

* Fix #3721 again

* TPU.mesh_reduce: stay in tensor space

Thanks @jysohn23

* Just a small comment

* whitespace

* torch.hub: pip install packaging

* Add test scenarii

5e7fe8b5

18 May, 2020 1 commit
- fix(run_language_modeling): use arg overwrite_cache (#4407) · d9ece823
  Boris Dayma authored May 18, 2020
  
  d9ece823
15 May, 2020 1 commit
- [skip ci] remove local rank · 15550ce0
  Julien Chaumond authored May 15, 2020
  
  15550ce0
14 May, 2020 1 commit
- Use Filelock to ensure distributed barriers · c547f15a
  Julien Chaumond authored May 14, 2020
```
see context in https://github.com/huggingface/transformers/pull/4223
```
  c547f15a
13 May, 2020 1 commit

(v2) Improvements to the wandb integration (#4324) · 24175910

Julien Chaumond authored May 12, 2020



* Improvements to the wandb integration

* small reorg + no global necessary

* feat(trainer): log epoch and final metrics

* Simplify logging a bit

* Fixup

* Fix crash when just running eval
Co-authored-by: Chris Van Pelt <vanpelt@gmail.com>
Co-authored-by: Boris Dayma <boris.dayma@gmail.com>

24175910

08 May, 2020 1 commit

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) · 7b75aa9f

Julien Chaumond authored May 08, 2020

* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None

7b75aa9f

07 May, 2020 2 commits
- [doc] Fix broken links + remove crazy big notebook · c99fe038
  Julien Chaumond authored May 07, 2020
  
  c99fe038
- BIG Reorganize examples (#4213) · 0ae96ff8
  Julien Chaumond authored May 07, 2020
```
* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around
```
  0ae96ff8