Commits · a75c64d80c76c3dc71f735d9197a4a601847e0cd · chenpangpang / transformers

26 Aug, 2020 1 commit
- Black 20 release · a75c64d8
  Lysandre authored Aug 26, 2020
  
  a75c64d8
07 Jul, 2020 1 commit

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

19 May, 2020 1 commit

Distributed eval: SequentialDistributedSampler + gather all results (#4243) · 5e7fe8b5

Julien Chaumond authored May 18, 2020

* Distributed eval: SequentialDistributedSampler + gather all results

* For consistency only write to disk from world_master

Close https://github.com/huggingface/transformers/issues/4272

* Working distributed eval

* Hook into scripts

* Fix #3721 again

* TPU.mesh_reduce: stay in tensor space

Thanks @jysohn23

* Just a small comment

* whitespace

* torch.hub: pip install packaging

* Add test scenarii

5e7fe8b5

18 May, 2020 1 commit
- fix(run_language_modeling): use arg overwrite_cache (#4407) · d9ece823
  Boris Dayma authored May 18, 2020
  
  d9ece823
15 May, 2020 1 commit
- [skip ci] remove local rank · 15550ce0
  Julien Chaumond authored May 15, 2020
  
  15550ce0
14 May, 2020 1 commit
- Use Filelock to ensure distributed barriers · c547f15a
  Julien Chaumond authored May 14, 2020
```
see context in https://github.com/huggingface/transformers/pull/4223
```
  c547f15a
13 May, 2020 1 commit

(v2) Improvements to the wandb integration (#4324) · 24175910

Julien Chaumond authored May 12, 2020



* Improvements to the wandb integration

* small reorg + no global necessary

* feat(trainer): log epoch and final metrics

* Simplify logging a bit

* Fixup

* Fix crash when just running eval
Co-authored-by: Chris Van Pelt <vanpelt@gmail.com>
Co-authored-by: Boris Dayma <boris.dayma@gmail.com>

24175910

08 May, 2020 1 commit

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) · 7b75aa9f

Julien Chaumond authored May 08, 2020

* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None

7b75aa9f

07 May, 2020 1 commit

BIG Reorganize examples (#4213) · 0ae96ff8

Julien Chaumond authored May 07, 2020

* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around

0ae96ff8

24 Apr, 2020 1 commit
- [examples] For convenience, also save the tokenizer · c8115260
  Julien Chaumond authored Apr 24, 2020
```
Close #3921
```
  c8115260
22 Apr, 2020 1 commit

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d

20 Apr, 2020 1 commit
- Fix bug in examples: double wrap into DataParallel during eval · b1ff0b2a
  Andrey Kulagin authored Apr 17, 2020
  
  b1ff0b2a
18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

13 Apr, 2020 1 commit
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
02 Apr, 2020 2 commits
- Resizing embedding matrix before sending it to the optimizer. (#3532) · c50aa67b
  Nicolas authored Apr 02, 2020
```
* Resizing embedding matrix after sending it to the optimizer prevents from updating the newly resized matrix.

* Remove space for style matter
```
  c50aa67b
- Adding should_continue check for retraining (#3509) · 1b101599
  Mark Kockerbeck authored Apr 02, 2020
  
  1b101599
24 Mar, 2020 3 commits
- [run_language_modeling] Fix: initialize a new model from a config object · eaabaaf7
  Julien Chaumond authored Mar 24, 2020
  
  eaabaaf7
- Expose missing mappings (see #3415) · f8823bad
  Julien Chaumond authored Mar 24, 2020
  
  f8823bad
- [examples] Use AutoModels in more examples · a8e3336a
  Julien Chaumond authored Mar 23, 2020
  
  a8e3336a
02 Mar, 2020 1 commit
- fix n_gpu count when no_cuda flag is activated (#3077) · 6b1ff250
  Victor SANH authored Mar 02, 2020
```
* fix n_gpu count when no_cuda flag is activated

* someone was left behind
```
  6b1ff250
12 Feb, 2020 2 commits
- Raise error when using an mlm flag for a clm model + correct TextDataset · f54a5bd3
  Lysandre authored Feb 10, 2020
  
  f54a5bd3
- Fix a few issues regarding the language modeling script · 569897ce
  Lysandre authored Feb 10, 2020
  
  569897ce
07 Feb, 2020 1 commit
- [examples] rename run_lm_finetuning to run_language_modeling · 42f08e59
  Julien Chaumond authored Feb 06, 2020
  
  42f08e59
05 Feb, 2020 1 commit
- [run_lm_finetuning] Tweak fix for non-long tensor, close #2728 · ada24def
  Julien Chaumond authored Feb 05, 2020
```
see 1ebfeb79

 and #2728
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
```
  ada24def
04 Feb, 2020 2 commits
- Revert erroneous fix · 3bf54172
  Lysandre authored Feb 04, 2020
  
  3bf54172
- Cast to long when masking tokens · 1ebfeb79
  Lysandre authored Feb 04, 2020
  
  1ebfeb79
03 Feb, 2020 1 commit

[Follow up 213] · 239dd23f

Lysandre authored Feb 03, 2020

Masked indices should have -1 and not -100. Updating documentation + scripts that were forgotten

239dd23f

28 Jan, 2020 2 commits
- Default save steps 50 to 500 in all scripts · 335dd5e6
  Lysandre authored Jan 28, 2020
  
  335dd5e6
- [run_lm_finetuning] GPT2 tokenizer doesn't have a pad_token · 6b4c3ee2
  Julien Chaumond authored Jan 27, 2020
```
ping @lysandrejik
```
  6b4c3ee2
21 Jan, 2020 5 commits
- Line-by-line text dataset (including padding) · 1a8e87be
  Julien Chaumond authored Jan 18, 2020
  
  1a8e87be
- change order · b94cf7fa
  Julien Chaumond authored Jan 18, 2020
  
  b94cf7fa
- Easier to not support this, as it could be confusing · 2eaa8b6e
  Julien Chaumond authored Jan 18, 2020
```
cc @lysandrejik
```
  2eaa8b6e
- make style · 801aaa55
  Julien Chaumond authored Jan 17, 2020
  
  801aaa55
- [run_lm_finetuning] Train from scratch · 56d4ba8d
  Julien Chaumond authored Jan 17, 2020
  
  56d4ba8d
07 Jan, 2020 2 commits
- spelling correction (#2434) · 43114b89
  Oren Amsalem authored Jan 07, 2020
  
  43114b89
- Fix error with global step in run_lm_finetuning.py · 27c1b656
  Lysandre Debut authored Jan 07, 2020
  
  27c1b656
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
01 Jan, 2020 1 commit
- [run_lm_finetuning] mask_tokens: document types · 629b22ad
  Julien Chaumond authored Jan 01, 2020
  
  629b22ad
22 Dec, 2019 1 commit
- Update comments mentioning Python 2. · d6eaf4e6
  Aymeric Augustin authored Dec 22, 2019
  
  d6eaf4e6