Commits · 7b75aa9fa55bee577e2c7403301ed31103125a35 · chenpangpang / transformers

08 May, 2020 1 commit

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) · 7b75aa9f

Julien Chaumond authored May 08, 2020

* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None

7b75aa9f

07 May, 2020 1 commit

BIG Reorganize examples (#4213) · 0ae96ff8

Julien Chaumond authored May 07, 2020

* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around

0ae96ff8

24 Apr, 2020 1 commit
- [examples] For convenience, also save the tokenizer · c8115260
  Julien Chaumond authored Apr 24, 2020
```
Close #3921
```
  c8115260
22 Apr, 2020 1 commit

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d

20 Apr, 2020 1 commit
- Fix bug in examples: double wrap into DataParallel during eval · b1ff0b2a
  Andrey Kulagin authored Apr 17, 2020
  
  b1ff0b2a
18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

13 Apr, 2020 1 commit
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
02 Apr, 2020 2 commits
- Resizing embedding matrix before sending it to the optimizer. (#3532) · c50aa67b
  Nicolas authored Apr 02, 2020
```
* Resizing embedding matrix after sending it to the optimizer prevents from updating the newly resized matrix.

* Remove space for style matter
```
  c50aa67b
- Adding should_continue check for retraining (#3509) · 1b101599
  Mark Kockerbeck authored Apr 02, 2020
  
  1b101599
24 Mar, 2020 3 commits
- [run_language_modeling] Fix: initialize a new model from a config object · eaabaaf7
  Julien Chaumond authored Mar 24, 2020
  
  eaabaaf7
- Expose missing mappings (see #3415) · f8823bad
  Julien Chaumond authored Mar 24, 2020
  
  f8823bad
- [examples] Use AutoModels in more examples · a8e3336a
  Julien Chaumond authored Mar 23, 2020
  
  a8e3336a
02 Mar, 2020 1 commit
- fix n_gpu count when no_cuda flag is activated (#3077) · 6b1ff250
  Victor SANH authored Mar 02, 2020
```
* fix n_gpu count when no_cuda flag is activated

* someone was left behind
```
  6b1ff250
12 Feb, 2020 2 commits
- Raise error when using an mlm flag for a clm model + correct TextDataset · f54a5bd3
  Lysandre authored Feb 10, 2020
  
  f54a5bd3
- Fix a few issues regarding the language modeling script · 569897ce
  Lysandre authored Feb 10, 2020
  
  569897ce
07 Feb, 2020 1 commit
- [examples] rename run_lm_finetuning to run_language_modeling · 42f08e59
  Julien Chaumond authored Feb 06, 2020
  
  42f08e59
05 Feb, 2020 1 commit
- [run_lm_finetuning] Tweak fix for non-long tensor, close #2728 · ada24def
  Julien Chaumond authored Feb 05, 2020
```
see 1ebfeb79

 and #2728
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
```
  ada24def
04 Feb, 2020 2 commits
- Revert erroneous fix · 3bf54172
  Lysandre authored Feb 04, 2020
  
  3bf54172
- Cast to long when masking tokens · 1ebfeb79
  Lysandre authored Feb 04, 2020
  
  1ebfeb79
03 Feb, 2020 1 commit

[Follow up 213] · 239dd23f

Lysandre authored Feb 03, 2020

Masked indices should have -1 and not -100. Updating documentation + scripts that were forgotten

239dd23f

28 Jan, 2020 2 commits
- Default save steps 50 to 500 in all scripts · 335dd5e6
  Lysandre authored Jan 28, 2020
  
  335dd5e6
- [run_lm_finetuning] GPT2 tokenizer doesn't have a pad_token · 6b4c3ee2
  Julien Chaumond authored Jan 27, 2020
```
ping @lysandrejik
```
  6b4c3ee2
21 Jan, 2020 5 commits
- Line-by-line text dataset (including padding) · 1a8e87be
  Julien Chaumond authored Jan 18, 2020
  
  1a8e87be
- change order · b94cf7fa
  Julien Chaumond authored Jan 18, 2020
  
  b94cf7fa
- Easier to not support this, as it could be confusing · 2eaa8b6e
  Julien Chaumond authored Jan 18, 2020
```
cc @lysandrejik
```
  2eaa8b6e
- make style · 801aaa55
  Julien Chaumond authored Jan 17, 2020
  
  801aaa55
- [run_lm_finetuning] Train from scratch · 56d4ba8d
  Julien Chaumond authored Jan 17, 2020
  
  56d4ba8d
07 Jan, 2020 2 commits
- spelling correction (#2434) · 43114b89
  Oren Amsalem authored Jan 07, 2020
  
  43114b89
- Fix error with global step in run_lm_finetuning.py · 27c1b656
  Lysandre Debut authored Jan 07, 2020
  
  27c1b656
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
01 Jan, 2020 1 commit
- [run_lm_finetuning] mask_tokens: document types · 629b22ad
  Julien Chaumond authored Jan 01, 2020
  
  629b22ad
22 Dec, 2019 5 commits
- Update comments mentioning Python 2. · d6eaf4e6
  Aymeric Augustin authored Dec 22, 2019
  
  d6eaf4e6
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Fix E266 flake8 warning (x90). · fa2ccbc0
  Aymeric Augustin authored Dec 21, 2019
  
  fa2ccbc0
- Fix E722 flake8 warnings (x26). · 631be270
  Aymeric Augustin authored Dec 21, 2019
  
  631be270
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

19 Dec, 2019 2 commits

[doc] Param name consistency · a5a06a85
Julien Chaumond authored Dec 19, 2019

a5a06a85

Minor/basic text fixes (#2229) · 1718fb9e

Aidan Kierans authored Dec 19, 2019

* Small clarification

Matches line 431 to line 435 for additional clarity and consistency.

* Fixed minor typo

The letter "s" was previously omitted from the word "docstrings".

1718fb9e