Commits · a573777901e662ec2e565be312ffaeedef6effec · chenpangpang / transformers

24 Aug, 2020 2 commits

Update repo to isort v5 (#6686) · a5737779
Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
a5737779

Fixed DataCollatorForLanguageModeling not accepting lists of lists (#6685) · d329c9b0

Teven authored Aug 24, 2020

* Fixed DataCollatorForLanguageModeling + PermutationLanguageModeling not accepting lists of lists

* Update data_collator.py

* black was grumpy

d329c9b0

20 Aug, 2020 1 commit

Add tests to Trainer (#6605) · 573bdb0a

Sylvain Gugger authored Aug 20, 2020

* Add tests to Trainer

* Test if removing long breaks everything

* Remove ugly hack

* Fix distributed test

* Use float for number of epochs

573bdb0a

18 Aug, 2020 1 commit
- Fixed label datatype for STS-B (#6492) · 5a81195e
  Ali Modarressi authored Aug 18, 2020
```
* fixed label datatype for sts-b

* naming update

* make style

* make style
```
  5a81195e
12 Aug, 2020 1 commit

Adding PaddingDataCollator (#6442) · d2370e1b

Sylvain Gugger authored Aug 12, 2020

* Data collator with padding

* Add type annotation

* Support tensors as well

* Add comment

* Fix for labels wrong shape

* Data collator with padding

* Add type annotation

* Support tensors as well

* Add comment

* Fix for labels wrong shape

* Remove changes rendered unnecessary

d2370e1b

11 Aug, 2020 1 commit

[Performance improvement] "Bad tokens ids" optimization (#6064) · 40478291

guillaume-be authored Aug 11, 2020

* Optimized banned token masking

* Avoid duplicate EOS masking if in bad_words_id

* Updated mask generation to handle empty banned token list

* Addition of unit tests for the updated bad_words_ids masking

* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test

* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test (timeout does not work on Windows)

* Moving Marian import to the test context to allow TF only environments to run

* Moving imports to torch_available test

* Updated operations device and test

* Updated operations device and test

* Added docstring and comment for in-place scores modification

* Moving test to own test_generation_utils, use of lighter models for testing

* removed unneded imports in test_modeling_common

* revert formatting change for ModelTesterMixin

* Updated caching, simplified eos token id test, removed unnecessary @require_torch

* formatting compliance

40478291

03 Aug, 2020 2 commits

fix labels (#6213) · 0b418673
Suraj Patil authored Aug 03, 2020

0b418673

Empty assert hunt (#6056) · 5a0dac53

Teven authored Aug 03, 2020



* Fixed empty asserts

* black-reformatted stragglers in templates

* More code quality checks

* Update src/transformers/convert_marian_to_pytorch.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/convert_marian_to_pytorch.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* removed unused line as per @sshleifer
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

5a0dac53

28 Jul, 2020 2 commits
- Fix #6092 (#6093) · 28931f81
  Sylvain Gugger authored Jul 28, 2020
```
* Fix #6092

* Format
```
  28931f81
- Make all data collators accept dict (#6065) · 0206efb4
  Sylvain Gugger authored Jul 28, 2020
```
* Make all data collators accept dict

* Style
```
  0206efb4
22 Jul, 2020 1 commit

Expose padding_strategy on squad processor to fix QA pipeline performance regression (#5932) · 89630017

Funtowicz Morgan authored Jul 22, 2020



* Attempt to fix the way squad_convert_examples_to_features pad the elements for the QA pipeline.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make the code easier to read and avoid testing multiple test the same thing.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* missing enum value on truncation_strategy.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Rethinking for the easiest fix: expose the padding strategy on squad_convert_examples_to_features.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused imports.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

89630017

20 Jul, 2020 1 commit
- [cleanup] squad processor (#5868) · 01c40db4
  Sam Shleifer authored Jul 20, 2020
  
  01c40db4
10 Jul, 2020 1 commit
- [squad] add version tag to squad cache (#5669) · cdf4cd70
  Tomo Lazovich authored Jul 10, 2020
  
  cdf4cd70
09 Jul, 2020 1 commit

QA pipeline BART compatible (#5496) · 3bd55199

Funtowicz Morgan authored Jul 09, 2020



* Ensure padding and question cannot have higher probs than context.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Add bart the the list of tokenizers adding two <sep> tokens for squad_convert_example_to_feature
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing @patrickvonplaten comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing @patrickvonplaten comments about masking non-context element when generating the answer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing @sshleifer comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make sure we mask CLS after handling impossible answers
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Mask in the correct vectors ...
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

3bd55199

07 Jul, 2020 2 commits

[examples] Add trainer support for question-answering (#4829) · e49393c3

Suraj Patil authored Jul 07, 2020



* add SquadDataset

* add DataCollatorForQuestionAnswering

* update __init__

* add run_squad with  trainer

* add DataCollatorForQuestionAnswering in __init__

* pass data_collator to trainer

* doc tweak

* Update run_squad_trainer.py

* Update __init__.py

* Update __init__.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

e49393c3

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

01 Jul, 2020 1 commit
- Fix tensor label type inference in default collator (#5250) · 35befd9c
  Joe Davison authored Jul 01, 2020
```
* allow tensor label inputs to default collator

* replace try/except with type check
```
  35befd9c
30 Jun, 2020 1 commit

Fix TensorFlow dataset generator (#4881) · fcf06524

Julien Plu authored Jul 01, 2020

* fix TensorFlow generator

* Better features handling

* Apply style

* Apply style

* Fix squad as well

* Apply style

* Better factorization of TF Tensors creation

fcf06524

26 Jun, 2020 1 commit

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

25 Jun, 2020 1 commit

Refactor Code samples; Test code samples (#5036) · 364a5ae1

Lysandre Debut authored Jun 25, 2020



* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

364a5ae1

24 Jun, 2020 1 commit
- Replace labels with -100 to skip loss calc (#4718) · 0a3d0e02
  Setu Shah authored Jun 24, 2020
  
  0a3d0e02
19 Jun, 2020 1 commit
- [bart-mnli] Fix class flipping bug (#5141) · f45e8739
  Sam Shleifer authored Jun 19, 2020
  
  f45e8739
18 Jun, 2020 1 commit
- Fix #5114 (#5122) · 5f721ad6
  Sylvain Gugger authored Jun 18, 2020
  
  5f721ad6
17 Jun, 2020 1 commit
- Make default_data_collator more flexible and deprecate old behavior (#5060) · 20fa8289
  Sylvain Gugger authored Jun 17, 2020
```
* Make default_data_collator more flexible

* Accept tensors for all features

* Document code

* Refactor

* Formatting
```
  20fa8289
16 Jun, 2020 1 commit
- Fix all sphynx warnings (#5068) · 011cc0be
  Sylvain Gugger authored Jun 16, 2020
  
  011cc0be
15 Jun, 2020 1 commit

Make DataCollator a callable (#5015) · 1affde2f

Sylvain Gugger authored Jun 15, 2020



* Make DataCollator a callable

* Update src/transformers/data/data_collator.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

1affde2f

05 Jun, 2020 1 commit
- Fix argument label (#4792) · 4dd5cf22
  Sylvain Gugger authored Jun 05, 2020
```
* Fix argument label

* Fix test
```
  4dd5cf22
04 Jun, 2020 1 commit

Tensorflow improvements (#4530) · f9414f75

Julien Plu authored Jun 05, 2020



* Better None gradients handling

* Apply Style

* Apply Style

* Create a loss class per task to compute its respective loss

* Add loss classes to the ALBERT TF models

* Add loss classes to the BERT TF models

* Add question answering and multiple choice to TF Camembert

* Remove prints

* Add multiple choice model to TF DistilBERT + loss computation

* Add question answering model to TF Electra + loss computation

* Add token classification, question answering and multiple choice models to TF Flaubert

* Add multiple choice model to TF Roberta + loss computation

* Add multiple choice model to TF XLM + loss computation

* Add multiple choice and question answering models to TF XLM-Roberta

* Add multiple choice model to TF XLNet + loss computation

* Remove unused parameters

* Add task loss classes

* Reorder TF imports + add new model classes

* Add new model classes

* Bugfix in TF T5 model

* Bugfix for TF T5 tests

* Bugfix in TF T5 model

* Fix TF T5 model tests

* Fix T5 tests + some renaming

* Fix inheritance issue in the AutoX tests

* Add tests for TF Flaubert and TF XLM Roberta

* Add tests for TF Flaubert and TF XLM Roberta

* Remove unused piece of code in the TF trainer

* bugfix and remove unused code

* Bugfix for TF 2.2

* Apply Style

* Divide TFSequenceClassificationAndMultipleChoiceLoss into their two respective name

* Apply style

* Mirror the PT Trainer in the TF one: fp16, optimizers and tb_writer as class parameter and better dataset handling

* Fix TF optimizations tests and apply style

* Remove useless parameter

* Bugfix and apply style

* Fix TF Trainer prediction

* Now the TF models return the loss such as their PyTorch couterparts

* Apply Style

* Ignore some tests output

* Take into account the SQuAD cls_index, p_mask and is_impossible parameters for the QuestionAnswering task models.

* Fix names for SQuAD data

* Apply Style

* Fix conflicts with 2.11 release

* Fix conflicts with 2.11

* Fix wrongname

* Add better documentation on the new create_optimizer function

* Fix isort

* logging_dir: use same default as PyTorch
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

f9414f75

02 Jun, 2020 1 commit

Add cache_dir to save features in GLUE + Differentiate match/mismatch for MNLI metrics (#4621) · b231a413

Jin Young Sohn authored Jun 02, 2020



* Glue task cleaup

* Enable writing cache to cache_dir in case dataset lives in readOnly
filesystem.
* Differentiate match vs mismatch for MNLI metrics.

* Style

* Fix pytype

* Fix type

* Use cache_dir in mnli mismatch eval dataset

* Small Tweaks
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

b231a413

29 May, 2020 1 commit
- Fix two bugs: 1. Index of test data of SST-2. 2. Label index of MNLI data. (#4546) · 3a5d1ea2
  Zhangyx authored May 29, 2020
  
  3a5d1ea2
21 May, 2020 1 commit

Adds predict stage for glue tasks, and generate result files which can be... · 49296533

Zhangyx authored May 21, 2020


Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (#4463)

* Adds predict stage for glue tasks, and generate result files which could be submitted to gluebenchmark.com website.

* Use Split enum + always output the label name
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

49296533

14 May, 2020 3 commits
- p_mask in SQuAD pre-processing (#4049) · 7defc667
  Lysandre Debut authored May 14, 2020
```
* Better p_mask building

* Adressing @mfuntowicz comments
```
  7defc667
- Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
  Julien Chaumond authored May 14, 2020
```
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
```
  448c4672
- Use Filelock to ensure distributed barriers · c547f15a
  Julien Chaumond authored May 14, 2020
```
see context in https://github.com/huggingface/transformers/pull/4223
```
  c547f15a
08 May, 2020 1 commit

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) · 7b75aa9f

Julien Chaumond authored May 08, 2020

* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None

7b75aa9f

23 Apr, 2020 1 commit
- Remove 50k limits bug · 8e093e59
  peterandluc authored Apr 23, 2020
  
  8e093e59
22 Apr, 2020 1 commit

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d

20 Apr, 2020 2 commits

Remove tqdm logging when using pipelines. (#3833) · 2c05b8a5

Funtowicz Morgan authored Apr 20, 2020

Introduce tqdm_enabled parameter on squad_convert_examples_to_features() default to True and set to False in QA pipelines.

2c05b8a5

Add `qas_id` to SquadResult and SquadExample (#3745) · c79b550d
Jared T Nielsen authored Apr 20, 2020
```
* Add qas_id

* Fix incorrect name in squad.py

* Make output files optional for squad eval
```
c79b550d

10 Apr, 2020 1 commit

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2