Commits · e344e3d4021421ec0d631d076daf17f8a4e82e69 · chenpangpang / transformers

07 Apr, 2020 7 commits
- [examples] SummarizationDataset cleanup (#3451) · e344e3d4
  Sam Shleifer authored Apr 07, 2020
  
  e344e3d4
- [Tokenization] fix edge case for bert tokenization (#3517) · b0ad0695
  Patrick von Platen authored Apr 07, 2020
```
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
```
  b0ad0695
- [Examples, Benchmark] Improve benchmark utils (#3674) · 80fa0f78
  Patrick von Platen authored Apr 07, 2020
```
* improve and add features to benchmark utils

* update benchmark style

* remove output files
```
  80fa0f78
- Optimize causal mask using torch.where (#2715) · 05deb52d
  Michael Pang authored Apr 07, 2020
```
* Optimize causal mask using torch.where

Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.

* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback

* Fix typo

* reformat line for CI
```
  05deb52d
- Speedup torch summarization tests (#3663) · 0a4b1068
  Sam Shleifer authored Apr 07, 2020
  
  0a4b1068
- Fix roberta checkpoint conversion script (#3642) · 5aa8a278
  Myle Ott authored Apr 07, 2020
  
  5aa8a278
- [model_cards] Turn down spurious warnings · 11cc1e16
  Julien Chaumond authored Apr 07, 2020
```
Close #3639 + spurious warning mentioned in #3227

cc @lysandrejik @thomwolf
```
  11cc1e16
06 Apr, 2020 19 commits

fixed TransfoXLLMHeadModel documentation (#3661) · 0a9d09b4
Teven authored Apr 07, 2020
```
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
0a9d09b4

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258

Ethan Perez authored Apr 06, 2020

* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits

e52d1258

Create README.md · 0ac33ddd
ktrapeznikov authored Apr 06, 2020

0ac33ddd
Add model card · 326e6eba
Manuel Romero authored Apr 06, 2020

326e6eba
Add model card · 43eca3f8
Manuel Romero authored Apr 06, 2020

43eca3f8
Create README.md · 6bec88ca
Manuel Romero authored Apr 06, 2020

6bec88ca
Add model card (#3655) · 769b60f9
Manuel Romero authored Apr 06, 2020
```
* Add model card

* Fix model name in fine-tuning script
```
769b60f9
Create model card (#3654) · c4bcb019
Manuel Romero authored Apr 06, 2020
```
* Create model card

* Fix model name in fine-tuning script
```
c4bcb019
Create README.md · 6903a987
Manuel Romero authored Apr 06, 2020

6903a987
Create README.md (#3662) · 760872db
MichalMalyska authored Apr 06, 2020

760872db
Add model card for BERTeus (#3649) · 47e1334c
jjacampos authored Apr 06, 2020
```
* Add model card for BERTeus

* Update README
```
47e1334c

BioMed Roberta-Base (AllenAI) (#3643) · 529534dc

Suchin authored Apr 06, 2020



* added model card

* updated README

* updated README

* updated README

* added evals

* removed pico eval

* Tweaks
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

529534dc

Update notebooks (#3620) · 261c4ff4

Lysandre Debut authored Apr 06, 2020

* Update notebooks

* From local to global link

* from local links to *actual* global links

261c4ff4

[model_cards] ELECTRA (w/ examples of usage) · 39a34cc3

Julien Chaumond authored Apr 06, 2020


Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

39a34cc3

Re-pin isort · ea6dba27
LysandreJik authored Apr 06, 2020

ea6dba27
unpin isort for pypi · 11c3257a
LysandreJik authored Apr 06, 2020

11c3257a
Release: v2.8.0 · 36bffc81
LysandreJik authored Apr 06, 2020

36bffc81
[Generate, Test] Split generate test function into beam search, no beam search (#3601) · 2ee41056
Patrick von Platen authored Apr 06, 2020
```
* split beam search and no beam search test

* fix test

* clean generate tests
```
2ee41056

05 Apr, 2020 2 commits
- fix argument order (#3637) · 1789c7da
  Patrick von Platen authored Apr 05, 2020
  
  1789c7da
- Fix TF T5 docstring (#3636) · b809d2f0
  Patrick von Platen authored Apr 05, 2020
  
  b809d2f0
04 Apr, 2020 7 commits
- Adjust model card to reflect changes to vocabulary · 4ab8ab4f
  Timo Moeller authored Apr 03, 2020
```
(cherry picked from commit 8e25c4bf2838211378db4d93e7f9722386cc1a04)
```
  4ab8ab4f
- Create README.md · ac40eed1
  ktrapeznikov authored Apr 04, 2020
```
adding readme for 
ktrapeznikov/albert-xlarge-v2-squad-v2
```
  ac40eed1
- Create README.md · fd9995eb
  ktrapeznikov authored Apr 04, 2020
  
  fd9995eb
- Tweak typing for #3566 · 5d912e7e
  Julien Chaumond authored Apr 04, 2020
  
  5d912e7e
- weigths*weights · 94eb68d7
  Julien Chaumond authored Apr 04, 2020
  
  94eb68d7
- Create model card · 243e687b
  Manuel Romero authored Apr 04, 2020
  
  243e687b
- [model_cards] Link to ExBERT visualisation · 3e4b4dd1
  Julien Chaumond authored Apr 03, 2020
```
Hat/tip @bhoov @HendrikStrobelt @sebastianGehrmann

Also cc @srush and @thomwolf
```
  3e4b4dd1
03 Apr, 2020 5 commits

Speed up GELU computation with torch.jit (#2988) · c6acd246

Max Ryabinin authored Apr 03, 2020

* Compile gelu_new with torchscript

* Compile _gelu_python with torchscript

* Wrap gelu_new with torch.jit for torch>=1.4

c6acd246

ELECTRA (#3257) · d5d7d886

Lysandre Debut authored Apr 03, 2020

* Electra wip

* helpers

* Electra wip

* Electra v1

* ELECTRA may be saved/loaded

* Generator & Discriminator

* Embedding size instead of halving the hidden size

* ELECTRA Tokenizer

* Revert BERT helpers

* ELECTRA Conversion script

* Archive maps

* PyTorch tests

* Start fixing tests

* Tests pass

* Same configuration for both models

* Compatible with base + large

* Simplification + weight tying

* Archives

* Auto + Renaming to standard names

* ELECTRA is uncased

* Tests

* Slight API changes

* Update tests

* wip

* ElectraForTokenClassification

* temp

* Simpler arch + tests

Removed ElectraForPreTraining which will be in a script

* Conversion script

* Auto model

* Update links to S3

* Split ElectraForPreTraining and ElectraForTokenClassification

* Actually test PreTraining model

* Remove num_labels from configuration

* wip

* wip

* From discriminator and generator to electra

* Slight API changes

* Better naming

* TensorFlow ELECTRA tests

* Accurate conversion script

* Added to conversion script

* Fast ELECTRA tokenizer

* Style

* Add ELECTRA to README

* Modeling Pytorch Doc + Real style

* TF Docs

* Docs

* Correct links

* Correct model intialized

* random fixes

* style

* Addressing Patrick's and Sam's comments

* Correct links in docs

d5d7d886

BertJapaneseTokenizer accept options for mecab (#3566) · 8594dd80
Yohei Tamura authored Apr 04, 2020
```
* BertJapaneseTokenizer accept options for mecab

* black

* fix mecab_option to Option[str]
```
8594dd80
Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-cased README (#3613) · 216e167c
HUSEIN ZOLKEPLI authored Apr 03, 2020
```
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base
```
216e167c
Update README.md (#3604) · 1ac6a246
ahotrod authored Apr 03, 2020
```
Update AutoModel & AutoTokernizer loading.
```
1ac6a246