Commits · f98d0ef2a237f546aee949e2b3577c0e59422a5e · chenpangpang / transformers

10 Apr, 2020 1 commit

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2

09 Apr, 2020 5 commits

[T5, generation] Add decoder caching for T5 (#3682) · ce2298fb

Patrick von Platen authored Apr 10, 2020



* initial commit to add decoder caching for T5

* better naming for caching

* finish T5 decoder caching

* correct test

* added extensive past testing for T5

* clean files

* make tests cleaner

* improve docstring

* improve docstring

* better reorder cache

* make style

* Update src/transformers/modeling_t5.py
Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com>

* make set output past work for all layers

* improve docstring

* improve docstring
Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>

ce2298fb

Fix force_download of files on Windows (#3697) · 9384e5f6
calpt authored Apr 09, 2020

9384e5f6
[Exbert] Change style of button · bc65afc4
Julien Chaumond authored Apr 09, 2020

bc65afc4
Update quotes · 31baeed6
LysandreJik authored Apr 09, 2020
```
cc @julien-c
```
31baeed6
Correct transformers-cli env call · f8208fa4
Teven authored Apr 09, 2020

f8208fa4

08 Apr, 2020 6 commits
- Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684) · 6435b9f9
  Lysandre Debut authored Apr 08, 2020
```
* Updating modeling tf files; adding tests

* Merge `encode_plus` and `batch_encode_plus`
```
  6435b9f9
- close #3699 · 500aa123
  LysandreJik authored Apr 08, 2020
  
  500aa123
- More doc for model cards (#3698) · a594ee9c
  Julien Chaumond authored Apr 08, 2020
```
see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270
```
  a594ee9c
- Update doc for {Summarization,Translation}Pipeline and other tweaks · 83703cd0
  Julien Chaumond authored Apr 07, 2020
  
  83703cd0
- Created README.md for model card ChemBERTa (#3666) · a1b3b416
  Seyone Chithrananda authored Apr 08, 2020
```
* created readme.md

* update readme with fixes

Fixes from PR comments
```
  a1b3b416
- Fix typo in FeatureExtractionPipeline docstring · 747907dc
  Lorenzo Ampil authored Apr 03, 2020
  
  747907dc
07 Apr, 2020 8 commits
- [Bart] Replace config.output_past with use_cache kwarg (#3632) · 715aa5b1
  Sam Shleifer authored Apr 07, 2020
  
  715aa5b1
- [examples] SummarizationDataset cleanup (#3451) · e344e3d4
  Sam Shleifer authored Apr 07, 2020
  
  e344e3d4
- [Tokenization] fix edge case for bert tokenization (#3517) · b0ad0695
  Patrick von Platen authored Apr 07, 2020
```
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
```
  b0ad0695
- [Examples, Benchmark] Improve benchmark utils (#3674) · 80fa0f78
  Patrick von Platen authored Apr 07, 2020
```
* improve and add features to benchmark utils

* update benchmark style

* remove output files
```
  80fa0f78
- Optimize causal mask using torch.where (#2715) · 05deb52d
  Michael Pang authored Apr 07, 2020
```
* Optimize causal mask using torch.where

Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.

* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback

* Fix typo

* reformat line for CI
```
  05deb52d
- Speedup torch summarization tests (#3663) · 0a4b1068
  Sam Shleifer authored Apr 07, 2020
  
  0a4b1068
- Fix roberta checkpoint conversion script (#3642) · 5aa8a278
  Myle Ott authored Apr 07, 2020
  
  5aa8a278
- [model_cards] Turn down spurious warnings · 11cc1e16
  Julien Chaumond authored Apr 07, 2020
```
Close #3639 + spurious warning mentioned in #3227

cc @lysandrejik @thomwolf
```
  11cc1e16
06 Apr, 2020 19 commits

fixed TransfoXLLMHeadModel documentation (#3661) · 0a9d09b4
Teven authored Apr 07, 2020
```
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
0a9d09b4

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258

Ethan Perez authored Apr 06, 2020

* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits

e52d1258

Create README.md · 0ac33ddd
ktrapeznikov authored Apr 06, 2020

0ac33ddd
Add model card · 326e6eba
Manuel Romero authored Apr 06, 2020

326e6eba
Add model card · 43eca3f8
Manuel Romero authored Apr 06, 2020

43eca3f8
Create README.md · 6bec88ca
Manuel Romero authored Apr 06, 2020

6bec88ca
Add model card (#3655) · 769b60f9
Manuel Romero authored Apr 06, 2020
```
* Add model card

* Fix model name in fine-tuning script
```
769b60f9
Create model card (#3654) · c4bcb019
Manuel Romero authored Apr 06, 2020
```
* Create model card

* Fix model name in fine-tuning script
```
c4bcb019
Create README.md · 6903a987
Manuel Romero authored Apr 06, 2020

6903a987
Create README.md (#3662) · 760872db
MichalMalyska authored Apr 06, 2020

760872db
Add model card for BERTeus (#3649) · 47e1334c
jjacampos authored Apr 06, 2020
```
* Add model card for BERTeus

* Update README
```
47e1334c

BioMed Roberta-Base (AllenAI) (#3643) · 529534dc

Suchin authored Apr 06, 2020



* added model card

* updated README

* updated README

* updated README

* added evals

* removed pico eval

* Tweaks
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

529534dc

Update notebooks (#3620) · 261c4ff4

Lysandre Debut authored Apr 06, 2020

* Update notebooks

* From local to global link

* from local links to *actual* global links

261c4ff4

[model_cards] ELECTRA (w/ examples of usage) · 39a34cc3

Julien Chaumond authored Apr 06, 2020


Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

39a34cc3

Re-pin isort · ea6dba27
LysandreJik authored Apr 06, 2020

ea6dba27
unpin isort for pypi · 11c3257a
LysandreJik authored Apr 06, 2020

11c3257a
Release: v2.8.0 · 36bffc81
LysandreJik authored Apr 06, 2020

36bffc81
[Generate, Test] Split generate test function into beam search, no beam search (#3601) · 2ee41056
Patrick von Platen authored Apr 06, 2020
```
* split beam search and no beam search test

* fix test

* clean generate tests
```
2ee41056

05 Apr, 2020 1 commit
- fix argument order (#3637) · 1789c7da
  Patrick von Platen authored Apr 05, 2020
  
  1789c7da