Commits · c913eb9c3894b4031dc059d22b42e38a5fcef989 · chenpangpang / transformers

25 Feb, 2020 6 commits

Add integration tests for xlm roberta modelling and xlm roberta tokenzier (#3014) · c913eb9c
Patrick von Platen authored Feb 25, 2020
```
* add first files

* add xlm roberta integration tests

* make style

* flake 8 issues solved
```
c913eb9c
Change masking to direct labeling for TPU support. (#2982) · e8ce63ff
srush authored Feb 25, 2020
```
* change masking to direct labelings

* fix black

* switch to ignore index

* .

* fix black
```
e8ce63ff
missing ner link (#2967) · 7a7ee28c
Jhuo IH authored Feb 25, 2020

7a7ee28c

Adding usage examples for common tasks (#2850) · 65e7c90a

Lysandre Debut authored Feb 25, 2020

* Usage: Sequence Classification & Question Answering

* Pipeline example

* Language modeling

* TensorFlow code for Sequence classification

* Custom TF/PT toggler in docs

* QA + LM for TensorFlow

* Finish Usage for both PyTorch and TensorFlow

* Addressing Julien's comments

* More assertive

* cleanup

* Favicon
- added favicon option in conf.py along with the favicon image
- udpated 🤗

 logo. slightly smaller and should appear more consistent across editing programs (no more tongue on the outside of the mouth)
Co-authored-by: joshchagani <joshua@joshuachagani.com>

65e7c90a

[ci] Run slow tests every day · e693cd1e
Julien Chaumond authored Feb 24, 2020

e693cd1e
[ci] Attempt to fix #2844 · 4fc63151
Julien Chaumond authored Feb 24, 2020

4fc63151

24 Feb, 2020 13 commits

Test correct tokenizers after default switch (#3003) · b90745c5
Lysandre Debut authored Feb 24, 2020

b90745c5
False by default (#3002) · 3716c3d8
Lysandre Debut authored Feb 24, 2020

3716c3d8
Release: v2.5.1 · f9ec5ca9
Lysandre authored Feb 24, 2020

f9ec5ca9

Fix for fast tokenizers save_pretrained compatibility with Python. (#2933) · 4cd9c097

Funtowicz Morgan authored Feb 25, 2020



* Renamed file generate by tokenizers when calling save_pretrained to match python.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

4cd9c097

fix _update_memory fn call in transformer-xl (#2971) · ee60840e
Sandro Cavallari authored Feb 25, 2020

ee60840e
add explaining example to XLNet LM modeling (#2997) · 6a50d501
Patrick von Platen authored Feb 24, 2020
```
* add explaining example to XLNet LM modeling

* improve docstring for xlnet
```
6a50d501

Add preprocessing step for transfo-xl tokenization to avoid tokenizing words... · 65d74c49

Patrick von Platen authored Feb 24, 2020

Add preprocessing step for transfo-xl tokenization to avoid tokenizing words followed by punction to <unk> (#2987)

* add preprocessing to add space before punctuation for transfo_xl

* improve warning messages

* make style

* compile regex at instantination of tokenizer object

65d74c49

Add local_files_only parameter to pretrained items (#2930) · a143d947

Bram Vanroy authored Feb 24, 2020

* Add disable_outgoing to pretrained items

Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded

* parameter name change

* Remove forgotten print

a143d947

Create README.md · 286d1ec7
Manuel Romero authored Feb 24, 2020

286d1ec7
kwargs are passed to both model and configuration in AutoModels (#2998) · 7984a70e
Lysandre Debut authored Feb 24, 2020

7984a70e

Testing that batch_encode_plus is the same as encode_plus (#2973) · 21d8b6a3

Lysandre Debut authored Feb 24, 2020

* Testing that encode_plus and batch_encode_plus behave the same way

Spoiler alert: they don't

* Testing rest of arguments in batch_encode_plus

* Test tensor return in batch_encode_plus

* Addressing Sam's comments

* flake8

* Simplified with `num_added_tokens`

21d8b6a3

Add slow generate tests for pretrained lm models (#2909) · 17c45c39

Patrick von Platen authored Feb 24, 2020

* add slow generate lm_model tests

* fix conflicts

* merge conflicts

* fix conflicts

* add slow generate lm_model tests

* make style

* delete unused variable

* fix conflicts

* fix conflicts

* fix conflicts

* delete unused variable

* fix conflicts

* finished hard coded tests

17c45c39

Warning on `add_special_tokens` (#2966) · 8194df8e

Lysandre Debut authored Feb 24, 2020

Warning on `add_special_tokens` when passed to `encode`, `encode_plus` and `batch_encode_plus`

8194df8e

23 Feb, 2020 3 commits
- add_ctags_to_git_ignore (#2984) · 38f5fe9e
  Patrick von Platen authored Feb 23, 2020
  
  38f5fe9e
- Delete untested, broken Model2LSTM (#2968) · 129f0604
  Sam Shleifer authored Feb 23, 2020
  
  129f0604
- Correct `special_tokens_mask` when `add_special_tokens=False` (#2965) · 0e84559d
  Lysandre Debut authored Feb 23, 2020
```
Don't know of a use case where that would be useful, but this is more consistent
```
  0e84559d
22 Feb, 2020 6 commits
- Bart: fix layerdrop and cached decoder_input_ids for generation (#2969) · 92487a1d
  Sam Shleifer authored Feb 22, 2020
  
  92487a1d
- Add standardized get_vocab method to tokenizers · c36416e5
  Joe Davison authored Feb 22, 2020
  
  c36416e5
- fix hardcoded path in examples readme · cafc4dfc
  saippuakauppias authored Feb 22, 2020
  
  cafc4dfc
- Update modelcard of bert-base-german-cased · 34b4b5a9
  Malte Pietsch authored Feb 21, 2020
```
Add image
```
  34b4b5a9
- Update README.md · 7df12d7b
  Manuel Romero authored Feb 20, 2020
```
- I added an example using the model with pipelines to show that we have set```{"use_fast": False}``` in the tokenizer.
- I added a Colab to play with the model and pipelines
- I added a Colab to discover Huggingface pipelines at the end of the document
```
  7df12d7b
- Fix max_length not taken into account when using pad_to_max_length on fast tokenizers (#2961) · cc6775cd
  Funtowicz Morgan authored Feb 22, 2020
```
* enable_padding should pad up to max_length if set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added more testing on padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
```
  cc6775cd
21 Feb, 2020 7 commits

Remove double bias (#2958) · 94ff2d6e
Lysandre Debut authored Feb 21, 2020

94ff2d6e
Only use F.gelu for torch >=1.4.0 (#2955) · b5b3445c
Sam Shleifer authored Feb 21, 2020
```
* Only use F.gelu for torch >=1.4.0

* Use F.gelu for newer torch
```
b5b3445c

Improve special_token_id logic in run_generation.py and add tests (#2885) · fc38d4c8

Patrick von Platen authored Feb 21, 2020



* improving generation

* finalized special token behaviour for no_beam_search generation

* solved modeling_utils merge conflict

* solve merge conflicts in modeling_utils.py

* add run_generation improvements from PR #2749

* adapted language generation to not use hardcoded -1 if no padding token is available

* remove the -1 removal as hard coded -1`s are not necessary anymore

* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown

* add slow language generation tests for pretrained models using hardcoded output with pytorch seed

* delete ipdb

* check that all generated tokens are valid

* renaming

* renaming Generation -> Generate

* make style

* updated so that generate_beam_search has same token behavior than generate_no_beam_search

* consistent return format for run_generation.py

* deleted pretrain lm generate tests -> will be added in another PR

* cleaning of unused if statements and renaming

* run_generate will always return an iterable

* make style

* consistent renaming

* improve naming, make sure generate function always returns the same tensor, add docstring

* add slow tests for all lmhead models

* make style and improve example comments modeling_utils

* better naming and refactoring in modeling_utils

* improving generation

* finalized special token behaviour for no_beam_search generation

* solved modeling_utils merge conflict

* solve merge conflicts in modeling_utils.py

* add run_generation improvements from PR #2749

* adapted language generation to not use hardcoded -1 if no padding token is available

* remove the -1 removal as hard coded -1`s are not necessary anymore

* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown

* add slow language generation tests for pretrained models using hardcoded output with pytorch seed

* delete ipdb

* check that all generated tokens are valid

* renaming

* renaming Generation -> Generate

* make style

* updated so that generate_beam_search has same token behavior than generate_no_beam_search

* consistent return format for run_generation.py

* deleted pretrain lm generate tests -> will be added in another PR

* cleaning of unused if statements and renaming

* run_generate will always return an iterable

* make style

* consistent renaming

* improve naming, make sure generate function always returns the same tensor, add docstring

* add slow tests for all lmhead models

* make style and improve example comments modeling_utils

* better naming and refactoring in modeling_utils

* changed fast random lm generation testing design to more general one

* delete in old testing design in gpt2

* correct old variable name

* temporary fix for encoder_decoder lm generation tests - has to be updated when t5 is fixed

* adapted all fast random generate tests to new design

* better warning description in modeling_utils

* better comment

* better comment and error message
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

fc38d4c8

Added CamembertForQuestionAnswering (#2746) · c749a543
maximeilluin authored Feb 21, 2020
```
* Added CamembertForQuestionAnswering

* fixed camembert tokenizer case
```
c749a543

Update modeling_tf_utils.py (#2924) · 5211d333

Bram Vanroy authored Feb 21, 2020

Tensorflow does not use .eval() vs .train().

closes https://github.com/huggingface/transformers/issues/2906

5211d333

Create README.md for xlnet_large_squad (#2942) · 3e98f27e
ahotrod authored Feb 21, 2020

3e98f27e
Labels are now added to model config under id2label and label2id (#2945) · 4452b44b
Martin Malmsten authored Feb 21, 2020

4452b44b

20 Feb, 2020 5 commits
- New BartModel (#2745) · 53ce3854
  Sam Shleifer authored Feb 20, 2020
```
* Results same as fairseq
* Wrote a ton of tests
* Struggled with api signatures
* added some docs
```
  53ce3854
- Removed unused fields in DistilBert TransformerBlock (#2710) · 564fd75d
  guillaume-be authored Feb 20, 2020
```
* Removed unused fields in DistilBert TransformerBlock
```
  564fd75d
- default arg fix (#2937) · 889d3bfd
  srush authored Feb 20, 2020
  
  889d3bfd
- Add get_vocab method to PretrainedTokenizer · 197d74f9
  Joe Davison authored Feb 20, 2020
  
  197d74f9
- Fix InputExample docstring (#2891) · ea8eba35
  Scott Gigante authored Feb 20, 2020
  
  ea8eba35