Commits · 33e43edddcab60217027dcf7f6570eead1195083 · chenpangpang / transformers

07 Jul, 2020 16 commits

[docs] fix model_doc links in model summary (#5566) · 33e43edd
Suraj Patil authored Jul 07, 2020
```
* fix model_doc links

* update model links
```
33e43edd
Fix tests imports dpr (#5576) · 4fedc125
Quentin Lhoest authored Jul 07, 2020
```
* fix test imports

* fix max_length

* style

* fix tests
```
4fedc125
[Bart] enable test_torchscript, update test_tie_weights (#5457) · d4886173
Sam Shleifer authored Jul 07, 2020
```
* Passing all but one torchscript test

* Style

* move comment

* remove unneeded assert
```
d4886173

[examples] Add trainer support for question-answering (#4829) · e49393c3

Suraj Patil authored Jul 07, 2020



* add SquadDataset

* add DataCollatorForQuestionAnswering

* update __init__

* add run_squad with  trainer

* add DataCollatorForQuestionAnswering in __init__

* pass data_collator to trainer

* doc tweak

* Update run_squad_trainer.py

* Update __init__.py

* Update __init__.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

e49393c3

Add DPR model (#5279) · fbd87921

Quentin Lhoest authored Jul 07, 2020



* beginning of dpr modeling

* wip

* implement forward

* remove biencoder + better init weights

* export dpr model to embed model for nlp lib

* add new api

* remove old code

* make style

* fix dumb typo

* don't load bert weights

* docs

* docs

* style

* move the `k` parameter

* fix init_weights

* add pretrained configs

* minor

* update config names

* style

* better config

* style

* clean code based on PR comments

* change Dpr to DPR

* fix config

* switch encoder config to a dict

* style

* inheritance -> composition

* add messages in assert startements

* add dpr reader tokenizer

* one tokenizer per model

* fix base_model_prefix

* fix imports

* typo

* add convert script

* docs

* change tokenizers conf names

* style

* change tokenizers conf names

* minor

* minor

* fix wrong names

* minor

* remove unused convert functions

* rename convert script

* use return_tensors in tokenizers

* remove n_questions dim

* move generate logic to tokenizer

* style

* add docs

* docs

* quality

* docs

* add tests

* style

* add tokenization tests

* DPR full tests

* Stay true to the attention mask building

* update docs

* missing param in bert input docs

* docs

* style
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

fbd87921

Update model card (#5491) · d2a93991
Savaş Yıldırım authored Jul 07, 2020

d2a93991
Update model card (#5492) · 2e653d89
Savaş Yıldırım authored Jul 07, 2020

2e653d89
bert-turkish-text-classification model card (#5493) · beaf60e5
Savaş Yıldırım authored Jul 07, 2020

beaf60e5

electra-small-finetuned-squadv1 model card (#5430) · e6eba841

Manuel Romero authored Jul 07, 2020

* Create model card

Create model card for electra-small-discriminator finetuned on SQUAD v1.1

* Set right model path in code example

e6eba841

ukr-roberta-base model card (#5514) · 43b7ad5d
Vitalii Radchenko authored Jul 07, 2020

43b7ad5d
roberta-base-1B-1-finetuned-squadv1 model card (#5515) · 87aa857d
Manuel Romero authored Jul 07, 2020

87aa857d

zuBERTa model card (#5536) · c7d96b60

Moseli Motsoehli authored Jul 07, 2020



* Create README

* Update README.md
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>

c7d96b60

roberta-base-1B-1-finetuned-squadv2 model card (#5523) · b95dfcf1
Manuel Romero authored Jul 07, 2020

b95dfcf1

Make T5 compatible with ONNX (#5518) · 69122657

Abel authored Jul 07, 2020



* Default decoder inputs to encoder ones for T5 if neither are specified.

* Fixing typo, now all tests are passing.

* Changing einsum to operations supported by onnx

* Adding a test to ensure T5 can be exported to onnx op>9

* Modified test for onnx export to make it faster

* Styling changes.

* Styling changes.

* Changing notation for matrix multiplication
Co-authored-by: Abel Riboulot <tkai@protomail.com>

69122657

[Reformer] Adapt Reformer MaskedLM Attn mask (#5560) · 989ae326
Patrick von Platen authored Jul 07, 2020
```
* fix attention mask

* fix slow test

* refactor attn masks

* fix fp16 generate test
```
989ae326

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

06 Jul, 2020 13 commits
- Post v3.0.2 release commit · 1d233286
  Lysandre authored Jul 06, 2020
  
  1d233286
- Release: v3.0.2 · b0892fa0
  Lysandre authored Jul 06, 2020
  
  b0892fa0
- Fix fast tokenizers too (#5562) · f1e2e423
  Sylvain Gugger authored Jul 06, 2020
  
  f1e2e423
- Various tokenizers fixes (#5558) · 5787e4c1
  Anthony MOI authored Jul 06, 2020
```
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
```
  5787e4c1
- Fix #5507 (#5559) · 21f28c34
  Sylvain Gugger authored Jul 06, 2020
```
* Fix #5507

* Fix formatting
```
  21f28c34
- The `add_space_before_punct_symbol` is only for TransfoXL (#5549) · 9d9b872b
  Lysandre Debut authored Jul 06, 2020
  
  9d9b872b
- GPT2 tokenizer should not output token type IDs (#5546) · d6b0b9d4
  Lysandre Debut authored Jul 06, 2020
```
* GPT2 tokenizer should not output token type IDs

* Same for OpenAIGPT
```
  d6b0b9d4
- Fix #5544 (#5551) · 7833b21a
  Sylvain Gugger authored Jul 06, 2020
  
  7833b21a
- Fix the tokenization warning noted in #5505 (#5550) · c4734840
  Thomas Wolf authored Jul 06, 2020
```
* fix warning

* style and quality
```
  c4734840
- Imports organization · 1bbc28be
  Lysandre authored Jul 06, 2020
  
  1bbc28be
- Update convert_pytorch_checkpoint_to_tf2.py (#5531) · 1bc13697
  Mohamed Taher Alrefaie authored Jul 06, 2020
```
fixed ImportError: cannot import name 'hf_bucket_url'
```
  1bc13697
- Typo fix in `training` doc (#5495) · b2309cc6
  Arnav Sharma authored Jul 06, 2020
  
  b2309cc6
- Fix typo in training (#5510) · 7ecff0cc
  ELanning authored Jul 06, 2020
  
  7ecff0cc
03 Jul, 2020 10 commits

[cleanup] TF T5 tests only init t5-base once. (#5410) · 58cca47c
Sam Shleifer authored Jul 03, 2020

58cca47c
better error message (#5497) · 99117292
Patrick von Platen authored Jul 03, 2020

99117292
unpining specific git versions in setup.py · b58a15a3
Thomas Wolf authored Jul 03, 2020

b58a15a3
Release: 3.0.1 · fedabcd1
Thomas Wolf authored Jul 03, 2020

fedabcd1

Exposing prepare_for_model for both slow & fast tokenizers (#5479) · 17ade127

Lysandre Debut authored Jul 03, 2020



* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

17ade127

Create model card (#5396) · 814ed7ee

Manuel Romero authored Jul 03, 2020

Create model card for electicidad-small (Spanish Electra) fine-tuned on SQUAD-esv1

814ed7ee

grammar corrections and train data update (#5448) · 49281ac9
Moseli Motsoehli authored Jul 03, 2020
```
- fixed grammar and spelling
- added an intro
- updated Training data references
```
49281ac9
Update upstream (#5456) · 97355339
chrisliu authored Jul 03, 2020

97355339
Create model card (#5464) · 55b932a8
Manuel Romero authored Jul 03, 2020
```
Create model card for electra-small-discriminator fine-tuned on SQUAD v2.0
```
55b932a8

QA Pipelines fixes (#5429) · 21cd8c40

Funtowicz Morgan authored Jul 03, 2020



* Make QA pipeline supports models with more than 2 outputs such as BART assuming start/end are the two first outputs.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length.

This result in every sample going through QA pipelines to be of size 384 whatever the actual input size is making the overall pipeline very slow.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Mask padding & question before applying softmax. Softmax has been refactored to operate in log space for speed and stability.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use PaddingStrategy.LONGEST instead of DO_NOT_PAD
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Revert "When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length."

This reverts commit 1b00a9a2
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI after unattended failure

* Trigger CI

21cd8c40

02 Jul, 2020 1 commit
- Fix roberta model ordering for TFAutoModel (#5414) · 8438bab3
  Pierric Cistac authored Jul 02, 2020
  
  8438bab3