Commits · d13eca11e2719edf3c1e985538cb335f514e3cae · chenpangpang / transformers

"sgl-router/src/routers/vscode:/vscode.git/clone" did not exist on "7a06ef984d262cd9bd38d4ef83382ab5c6e73aa8"

17 Apr, 2020 2 commits
- Higher tolerance for past testing in T5 (#3843) · d13eca11
  Patrick von Platen authored Apr 17, 2020
  
  d13eca11
- Question Answering support for Albert and Roberta in TF (#3812) · 6d00033e
  Pierric Cistac authored Apr 17, 2020
```
* Add TFAlbertForQuestionAnswering

* Add TFRobertaForQuestionAnswering

* Update TFAutoModel with Roberta/Albert for QA

* Clean `super` TF Albert calls
```
  6d00033e
16 Apr, 2020 2 commits

clean pipelines (#3795) · baca8fa8
Patrick von Platen authored Apr 16, 2020

baca8fa8

[TFT5, Cache] Add cache to TFT5 (#3772) · 38f7461d

Patrick von Platen authored Apr 16, 2020

* correct gpt2 test inputs

* make style

* delete modeling_gpt2 change in test file

* translate from pytorch

* correct tests

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* make tensorflow t5 caching work

* make style

* clean reorder cache

* remove unnecessary spaces

* fix test

38f7461d

14 Apr, 2020 1 commit

[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734) · 01c37dcd

Patrick von Platen authored Apr 14, 2020

* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2

01c37dcd

13 Apr, 2020 1 commit

Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716) · 352d5472

Teven authored Apr 13, 2020



* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>

352d5472

10 Apr, 2020 2 commits

[examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c

Julien Chaumond authored Apr 10, 2020

* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review

b169ac9c

Multilingual BART - (#3602) · 7a7fdf71
Sam Shleifer authored Apr 10, 2020
```
- support mbart-en-ro weights
- add MBartTokenizer
```
7a7fdf71

09 Apr, 2020 2 commits

[T5, generation] Add decoder caching for T5 (#3682) · ce2298fb

Patrick von Platen authored Apr 10, 2020



* initial commit to add decoder caching for T5

* better naming for caching

* finish T5 decoder caching

* correct test

* added extensive past testing for T5

* clean files

* make tests cleaner

* improve docstring

* improve docstring

* better reorder cache

* make style

* Update src/transformers/modeling_t5.py
Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com>

* make set output past work for all layers

* improve docstring

* improve docstring
Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>

ce2298fb

Update quotes · 31baeed6
LysandreJik authored Apr 09, 2020
```
cc @julien-c
```
31baeed6

08 Apr, 2020 1 commit
- Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684) · 6435b9f9
  Lysandre Debut authored Apr 08, 2020
```
* Updating modeling tf files; adding tests

* Merge `encode_plus` and `batch_encode_plus`
```
  6435b9f9
07 Apr, 2020 2 commits
- [Bart] Replace config.output_past with use_cache kwarg (#3632) · 715aa5b1
  Sam Shleifer authored Apr 07, 2020
  
  715aa5b1
- Speedup torch summarization tests (#3663) · 0a4b1068
  Sam Shleifer authored Apr 07, 2020
  
  0a4b1068
06 Apr, 2020 2 commits

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

[Generate, Test] Split generate test function into beam search, no beam search (#3601) · 2ee41056
Patrick von Platen authored Apr 06, 2020
```
* split beam search and no beam search test

* fix test

* clean generate tests
```
2ee41056

03 Apr, 2020 2 commits

ELECTRA (#3257) · d5d7d886

Lysandre Debut authored Apr 03, 2020

* Electra wip

* helpers

* Electra wip

* Electra v1

* ELECTRA may be saved/loaded

* Generator & Discriminator

* Embedding size instead of halving the hidden size

* ELECTRA Tokenizer

* Revert BERT helpers

* ELECTRA Conversion script

* Archive maps

* PyTorch tests

* Start fixing tests

* Tests pass

* Same configuration for both models

* Compatible with base + large

* Simplification + weight tying

* Archives

* Auto + Renaming to standard names

* ELECTRA is uncased

* Tests

* Slight API changes

* Update tests

* wip

* ElectraForTokenClassification

* temp

* Simpler arch + tests

Removed ElectraForPreTraining which will be in a script

* Conversion script

* Auto model

* Update links to S3

* Split ElectraForPreTraining and ElectraForTokenClassification

* Actually test PreTraining model

* Remove num_labels from configuration

* wip

* wip

* From discriminator and generator to electra

* Slight API changes

* Better naming

* TensorFlow ELECTRA tests

* Accurate conversion script

* Added to conversion script

* Fast ELECTRA tokenizer

* Style

* Add ELECTRA to README

* Modeling Pytorch Doc + Real style

* TF Docs

* Docs

* Correct links

* Correct model intialized

* random fixes

* style

* Addressing Patrick's and Sam's comments

* Correct links in docs

d5d7d886

BertJapaneseTokenizer accept options for mecab (#3566) · 8594dd80
Yohei Tamura authored Apr 04, 2020
```
* BertJapaneseTokenizer accept options for mecab

* black

* fix mecab_option to Option[str]
```
8594dd80

01 Apr, 2020 2 commits

[T5, TF 2.2] change tf t5 argument naming (#3547) · a4ee4da1
Patrick von Platen authored Apr 01, 2020
```
* change tf t5 argument naming for TF 2.2

* correct bug in testing
```
a4ee4da1

[T5, Testst] Add extensive hard-coded integration tests and make sure PT and... · b815edf6

Patrick von Platen authored Apr 01, 2020

[T5, Testst] Add extensive hard-coded integration tests and make sure PT and TF give equal results (#3550)

* add some t5 integration tests

* finish summarization and translation integration tests for T5 - results loook good

* add tf test

* fix == vs is bug

* fix tf beam search error and make tf t5 tests pass

b815edf6

31 Mar, 2020 1 commit
- [Generate] Add bad words list argument to the generate function (#3367) · b38d552a
  Patrick von Platen authored Mar 31, 2020
```
* add bad words list

* make style

* add bad_words_tokens

* make style

* better naming

* make style

* fix typo
```
  b38d552a
30 Mar, 2020 2 commits

[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488) · 8deff3ac
Sam Shleifer authored Mar 30, 2020

8deff3ac

[T5] make decoder input ids optional for t5 training (#3521) · 75ec6c9e

Patrick von Platen authored Mar 30, 2020

* make decoder input ids optional for t5 training

* lm_lables should not be shifted in t5

* add tests

* finish shift right functionality for PT T5

* move shift right to correct class

* cleaner code

* replace -100 values with pad token id

* add assert statement

* remove unnecessary for loop

* make style

75ec6c9e

29 Mar, 2020 1 commit
- [BART] add bart-large-xsum weights (#3422) · f6a23d19
  Sam Shleifer authored Mar 29, 2020
  
  f6a23d19
27 Mar, 2020 1 commit
- [Bart/Memory] Two separate, smaller decoder attention masks (#3371) · 3ee431dd
  Sam Shleifer authored Mar 26, 2020
  
  3ee431dd
26 Mar, 2020 4 commits

[Bart/Memory] don't create lm_head (#3323) · 39371ee4
Sam Shleifer authored Mar 26, 2020
```
* delete lm_head, skips weight tying
* Fixed s3
```
39371ee4

Add missing token classification for XLM (#3277) · 1a6c546c

sakares saengkaew authored Mar 26, 2020



* Add the missing token classification for XLM

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add the missing token classification for XLM

* fix styling

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add missing description for AlbertForTokenClassification

* fix styling

* Add missing docstring for AlBert

* Slow tests should be slow
Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

1a6c546c

Adds translation pipeline (#3419) · 022e8fab

Patrick von Platen authored Mar 26, 2020

* fix merge conflicts

* add t5 summarization example

* change parameters for t5 summarization

* make style

* add first code snippet for translation

* only add prefixes

* add prefix patterns

* make style

* renaming

* fix conflicts

* remove unused patterns

* solve conflicts

* fix merge conflicts

* remove translation example

* remove summarization example

* make sure tensors are in numpy for float comparsion

* re-add t5 config

* fix t5 import config typo

* make style

* remove unused numpy statements

* update doctstring

* import translation pipeline

022e8fab

Add t5 to pipeline(task='summarization') (#3413) · 9c683ef0

Patrick von Platen authored Mar 26, 2020

* solve conflicts

* move warnings below

* incorporate changes

* add pad_to_max_length to pipelines

* add bug fix for T5 beam search

* add prefix patterns

* make style

* fix conflicts

* adapt pipelines for task specific parameters

* improve docstring

* remove unused patterns

9c683ef0

24 Mar, 2020 1 commit
- Add camembert integration tests (#3375) · e392ba69
  Patrick von Platen authored Mar 24, 2020
```
* add integration tests for camembert

* use jplu/tf-camembert fro the moment

* make style
```
  e392ba69
20 Mar, 2020 1 commit
- Clean special token init in modeling_....py (#3264) · 95e00d08
  Patrick von Platen authored Mar 20, 2020
```
* make style

* fix conflicts
```
  95e00d08
19 Mar, 2020 2 commits

Support T5 Generation (#3228) · bbf26c4e

Patrick von Platen authored Mar 19, 2020



* fix conflicts

* update bart max length test

* correct spelling mistakes

* implemented model specific encode function

* fix merge conflicts

* better naming

* save intermediate state -> need to rethink strucuture a bit

* leave tf problem as it is for now

* current version

* add layers.pop

* remove ipdb

* make style

* clean return cut decoding

* remove ipdbs

* Fix restoring layers in the decoders that doesnt exists.

* push good intermediate solution for now

* fix conflicts

* always good to refuse to merge conflicts when rebasing

* fix small bug

* improve function calls

* remove unused file

* add correct scope behavior for t5_generate
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>

bbf26c4e

[BART] cleanup: remove redundant kwargs, improve docstrings (#3319) · ad7233fc
Sam Shleifer authored Mar 19, 2020

ad7233fc

18 Mar, 2020 2 commits

XLM-R Tokenizer now passes common tests + Integration tests (#3198) · d6afbd32

Lysandre Debut authored Mar 18, 2020

* XLM-R now passes common tests + Integration tests

* Correct mask index

* Model input names

* Style

* Remove text preprocessing

* Unneccessary import

d6afbd32

Adding LM Head to Transfo-XL and first step to fixing problem with Adaptive... · 292186a3

Patrick von Platen authored Mar 18, 2020

Adding LM Head to Transfo-XL and first step to fixing problem with Adaptive Embeddings in TransfoXL (#3286)

* first commit

* work in progress

* make language generation task pass

* update to working version for LM

* delete print

* remove dead code

* make style

292186a3

17 Mar, 2020 3 commits

Add Summarization to Pipelines (#3128) · 38a555a8

Sam Shleifer authored Mar 17, 2020

* passing

* Undo stupid chg

* docs

* undo rename

* delete-cruft

* only import if you have torch

* Dont rely on dict ordering

* Fix dict ordering upstream

* docstring link

* docstring link

* remove trailing comma for 3.5 compat

* new name

* delegate kwarging

* Update kwargs

38a555a8

[generate] do_sample default back to False (#3298) · e8f44af5

Patrick von Platen authored Mar 17, 2020

* change do_samples back

* None better default as boolean

* adapt do_sample to True in test example

* make style

e8f44af5

[BART] Delete redundant unit test (#3302) · b2c1a447
Sam Shleifer authored Mar 16, 2020

b2c1a447

16 Mar, 2020 1 commit
- [BART] Remove unused kwargs (#3279) · 5ea8ba67
  Sam Shleifer authored Mar 15, 2020
```
* Remove unused kwargs
* dont call forward in tests
```
  5ea8ba67
13 Mar, 2020 1 commit
- [BART] FP16 testing fixes (#3266) · 2bd79e23
  Sam Shleifer authored Mar 13, 2020
  
  2bd79e23
12 Mar, 2020 1 commit
- fix typo · 6a82f774
  Patrick von Platen authored Mar 12, 2020
  
  6a82f774