Commits · b0f05e0c4ced7991fef989a817b05051992bc415 · chenpangpang / transformers

09 Oct, 2020 3 commits

[pegasus] Faster tokenizer tests (#7672) · b0f05e0c
Stas Bekman authored Oct 09, 2020

b0f05e0c

Reintroduce clean_text on BertTokenizer call which was removed by mistake in #4723 (#5749) · 21ed3a6b

Funtowicz Morgan authored Oct 09, 2020



* Reintroduce clean_text call which was removed by mistake in #4723
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added unittest for clean_text parameter on Bert tokenizer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Better unittest name.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Adapt unittest to use untrained tokenizer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Code quality + update test
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

21ed3a6b

fix nn.DataParallel compatibility with PyTorch 1.5 (#7671) · 0578a913
guhur authored Oct 09, 2020
```
The same type of errors as in https://github.com/huggingface/transformers/pull/4300
```
0578a913

08 Oct, 2020 3 commits

Fix RobertaForCausalLM docs (#7642) · 4a00613c

Lysandre Debut authored Oct 08, 2020



* Fix RobertaForCausalLM docs

* Apply review suggestion
Co-authored-by: sgugger <sylvain.gugger@gmail,com>
Co-authored-by: sgugger <sylvain.gugger@gmail,com>

4a00613c

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

Replaced torch.load for loading the pretrained vocab of TransformerXL... · 4d04120c

Piero Molino authored Oct 08, 2020


Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load (#6935)

* Replaced torch.load for loading the pretrained vocab of TransformerXL to pickle.load

* Replaced torch.save with pickle.dump when saving the vocabulary

* updating transformer-xl

* uploaded on S3 - compatibility

* fix tests

* style

* Address review comments
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

4d04120c

07 Oct, 2020 3 commits

Blenderbot (#7418) · 960faaaf

Sam Shleifer authored Oct 07, 2020


Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

960faaaf

Trainer callbacks (#7596) · 08ba4b49

Sylvain Gugger authored Oct 07, 2020



* Initial callback proposal

* Finish various callbacks

* Post-rebase conflicts

* Fix tests

* Don't use something that's not set

* Documentation

* Remove unwanted print.

* Document all models can work

* Add tests + small fixes

* Update docs/source/internal/trainer_utils.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Fix TF tests

* Real fix this time

* This one should work

* Fix typo

* Really fix typo
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

08ba4b49

Add GPT2 to sequence classification auto model (#7630) · 8fa0c956
Lysandre Debut authored Oct 07, 2020

8fa0c956

06 Oct, 2020 8 commits
- Fix tokenizer UnboundLocalError when padding is set to PaddingStrategy.MAX_LENGTH (#7610) · e084089e
  Gabriele Picco authored Oct 06, 2020
```
* Fix UnboundLocalError when PaddingStrategy is MAX_LENGTH

* Fix UnboundLocalError for TruncationStrategy
```
  e084089e
- Fix wrong reference name/filename in docstring (#7616) · adfe6ace
  Philipp authored Oct 07, 2020
```
Resolves: #7613
```
  adfe6ace
- Fix-copies · f0d20ad3
  Lysandre authored Oct 06, 2020
  
  f0d20ad3
- Add GPT2ForSequenceClassification based on DialogRPT (#7501) · 59824318
  Lysandre Debut authored Oct 06, 2020
```
* Add GPT2ForSequenceClassification based on DialogRPT

* Better documentation

* Code quality
```
  59824318
- [bart] fix config.classif_dropout (#7593) · 2b574e7c
  Sam Shleifer authored Oct 06, 2020
  
  2b574e7c
- fix return dicitonary labels from masked_lm_labels to labels (#7595) · 4d541f51
  George Mihaila authored Oct 06, 2020
  
  4d541f51
- [TF generation] Fix typo (#7582) · eda27f44
  Siddharth Jain authored Oct 06, 2020
```
* Fixing top_k and min_length assertions, and a typo fix

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
```
  eda27f44
- Fix squeezebert docs (#7587) · 0257992e
  Lysandre Debut authored Oct 06, 2020
```
* Configuration

* Modeling

* Tokenization

* Obliterate the trailing spaces

* From underlines to long underlines
```
  0257992e
05 Oct, 2020 12 commits

Documentation fixes (#7585) · 03835af7
Sylvain Gugger authored Oct 05, 2020

03835af7

Custom TF weights loading (#7422) · 9cf7b23b

Julien Plu authored Oct 05, 2020



* First try

* Fix TF utils

* Handle authorized unexpected keys when loading weights

* Add several more authorized unexpected keys

* Apply style

* Fix test

* Address Patrick's comments.

* Update src/transformers/modeling_tf_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply style

* Make return_dict the default behavior and display a warning message

* Revert

* Replace wrong keyword

* Revert code

* Add forgot key

* Fix bug in loading PT models from a TF one.

* Fix sort

* Add a test for custom load weights in BERT

* Apply style

* Remove unused import
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

9cf7b23b

Fix post_init of some TrainingArguments (#7525) · ca05c2a4
Sylvain Gugger authored Oct 05, 2020

ca05c2a4
Add new dummy PT objects · 3bd3d8b5
Sylvain Gugger authored Oct 05, 2020

3bd3d8b5

Allow soft dependencies in the namespace with ImportErrors at use (#7537) · 28d183c9

Sylvain Gugger authored Oct 05, 2020

* PoC on RAG

* Format class name/obj name

* Better name in message

* PoC on one TF model

* Add PyTorch and TF dummy objects + script

* Treat scikit-learn

* Bad copy pastes

* Typo

28d183c9

Fix tokenization in SQuAD for RoBERTa, Longformer, BART (#7387) · ba5ea66e
Malte Pietsch authored Oct 05, 2020
```
* fix squad tokenization for roberta & co

* change to pure type based check

* sort imports
```
ba5ea66e
Allow nested tensors in predicted logits (#7542) · 0270256b
Sylvain Gugger authored Oct 05, 2020

0270256b

Add `power` argument for TF PolynomialDecay (#5732) · 60de910e

Cola authored Oct 05, 2020

* 🚩 Add `power` argument for TF PolynomialDecay

* 🚩 Create default optimizer with power

* 🚩 Add argument to training args

* 🚨 Clean code format

* 🚨 Fix black warning

* 🚨 Fix code format

60de910e

Add Electra unexpected keys (#7569) · 41c3a3b9
Lysandre Debut authored Oct 05, 2020

41c3a3b9

SqueezeBERT architecture (#7083) · 02ef825b

Forrest Iandola authored Oct 05, 2020

* configuration_squeezebert.py

thin wrapper around bert tokenizer

fix typos

wip sb model code

wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working

set up squeezebert to use BertModelOutput when returning results.

squeezebert documentation

formatting

allow head mask that is an array of [None, ..., None]

docs

docs cont'd

path to vocab

docs and pointers to cloud files (WIP)

line length and indentation

squeezebert model cards

formatting of model cards

untrack modeling_squeezebert_scratchpad.py

update aws paths to vocab and config files

get rid of stub of NSP code, and advise users to pretrain with mlm only

fix rebase issues

redo rebase of modeling_auto.py

fix issues with code formatting

more code format auto-fixes

move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert

tests for squeezebert modeling and tokenization

fix typo

move squeezebert before bert in modeling_auto.py to fix inheritance problem

disable test_head_masking, since squeezebert doesn't yet implement head masking

fix issues exposed by the test_modeling_squeezebert.py

fix an issue exposed by test_tokenization_squeezebert.py

fix issue exposed by test_modeling_squeezebert.py

auto generated code style improvement

issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head()

update copyright

resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask

docs

add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli

autogenerated formatting tweaks

integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings

* tiny change to order of imports

02ef825b

Cleanup documentation for BART, Marian, MBART and Pegasus (#7523) · e2c935f5

Sylvain Gugger authored Oct 05, 2020

* Cleanup documentation for BART, Marian, MBART and Pegasus

* Cleanup documentation for BART, Marian, MBART and Pegasus

e2c935f5

LayoutLM: add exception handling for bbox values (#7452) · 5e941bec

Alexandr authored Oct 05, 2020



* LayoutLM: add exception handling for bbox values

To replicate unhandled error:

- In `test_modelling_layoutlm.py` set `range_bbox=1025`, i.e. greater 1024
- Run `pytest tests/test_modeling_layoutlm.py`

Requirement for bbox values to be within the range 0-1000 is documented
but if it is violated then it isa not clear what is the issue from error
message.

* Update src/transformers/modeling_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

5e941bec

04 Oct, 2020 1 commit
- Remove labels from the RagModel example (#7560) · 95f792af
  Sylvain Gugger authored Oct 04, 2020
  
  95f792af
01 Oct, 2020 9 commits

Fix seq2seq example test (#7518) · bdcc4b78
Sylvain Gugger authored Oct 01, 2020
```
* Fix seq2seq example test

* Fix bad copy-paste

* Also save the state
```
bdcc4b78

Clean the Trainer state (#7490) · 29baa8fa

Sylvain Gugger authored Oct 01, 2020

* Trainer should not modify its TrainingArguments

* Trainer should not modify its TrainingArguments

* Trainer should not modify its TrainingArguments

* Add test of resumed training

* Fixes

* Non multiGPU test

* Clean Trainer state

* Add more to the state

* Documentation

* One last test

* Make resume training test more complete

* Unwanted changes

29baa8fa

fix data type (#7513) · bd262158
Patrick von Platen authored Oct 01, 2020

bd262158

[Seq2Seq] Fix a couple of bugs and clean examples (#7474) · 62f5ae68

Patrick von Platen authored Oct 01, 2020



* clean T5

* fix t5 tests

* fix index typo

* fix tf common test

* fix examples

* change positional ordering for Bart and FSTM

* add signature test

* clean docs and add tests

* add docs to encoder decoder

* clean docs

* correct two doc strings

* remove sig test for TF Elektra & Funnel

* fix tf t5 slow tests

* fix input_ids to inputs in tf

* Update src/transformers/modeling_bart.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_bart.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* implement lysandre results

* make style

* fix encoder decoder typo

* fix tf slow tests

* fix slow tests

* renaming

* remove unused input
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

62f5ae68

Fix Tune progress_reporter kwarg (#7508) · 5fc3b5cb
Kai Fricke authored Oct 01, 2020

5fc3b5cb
Report Tune metrics in final evaluation (#7507) · dabc85d1
Kai Fricke authored Oct 01, 2020

dabc85d1
Fix local_files_only for TF (#6091) · 85d2d8c9
Lysandre Debut authored Oct 01, 2020

85d2d8c9

Enable pegasus fp16 by clamping large activations (#7243) · 9e80f972

Sam Shleifer authored Oct 01, 2020

* Clean clamp

* boom boom

* Take some other changes

* boom boom

* boom boom

* boom boom

* one chg

* fix test

* Use finfo

* style

9e80f972

Distributed Trainer: 2 little fixes (#7461) · 097049b8

Sam Shleifer authored Sep 30, 2020

* reset model.config

* Update src/transformers/trainer.py

* use lower case tensor

* Just tensor change

097049b8

30 Sep, 2020 1 commit
- Small QOL improvements to TrainingArguments (#7475) · a97a73e0
  Sylvain Gugger authored Sep 30, 2020
```
* Small QOL improvements to TrainingArguments

* With the self.
```
  a97a73e0