Commits · 901507335f6ed59cad6bbbc2b5d8d9eba8a1b4e1 · chenpangpang / transformers

16 Nov, 2020 1 commit

Switch `return_dict` to `True` by default. (#8530) · 1073a2bd

Sylvain Gugger authored Nov 16, 2020

* Use the CI to identify failing tests

* Remove from all examples and tests

* More default switch

* Fixes

* More test fixes

* More fixes

* Last fixes hopefully

* Use the CI to identify failing tests

* Remove from all examples and tests

* More default switch

* Fixes

* More test fixes

* More fixes

* Last fixes hopefully

* Run on the real suite

* Fix slow tests

1073a2bd

13 Nov, 2020 1 commit

Model templates encoder only (#8509) · 826f0457

Lysandre Debut authored Nov 13, 2020



* Model templates

* TensorFlow

* Remove pooler

* CI

* Tokenizer + Refactoring

* Encoder-Decoder

* Let's go testing

* Encoder-Decoder in TF

* Let's go testing in TF

* Documentation

* README

* Fixes

* Better names

* Style

* Update docs

* Choose to skip either TF or PT

* Code quality fixes

* Add to testing suite

* Update file path

* Cookiecutter path

* Update `transformers` path

* Handle rebasing

* Remove seq2seq from model templates

* Remove s2s config

* Apply Sylvain and Patrick comments

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Last fixes from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

826f0457

04 Nov, 2020 1 commit
- Fix validation file loading in scripts (#8298) · cf897246
  Sylvain Gugger authored Nov 04, 2020
  
  cf897246
30 Oct, 2020 1 commit

Finalize lm examples (#8188) · cdc48ce9

Sylvain Gugger authored Oct 30, 2020



* Finish the cleanup of the language-modeling examples

* Update main README

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Propagate changes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

cdc48ce9

29 Oct, 2020 2 commits

Add a template for examples and apply it for mlm and plm examples (#8153) · 69117628

Sylvain Gugger authored Oct 29, 2020

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Styling

69117628

Fix doc errors and typos across the board (#8139) · 969859d5

Santiago Castro authored Oct 29, 2020

* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes

969859d5

28 Oct, 2020 1 commit
- Rename add_start_docstrings_to_callable (#8120) · 378142af
  Sylvain Gugger authored Oct 28, 2020
  
  378142af
20 Oct, 2020 1 commit

[testing] rename skip targets + docs (#7863) · 3e31e7f9

Stas Bekman authored Oct 20, 2020



* rename skip targets + docs

* fix quotes

* style

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* small improvements

* fix
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

3e31e7f9

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

12 Oct, 2020 1 commit
- Fix typo in all model docs (#7714) · 13c18577
  Sylvain Gugger authored Oct 12, 2020
  
  13c18577
05 Oct, 2020 2 commits

Check and update model list in index.rst automatically (#7527) · b2b7fc78

Sylvain Gugger authored Oct 05, 2020

* Check and update model list in index.rst automatically

* Check and update model list in index.rst automatically

* Adapt template

b2b7fc78

SqueezeBERT architecture (#7083) · 02ef825b

Forrest Iandola authored Oct 05, 2020

* configuration_squeezebert.py

thin wrapper around bert tokenizer

fix typos

wip sb model code

wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working

set up squeezebert to use BertModelOutput when returning results.

squeezebert documentation

formatting

allow head mask that is an array of [None, ..., None]

docs

docs cont'd

path to vocab

docs and pointers to cloud files (WIP)

line length and indentation

squeezebert model cards

formatting of model cards

untrack modeling_squeezebert_scratchpad.py

update aws paths to vocab and config files

get rid of stub of NSP code, and advise users to pretrain with mlm only

fix rebase issues

redo rebase of modeling_auto.py

fix issues with code formatting

more code format auto-fixes

move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert

tests for squeezebert modeling and tokenization

fix typo

move squeezebert before bert in modeling_auto.py to fix inheritance problem

disable test_head_masking, since squeezebert doesn't yet implement head masking

fix issues exposed by the test_modeling_squeezebert.py

fix an issue exposed by test_tokenization_squeezebert.py

fix issue exposed by test_modeling_squeezebert.py

auto generated code style improvement

issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head()

update copyright

resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask

docs

add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli

autogenerated formatting tweaks

integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings

* tiny change to order of imports

02ef825b

24 Sep, 2020 1 commit
- Clean RAG docs and template docs (#7348) · 0ccb6f5c
  Sylvain Gugger authored Sep 24, 2020
```
* Clean RAG docs and template docs

* Fix typo

* Better doc
```
  0ccb6f5c
15 Sep, 2020 1 commit
- [logging] remove no longer needed verbosity override (#7100) · b0cbcdb0
  Stas Bekman authored Sep 15, 2020
  
  b0cbcdb0
10 Sep, 2020 2 commits
- Samell fixed in tf template (#7044) · d1691d90
  Sylvain Gugger authored Sep 10, 2020
  
  d1691d90
- Fix template (#7040) · b482ad47
  Lysandre Debut authored Sep 10, 2020
  
  b482ad47
04 Sep, 2020 1 commit
- [doc] remove the implied defaults to :obj:`None`, s/True/ :obj:`True/, etc. (#6956) · 48ff6d51
  Stas Bekman authored Sep 04, 2020
```
* remove the implied defaults to :obj:`None`

* fix bug in the original

* replace to :obj:`True`, :obj:`False`
```
  48ff6d51
03 Sep, 2020 1 commit
- Template updates (#6914) · 722b5807
  Sylvain Gugger authored Sep 03, 2020
  
  722b5807
02 Sep, 2020 1 commit
- [doc] typos (#6867) · 7351ef83
  Stas Bekman authored Sep 02, 2020
```
* [doc] typos

fixed typos

* Update README.md
```
  7351ef83
26 Aug, 2020 1 commit
- Black 20 release · a75c64d8
  Lysandre authored Aug 26, 2020
  
  a75c64d8
24 Aug, 2020 1 commit
- Update repo to isort v5 (#6686) · a5737779
  Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
  a5737779
13 Aug, 2020 1 commit

cleanup tf unittests: part 2 (#6260) · e983da0e

Stas Bekman authored Aug 13, 2020

* cleanup torch unittests: part 2

* remove trailing comma added by isort, and which breaks flake

* one more comma

* revert odd balls

* part 3: odd cases

* more ["key"] -> .key refactoring

* .numpy() is not needed

* more unncessary .numpy() removed

* more simplification

e983da0e

05 Aug, 2020 1 commit

Tf model outputs (#6247) · c67d1a02

Sylvain Gugger authored Aug 05, 2020

* TF outputs and test on BERT

* Albert to DistilBert

* All remaining TF models except T5

* Documentation

* One file forgotten

* TF outputs and test on BERT

* Albert to DistilBert

* All remaining TF models except T5

* Documentation

* One file forgotten

* Add new models and fix issues

* Quality improvements

* Add T5

* A bit of cleanup

* Fix for slow tests

* Style

c67d1a02

04 Aug, 2020 1 commit

cleanup torch unittests (#6196) · 5deed37f

Stas Bekman authored Aug 03, 2020

* improve unit tests

this is a sample of one test according to the request in https://github.com/huggingface/transformers/issues/5973
before I apply it to the rest

* batch 1

* batch 2

* batch 3

* batch 4

* batch 5

* style

* non-tf template

* last deletion of check_loss_output

5deed37f

03 Aug, 2020 1 commit

Empty assert hunt (#6056) · 5a0dac53

Teven authored Aug 03, 2020



* Fixed empty asserts

* black-reformatted stragglers in templates

* More code quality checks

* Update src/transformers/convert_marian_to_pytorch.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/convert_marian_to_pytorch.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* removed unused line as per @sshleifer
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

5a0dac53

31 Jul, 2020 1 commit
- Model output test (#6155) · d951c14a
  Sylvain Gugger authored Jul 31, 2020
```
* Use return_dict=True in all tests

* Formatting
```
  d951c14a
30 Jul, 2020 1 commit

Switch from return_tuple to return_dict (#6138) · 91cb9546

Sylvain Gugger authored Jul 30, 2020



* Switch from return_tuple to return_dict

* Fix test

* [WIP] Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleC… (#5614)

* Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} models and tests

* AutoModels


Tiny tweaks

* Style

* Final changes before merge

* Re-order for simpler review

* Final fixes

* Addressing @sgugger's comments

* Test MultipleChoice

* Rework TF trainer (#6038)

* Fully rework training/prediction loops

* fix method name

* Fix variable name

* Fix property name

* Fix scope

* Fix method name

* Fix tuple index

* Fix tuple index

* Fix indentation

* Fix variable name

* fix eval before log

* Add drop remainder for test dataset

* Fix step number + fix logging datetime

* fix eval loss value

* use global step instead of step + fix logging at step 0

* Fix logging datetime

* Fix global_step usage

* Fix breaking loop + logging datetime

* Fix step in prediction loop

* Fix step breaking

* Fix train/test loops

* Force TF at least 2.2 for the trainer

* Use assert_cardinality to facilitate the dataset size computation

* Log steps per epoch

* Make tfds compliant with TPU

* Make tfds compliant with TPU

* Use TF dataset enumerate instead of the Python one

* revert previous commit

* Fix data_dir

* Apply style

* rebase on master

* Address Sylvain's comments

* Address Sylvain's and Lysandre comments

* Trigger CI

* Remove unused import

* Switch from return_tuple to return_dict

* Fix test

* Add recent model
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Plu <plu.julien@gmail.com>

91cb9546

24 Jul, 2020 1 commit
- Update the new model template (#6019) · a884b7fa
  Sylvain Gugger authored Jul 24, 2020
  
  a884b7fa
22 Jul, 2020 1 commit
- [docs] Add integration test example to copy pasta template (#5961) · feeb956a
  Sam Shleifer authored Jul 22, 2020
```
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
```
  feeb956a
28 Jun, 2020 1 commit
- save_pretrained: mkdir(exist_ok=True) (#5258) · 45e26125
  Sam Shleifer authored Jun 28, 2020
```
* all save_pretrained methods mkdir if not os.path.exists
```
  45e26125
26 Jun, 2020 1 commit

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

09 Jun, 2020 1 commit

[All models] Extend config.output_attentions with output_attentions function arguments (#4538) · 6e603cb7

Bharat Raghunathan authored Jun 10, 2020



* DOC: Replace instances of ``config.output_attentions`` with function argument ``output_attentions``

* DOC: Apply Black Formatting

* Fix errors where output_attentions was undefined

* Remove output_attentions in classes per review

* Fix regressions on tests having `output_attention`

* Fix further regressions in tests relating to `output_attentions`

Ensure proper propagation of `output_attentions` as a function parameter
to all model subclasses

* Fix more regressions in `test_output_attentions`

* Fix issues with BertEncoder

* Rename related variables to `output_attentions`

* fix pytorch tests

* fix bert and gpt2 tf

* Fix most TF tests for `test_output_attentions`

* Fix linter errors and more TF tests

* fix conflicts

* DOC: Apply Black Formatting

* Fix errors where output_attentions was undefined

* Remove output_attentions in classes per review

* Fix regressions on tests having `output_attention`

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* fix pytorch tests

* fix conflicts

* fix conflicts

* Fix linter errors and more TF tests

* fix tf tests

* make style

* fix isort

* improve output_attentions

* improve tensorflow
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

6e603cb7

02 Jun, 2020 1 commit

Kill model archive maps (#4636) · d4c2cb40

Julien Chaumond authored Jun 02, 2020

* Kill model archive maps

* Fixup

* Also kill model_archive_map for MaskedBertPreTrainedModel

* Unhook config_archive_map

* Tokenizers: align with model id changes

* make style && make quality

* Fix CI

d4c2cb40

29 Apr, 2020 1 commit

CDN urls (#4030) · 455c6390

Julien Chaumond authored Apr 28, 2020

* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese

455c6390

18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

16 Apr, 2020 1 commit
- [cleanup] factor out get_head_mask, invert_attn_mask, get_exten… (#3806) · dbd04124
  Sam Shleifer authored Apr 16, 2020
```
* Delete some copy pasted code
```
  dbd04124
08 Apr, 2020 1 commit

More doc for model cards (#3698) · a594ee9c

Julien Chaumond authored Apr 08, 2020

see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270

a594ee9c

04 Apr, 2020 1 commit
- weigths*weights · 94eb68d7
  Julien Chaumond authored Apr 04, 2020
  
  94eb68d7
24 Mar, 2020 1 commit
- [examples] Use AutoModels in more examples · a8e3336a
  Julien Chaumond authored Mar 23, 2020
  
  a8e3336a