Commits · ba8c4d0ac04acfcdbdeaed954f698d6d5ec3e532 · chenpangpang / transformers

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

09 Oct, 2020 1 commit
- [pegasus] Faster tokenizer tests (#7672) · b0f05e0c
  Stas Bekman authored Oct 09, 2020
  
  b0f05e0c
08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

11 Sep, 2020 1 commit
- [T5Tokenizer] remove prefix_tokens (#7078) · 0a8c17d5
  Suraj Patil authored Sep 11, 2020
  
  0a8c17d5
10 Sep, 2020 1 commit

Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. (#6594) · 7fd1febf

Patrick von Platen authored Sep 10, 2020

* add conversion script

* improve conversion script

* make style

* add tryout files

* fix

* update

* add causal bert

* better names

* add tokenizer file as well

* finish causal_bert

* fix small bugs

* improve generate

* change naming

* renaming

* renaming

* renaming

* remove leftover files

* clean files

* add fix tokenizer

* finalize

* correct slow test

* update docs

* small fixes

* fix link

* adapt check repo

* apply sams and sylvains recommendations

* fix import

* implement Lysandres recommendations

* fix logger warn

7fd1febf

30 Aug, 2020 1 commit
- [tests] fix typos in inputs (#6818) · 563485bf
  Stas Bekman authored Aug 30, 2020
  
  563485bf
28 Aug, 2020 2 commits

t5 model should make decoder_attention_mask (#6800) · 3cac867f
Sam Shleifer authored Aug 28, 2020

3cac867f

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. (#6654) · 9336086a

Sam Shleifer authored Aug 28, 2020

* broken test

* batch parity

* tests pass

* boom boom

* boom boom

* split out bart tokenizer tests

* fix tests

* boom boom

* Fixed dataset bug

* Fix marian

* Undo extra

* Get marian working

* Fix t5 tok tests

* Test passing

* Cleanup

* better assert msg

* require torch

* Fix mbart tests

* undo extra decoder_attn_mask change

* Fix import

* pegasus tokenizer can ignore src_lang kwargs

* unused kwarg test cov

* boom boom

* add todo for pegasus issue

* cover one word translation edge case

* Cleanup

* doc

9336086a

26 Aug, 2020 1 commit
- Black 20 release · a75c64d8
  Lysandre authored Aug 26, 2020
  
  a75c64d8
25 Aug, 2020 1 commit
- T5Tokenizer adds EOS token if not already added (#5866) · 62449570
  Sam Shleifer authored Aug 25, 2020
```
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
```
  62449570
17 Aug, 2020 1 commit
- [T5Tokenizer] add prepare_seq2seq_batch method (#6122) · 407da12e
  Suraj Patil authored Aug 17, 2020
```
* tests
```
  407da12e
19 May, 2020 1 commit
- [cleanup] test_tokenization_common.py (#4390) · 07dd7c2f
  Sam Shleifer authored May 19, 2020
  
  07dd7c2f
15 Jan, 2020 1 commit
- 💄 super · 83a41d39
  Julien Chaumond authored Jan 15, 2020
  
  83a41d39
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
22 Dec, 2019 7 commits
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Replace CommonTestCases for tokenizers with a mixin. · 00204f2b
  Aymeric Augustin authored Dec 22, 2019
```
This is the same change as for (TF)CommonTestCases for modeling.
```
  00204f2b
- Rename file for consistency. · a3c5883f
  Aymeric Augustin authored Dec 22, 2019
  
  a3c5883f
- Remove unittest.main() in test modules. · 7e98e211
  Aymeric Augustin authored Dec 22, 2019
```
This construct isn't used anymore these days.

Running python tests/test_foo.py puts the tests/ directory on
PYTHONPATH, which isn't representative of how we run tests.

Use python -m unittest tests/test_foo.py instead.
```
  7e98e211
- Switch test files to the standard test_*.py scheme. · ced0a942
  Aymeric Augustin authored Dec 22, 2019
  
  ced0a942
- Move tests outside of library. · 067395d5
  Aymeric Augustin authored Dec 22, 2019
  
  067395d5
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

10 Dec, 2019 1 commit
- updating tests and TF 2.0 model · 8ae1044f
  thomwolf authored Dec 10, 2019
  
  8ae1044f
07 Nov, 2019 1 commit
- fix python 2 sentencepiece tokenization · 8fda532c
  thomwolf authored Nov 07, 2019
  
  8fda532c
06 Nov, 2019 1 commit
- adding tests and updating model · 076a2079
  thomwolf authored Nov 06, 2019
  
  076a2079
04 Nov, 2019 1 commit
- fix tests - flagged as slow all the tests downloading from AWS · b340a910
  thomwolf authored Nov 04, 2019
  
  b340a910
22 Oct, 2019 1 commit
- Remove · 7d709e55
  Lysandre authored Oct 22, 2019
  
  7d709e55
04 Oct, 2019 1 commit
- update encode_plus - add truncation strategies · 6c1d0bc0
  thomwolf authored Oct 04, 2019
  
  6c1d0bc0
26 Sep, 2019 1 commit
- [BIG] pytorch-transformers => transformers · 31c23bd5
  thomwolf authored Sep 26, 2019
  
  31c23bd5
19 Sep, 2019 1 commit
- Sentence -> Sequence. Removed output_mask from the special token addition methods. · bf503158
  LysandreJik authored Sep 19, 2019
  
  bf503158
30 Aug, 2019 1 commit
- added test and debug tokenizer configuration serialization · 69da972a
  thomwolf authored Aug 30, 2019
  
  69da972a
12 Aug, 2019 1 commit
- Added integration tests for sequence builders. · 634a3172
  LysandreJik authored Aug 12, 2019
  
  634a3172
05 Aug, 2019 1 commit
- cleaning up tokenizer tests structure (at last) - last remaining ppb refs · 328afb70
  thomwolf authored Aug 05, 2019
  
  328afb70
15 Jul, 2019 1 commit
- update tokenizer - update squad example for xlnet · 15d8b126
  thomwolf authored Jul 15, 2019
  
  15d8b126
09 Jul, 2019 2 commits
- fix python 2 tests · c079d7dd
  thomwolf authored Jul 09, 2019
  
  c079d7dd
- unified tokenizer api and serialization + tests · b1978698
  thomwolf authored Jul 09, 2019
  
  b1978698
05 Jul, 2019 3 commits
- tokenization abstract class - tests for examples · 36bca545
  thomwolf authored Jul 05, 2019
  
  36bca545
- [BIG] name change · 0bab55d5
  thomwolf authored Jul 05, 2019
  
  0bab55d5
- standardizing tokenizers API and adding tests · e75c3f70
  thomwolf authored Jul 05, 2019
  
  e75c3f70