Commits · f394a2a50d8729cd1ca9b368e330ec50664c3292 · chenpangpang / transformers

31 May, 2022 1 commit

[Json configs] Make json prettier for all saved tokenizer files & ensure same... · f394a2a5

Patrick von Platen authored May 31, 2022

[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457)

* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review

f394a2a5

12 May, 2022 1 commit

Black preview (#17217) · afe5d42d

Sylvain Gugger authored May 12, 2022

* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black

afe5d42d

30 Mar, 2022 1 commit
- Add length to PreTrainedTokenizer train_new_from_iterator (#16493) · d04adc35
  dctelus authored Mar 30, 2022
  
  d04adc35
25 Mar, 2022 1 commit
- Big file_utils cleanup (#16396) · 088c1880
  Sylvain Gugger authored Mar 25, 2022
```
* Big file_utils cleanup

* This one still needs to be treated separately
```
  088c1880
23 Mar, 2022 1 commit

Reorganize file utils (#16264) · 4975002d

Sylvain Gugger authored Mar 23, 2022

* Split file_utils in several submodules

* Fixes

* Add back more objects

* More fixes

* Who exactly decided to import that from there?

* Second suggestion to code with code review

* Revert wront move

* Fix imports

* Adapt all imports

* Adapt all imports everywhere

* Revert this import, will fix in a separate commit

4975002d

01 Feb, 2022 1 commit

fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86

SaulLu authored Feb 01, 2022

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da0895330bedfafc81ae3073470a9348c669f032.

* set `tokenizer_file` in super `__init__` of mbart

7b8bdd86

03 Jan, 2022 1 commit

Improve truncation_side (#14947) · d33dc796

Nicolas Patry authored Jan 03, 2022



* Enabling `truncation_side` for Slow and Fast tokenizer.
Co-Authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

* Disable failing tests.

* Layout xlm.

* assert -> assertEqual.
Co-authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

d33dc796

30 Dec, 2021 1 commit

Enabling `tokenizers` upgrade. (#14941) · 08cb5718

Nicolas Patry authored Dec 30, 2021

* Enabling `tokenizers` upgrade.

* Moved ugly comment.

* Tokenizers==0.11.1 needs an update to keep borrow checker

happy in highly contiguous calls.

* Support both 0.11.1 and 0.11.0

08cb5718

27 Dec, 2021 1 commit

Doc styler v2 (#14950) · 87e6e4fe

Sylvain Gugger authored Dec 27, 2021

* New doc styler

* Fix issue with args at the start

* Code sample fixes

* Style code examples in MDX

* Fix more patterns

* Typo

* Typo

* More patterns

* Do without black for now

* Get more info in error

* Docstring style

* Re-enable check

* Quality

* Fix add_end_docstring decorator

* Fix docstring

87e6e4fe

23 Dec, 2021 1 commit
- Fix AttributeError from PreTrainedTokenizerFast.decoder (#14691) · d8c09c65
  Alex Hedges authored Dec 23, 2021
  
  d8c09c65
21 Dec, 2021 1 commit

Mass conversion of documentation from rst to Markdown (#14866) · 27b3031d

Sylvain Gugger authored Dec 21, 2021

* Convert docstrings of all configurations and tokenizers

* Processors and fixes

* Last modeling files and fixes to models

* Pipeline modules

* Utils files

* Data submodule

* All the other files

* Style

* Missing examples

* Style again

* Fix copies

* Say bye bye to rst docstrings forever

27b3031d

08 Sep, 2021 1 commit
- Typo in "end_of_word_suffix" (#13477) · 330d83fd
  Koichi Yasuoka authored Sep 09, 2021
```
But does it really work?
```
  330d83fd
01 Sep, 2021 1 commit

Fix tokenizer saving during training with `Trainer` (#12806) · c4d78f01

SaulLu authored Sep 01, 2021



* add test in trainer and test tokenizer saving wi
th trainer

* quality

* reverse trainer changes

* replace test in test_trainer by a test for all the tokenizers

* format

* add can_save_slow_tokenizer attribute to all tokenizers

* fix Herbert

* format

* Change comment in error

* add comments and a new assert

* Update src/transformers/models/albert/tokenization_albert_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change ValueError barthez

* change ValueError BigBird

* change ValueError Camembert

* change ValueError Mbart50

* change ValueError Pegasus

* change ValueError ReFormer

* change ValueError T5

* change ValueError RoBERTa

* XLNET fast

* Update tests/test_tokenization_common.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `assert` into `self.assertIn`

* format
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

c4d78f01

30 Aug, 2021 1 commit
- fix: typo spelling grammar (#13212) · 01977466
  arfy slowy authored Aug 30, 2021
```
* fix: typo spelling grammar

* fix: make fixup
```
  01977466
13 Aug, 2021 1 commit
- Fix docstring of train_new_from_iterator · a13c8145
  Omar Sanseviero authored Aug 13, 2021
  
  a13c8145
12 Jul, 2021 1 commit
- Add tokenizer_file parameter to PreTrainedTokenizerFast docstring (#12624) · a882b9fa
  Lewis Bails authored Jul 12, 2021
```
Co-authored-by: Lewis Bails <Lewis.Bails@infomedia.dk>
```
  a882b9fa
09 Jul, 2021 2 commits

Simplify unk token (#12582) · 0cc2dc24

Sylvain Gugger authored Jul 09, 2021

* Base test

* More test

* Fix mistake

* Add a docstring change

* Add doc ignore

* Simplify logic for unk token in Unigram tokenizers

* Remove changes from otehr branch

0cc2dc24

This will reduce "Already borrowed error": (#12550) · cc12e1db

Nicolas Patry authored Jul 09, 2021

* This will reduce "Already borrowed error":

Original issue https://github.com/huggingface/tokenizers/issues/537



The original issue is caused by transformers calling many times
mutable functions on the rust tokenizers.
Rust needs to guarantee that only 1 agent has a mutable reference
to memory at a given time (for many reasons which don't need explaining
here). Usually, the rust compiler can guarantee that this property is
true at compile time.

Unfortunately, this is impossible for Python to do that, so PyO3, the
bridge between rust and python used by `tokenizers`, will change the
compile guarantee for a dynamic guarantee, so if multiple agents try
to have multiple mutable borrows at the same time, then the runtime will
yell with "Already borrowed".

The proposed fix here in transformers, is simply to reduce the actual
number of calls that really need mutable borrows. By reducing them,
we reduce the risk of running into "Already borrowed" error.
The caveat is now we add a call to read the current configuration of the
`_tokenizer`, so worst case we have 2 calls instead of 1, and best case
we simply have 1 + a Python comparison of a dict (should be negligible).

* Adding a test.

* trivial error :(.

* Update tests/test_tokenization_fast.py
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* Adding reference to original issues in the tests.

* Update the tests with fast tokenizer.
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

cc12e1db

29 Jun, 2021 1 commit

Easily train a new fast tokenizer from a given one (#12361) · dc42e770

Sylvain Gugger authored Jun 29, 2021



* [WIP] Easily train a new fast tokenizer from a given one

* Fix test

* Roll out to other tokenizers and add tests

* Fix bug with unk id and add emoji to test

* Really use something different in test

* Implement special tokens map

* Map special tokens in the Transformers tokenizers

* Fix test

* Make test more robust

* Fix test for BPE

* More robust map and test

Co-authored-by SaulLu

* Test file

* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>

* Map unk token for Wordpiece and address review comment

* Fix lowercase test and address review comment

* Fix all tests

* Simplify test

* Fix tests for realsies

* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)

* Propose change in tests regarding lower case

* add new test for special tokens types

* put back the test part about decoding

* add feature: the AddedToken is re-build with the different mapped content

* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

dc42e770

14 Jun, 2021 1 commit

Feature to use the PreTrainedTokenizerFast class as a stand-alone tokenizer (#11810) · 476ba679

SaulLu authored Jun 14, 2021



* feature for tokenizer without slow/legacy version

* format

* modify common test

* add tests

* add PreTrainedTokenizerFast to AutoTokenizer

* format

* change tokenizer common test in order to be able to run test without a slow version

* update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class`

* add autokenizer test

* replace  `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None`

* remove obsolete change in comment

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `get_main_tokenizer` into `get_tokenizers`

* clarify `get_tokenizers` method

* homogenize with `test_slow_tokenizer` and `test_rust_tokenizer`

* add `test_rust_tokenizer = False` to tokenizer which don't define a fast version

* `test_rust_tokenizer = False` for BertJapaneseTokenizer

* `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

476ba679

26 Apr, 2021 1 commit
- fix some typos in docs, comments, logging/errors (#11432) · b24ead87
  LSinev authored Apr 26, 2021
  
  b24ead87
15 Apr, 2021 1 commit

Tokenizer fast save (#11234) · 2550b41a

Sylvain Gugger authored Apr 15, 2021

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

2550b41a

05 Apr, 2021 1 commit

Documentation about loading a fast tokenizer within Transformers (#11029) · 9f4e0c23

Lysandre Debut authored Apr 05, 2021



* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

9f4e0c23

31 Mar, 2021 1 commit

Enforce string-formatting with f-strings (#10980) · acc3bd9d

Sylvain Gugger authored Mar 31, 2021



* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

acc3bd9d

08 Mar, 2021 1 commit

tokenization_marian.py: use current_spm for decoding (#10357) · b8805084

Mehrad Moradshahi authored Mar 08, 2021



* Fix Marian decoding

Tokenizer's decode and batch_decode now accepts a new argument (use_source_tokenizer) which indicates whether the source spm should be used to decode ids. This is useful for Marian models specificallly when decoding source input ids.

* Adapt docstrings
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

b8805084

25 Feb, 2021 1 commit

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor,... · cb38ffcc

Patrick von Platen authored Feb 25, 2021

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer (#10324)

* push to show

* small improvement

* small improvement

* Update src/transformers/feature_extraction_utils.py

* Update src/transformers/feature_extraction_utils.py

* implement base

* add common tests

* make all tests pass for wav2vec2

* make padding work & add more tests

* finalize feature extractor utils

* add call method to feature extraction

* finalize feature processor

* finish tokenizer

* finish general processor design

* finish tests

* typo

* remove bogus file

* finish docstring

* add docs

* finish docs

* small fix

* correct docs

* save intermediate

* load changes

* apply changes

* apply changes to doc

* change tests

* apply surajs recommend

* final changes

* Apply suggestions from code review

* fix typo

* fix import

* correct docstring

cb38ffcc

08 Feb, 2021 1 commit
- A few fixes in the documentation (#10033) · 45aaf5f7
  Sylvain Gugger authored Feb 08, 2021
  
  45aaf5f7
04 Feb, 2021 1 commit
- Add `from_slow` in fast tokenizers build and fixes some bugs (#9987) · 7898fc03
  Sylvain Gugger authored Feb 04, 2021
  
  7898fc03
02 Dec, 2020 1 commit

Warning about too long input for fast tokenizers too (#8799) · a8c3f9aa

Nicolas Patry authored Dec 02, 2020

* Warning about too long input for fast tokenizers too

If truncation is not set in tokenizers, but the tokenization is too long
for the model (`model_max_length`), we used to trigger a warning that

The input would probably fail (which it most likely will).

This PR re-enables the warning for fast tokenizers too and uses common
code for the trigger to make sure it's consistent across.

* Checking for pair of inputs too.

* Making the function private and adding it's doc.

* Remove formatting ?? in odd place.

* Missed uppercase.

a8c3f9aa

27 Nov, 2020 1 commit

Extend typing to path-like objects in `PretrainedConfig` and `PreTrainedModel` (#8770) · f9a2a9e3

Giovanni Compagnoni authored Nov 27, 2020

* update configuration_utils.py typing to allow pathlike objects when sensible

* update modeling_utils.py typing to allow pathlike objects when sensible

* black

* update tokenization_utils_base.py typing to allow pathlike objects when sensible

* update tokenization_utils_fast.py typing to allow pathlike objects when sensible

* update configuration_auto.py typing to allow pathlike objects when sensible

* update configuration_auto.py docstring to allow pathlike objects when sensible

* update tokenization_auto.py docstring to allow pathlike objects when sensible

* black

f9a2a9e3

17 Nov, 2020 1 commit

Remove deprecated (#8604) · dd52804f

Sylvain Gugger authored Nov 17, 2020



* Remove old deprecated arguments
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Remove needless imports

* Fix tests
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

dd52804f

15 Nov, 2020 1 commit

[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests... · f4e04cd2

Thomas Wolf authored Nov 15, 2020


[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)

* Fixing roberta for slow-fast tests

* WIP getting equivalence on pipelines

* slow-to-fast equivalence - working on question-answering pipeline

* optional FAISS tests

* Pipeline Q&A

* Move pipeline tests to their own test job again

* update tokenizer to add sequence id methods

* update to tokenizers 0.9.4

* set sentencepiecce as optional

* clean up squad

* clean up pipelines to use sequence_ids

* style/quality

* wording

* Switch to use_fast = True by default

* update tests for use_fast at True by default

* fix rag tokenizer test

* removing protobuf from required dependencies

* fix NER test for use_fast = True by default

* fixing example tests (Q&A examples use slow tokenizers for now)

* protobuf in main deps extras["sentencepiece"] and example deps

* fix protobug install test

* try to fix seq2seq by switching to slow tokenizers for now

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

f4e04cd2

29 Oct, 2020 1 commit

Fix doc errors and typos across the board (#8139) · 969859d5

Santiago Castro authored Oct 29, 2020

* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes

969859d5

26 Oct, 2020 2 commits

Doc styling (#8067) · 08f534d2

Sylvain Gugger authored Oct 26, 2020

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy

08f534d2

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
Thomas Wolf authored Oct 26, 2020
```
* fixing #8001

* make T5 tokenizer serialization more robust - style
```
79eb3915

23 Oct, 2020 1 commit

[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers... · 3a40cdf5

Thomas Wolf authored Oct 23, 2020


[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)

* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3a40cdf5

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

13 Oct, 2020 1 commit
- fixed lots of typos. (#7758) · 7e73c128
  Tiger authored Oct 13, 2020
  
  7e73c128
08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

22 Sep, 2020 1 commit
- is_pretokenized -> is_split_into_words (#7236) · 21ca1480
  Sylvain Gugger authored Sep 22, 2020
```
* is_pretokenized -> is_split_into_words

* Fix tests
```
  21ca1480