[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized...

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized...
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
36434220 · Anthony MOI · GitHub · ebba39e4 · 36434220 · 36434220
Unverified Commit 36434220 authored Jun 15, 2020 by Anthony MOI Committed by GitHub Jun 15, 2020
5 changed files
--- a/tests/test_tokenization_marian.py
+++ b/tests/test_tokenization_marian.py
@@ -51,10 +51,10 @@ class MarianTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        tokenizer = MarianTokenizer.from_pretrained(self.tmpdirname)
        tokenizer.save_pretrained(self.tmpdirname)
-    def get_tokenizer(self, max_len=None, **kwargs) -> MarianTokenizer:
+    def get_tokenizer(self, **kwargs) -> MarianTokenizer:
-        return MarianTokenizer.from_pretrained(self.tmpdirname, model_max_length=max_len, **kwargs)
+        return MarianTokenizer.from_pretrained(self.tmpdirname, **kwargs)
-    def get_input_output_texts(self):
+    def get_input_output_texts(self, tokenizer):
        return (
            "This is a test",
            "This is a test",

--- a/tests/test_tokenization_openai.py
+++ b/tests/test_tokenization_openai.py
@@ -64,7 +64,7 @@ class OpenAIGPTTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        with open(self.merges_file, "w") as fp:
            fp.write("\n".join(merges))
-    def get_input_output_texts(self):
+    def get_input_output_texts(self, tokenizer):
        return "lower newer", "lower newer"
    def test_full_tokenizer(self):

--- a/tests/test_tokenization_roberta.py
+++ b/tests/test_tokenization_roberta.py
@@ -18,7 +18,7 @@ import json
 import os
 import unittest
-from transformers.tokenization_roberta import VOCAB_FILES_NAMES, RobertaTokenizer
+from transformers.tokenization_roberta import VOCAB_FILES_NAMES, RobertaTokenizer, RobertaTokenizerFast
 from .test_tokenization_common import TokenizerTesterMixin
 from .utils import slow
@@ -68,7 +68,11 @@ class RobertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        kwargs.update(self.special_tokens_map)
        return RobertaTokenizer.from_pretrained(self.tmpdirname, **kwargs)
-    def get_input_output_texts(self):
+    def get_rust_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return RobertaTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+    def get_input_output_texts(self, tokenizer):
        input_text = "lower newer"
        output_text = "lower newer"
        return input_text, output_text

--- a/tests/test_tokenization_transfo_xl.py
+++ b/tests/test_tokenization_transfo_xl.py
@@ -56,7 +56,7 @@ class TransfoXLTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        kwargs["lower_case"] = True
        return TransfoXLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
-    def get_input_output_texts(self):
+    def get_input_output_texts(self, tokenizer):
        input_text = "<unk> UNwanted , running"
        output_text = "<unk> unwanted, running"
        return input_text, output_text

--- a/tests/test_tokenization_xlm.py
+++ b/tests/test_tokenization_xlm.py
@@ -65,7 +65,7 @@ class XLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
        with open(self.merges_file, "w") as fp:
            fp.write("\n".join(merges))
-    def get_input_output_texts(self):
+    def get_input_output_texts(self, tokenizer):
        input_text = "lower newer"
        output_text = "lower newer"
        return input_text, output_text