[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308)

* remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for #5310 * updated doc and examples

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308)
* remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for #5310 * updated doc and examples
601d4d69 · Thomas Wolf · GitHub · fd405e9a · 601d4d69 · 601d4d69
Unverified Commit 601d4d69 authored Jun 26, 2020 by Thomas Wolf Committed by GitHub Jun 26, 2020
20 changed files
--- a/README.md
+++ b/README.md
@@ -287,8 +287,8 @@ pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf
 sentence_0 = "This research was consistent with his findings."
 sentence_1 = "His findings were compatible with this research."
 sentence_2 = "His findings were not compatible with this research."
-inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
+inputs_1 = tokenizer(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
-inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
+inputs_2 = tokenizer(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
 pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
 pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

--- a/docs/README.md
+++ b/docs/README.md
@@ -167,7 +167,7 @@ Here's an example showcasing everything so far:
            Indices can be obtained using :class:`transformers.AlbertTokenizer`.
            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+            :func:`transformers.PreTrainedTokenizer.__call__` for details.
            `What are input IDs? <../glossary.html#input-ids>`__
 ```

--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -11,7 +11,7 @@ The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` impleme
 - adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
 - managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
-``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
+``BatchEncoding`` holds the output of the tokenizer's encoding methods (``__call__``, ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
 ``PreTrainedTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -74,7 +74,7 @@ of each other. The process is the following:
  with the weights stored in the checkpoint.
 - Build a sequence from the two sentences, with the correct model-specific separators token type ids
  and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and
-  :func:`~transformers.PreTrainedTokenizer.encode_plus` take care of this)
+  :func:`~transformers.PreTrainedTokenizer.__call__` take care of this)
 - Pass this sequence through the model so that it is classified in one of the two available classes: 0
  (not a paraphrase) and 1 (is a paraphrase)
 - Compute the softmax of the result to get probabilities over the classes
@@ -95,8 +95,8 @@ of each other. The process is the following:
    >>> sequence_1 = "Apples are especially bad for your health"
    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
-    >>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
+    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
-    >>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
+    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
    >>> paraphrase_classification_logits = model(**paraphrase)[0]
    >>> not_paraphrase_classification_logits = model(**not_paraphrase)[0]
@@ -128,8 +128,8 @@ of each other. The process is the following:
    >>> sequence_1 = "Apples are especially bad for your health"
    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
-    >>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf")
+    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
-    >>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf")
+    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
    >>> paraphrase_classification_logits = model(paraphrase)[0]
    >>> not_paraphrase_classification_logits = model(not_paraphrase)[0]
@@ -221,7 +221,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    ... ]
    >>> for question in questions:
-    ...     inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
+    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    ...     input_ids = inputs["input_ids"].tolist()[0]
    ...
    ...     text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
@@ -263,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    ... ]
    >>> for question in questions:
-    ...     inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf")
+    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    ...     input_ids = inputs["input_ids"].numpy()[0]
    ...
    ...     text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -77,7 +77,7 @@ other than bias and layer normalization terms:
    optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
 Now we can set up a simple dummy training batch using
-:func:`~transformers.PreTrainedTokenizer.batch_encode_plus`. This returns a
+:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
 :func:`~transformers.BatchEncoding` instance which
 prepares everything we might need to pass to the model.

--- a/examples/adversarial/utils_hans.py
+++ b/examples/adversarial/utils_hans.py
@@ -298,12 +298,13 @@ def hans_convert_examples_to_features(
        if ex_index % 10000 == 0:
            logger.info("Writing example %d" % (ex_index))
-        inputs = tokenizer.encode_plus(
+        inputs = tokenizer(
            example.text_a,
            example.text_b,
            add_special_tokens=True,
            max_length=max_length,
-            pad_to_max_length=True,
+            padding="max_length",
+            truncation=True,
            return_overflowing_tokens=True,
        )

--- a/examples/longform-qa/eli5_utils.py
+++ b/examples/longform-qa/eli5_utils.py
@@ -193,12 +193,12 @@ def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from
 def make_qa_retriever_batch(qa_list, tokenizer, max_len=64, device="cuda:0"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
-    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
+    a_toks = tokenizer(a_ls, max_length=max_len, padding="max_length", truncation=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -375,12 +375,12 @@ def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="
 def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
-    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
+    a_toks = tokenizer(a_ls, max_length=min(max_len, max_a_len), padding="max_length", truncation=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -531,7 +531,7 @@ def qa_s2s_generate(
 # ELI5-trained retrieval model usage
 ###############
 def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda:0"):
-    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
+    a_toks = tokenizer(passages, max_length=max_length, padding="max_length", truncation=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -542,7 +542,7 @@ def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=12
 def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda:0"):
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=128, padding="max_length", truncation=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),

--- a/examples/movement-pruning/emmental/modeling_bert_masked.py
+++ b/examples/movement-pruning/emmental/modeling_bert_masked.py
@@ -424,7 +424,7 @@ MASKED_BERT_INPUTS_DOCSTRING = r"""
            Indices can be obtained using :class:`transformers.BertTokenizer`.
            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+            :func:`transformers.PreTrainedTokenizer.__call__` for details.
            `What are input IDs? <../glossary.html#input-ids>`__
        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):

--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
@@ -510,12 +510,13 @@ def convert_examples_to_features(
            else:
                text_b = example.question + " " + ending
-            inputs = tokenizer.encode_plus(
+            inputs = tokenizer(
                text_a,
                text_b,
                add_special_tokens=True,
                max_length=max_length,
-                pad_to_max_length=True,
+                padding="max_length",
+                truncation=True,
                return_overflowing_tokens=True,
            )
            if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:

--- a/examples/seq2seq/run_eval.py
+++ b/examples/seq2seq/run_eval.py
@@ -45,9 +45,9 @@ def generate_summaries_or_translations(
    for batch in tqdm(list(chunks(examples, batch_size))):
        if "t5" in model_name:
            batch = [model.config.prefix + text for text in batch]
-        batch = tokenizer.batch_encode_plus(
+        batch = tokenizer(batch, max_length=1024, return_tensors="pt", truncation=True, padding="max_length").to(
-            batch, max_length=1024, return_tensors="pt", truncation=True, pad_to_max_length=True
+            device
-        ).to(device)
+        )
        summaries = model.generate(**batch, **gen_kwargs)
        dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
        for hypothesis in dec:

--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -41,12 +41,12 @@ def encode_file(
    assert lns, f"found empty file at {data_path}"
    examples = []
    for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
-        tokenized = tokenizer.batch_encode_plus(
+        tokenized = tokenizer(
            [text],
            max_length=max_length,
-            pad_to_max_length=pad_to_max_length,
+            padding="max_length" if pad_to_max_length else None,
-            add_prefix_space=True,
            truncation=True,
+            add_prefix_space=True,
            return_tensors=return_tensors,
        )
        assert tokenized.input_ids.shape[1] == max_length

--- a/model_cards/SparkBeyond/roberta-large-sts-b/README.md
+++ b/model_cards/SparkBeyond/roberta-large-sts-b/README.md
@@ -40,7 +40,7 @@ def roberta_similarity_batches(to_predict):
  return similarity_scores
 def similarity_roberta(model, tokenizer, sent_pairs):
-  batch_token = tokenizer.batch_encode_plus(sent_pairs, pad_to_max_length=True, max_length=500)
+  batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
  res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())  
  return res

--- a/model_cards/a-ware/bart-squadv2/README.md
+++ b/model_cards/a-ware/bart-squadv2/README.md
@@ -60,7 +60,7 @@ tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
 model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']

--- a/model_cards/a-ware/xlmroberta-squadv2/README.md
+++ b/model_cards/a-ware/xlmroberta-squadv2/README.md
@@ -43,7 +43,7 @@ tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
 model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']

--- a/model_cards/google/reformer-enwik8/README.md
+++ b/model_cards/google/reformer-enwik8/README.md
@@ -14,7 +14,7 @@ Therefore, this model does not need a tokenizer. The following function can inst
 import torch
 # Encoding
-def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
+def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])
    # create emtpy tensors

--- a/model_cards/lserinol/bert-turkish-question-answering/README.md
+++ b/model_cards/lserinol/bert-turkish-question-answering/README.md
@@ -43,7 +43,7 @@ questions = [
 ]
 for question in questions:
-    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
+    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

--- a/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md
@@ -50,7 +50,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4
 text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
 question = "What has Huggingface done ?"
-encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
+encoding = tokenizer(question, text, return_tensors="pt")
 input_ids = encoding["input_ids"]
 # default is local attention everywhere

--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -55,7 +55,7 @@ model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
 def get_answer(question, context):
  input_text = "question: %s  context: %s </s>" % (question, context)
-  features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
+  features = tokenizer([input_text], return_tensors='pt')
  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])

--- a/model_cards/oliverguhr/german-sentiment-bert/README.md
+++ b/model_cards/oliverguhr/german-sentiment-bert/README.md
@@ -55,7 +55,7 @@ class SentimentModel():
    def predict_sentiment(self, texts: List[str])-> List[str]:
        texts = [self.clean_text(text) for text in texts]
        # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
-        input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
+        input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
        input_ids = torch.tensor(input_ids["input_ids"])
        with torch.no_grad():

--- a/model_cards/valhalla/bart-large-finetuned-squadv1/README.md
+++ b/model_cards/valhalla/bart-large-finetuned-squadv1/README.md
@@ -50,7 +50,7 @@ tokenizer = BartTokenizer.from_pretrained('valhalla/bart-large-finetuned-squadv1
 model = BartForQuestionAnswering.from_pretrained('valhalla/bart-large-finetuned-squadv1')
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']