[Doctest] Setup, quicktour and task_summary (#13078)

* Fix doctests for quicktour * Adapt causal LM exemple * Remove space * Fix until summarization * End of task summary * Style * With last changes in quicktour

[Doctest] Setup, quicktour and task_summary (#13078)
* Fix doctests for quicktour * Adapt causal LM exemple * Remove space * Fix until summarization * End of task summary * Style * With last changes in quicktour
83424ade · Sylvain Gugger · GitHub · bfc88509 · 83424ade · 83424ade
Unverified Commit 83424ade authored Aug 11, 2021 by Sylvain Gugger Committed by GitHub Aug 11, 2021
Show whitespace changes
Inline Side-by-side

Showing with 184 additions and 148 deletions

docs/source/quicktour.rst docs/source/quicktour.rst +13 -12

docs/source/task_summary.rst docs/source/task_summary.rst +168 -136

setup.cfg setup.cfg +3 -0

No files found.
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -65,7 +65,7 @@ make them readable. For instance:
 .. code-block::

    >>> classifier('We are very happy to show you the 🤗 Transformers library.')
-    [{'label': 'POSITIVE', 'score': 0.9997795224189758}]
+    [{'label': 'POSITIVE', 'score': 0.99978}]

 That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a
 `batch`, returning a list of dictionaries like this one:
@@ -195,7 +195,8 @@ sequence:
 .. code-block::

    >>> print(inputs)
-    {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+    {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
 batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
@@ -264,8 +265,8 @@ objects are described in greater detail :doc:`here <main_classes/output>`. For n
    >>> ## TENSORFLOW CODE
    >>> print(tf_outputs)
    TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
-    array([[-4.0832963 ,  4.3364143 ],
-           [ 0.081807  , -0.04178282]], dtype=float32)>, hidden_states=None, attentions=None)
+    array([[-4.0833 ,  4.3364  ],
+           [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)

 Notice how the output object has a ``logits`` attribute. You can use this to access the model's final activations.

@@ -283,7 +284,7 @@ Let's apply the SoftMax activation to get predictions.
    >>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
    >>> ## TENSORFLOW CODE
    >>> import tensorflow as tf
-    >>> tf.nn.softmax(tf_outputs.logits, axis=-1)
+    >>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)

 We can see we get the numbers from before:

@@ -292,8 +293,8 @@ We can see we get the numbers from before:
    >>> ## TENSORFLOW CODE
    >>> print(tf_predictions)
    tf.Tensor(
-    [[2.2042994e-04 9.9977952e-01]
-     [5.3086340e-01 4.6913657e-01]], shape=(2, 2), dtype=float32)
+    [[2.2043e-04 9.9978e-01]
+     [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
    >>> ## PYTORCH CODE
    >>> print(pt_predictions)
    tensor([[2.2043e-04, 9.9978e-01],
@@ -314,9 +315,9 @@ attribute:
    >>> import tensorflow as tf
    >>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
    >>> print(tf_outputs)
-    TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051287e-04, 6.3326043e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
-    array([[-4.0832963 ,  4.3364143 ],
-           [ 0.081807  , -0.04178282]], dtype=float32)>, hidden_states=None, attentions=None)
+    TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051e-04, 6.3326e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
+    array([[-4.0833 ,  4.3364  ],
+           [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)

 Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
 <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗

--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -107,7 +107,8 @@ each other. The process is the following:
    >>> sequence_1 = "Apples are especially bad for your health"
    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

-    >>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
+    >>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
+    >>> # the sequence, as well as compute the attention masks.
    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

@@ -141,12 +142,13 @@ each other. The process is the following:
    >>> sequence_1 = "Apples are especially bad for your health"
    >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

-    >>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
+    >>> # The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
+    >>> # the sequence, as well as compute the attention masks.
    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

-    >>> paraphrase_classification_logits = model(paraphrase)[0]
-    >>> not_paraphrase_classification_logits = model(not_paraphrase)[0]
+    >>> paraphrase_classification_logits = model(paraphrase).logits
+    >>> not_paraphrase_classification_logits = model(not_paraphrase).logits

    >>> paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
    >>> not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
@@ -197,11 +199,11 @@ positions of the extracted answer in the text.

    >>> result = question_answerer(question="What is extractive question answering?", context=context)
    >>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
-    Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96
+    Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95

    >>> result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
    >>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
-    Answer: 'SQuAD dataset,', score: 0.5053, start: 147, end: 161
+    Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160


 Here is an example of question answering using a model and a tokenizer. The process is the following:
@@ -247,10 +249,10 @@ Here is an example of question answering using a model and a tokenizer. The proc
    ...     answer_start_scores = outputs.start_logits
    ...     answer_end_scores = outputs.end_logits
    ...
-    ...     answer_start = torch.argmax(
-    ...         answer_start_scores
-    ...     )  # Get the most likely beginning of answer with the argmax of the score
-    ...     answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
+    ...     # Get the most likely beginning of answer with the argmax of the score
+    ...     answer_start = torch.argmax(answer_start_scores)
+    ...     # Get the most likely end of answer with the argmax of the score 
+    ...     answer_end = torch.argmax(answer_end_scores) + 1
    ...
    ...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    ...
@@ -261,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    Question: What does 🤗 Transformers provide?
    Answer: general - purpose architectures
    Question: 🤗 Transformers provides interoperability between which frameworks?
-    Answer: tensorflow 2 . 0 and pytorch
+    Answer: tensorflow 2. 0 and pytorch
    >>> ## TENSORFLOW CODE
    >>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
    >>> import tensorflow as tf
@@ -290,12 +292,11 @@ Here is an example of question answering using a model and a tokenizer. The proc
    ...     answer_start_scores = outputs.start_logits
    ...     answer_end_scores = outputs.end_logits
    ...
-    ...     answer_start = tf.argmax(
-    ...         answer_start_scores, axis=1
-    ...     ).numpy()[0]  # Get the most likely beginning of answer with the argmax of the score
-    ...     answer_end = (
-    ...         tf.argmax(answer_end_scores, axis=1) + 1
-    ...     ).numpy()[0]  # Get the most likely end of answer with the argmax of the score
+    ...     # Get the most likely beginning of answer with the argmax of the score
+    ...     answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
+    ...     # Get the most likely end of answer with the argmax of the score
+    ...     answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
+    ...
    ...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    ...
    ...     print(f"Question: {question}")
@@ -305,7 +306,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    Question: What does 🤗 Transformers provide?
    Answer: general - purpose architectures
    Question: 🤗 Transformers provides interoperability between which frameworks?
-    Answer: tensorflow 2 . 0 and pytorch
+    Answer: tensorflow 2. 0 and pytorch



@@ -344,31 +345,31 @@ This outputs the sequences with the mask filled, the confidence score, and the t

    >>> from pprint import pprint
    >>> pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
-    [{'score': 0.1792745739221573,
-      'sequence': '<s>HuggingFace is creating a tool that the community uses to '
-                  'solve NLP tasks.</s>',
+    [{'score': 0.179275,
+      'sequence': 'HuggingFace is creating a tool that the community uses to solve '
+                  'NLP tasks.',
      'token': 3944,
-      'token_str': 'Ġtool'},
-     {'score': 0.11349421739578247,
-      'sequence': '<s>HuggingFace is creating a framework that the community uses '
-                  'to solve NLP tasks.</s>',
+      'token_str': ' tool'},
+     {'score': 0.113494,
+      'sequence': 'HuggingFace is creating a framework that the community uses to '
+                  'solve NLP tasks.',
      'token': 7208,
-      'token_str': 'Ġframework'},
-     {'score': 0.05243554711341858,
-      'sequence': '<s>HuggingFace is creating a library that the community uses to '
-                  'solve NLP tasks.</s>',
+      'token_str': ' framework'},
+     {'score': 0.0524355,
+      'sequence': 'HuggingFace is creating a library that the community uses to '
+                  'solve NLP tasks.',
      'token': 5560,
-      'token_str': 'Ġlibrary'},
-     {'score': 0.03493533283472061,
-      'sequence': '<s>HuggingFace is creating a database that the community uses '
-                  'to solve NLP tasks.</s>',
+      'token_str': ' library'},
+     {'score': 0.0349353,
+      'sequence': 'HuggingFace is creating a database that the community uses to '
+                  'solve NLP tasks.',
      'token': 8503,
-      'token_str': 'Ġdatabase'},
-     {'score': 0.02860250137746334,
-      'sequence': '<s>HuggingFace is creating a prototype that the community uses '
-                  'to solve NLP tasks.</s>',
+      'token_str': ' database'},
+     {'score': 0.0286025,
+      'sequence': 'HuggingFace is creating a prototype that the community uses to '
+                  'solve NLP tasks.',
      'token': 17715,
-      'token_str': 'Ġprototype'}]
+      'token_str': ' prototype'}]

 Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:

@@ -385,43 +386,48 @@ Here is an example of doing masked language modeling using a model and a tokeniz
 .. code-block::

    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import AutoModelForMaskedLM, AutoTokenizer
    >>> import torch

    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
+    >>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

-    >>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
+    >>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+    ...     f"versions would help {tokenizer.mask_token} our carbon footprint."

-    >>> input = tokenizer.encode(sequence, return_tensors="pt")
-    >>> mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
+    >>> inputs = tokenizer(sequence, return_tensors="pt")
+    >>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

-    >>> token_logits = model(input).logits
+    >>> token_logits = model(**inputs).logits
    >>> mask_token_logits = token_logits[0, mask_token_index, :]

    >>> top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+
+    >>> for token in top_5_tokens:
+    ...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
+    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
+    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
+    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
+    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
+    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")
+    >>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

-    >>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
+    >>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+    ...     f"versions would help {tokenizer.mask_token} our carbon footprint."

-    >>> input = tokenizer.encode(sequence, return_tensors="tf")
-    >>> mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]
+    >>> inputs = tokenizer(sequence, return_tensors="tf")
+    >>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

-    >>> token_logits = model(input)[0]
+    >>> token_logits = model(**inputs).logits
    >>> mask_token_logits = token_logits[0, mask_token_index, :]

    >>> top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

-
-This prints five sequences, with the top 5 tokens predicted by the model:
-
-.. code-block::
-
    >>> for token in top_5_tokens:
    ...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
@@ -431,6 +437,9 @@ This prints five sequences, with the top 5 tokens predicted by the model:
    Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.


+This prints five sequences, with the top 5 tokens predicted by the model.
+
+
 Causal Language Modeling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@@ -449,23 +458,27 @@ of tokens.
 .. code-block::

    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
+    >>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, top_k_top_p_filtering
    >>> import torch
    >>> from torch import nn

    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = AutoModelWithLMHead.from_pretrained("gpt2")
+    >>> model = AutoModelForCausalLM.from_pretrained("gpt2")

    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and"

-    >>> input_ids = tokenizer.encode(sequence, return_tensors="pt")
+    >>> inputs = tokenizer(sequence, return_tensors="pt")
+    >>> input_ids = inputs["input_ids"]

    >>> # get logits of last hidden state
-    >>> next_token_logits = model(input_ids).logits[:, -1, :]
+    >>> next_token_logits = model(**inputs).logits[:, -1, :]

    >>> # filter
    >>> filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

+    >>> # set seed for reproducibility
+    >>> set_seed(42)
+
    >>> # sample
    >>> probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
    >>> next_token = torch.multinomial(probs, num_samples=1)
@@ -474,22 +487,26 @@ of tokens.

    >>> resulting_string = tokenizer.decode(generated.tolist()[0])
    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
+    >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, set_seed, tf_top_k_top_p_filtering
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+    >>> model = TFAutoModelForCausalLM.from_pretrained("gpt2")

-    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and "
+    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and"

-    >>> input_ids = tokenizer.encode(sequence, return_tensors="tf")
+    >>> inputs = tokenizer(sequence, return_tensors="tf")
+    >>> input_ids = inputs["input_ids"]

    >>> # get logits of last hidden state
-    >>> next_token_logits = model(input_ids)[0][:, -1, :]
+    >>> next_token_logits = model(**inputs).logits[:, -1, :]

    >>> # filter
    >>> filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

+    >>> # set seed for reproducibility
+    >>> set_seed(42)
+
    >>> # sample
    >>> next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

@@ -498,12 +515,13 @@ of tokens.
    >>> resulting_string = tokenizer.decode(generated.numpy().tolist()[0])


-This outputs a (hopefully) coherent next token following the original sequence, which in our case is the word *has*:
+This outputs a (hopefully) coherent next token following the original sequence, which in our case is the word
+*features*:

 .. code-block::

    >>> print(resulting_string)
-    Hugging Face is based in DUMBO, New York City, and has
+    Hugging Face is based in DUMBO, New York City, and features

 In the next section, we show how :func:`~transformers.generation_utils.GenerationMixin.generate` can be used to
 generate multiple tokens up to a specified length instead of one token at a time.
@@ -522,7 +540,8 @@ As a default all models apply *Top-K* sampling when used in pipelines, as config

    >>> text_generator = pipeline("text-generation")
    >>> print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
-    [{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
+    [{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
+    "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]



@@ -536,9 +555,9 @@ Below is an example of text generation using ``XLNet`` and its tokenizer, which
 .. code-block::

    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

-    >>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    >>> model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -558,12 +577,19 @@ Below is an example of text generation using ``XLNet`` and its tokenizer, which

    >>> prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
    >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+    >>> # set seed for reproducibility
+    >>> set_seed(42)
    >>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

+    >>> print(generated)
+    Today the weather is really nice and I am planning on anning on going to a nearby restaurant on Monday. It is very
+    cool with the clouds and the wind. A nice afternoon is on the way out of there, when I get my phone in the sun.
+    Sounds like its a good day in my house, but on that "good"" thing. There is a group of people who'd want to be out
+    and
    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer

-    >>> model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    >>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -582,13 +608,16 @@ Below is an example of text generation using ``XLNet`` and its tokenizer, which
    >>> inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")

    >>> prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
+    >>> # set seed for reproducibility
+    >>> set_seed(42)
    >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
    >>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

-.. code-block::
-
    >>> print(generated)
-    Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>...............
+    Today the weather is really nice and I am planning on anning on riding a "" over to the coast. It is also an ideal
+    day to fly to Bali and see more of the local "".<eop> “...The weather is great for travel and traveling as far as
+    local "".”. When the weather is good, I will ride my "" over to the coast and see more of
+

 Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
 PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
@@ -638,21 +667,20 @@ Here are the expected results:

 .. code-block::

-    >>> print(ner_pipe(sequence))
-    [
-        {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
-        {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
-        {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
-        {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
-        {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
-        {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
-        {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
-        {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
-        {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
-        {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
-        {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
-        {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
-    ]
+    >>> for entity in ner_pipe(sequence):
+    ...     print(entity)
+    {'entity': 'I-ORG', 'score': 0.999579, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
+    {'entity': 'I-ORG', 'score': 0.990976, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
+    {'entity': 'I-ORG', 'score': 0.998223, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
+    {'entity': 'I-ORG', 'score': 0.999488, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
+    {'entity': 'I-LOC', 'score': 0.999434, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
+    {'entity': 'I-LOC', 'score': 0.999320, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
+    {'entity': 'I-LOC', 'score': 0.999379, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
+    {'entity': 'I-LOC', 'score': 0.986258, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
+    {'entity': 'I-LOC', 'score': 0.951427, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
+    {'entity': 'I-LOC', 'score': 0.933659, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
+    {'entity': 'I-LOC', 'score': 0.976165, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
+    {'entity': 'I-LOC', 'score': 0.991463, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}

 Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
 "DUMBO" and "Manhattan Bridge" have been identified as locations.
@@ -679,26 +707,13 @@ Here is an example of doing named entity recognition, using a model and a tokeni
    >>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

-    >>> label_list = [
-    ...     "O",       # Outside of a named entity
-    ...     "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
-    ...     "I-MISC",  # Miscellaneous entity
-    ...     "B-PER",   # Beginning of a person's name right after another person's name
-    ...     "I-PER",   # Person's name
-    ...     "B-ORG",   # Beginning of an organisation right after another organisation
-    ...     "I-ORG",   # Organisation
-    ...     "B-LOC",   # Beginning of a location right after another location
-    ...     "I-LOC"    # Location
-    ... ]
+    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
+    ...            "therefore very close to the Manhattan Bridge."

-    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
-    ...            "close to the Manhattan Bridge."
+    >>> inputs = tokenizer(sequence, return_tensors="pt")
+    >>> tokens = inputs.tokens()

-    >>> # Bit of a hack to get the tokens with the special tokens
-    >>> tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
-    >>> inputs = tokenizer.encode(sequence, return_tensors="pt")
-
-    >>> outputs = model(inputs).logits
+    >>> outputs = model(**inputs).logits
    >>> predictions = torch.argmax(outputs, dim=2)
    >>> ## TENSORFLOW CODE
    >>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
@@ -707,14 +722,13 @@ Here is an example of doing named entity recognition, using a model and a tokeni
    >>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

-    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
-    ...            "close to the Manhattan Bridge."
+    >>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
+    ...            "therefore very close to the Manhattan Bridge."

-    >>> # Bit of a hack to get the tokens with the special tokens
-    >>> tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
-    >>> inputs = tokenizer.encode(sequence, return_tensors="tf")
+    >>> inputs = tokenizer(sequence, return_tensors="tf")
+    >>> tokens = inputs.tokens()

-    >>> outputs = model(inputs)[0]
+    >>> outputs = model(**inputs)[0]
    >>> predictions = tf.argmax(outputs, axis=2)


@@ -755,8 +769,7 @@ illustrated below:
    (',', 'O')
    ('therefore', 'O')
    ('very', 'O')
-    ('##c', 'O')
-    ('##lose', 'O')
+    ('close', 'O')
    ('to', 'O')
    ('the', 'O')
    ('Manhattan', 'I-LOC')
@@ -764,6 +777,7 @@ illustrated below:
    ('.', 'O')
    ('[SEP]', 'O')

+
 Summarization
 -----------------------------------------------------------------------------------------------------------------------

@@ -811,7 +825,9 @@ below. This outputs the following summary:
 .. code-block::

    >>> print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
-    [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
+    [{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
+    the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
+    2002 . At one time, she was married to eight men at once, prosecutors say .'}]

 Here is an example of doing summarization using a model and a tokenizer. The process is the following:

@@ -833,8 +849,15 @@ CNN / Daily Mail), it yields very good results.
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
-    >>> inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
-    >>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
+    >>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
+    >>> outputs = model.generate(
+    ...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
+    ... )
+
+    >>> print(tokenizer.decode(outputs[0]))
+    <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
+    counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
+    between 1999 and 2002.</s>
    >>> ## TENSORFLOW CODE
    >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

@@ -842,13 +865,15 @@ CNN / Daily Mail), it yields very good results.
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
-    >>> inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
-    >>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
-
-.. code-block::
+    >>> inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
+    >>> outputs = model.generate(
+    ...     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
+    ... )

    >>> print(tokenizer.decode(outputs[0]))
-    <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them between 1999 and 2002.</s>
+    <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
+    counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
+    between 1999 and 2002.


 Translation
@@ -888,25 +913,32 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 .. code-block::

    >>> ## PYTORCH CODE
-    >>> from transformers import AutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-    >>> model = AutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

-    >>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
-    >>> outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
+    >>> inputs = tokenizer(
+    ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
+    ...     return_tensors="pt"
+    ... )
+    >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
+
+    >>> print(tokenizer.decode(outputs[0]))
+    <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
    >>> ## TENSORFLOW CODE
-    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer
+    >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

-    >>> model = TFAutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

-    >>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="tf")
-    >>> outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
-
-As with the pipeline example, we get the same translation:
-
-.. code-block::
+    >>> inputs = tokenizer(
+    ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
+    ...     return_tensors="tf"
+    ... )
+    >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

    >>> print(tokenizer.decode(outputs[0]))
-    Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
+    <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
+
+We get the same translation as with the pipeline example.
--- a/setup.cfg
+++ b/setup.cfg
@@ -49,3 +49,6 @@ use_parentheses = True
 [flake8]
 ignore = E203, E501, E741, W503, W605
 max-line-length = 119
+
+[tool:pytest]
+doctest_optionflags=NUMBER NORMALIZE_WHITESPACE ELLIPSIS
\ No newline at end of file