Doc styler examples (#14953)

* Fix bad examples * Add black formatting to style_doc * Use first nonempty line * Put it at the right place * Don't add spaces to empty lines * Better templates * Deal with triple quotes in docstrings * Result of style_doc * Enable mdx treatment and fix code examples in MDXs * Result of doc styler on doc source files * Last fixes * Break copy from

Doc styler examples (#14953)
* Fix bad examples * Add black formatting to style_doc * Use first nonempty line * Put it at the right place * Don't add spaces to empty lines * Better templates * Deal with triple quotes in docstrings * Result of style_doc * Enable mdx treatment and fix code examples in MDXs * Result of doc styler on doc source files * Last fixes * Break copy from
b5e2b183 · Sylvain Gugger · GitHub · e13f72fb · b5e2b183 · b5e2b183
Unverified Commit b5e2b183 authored Dec 27, 2021 by Sylvain Gugger Committed by GitHub Dec 27, 2021
20 changed files
--- a/docs/source/add_new_model.mdx
+++ b/docs/source/add_new_model.mdx
@@ -267,7 +267,7 @@ single forward pass using a dummy integer vector of input IDs as an input. Such
 pseudocode):
 ```python
-model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
 input_ids = [0, 4, 5, 2, 3, 7, 9]  # vector of input ids
 original_output = model.predict(input_ids)
 ```
@@ -476,6 +476,7 @@ following command should work:
 ```python
 from transformers import BrandNewBertModel, BrandNewBertConfig
 model = BrandNewBertModel(BrandNewBertConfig())
 ```
@@ -502,12 +503,13 @@ PyTorch, called `SimpleModel` as follows:
 ```python
 from torch import nn
 class SimpleModel(nn.Module):
    def __init__(self):
-            super().__init__()
+        super().__init__()
-            self.dense = nn.Linear(10, 10)
+        self.dense = nn.Linear(10, 10)
-            self.intermediate = nn.Linear(10, 10)
+        self.intermediate = nn.Linear(10, 10)
-            self.layer_norm = nn.LayerNorm(10)
+        self.layer_norm = nn.LayerNorm(10)
 ```
 Now we can create an instance of this model definition which will fill all weights: `dense`, `intermediate`,
@@ -565,7 +567,7 @@ In the conversion script, you should fill those randomly initialized weights wit
 corresponding layer in the checkpoint. *E.g.*
 ```python
-# retrieve matching layer weights, e.g. by 
+# retrieve matching layer weights, e.g. by
 # recursive algorithm
 layer_name = "dense"
 pretrained_weight = array_of_dense_layer
@@ -622,7 +624,7 @@ pass of the model using the original repository. Now you should write an analogo
 implementation instead of the original one. It should look as follows:
 ```python
-model = BrandNewBertModel.from_pretrained(/path/to/converted/checkpoint/folder)
+model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder")
 input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]
 output = model(input_ids).last_hidden_states
 ```
@@ -668,7 +670,7 @@ fully comply with the required design. To make sure, the implementation is fully
 common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under
 the same `tests/test_modeling_brand_new_bert.py`. Run this test file to verify that all common tests pass:
-```python
+```bash
 pytest tests/test_modeling_brand_new_bert.py
 ```
@@ -714,7 +716,7 @@ that inputs a string and returns the `input_ids``. It could look similar to this
 ```python
 input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
-model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
 input_ids = model.tokenize(input_str)
 ```
@@ -725,9 +727,10 @@ created. It should look similar to this:
 ```python
 from transformers import BrandNewBertTokenizer
 input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
-tokenizer = BrandNewBertTokenizer.from_pretrained(/path/to/tokenizer/folder/)
+tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/")
 input_ids = tokenizer(input_str).input_ids
 ```

--- a/docs/source/add_new_pipeline.mdx
+++ b/docs/source/add_new_pipeline.mdx
@@ -26,6 +26,7 @@ Start by inheriting the base class `Pipeline`. with the 4 methods needed to impl
 ```python
 from transformers import Pipeline
 class MyPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
@@ -34,7 +35,7 @@ class MyPipeline(Pipeline):
        return preprocess_kwargs, {}, {}
    def preprocess(self, inputs, maybe_arg=2):
-        model_input = Tensor(....)
+        model_input = Tensor(inputs["input_ids"])
        return {"model_input": model_input}
    def _forward(self, model_inputs):
@@ -90,6 +91,7 @@ def postprocess(self, model_outputs, top_k=5):
    # Add logic to handle top_k
    return best_class
 def _sanitize_parameters(self, **kwargs):
    preprocess_kwargs = {}
    if "maybe_arg" in kwargs:

--- a/docs/source/benchmarks.mdx
+++ b/docs/source/benchmarks.mdx
@@ -37,11 +37,12 @@ The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an
 >>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
 >>> benchmark = PyTorchBenchmark(args)
 ===PT-TF-SPLIT===
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
->>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = TensorFlowBenchmarkArguments(
+...     models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> benchmark = TensorFlowBenchmark(args)
 ```
@@ -174,7 +175,9 @@ configurations must be inserted with the benchmark args as follows.
 ```py
 >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
->>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = PyTorchBenchmarkArguments(
+...     models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> config_base = BertConfig()
 >>> config_384_hid = BertConfig(hidden_size=384)
 >>> config_6_lay = BertConfig(num_hidden_layers=6)
@@ -244,7 +247,9 @@ bert-6-lay                 8              512            1359
 ===PT-TF-SPLIT===
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
->>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = TensorFlowBenchmarkArguments(
+...     models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> config_base = BertConfig()
 >>> config_384_hid = BertConfig(hidden_size=384)
 >>> config_6_lay = BertConfig(num_hidden_layers=6)

--- a/docs/source/custom_datasets.mdx
+++ b/docs/source/custom_datasets.mdx
@@ -54,6 +54,7 @@ The 🤗 Datasets library makes it simple to load a dataset:
 ```python
 from datasets import load_dataset
 imdb = load_dataset("imdb")
 ```
@@ -61,8 +62,9 @@ This loads a `DatasetDict` object which you can index into to view an example:
 ```python
 imdb["train"][0]
-{'label': 1,
+{
- 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
+    "label": 1,
+    "text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
 }
 ```
@@ -74,6 +76,7 @@ model was trained with to ensure appropriately tokenized words. Load the DistilB
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```
@@ -99,6 +102,7 @@ batch. This is known as **dynamic padding**. You can do this with the `DataColla
 ```python
 from transformers import DataCollatorWithPadding
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 ```
@@ -108,6 +112,7 @@ Now load your model with the [`AutoModelForSequenceClassification`] class along
 ```python
 from transformers import AutoModelForSequenceClassification
 model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
 ```
@@ -121,7 +126,7 @@ At this point, only three steps remain:
 from transformers import TrainingArguments, Trainer
 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
@@ -150,6 +155,7 @@ Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of
 ```python
 from transformers import DataCollatorWithPadding
 data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
 ```
@@ -158,14 +164,14 @@ Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`
 ```python
 tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
-    columns=['attention_mask', 'input_ids', 'label'],
+    columns=["attention_mask", "input_ids", "label"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
 )
 tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
-    columns=['attention_mask', 'input_ids', 'label'],
+    columns=["attention_mask", "input_ids", "label"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
@@ -182,17 +188,14 @@ batch_size = 16
 num_epochs = 5
 batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
 total_train_steps = int(batches_per_epoch * num_epochs)
-optimizer, schedule = create_optimizer(
+optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
-    init_lr=2e-5, 
-    num_warmup_steps=0, 
-    num_train_steps=total_train_steps
-)
 ```
 Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:
 ```python
 from transformers import TFAutoModelForSequenceClassification
 model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
 ```
@@ -200,6 +203,7 @@ Compile the model:
 ```python
 import tensorflow as tf
 model.compile(optimizer=optimizer)
 ```
@@ -234,14 +238,15 @@ or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/no
 Load the WNUT 17 dataset from the 🤗 Datasets library:
 ```python
-from datasets import load_dataset
+>>> from datasets import load_dataset
-wnut = load_dataset("wnut_17")
+>>> wnut = load_dataset("wnut_17")
 ```
 A quick look at the dataset shows the labels associated with each word in the sentence:
 ```python
-wnut["train"][0]
+>>> wnut["train"][0]
 {'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
@@ -251,21 +256,22 @@ wnut["train"][0]
 View the specific NER tags by:
 ```python
-label_list = wnut["train"].features[f"ner_tags"].feature.names
+>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
-label_list
+>>> label_list
-['O',
+[
- 'B-corporation',
+    "O",
- 'I-corporation',
+    "B-corporation",
- 'B-creative-work',
+    "I-corporation",
- 'I-creative-work',
+    "B-creative-work",
- 'B-group',
+    "I-creative-work",
- 'I-group',
+    "B-group",
- 'B-location',
+    "I-group",
- 'I-location',
+    "B-location",
- 'B-person',
+    "I-location",
- 'I-person',
+    "B-person",
- 'B-product',
+    "I-person",
- 'I-product'
+    "B-product",
+    "I-product",
 ]
 ```
@@ -282,6 +288,7 @@ Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoT
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```
@@ -289,9 +296,9 @@ Since the input has already been split into words, set `is_split_into_words=True
 subwords:
 ```python
-tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
+>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
-tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
+>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
-tokens
+>>> tokens
 ['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
 ```
@@ -314,10 +321,10 @@ def tokenize_and_align_labels(examples):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
-        for word_idx in word_ids:                            # Set the special tokens to -100.
+        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
-            elif word_idx != previous_word_idx:              # Only label the first token of a given word.
+            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
        labels.append(label_ids)
@@ -336,6 +343,7 @@ Finally, pad your text and labels, so they are a uniform length:
 ```python
 from transformers import DataCollatorForTokenClassification
 data_collator = DataCollatorForTokenClassification(tokenizer)
 ```
@@ -345,6 +353,7 @@ Load your model with the [`AutoModelForTokenClassification`] class along with th
 ```python
 from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
 model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
 ```
@@ -352,7 +361,7 @@ Gather your training arguments in [`TrainingArguments`]:
 ```python
 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
@@ -387,6 +396,7 @@ Batch your examples together and pad your text and labels, so they are a uniform
 ```python
 from transformers import DataCollatorForTokenClassification
 data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
 ```
@@ -412,6 +422,7 @@ Load the model with the [`TFAutoModelForTokenClassification`] class along with t
 ```python
 from transformers import TFAutoModelForTokenClassification
 model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
 ```
@@ -435,6 +446,7 @@ Compile the model:
 ```python
 import tensorflow as tf
 model.compile(optimizer=optimizer)
 ```
@@ -469,13 +481,14 @@ Load the SQuAD dataset from the 🤗 Datasets library:
 ```python
 from datasets import load_dataset
 squad = load_dataset("squad")
 ```
 Take a look at an example from the dataset:
 ```python
-squad["train"][0]
+>>> squad["train"][0]
 {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
@@ -490,6 +503,7 @@ Load the DistilBERT tokenizer with an [`AutoTokenizer`]:
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```
@@ -567,6 +581,7 @@ Batch the processed examples together:
 ```python
 from transformers import default_data_collator
 data_collator = default_data_collator
 ```
@@ -576,6 +591,7 @@ Load your model with the [`AutoModelForQuestionAnswering`] class:
 ```python
 from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
 model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
 ```
@@ -583,7 +599,7 @@ Gather your training arguments in [`TrainingArguments`]:
 ```python
 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
@@ -618,6 +634,7 @@ Batch the processed examples together with a TensorFlow default data collator:
 ```python
 from transformers.data.data_collator import tf_default_collator
 data_collator = tf_default_collator
 ```
@@ -650,8 +667,8 @@ batch_size = 16
 num_epochs = 2
 total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
 optimizer, schedule = create_optimizer(
-    init_lr=2e-5, 
+    init_lr=2e-5,
-    num_warmup_steps=0, 
+    num_warmup_steps=0,
    num_train_steps=total_train_steps,
 )
 ```
@@ -660,6 +677,7 @@ Load your model with the [`TFAutoModelForQuestionAnswering`] class:
 ```python
 from transformers import TFAutoModelForQuestionAnswering
 model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
 ```
@@ -667,6 +685,7 @@ Compile the model:
 ```python
 import tensorflow as tf
 model.compile(optimizer=optimizer)
 ```

--- a/docs/source/debugging.mdx
+++ b/docs/source/debugging.mdx
@@ -49,6 +49,7 @@ If you're using your own training loop or another Trainer you can accomplish the
 ```python
 from .debug_utils import DebugUnderflowOverflow
 debug_overflow = DebugUnderflowOverflow(model)
 ```
@@ -200,13 +201,16 @@ def _forward(self, hidden_states):
    hidden_states = self.wo(hidden_states)
    return hidden_states
 import torch
 def forward(self, hidden_states):
    if torch.is_autocast_enabled():
-         with torch.cuda.amp.autocast(enabled=False):
+        with torch.cuda.amp.autocast(enabled=False):
-             return self._forward(hidden_states)
+            return self._forward(hidden_states)
-     else:
+    else:
-         return self._forward(hidden_states)
+        return self._forward(hidden_states)
 ```
 Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
@@ -216,8 +220,10 @@ want to analyse the intermediary stages of any specific `forward` function as we
 ```python
 from debug_utils import detect_overflow
 class T5LayerFF(nn.Module):
    [...]
    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        detect_overflow(forwarded_states, "after layer_norm")
@@ -237,6 +243,7 @@ its default, e.g.:
 ```python
 from .debug_utils import DebugUnderflowOverflow
 debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
 ```
@@ -248,7 +255,7 @@ Let's say you want to watch the absolute min and max values for all the ingredie
 batch, and only do that for batches 1 and 3. Then you instantiate this class as:
 ```python
-debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
 ```
 And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
@@ -295,5 +302,5 @@ numbers started to diverge.
 You can also specify the batch number after which to stop the training, with:
 ```python
-debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
 ```
--- a/docs/source/glossary.mdx
+++ b/docs/source/glossary.mdx
@@ -58,6 +58,7 @@ tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenize
 ```python
 >>> from transformers import BertTokenizer
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
 >>> sequence = "A Titan RTX has 24GB of VRAM"
@@ -126,6 +127,7 @@ For example, consider these two sequences:
 ```python
 >>> from transformers import BertTokenizer
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
 >>> sequence_a = "This is a short sequence."
@@ -190,6 +192,7 @@ arguments (and not a list, like before) like this:
 ```python
 >>> from transformers import BertTokenizer
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
 >>> sequence_a = "HuggingFace is based in NYC"
 >>> sequence_b = "Where is HuggingFace based?"
@@ -212,7 +215,7 @@ the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:
 ```python
->>> encoded_dict['token_type_ids']
+>>> encoded_dict["token_type_ids"]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
 ```

--- a/docs/source/internal/generation_utils.mdx
+++ b/docs/source/internal/generation_utils.mdx
@@ -32,8 +32,8 @@ Here's an example:
 ```python
 from transformers import GPT2Tokenizer, GPT2LMHeadModel
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
-model = GPT2LMHeadModel.from_pretrained('gpt2')
+model = GPT2LMHeadModel.from_pretrained("gpt2")
 inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
 generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)

--- a/docs/source/main_classes/callback.mdx
+++ b/docs/source/main_classes/callback.mdx
@@ -79,12 +79,13 @@ class MyCallback(TrainerCallback):
    def on_train_begin(self, args, state, control, **kwargs):
        print("Starting training")
 trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
-    callbacks=[MyCallback]  # We can either pass the callback class this way or an instance of it (MyCallback())
+    callbacks=[MyCallback],  # We can either pass the callback class this way or an instance of it (MyCallback())
 )
 ```

--- a/docs/source/main_classes/deepspeed.mdx
+++ b/docs/source/main_classes/deepspeed.mdx
@@ -295,11 +295,12 @@ If you're using only 1 GPU, here is how you'd have to adjust your training code
 # DeepSpeed requires a distributed environment even when only one process is used.
 # This emulates a launcher in the notebook
 import os
-os.environ['MASTER_ADDR'] = 'localhost'
-os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
+os.environ["MASTER_ADDR"] = "localhost"
-os.environ['RANK'] = "0"
+os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
-os.environ['LOCAL_RANK'] = "0"
+os.environ["RANK"] = "0"
-os.environ['WORLD_SIZE'] = "1"
+os.environ["LOCAL_RANK"] = "0"
+os.environ["WORLD_SIZE"] = "1"
 # Now proceed as normal, plus pass the deepspeed config file
 training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
@@ -316,7 +317,7 @@ at the beginning of this section.
 If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
 cell with:
-```python
+```python no-style
 %%bash
 cat <<'EOT' > ds_config_zero3.json
 {
@@ -382,14 +383,14 @@ EOT
 If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via
 shell from a cell. For example, to use `run_translation.py` you would launch it with:
-```python
+```python no-style
 !git clone https://github.com/huggingface/transformers
 !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
 ```
 or with `%%bash` magic, where you can write a multi-line code for the shell program to run:
-```python
+```python no-style
 %%bash
 git clone https://github.com/huggingface/transformers
@@ -512,7 +513,7 @@ TrainingArguments(..., deepspeed="/path/to/ds_config.json")
 or:
 ```python
-ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
+ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
 TrainingArguments(..., deepspeed=ds_config_dict)
 ```
@@ -1430,6 +1431,7 @@ If you have saved at least one checkpoint, and you want to use the latest one, y
 ```python
 from transformers.trainer_utils import get_last_checkpoint
 from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
 checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
 ```
@@ -1439,6 +1441,7 @@ checkpoint), then you can finish the training by first saving the final model ex
 ```python
 from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
 checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
 trainer.deepspeed.save_checkpoint(checkpoint_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
@@ -1461,7 +1464,8 @@ these yourself as is shown in the following example:
 ```python
 from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
-state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)  # already on cpu
 model = model.cpu()
 model.load_state_dict(state_dict)
 ```
@@ -1529,9 +1533,10 @@ context manager (which is also a function decorator), like so:
 ```python
 from transformers import T5ForConditionalGeneration, T5Config
 import deepspeed
 with deepspeed.zero.Init():
-   config = T5Config.from_pretrained("t5-small")
+    config = T5Config.from_pretrained("t5-small")
-   model = T5ForConditionalGeneration(config)
+    model = T5ForConditionalGeneration(config)
 ```
 As you can see this gives you a randomly initialized model.
@@ -1544,6 +1549,7 @@ section. Thus you must create the [`TrainingArguments`] object **before** callin
 ```python
 from transformers import AutoModel, Trainer, TrainingArguments
 training_args = TrainingArguments(..., deepspeed=ds_config)
 model = AutoModel.from_pretrained("t5-small")
 trainer = Trainer(model=model, args=training_args, ...)
@@ -1574,7 +1580,7 @@ limitations.
 Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
 ```python
-tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
+tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True)
 ```
 stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much
@@ -1715,9 +1721,9 @@ For example for a pretrained model:
 from transformers.deepspeed import HfDeepSpeedConfig
 from transformers import AutoModel, deepspeed
-ds_config = { ... } # deepspeed config object or path to the file
+ds_config = {...}  # deepspeed config object or path to the file
 # must run before instantiating the model
-dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
 model = AutoModel.from_pretrained("gpt2")
 engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
 ```
@@ -1728,9 +1734,9 @@ or for non-pretrained model:
 from transformers.deepspeed import HfDeepSpeedConfig
 from transformers import AutoModel, AutoConfig, deepspeed
-ds_config = { ... } # deepspeed config object or path to the file
+ds_config = {...}  # deepspeed config object or path to the file
 # must run before instantiating the model
-dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
 config = AutoConfig.from_pretrained("gpt2")
 model = AutoModel.from_config(config)
 engine = deepspeed.initialize(model=model, config_params=ds_config, ...)

--- a/docs/source/main_classes/logging.mdx
+++ b/docs/source/main_classes/logging.mdx
@@ -21,6 +21,7 @@ to the INFO level.
 ```python
 import transformers
 transformers.logging.set_verbosity_info()
 ```

--- a/docs/source/main_classes/output.mdx
+++ b/docs/source/main_classes/output.mdx
@@ -22,8 +22,8 @@ Let's see of this looks on an example:
 from transformers import BertTokenizer, BertForSequenceClassification
 import torch
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1

--- a/docs/source/main_classes/pipelines.mdx
+++ b/docs/source/main_classes/pipelines.mdx
@@ -101,6 +101,7 @@ from transformers import pipeline
 pipe = pipeline("text-classification")
 def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request
@@ -110,6 +111,7 @@ def data():
        # does the preprocessing while the main runs the big inference
        yield "This is a test"
 for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
@@ -125,10 +127,10 @@ All pipelines can use batching. This will work
 whenever the pipeline uses its streaming ability (so when passing lists or `Dataset` or `generator`).
 ```python
-from transformers import pipeline                                                   
+from transformers import pipeline
 from transformers.pipelines.base import KeyDataset
 import datasets
-import tqdm                                                                         
+import tqdm
 dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
 pipe = pipeline("text-classification", device=0)
@@ -149,28 +151,28 @@ Example where it's mostly a speedup:
 </Tip>
 ```python
-from transformers import pipeline                                                   
+from transformers import pipeline
-from torch.utils.data import Dataset                                                
+from torch.utils.data import Dataset
-import tqdm                                                                         
+import tqdm
-pipe = pipeline("text-classification", device=0)                                    
+pipe = pipeline("text-classification", device=0)
-class MyDataset(Dataset):                                                           
+class MyDataset(Dataset):
-    def __len__(self):                                                              
+    def __len__(self):
-        return 5000                                                                 
+        return 5000
-    def __getitem__(self, i):                                                       
+    def __getitem__(self, i):
-        return "This is a test"                                                     
+        return "This is a test"
-dataset = MyDataset()   
+dataset = MyDataset()
 for batch_size in [1, 8, 64, 256]:
-    print("-" * 30)                                                                     
+    print("-" * 30)
-    print(f"Streaming batch_size={batch_size}")    
+    print(f"Streaming batch_size={batch_size}")
-    for out in tqdm.tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):              
+    for out in tqdm.tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
        pass
 ```
@@ -194,15 +196,15 @@ Streaming batch_size=256
 Example where it's most a slowdown:
 ```python
-class MyDataset(Dataset):                                                           
+class MyDataset(Dataset):
-    def __len__(self):                                                              
+    def __len__(self):
-        return 5000                                                                 
+        return 5000
-    def __getitem__(self, i):                                                       
+    def __getitem__(self, i):
-        if i % 64 == 0:                                                          
+        if i % 64 == 0:
-            n = 100                                                              
+            n = 100
-        else:                                                                    
+        else:
-            n = 1                                                                
+            n = 1
        return "This is a test" * n
 ```
@@ -298,10 +300,11 @@ If you want to try simply you can:
 ```python
 class MyPipeline(TextClassificationPipeline):
-    def postprocess(...):
+    def postprocess():
-        ...
+        # Your code goes here
        scores = scores * 100
-        ...
+        # And here
 my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...)
 # or if you use *pipeline* function, then:

--- a/docs/source/main_classes/processors.mdx
+++ b/docs/source/main_classes/processors.mdx
@@ -122,7 +122,7 @@ examples = processor.get_dev_examples(squad_v2_data_dir)
 processor = SquadV1Processor()
 examples = processor.get_dev_examples(squad_v1_data_dir)
-features = squad_convert_examples_to_features( 
+features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
@@ -139,7 +139,7 @@ Using *tensorflow_datasets* is as easy as using a data file:
 tfds_examples = tfds.load("squad")
 examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
-features = squad_convert_examples_to_features( 
+features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,

--- a/docs/source/main_classes/trainer.mdx
+++ b/docs/source/main_classes/trainer.mdx
@@ -53,14 +53,16 @@ Here is an example of how to customize [`Trainer`] using a custom loss function
 from torch import nn
 from transformers import Trainer
 class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
-        logits = outputs.get('logits')
+        logits = outputs.get("logits")
        loss_fct = nn.BCEWithLogitsLoss()
-        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
+        loss = loss_fct(
-                        labels.float().view(-1, self.model.config.num_labels))
+            logits.view(-1, self.model.config.num_labels), labels.float().view(-1, self.model.config.num_labels)
+        )
        return (loss, outputs) if return_outputs else loss
 ```

--- a/docs/source/migration.mdx
+++ b/docs/source/migration.mdx
@@ -209,7 +209,7 @@ Here is a `pytorch-pretrained-bert` to 🤗 Transformers conversion example for
 ```python
 # Let's load our model
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
 # If you used to have this line in pytorch-pretrained-bert:
 loss = model(input_ids, labels=labels)
@@ -222,7 +222,7 @@ loss = outputs[0]
 loss, logits = outputs[:2]
 # And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased", output_attentions=True)
 outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs
 ```
@@ -241,23 +241,23 @@ Here is an example:
 ```python
 ### Let's load a model and tokenizer
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
 ### Do some stuff to our model and tokenizer
 # Ex: add new tokens to the vocabulary and embeddings of our model
-tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
+tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"])
 model.resize_token_embeddings(len(tokenizer))
 # Train our model
 train(model)
 ### Now let's save our model and tokenizer to a directory
-model.save_pretrained('./my_saved_model_directory/')
+model.save_pretrained("./my_saved_model_directory/")
-tokenizer.save_pretrained('./my_saved_model_directory/')
+tokenizer.save_pretrained("./my_saved_model_directory/")
 ### Reload the model and the tokenizer
-model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
+model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/")
-tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
+tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/")
 ```
 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
@@ -283,7 +283,13 @@ num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1
 ### Previously BertAdam optimizer was instantiated like this:
-optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, num_training_steps=num_training_steps)
+optimizer = BertAdam(
+    model.parameters(),
+    lr=lr,
+    schedule="warmup_linear",
+    warmup=warmup_proportion,
+    num_training_steps=num_training_steps,
+)
 ### and used like this:
 for batch in train_data:
    loss = model(batch)
@@ -291,13 +297,19 @@ for batch in train_data:
    optimizer.step()
 ### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
-optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
+optimizer = AdamW(
-scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
+    model.parameters(), lr=lr, correct_bias=False
+)  # To reproduce BertAdam specific behavior set correct_bias=False
+scheduler = get_linear_schedule_with_warmup(
+    optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
+)  # PyTorch scheduler
 ### and used like this:
 for batch in train_data:
    loss = model(batch)
    loss.backward()
-    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
+    torch.nn.utils.clip_grad_norm_(
+        model.parameters(), max_grad_norm
+    )  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    optimizer.step()
    scheduler.step()
 ```
--- a/docs/source/model_doc/bart.mdx
+++ b/docs/source/model_doc/bart.mdx
@@ -64,12 +64,15 @@ The `facebook/bart-base` and `facebook/bart-large` checkpoints can be used to fi
 ```python
 from transformers import BartForConditionalGeneration, BartTokenizer
 model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
 tok = BartTokenizer.from_pretrained("facebook/bart-large")
 example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
-batch = tok(example_english_phrase, return_tensors='pt')
+batch = tok(example_english_phrase, return_tensors="pt")
-generated_ids = model.generate(batch['input_ids'])
+generated_ids = model.generate(batch["input_ids"])
-assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
+assert tok.batch_decode(generated_ids, skip_special_tokens=True) == [
+    "UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria"
+]
 ```
 ## BartConfig

--- a/docs/source/model_doc/bartpho.mdx
+++ b/docs/source/model_doc/bartpho.mdx
@@ -44,6 +44,7 @@ Example of use:
 >>> # With TensorFlow 2.0+:
 >>> from transformers import TFAutoModel
 >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
 >>> input_ids = tokenizer(line, return_tensors="tf")
 >>> features = bartpho(**input_ids)
@@ -58,9 +59,10 @@ Tips:
 ```python
 >>> from transformers import MBartForConditionalGeneration
 >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
->>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
+>>> TXT = "Chúng tôi là <mask> nghiên cứu viên."
->>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
+>>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"]
 >>> logits = bartpho(input_ids).logits
 >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
 >>> probs = logits[0, masked_index].softmax(dim=0)

--- a/docs/source/model_doc/bert_japanese.mdx
+++ b/docs/source/model_doc/bert_japanese.mdx
@@ -30,7 +30,7 @@ Example of using a model with MeCab and WordPiece tokenization:
 ```python
 >>> import torch
->>> from transformers import AutoModel, AutoTokenizer 
+>>> from transformers import AutoModel, AutoTokenizer
 >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
 >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
@@ -40,7 +40,7 @@ Example of using a model with MeCab and WordPiece tokenization:
 >>> inputs = tokenizer(line, return_tensors="pt")
->>> print(tokenizer.decode(inputs['input_ids'][0]))
+>>> print(tokenizer.decode(inputs["input_ids"][0]))
 [CLS] 吾輩 は 猫 で ある 。 [SEP]
 >>> outputs = bertjapanese(**inputs)
@@ -57,7 +57,7 @@ Example of using a model with Character tokenization:
 >>> inputs = tokenizer(line, return_tensors="pt")
->>> print(tokenizer.decode(inputs['input_ids'][0]))
+>>> print(tokenizer.decode(inputs["input_ids"][0]))
 [CLS] 吾 輩 は 猫 で あ る 。 [SEP]
 >>> outputs = bertjapanese(**inputs)

--- a/docs/source/model_doc/bertgeneration.mdx
+++ b/docs/source/model_doc/bertgeneration.mdx
@@ -39,14 +39,18 @@ Usage:
 >>> # use BERT's cls token as BOS token and sep token as EOS token
 >>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
 >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
->>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
+>>> decoder = BertGenerationDecoder.from_pretrained(
+...     "bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102
+... )
 >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
 >>> # create tokenizer...
 >>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
->>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
+>>> input_ids = tokenizer(
->>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
+...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
+>>> ).input_ids
+>>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
 >>> # train...
 >>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
@@ -61,7 +65,9 @@ Usage:
 >>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
 >>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
->>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
+>>> input_ids = tokenizer(
+...     "This is the first sentence. This is the second sentence.", add_special_tokens=False, return_tensors="pt"
+>>> ).input_ids
 >>> outputs = sentence_fuser.generate(input_ids)

--- a/docs/source/model_doc/bertweet.mdx
+++ b/docs/source/model_doc/bertweet.mdx
@@ -28,14 +28,14 @@ Example of use:
 ```python
 >>> import torch
->>> from transformers import AutoModel, AutoTokenizer 
+>>> from transformers import AutoModel, AutoTokenizer
 >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
->>> # For transformers v4.x+: 
+>>> # For transformers v4.x+:
 >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
->>> # For transformers v3.x: 
+>>> # For transformers v3.x:
 >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
 >>> # INPUT TWEET IS ALREADY NORMALIZED!