Doc styler v2 (#14950)

* New doc styler * Fix issue with args at the start * Code sample fixes * Style code examples in MDX * Fix more patterns * Typo * Typo * More patterns * Do without black for now * Get more info in error * Docstring style * Re-enable check * Quality * Fix add_end_docstring decorator * Fix docstring

Doc styler v2 (#14950)
* New doc styler * Fix issue with args at the start * Code sample fixes * Style code examples in MDX * Fix more patterns * Typo * Typo * More patterns * Do without black for now * Get more info in error * Docstring style * Re-enable check * Quality * Fix add_end_docstring decorator * Fix docstring
87e6e4fe · Sylvain Gugger · GitHub · c1138273 · 87e6e4fe · 87e6e4fe
Unverified Commit 87e6e4fe authored Dec 27, 2021 by Sylvain Gugger Committed by GitHub Dec 27, 2021
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -848,7 +848,7 @@ jobs:
            - run: isort --check-only examples tests src utils
            - run: python utils/custom_init_isort.py --check_only
            - run: flake8 examples tests src utils
-#            - run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
+            - run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
    check_repository_consistency:
        working_directory: ~/transformers

--- a/Makefile
+++ b/Makefile
@@ -48,13 +48,13 @@ quality:
 	isort --check-only $(check_dirs)
 	python utils/custom_init_isort.py --check_only
 	flake8 $(check_dirs)
-#	python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
+	python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
 # Format source code automatically and check is there are any problems left that need manual fixing
 extra_style_checks:
 	python utils/custom_init_isort.py
-#	python utils/style_doc.py src/transformers docs/source --max_len 119
+	python utils/style_doc.py src/transformers docs/source --max_len 119
 # this target runs checks on all files and potentially modifies some of them

--- a/src/transformers/commands/lfs.py
+++ b/src/transformers/commands/lfs.py
@@ -9,12 +9,8 @@ Spec is: github.com/git-lfs/git-lfs/blob/master/docs/custom-transfers.md
 To launch debugger while developing:
 ``` [lfs "customtransfer.multipart"]
+ path = /path/to/transformers/.env/bin/python args = -m debugpy --listen 5678 --wait-for-client
-path = /path/to/transformers/.env/bin/python
+/path/to/transformers/src/transformers/commands/transformers_cli.py lfs-multipart-upload ``` """
-args = -m debugpy --listen 5678 --wait-for-client /path/to/transformers/src/transformers/commands/transformers_cli.py
-lfs-multipart-upload ```
-"""
 import json
 import os

--- a/src/transformers/commands/serving.py
+++ b/src/transformers/commands/serving.py
@@ -214,9 +214,7 @@ class ServeCommand(BaseTransformersCLICommand):
    async def forward(self, inputs=Body(None, embed=True)):
        """
-        **inputs**:
+        **inputs**: **attention_mask**: **tokens_type_ids**:
-        **attention_mask**:
-        **tokens_type_ids**:
        """
        # Check we don't have empty string

--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -178,7 +178,8 @@ class PretrainedConfig(PushToHubMixin):
        > Parameters for fine-tuning tasks
-        architectures (`List[str]`, *optional*): Model architectures that can be used with the model pretrained weights.
+        architectures (`List[str]`, *optional*):
+            Model architectures that can be used with the model pretrained weights.
        finetuning_task (`str`, *optional*):
            Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow
            or PyTorch) checkpoint.
@@ -401,16 +402,14 @@ class PretrainedConfig(PushToHubMixin):
                <Tip warning={true}>
-                Using `push_to_hub=True` will synchronize the repository you are pushing to with
+                Using `push_to_hub=True` will synchronize the repository you are pushing to with `save_directory`,
-                `save_directory`, which requires `save_directory` to be a local clone of the repo you are
+                which requires `save_directory` to be a local clone of the repo you are pushing to if it's an existing
-                pushing to if it's an existing folder. Pass along `temp_dir=True` to use a temporary directory
+                folder. Pass along `temp_dir=True` to use a temporary directory instead.
-                instead.
                </Tip>
            kwargs:
-                Additional key word arguments passed along to the
+                Additional key word arguments passed along to the [`~file_utils.PushToHubMixin.push_to_hub`] method.
-                [`~file_utils.PushToHubMixin.push_to_hub`] method.
        """
        if os.path.isfile(save_directory):
            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
@@ -433,8 +432,7 @@ class PretrainedConfig(PushToHubMixin):
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        r"""
-        Instantiate a [`PretrainedConfig`] (or a derived class) from a pretrained model
+        Instantiate a [`PretrainedConfig`] (or a derived class) from a pretrained model configuration.
-        configuration.
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
@@ -445,8 +443,7 @@ class PretrainedConfig(PushToHubMixin):
                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
                - a path to a *directory* containing a configuration file saved using the
                  [`~PretrainedConfig.save_pretrained`] method, e.g., `./my_model_directory/`.
-                - a path or url to a saved configuration JSON *file*, e.g.,
+                - a path or url to a saved configuration JSON *file*, e.g., `./my_model_directory/configuration.json`.
-                  `./my_model_directory/configuration.json`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the
                standard cache should not be used.
@@ -457,10 +454,11 @@ class PretrainedConfig(PushToHubMixin):
                Whether or not to delete incompletely received file. Attempts to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            use_auth_token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token
+                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
-                generated when running `transformers-cli login` (stored in `~/.huggingface`).
+                when running `transformers-cli login` (stored in `~/.huggingface`).
            revision(`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
@@ -468,9 +466,9 @@ class PretrainedConfig(PushToHubMixin):
            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
                If `False`, then this function returns just the final configuration object.
-                If `True`, then this functions returns a `Tuple(config, unused_kwargs)` where *unused_kwargs*
+                If `True`, then this functions returns a `Tuple(config, unused_kwargs)` where *unused_kwargs* is a
-                is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e.,
+                dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the
-                the part of `kwargs` which has not been used to update `config` and is otherwise ignored.
+                part of `kwargs` which has not been used to update `config` and is otherwise ignored.
            kwargs (`Dict[str, Any]`, *optional*):
                The values in kwargs of any keys which are configuration attributes will be used to override the loaded
                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
@@ -615,8 +613,7 @@ class PretrainedConfig(PushToHubMixin):
        Args:
            config_dict (`Dict[str, Any]`):
                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be
-                retrieved from a pretrained checkpoint by leveraging the
+                retrieved from a pretrained checkpoint by leveraging the [`~PretrainedConfig.get_config_dict`] method.
-                [`~PretrainedConfig.get_config_dict`] method.
            kwargs (`Dict[str, Any]`):
                Additional parameters from which to initialize the configuration object.
@@ -730,8 +727,8 @@ class PretrainedConfig(PushToHubMixin):
        Args:
            use_diff (`bool`, *optional*, defaults to `True`):
-                If set to `True`, only the difference between the config instance and the default
+                If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
-                `PretrainedConfig()` is serialized to JSON string.
+                is serialized to JSON string.
        Returns:
            `str`: String containing all the attributes that make up this configuration instance in JSON format.
@@ -750,8 +747,8 @@ class PretrainedConfig(PushToHubMixin):
            json_file_path (`str` or `os.PathLike`):
                Path to the JSON file in which this configuration instance's parameters will be saved.
            use_diff (`bool`, *optional*, defaults to `True`):
-                If set to `True`, only the difference between the config instance and the default
+                If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
-                `PretrainedConfig()` is serialized to JSON file.
+                is serialized to JSON file.
        """
        with open(json_file_path, "w", encoding="utf-8") as writer:
            writer.write(self.to_json_string(use_diff=use_diff))
@@ -807,8 +804,8 @@ class PretrainedConfig(PushToHubMixin):
    def dict_torch_dtype_to_str(self, d: Dict[str, Any]) -> None:
        """
        Checks whether the passed dictionary has a *torch_dtype* key and if it's not None, converts torch.dtype to a
-        string of just the type. For example, `torch.float32` get converted into *"float32"* string, which can
+        string of just the type. For example, `torch.float32` get converted into *"float32"* string, which can then be
-        then be stored in the json format.
+        stored in the json format.
        """
        if d.get("torch_dtype", None) is not None and not isinstance(d["torch_dtype"], str):
            d["torch_dtype"] = str(d["torch_dtype"]).split(".")[1]
@@ -831,8 +828,8 @@ def get_configuration_file(
            git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
            identifier allowed by git.
        use_auth_token (`str` or *bool*, *optional*):
-            The token to use as HTTP bearer authorization for remote files. If `True`, will use the token
+            The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
-            generated when running `transformers-cli login` (stored in `~/.huggingface`).
+            when running `transformers-cli login` (stored in `~/.huggingface`).
        local_files_only (`bool`, *optional*, defaults to `False`):
            Whether or not to only rely on local files and not to attempt to download any files.

--- a/src/transformers/convert_graph_to_onnx.py
+++ b/src/transformers/convert_graph_to_onnx.py
@@ -348,7 +348,8 @@ def convert(
        output: The path where the ONNX graph will be stored
        opset: The actual version of the ONNX operator set to use
        tokenizer: The name of the model to load for the pipeline, default to the model's name if not provided
-        use_external_format: Split the model definition from its parameters to allow model bigger than 2GB (PyTorch only)
+        use_external_format:
+            Split the model definition from its parameters to allow model bigger than 2GB (PyTorch only)
        pipeline_name: The kind of pipeline to instantiate (ner, question-answering, etc.)
        model_kwargs: Keyword arguments to be forwarded to the model constructor

--- a/src/transformers/convert_pytorch_checkpoint_to_tf2.py
+++ b/src/transformers/convert_pytorch_checkpoint_to_tf2.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Convert pytorch checkpoints to TensorFlow """
+""" Convert pytorch checkpoints to TensorFlow"""
 import argparse

--- a/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py
+++ b/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Convert slow tokenizers checkpoints in fast (serialization format of the `tokenizers` library) """
+""" Convert slow tokenizers checkpoints in fast (serialization format of the `tokenizers` library)"""
 import argparse
 import os

--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -219,12 +219,12 @@ class DataCollatorWithPadding:
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
-            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence
-              sequence if provided).
+              if provided).
-            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
-              maximum acceptable input length for the model if that argument is not provided.
+              acceptable input length for the model if that argument is not provided.
-            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
-              different lengths).
+              lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
@@ -271,12 +271,12 @@ class DataCollatorForTokenClassification(DataCollatorMixin):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
-            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence
-              sequence if provided).
+              if provided).
-            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
-              maximum acceptable input length for the model if that argument is not provided.
+              acceptable input length for the model if that argument is not provided.
-            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
-              different lengths).
+              lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
@@ -526,12 +526,12 @@ class DataCollatorForSeq2Seq:
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
-            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence
-              sequence is provided).
+              is provided).
-            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
-              maximum acceptable input length for the model if that argument is not provided.
+              acceptable input length for the model if that argument is not provided.
-            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
-              different lengths).
+              lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
@@ -612,9 +612,9 @@ class DataCollatorForLanguageModeling(DataCollatorMixin):
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        mlm (`bool`, *optional*, defaults to `True`):
-            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the
+            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
-            inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for
+            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
-            non-masked tokens and the value to predict for the masked token.
+            tokens and the value to predict for the masked token.
        mlm_probability (`float`, *optional*, defaults to 0.15):
            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
        pad_to_multiple_of (`int`, *optional*):
@@ -625,9 +625,8 @@ class DataCollatorForLanguageModeling(DataCollatorMixin):
    <Tip>
    For best performance, this data collator should be used with a dataset having items that are dictionaries or
-    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a
+    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
-    [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the
+    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
-    argument `return_special_tokens_mask=True`.
    </Tip>"""
@@ -852,10 +851,9 @@ class DataCollatorForWholeWordMask(DataCollatorForLanguageModeling):
    <Tip>
-    This collator relies on details of the implementation of subword tokenization by
+    This collator relies on details of the implementation of subword tokenization by [`BertTokenizer`], specifically
-    [`BertTokenizer`], specifically that subword tokens are prefixed with *##*. For tokenizers
+    that subword tokens are prefixed with *##*. For tokenizers that do not adhere to this scheme, this collator will
-    that do not adhere to this scheme, this collator will produce an output that is roughly equivalent to
+    produce an output that is roughly equivalent to [`.DataCollatorForLanguageModeling`].
-    [`.DataCollatorForLanguageModeling`].
    </Tip>"""
@@ -1234,13 +1232,13 @@ class DataCollatorForPermutationLanguageModeling(DataCollatorMixin):
        The masked tokens to be predicted for a particular sequence are determined by the following algorithm:
            0. Start from the beginning of the sequence by setting `cur_len = 0` (number of tokens processed so far).
-            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be
+            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be masked)
-               masked)
            2. Reserve a context of length `context_length = span_length / plm_probability` to surround span to be
               masked
-            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length - span_length]` and mask tokens `start_index:start_index + span_length`
+            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length -
-            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in
+               span_length]` and mask tokens `start_index:start_index + span_length`
-               the sequence to be processed), repeat from Step 1.
+            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in the
+               sequence to be processed), repeat from Step 1.
        """
        import torch
@@ -1331,13 +1329,13 @@ class DataCollatorForPermutationLanguageModeling(DataCollatorMixin):
        The masked tokens to be predicted for a particular sequence are determined by the following algorithm:
            0. Start from the beginning of the sequence by setting `cur_len = 0` (number of tokens processed so far).
-            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be
+            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be masked)
-               masked)
            2. Reserve a context of length `context_length = span_length / plm_probability` to surround span to be
               masked
-            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length - span_length]` and mask tokens `start_index:start_index + span_length`
+            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length -
-            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in
+               span_length]` and mask tokens `start_index:start_index + span_length`
-               the sequence to be processed), repeat from Step 1.
+            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in the
+               sequence to be processed), repeat from Step 1.
        """
        from random import randint
@@ -1439,13 +1437,13 @@ class DataCollatorForPermutationLanguageModeling(DataCollatorMixin):
        The masked tokens to be predicted for a particular sequence are determined by the following algorithm:
            0. Start from the beginning of the sequence by setting `cur_len = 0` (number of tokens processed so far).
-            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be
+            1. Sample a `span_length` from the interval `[1, max_span_length]` (length of span of tokens to be masked)
-               masked)
            2. Reserve a context of length `context_length = span_length / plm_probability` to surround span to be
               masked
-            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length - span_length]` and mask tokens `start_index:start_index + span_length`
+            3. Sample a starting point `start_index` from the interval `[cur_len, cur_len + context_length -
-            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in
+               span_length]` and mask tokens `start_index:start_index + span_length`
-               the sequence to be processed), repeat from Step 1.
+            4. Set `cur_len = cur_len + context_length`. If `cur_len < max_len` (i.e. there are tokens remaining in the
+               sequence to be processed), repeat from Step 1.
        """
        from random import randint

--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" GLUE processors and helpers """
+""" GLUE processors and helpers"""
 import os
 import warnings
@@ -59,9 +59,9 @@ def glue_convert_examples_to_features(
        output_mode: String indicating the output mode. Either `regression` or `classification`
    Returns:
-        If the `examples` input is a `tf.data.Dataset`, will return a `tf.data.Dataset` containing the
+        If the `examples` input is a `tf.data.Dataset`, will return a `tf.data.Dataset` containing the task-specific
-        task-specific features. If the input is a list of `InputExamples`, will return a list of task-specific
+        features. If the input is a list of `InputExamples`, will return a list of task-specific `InputFeatures` which
-        `InputFeatures` which can be fed to the model.
+        can be fed to the model.
    """
    warnings.warn(DEPRECATION_WARNING.format("function"), FutureWarning)

--- a/src/transformers/data/processors/squad.py
+++ b/src/transformers/data/processors/squad.py
@@ -774,9 +774,10 @@ class SquadFeatures:
        example_index: the index of the example
        unique_id: The unique Feature identifier
        paragraph_len: The length of the context
-        token_is_max_context: List of booleans identifying which tokens have their maximum context in this feature object.
+        token_is_max_context:
-            If a token does not have their maximum context in this feature object, it means that another feature object
+            List of booleans identifying which tokens have their maximum context in this feature object. If a token
-            has more information related to that token and should be prioritized over this feature for that token.
+            does not have their maximum context in this feature object, it means that another feature object has more
+            information related to that token and should be prioritized over this feature for that token.
        tokens: list of tokens corresponding to the input ids
        token_to_orig_map: mapping between the tokens and the original text, needed in order to identify the answer.
        start_position: start of the answer token index

--- a/src/transformers/data/processors/utils.py
+++ b/src/transformers/data/processors/utils.py
@@ -248,8 +248,8 @@ class SingleSentenceClassificationProcessor(DataProcessor):
            pad_on_left: If set to `True`, the examples will be padded on the left rather than on the right (default)
            pad_token: Padding token
            mask_padding_with_zero: If set to `True`, the attention mask will be filled by `1` for actual values
-                and by `0` for padded values. If set to `False`, inverts it (`1` for padded values, `0` for
+                and by `0` for padded values. If set to `False`, inverts it (`1` for padded values, `0` for actual
-                actual values)
+                values)
        Returns:
            If the `examples` input is a `tf.data.Dataset`, will return a `tf.data.Dataset` containing the

--- a/src/transformers/data/processors/xnli.py
+++ b/src/transformers/data/processors/xnli.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" XNLI utils (dataset loading and evaluation) """
+""" XNLI utils (dataset loading and evaluation)"""
 import os

--- a/src/transformers/debug_utils.py
+++ b/src/transformers/debug_utils.py
@@ -43,14 +43,15 @@ class DebugUnderflowOverflow:
    debug_overflow = DebugUnderflowOverflow(model)
    ```
-    then run the training as normal and if `nan` or `inf` gets detected in at least one of the weight, input or
+    then run the training as normal and if `nan` or `inf` gets detected in at least one of the weight, input or output
-    output elements this module will throw an exception and will print `max_frames_to_save` frames that lead to this
+    elements this module will throw an exception and will print `max_frames_to_save` frames that lead to this event,
-    event, each frame reporting
+    each frame reporting
    1. the fully qualified module name plus the class name whose `forward` was run
    2. the absolute min and max value of all elements for each module weights, and the inputs and output
-    For example, here is the header and the last few frames in detection report for `google/mt5-small` run in fp16 mixed precision :
+    For example, here is the header and the last few frames in detection report for `google/mt5-small` run in fp16
+    mixed precision :
    ```
    Detected inf/nan during batch_number=0
@@ -77,8 +78,8 @@ class DebugUnderflowOverflow:
    0.00e+00      inf output
    ```
-    You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value
+    You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
-    was around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which
+    around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which
    renormalizes the weights, after it zeroed some of the elements, which pushes the absolute max value to more than
    64K, and we get an overlow.
@@ -93,9 +94,9 @@ class DebugUnderflowOverflow:
    debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
    ```
-        To validate that you have set up this debugging feature correctly, and you intend to use it in a training that may
+        To validate that you have set up this debugging feature correctly, and you intend to use it in a training that
-        take hours to complete, first run it with normal tracing enabled for one of a few batches as explained in the next
+        may take hours to complete, first run it with normal tracing enabled for one of a few batches as explained in
-        section.
+        the next section.
        Mode 2. Specific batch absolute min/max tracing without detection
@@ -128,8 +129,8 @@ class DebugUnderflowOverflow:
    **Performance**:
-    As this module measures absolute `min`/``max` of each weight of the model on every forward it'll slow the
+    As this module measures absolute `min`/``max` of each weight of the model on every forward it'll slow the training
-    training down. Therefore remember to turn it off once the debugging needs have been met.
+    down. Therefore remember to turn it off once the debugging needs have been met.
    Args:
        model (`nn.Module`):

--- a/src/transformers/deepspeed.py
+++ b/src/transformers/deepspeed.py
@@ -42,12 +42,12 @@ class HfDeepSpeedConfig:
    This object contains a DeepSpeed configuration dictionary and can be quickly queried for things like zero stage.
    A `weakref` of this object is stored in the module's globals to be able to access the config from areas where
-    things like the Trainer object is not available (e.g. `from_pretrained` and `_get_resized_embeddings`).
+    things like the Trainer object is not available (e.g. `from_pretrained` and `_get_resized_embeddings`). Therefore
-    Therefore it's important that this object remains alive while the program is still running.
+    it's important that this object remains alive while the program is still running.
-    [`Trainer`] uses the `HfTrainerDeepSpeedConfig` subclass instead. That subclass has logic to
+    [`Trainer`] uses the `HfTrainerDeepSpeedConfig` subclass instead. That subclass has logic to sync the configuration
-    sync the configuration with values of [`TrainingArguments`] by replacing special placeholder
+    with values of [`TrainingArguments`] by replacing special placeholder values: `"auto"`. Without this special logic
-    values: `"auto"`. Without this special logic the DeepSpeed configuration is not modified in any way.
+    the DeepSpeed configuration is not modified in any way.
    Args:
        config_file_or_dict (`Union[str, Dict]`): path to DeepSpeed config file or dict.
@@ -136,8 +136,8 @@ class HfDeepSpeedConfig:
    def is_true(self, ds_key_long):
        """
-        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very specific question of whether the value is set to `True` (and it's not set to `False`` or
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
-        isn't set).
+        specific question of whether the value is set to `True` (and it's not set to `False`` or isn't set).
        """
        value = self.get_value(ds_key_long)
@@ -145,8 +145,8 @@ class HfDeepSpeedConfig:
    def is_false(self, ds_key_long):
        """
-        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very specific question of whether the value is set to `False` (and it's not set to `True`` or
+        Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
-        isn't set).
+        specific question of whether the value is set to `False` (and it's not set to `True`` or isn't set).
        """
        value = self.get_value(ds_key_long)
        return False if value is None else not bool(value)
@@ -163,8 +163,8 @@ class HfDeepSpeedConfig:
 class HfTrainerDeepSpeedConfig(HfDeepSpeedConfig):
    """
-    The `HfTrainerDeepSpeedConfig` object is meant to be created during `TrainingArguments` object creation and has
+    The `HfTrainerDeepSpeedConfig` object is meant to be created during `TrainingArguments` object creation and has the
-    the same lifespan as the latter.
+    same lifespan as the latter.
    """
    def __init__(self, config_file_or_dict):

--- a/src/transformers/feature_extraction_sequence_utils.py
+++ b/src/transformers/feature_extraction_sequence_utils.py
@@ -78,35 +78,36 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
        Pad input values / input vectors or a batch of input values / input vectors up to predefined length or to the
        max sequence length in the batch.
-        Padding side (left/right) padding values are defined at the feature extractor level (with
+        Padding side (left/right) padding values are defined at the feature extractor level (with `self.padding_side`,
-        `self.padding_side`, `self.padding_value`)
+        `self.padding_value`)
        <Tip>
-        If the `processed_features` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors,
+        If the `processed_features` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
-        the result will use the same type unless you provide a different tensor type with `return_tensors`. In
+        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
-        the case of PyTorch tensors, you will lose the specific device of your tensors however.
+        PyTorch tensors, you will lose the specific device of your tensors however.
        </Tip>
        Args:
            processed_features ([`BatchFeature`], list of [`BatchFeature`], `Dict[str, List[float]]`, `Dict[str, List[List[float]]` or `List[Dict[str, List[float]]]`):
-                Processed inputs. Can represent one input ([`BatchFeature`] or `Dict[str, List[float]]`) or a batch of input values / vectors (list of [`BatchFeature`],
+                Processed inputs. Can represent one input ([`BatchFeature`] or `Dict[str, List[float]]`) or a batch of
-                *Dict[str, List[List[float]]]* or *List[Dict[str, List[float]]]*) so you can use this method during
+                input values / vectors (list of [`BatchFeature`], *Dict[str, List[List[float]]]* or *List[Dict[str,
-                preprocessing as well as in a PyTorch Dataloader collate function.
+                List[float]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
+                collate function.
-                Instead of `List[float]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow
+                Instead of `List[float]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors),
-                tensors), see the note above for the return type.
+                see the note above for the return type.
            padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `True`):
                Select a strategy to pad the returned sequences (according to the model's padding side and padding
                index) among:
-                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
-                  single sequence if provided).
+                  sequence if provided).
-                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
-                  maximum acceptable input length for the model if that argument is not provided.
+                  acceptable input length for the model if that argument is not provided.
-                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
-                  different lengths).
+                  lengths).
            max_length (`int`, *optional*):
                Maximum length of the returned list and optionally padding length (see above).
            truncation (`bool`):
@@ -242,7 +243,9 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
        Pad inputs (on left/right and up to predefined length or max length in the batch)
        Args:
-            processed_features: Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
+            processed_features:
+                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
+                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
            max_length: maximum length of the returned list and optionally padding length (see below)
            padding_strategy: PaddingStrategy to use for padding.
@@ -256,7 +259,8 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
                >= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
-            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
        """
        required_input = processed_features[self.model_input_names[0]]
@@ -307,12 +311,15 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
        Truncate inputs to predefined length or max length in the batch
        Args:
-            processed_features: Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
+            processed_features:
+                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
+                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
            max_length: maximum length of the returned list and optionally padding length (see below)
            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
                >= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
-            truncation: (optional) Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            truncation:
+                (optional) Activates truncation to cut input sequences longer than `max_length` to `max_length`.
        """
        if not truncation:
            return processed_features

--- a/src/transformers/feature_extraction_utils.py
+++ b/src/transformers/feature_extraction_utils.py
@@ -54,8 +54,7 @@ PreTrainedFeatureExtractor = Union["SequenceFeatureExtractor"]  # noqa: F821
 class BatchFeature(UserDict):
    r"""
-    Holds the output of the [`~SequenceFeatureExtractor.pad`] and feature extractor specific
+    Holds the output of the [`~SequenceFeatureExtractor.pad`] and feature extractor specific `__call__` methods.
-    `__call__` methods.
    This class is derived from a python dictionary and can be used as a dictionary.
@@ -74,8 +73,8 @@ class BatchFeature(UserDict):
    def __getitem__(self, item: str) -> Union[Any]:
        """
-        If the key is a string, returns the value of the dict associated to `key` ('input_values',
+        If the key is a string, returns the value of the dict associated to `key` ('input_values', 'attention_mask',
-        'attention_mask', etc.).
+        etc.).
        """
        if isinstance(item, str):
            return self.data[item]
@@ -216,8 +215,8 @@ class FeatureExtractionMixin:
        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
    ) -> PreTrainedFeatureExtractor:
        r"""
-        Instantiate a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a feature
+        Instantiate a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a feature extractor, *e.g.* a
-        extractor, *e.g.* a derived class of [`SequenceFeatureExtractor`].
+        derived class of [`SequenceFeatureExtractor`].
        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
@@ -241,19 +240,20 @@ class FeatureExtractionMixin:
                Whether or not to delete incompletely received file. Attempts to resume the download if such a file
                exists.
            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            use_auth_token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token
+                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
-                generated when running `transformers-cli login` (stored in `~/.huggingface`).
+                when running `transformers-cli login` (stored in `~/.huggingface`).
            revision(`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
-                If `False`, then this function returns just the final feature extractor object. If `True`,
+                If `False`, then this function returns just the final feature extractor object. If `True`, then this
-                then this functions returns a `Tuple(feature_extractor, unused_kwargs)` where *unused_kwargs* is a
+                functions returns a `Tuple(feature_extractor, unused_kwargs)` where *unused_kwargs* is a dictionary
-                dictionary consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the
+                consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the part of
-                part of `kwargs` which has not been used to update `feature_extractor` and is otherwise ignored.
+                `kwargs` which has not been used to update `feature_extractor` and is otherwise ignored.
            kwargs (`Dict[str, Any]`, *optional*):
                The values in kwargs of any keys which are feature extractor attributes will be used to override the
                loaded values. Behavior concerning key/value pairs whose keys are *not* feature extractor attributes is
@@ -311,16 +311,14 @@ class FeatureExtractionMixin:
    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
        """
        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
-        feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`] using
+        feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`] using `from_dict`.
-        `from_dict`.
        Parameters:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
        Returns:
-            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the feature extractor
+            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the feature extractor object.
-            object.
        """
        cache_dir = kwargs.pop("cache_dir", None)
        force_download = kwargs.pop("force_download", False)
@@ -398,8 +396,8 @@ class FeatureExtractionMixin:
    @classmethod
    def from_dict(cls, feature_extractor_dict: Dict[str, Any], **kwargs) -> PreTrainedFeatureExtractor:
        """
-        Instantiates a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a Python
+        Instantiates a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a Python dictionary of
-        dictionary of parameters.
+        parameters.
        Args:
            feature_extractor_dict (`Dict[str, Any]`):
@@ -410,8 +408,8 @@ class FeatureExtractionMixin:
                Additional parameters from which to initialize the feature extractor object.
        Returns:
-            [`~feature_extraction_utils.FeatureExtractionMixin`]: The feature extractor object
+            [`~feature_extraction_utils.FeatureExtractionMixin`]: The feature extractor object instantiated from those
-            instantiated from those parameters.
+            parameters.
        """
        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
@@ -447,16 +445,16 @@ class FeatureExtractionMixin:
    @classmethod
    def from_json_file(cls, json_file: Union[str, os.PathLike]) -> PreTrainedFeatureExtractor:
        """
-        Instantiates a feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`]
+        Instantiates a feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`] from the path to
-        from the path to a JSON file of parameters.
+        a JSON file of parameters.
        Args:
            json_file (`str` or `os.PathLike`):
                Path to the JSON file containing the parameters.
        Returns:
-            A feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`]: The
+            A feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`]: The feature_extractor
-            feature_extractor object instantiated from that JSON file.
+            object instantiated from that JSON file.
        """
        with open(json_file, "r", encoding="utf-8") as reader:
            text = reader.read()
@@ -468,8 +466,7 @@ class FeatureExtractionMixin:
        Serializes this instance to a JSON string.
        Returns:
-            `str`: String containing all the attributes that make up this feature_extractor instance in JSON
+            `str`: String containing all the attributes that make up this feature_extractor instance in JSON format.
-            format.
        """
        dictionary = self.to_dict()

--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -855,7 +855,7 @@ def add_start_docstrings_to_model_forward(*docstr):
 def add_end_docstrings(*docstr):
    def docstring_decorator(fn):
-        fn.__doc__ = fn.__doc__ + "".join(docstr)
+        fn.__doc__ = (fn.__doc__ if fn.__doc__ is not None else "") + "".join(docstr)
        return fn
    return docstring_decorator
@@ -1169,7 +1169,8 @@ PT_SPEECH_SEQ_CLASS_SAMPLE = r"""
    >>> # audio file is decoded on the fly
    >>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt")
-    >>> logits = model(**inputs).logits >>> predicted_class_ids = torch.argmax(logits, dim=-1)
+    >>> logits = model(**inputs).logits
+    >>> predicted_class_ids = torch.argmax(logits, dim=-1)
    >>> predicted_label = model.config.id2label[predicted_class_ids]
    >>> # compute loss - target_label is e.g. "down"

--- a/src/transformers/generation_beam_search.py
+++ b/src/transformers/generation_beam_search.py
@@ -29,8 +29,7 @@ PROCESS_INPUTS_DOCSTRING = r"""
            Indices of input sequence tokens in the vocabulary.
            Indices can be obtained using any class inheriting from [`PreTrainedTokenizer`]. See
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        next_scores (`torch.FloatTensor` of shape `(batch_size, 2 * num_beams)`):
@@ -47,10 +46,10 @@ PROCESS_INPUTS_DOCSTRING = r"""
    Return:
        `UserDict`: A dictionary composed of the fields as defined above:
-            - **next_beam_scores** (`torch.FloatTensor` of shape `(batch_size * num_beams)`) -- Updated
+            - **next_beam_scores** (`torch.FloatTensor` of shape `(batch_size * num_beams)`) -- Updated scores of all
-              scores of all non-finished beams.
+              non-finished beams.
-            - **next_beam_tokens** (`torch.FloatTensor` of shape `(batch_size * num_beams)`) -- Next tokens
+            - **next_beam_tokens** (`torch.FloatTensor` of shape `(batch_size * num_beams)`) -- Next tokens to be added
-              to be added to the non-finished beam_hypotheses.
+              to the non-finished beam_hypotheses.
            - **next_beam_indices** (`torch.FloatTensor` of shape `(batch_size * num_beams)`) -- Beam indices
              indicating to which beam the next tokens shall be added.
@@ -62,8 +61,7 @@ FINALIZE_INPUTS_DOCSTRING = r"""
            Indices of input sequence tokens in the vocabulary.
            Indices can be obtained using any class inheriting from [`PreTrainedTokenizer`]. See
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        final_beam_scores (`torch.FloatTensor` of shape `(batch_size * num_beams)`):
@@ -78,9 +76,9 @@ FINALIZE_INPUTS_DOCSTRING = r"""
            The id of the *end-of-sequence* token.
    Return:
-        `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`: The generated
+        `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`: The generated sequences.
-        sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter if all
+        The second dimension (sequence_length) is either equal to `max_length` or shorter if all batches finished early
-        batches finished early due to the `eos_token_id`.
+        due to the `eos_token_id`.
 """
@@ -121,9 +119,11 @@ class BeamSearchScorer(BeamScorer):
    r"""
    [`BeamScorer`] implementing standard beam search decoding.
-    Adapted in part from [Facebook's XLM beam search code](https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529).
+    Adapted in part from [Facebook's XLM beam search
+    code](https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529).
-    Reference for the diverse beam search algorithm and implementation [Ashwin Kalyan's DBS implementation](https://github.com/ashwinkalyan/dbs/blob/master/dbs/beam_utils.lua)
+    Reference for the diverse beam search algorithm and implementation [Ashwin Kalyan's DBS
+    implementation](https://github.com/ashwinkalyan/dbs/blob/master/dbs/beam_utils.lua)
    Args:
        batch_size (`int`):
@@ -133,8 +133,8 @@ class BeamSearchScorer(BeamScorer):
        num_beams (`int`):
            Number of beams for beam search.
        device (`torch.device`):
-            Defines the device type (*e.g.*, `"cpu"` or `"cuda"`) on which this instance of
+            Defines the device type (*e.g.*, `"cpu"` or `"cuda"`) on which this instance of `BeamSearchScorer` will be
-            `BeamSearchScorer` will be allocated.
+            allocated.
        length_penalty (`float`, *optional*, defaults to 1.0):
            Exponential penalty to the length. 1.0 means no penalty. Set to values < 1.0 in order to encourage the
            model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer
@@ -145,8 +145,8 @@ class BeamSearchScorer(BeamScorer):
            The number of beam hypotheses that shall be returned upon calling
            [`~transformer.BeamSearchScorer.finalize`].
        num_beam_groups (`int`):
-            Number of groups to divide `num_beams` into in order to ensure diversity among different groups of
+            Number of groups to divide `num_beams` into in order to ensure diversity among different groups of beams.
-            beams. See [this paper](https://arxiv.org/pdf/1610.02424.pdf) for more details.
+            See [this paper](https://arxiv.org/pdf/1610.02424.pdf) for more details.
    """
    def __init__(

--- a/src/transformers/generation_flax_logits_process.py
+++ b/src/transformers/generation_flax_logits_process.py
@@ -32,9 +32,8 @@ LOGITS_PROCESSOR_INPUTS_DOCSTRING = r"""
        input_ids (`jnp.ndarray` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`PreTrainedTokenizer`]. See
+            Indices can be obtained using [`PreTrainedTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        scores (`jnp.ndarray` of shape `(batch_size, config.vocab_size)`):
@@ -73,10 +72,9 @@ class FlaxLogitsWarper(ABC):
 class FlaxLogitsProcessorList(list):
    """
-    This class can be used to create a list of [`FlaxLogitsProcessor`] or
+    This class can be used to create a list of [`FlaxLogitsProcessor`] or [`FlaxLogitsWarper`] to subsequently process
-    [`FlaxLogitsWarper`] to subsequently process a `scores` input tensor. This class inherits
+    a `scores` input tensor. This class inherits from list and adds a specific *__call__* method to apply each
-    from list and adds a specific *__call__* method to apply each [`FlaxLogitsProcessor`] or
+    [`FlaxLogitsProcessor`] or [`FlaxLogitsWarper`] to the inputs.
-    [`FlaxLogitsWarper`] to the inputs.
    """
    @add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
@@ -117,13 +115,12 @@ class FlaxTemperatureLogitsWarper(FlaxLogitsWarper):
 class FlaxTopPLogitsWarper(FlaxLogitsWarper):
    """
-    [`LogitsWarper`] that performs top-p, i.e. restricting to top tokens summing to prob_cut_off <=
+    [`LogitsWarper`] that performs top-p, i.e. restricting to top tokens summing to prob_cut_off <= prob_cut_off.
-    prob_cut_off.
    Args:
        top_p (`float`):
-            If set to < 1, only the most probable tokens with probabilities that add up to `top_p` or higher are
+            If set to < 1, only the most probable tokens with probabilities that add up to `top_p` or higher are kept
-            kept for generation.
+            for generation.
        filter_value (`float`, *optional*, defaults to `-float("Inf")`):
            All filtered values will be set to this float value.
        min_tokens_to_keep (`int`, *optional*, defaults to 1):
@@ -219,8 +216,7 @@ class FlaxForcedBOSTokenLogitsProcessor(FlaxLogitsProcessor):
 class FlaxForcedEOSTokenLogitsProcessor(FlaxLogitsProcessor):
    r"""
-    [`FlaxLogitsProcessor`] that enforces the specified token as the last generated token when
+    [`FlaxLogitsProcessor`] that enforces the specified token as the last generated token when `max_length` is reached.
-    `max_length` is reached.
    Args:
        max_length (`int`):