Fix doc errors and typos across the board (#8139)

* Fix doc errors and typos across the board * Fix a typo * Fix the CI * Fix more typos * Fix CI * More fixes * Fix CI * More fixes * More fixes

Fix doc errors and typos across the board (#8139)
* Fix doc errors and typos across the board * Fix a typo * Fix the CI * Fix more typos * Fix CI * More fixes * Fix CI * More fixes * More fixes
969859d5 · Santiago Castro · GitHub · 4731a00c · 969859d5 · 969859d5
Unverified Commit 969859d5 authored Oct 29, 2020 by Santiago Castro Committed by GitHub Oct 29, 2020
20 changed files
--- a/model_cards/mrm8488/spanbert-large-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/spanbert-large-finetuned-squadv2/README.md
@@ -54,7 +54,7 @@ python code/run_squad.py \
 | SpanBERT (large)        | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1)          | **88.7** (this)     | 79.6    |  [70.8](https://huggingface.co/mrm8488/spanbert-large-finetuned-tacred)  |
-Note: The numbers marked as * are evaluated on the development sets becaus those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
+Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
 ## Model in action

--- a/model_cards/mrm8488/spanbert-large-finetuned-tacred/README.md
+++ b/model_cards/mrm8488/spanbert-large-finetuned-tacred/README.md
@@ -45,7 +45,7 @@ python code/run_tacred.py \
 | SpanBERT (large)        | [94.6](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv1)        | [88.7](https://huggingface.co/mrm8488/spanbert-large-finetuned-squadv2)     | 79.6    |  **70.8** (this one)  |
-Note: The numbers marked as * are evaluated on the development sets becaus those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
+Note: The numbers marked as * are evaluated on the development sets because those models were not submitted to the official SQuAD leaderboard. All the other numbers are test numbers.
 > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)

--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
@@ -50,7 +50,7 @@ tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-wikiSQL-sql
 model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-wikiSQL-sql-to-en")
 def get_explanation(query):
-  input_text = "translante Sql to English: %s </s>" % query
+  input_text = "translate Sql to English: %s </s>" % query
  features = tokenizer([input_text], return_tensors='pt')
  output = model.generate(input_ids=features['input_ids'], 

--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
@@ -50,7 +50,7 @@ tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-wikiSQL")
 model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-wikiSQL")
 def get_sql(query):
-  input_text = "translante English to SQL: %s </s>" % query
+  input_text = "translate English to SQL: %s </s>" % query
  features = tokenizer([input_text], return_tensors='pt')
  output = model.generate(input_ids=features['input_ids'], 

--- a/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
@@ -50,7 +50,7 @@ tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-small-finetuned-wikiSQL")
 model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-small-finetuned-wikiSQL")
 def get_sql(query):
-  input_text = "translante English to SQL: %s </s>" % query
+  input_text = "translate English to SQL: %s </s>" % query
  features = tokenizer([input_text], return_tensors='pt')
  output = model.generate(input_ids=features['input_ids'], 

--- a/model_cards/mrm8488/xlm-multi-finetuned-xquadv1/README.md
+++ b/model_cards/mrm8488/xlm-multi-finetuned-xquadv1/README.md
@@ -71,7 +71,7 @@ Citation:
 </details>
-As XQuAD is just an evaluation dataset, I used Data augmentation techniques (scraping, neural machine translation, etc) to obtain more samples and splited the dataset in order to have a train and test set. The test set was created in a way that contains the same number of samples for each language. Finally, I got:
+As XQuAD is just an evaluation dataset, I used Data augmentation techniques (scraping, neural machine translation, etc) to obtain more samples and split the dataset in order to have a train and test set. The test set was created in a way that contains the same number of samples for each language. Finally, I got:
 | Dataset     | # samples |
 | ----------- | --------- |

--- a/src/transformers/benchmark/benchmark_utils.py
+++ b/src/transformers/benchmark/benchmark_utils.py
@@ -172,7 +172,7 @@ class MemorySummary(NamedTuple):
    `MemorySummary` namedtuple otherwise with the fields:
        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` by
-          substracting the memory after executing each line from the memory before executing said line.
+          subtracting the memory after executing each line from the memory before executing said line.
        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line
          obtained by summing repeated memory increase for a line if it's executed several times. The list is sorted
          from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory
@@ -208,7 +208,7 @@ def measure_peak_memory_cpu(function: Callable[[], None], interval=0.5, device_i
    Returns:
-        - `max_memory`: (`int`) cosumed memory peak in Bytes
+        - `max_memory`: (`int`) consumed memory peak in Bytes
    """
    def get_cpu_memory(process_id: int) -> int:
@@ -221,7 +221,7 @@ def measure_peak_memory_cpu(function: Callable[[], None], interval=0.5, device_i
        Returns
-            - `memory`: (`int`) cosumed memory in Bytes
+            - `memory`: (`int`) consumed memory in Bytes
        """
        process = psutil.Process(process_id)
        try:
@@ -367,7 +367,7 @@ def start_memory_tracing(
            devices = list(range(nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace
            nvml.nvmlShutdown()
        except (OSError, nvml.NVMLError):
-            logger.warning("Error while initializing comunication with GPU. " "We won't perform GPU memory tracing.")
+            logger.warning("Error while initializing communication with GPU. " "We won't perform GPU memory tracing.")
            log_gpu = False
        else:
            log_gpu = is_torch_available() or is_tf_available()
@@ -472,9 +472,10 @@ def stop_memory_tracing(
    Args:
-        - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary
+        `memory_trace` (optional output of start_memory_tracing, default: None):
-        - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total
+            memory trace to convert in summary
-          memory
+        `ignore_released_memory` (boolean, default: None):
+            if True we only sum memory increase to compute total memory
    Return:
@@ -482,7 +483,7 @@ def stop_memory_tracing(
        - `MemorySummary` namedtuple otherwise with the fields:
            - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace` by
-              substracting the memory after executing each line from the memory before executing said line.
+              subtracting the memory after executing each line from the memory before executing said line.
            - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each
              line obtained by summing repeated memory increase for a line if it's executed several times. The list is
              sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative

--- a/src/transformers/commands/convert.py
+++ b/src/transformers/commands/convert.py
@@ -41,7 +41,7 @@ class ConvertCommand(BaseTransformersCLICommand):
            "--tf_checkpoint", type=str, required=True, help="TensorFlow checkpoint path or folder."
        )
        train_parser.add_argument(
-            "--pytorch_dump_output", type=str, required=True, help="Path to the PyTorch savd model output."
+            "--pytorch_dump_output", type=str, required=True, help="Path to the PyTorch saved model output."
        )
        train_parser.add_argument("--config", type=str, default="", help="Configuration file path or folder.")
        train_parser.add_argument(

--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@@ -61,7 +61,7 @@ class BartConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):

--- a/src/transformers/configuration_bert.py
+++ b/src/transformers/configuration_bert.py
@@ -76,7 +76,7 @@ class BertConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):

--- a/src/transformers/configuration_bert_generation.py
+++ b/src/transformers/configuration_bert_generation.py
@@ -42,7 +42,7 @@ class BertGenerationConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
@@ -62,7 +62,7 @@ class BertGenerationConfig(PretrainedConfig):
        >>> # Initializing a BertGeneration config
        >>> configuration = BertGenerationConfig()
-        >>> # Initializing a modelfrom the config
+        >>> # Initializing a model from the config
        >>> model = BertGenerationEncoder(configuration)
        >>> # Accessing the model configuration

--- a/src/transformers/configuration_blenderbot.py
+++ b/src/transformers/configuration_blenderbot.py
@@ -58,7 +58,7 @@ class BlenderbotConfig(BartConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):

--- a/src/transformers/configuration_deberta.py
+++ b/src/transformers/configuration_deberta.py
@@ -55,7 +55,7 @@ class DebertaConfig(PretrainedConfig):
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu"`, :obj:`"tanh"`, :obj:`"gelu_fast"`,
            :obj:`"mish"`, :obj:`"linear"`, :obj:`"sigmoid"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):

--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@@ -61,7 +61,7 @@ class DistilBertConfig(PretrainedConfig):
        hidden_dim (:obj:`int`, `optional`, defaults to 3072):
            The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):

--- a/src/transformers/configuration_dpr.py
+++ b/src/transformers/configuration_dpr.py
@@ -57,7 +57,7 @@ class DPRConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):

--- a/src/transformers/configuration_electra.py
+++ b/src/transformers/configuration_electra.py
@@ -62,7 +62,7 @@ class ElectraConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):

--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@@ -59,11 +59,11 @@ class FlaubertConfig(XLMConfig):
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for the attention mechanism
        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether or not to use a `gelu` actibation instead of `relu`.
+            Whether or not to use a `gelu` activation instead of `relu`.
        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not the model shoul behave in a causal manner. Causal models use a triangular attention mask in
+            Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in
            order to only attend to the left-side context instead if a bidirectional context.
        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction

--- a/src/transformers/configuration_fsmt.py
+++ b/src/transformers/configuration_fsmt.py
@@ -73,7 +73,7 @@ class FSMTConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):

--- a/src/transformers/configuration_funnel.py
+++ b/src/transformers/configuration_funnel.py
@@ -68,7 +68,7 @@ class FunnelConfig(PretrainedConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for the attention probabilities.
        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):

--- a/src/transformers/configuration_layoutlm.py
+++ b/src/transformers/configuration_layoutlm.py
@@ -54,7 +54,7 @@ class LayoutLMConfig(BertConfig):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):