Fix many typos (#8708)

e1f3156b · Santiago Castro · GitHub · 9c0afdaf · e1f3156b · e1f3156b
Unverified Commit e1f3156b authored Nov 22, 2020 by Santiago Castro Committed by GitHub Nov 21, 2020
20 changed files
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -44,7 +44,7 @@ The documentation is organized in five parts:
  and a glossary.
 - **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
 - **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
+- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
  transformers model
 - The three last section contain the documentation of each public class and function, grouped in:

--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@@ -5,7 +5,7 @@ Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
-intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
+introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 The abstract from the paper is the following:

--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -530,7 +530,7 @@ Sequence-to-sequence model with the same encoder-decoder model architecture as B
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
+  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.

--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -109,7 +109,7 @@ XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
+over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
 labeling and question answering.
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:

--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@@ -62,7 +62,7 @@ sliding the context window so that the model has more context when making each p
 This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
 favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
 practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
 predictions at each step.
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers

--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -25,7 +25,7 @@ class PlotArguments:
    )
    plot_along_batch: bool = field(
        default=False,
-        metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
+        metadata={"help": "Whether to plot along batch size or sequence length. Defaults to sequence length."},
    )
    is_time: bool = field(
        default=False,

--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -17,7 +17,7 @@ This folder contains the original code used to train Distil* as well as examples
 ## What is Distil*
-Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distilled-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
 We have applied the same method to other Transformer architectures and released the weights:
 - GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set).
@@ -57,7 +57,7 @@ Here are the results on the *test* sets for 6 of the languages available in XNLI
 This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
-**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
+**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breaking changes compared to v1.1.0).
 ## How to use DistilBERT
@@ -111,7 +111,7 @@ python scripts/binarized_data.py \
    --dump_file data/binarized_text
 ```
-Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
+Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smooths the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
 ```bash
 python scripts/token_counts.py \
@@ -173,7 +173,7 @@ python -m torch.distributed.launch \
        --token_counts data/token_counts.bert-base-uncased.pickle
 ```
-**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
+**Tips:** Starting distilled training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
 Happy distillation!

--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -188,7 +188,7 @@ class Distiller:
    def prepare_batch_mlm(self, batch):
        """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the masked label for MLM.
        Input:
        ------
@@ -200,7 +200,7 @@ class Distiller:
        -------
            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
-            mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -100 where there is nothing to predict.
+            mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels. There is a -100 where there is nothing to predict.
        """
        token_ids, lengths = batch
        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -253,7 +253,7 @@ class Distiller:
    def prepare_batch_clm(self, batch):
        """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the labels for CLM.
        Input:
        ------

--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -86,7 +86,7 @@ if __name__ == "__main__":
            compressed_sd[f"vocab_layer_norm.{w}"] = state_dict[f"cls.predictions.transform.LayerNorm.{w}"]
    print(f"N layers selected for distillation: {std_idx}")
-    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
+    print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
-    print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
+    print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
    torch.save(compressed_sd, args.dump_checkpoint)
--- a/examples/movement-pruning/README.md
+++ b/examples/movement-pruning/README.md
@@ -21,7 +21,7 @@ You can also have a look at this fun *Explain Like I'm Five* introductory [slide
 One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
-In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
+In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the original dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
 While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).

--- a/examples/movement-pruning/emmental/modules/binarizer.py
+++ b/examples/movement-pruning/emmental/modules/binarizer.py
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
+Binarizers take a (real value) matrix as input and produce a binary (values in {0,1}) mask of the same shape.
 """
 import torch

--- a/model_cards/KB/albert-base-swedish-cased-alpha/README.md
+++ b/model_cards/KB/albert-base-swedish-cased-alpha/README.md
@@ -4,7 +4,7 @@ language: sv
 # Swedish BERT Models
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 The following three models are currently available:
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 ### ALBERT base
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 ```python
 from transformers import AutoModel,AutoTokenizer

--- a/model_cards/KB/bert-base-swedish-cased-ner/README.md
+++ b/model_cards/KB/bert-base-swedish-cased-ner/README.md
@@ -4,7 +4,7 @@ language: sv
 # Swedish BERT Models
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 The following three models are currently available:
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 ### ALBERT base
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 ```python
 from transformers import AutoModel,AutoTokenizer

--- a/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,

--- a/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,

--- a/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,

--- a/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,

--- a/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Intent Prediction) - Dataset 📚 
-Dataset ID: ```event2Mind``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```event2Mind``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |

--- a/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |

--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad_v2``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |