adding model cards for distilled models (#8300)

* adding model cards for distil models * forgot the languages

adding model cards for distilled models (#8300)
* adding model cards for distil models * forgot the languages
969ccac2 · Victor SANH · GitHub · 7342d9a5 · 969ccac2 · 969ccac2
Unverified Commit 969ccac2 authored Nov 04, 2020 by Victor SANH Committed by GitHub Nov 04, 2020
8 changed files
--- a/model_cards/distilbert-base-cased-README.md
+++ b/model_cards/distilbert-base-cased-README.md
 ---
+language: en
 license: apache-2.0
+datasets:
+- bookcorpus
+- wikipedia
 ---
+
+# DistilBERT base model (cased)
+
+This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased).
+It was introduced in [this paper](https://arxiv.org/abs/1910.01108).
+The code for the distillation process can be found
+[here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
+This model is cased: it does make a difference between english and English.
+
+All the training details on the pre-training, the uses, limitations and potential biases are the same as for [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased).
+We highly encourage to check it if you want to know more.
+
+## Evaluation results
+
+When fine-tuned on downstream tasks, this model achieves the following results:
+
+Glue test results:
+
+| Task | MNLI | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  |
+|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
+|      | 81.5 | 87.8 | 88.2 | 90.4  | 47.2 | 85.5  | 85.6 | 60.6 |
+
+### BibTeX entry and citation info
+
+```bibtex
+@article{Sanh2019DistilBERTAD,
+  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
+  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
+  journal={ArXiv},
+  year={2019},
+  volume={abs/1910.01108}
+}
+```
--- a/model_cards/distilbert-base-cased-distilled-squad-README.md
+++ b/model_cards/distilbert-base-cased-distilled-squad-README.md
@@ -6,3 +6,8 @@ metrics:
 - squad
 license: apache-2.0
 ---
+
+# DistilBERT base cased distilled SQuAD
+
+This model is a fine-tune checkpoint of [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1.
+This model reaches a F1 score of 87.1 on the dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7).
--- a/model_cards/distilbert-base-multilingual-cased-README.md
+++ b/model_cards/distilbert-base-multilingual-cased-README.md
 ---
 language: multilingual
 license: apache-2.0
+datasets:
+- wikipedia
 ---
+
+# DistilBERT base multilingual model (cased)
+
+This model is a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). The code for the distillation process can be found
+[here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is cased: it does make a difference between english and English.
+
+The model is trained on the concatenation of Wikipedia in 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
+The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).
+On average DistilmBERT is twice as fast as mBERT-base.
+
+We encourage to check [BERT base multilingual model](bert-base-multilingual-cased) to know more about usage, limitations and potential biases.
+
+| Model                        | English | Spanish | Chinese | German | Arabic  | Urdu |
+| :---:                        | :---:   | :---:   | :---:   | :---:  | :---:   | :---:|
+| mBERT base cased (computed)  | 82.1    | 74.6    | 69.1    | 72.3   | 66.4    | 58.5 |
+| mBERT base uncased (reported)| 81.4    | 74.3    | 63.8    | 70.5   | 62.1    | 58.3 |
+| DistilmBERT                  | 78.2    | 69.1    | 64.0    | 66.3   | 59.1    | 54.7 |
+
+### BibTeX entry and citation info
+
+```bibtex
+@article{Sanh2019DistilBERTAD,
+  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
+  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
+  journal={ArXiv},
+  year={2019},
+  volume={abs/1910.01108}
+}
+```
--- a/model_cards/distilbert-base-uncased-README.md
+++ b/model_cards/distilbert-base-uncased-README.md
@@ -10,7 +10,7 @@ datasets:

 # DistilBERT base model (uncased)

-This model is a distilled version of the [BERT base mode](https://huggingface.co/distilbert-base-uncased). It was
+This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was
 introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found
 [here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is uncased: it does
 not make a difference between english and English.
@@ -102,7 +102,7 @@ output = model(encoded_input)

 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions. It also inherits some of
-[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias). 
+[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).

 ```python
 >>> from transformers import pipeline
@@ -196,9 +196,9 @@ When fine-tuned on downstream tasks, this model achieves the following results:

 Glue test results:

-| Task | MNLI | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
-|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
-|      | 82.2 | 88.5 | 89.2 | 91.3  | 51.3 | 85.8  | 87.5 | 59.9 | 77.0    |
+| Task | MNLI | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  |
+|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
+|      | 82.2 | 88.5 | 89.2 | 91.3  | 51.3 | 85.8  | 87.5 | 59.9 |


 ### BibTeX entry and citation info

--- a/model_cards/distilbert-base-uncased-distilled-squad-README.md
+++ b/model_cards/distilbert-base-uncased-distilled-squad-README.md
 ---
+language: en
 datasets:
 - squad
 widget:
@@ -8,3 +9,8 @@ widget:
  context: "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain \"Amazonas\" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
 license: apache-2.0
 ---
+
+# DistilBERT base uncased distilled SQuAD
+
+This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1.
+This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert bert-base-uncased version reaches a F1 score of 88.5).
--- a/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md
+++ b/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md
 ---
+language: en
 license: apache-2.0
+datasets:
+- sst-2
 ---
+
+# DistilBERT base uncased finetuned SST-2
+
+This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned on SST-2.
+This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).
+
+# Fine-tuning hyper-parameters
+
+- learning_rate = 1e-5
+- batch_size = 32
+- warmup = 600
+- max_seq_length = 128
+- num_train_epochs = 3.0
--- a/model_cards/distilgpt2-README.md
+++ b/model_cards/distilgpt2-README.md
 ---
+language: en
 tags:
 - exbert

 license: apache-2.0
+datasets:
+- openwebtext
 ---

+# DistilGPT2
+
+DistilGPT2 English language model pretrained with the supervision of [GPT2](https://huggingface.co/gpt2) (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
+
+On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for DistilGPT2 (after fine-tuning on the train set).
+
+We encourage to check [GPT2](https://huggingface.co/gpt2) to know more about usage, limitations and potential biases.
+
 <a href="https://huggingface.co/exbert/?model=distilgpt2">
 	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
 </a>
--- a/model_cards/distilroberta-base-README.md
+++ b/model_cards/distilroberta-base-README.md
 ---
+language: en
 tags:
 - exbert

 license: apache-2.0
+datasets:
+- openwebtext
 ---

+# DistilRoBERTa base model
+
+This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased).
+The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
+This model is case-sensitive: it makes a difference between english and English.
+
+The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base).
+On average DistilRoBERTa is twice as fast as Roberta-base.
+
+We encourage to check [RoBERTa-base model](https://huggingface.co/roberta-base) to know more about usage, limitations and potential biases.
+
+## Training data
+
+DistilRoBERTa was pre-trained on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa).
+
+## Evaluation results
+
+When fine-tuned on downstream tasks, this model achieves the following results:
+
+Glue test results:
+
+| Task | MNLI | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  |
+|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
+|      | 84.0 | 89.4 | 90.8 | 92.5  | 59.3 | 88.3  | 86.6 | 67.9 |
+
+### BibTeX entry and citation info
+
+```bibtex
+@article{Sanh2019DistilBERTAD,
+  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
+  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
+  journal={ArXiv},
+  year={2019},
+  volume={abs/1910.01108}
+}
+```
+
 <a href="https://huggingface.co/exbert/?model=distilroberta-base">
 	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
 </a>