[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)
* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3552d0e0 · Julien Chaumond · GitHub · 29e45979 · 29e45979 · 29e45979
Unverified Commit 3552d0e0 authored Dec 12, 2020 by Julien Chaumond Committed by GitHub Dec 11, 2020
20 changed files
--- a/model_cards/severinsimmler/literary-german-bert/README.md
+++ b/model_cards/severinsimmler/literary-german-bert/README.md
---
-language: de
-thumbnail: kfold.png
---
-# German BERT for literary texts
-This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
-# Stats
-## Language modeling
-The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
-![years](prosa-jahre.png)
-### Results
-After one epoch:
-| Model            | Perplexity |
-| ---------------- | ---------- |
-| Vanilla BERT     | 6.82       |
-| Fine-tuned BERT  | 4.98       |
-## Named entity recognition
-The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
-## Results
-| Dataset | Precision | Recall | F1   |
-| ------- | --------- | ------ | ---- |
-| Dev     | 96.4      | 87.3   | 91.6 |
-| Test    | 92.8      | 94.9   | 93.8 |
-The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
-![kfold](kfold.png)
-# References
-Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
-Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.
--- a/model_cards/severinsimmler/literary-german-bert/kfold.png
+++ b/model_cards/severinsimmler/literary-german-bert/kfold.png
--- a/model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
+++ b/model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
--- a/model_cards/seyonec/ChemBERTa-zinc-base-v1/README.md
+++ b/model_cards/seyonec/ChemBERTa-zinc-base-v1/README.md
---
-tags: 
- chemistry
---
-# ChemBERTa: Training a BERT-like transformer model for masked language modelling of chemical SMILES strings.
-Deep learning for chemistry and materials science remains a novel field with lots of potiential. However, the popularity of transfer learning based methods in areas such as NLP and computer vision have not yet been effectively developed in computational chemistry + machine learning. Using HuggingFace's suite of models and the ByteLevel tokenizer, we are able to train on a large corpus of 100k SMILES strings from a commonly known benchmark dataset, ZINC.
-Training RoBERTa over 5 epochs, the model achieves a decent loss of 0.398, but may likely continue to decline if trained for a larger number of epochs. The model can predict tokens within a SMILES sequence/molecule, allowing for variants of a molecule within discoverable chemical space to be predicted.
-By applying the representations of functional groups and atoms learned by the model, we can try to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
-Additionally, visualization of the attention mechanism have been seen through previous research as incredibly valuable towards chemical reaction classification. The applications of open-sourcing large-scale transformer models such as RoBERTa with HuggingFace may allow for the acceleration of these individual research directions.
-A link to a repository which includes the training, uploading and evaluation notebook (with sample predictions on compounds such as Remdesivir) can be found [here](https://github.com/seyonechithrananda/bert-loves-chemistry). All of the notebooks can be copied into a new Colab runtime for easy execution.
-Thanks for checking this out!
- Seyone
--- a/model_cards/shoarora/alectra-small-owt/README.md
+++ b/model_cards/shoarora/alectra-small-owt/README.md
-# ALECTRA-small-OWT
-This is an extension of
-[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
-[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
-The training task (discriminative LM / replaced-token-detection) can be generalized to any transformer type.  Here, we train an ALBERT model under the same scheme.
-## Pretraining task
-![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
-(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
-ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
-This involves a generator (a Masked LM model) creating examples for a discriminator
-to classify as original or replaced for each token.
-The generator generalizes to any `*ForMaskedLM` model and the discriminator could be
-any `*ForTokenClassification` model.  Therefore, we can extend the task to ALBERT models,
-not just BERT as in the original paper.
-## Usage
-```python
-from transformers import AlbertForSequenceClassification, BertTokenizer
-# Both models use the bert-base-uncased tokenizer and vocab.
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-alectra = AlbertForSequenceClassification.from_pretrained('shoarora/alectra-small-owt')
-```
-NOTE: this ALBERT model uses a BERT WordPiece tokenizer.
-## Code
-The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
-Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
-and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_alectra_small.py) is the script that created this model.
-This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
-## Downstream tasks
-#### GLUE Dev results
-| Model                    | # Params | CoLA | SST | MRPC | STS  | QQP  | MNLI | QNLI | RTE |
-| ---                      | ---      | ---  | --- | ---  | ---  | ---  | ---  | ---  | --- |
-| ELECTRA-Small++          | 14M      | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
-| ELECTRA-Small-OWT        | 14M      | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
-| ELECTRA-Small-OWT (ours) | 17M      | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
-| ALECTRA-Small-OWT (ours) |  4M      | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
-#### GLUE Test results
-| Model                    | # Params | CoLA | SST | MRPC | STS  | QQP  | MNLI | QNLI | RTE |
-| ---                      | ---      | ---  | --- | ---  | ---  | ---  | ---  | ---  | --- |
-| BERT-Base                | 110M     | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
-| GPT                      | 117M     | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
-| ELECTRA-Small++          | 14M      | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
-| ELECTRA-Small-OWT (ours) | 17M      | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
-| ALECTRA-Small-OWT (ours) |  4M      | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|
--- a/model_cards/shoarora/electra-small-owt/README.md
+++ b/model_cards/shoarora/electra-small-owt/README.md
-# ELECTRA-small-OWT
-This is an unnoficial implementation of an
-[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
-[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
-Differences from official ELECTRA models:
- - we use a `BertForMaskedLM` as the generator and `BertForTokenClassification` as the discriminator
- - they use an embedding projection layer, but Bert doesn't have one
-## Pretraining ttask
-![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
-(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
-ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
-This involves a generator (a Masked LM model) creating examples for a discriminator
-to classify as original or replaced for each token.
-## Usage
-```python
-from transformers import BertForSequenceClassification, BertTokenizer
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-electra = BertForSequenceClassification.from_pretrained('shoarora/electra-small-owt')
-```
-## Code
-The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
-Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
-and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_electra_small.py) is the script that created this model.
-This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
-## Downstream tasks
-#### GLUE Dev results
-| Model                    | # Params | CoLA | SST | MRPC | STS  | QQP  | MNLI | QNLI | RTE |
-| ---                      | ---      | ---  | --- | ---  | ---  | ---  | ---  | ---  | --- |
-| ELECTRA-Small++          | 14M      | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
-| ELECTRA-Small-OWT        | 14M      | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
-| ELECTRA-Small-OWT (ours) | 17M      | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
-| ALECTRA-Small-OWT (ours) |  4M      | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
- Table initialized from [ELECTRA github repo](https://github.com/google-research/electra)
-#### GLUE Test results
-| Model                    | # Params | CoLA | SST | MRPC | STS  | QQP  | MNLI | QNLI | RTE |
-| ---                      | ---      | ---  | --- | ---  | ---  | ---  | ---  | ---  | --- |
-| BERT-Base                | 110M     | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
-| GPT                      | 117M     | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
-| ELECTRA-Small++          | 14M      | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
-| ELECTRA-Small-OWT (ours) | 17M      | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
-| ALECTRA-Small-OWT (ours) |  4M      | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|
--- a/model_cards/shrugging-grace/tweetclassifier/README.md
+++ b/model_cards/shrugging-grace/tweetclassifier/README.md
-# shrugging-grace/tweetclassifier
-## Model description
-This model classifies tweets as either relating to the Covid-19 pandemic or not. 
-## Intended uses & limitations
-It is intended to be used on tweets commenting on UK politics, in particular those trending with the #PMQs hashtag, as this refers to weekly Prime Ministers' Questions.  
-#### How to use
-``LABEL_0`` means that the tweet relates to Covid-19
-``LABEL_1`` means that the tweet does not relate to Covid-19
-## Training data
-The model was trained on 1000 tweets (with the "#PMQs'), which were manually labeled by the author. The tweets were collected between May-July 2020. 
-### BibTeX entry and citation info
-This was based on a pretrained version of BERT. 
-@article{devlin2018bert,
-  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
-  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
-  journal={arXiv preprint arXiv:1810.04805},
-  year={2018}
-}
--- a/model_cards/smanjil/German-MedBERT/README.md
+++ b/model_cards/smanjil/German-MedBERT/README.md
---
-language: de
-tags: 
- exbert
- German
---
-<a href="https://huggingface.co/exbert/?model=smanjil/German-MedBERT">
-	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
-</a>
-# German Medical BERT
-This is a fine-tuned model on Medical domain for German language and based on German BERT. This model has only been trained to improve on target task (Masked Language Model). It can later be used to perform a downstream task of your needs, while I performed it for NTS-ICD-10 text classification task.
-## Overview
-**Language model:** bert-base-german-cased
-**Language:** German
-**Fine-tuning:** Medical articles (diseases, symptoms, therapies, etc..)
-**Eval data:** NTS-ICD-10 dataset (Classification)
-**Infrastructure:** Gogle Colab
-## Details
- We fine-tuned using Pytorch with Huggingface library on Colab GPU.
- With standard parameter settings for fine-tuning as mentioned in original BERT's paper.
- Although had to train for upto 25 epochs for classification.
-## Performance (Micro precision, recall and f1 score for multilabel code classification)
-|Models			|P	|R	|F1	|
-|:--------------	|:------|:------|:------|
-|German BERT		|86.04	|75.82	|80.60	|
-|German MedBERT-256	|87.41	|77.97	|82.42	|
-|German MedBERT-512	|87.75	|78.26	|82.73	|
-## Author
-Manjil Shrestha: `shresthamanjil21 [at] gmail.com`
-Get in touch:
-[LinkedIn](https://www.linkedin.com/in/manjil-shrestha-038527b4/)
--- a/model_cards/spentaur/yelp/README.md
+++ b/model_cards/spentaur/yelp/README.md
-# DistilBERT Yelp Review Sentiment
-This model is used for sentiment analysis on english yelp reviews.  
-It is a DistilBERT model trained on 1 million reviews from the yelp open dataset.  
-It is a regression model, with outputs in the range of ~-2 to ~2. With -2 being 1 star and 2 being 5 stars.  
-It was trained using the [ktrain](https://github.com/amaiya/ktrain) because of it's ease of use.
-Example use:
-```
-tokenizer = AutoTokenizer.from_pretrained(
-    'distilbert-base-uncased', use_fast=True)
-model = TFAutoModelForSequenceClassification.from_pretrained(
-    "spentaur/yelp")
-review = "This place is great!"
-input_ids = tokenizer.encode(review, return_tensors='tf')
-pred = model(input_ids)[0][0][0].numpy()
-# pred should === 1.9562385
-```
--- a/model_cards/squeezebert/squeezebert-mnli-headless/README.md
+++ b/model_cards/squeezebert/squeezebert-mnli-headless/README.md
-language: en
-license: bsd
-datasets:
- bookcorpus
- wikipedia
---
-# SqueezeBERT pretrained model
-This model, `squeezebert-mnli-headless`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data.
-SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
-The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
-## Pretraining
-### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
-### Pretraining procedure
-The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
-(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
-From the SqueezeBERT paper:
-> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
-## Finetuning
-The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
-A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
-Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
-This model, `squeezebert/squeezebert-mnli-headless`, is the "finetuned with bells and whistles" MNLI-finetuned SqueezeBERT model. In this particular model, we have removed the final classification layer -- in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.
-### How to finetune
-To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
-```
-./utils/download_glue_data.py
-python examples/text-classification/run_glue.py \
-    --model_name_or_path squeezebert-base-headless \
-    --task_name mrpc \
-    --data_dir ./glue_data/MRPC \
-    --output_dir ./models/squeezebert_mrpc \
-    --overwrite_output_dir \
-    --do_train \
-    --do_eval \
-    --num_train_epochs 10 \
-    --learning_rate 3e-05 \
-    --per_device_train_batch_size 16 \
-    --save_steps 20000
-```
-## BibTeX entry and citation info
-```
-@article{2020_SqueezeBERT,
-     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
-     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
-     journal = {arXiv:2006.11316},
-     year = {2020}
-}
-```
--- a/model_cards/squeezebert/squeezebert-mnli/README.md
+++ b/model_cards/squeezebert/squeezebert-mnli/README.md
-language: en
-license: bsd
-datasets:
- bookcorpus
- wikipedia
---
-# SqueezeBERT pretrained model
-This model, `squeezebert-mnli`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset.
-SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
-The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
-## Pretraining
-### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
-### Pretraining procedure
-The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
-(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
-From the SqueezeBERT paper:
-> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
-## Finetuning
-The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
-A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
-Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
-This model, `squeezebert/squeezebert-mnli`, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.
-### How to finetune
-To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
-```
-./utils/download_glue_data.py
-python examples/text-classification/run_glue.py \
-    --model_name_or_path squeezebert-base-headless \
-    --task_name mrpc \
-    --data_dir ./glue_data/MRPC \
-    --output_dir ./models/squeezebert_mrpc \
-    --overwrite_output_dir \
-    --do_train \
-    --do_eval \
-    --num_train_epochs 10 \
-    --learning_rate 3e-05 \
-    --per_device_train_batch_size 16 \
-    --save_steps 20000
-```
-## BibTeX entry and citation info
-```
-@article{2020_SqueezeBERT,
-     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
-     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
-     journal = {arXiv:2006.11316},
-     year = {2020}
-}
-```
--- a/model_cards/squeezebert/squeezebert-uncased/README.md
+++ b/model_cards/squeezebert/squeezebert-uncased/README.md
-language: en
-license: bsd
-datasets:
- bookcorpus
- wikipedia
---
-# SqueezeBERT pretrained model
-This model, `squeezebert-uncased`, is a pretrained model for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective.
-SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
-The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
-## Pretraining
-### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
-### Pretraining procedure
-The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
-(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
-From the SqueezeBERT paper:
-> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
-## Finetuning
-The SqueezeBERT paper results from 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
-A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
-Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
-This model, `squeezebert/squeezebert-uncased`, has been pretrained but not finetuned. For most text classification tasks, we recommend using squeezebert-mnli-headless as a starting point.
-### How to finetune
-To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
-```
-./utils/download_glue_data.py
-python examples/text-classification/run_glue.py \
-    --model_name_or_path squeezebert-base-headless \
-    --task_name mrpc \
-    --data_dir ./glue_data/MRPC \
-    --output_dir ./models/squeezebert_mrpc \
-    --overwrite_output_dir \
-    --do_train \
-    --do_eval \
-    --num_train_epochs 10 \
-    --learning_rate 3e-05 \
-    --per_device_train_batch_size 16 \
-    --save_steps 20000
-```
-## BibTeX entry and citation info
-```
-@article{2020_SqueezeBERT,
-     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
-     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
-     journal = {arXiv:2006.11316},
-     year = {2020}
-}
-```
--- a/model_cards/stas/tiny-wmt19-en-de/README.md
+++ b/model_cards/stas/tiny-wmt19-en-de/README.md
---
-language:
- en
- de
-thumbnail:
-tags:
- wmt19
- testing
-license: apache-2.0
-datasets:
- wmt19
-metrics:
- bleu
---
-# Tiny FSMT
-This is a tiny model that is used in the `transformers` test suite. It doesn't do anything useful, other than testing that `FSMT` works.
--- a/model_cards/stevhliu/astroGPT/README.md
+++ b/model_cards/stevhliu/astroGPT/README.md
---
-language: "en"
-thumbnail: "https://raw.githubusercontent.com/stevhliu/satsuma/master/images/astroGPT-thumbnail.png"
-widget:
- text: "Jan 18, 2020"
- text: "Feb 14, 2020"
- text: "Jul 04, 2020"
---
-# astroGPT 🪐
-## Model description
-This is a GPT-2 model fine-tuned on Western zodiac signs. For more information about GPT-2, take a look at 🤗 Hugging Face's GPT-2 [model card](https://huggingface.co/gpt2). You can use astroGPT to generate a daily horoscope by entering the current date.
-## How to use
-To use this model, simply enter the current date like so `Mon DD, YEAR`:
-```python
-from transformers import AutoTokenizer, AutoModelWithLMHead
-tokenizer = AutoTokenizer.from_pretrained("stevhliu/astroGPT")
-model = AutoModelWithLMHead.from_pretrained("stevhliu/astroGPT")
-input_ids = tokenizer.encode('Sep 03, 2020', return_tensors='pt').to('cuda')
-sample_output = model.generate(input_ids,
-                               do_sample=True, 
-                               max_length=75,
-                               top_k=20, 
-                               top_p=0.97)
-print(sample_output)
-```
-## Limitations and bias
-astroGPT inherits the same biases that affect GPT-2 as a result of training on a lot of non-neutral content on the internet. The model does not currently support zodiac sign-specific generation and only returns a general horoscope. While the generated text may occasionally mention a specific zodiac sign, this is  due to how the horoscopes were originally written by it's human authors.
-## Data
-The data was scraped from [Horoscope.com](https://www.horoscope.com/us/index.aspx) and trained on 4.7MB of text. The text was collected from four categories (daily, love, wellness, career) and span from 09/01/19 to 08/01/2020. The archives only store horoscopes dating a year back from the current date.
-## Training and results
-The text was tokenized using the fast GPT-2 BPE [tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizerfast). It has a vocabulary size of 50,257 and sequence length of 1024 tokens. The model was trained with on one of Google Colaboratory's GPU's for approximately 2.5 hrs with [fastai's](https://docs.fast.ai/) learning rate finder, discriminative learning rates and 1cycle policy. See table below for a quick summary of the training procedure and results.
-| dataset size  | epochs | lr                | training time | train_loss | valid_loss | perplexity | 
-|:-------------:|:------:|:-----------------:|:-------------:|:----------:|:----------:|:----------:|
-| 5.9MB         |32      | slice(1e-7,1e-5)  | 2.5 hrs       | 2.657170   | 2.642387   | 14.046692	|
--- a/model_cards/surajp/RoBERTa-hindi-guj-san/README.md
+++ b/model_cards/surajp/RoBERTa-hindi-guj-san/README.md
---
-language:
- hi
- sa
- gu
-tags:
- Indic
-license: mit
-datasets:
- Wikipedia (Hindi, Sanskrit, Gujarati)
-metrics:
- perplexity
---
-# RoBERTa-hindi-guj-san
-## Model description
-Multillingual RoBERTa like model trained on Wikipedia articles of Hindi, Sanskrit, Gujarati languages. The tokenizer was trained on combined text. 
-However, Hindi text was used to pre-train the model and then it was fine-tuned on Sanskrit and Gujarati Text combined hoping that pre-training with Hindi 
-will help the model learn similar languages.
-### Configuration
-| Parameter | Value |
-|---|---|
-| `hidden_size` | 768 |
-| `num_attention_heads` | 12 |
-| `num_hidden_layers` | 6 |
-| `vocab_size` | 30522 |
-|`model_type`|`roberta`|
-## Intended uses & limitations
-#### How to use
-```python
-# Example usage
-from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
-tokenizer = AutoTokenizer.from_pretrained("surajp/RoBERTa-hindi-guj-san")
-model = AutoModelWithLMHead.from_pretrained("surajp/RoBERTa-hindi-guj-san")
-fill_mask = pipeline(
-    "fill-mask",
-    model=model,
-    tokenizer=tokenizer
-)
-# Sanskrit: इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
-# Hindi:  अगर आप अब अभ्यास नहीं करते हो तो आप अपने परीक्षा में मूर्खतापूर्ण गलतियाँ करोगे।
-# Gujarati: ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.
-fill_mask("ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો <mask> હતો.")
-'''
-Output:
--------
-[
-{'score': 0.07849744707345963, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો જ હતો.</s>', 'token': 390},
-{'score': 0.06273336708545685, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો ન હતો.</s>', 'token': 478},
-{'score': 0.05160355195403099, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો થઇ હતો.</s>', 'token': 2075},
-{'score': 0.04751499369740486, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો એક હતો.</s>', 'token': 600},
-{'score': 0.03788900747895241, 'sequence': '<s> ગુજરાતમાં ૧૯મી માર્ચ સુધી કોઈ સકારાત્મક (પોઝીટીવ) રીપોર્ટ આવ્યો પણ હતો.</s>', 'token': 840}
-]
-```
-## Training data
-Cleaned wikipedia articles in Hindi, Sanskrit and Gujarati on Kaggle. It contains training as well as evaluation text. 
-Used in [iNLTK](https://github.com/goru001/inltk)
- [Hindi](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k)
- [Gujarati](https://www.kaggle.com/disisbig/gujarati-wikipedia-articles)
- [Sanskrit](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles)
-## Training procedure
- On TPU (using `xla_spawn.py`)
- For language modelling
- Iteratively increasing `--block_size` from 128 to 256 over epochs
- Tokenizer trained on combined text
- Pre-training with Hindi and fine-tuning on Sanskrit and Gujarati texts
-```
--model_type distillroberta-base \
--model_name_or_path "/content/SanHiGujBERTa" \
--mlm_probability 0.20 \
--line_by_line \
--save_total_limit 2 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 128 \
--num_train_epochs 5 \
--block_size 256 \
--seed 108 \
--overwrite_output_dir \
-```
-## Eval results
-perplexity = 2.920005983224673
-> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
-> Made with <span style="color: #e25555;">&hearts;</span> in India
--- a/model_cards/surajp/SanBERTa/README.md
+++ b/model_cards/surajp/SanBERTa/README.md
---
-language: sa
---
-# RoBERTa trained on Sanskrit (SanBERTa)
-**Mode size** (after training): **340MB**
-### Dataset:
-[Wikipedia articles](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles) (used in [iNLTK](https://github.com/goru001/nlp-for-sanskrit)).
-It contains evaluation set.
-[Sanskrit scraps from CLTK](http://cltk.org/)
-### Configuration
-| Parameter | Value |
-|---|---|
-| `num_attention_heads` | 12 |
-| `num_hidden_layers` | 6 |
-| `hidden_size` | 768 |
-| `vocab_size` | 29407 |
-### Training :
- On TPU
- For language modelling
- Iteratively increasing `--block_size` from 128 to 256 over epochs
-### Evaluation
-|Metric| # Value |
-|---|---|
-|Perplexity (`block_size=256`)|4.04|
-## Example of usage:
-### For Embeddings
-```
-tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
-model = RobertaModel.from_pretrained("surajp/SanBERTa")
-op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt")
-ps = model(op)
-ps[0].shape
-```
-```
-'''
-Output:
--------
-torch.Size([1, 47, 768])
-```
-### For \<mask\> Prediction
-```
-from transformers import pipeline
-fill_mask = pipeline(
-    "fill-mask",
-    model="surajp/SanBERTa",
-    tokenizer="surajp/SanBERTa"
-)
-## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
-fill_mask("इयं भाषा न केवल<mask> भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।")
-ps = model(torch.tensor(enc).unsqueeze(1))
-print(ps[0].shape)
-```
-```
-'''
-Output:
--------
-[{'score': 0.7516744136810303,
-  'sequence': '<s> इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
-  'token': 280,
-  'token_str': 'à¤Ĥ'},
- {'score': 0.06230105459690094,
-  'sequence': '<s> इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
-  'token': 289,
-  'token_str': 'à¥Ģ'},
- {'score': 0.055410224944353104,
-  'sequence': '<s> इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
-  'token': 265,
-  'token_str': 'à¤¾'},
-  ...]
-```
-### It works!! 🎉 🎉 🎉
-> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
-> Made with <span style="color: #e25555;">&hearts;</span> in India
--- a/model_cards/surajp/albert-base-sanskrit/README.md
+++ b/model_cards/surajp/albert-base-sanskrit/README.md
---
-language: sa
---
-# ALBERT-base-Sanskrit
-Explaination Notebook Colab: [SanskritALBERT.ipynb](https://colab.research.google.com/github/parmarsuraj99/suraj-parmar/blob/master/_notebooks/2020-05-02-SanskritALBERT.ipynb)
-Size of the model is **46MB**
-Example of usage:
-```
-tokenizer = AutoTokenizer.from_pretrained("surajp/albert-base-sanskrit")
-model = AutoModel.from_pretrained("surajp/albert-base-sanskrit")
-enc=tokenizer.encode("ॐ सर्वे भवन्तु सुखिनः सर्वे सन्तु निरामयाः । सर्वे भद्राणि पश्यन्तु मा कश्चिद्दुःखभाग्भवेत् । ॐ शान्तिः शान्तिः शान्तिः ॥")
-print(tokenizer.decode(enc))
-ps = model(torch.tensor(enc).unsqueeze(1))
-print(ps[0].shape)
-```
-```
-'''
-Output:
--------
-[CLS] ॐ सर्वे भवन्तु सुखिनः सर्वे सन्तु निरामयाः । सर्वे भद्राणि पश्यन्तु मा कश्चिद्दुःखभाग्भवेत् । ॐ शान्तिः शान्तिः शान्तिः ॥[SEP]
-torch.Size([28, 1, 768])
-```
-> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99)
-> Made with <span style="color: #e25555;">&hearts;</span> in India
--- a/model_cards/t5-11b-README.md
+++ b/model_cards/t5-11b-README.md
---
-language: en
-datasets:
- c4
-tags:
- summarization
- translation
-license: apache-2.0
-inference: false
---
-## Disclaimer
-**Before `transformers` v3.5.0**, due do its immense size, `t5-11b` required some special treatment. 
-If you're using transformers `<= v3.4.0`, `t5-11b` should be loaded with flag `use_cdn` set to `False` as follows:
-```python
-t5 = transformers.T5ForConditionalGeneration.from_pretrained('t5-11b', use_cdn = False)
-```
-Secondly, a single GPU will most likely not have enough memory to even load the model into memory as the weights alone amount to over 40 GB.
-Model parallelism has to be used here to overcome this problem as is explained in this [PR](https://github.com/huggingface/transformers/pull/3578).
-## [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) 
-Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
-Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
-Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
-Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* 
-## Abstract
-Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
--- a/model_cards/t5-3b-README.md
+++ b/model_cards/t5-3b-README.md
---
-language: en
-datasets:
- c4
-tags:
- summarization
- translation
-license: apache-2.0
---
-[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) 
-Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
-Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
-Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
-Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* 
-## Abstract
-Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
--- a/model_cards/t5-base-README.md
+++ b/model_cards/t5-base-README.md
---
-language: en
-datasets:
- c4
-tags:
- summarization
- translation
-license: apache-2.0
---
-[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) 
-Pretraining Dataset: [C4](https://huggingface.co/datasets/c4)
-Other Community Checkpoints: [here](https://huggingface.co/models?search=t5)
-Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
-Authors: *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* 
-## Abstract
-Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)