[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)
* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3552d0e0 · Julien Chaumond · GitHub · 29e45979 · 29e45979 · 29e45979
Unverified Commit 3552d0e0 authored Dec 12, 2020 by Julien Chaumond Committed by GitHub Dec 11, 2020
20 changed files
--- a/model_cards/lvwerra/bert-imdb/README.md
+++ b/model_cards/lvwerra/bert-imdb/README.md
-# BERT-IMDB
-## What is it?
-BERT (`bert-large-cased`) trained for sentiment classification on the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
-## Training setting
-The model was trained on 80% of the IMDB dataset for sentiment classification for three epochs with a learning rate of `1e-5` with the `simpletransformers` library. The library uses a learning rate schedule.
-## Result
-The model achieved 90% classification accuracy on the validation set.
-## Reference
-The full experiment is available in the [tlr repo](https://lvwerra.github.io/trl/03-bert-imdb-training/).
--- a/model_cards/lvwerra/gpt2-imdb-ctrl/README.md
+++ b/model_cards/lvwerra/gpt2-imdb-ctrl/README.md
-# GPT2-IMDB-ctrl
-## What is it?
-A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce controlled movie reviews based the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/bert-imdb`) via PPO.
-## Training setting
-The model was trained for `200` optimisation steps with a batch size of `256` which corresponds to `51200` training samples. The full experiment setup can be found in the Jupyter notebook in the [trl repo](https://lvwerra.github.io/trl/05-gpt2-sentiment-ppo-training/). The strings `"[negative]"`, `"[neutral]"`, and `"[positive]"` are added at the beginning of the query to control the sentiment.
-## Examples
-A few examples of the model response to a query before and after optimisation:
-| query | response [negative] | rewards [negative] | response [neutral] | rewards [neutral] | response [positive] | rewards [positive] |
-|-------|---------------------|--------------------|--------------------|-------------------|---------------------|--------------------|
-|I watched this movie when|it was released and was awful. Little bit of ...|3.130034|it was released and it was the first movie I ...|-1.351991|I was younger it was wonderful. The new play ...|4.232218|
-|I can remember seeing this|movie in 2008, and I was so disappointed...yo...|3.428725|in support groups, which I think was not as i...|0.213288|movie, and it is one of my favorite movies ev...|4.168838|
-|This 1970 hit film has|little resonance. This movie is bad, not only...|4.241872|a bit of Rocket power.783287. It can be easil...|0.849278|the best formula for comedy and is't just jus...|4.208804|
--- a/model_cards/lvwerra/gpt2-imdb-pos/README.md
+++ b/model_cards/lvwerra/gpt2-imdb-pos/README.md
-# GPT2-IMDB-pos
-## What is it?
-A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce positive movie reviews based the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/gpt2-imdb`) via PPO.
-## Training setting
-The model was trained for `100` optimisation steps with a batch size of `256` which corresponds to `25600` training samples. The full experiment setup can be found in the Jupyter notebook in the [trl repo](https://lvwerra.github.io/trl/04-gpt2-sentiment-ppo-training/).
-## Examples
-A few examples of the model response to a query before and after optimisation:
-| query | response (before) | response (after) | rewards (before) | rewards (after) |
-|-------|-------------------|------------------|------------------|-----------------|
-|I'd never seen a |heavier, woodier example of Victorian archite... |film of this caliber, and I think it's wonder... |3.297736 |4.158653|
-|I love John's work	|but I actually have to write language as in w... |and I hereby recommend this film. I am really... |-1.904006 |4.159198 |
-|I's a big struggle |to see anyone who acts in that way. by Jim Th... |, but overall I'm happy with the changes even ... |-1.595925 |2.651260|
--- a/model_cards/lvwerra/gpt2-imdb/README.md
+++ b/model_cards/lvwerra/gpt2-imdb/README.md
-# GPT2-IMDB
-## What is it?
-A GPT2 (`gpt2`) language model fine-tuned on the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
-## Training setting
-The GPT2 language model was fine-tuned for 1 epoch on the IMDB dataset. All comments were joined into a single text file separated by the EOS token:
-```
-import pandas as pd
-df = pd.read_csv("imdb-dataset.csv")
-imdb_str = " <|endoftext|> ".join(df['review'].tolist())
-with open ('imdb.txt', 'w') as f:
-    f.write(imdb_str)
-```
-To train the model the `run_language_modeling.py` script in the `transformer` library was used:
-```
-python run_language_modeling.py 
-	--train_data_file imdb.txt 
-	--output_dir gpt2-imdb 
-	--model_type gpt2 
-	--model_name_or_path gpt2
-```
--- a/model_cards/lvwerra/gpt2-medium-taboo/README.md
+++ b/model_cards/lvwerra/gpt2-medium-taboo/README.md
-# GPT-2 (medium) Taboo
-## What is it?
-A fine-tuned GPT-2 version for Taboo cards generation.
-## Training setting
-The model was trained on ~900 Taboo cards in the following format for 100 epochs:
-```
-Describe the word Glitch without using the words Problem, Unexpected, Technology, Minor, Outage.
-````
--- a/model_cards/lysandre/arxiv-nlp/README.md
+++ b/model_cards/lysandre/arxiv-nlp/README.md
-# ArXiv-NLP GPT-2 checkpoint
-This is a GPT-2 small checkpoint for PyTorch. It is the official `gpt2-small` fine-tuned to ArXiv paper on the computational linguistics field.
-## Training data
-This model was trained on a subset of ArXiv papers that were parsed from PDF to txt. The resulting data is made of 80MB of text from the computational linguistics (cs.CL) field.
\ No newline at end of file
--- a/model_cards/lysandre/arxiv/README.md
+++ b/model_cards/lysandre/arxiv/README.md
-# ArXiv GPT-2 checkpoint
-This is a GPT-2 small checkpoint for PyTorch. It is the official `gpt2-small` finetuned to ArXiv paper on physics fields.
-## Training data
-This model was trained on a subset of ArXiv papers that were parsed from PDF to txt. The resulting data is made of 130MB of text, mostly from quantum physics (quant-ph) and other physics sub-fields.
--- a/model_cards/m3hrdadfi/albert-fa-base-v2/README.md
+++ b/model_cards/m3hrdadfi/albert-fa-base-v2/README.md
---
-language: fa
-tags:
- albert-persian
- persian-lm
-license: apache-2.0
-datasets:
- Persian Wikidumps
- MirasText
- BigBang Page
- Chetor
- Eligasht
- DigiMag
- Ted Talks
- Books (Novels, ...)
---
-# ALBERT-Persian
-## ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language
-## Introduction
-ALBERT-Persian trained on a massive amount of public corpora ([Persian Wikidumps](https://dumps.wikimedia.org/fawiki/), [MirasText](https://github.com/miras-tech/MirasText)) and six other manually crawled text data from a various type of websites ([BigBang Page](https://bigbangpage.com/) `scientific`, [Chetor](https://www.chetor.com/) `lifestyle`, [Eligasht](https://www.eligasht.com/Blog/) `itinerary`,  [Digikala](https://www.digikala.com/mag/) `digital magazine`, [Ted Talks](https://www.ted.com/talks) `general conversational`, Books `novels, storybooks, short stories from old to the contemporary era`).
-## Intended uses & limitations
-You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
-be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=albert-fa) to look for
-fine-tuned versions on a task that interests you.
-### How to use
-#### TensorFlow 2.0
-```python
-from transformers import AutoConfig, AutoTokenizer, TFAutoModel
-config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-model = TFAutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
-tokenizer.tokenize(text)
->>> ['▁ما', '▁در', '▁هوش', 'واره', '▁معتقد', 'یم', '▁با', '▁انتقال', '▁صحیح', '▁دانش', '▁و', '▁اگاه', 'ی', '،', '▁همه', '▁افراد', '▁می', '▁توانند', '▁از', '▁ابزارهای', '▁هوشمند', '▁استفاده', '▁کنند', '.', '▁شعار', '▁ما', '▁هوش', '▁مصنوعی', '▁برای', '▁همه', '▁است', '.']
-```
-#### Pytorch
-```python
-from transformers import AutoConfig, AutoTokenizer, AutoModel
-config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-model = AutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")
-```
-## Training
-ALBERT-Persian is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than `3.9M` documents, `73M` sentences, and `1.3B` words, like the way we did for [ParsBERT](https://github.com/hooshvare/parsbert).
-## Goals
-Objective goals during training are as below (after 140K steps).
-``` bash
-***** Eval results *****
-global_step = 140000
-loss = 2.0080082
-masked_lm_accuracy = 0.6141017
-masked_lm_loss = 1.9963315
-sentence_order_accuracy = 0.985
-sentence_order_loss = 0.06908702
-```
-## Derivative models
-### Base Config
-#### Albert Model
- [m3hrdadfi/albert-face-base-v2](https://huggingface.co/m3hrdadfi/albert-fa-base-v2) 
-#### Albert Sentiment Analysis
- [m3hrdadfi/albert-fa-base-v2-sentiment-digikala](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-digikala) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-snappfood](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-snappfood) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-binary](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-binary) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-multi](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-multi) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-binary](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-binary) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-multi](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-multi) 
- [m3hrdadfi/albert-fa-base-v2-sentiment-multi](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-sentiment-multi) 
-#### Albert Text Classification
- [m3hrdadfi/albert-fa-base-v2-clf-digimag](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-clf-digimag) 
- [m3hrdadfi/albert-fa-base-v2-clf-persiannews](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-clf-persiannews) 
-#### Albert NER
- [m3hrdadfi/albert-fa-base-v2-ner](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-ner) 
- [m3hrdadfi/albert-fa-base-v2-ner-arman](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-ner-arman) 
- [m3hrdadfi/albert-fa-base-v2-ner-arman](https://huggingface.co/m3hrdadfi/albert-fa-base-v2-ner-arman) 
-## Eval results
-The following tables summarize the F1 scores obtained by ALBERT-Persian as compared to other models and architectures.
-### Sentiment Analysis (SA) Task
-|          Dataset         | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT | DeepSentiPers |
-|:------------------------:|:-----------------:|:-----------:|:-----:|:-------------:|
-|  Digikala User Comments  |       81.12       |    81.74    | 80.74 |       -       |
-|  SnappFood User Comments |       85.79       |    88.12    | 87.87 |       -       |
-|  SentiPers (Multi Class) |       66.12       |    71.11    |   -   |     69.33     |
-| SentiPers (Binary Class) |       91.09       |    92.13    |   -   |     91.98     |
-### Text Classification (TC) Task
-|      Dataset      | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT |
-|:-----------------:|:-----------------:|:-----------:|:-----:|
-| Digikala Magazine |       92.33       |    93.59    | 90.72 |
-|    Persian News   |       97.01       |    97.19    | 95.79 |
-### Named Entity Recognition (NER) Task
-| Dataset | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
-|:-------:|:-----------------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:|
-|  PEYMA  |       88.99       |    93.10    | 86.64 |      -     |     90.59    |     -    |      84.00     |      -     |
-|  ARMAN  |       97.43       |    98.79    | 95.89 |    89.9    |     84.03    |   86.55  |        -       |    77.45   |
-### BibTeX entry and citation info
-Please cite in publications as the following:
-```bibtex
-@misc{ALBERT-Persian,
-  author = {Mehrdad Farahani},
-  title = {ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language},
-  year = {2020},
-  publisher = {GitHub},
-  journal = {GitHub repository},
-  howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
-}
-@article{ParsBERT,
-    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
-    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
-    journal={ArXiv},
-    year={2020},
-    volume={abs/2005.12515}
-}
-```
-## Questions?
-Post a Github issue on the [ALBERT-Persian](https://github.com/m3hrdadfi/albert-persian) repo.
--- a/model_cards/m3hrdadfi/bert2bert-fa-news-headline/README.md
+++ b/model_cards/m3hrdadfi/bert2bert-fa-news-headline/README.md
---
-language: fa
-license: apache-2.0
-tags:
- summarization
---
-A Bert2Bert model on VoA Persian Corpus (a medium-sized corpus of 7.9 million words, 2003-2008) generates headlines. The model achieved a 25.30 ROUGE-2 score. 
-For more detail, please follow the [News Headline Generation](https://github.com/m3hrdadfi/news-headline-generation) repo. 
-## Eval results
-The following table summarizes the ROUGE scores obtained by the Bert2Bert model.
-|    %    | Precision | Recall | FMeasure |
-|:-------:|:---------:|:------:|:--------:|
-| ROUGE-1 |   43.78   |  45.52 |   43.54  |
-| ROUGE-2 |   24.50   | 25.30* |   24.24  |
-| ROUGE-L |   41.20   |  42.22 |   40.76  |
-## Questions?
-Post a Github issue on the [News Headline Generation](https://github.com/hooshvare/news-headline-generation/issues) repo.
--- a/model_cards/m3hrdadfi/bert2bert-fa-wiki-summary/README.md
+++ b/model_cards/m3hrdadfi/bert2bert-fa-wiki-summary/README.md
---
-language: fa
-license: apache-2.0
-tags:
- summarization
---
-A Bert2Bert model on the Wiki Summary dataset to summarize articles. The model achieved an 8.47 ROUGE-2 score. 
-For more detail, please follow the [Wiki Summary](https://github.com/m3hrdadfi/wiki-summary) repo. 
-## Eval results
-The following table summarizes the ROUGE scores obtained by the Bert2Bert model.
-|    %    | Precision | Recall | FMeasure |
-|:-------:|:---------:|:------:|:--------:|
-| ROUGE-1 |   28.14   |  30.86 |   27.34  |
-| ROUGE-2 |   07.12   | 08.47* |   07.10  |
-| ROUGE-L |   28.49   |  25.87 |   25.50  |
-## Questions?
-Post a Github issue on the [Wiki Summary](https://github.com/m3hrdadfi/wiki-summary/issues) repo.
--- a/model_cards/microsoft/DeBERTa-base/README.md
+++ b/model_cards/microsoft/DeBERTa-base/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
-license: mit
---
-## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
-[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. 
-Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
-#### Fine-tuning on NLU tasks
-We present the dev results on SQuAD 1.1/2.0 and MNLI tasks.
-| Model             | SQuAD 1.1 | SQuAD 2.0 | MNLI-m |
-|-------------------|-----------|-----------|--------|
-| RoBERTa-base      | 91.5/84.6 | 83.7/80.5 | 87.6   |
-| XLNet-Large       | -/-       | -/80.2    | 86.8   |
-| **DeBERTa-base**  | 93.1/87.2 | 86.2/83.1 | 88.8   |
-### Citation
-If you find DeBERTa useful for your work, please cite the following paper:
-``` latex
-@misc{he2020deberta,
-    title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
-    author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
-    year={2020},
-    eprint={2006.03654},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-		}
-```
--- a/model_cards/microsoft/DeBERTa-large/README.md
+++ b/model_cards/microsoft/DeBERTa-large/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
-license: mit
---
-## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
-[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. 
-Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
-#### Fine-tuning on NLU tasks
-We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.
-| Model             | SQuAD 1.1 | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE  | MRPC | QQP  |STS-B|
-|-------------------|-----------|-----------|--------|-------|------|------|------|------|------|-----|
-| BERT-Large        | 90.9/84.1 | 81.8/79.0 | 86.6   | 93.2  | 92.3 | 60.6 | 70.4 | 88.0 | 91.3 |90.0 |
-| RoBERTa-Large     | 94.6/88.9 | 89.4/86.5 | 90.2   | 96.4  | 93.9 | 68.0 | 86.6 | 90.9 | 92.2 |92.4 |
-| XLNet-Large       | 95.1/89.7 | 90.6/87.9 | 90.8   | 97.0  | 94.9 | 69.0 | 85.9 | 90.8 | 92.3 |92.5 |
-| **DeBERTa-Large** | 95.5/90.1 | 90.7/88.0 | 91.1   | 96.5  | 95.3 | 69.5 | 88.1 | 92.5 | 92.3 |92.5 |
-### Citation
-If you find DeBERTa useful for your work, please cite the following paper:
-``` latex
-@misc{he2020deberta,
-    title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
-    author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
-    year={2020},
-    eprint={2006.03654},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-		}
-```
--- a/model_cards/microsoft/DialoGPT-large/README.md
+++ b/model_cards/microsoft/DialoGPT-large/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png
-tags:
- conversational
-license: mit
---
-## A State-of-the-Art Large-scale Pretrained Response generation model (DialoGPT)
-DialoGPT is a SOTA large-scale pretrained dialogue response generation model for multiturn conversations. 
-The [human evaluation results](https://github.com/dreasysnail/Dialogpt_dev#human-evaluation) indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test.
-The model is trained on 147M multi-turn dialogue from Reddit discussion thread. 
-* Multi-turn generation examples from an interactive environment:
-|Role | Response |
-|---------|--------|
-|User | Does money buy happiness? |
-| Bot |  Depends how much money you spend on it .|
-|User | What is the best way to buy happiness ? |
-| Bot | You just have to be a millionaire by your early 20s, then you can be happy . |
-|User |This is so difficult ! |
-| Bot | You have no idea how hard it is to be a millionaire and happy . There is a reason the rich have a lot of money |
-Please find the information about preprocessing, training and full details of the DialoGPT in the [original DialoGPT repository](https://github.com/microsoft/DialoGPT)
-ArXiv paper: [https://arxiv.org/abs/1911.00536](https://arxiv.org/abs/1911.00536)
-### How to use
-Now we are ready to try out how the model works as a chatting partner!
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
-model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")
-# Let's chat for 5 lines
-for step in range(5):
-	# encode the new user input, add the eos_token and return a tensor in Pytorch
-	new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
-	# append the new user input tokens to the chat history
-	bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
-	# generated a response while limiting the total chat history to 1000 tokens, 
-	chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
-	# pretty print last ouput tokens from bot
-	print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
-```
--- a/model_cards/microsoft/DialoGPT-medium/README.md
+++ b/model_cards/microsoft/DialoGPT-medium/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png
-tags:
- conversational
-license: mit
---
-## A State-of-the-Art Large-scale Pretrained Response generation model (DialoGPT)
-DialoGPT is a SOTA large-scale pretrained dialogue response generation model for multiturn conversations. 
-The [human evaluation results](https://github.com/dreasysnail/Dialogpt_dev#human-evaluation) indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test.
-The model is trained on 147M multi-turn dialogue from Reddit discussion thread. 
-* Multi-turn generation examples from an interactive environment:
-|Role | Response |
-|---------|--------|
-|User | Does money buy happiness? |
-| Bot |  Depends how much money you spend on it .|
-|User | What is the best way to buy happiness ? |
-| Bot | You just have to be a millionaire by your early 20s, then you can be happy . |
-|User |This is so difficult ! |
-| Bot | You have no idea how hard it is to be a millionaire and happy . There is a reason the rich have a lot of money |
-Please find the information about preprocessing, training and full details of the DialoGPT in the [original DialoGPT repository](https://github.com/microsoft/DialoGPT)
-ArXiv paper: [https://arxiv.org/abs/1911.00536](https://arxiv.org/abs/1911.00536)
-### How to use
-Now we are ready to try out how the model works as a chatting partner!
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
-model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")
-# Let's chat for 5 lines
-for step in range(5):
-	# encode the new user input, add the eos_token and return a tensor in Pytorch
-	new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
-	# append the new user input tokens to the chat history
-	bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
-	# generated a response while limiting the total chat history to 1000 tokens, 
-	chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
-	# pretty print last ouput tokens from bot
-	print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
-```
--- a/model_cards/microsoft/DialoGPT-small/README.md
+++ b/model_cards/microsoft/DialoGPT-small/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png
-tags:
- conversational
-license: mit
---
-## A State-of-the-Art Large-scale Pretrained Response generation model (DialoGPT)
-DialoGPT is a SOTA large-scale pretrained dialogue response generation model for multiturn conversations. 
-The [human evaluation results](https://github.com/dreasysnail/Dialogpt_dev#human-evaluation) indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test.
-The model is trained on 147M multi-turn dialogue from Reddit discussion thread. 
-* Multi-turn generation examples from an interactive environment:
-|Role | Response |
-|---------|--------|
-|User | Does money buy happiness? |
-| Bot |  Depends how much money you spend on it .|
-|User | What is the best way to buy happiness ? |
-| Bot | You just have to be a millionaire by your early 20s, then you can be happy . |
-|User |This is so difficult ! |
-| Bot | You have no idea how hard it is to be a millionaire and happy . There is a reason the rich have a lot of money |
-Please find the information about preprocessing, training and full details of the DialoGPT in the [original DialoGPT repository](https://github.com/microsoft/DialoGPT)
-ArXiv paper: [https://arxiv.org/abs/1911.00536](https://arxiv.org/abs/1911.00536)
-### How to use
-Now we are ready to try out how the model works as a chatting partner!
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
-model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
-# Let's chat for 5 lines
-for step in range(5):
-	# encode the new user input, add the eos_token and return a tensor in Pytorch
-	new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
-	# append the new user input tokens to the chat history
-	bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
-	# generated a response while limiting the total chat history to 1000 tokens, 
-	chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
-	# pretty print last ouput tokens from bot
-	print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
-```
--- a/model_cards/microsoft/MiniLM-L12-H384-uncased/README.md
+++ b/model_cards/microsoft/MiniLM-L12-H384-uncased/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
-tags:
- text-classification
-license: mit
---
-## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
-MiniLM is a distilled model from the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)".
-Please find the information about preprocessing, training and full details of the MiniLM in the [original MiniLM repository](https://github.com/microsoft/unilm/blob/master/minilm/).
-Please note: This checkpoint can be an inplace substitution for BERT and it needs to be fine-tuned before use!
-### English Pre-trained Models
-We release the **uncased** **12**-layer model with **384** hidden size distilled from an in-house pre-trained [UniLM v2](/unilm) model in BERT-Base size.
- MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base
-#### Fine-tuning on NLU tasks
-We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.
-| Model                                             | #Param | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE  | MRPC | QQP  |
-|---------------------------------------------------|--------|-----------|--------|-------|------|------|------|------|------|
-| [BERT-Base](https://arxiv.org/pdf/1810.04805.pdf) | 109M   | 76.8      | 84.5   | 93.2  | 91.7 | 58.9 | 68.6 | 87.3 | 91.3 |
-| **MiniLM-L12xH384**                               | 33M    | 81.7      | 85.7   | 93.0  | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 |
-### Citation
-If you find MiniLM useful in your research, please cite the following paper:
-``` latex
-@misc{wang2020minilm,
-    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
-    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
-    year={2020},
-    eprint={2002.10957},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
--- a/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md
+++ b/model_cards/microsoft/Multilingual-MiniLM-L12-H384/README.md
---
-thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
-tags:
- text-classification
-license: mit
---
-## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
-MiniLM is a distilled model from the paper "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)".
-Please find the information about preprocessing, training and full details of the MiniLM in the [original MiniLM repository](https://github.com/microsoft/unilm/blob/master/minilm/).
-Please note: This checkpoint uses `BertModel` with `XLMRobertaTokenizer` so `AutoTokenizer` won't work with this checkpoint!
-### Multilingual Pretrained Model
- Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters
-Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on [huggingface/transformers](https://github.com/huggingface/transformers). Please replace `run_xnli.py` in transformers with [ours](https://github.com/microsoft/unilm/blob/master/minilm/examples/run_xnli.py) to fine-tune multilingual MiniLM.  
-We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).
-#### Cross-Lingual Natural Language Inference - [XNLI](https://arxiv.org/abs/1809.05053)
-We evaluate our model on cross-lingual transfer from English to other languages. Following [Conneau et al. (2019)](https://arxiv.org/abs/1911.02116), we select the best single model on the joint dev set of all the languages.
-| Model                                                                                       | #Layers | #Hidden | #Transformer Parameters | Average | en   | fr   | es   | de   | el   | bg   | ru   | tr   | ar   | vi   | th   | zh   | hi   | sw   | ur   |
-|---------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
-| [mBERT](https://github.com/google-research/bert)                                            | 12      | 768     | 85M                     | 66.3    | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 |
-| [XLM-100](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 16      | 1280    | 315M                    | 70.7    | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 |
-| [XLM-R Base](https://arxiv.org/abs/1911.02116)                                              | 12      | 768     | 85M                     | 74.5    | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 |
-| **mMiniLM-L12xH384**                                                                        | 12      | 384     | 21M                     | 71.1    | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 |
-This example code fine-tunes **12**-layer multilingual MiniLM on XNLI.
-```bash
-# run fine-tuning on XNLI
-DATA_DIR=/{path_of_data}/
-OUTPUT_DIR=/{path_of_fine-tuned_model}/
-MODEL_PATH=/{path_of_pre-trained_model}/
-python ./examples/run_xnli.py --model_type minilm \
- --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
- --model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
- --tokenizer_name xlm-roberta-base \
- --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
- --do_train \
- --do_eval \
- --max_seq_length 128 \
- --per_gpu_train_batch_size 128 \
- --learning_rate 5e-5 \
- --num_train_epochs 5 \
- --per_gpu_eval_batch_size 32 \
- --weight_decay 0.001 \
- --warmup_steps 500 \
- --save_steps 1500 \
- --logging_steps 1500 \
- --eval_all_checkpoints \
- --language en \
- --fp16 \
- --fp16_opt_level O2
-```
-#### Cross-Lingual Question Answering - [MLQA](https://arxiv.org/abs/1910.07475)
-Following [Lewis et al. (2019b)](https://arxiv.org/abs/1910.07475), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.
-| Model F1 Score                                                                             | #Layers | #Hidden | #Transformer Parameters | Average | en   | es   | de   | ar   | hi   | vi   | zh   |
-|--------------------------------------------------------------------------------------------|---------|---------|-------------------------|---------|------|------|------|------|------|------|------|
-| [mBERT](https://github.com/google-research/bert)                                           | 12      | 768     | 85M                     | 57.7    | 77.7 | 64.3 | 57.9 | 45.7 | 43.8 | 57.1 | 57.5 |
-| [XLM-15](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) | 12      | 1024    | 151M                    | 61.6    | 74.9 | 68.0 | 62.2 | 54.8 | 48.8 | 61.4 | 61.1 |
-| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Reported)                                  | 12      | 768     | 85M                     | 62.9    | 77.8 | 67.2 | 60.8 | 53.0 | 57.9 | 63.1 | 60.2 |
-| [XLM-R Base](https://arxiv.org/abs/1911.02116) (Our fine-tuned)                            | 12      | 768     | 85M                     | 64.9    | 80.3 | 67.0 | 62.7 | 55.0 | 60.4 | 66.5 | 62.3 |
-| **mMiniLM-L12xH384**                                                                       | 12      | 384     | 21M                     | 63.2    | 79.4 | 66.1 | 61.2 | 54.9 | 58.5 | 63.1 | 59.0 |
-### Citation
-If you find MiniLM useful in your research, please cite the following paper:
-``` latex
-@misc{wang2020minilm,
-    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
-    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
-    year={2020},
-    eprint={2002.10957},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
--- a/model_cards/microsoft/codebert-base-mlm/README.md
+++ b/model_cards/microsoft/codebert-base-mlm/README.md
-## CodeBERT-base-mlm
-Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155).
-### Training Data
-The model is trained on the code corpus of [CodeSearchNet](https://github.com/github/CodeSearchNet)
-### Training Objective
-This model is initialized with Roberta-base and trained with a simple MLM (Masked Language Model) objective.
-### Usage
-```python
-from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
-model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm')
-tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm')
-code_example = "if (x is not None) <mask> (x>1)"
-fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
-outputs = fill_mask(code_example)
-print(outputs)
-```
-Expected results:
-```
-{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
-{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
-{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
-{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
-{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}
-```
-### Reference
-1. [Bimodal CodeBERT trained with MLM+RTD objective](https://huggingface.co/microsoft/codebert-base) (suitable for code search and document generation)
-2. 🤗 [Hugging Face's CodeBERTa](https://huggingface.co/huggingface/CodeBERTa-small-v1) (small size, 6 layers)
-### Citation
-```bibtex
-@misc{feng2020codebert,
-    title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
-    author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
-    year={2020},
-    eprint={2002.08155},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
--- a/model_cards/microsoft/codebert-base/README.md
+++ b/model_cards/microsoft/codebert-base/README.md
-## CodeBERT-base
-Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155).
-### Training Data
-The model is trained on bi-modal data (documents & code) of [CodeSearchNet](https://github.com/github/CodeSearchNet)
-### Training Objective
-This model is initialized with Roberta-base and trained with MLM+RTD objective (cf. the paper).
-### Usage
-Please see [the official repository](https://github.com/microsoft/CodeBERT) for scripts that support "code search" and "code-to-document generation".
-### Reference
-1. [CodeBERT trained with Masked LM objective](https://huggingface.co/microsoft/codebert-base-mlm) (suitable for code completion)
-2. 🤗 [Hugging Face's CodeBERTa](https://huggingface.co/huggingface/CodeBERTa-small-v1) (small size, 6 layers)
-### Citation
-```bibtex
-@misc{feng2020codebert,
-    title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
-    author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
-    year={2020},
-    eprint={2002.08155},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
--- a/model_cards/microsoft/layoutlm-base-uncased/README.md
+++ b/model_cards/microsoft/layoutlm-base-uncased/README.md
-# LayoutLM
-## Model description
-LayoutLM is a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. LayoutLM archives the SOTA results on multiple datasets. For more details, please refer to our paper: 
-[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)
-Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, [KDD 2020](https://www.kdd.org/kdd2020/accepted-papers)
-## Training data
-We pre-train LayoutLM on IIT-CDIP Test Collection 1.0\* dataset with two settings. 
-* LayoutLM-Base, Uncased (11M documents, 2 epochs): 12-layer, 768-hidden, 12-heads, 113M parameters **(This Model)**
-* LayoutLM-Large, Uncased (11M documents, 2 epochs): 24-layer, 1024-hidden, 16-heads, 343M parameters
-## Citation
-If you find LayoutLM useful in your research, please cite the following paper:
-``` latex
-@misc{xu2019layoutlm,
-    title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
-    author={Yiheng Xu and Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou},
-    year={2019},
-    eprint={1912.13318},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```