[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)
* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3552d0e0 · Julien Chaumond · GitHub · 29e45979 · 29e45979 · 29e45979
Unverified Commit 3552d0e0 authored Dec 12, 2020 by Julien Chaumond Committed by GitHub Dec 11, 2020
18 changed files
--- a/model_cards/voidful/albert_chinese_large/README.md
+++ b/model_cards/voidful/albert_chinese_large/README.md
---
-language: zh
---
-
-# albert_chinese_large
-
-This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)  
-converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
-
-## Attention (注意)
-
-Since sentencepiece is not used in albert_chinese_large model   
-you have to call BertTokenizer instead of AlbertTokenizer !!!    
-we can eval it using an example on MaskedLM   
-   
-由於 albert_chinese_large 模型沒有用 sentencepiece   
-用AlbertTokenizer會載不進詞表，因此需要改用BertTokenizer !!!   
-我們可以跑MaskedLM預測來驗證這個做法是否正確   
-   
-## Justify (驗證有效性)
-[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)   
-```python
-from transformers import *
-import torch
-from torch.nn.functional import softmax
-
-pretrained = 'voidful/albert_chinese_large'
-tokenizer = BertTokenizer.from_pretrained(pretrained)
-model = AlbertForMaskedLM.from_pretrained(pretrained)
-
-inputtext = "今天[MASK]情很好"
-
-maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
-
-input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
-outputs = model(input_ids, masked_lm_labels=input_ids)
-loss, prediction_scores = outputs[:2]
-logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
-predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-print(predicted_token,logit_prob[predicted_index])
-```
-Result: `心 0.9422469735145569`   
--- a/model_cards/voidful/albert_chinese_small/README.md
+++ b/model_cards/voidful/albert_chinese_small/README.md
---
-language: zh
---
-
-# albert_chinese_small
-
-This a albert_chinese_small model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_small_google_zh model    
-converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
-
-## Attention (注意)
-
-Since sentencepiece is not used in albert_chinese_small model   
-you have to call BertTokenizer instead of AlbertTokenizer !!!    
-we can eval it using an example on MaskedLM   
-   
-由於 albert_chinese_small 模型沒有用 sentencepiece   
-用AlbertTokenizer會載不進詞表，因此需要改用BertTokenizer !!!   
-我們可以跑MaskedLM預測來驗證這個做法是否正確   
-   
-## Justify (驗證有效性)
-[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)   
-```python
-from transformers import *
-import torch
-from torch.nn.functional import softmax
-
-pretrained = 'voidful/albert_chinese_small'
-tokenizer = BertTokenizer.from_pretrained(pretrained)
-model = AlbertForMaskedLM.from_pretrained(pretrained)
-
-inputtext = "今天[MASK]情很好"
-
-maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
-
-input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
-outputs = model(input_ids, masked_lm_labels=input_ids)
-loss, prediction_scores = outputs[:2]
-logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
-predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-print(predicted_token,logit_prob[predicted_index])
-```
-Result: `感 0.6390823125839233`   
--- a/model_cards/voidful/albert_chinese_tiny/README.md
+++ b/model_cards/voidful/albert_chinese_tiny/README.md
---
-language: zh
---
-
-# albert_chinese_tiny
-
-This a albert_chinese_tiny model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_tiny_google_zh model    
-converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
-
-## Attention (注意)
-
-Since sentencepiece is not used in albert_chinese_tiny model   
-you have to call BertTokenizer instead of AlbertTokenizer !!!    
-we can eval it using an example on MaskedLM   
-   
-由於 albert_chinese_tiny 模型沒有用 sentencepiece   
-用AlbertTokenizer會載不進詞表，因此需要改用BertTokenizer !!!   
-我們可以跑MaskedLM預測來驗證這個做法是否正確   
-   
-## Justify (驗證有效性)
-[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)   
-```python
-from transformers import *
-import torch
-from torch.nn.functional import softmax
-
-pretrained = 'voidful/albert_chinese_tiny'
-tokenizer = BertTokenizer.from_pretrained(pretrained)
-model = AlbertForMaskedLM.from_pretrained(pretrained)
-
-inputtext = "今天[MASK]情很好"
-
-maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
-
-input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
-outputs = model(input_ids, masked_lm_labels=input_ids)
-loss, prediction_scores = outputs[:2]
-logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
-predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-print(predicted_token,logit_prob[predicted_index])
-```
-Result: `感 0.40312355756759644`   
--- a/model_cards/voidful/albert_chinese_xlarge/README.md
+++ b/model_cards/voidful/albert_chinese_xlarge/README.md
---
-language: zh
---
-
-# albert_chinese_xlarge
-
-This a albert_chinese_xlarge model from [Google's github](https://github.com/google-research/ALBERT)  
-converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
-
-## Attention (注意)
-
-Since sentencepiece is not used in albert_chinese_xlarge model   
-you have to call BertTokenizer instead of AlbertTokenizer !!!    
-we can eval it using an example on MaskedLM   
-   
-由於 albert_chinese_xlarge 模型沒有用 sentencepiece   
-用AlbertTokenizer會載不進詞表，因此需要改用BertTokenizer !!!   
-我們可以跑MaskedLM預測來驗證這個做法是否正確   
-   
-## Justify (驗證有效性)
-[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)   
-```python
-from transformers import *
-import torch
-from torch.nn.functional import softmax
-
-pretrained = 'voidful/albert_chinese_xlarge'
-tokenizer = BertTokenizer.from_pretrained(pretrained)
-model = AlbertForMaskedLM.from_pretrained(pretrained)
-
-inputtext = "今天[MASK]情很好"
-
-maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
-
-input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
-outputs = model(input_ids, masked_lm_labels=input_ids)
-loss, prediction_scores = outputs[:2]
-logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
-predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-print(predicted_token,logit_prob[predicted_index])
-```
-Result: `心 0.9942440390586853`   
--- a/model_cards/voidful/albert_chinese_xxlarge/README.md
+++ b/model_cards/voidful/albert_chinese_xxlarge/README.md
---
-language: zh
---
-
-# albert_chinese_xxlarge
-
-This a albert_chinese_xxlarge model from [Google's github](https://github.com/google-research/ALBERT)  
-converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
-
-## Attention (注意)
-
-Since sentencepiece is not used in albert_chinese_xxlarge model   
-you have to call BertTokenizer instead of AlbertTokenizer !!!    
-we can eval it using an example on MaskedLM   
-   
-由於 albert_chinese_xxlarge 模型沒有用 sentencepiece   
-用AlbertTokenizer會載不進詞表，因此需要改用BertTokenizer !!!   
-我們可以跑MaskedLM預測來驗證這個做法是否正確   
-   
-## Justify (驗證有效性)
-[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)   
-```python
-from transformers import *
-import torch
-from torch.nn.functional import softmax
-
-pretrained = 'voidful/albert_chinese_xxlarge'
-tokenizer = BertTokenizer.from_pretrained(pretrained)
-model = AlbertForMaskedLM.from_pretrained(pretrained)
-
-inputtext = "今天[MASK]情很好"
-
-maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
-
-input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
-outputs = model(input_ids, masked_lm_labels=input_ids)
-loss, prediction_scores = outputs[:2]
-logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
-predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-print(predicted_token,logit_prob[predicted_index])
-```
-Result: `心 0.995713472366333`   
--- a/model_cards/wietsedv/bert-base-dutch-cased/README.md
+++ b/model_cards/wietsedv/bert-base-dutch-cased/README.md
-# BERTje: A Dutch BERT model
-
-BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
-
-⚠️ **The new home of this model is the [GroNLP](https://huggingface.co/GroNLP) organization.**
-
-BERTje now lives at: [`GroNLP/bert-base-dutch-cased`](https://huggingface.co/GroNLP/bert-base-dutch-cased)
-
-The model weights of the versions at `wietsedv/` and `GroNLP/` are the same, so do not worry if you use(d) `wietsedv/bert-base-dutch-cased`.
-
-
-<img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">
--- a/model_cards/wptoux/albert-chinese-large-qa/README.md
+++ b/model_cards/wptoux/albert-chinese-large-qa/README.md
-# albert-chinese-large-qa
-Albert large QA model pretrained from baidu webqa and baidu dureader datasets.
-
-## Data source
-+ baidu webqa 1.0
-+ baidu dureader
-
-## Traing Method
-We combined the two datasets together and created a new dataset in squad format, including 705139 samples for training and 69638 samples for validation.
-We finetune the model based on the albert chinese large model.
-
-## Hyperparams
-+ learning_rate 1e-5
-+ max_seq_length 512
-+ max_query_length 50
-+ max_answer_length 300
-+ doc_stride 256
-+ num_train_epochs 2
-+ warmup_steps 1000
-+ per_gpu_train_batch_size 8
-+ gradient_accumulation_steps 3
-+ n_gpu 2 (Nvidia Tesla P100)
-
-## Usage
-```
-from transformers import AutoModelForQuestionAnswering, BertTokenizer
-
-model = AutoModelForQuestionAnswering.from_pretrained('wptoux/albert-chinese-large-qa')
-tokenizer = BertTokenizer.from_pretrained('wptoux/albert-chinese-large-qa')
-```
-***Important: use BertTokenizer***
-
-## MoreInfo
-Please visit https://github.com/wptoux/albert-chinese-large-webqa for details.
--- a/model_cards/xlm-mlm-en-2048-README.md
+++ b/model_cards/xlm-mlm-en-2048-README.md
---
-tags:
- exbert
-
-license: cc-by-nc-4.0
---
-
-<a href="https://huggingface.co/exbert/?model=xlm-mlm-en-2048">
-	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
-</a>
--- a/model_cards/xlm-roberta-base-README.md
+++ b/model_cards/xlm-roberta-base-README.md
---
-tags:
- exbert
-
-license: mit
---
-
-<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
-	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
-</a>
--- a/model_cards/xlm-roberta-large-finetuned-conll03-german-README.md
+++ b/model_cards/xlm-roberta-large-finetuned-conll03-german-README.md
---
-language: de
---
-
-## xlm-roberta-large-finetuned-conll03-german
--- a/model_cards/yjernite/bart_eli5/README.md
+++ b/model_cards/yjernite/bart_eli5/README.md
---
-language: en
-license: apache-2.0
-datasets:
- eli5
---
-
-## BART ELI5
-
-Read the article at https://yjernite.github.io/lfqa.html and try the demo at https://huggingface.co/qa/
--- a/model_cards/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli/README.md
+++ b/model_cards/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli/README.md
---
-datasets:
- snli
- anli
- multi_nli
- multi_nli_mismatch
- fever
-license: mit
---
-This is a strong pre-trained RoBERTa-Large NLI model.  
-
-The training data is a combination of well-known NLI datasets: [`SNLI`](https://nlp.stanford.edu/projects/snli/), [`MNLI`](https://cims.nyu.edu/~sbowman/multinli/), [`FEVER-NLI`](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md), [`ANLI (R1, R2, R3)`](https://github.com/facebookresearch/anli).  
-Other pre-trained NLI models including `RoBERTa`, `ALBert`, `BART`, `ELECTRA`, `XLNet` are also available.  
-
-Trained by [Yixin Nie](https://easonnie.github.io), [original source](https://github.com/facebookresearch/anli).
-
-Try the code snippet below.
-```
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-
-if __name__ == '__main__':
-    max_length = 256
-
-    premise = "Two women are embracing while holding to go packages."
-    hypothesis = "The men are fighting outside a deli."
-
-    hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
-    # hg_model_hub_name = "ynie/albert-xxlarge-v2-snli_mnli_fever_anli_R1_R2_R3-nli"
-    # hg_model_hub_name = "ynie/bart-large-snli_mnli_fever_anli_R1_R2_R3-nli"
-    # hg_model_hub_name = "ynie/electra-large-discriminator-snli_mnli_fever_anli_R1_R2_R3-nli"
-    # hg_model_hub_name = "ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli"
-
-    tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
-    model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)
-
-    tokenized_input_seq_pair = tokenizer.encode_plus(premise, hypothesis,
-                                                     max_length=max_length,
-                                                     return_token_type_ids=True, truncation=True)
-
-    input_ids = torch.Tensor(tokenized_input_seq_pair['input_ids']).long().unsqueeze(0)
-    # remember bart doesn't have 'token_type_ids', remove the line below if you are using bart.
-    token_type_ids = torch.Tensor(tokenized_input_seq_pair['token_type_ids']).long().unsqueeze(0)
-    attention_mask = torch.Tensor(tokenized_input_seq_pair['attention_mask']).long().unsqueeze(0)
-
-    outputs = model(input_ids,
-                    attention_mask=attention_mask,
-                    token_type_ids=token_type_ids,
-                    labels=None)
-    # Note:
-    # "id2label": {
-    #     "0": "entailment",
-    #     "1": "neutral",
-    #     "2": "contradiction"
-    # },
-
-    predicted_probability = torch.softmax(outputs[0], dim=1)[0].tolist()  # batch_size only one
-
-    print("Premise:", premise)
-    print("Hypothesis:", hypothesis)
-    print("Entailment:", predicted_probability[0])
-    print("Neutral:", predicted_probability[1])
-    print("Contradiction:", predicted_probability[2])
-```
-
-More in [here](https://github.com/facebookresearch/anli/blob/master/src/hg_api/interactive_eval.py).
-
-Citation:
-```
-@inproceedings{nie-etal-2020-adversarial,
-    title = "Adversarial {NLI}: A New Benchmark for Natural Language Understanding",
-    author = "Nie, Yixin  and
-      Williams, Adina  and
-      Dinan, Emily  and
-      Bansal, Mohit  and
-      Weston, Jason  and
-      Kiela, Douwe",
-    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
-    year = "2020",
-    publisher = "Association for Computational Linguistics",
-}
-```
--- a/model_cards/youscan/ukr-roberta-base/README.md
+++ b/model_cards/youscan/ukr-roberta-base/README.md
---
-language:
- uk
---
-
-# ukr-roberta-base
-
-## Pre-training corpora
-Below is the list of corpora used along with the output of wc command (counting lines, words and characters). These corpora were concatenated and tokenized with HuggingFace Roberta Tokenizer.
-
-| Tables        | Lines           | Words  | Characters  |
-| ------------- |--------------:| -----:| -----:|
-| [Ukrainian Wikipedia - May 2020](https://dumps.wikimedia.org/ukwiki/latest/ukwiki-latest-pages-articles.xml.bz2)      | 18 001 466| 201 207 739 | 2 647 891 947 |
-| [Ukrainian OSCAR deduplicated dataset](https://oscar-public.huma-num.fr/shuffled/uk_dedup.txt.gz) | 56 560 011      |    2 250 210 650 | 29 705 050 592 |
-| Sampled mentions from social networks | 11 245 710      |    128 461 796 | 1 632 567 763 |
-| Total | 85 807 187      |    2 579 880 185 | 33 985 510 302 |
-
-## Pre-training details
-
-* Ukrainian Roberta was trained with code provided in [HuggingFace tutorial](https://huggingface.co/blog/how-to-train)
-* Currently released model follows roberta-base-cased model architecture (12-layer, 768-hidden, 12-heads, 125M parameters)
-* The model was trained on 4xV100 (85 hours)
-* Training configuration you can find in the [original repository](https://github.com/youscan/language-models)
-
-## Author
-Vitalii Radchenko - contact me on Twitter [@vitaliradchenko](https://twitter.com/vitaliradchenko)
--- a/model_cards/yuvraj/summarizer-cnndm/README.md
+++ b/model_cards/yuvraj/summarizer-cnndm/README.md
---
-language: "en"
-tags:
- summarization
---
-
-# Summarization
-
-## Model description
-
-BartForConditionalGeneration model fine tuned for summarization on 10000 samples from the cnn-dailymail dataset
-
-## How to use
-
-PyTorch model available
-
-```python
-from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
-
-tokenizer = AutoTokenizer.from_pretrained("yuvraj/summarizer-cnndm") 
-AutoModelWithLMHead.from_pretrained("yuvraj/summarizer-cnndm")
-
-summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)
-summarizer("<Text to be summarized>")
-
-## Limitations and bias
-Trained on a small dataset
--- a/model_cards/yuvraj/xSumm/README.md
+++ b/model_cards/yuvraj/xSumm/README.md
---
-language: "en"
-tags:
- summarization
- extreme summarization
---
-
-## Model description
-
-BartForConditionalGenerationModel for extreme summarization- creates a one line abstractive summary of a given article
-
-## How to use
-
-PyTorch model available
-
-```python
-from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
-
-tokenizer = AutoTokenizer.from_pretrained("yuvraj/xSumm")			
-model = AutoModelWithLMHead.from_pretrained("yuvraj/xSumm")
-
-xsumm = pipeline('summarization', model=model, tokenizer=tokenizer)
-xsumm("<text to be summarized>")
-
-## Limitations and bias
-Trained on a small fraction of the xsumm training dataset
--- a/model_cards/zanelim/singbert-large-sg/README.md
+++ b/model_cards/zanelim/singbert-large-sg/README.md
---
-language: en
-tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- bert-large-uncased
-license: mit
-datasets:
- reddit singapore, malaysia
- hardwarezone
-widget:
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---
-
-# Model name
-
-SingBert Large - Bert for Singlish (SG) and Manglish (MY).
-
-## Model description
-
-Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the large version, which was initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
-[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
-
-## Intended uses & limitations
-
-#### How to use
-
-```python
->>> from transformers import pipeline
->>> nlp = pipeline('fill-mask', model='zanelim/singbert-large-sg')
->>> nlp("kopi c siew [MASK]")
-
-[{'sequence': '[CLS] kopi c siew dai [SEP]',
-  'score': 0.9003700017929077,
-  'token': 18765,
-  'token_str': 'dai'},
- {'sequence': '[CLS] kopi c siew mai [SEP]',
-  'score': 0.0779474675655365,
-  'token': 14736,
-  'token_str': 'mai'},
- {'sequence': '[CLS] kopi c siew. [SEP]',
-  'score': 0.0032227332703769207,
-  'token': 1012,
-  'token_str': '.'},
- {'sequence': '[CLS] kopi c siew bao [SEP]',
-  'score': 0.0017727474914863706,
-  'token': 25945,
-  'token_str': 'bao'},
- {'sequence': '[CLS] kopi c siew peng [SEP]',
-  'score': 0.0012526646023616195,
-  'token': 26473,
-  'token_str': 'peng'}]
-
->>> nlp("one teh c siew dai, and one kopi [MASK]")
-
-[{'sequence': '[CLS] one teh c siew dai, and one kopi. [SEP]',
-  'score': 0.5249741077423096,
-  'token': 1012,
-  'token_str': '.'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi o [SEP]',
-  'score': 0.27349168062210083,
-  'token': 1051,
-  'token_str': 'o'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi peng [SEP]',
-  'score': 0.057190295308828354,
-  'token': 26473,
-  'token_str': 'peng'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi c [SEP]',
-  'score': 0.04022320732474327,
-  'token': 1039,
-  'token_str': 'c'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi? [SEP]',
-  'score': 0.01191170234233141,
-  'token': 1029,
-  'token_str': '?'}]
-
->>> nlp("die [MASK] must try")
-
-[{'sequence': '[CLS] die die must try [SEP]',
-  'score': 0.9921030402183533,
-  'token': 3280,
-  'token_str': 'die'},
- {'sequence': '[CLS] die also must try [SEP]',
-  'score': 0.004993876442313194,
-  'token': 2036,
-  'token_str': 'also'},
- {'sequence': '[CLS] die liao must try [SEP]',
-  'score': 0.000317625846946612,
-  'token': 727,
-  'token_str': 'liao'},
- {'sequence': '[CLS] die still must try [SEP]',
-  'score': 0.0002260878391098231,
-  'token': 2145,
-  'token_str': 'still'},
- {'sequence': '[CLS] die i must try [SEP]',
-  'score': 0.00016935862367972732,
-  'token': 1045,
-  'token_str': 'i'}]
-
->>> nlp("dont play [MASK] leh")
-
-[{'sequence': '[CLS] dont play play leh [SEP]',
-  'score': 0.9079819321632385,
-  'token': 2377,
-  'token_str': 'play'},
- {'sequence': '[CLS] dont play punk leh [SEP]',
-  'score': 0.006846973206847906,
-  'token': 7196,
-  'token_str': 'punk'},
- {'sequence': '[CLS] dont play games leh [SEP]',
-  'score': 0.004041737411171198,
-  'token': 2399,
-  'token_str': 'games'},
- {'sequence': '[CLS] dont play politics leh [SEP]',
-  'score': 0.003728888463228941,
-  'token': 4331,
-  'token_str': 'politics'},
- {'sequence': '[CLS] dont play cheat leh [SEP]',
-  'score': 0.0032805048394948244,
-  'token': 21910,
-  'token_str': 'cheat'}]
-
->>> nlp("confirm plus [MASK]")
-
-{'sequence': '[CLS] confirm plus chop [SEP]',
-  'score': 0.9749826192855835,
-  'token': 24494,
-  'token_str': 'chop'},
- {'sequence': '[CLS] confirm plus chopped [SEP]',
-  'score': 0.017554156482219696,
-  'token': 24881,
-  'token_str': 'chopped'},
- {'sequence': '[CLS] confirm plus minus [SEP]',
-  'score': 0.002725469646975398,
-  'token': 15718,
-  'token_str': 'minus'},
- {'sequence': '[CLS] confirm plus guarantee [SEP]',
-  'score': 0.000900257145985961,
-  'token': 11302,
-  'token_str': 'guarantee'},
- {'sequence': '[CLS] confirm plus one [SEP]',
-  'score': 0.0004384620988275856,
-  'token': 2028,
-  'token_str': 'one'}]
-
->>> nlp("catch no [MASK]")
-
-[{'sequence': '[CLS] catch no ball [SEP]',
-  'score': 0.9381157159805298,
-  'token': 3608,
-  'token_str': 'ball'},
- {'sequence': '[CLS] catch no balls [SEP]',
-  'score': 0.060842301696538925,
-  'token': 7395,
-  'token_str': 'balls'},
- {'sequence': '[CLS] catch no fish [SEP]',
-  'score': 0.00030917322146706283,
-  'token': 3869,
-  'token_str': 'fish'},
- {'sequence': '[CLS] catch no breath [SEP]',
-  'score': 7.552534952992573e-05,
-  'token': 3052,
-  'token_str': 'breath'},
- {'sequence': '[CLS] catch no tail [SEP]',
-  'score': 4.208395694149658e-05,
-  'token': 5725,
-  'token_str': 'tail'}]
-
-```
-
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained('zanelim/singbert-large-sg')
-model = BertModel.from_pretrained("zanelim/singbert-large-sg")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
-```
-
-and in TensorFlow:
-```python
-from transformers import BertTokenizer, TFBertModel
-tokenizer = BertTokenizer.from_pretrained("zanelim/singbert-large-sg")
-model = TFBertModel.from_pretrained("zanelim/singbert-large-sg")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
-```
-
-#### Limitations and bias
-This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
-constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
-
-## Training data
-Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
-corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
-
-## Training procedure
-
-Initialized with [bert large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models) vocab and checkpoints (pre-trained weights).
-Top 1000 custom vocab tokens (non-overlapped with original bert vocab) were further extracted from training data and filled into unused tokens in original bert vocab.
-
-Pre-training was further finetuned on training data with the following hyperparameters
-* train_batch_size: 512
-* max_seq_length: 128
-* num_train_steps: 300000
-* num_warmup_steps: 5000
-* learning_rate: 2e-5
-* hardware: TPU v3-8
--- a/model_cards/zanelim/singbert-lite-sg/README.md
+++ b/model_cards/zanelim/singbert-lite-sg/README.md
---
-language: en
-tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- albert-base-v2
-license: mit
-datasets:
- reddit singapore, malaysia
- hardwarezone
-widget:
- text: "dont play [MASK] leh"
- text: "die [MASK] must try"
---
-
-# Model name
-
-SingBert Lite - Bert for Singlish (SG) and Manglish (MY).
-
-## Model description
-
-Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the lite-version, which was initialized from [Albert base v2](https://github.com/google-research/albert#albert), with pre-training finetuned on
-[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
-
-## Intended uses & limitations
-
-#### How to use
-
-```python
->>> from transformers import pipeline
->>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg')
->>> nlp("die [MASK] must try")
-
-[{'sequence': '[CLS] die die must try[SEP]',
-  'score': 0.7731555700302124,
-  'token': 1327,
-  'token_str': '▁die'},
- {'sequence': '[CLS] die also must try[SEP]',
-  'score': 0.04763784259557724,
-  'token': 67,
-  'token_str': '▁also'},
- {'sequence': '[CLS] die still must try[SEP]',
-  'score': 0.01859409362077713,
-  'token': 174,
-  'token_str': '▁still'},
- {'sequence': '[CLS] die u must try[SEP]',
-  'score': 0.015824034810066223,
-  'token': 287,
-  'token_str': '▁u'},
- {'sequence': '[CLS] die is must try[SEP]',
-  'score': 0.011271446943283081,
-  'token': 25,
-  'token_str': '▁is'}]
-
->>> nlp("dont play [MASK] leh")
-
-[{'sequence': '[CLS] dont play play leh[SEP]',
-  'score': 0.4365769624710083,
-  'token': 418,
-  'token_str': '▁play'},
- {'sequence': '[CLS] dont play punk leh[SEP]',
-  'score': 0.06880936771631241,
-  'token': 6769,
-  'token_str': '▁punk'},
- {'sequence': '[CLS] dont play game leh[SEP]',
-  'score': 0.051739856600761414,
-  'token': 250,
-  'token_str': '▁game'},
- {'sequence': '[CLS] dont play games leh[SEP]',
-  'score': 0.045703962445259094,
-  'token': 466,
-  'token_str': '▁games'},
- {'sequence': '[CLS] dont play around leh[SEP]',
-  'score': 0.013458190485835075,
-  'token': 140,
-  'token_str': '▁around'}]
-
->>> nlp("catch no [MASK]")
-
-[{'sequence': '[CLS] catch no ball[SEP]',
-  'score': 0.6197211146354675,
-  'token': 1592,
-  'token_str': '▁ball'},
- {'sequence': '[CLS] catch no balls[SEP]',
-  'score': 0.08441998809576035,
-  'token': 7152,
-  'token_str': '▁balls'},
- {'sequence': '[CLS] catch no joke[SEP]',
-  'score': 0.0676785409450531,
-  'token': 8186,
-  'token_str': '▁joke'},
- {'sequence': '[CLS] catch no?[SEP]',
-  'score': 0.040638409554958344,
-  'token': 60,
-  'token_str': '?'},
- {'sequence': '[CLS] catch no one[SEP]',
-  'score': 0.03546864539384842,
-  'token': 53,
-  'token_str': '▁one'}]
-
->>> nlp("confirm plus [MASK]")
-
-[{'sequence': '[CLS] confirm plus chop[SEP]',
-  'score': 0.9608421921730042,
-  'token': 17144,
-  'token_str': '▁chop'},
- {'sequence': '[CLS] confirm plus guarantee[SEP]',
-  'score': 0.011784233152866364,
-  'token': 9120,
-  'token_str': '▁guarantee'},
- {'sequence': '[CLS] confirm plus confirm[SEP]',
-  'score': 0.010571340098977089,
-  'token': 10265,
-  'token_str': '▁confirm'},
- {'sequence': '[CLS] confirm plus egg[SEP]',
-  'score': 0.0033525123726576567,
-  'token': 6387,
-  'token_str': '▁egg'},
- {'sequence': '[CLS] confirm plus bet[SEP]',
-  'score': 0.0008760977652855217,
-  'token': 5676,
-  'token_str': '▁bet'}]
-
-```
-
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import AlbertTokenizer, AlbertModel
-tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg')
-model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
-```
-
-and in TensorFlow:
-```python
-from transformers import AlbertTokenizer, TFAlbertModel
-tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg")
-model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
-```
-
-#### Limitations and bias
-This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
-constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
-
-## Training data
-Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
-corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
-
-## Training procedure
-
-Initialized with [albert base v2](https://github.com/google-research/albert#albert) vocab and checkpoints (pre-trained weights).
-
-Pre-training was further finetuned on training data with the following hyperparameters
-* train_batch_size: 4096
-* max_seq_length: 128
-* num_train_steps: 125000
-* num_warmup_steps: 5000
-* learning_rate: 0.00176
-* hardware: TPU v3-8
--- a/model_cards/zanelim/singbert/README.md
+++ b/model_cards/zanelim/singbert/README.md
---
-language: en
-tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- bert-base-uncased
-license: mit
-datasets:
- reddit singapore, malaysia
- hardwarezone
-widget:
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---
-
-# Model name
-
-SingBert - Bert for Singlish (SG) and Manglish (MY).
-
-## Model description
-
-[BERT base uncased](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
-[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
-
-## Intended uses & limitations
-
-#### How to use
-
-```python
->>> from transformers import pipeline
->>> nlp = pipeline('fill-mask', model='zanelim/singbert')
->>> nlp("kopi c siew [MASK]")
-
-[{'sequence': '[CLS] kopi c siew dai [SEP]',
-  'score': 0.5092713236808777,
-  'token': 18765,
-  'token_str': 'dai'},
- {'sequence': '[CLS] kopi c siew mai [SEP]',
-  'score': 0.3515934646129608,
-  'token': 14736,
-  'token_str': 'mai'},
- {'sequence': '[CLS] kopi c siew bao [SEP]',
-  'score': 0.05576375499367714,
-  'token': 25945,
-  'token_str': 'bao'},
- {'sequence': '[CLS] kopi c siew. [SEP]',
-  'score': 0.006019321270287037,
-  'token': 1012,
-  'token_str': '.'},
- {'sequence': '[CLS] kopi c siew sai [SEP]',
-  'score': 0.0038361591286957264,
-  'token': 18952,
-  'token_str': 'sai'}]
-
->>> nlp("one teh c siew dai, and one kopi [MASK].")
-
-[{'sequence': '[CLS] one teh c siew dai, and one kopi c [SEP]',
-  'score': 0.6176503300666809,
-  'token': 1039,
-  'token_str': 'c'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi o [SEP]',
-  'score': 0.21094971895217896,
-  'token': 1051,
-  'token_str': 'o'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi. [SEP]',
-  'score': 0.13027705252170563,
-  'token': 1012,
-  'token_str': '.'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi! [SEP]',
-  'score': 0.004680239595472813,
-  'token': 999,
-  'token_str': '!'},
- {'sequence': '[CLS] one teh c siew dai, and one kopi w [SEP]',
-  'score': 0.002034128177911043,
-  'token': 1059,
-  'token_str': 'w'}]
-
->>> nlp("dont play [MASK] leh")
-
-[{'sequence': '[CLS] dont play play leh [SEP]',
-  'score': 0.9281464219093323,
-  'token': 2377,
-  'token_str': 'play'},
- {'sequence': '[CLS] dont play politics leh [SEP]',
-  'score': 0.010990909300744534,
-  'token': 4331,
-  'token_str': 'politics'},
- {'sequence': '[CLS] dont play punk leh [SEP]',
-  'score': 0.005583590362221003,
-  'token': 7196,
-  'token_str': 'punk'},
- {'sequence': '[CLS] dont play dirty leh [SEP]',
-  'score': 0.0025784350000321865,
-  'token': 6530,
-  'token_str': 'dirty'},
- {'sequence': '[CLS] dont play cheat leh [SEP]',
-  'score': 0.0025066907983273268,
-  'token': 21910,
-  'token_str': 'cheat'}]
-
->>> nlp("catch no [MASK]")
-
-[{'sequence': '[CLS] catch no ball [SEP]',
-  'score': 0.7922210693359375,
-  'token': 3608,
-  'token_str': 'ball'},
- {'sequence': '[CLS] catch no balls [SEP]',
-  'score': 0.20503675937652588,
-  'token': 7395,
-  'token_str': 'balls'},
- {'sequence': '[CLS] catch no tail [SEP]',
-  'score': 0.0006608376861549914,
-  'token': 5725,
-  'token_str': 'tail'},
- {'sequence': '[CLS] catch no talent [SEP]',
-  'score': 0.0002158183924620971,
-  'token': 5848,
-  'token_str': 'talent'},
- {'sequence': '[CLS] catch no prisoners [SEP]',
-  'score': 5.3481446229852736e-05,
-  'token': 5895,
-  'token_str': 'prisoners'}]
-
->>> nlp("confirm plus [MASK]")
-
-[{'sequence': '[CLS] confirm plus chop [SEP]',
-  'score': 0.992355227470398,
-  'token': 24494,
-  'token_str': 'chop'},
- {'sequence': '[CLS] confirm plus one [SEP]',
-  'score': 0.0037301010452210903,
-  'token': 2028,
-  'token_str': 'one'},
- {'sequence': '[CLS] confirm plus minus [SEP]',
-  'score': 0.0014284878270700574,
-  'token': 15718,
-  'token_str': 'minus'},
- {'sequence': '[CLS] confirm plus 1 [SEP]',
-  'score': 0.0011354683665558696,
-  'token': 1015,
-  'token_str': '1'},
- {'sequence': '[CLS] confirm plus chopped [SEP]',
-  'score': 0.0003804611915256828,
-  'token': 24881,
-  'token_str': 'chopped'}]
-
->>> nlp("die [MASK] must try")
-
-[{'sequence': '[CLS] die die must try [SEP]',
-  'score': 0.9552758932113647,
-  'token': 3280,
-  'token_str': 'die'},
- {'sequence': '[CLS] die also must try [SEP]',
-  'score': 0.03644804656505585,
-  'token': 2036,
-  'token_str': 'also'},
- {'sequence': '[CLS] die liao must try [SEP]',
-  'score': 0.003282855963334441,
-  'token': 727,
-  'token_str': 'liao'},
- {'sequence': '[CLS] die already must try [SEP]',
-  'score': 0.0004937972989864647,
-  'token': 2525,
-  'token_str': 'already'},
- {'sequence': '[CLS] die hard must try [SEP]',
-  'score': 0.0003659659414552152,
-  'token': 2524,
-  'token_str': 'hard'}]
-
-```
-
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained('zanelim/singbert')
-model = BertModel.from_pretrained("zanelim/singbert")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
-```
-
-and in TensorFlow:
-```python
-from transformers import BertTokenizer, TFBertModel
-tokenizer = BertTokenizer.from_pretrained("zanelim/singbert")
-model = TFBertModel.from_pretrained("zanelim/singbert")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
-```
-
-#### Limitations and bias
-This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
-constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
-
-## Training data
-Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
-corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
-
-## Training procedure
-
-Initialized with [bert base uncased](https://github.com/google-research/bert#pre-trained-models) vocab and checkpoints (pre-trained weights).
-Top 1000 custom vocab tokens (non-overlapped with original bert vocab) were further extracted from training data and filled into unused tokens in original bert vocab.
-
-Pre-training was further finetuned on training data with the following hyperparameters
-* train_batch_size: 512
-* max_seq_length: 128
-* num_train_steps: 300000
-* num_warmup_steps: 5000
-* learning_rate: 2e-5
-* hardware: TPU v3-8