Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: zh
---
# albert_chinese_large
This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_large model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_large 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `心 0.9422469735145569`
---
language: zh
---
# albert_chinese_small
This a albert_chinese_small model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_small_google_zh model
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_small model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_small 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_small'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `感 0.6390823125839233`
---
language: zh
---
# albert_chinese_tiny
This a albert_chinese_tiny model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_tiny_google_zh model
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_tiny model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_tiny 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_tiny'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `感 0.40312355756759644`
---
language: zh
---
# albert_chinese_xlarge
This a albert_chinese_xlarge model from [Google's github](https://github.com/google-research/ALBERT)
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_xlarge model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_xlarge 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_xlarge'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `心 0.9942440390586853`
---
language: zh
---
# albert_chinese_xxlarge
This a albert_chinese_xxlarge model from [Google's github](https://github.com/google-research/ALBERT)
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Attention (注意)
Since sentencepiece is not used in albert_chinese_xxlarge model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_xxlarge 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
```python
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_xxlarge'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
```
Result: `心 0.995713472366333`
# BERTje: A Dutch BERT model
BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
⚠️ **The new home of this model is the [GroNLP](https://huggingface.co/GroNLP) organization.**
BERTje now lives at: [`GroNLP/bert-base-dutch-cased`](https://huggingface.co/GroNLP/bert-base-dutch-cased)
The model weights of the versions at `wietsedv/` and `GroNLP/` are the same, so do not worry if you use(d) `wietsedv/bert-base-dutch-cased`.
<img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">
# albert-chinese-large-qa
Albert large QA model pretrained from baidu webqa and baidu dureader datasets.
## Data source
+ baidu webqa 1.0
+ baidu dureader
## Traing Method
We combined the two datasets together and created a new dataset in squad format, including 705139 samples for training and 69638 samples for validation.
We finetune the model based on the albert chinese large model.
## Hyperparams
+ learning_rate 1e-5
+ max_seq_length 512
+ max_query_length 50
+ max_answer_length 300
+ doc_stride 256
+ num_train_epochs 2
+ warmup_steps 1000
+ per_gpu_train_batch_size 8
+ gradient_accumulation_steps 3
+ n_gpu 2 (Nvidia Tesla P100)
## Usage
```
from transformers import AutoModelForQuestionAnswering, BertTokenizer
model = AutoModelForQuestionAnswering.from_pretrained('wptoux/albert-chinese-large-qa')
tokenizer = BertTokenizer.from_pretrained('wptoux/albert-chinese-large-qa')
```
***Important: use BertTokenizer***
## MoreInfo
Please visit https://github.com/wptoux/albert-chinese-large-webqa for details.
---
tags:
- exbert
license: cc-by-nc-4.0
---
<a href="https://huggingface.co/exbert/?model=xlm-mlm-en-2048">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
tags:
- exbert
license: mit
---
<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
---
language: de
---
## xlm-roberta-large-finetuned-conll03-german
---
language: en
license: apache-2.0
datasets:
- eli5
---
## BART ELI5
Read the article at https://yjernite.github.io/lfqa.html and try the demo at https://huggingface.co/qa/
---
datasets:
- snli
- anli
- multi_nli
- multi_nli_mismatch
- fever
license: mit
---
This is a strong pre-trained RoBERTa-Large NLI model.
The training data is a combination of well-known NLI datasets: [`SNLI`](https://nlp.stanford.edu/projects/snli/), [`MNLI`](https://cims.nyu.edu/~sbowman/multinli/), [`FEVER-NLI`](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md), [`ANLI (R1, R2, R3)`](https://github.com/facebookresearch/anli).
Other pre-trained NLI models including `RoBERTa`, `ALBert`, `BART`, `ELECTRA`, `XLNet` are also available.
Trained by [Yixin Nie](https://easonnie.github.io), [original source](https://github.com/facebookresearch/anli).
Try the code snippet below.
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
if __name__ == '__main__':
max_length = 256
premise = "Two women are embracing while holding to go packages."
hypothesis = "The men are fighting outside a deli."
hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
# hg_model_hub_name = "ynie/albert-xxlarge-v2-snli_mnli_fever_anli_R1_R2_R3-nli"
# hg_model_hub_name = "ynie/bart-large-snli_mnli_fever_anli_R1_R2_R3-nli"
# hg_model_hub_name = "ynie/electra-large-discriminator-snli_mnli_fever_anli_R1_R2_R3-nli"
# hg_model_hub_name = "ynie/xlnet-large-cased-snli_mnli_fever_anli_R1_R2_R3-nli"
tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)
tokenized_input_seq_pair = tokenizer.encode_plus(premise, hypothesis,
max_length=max_length,
return_token_type_ids=True, truncation=True)
input_ids = torch.Tensor(tokenized_input_seq_pair['input_ids']).long().unsqueeze(0)
# remember bart doesn't have 'token_type_ids', remove the line below if you are using bart.
token_type_ids = torch.Tensor(tokenized_input_seq_pair['token_type_ids']).long().unsqueeze(0)
attention_mask = torch.Tensor(tokenized_input_seq_pair['attention_mask']).long().unsqueeze(0)
outputs = model(input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=None)
# Note:
# "id2label": {
# "0": "entailment",
# "1": "neutral",
# "2": "contradiction"
# },
predicted_probability = torch.softmax(outputs[0], dim=1)[0].tolist() # batch_size only one
print("Premise:", premise)
print("Hypothesis:", hypothesis)
print("Entailment:", predicted_probability[0])
print("Neutral:", predicted_probability[1])
print("Contradiction:", predicted_probability[2])
```
More in [here](https://github.com/facebookresearch/anli/blob/master/src/hg_api/interactive_eval.py).
Citation:
```
@inproceedings{nie-etal-2020-adversarial,
title = "Adversarial {NLI}: A New Benchmark for Natural Language Understanding",
author = "Nie, Yixin and
Williams, Adina and
Dinan, Emily and
Bansal, Mohit and
Weston, Jason and
Kiela, Douwe",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
year = "2020",
publisher = "Association for Computational Linguistics",
}
```
---
language:
- uk
---
# ukr-roberta-base
## Pre-training corpora
Below is the list of corpora used along with the output of wc command (counting lines, words and characters). These corpora were concatenated and tokenized with HuggingFace Roberta Tokenizer.
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Ukrainian Wikipedia - May 2020](https://dumps.wikimedia.org/ukwiki/latest/ukwiki-latest-pages-articles.xml.bz2) | 18 001 466| 201 207 739 | 2 647 891 947 |
| [Ukrainian OSCAR deduplicated dataset](https://oscar-public.huma-num.fr/shuffled/uk_dedup.txt.gz) | 56 560 011 | 2 250 210 650 | 29 705 050 592 |
| Sampled mentions from social networks | 11 245 710 | 128 461 796 | 1 632 567 763 |
| Total | 85 807 187 | 2 579 880 185 | 33 985 510 302 |
## Pre-training details
* Ukrainian Roberta was trained with code provided in [HuggingFace tutorial](https://huggingface.co/blog/how-to-train)
* Currently released model follows roberta-base-cased model architecture (12-layer, 768-hidden, 12-heads, 125M parameters)
* The model was trained on 4xV100 (85 hours)
* Training configuration you can find in the [original repository](https://github.com/youscan/language-models)
## Author
Vitalii Radchenko - contact me on Twitter [@vitaliradchenko](https://twitter.com/vitaliradchenko)
---
language: "en"
tags:
- summarization
---
# Summarization
## Model description
BartForConditionalGeneration model fine tuned for summarization on 10000 samples from the cnn-dailymail dataset
## How to use
PyTorch model available
```python
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
tokenizer = AutoTokenizer.from_pretrained("yuvraj/summarizer-cnndm")
AutoModelWithLMHead.from_pretrained("yuvraj/summarizer-cnndm")
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)
summarizer("<Text to be summarized>")
## Limitations and bias
Trained on a small dataset
---
language: "en"
tags:
- summarization
- extreme summarization
---
## Model description
BartForConditionalGenerationModel for extreme summarization- creates a one line abstractive summary of a given article
## How to use
PyTorch model available
```python
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
tokenizer = AutoTokenizer.from_pretrained("yuvraj/xSumm")
model = AutoModelWithLMHead.from_pretrained("yuvraj/xSumm")
xsumm = pipeline('summarization', model=model, tokenizer=tokenizer)
xsumm("<text to be summarized>")
## Limitations and bias
Trained on a small fraction of the xsumm training dataset
---
language: en
tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- bert-large-uncased
license: mit
datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---
# Model name
SingBert Large - Bert for Singlish (SG) and Manglish (MY).
## Model description
Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the large version, which was initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
## Intended uses & limitations
#### How to use
```python
>>> from transformers import pipeline
>>> nlp = pipeline('fill-mask', model='zanelim/singbert-large-sg')
>>> nlp("kopi c siew [MASK]")
[{'sequence': '[CLS] kopi c siew dai [SEP]',
'score': 0.9003700017929077,
'token': 18765,
'token_str': 'dai'},
{'sequence': '[CLS] kopi c siew mai [SEP]',
'score': 0.0779474675655365,
'token': 14736,
'token_str': 'mai'},
{'sequence': '[CLS] kopi c siew. [SEP]',
'score': 0.0032227332703769207,
'token': 1012,
'token_str': '.'},
{'sequence': '[CLS] kopi c siew bao [SEP]',
'score': 0.0017727474914863706,
'token': 25945,
'token_str': 'bao'},
{'sequence': '[CLS] kopi c siew peng [SEP]',
'score': 0.0012526646023616195,
'token': 26473,
'token_str': 'peng'}]
>>> nlp("one teh c siew dai, and one kopi [MASK]")
[{'sequence': '[CLS] one teh c siew dai, and one kopi. [SEP]',
'score': 0.5249741077423096,
'token': 1012,
'token_str': '.'},
{'sequence': '[CLS] one teh c siew dai, and one kopi o [SEP]',
'score': 0.27349168062210083,
'token': 1051,
'token_str': 'o'},
{'sequence': '[CLS] one teh c siew dai, and one kopi peng [SEP]',
'score': 0.057190295308828354,
'token': 26473,
'token_str': 'peng'},
{'sequence': '[CLS] one teh c siew dai, and one kopi c [SEP]',
'score': 0.04022320732474327,
'token': 1039,
'token_str': 'c'},
{'sequence': '[CLS] one teh c siew dai, and one kopi? [SEP]',
'score': 0.01191170234233141,
'token': 1029,
'token_str': '?'}]
>>> nlp("die [MASK] must try")
[{'sequence': '[CLS] die die must try [SEP]',
'score': 0.9921030402183533,
'token': 3280,
'token_str': 'die'},
{'sequence': '[CLS] die also must try [SEP]',
'score': 0.004993876442313194,
'token': 2036,
'token_str': 'also'},
{'sequence': '[CLS] die liao must try [SEP]',
'score': 0.000317625846946612,
'token': 727,
'token_str': 'liao'},
{'sequence': '[CLS] die still must try [SEP]',
'score': 0.0002260878391098231,
'token': 2145,
'token_str': 'still'},
{'sequence': '[CLS] die i must try [SEP]',
'score': 0.00016935862367972732,
'token': 1045,
'token_str': 'i'}]
>>> nlp("dont play [MASK] leh")
[{'sequence': '[CLS] dont play play leh [SEP]',
'score': 0.9079819321632385,
'token': 2377,
'token_str': 'play'},
{'sequence': '[CLS] dont play punk leh [SEP]',
'score': 0.006846973206847906,
'token': 7196,
'token_str': 'punk'},
{'sequence': '[CLS] dont play games leh [SEP]',
'score': 0.004041737411171198,
'token': 2399,
'token_str': 'games'},
{'sequence': '[CLS] dont play politics leh [SEP]',
'score': 0.003728888463228941,
'token': 4331,
'token_str': 'politics'},
{'sequence': '[CLS] dont play cheat leh [SEP]',
'score': 0.0032805048394948244,
'token': 21910,
'token_str': 'cheat'}]
>>> nlp("confirm plus [MASK]")
{'sequence': '[CLS] confirm plus chop [SEP]',
'score': 0.9749826192855835,
'token': 24494,
'token_str': 'chop'},
{'sequence': '[CLS] confirm plus chopped [SEP]',
'score': 0.017554156482219696,
'token': 24881,
'token_str': 'chopped'},
{'sequence': '[CLS] confirm plus minus [SEP]',
'score': 0.002725469646975398,
'token': 15718,
'token_str': 'minus'},
{'sequence': '[CLS] confirm plus guarantee [SEP]',
'score': 0.000900257145985961,
'token': 11302,
'token_str': 'guarantee'},
{'sequence': '[CLS] confirm plus one [SEP]',
'score': 0.0004384620988275856,
'token': 2028,
'token_str': 'one'}]
>>> nlp("catch no [MASK]")
[{'sequence': '[CLS] catch no ball [SEP]',
'score': 0.9381157159805298,
'token': 3608,
'token_str': 'ball'},
{'sequence': '[CLS] catch no balls [SEP]',
'score': 0.060842301696538925,
'token': 7395,
'token_str': 'balls'},
{'sequence': '[CLS] catch no fish [SEP]',
'score': 0.00030917322146706283,
'token': 3869,
'token_str': 'fish'},
{'sequence': '[CLS] catch no breath [SEP]',
'score': 7.552534952992573e-05,
'token': 3052,
'token_str': 'breath'},
{'sequence': '[CLS] catch no tail [SEP]',
'score': 4.208395694149658e-05,
'token': 5725,
'token_str': 'tail'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('zanelim/singbert-large-sg')
model = BertModel.from_pretrained("zanelim/singbert-large-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained("zanelim/singbert-large-sg")
model = TFBertModel.from_pretrained("zanelim/singbert-large-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Limitations and bias
This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
## Training data
Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
## Training procedure
Initialized with [bert large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models) vocab and checkpoints (pre-trained weights).
Top 1000 custom vocab tokens (non-overlapped with original bert vocab) were further extracted from training data and filled into unused tokens in original bert vocab.
Pre-training was further finetuned on training data with the following hyperparameters
* train_batch_size: 512
* max_seq_length: 128
* num_train_steps: 300000
* num_warmup_steps: 5000
* learning_rate: 2e-5
* hardware: TPU v3-8
---
language: en
tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- albert-base-v2
license: mit
datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "dont play [MASK] leh"
- text: "die [MASK] must try"
---
# Model name
SingBert Lite - Bert for Singlish (SG) and Manglish (MY).
## Model description
Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the lite-version, which was initialized from [Albert base v2](https://github.com/google-research/albert#albert), with pre-training finetuned on
[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
## Intended uses & limitations
#### How to use
```python
>>> from transformers import pipeline
>>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg')
>>> nlp("die [MASK] must try")
[{'sequence': '[CLS] die die must try[SEP]',
'score': 0.7731555700302124,
'token': 1327,
'token_str': '▁die'},
{'sequence': '[CLS] die also must try[SEP]',
'score': 0.04763784259557724,
'token': 67,
'token_str': '▁also'},
{'sequence': '[CLS] die still must try[SEP]',
'score': 0.01859409362077713,
'token': 174,
'token_str': '▁still'},
{'sequence': '[CLS] die u must try[SEP]',
'score': 0.015824034810066223,
'token': 287,
'token_str': '▁u'},
{'sequence': '[CLS] die is must try[SEP]',
'score': 0.011271446943283081,
'token': 25,
'token_str': '▁is'}]
>>> nlp("dont play [MASK] leh")
[{'sequence': '[CLS] dont play play leh[SEP]',
'score': 0.4365769624710083,
'token': 418,
'token_str': '▁play'},
{'sequence': '[CLS] dont play punk leh[SEP]',
'score': 0.06880936771631241,
'token': 6769,
'token_str': '▁punk'},
{'sequence': '[CLS] dont play game leh[SEP]',
'score': 0.051739856600761414,
'token': 250,
'token_str': '▁game'},
{'sequence': '[CLS] dont play games leh[SEP]',
'score': 0.045703962445259094,
'token': 466,
'token_str': '▁games'},
{'sequence': '[CLS] dont play around leh[SEP]',
'score': 0.013458190485835075,
'token': 140,
'token_str': '▁around'}]
>>> nlp("catch no [MASK]")
[{'sequence': '[CLS] catch no ball[SEP]',
'score': 0.6197211146354675,
'token': 1592,
'token_str': '▁ball'},
{'sequence': '[CLS] catch no balls[SEP]',
'score': 0.08441998809576035,
'token': 7152,
'token_str': '▁balls'},
{'sequence': '[CLS] catch no joke[SEP]',
'score': 0.0676785409450531,
'token': 8186,
'token_str': '▁joke'},
{'sequence': '[CLS] catch no?[SEP]',
'score': 0.040638409554958344,
'token': 60,
'token_str': '?'},
{'sequence': '[CLS] catch no one[SEP]',
'score': 0.03546864539384842,
'token': 53,
'token_str': '▁one'}]
>>> nlp("confirm plus [MASK]")
[{'sequence': '[CLS] confirm plus chop[SEP]',
'score': 0.9608421921730042,
'token': 17144,
'token_str': '▁chop'},
{'sequence': '[CLS] confirm plus guarantee[SEP]',
'score': 0.011784233152866364,
'token': 9120,
'token_str': '▁guarantee'},
{'sequence': '[CLS] confirm plus confirm[SEP]',
'score': 0.010571340098977089,
'token': 10265,
'token_str': '▁confirm'},
{'sequence': '[CLS] confirm plus egg[SEP]',
'score': 0.0033525123726576567,
'token': 6387,
'token_str': '▁egg'},
{'sequence': '[CLS] confirm plus bet[SEP]',
'score': 0.0008760977652855217,
'token': 5676,
'token_str': '▁bet'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg')
model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import AlbertTokenizer, TFAlbertModel
tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg")
model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Limitations and bias
This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
## Training data
Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
## Training procedure
Initialized with [albert base v2](https://github.com/google-research/albert#albert) vocab and checkpoints (pre-trained weights).
Pre-training was further finetuned on training data with the following hyperparameters
* train_batch_size: 4096
* max_seq_length: 128
* num_train_steps: 125000
* num_warmup_steps: 5000
* learning_rate: 0.00176
* hardware: TPU v3-8
---
language: en
tags:
- singapore
- sg
- singlish
- malaysia
- ms
- manglish
- bert-base-uncased
license: mit
datasets:
- reddit singapore, malaysia
- hardwarezone
widget:
- text: "kopi c siew [MASK]"
- text: "die [MASK] must try"
---
# Model name
SingBert - Bert for Singlish (SG) and Manglish (MY).
## Model description
[BERT base uncased](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
[singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
## Intended uses & limitations
#### How to use
```python
>>> from transformers import pipeline
>>> nlp = pipeline('fill-mask', model='zanelim/singbert')
>>> nlp("kopi c siew [MASK]")
[{'sequence': '[CLS] kopi c siew dai [SEP]',
'score': 0.5092713236808777,
'token': 18765,
'token_str': 'dai'},
{'sequence': '[CLS] kopi c siew mai [SEP]',
'score': 0.3515934646129608,
'token': 14736,
'token_str': 'mai'},
{'sequence': '[CLS] kopi c siew bao [SEP]',
'score': 0.05576375499367714,
'token': 25945,
'token_str': 'bao'},
{'sequence': '[CLS] kopi c siew. [SEP]',
'score': 0.006019321270287037,
'token': 1012,
'token_str': '.'},
{'sequence': '[CLS] kopi c siew sai [SEP]',
'score': 0.0038361591286957264,
'token': 18952,
'token_str': 'sai'}]
>>> nlp("one teh c siew dai, and one kopi [MASK].")
[{'sequence': '[CLS] one teh c siew dai, and one kopi c [SEP]',
'score': 0.6176503300666809,
'token': 1039,
'token_str': 'c'},
{'sequence': '[CLS] one teh c siew dai, and one kopi o [SEP]',
'score': 0.21094971895217896,
'token': 1051,
'token_str': 'o'},
{'sequence': '[CLS] one teh c siew dai, and one kopi. [SEP]',
'score': 0.13027705252170563,
'token': 1012,
'token_str': '.'},
{'sequence': '[CLS] one teh c siew dai, and one kopi! [SEP]',
'score': 0.004680239595472813,
'token': 999,
'token_str': '!'},
{'sequence': '[CLS] one teh c siew dai, and one kopi w [SEP]',
'score': 0.002034128177911043,
'token': 1059,
'token_str': 'w'}]
>>> nlp("dont play [MASK] leh")
[{'sequence': '[CLS] dont play play leh [SEP]',
'score': 0.9281464219093323,
'token': 2377,
'token_str': 'play'},
{'sequence': '[CLS] dont play politics leh [SEP]',
'score': 0.010990909300744534,
'token': 4331,
'token_str': 'politics'},
{'sequence': '[CLS] dont play punk leh [SEP]',
'score': 0.005583590362221003,
'token': 7196,
'token_str': 'punk'},
{'sequence': '[CLS] dont play dirty leh [SEP]',
'score': 0.0025784350000321865,
'token': 6530,
'token_str': 'dirty'},
{'sequence': '[CLS] dont play cheat leh [SEP]',
'score': 0.0025066907983273268,
'token': 21910,
'token_str': 'cheat'}]
>>> nlp("catch no [MASK]")
[{'sequence': '[CLS] catch no ball [SEP]',
'score': 0.7922210693359375,
'token': 3608,
'token_str': 'ball'},
{'sequence': '[CLS] catch no balls [SEP]',
'score': 0.20503675937652588,
'token': 7395,
'token_str': 'balls'},
{'sequence': '[CLS] catch no tail [SEP]',
'score': 0.0006608376861549914,
'token': 5725,
'token_str': 'tail'},
{'sequence': '[CLS] catch no talent [SEP]',
'score': 0.0002158183924620971,
'token': 5848,
'token_str': 'talent'},
{'sequence': '[CLS] catch no prisoners [SEP]',
'score': 5.3481446229852736e-05,
'token': 5895,
'token_str': 'prisoners'}]
>>> nlp("confirm plus [MASK]")
[{'sequence': '[CLS] confirm plus chop [SEP]',
'score': 0.992355227470398,
'token': 24494,
'token_str': 'chop'},
{'sequence': '[CLS] confirm plus one [SEP]',
'score': 0.0037301010452210903,
'token': 2028,
'token_str': 'one'},
{'sequence': '[CLS] confirm plus minus [SEP]',
'score': 0.0014284878270700574,
'token': 15718,
'token_str': 'minus'},
{'sequence': '[CLS] confirm plus 1 [SEP]',
'score': 0.0011354683665558696,
'token': 1015,
'token_str': '1'},
{'sequence': '[CLS] confirm plus chopped [SEP]',
'score': 0.0003804611915256828,
'token': 24881,
'token_str': 'chopped'}]
>>> nlp("die [MASK] must try")
[{'sequence': '[CLS] die die must try [SEP]',
'score': 0.9552758932113647,
'token': 3280,
'token_str': 'die'},
{'sequence': '[CLS] die also must try [SEP]',
'score': 0.03644804656505585,
'token': 2036,
'token_str': 'also'},
{'sequence': '[CLS] die liao must try [SEP]',
'score': 0.003282855963334441,
'token': 727,
'token_str': 'liao'},
{'sequence': '[CLS] die already must try [SEP]',
'score': 0.0004937972989864647,
'token': 2525,
'token_str': 'already'},
{'sequence': '[CLS] die hard must try [SEP]',
'score': 0.0003659659414552152,
'token': 2524,
'token_str': 'hard'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('zanelim/singbert')
model = BertModel.from_pretrained("zanelim/singbert")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
and in TensorFlow:
```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained("zanelim/singbert")
model = TFBertModel.from_pretrained("zanelim/singbert")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```
#### Limitations and bias
This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
## Training data
Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
## Training procedure
Initialized with [bert base uncased](https://github.com/google-research/bert#pre-trained-models) vocab and checkpoints (pre-trained weights).
Top 1000 custom vocab tokens (non-overlapped with original bert vocab) were further extracted from training data and filled into unused tokens in original bert vocab.
Pre-training was further finetuned on training data with the following hyperparameters
* train_batch_size: 512
* max_seq_length: 128
* num_train_steps: 300000
* num_warmup_steps: 5000
* learning_rate: 2e-5
* hardware: TPU v3-8
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment