[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)

* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)
* rm all model cards * Update the .rst @sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler * Add a rootlevel README.md with simple instructions/context * Update docs/source/model_sharing.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * make style * rm all model cards Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
3552d0e0 · Julien Chaumond · GitHub · 29e45979 · 29e45979 · 29e45979
Unverified Commit 3552d0e0 authored Dec 12, 2020 by Julien Chaumond Committed by GitHub Dec 11, 2020
20 changed files
--- a/model_cards/huseinzol05/albert-tiny-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/albert-tiny-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa Albert Model
-
-Pretrained Albert tiny language model for Malay and Indonesian, 85% faster execution and 50% smaller than Albert base.
-
-## Pretraining Corpus
-
-`albert-tiny-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google Albert's github [repository](https://github.com/google-research/ALBERT) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/albert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import AlbertTokenizer, AlbertModel
-
-model = BertModel.from_pretrained('huseinzol05/albert-tiny-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/albert-tiny-bahasa-cased',
-    do_lower_case = False,
-)
-```
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/albert-tiny-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/albert-tiny-bahasa-cased',
-    do_lower_case = False,
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan [MASK]'))
-```
-
-Output is,
-
-```text
-[{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
-  'score': 0.05121927708387375,
-  'token': 629},
- {'sequence': '[CLS] makan ayam dengan sayur[SEP]',
-  'score': 0.04497420787811279,
-  'token': 1639},
- {'sequence': '[CLS] makan ayam dengan nasi[SEP]',
-  'score': 0.039827536791563034,
-  'token': 453},
- {'sequence': '[CLS] makan ayam dengan rendang[SEP]',
-  'score': 0.032997727394104004,
-  'token': 2451},
- {'sequence': '[CLS] makan ayam dengan makan[SEP]',
-  'score': 0.031354598701000214,
-  'token': 129}]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train Albert for Bahasa. 
-
-
--- a/model_cards/huseinzol05/bert-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/bert-base-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa BERT Model
-
-Pretrained BERT base language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`bert-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/bert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/bert).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import AlbertTokenizer, BertModel
-
-model = BertModel.from_pretrained('huseinzol05/bert-base-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/bert-base-bahasa-cased',
-    unk_token = '[UNK]',
-    pad_token = '[PAD]',
-    do_lower_case = False,
-)
-```
-
-We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/bert-base-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/bert-base-bahasa-cased',
-    unk_token = '[UNK]',
-    pad_token = '[PAD]',
-    do_lower_case = False,
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan [MASK]'))
-```
-
-Output is,
-
-```text
-[{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
-  'score': 0.10812027007341385,
-  'token': 2446},
- {'sequence': '[CLS] makan ayam dengan kicap[SEP]',
-  'score': 0.07653367519378662,
-  'token': 12928},
- {'sequence': '[CLS] makan ayam dengan nasi[SEP]',
-  'score': 0.06839974224567413,
-  'token': 450},
- {'sequence': '[CLS] makan ayam dengan ayam[SEP]',
-  'score': 0.059544261544942856,
-  'token': 638},
- {'sequence': '[CLS] makan ayam dengan sayur[SEP]',
-  'score': 0.05294966697692871,
-  'token': 1639}]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train BERT for Bahasa. 
-
-
--- a/model_cards/huseinzol05/electra-base-discriminator-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/electra-base-discriminator-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa ELECTRA Model
-
-Pretrained ELECTRA base language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`electra-base-discriminator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import ElectraTokenizer, ElectraModel
-
-model = ElectraModel.from_pretrained('huseinzol05/electra-base-discriminator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-base-discriminator-bahasa-cased',
-    do_lower_case = False,
-)
-```
-
-## Example using ElectraForPreTraining
-
-```python
-from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
-
-model = ElectraForPreTraining.from_pretrained('huseinzol05/electra-base-discriminator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-base-discriminator-bahasa-cased', 
-    do_lower_case = False
-)
-sentence = 'kerajaan sangat prihatin terhadap rakyat'
-fake_tokens = tokenizer.tokenize(sentence)
-fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
-discriminator_outputs = discriminator(fake_inputs)
-predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
-
-list(zip(fake_tokens, predictions.tolist()))
-```
-
-Output is,
-
-```text
-[('kerajaan', 0.0),
- ('sangat', 0.0),
- ('prihatin', 0.0),
- ('terhadap', 0.0),
- ('rakyat', 0.0)]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa. 
-
-
--- a/model_cards/huseinzol05/electra-base-generator-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/electra-base-generator-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa ELECTRA Model
-
-Pretrained ELECTRA base language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`electra-base-generator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import ElectraTokenizer, ElectraModel
-
-model = ElectraModel.from_pretrained('huseinzol05/electra-base-generator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-base-generator-bahasa-cased',
-    do_lower_case = False,
-)
-```
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/electra-base-generator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-base-generator-bahasa-cased',
-    do_lower_case = False,
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan [MASK]'))
-```
-
-Output is,
-
-```text
-[{'sequence': '[CLS] makan ayam dengan ayam [SEP]',
-  'score': 0.08424834907054901,
-  'token': 3255},
- {'sequence': '[CLS] makan ayam dengan rendang [SEP]',
-  'score': 0.064150370657444,
-  'token': 6288},
- {'sequence': '[CLS] makan ayam dengan nasi [SEP]',
-  'score': 0.033446669578552246,
-  'token': 2533},
- {'sequence': '[CLS] makan ayam dengan kucing [SEP]',
-  'score': 0.02803465723991394,
-  'token': 3577},
- {'sequence': '[CLS] makan ayam dengan telur [SEP]',
-  'score': 0.026627106592059135,
-  'token': 6350}]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa. 
-
-
--- a/model_cards/huseinzol05/electra-small-discriminator-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/electra-small-discriminator-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa ELECTRA Model
-
-Pretrained ELECTRA small language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`electra-small-discriminator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import ElectraTokenizer, ElectraModel
-
-model = ElectraModel.from_pretrained('huseinzol05/electra-small-discriminator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-small-discriminator-bahasa-cased',
-    do_lower_case = False,
-)
-```
-
-## Example using ElectraForPreTraining
-
-```python
-from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
-
-model = ElectraForPreTraining.from_pretrained('huseinzol05/electra-small-discriminator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-small-discriminator-bahasa-cased', 
-    do_lower_case = False
-)
-sentence = 'kerajaan sangat prihatin terhadap rakyat'
-fake_tokens = tokenizer.tokenize(sentence)
-fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
-discriminator_outputs = discriminator(fake_inputs)
-predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
-
-list(zip(fake_tokens, predictions.tolist()))
-```
-
-Output is,
-
-```text
-[('kerajaan', 0.0),
- ('sangat', 0.0),
- ('prihatin', 0.0),
- ('terhadap', 0.0),
- ('rakyat', 0.0)]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa. 
-
-
--- a/model_cards/huseinzol05/electra-small-generator-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/electra-small-generator-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa ELECTRA Model
-
-Pretrained ELECTRA small language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`electra-small-generator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import ElectraTokenizer, ElectraModel
-
-model = ElectraModel.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-small-generator-bahasa-cased',
-    do_lower_case = False,
-)
-```
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
-tokenizer = ElectraTokenizer.from_pretrained(
-    'huseinzol05/electra-small-generator-bahasa-cased',
-    do_lower_case = False,
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan [MASK]'))
-```
-
-Output is,
-
-```text
-[{'sequence': '[CLS] makan ayam dengan ayam [SEP]',
-  'score': 0.08424834907054901,
-  'token': 3255},
- {'sequence': '[CLS] makan ayam dengan rendang [SEP]',
-  'score': 0.064150370657444,
-  'token': 6288},
- {'sequence': '[CLS] makan ayam dengan nasi [SEP]',
-  'score': 0.033446669578552246,
-  'token': 2533},
- {'sequence': '[CLS] makan ayam dengan kucing [SEP]',
-  'score': 0.02803465723991394,
-  'token': 3577},
- {'sequence': '[CLS] makan ayam dengan telur [SEP]',
-  'score': 0.026627106592059135,
-  'token': 6350}]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa. 
-
-
--- a/model_cards/huseinzol05/gpt2-117M-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/gpt2-117M-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa GPT2 Model
-
-Pretrained GPT2 117M model for Malay.
-
-## Pretraining Corpus
-
-`gpt2-117M-bahasa-cased` model was pretrained on ~0.9 Billion words. We trained on standard language structure only, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-3. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-4. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-5. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-6. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-7. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-8. [Common-Crawl](https://github.com/huseinzol05/malaya-dataset#common-crawl).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using GPT2's github [repository](https://github.com/openai/gpt-2) on a V3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/gpt2](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import GPT2Tokenizer, GPT2Model
-
-model = GPT2Model.from_pretrained('huseinzol05/gpt2-117M-bahasa-cased')
-tokenizer = GPT2Tokenizer.from_pretrained(
-    'huseinzol05/gpt2-117M-bahasa-cased',
-)
-```
-
-## Example using GPT2LMHeadModel
-
-```python
-from transformers import GPT2Tokenizer, GPT2LMHeadModel
-
-tokenizer = GPT2Tokenizer.from_pretrained('huseinzol05/gpt2-117M-bahasa-cased')
-model = GPT2LMHeadModel.from_pretrained(
-    'huseinzol05/gpt2-117M-bahasa-cased', pad_token_id = tokenizer.eos_token_id
-)
-
-input_ids = tokenizer.encode(
-    'penat bak hang, macam ni aku takmau kerja dah', return_tensors = 'pt'
-)
-sample_outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    max_length = 50,
-    top_k = 50,
-    top_p = 0.95,
-    num_return_sequences = 3,
-)
-
-print('Output:\n' + 100 * '-')
-for i, sample_output in enumerate(sample_outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-```
-
-Output is,
-
-```text
-Output:
----------------------------------------------------------------------------------------------------
-0: penat bak hang, macam ni aku takmau kerja dah jadi aku pernah beritahu orang.
-Ini bukan aku rasa cam nak ajak teman kan ni.
-Tengok ni aku dah ada adik-adik & anak yang tinggal dan kerja2 yang kat sekolah.
-1: penat bak hang, macam ni aku takmau kerja dah.
-Takleh takleh nak ambik air.
-Tgk jugak aku kat rumah ni.
-Pastu aku nak bagi aku.
-So aku dah takde masalah pulak.
-Balik aku pun
-2: penat bak hang, macam ni aku takmau kerja dah macam tu.
-Tapi semua tu aku ingat cakap, ada cara hidup ni yang kita kena bayar.. pastu kita tak mampu bayar.. kan!!
-Takpelah, aku nak cakap, masa yang
-```
--- a/model_cards/huseinzol05/gpt2-345M-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/gpt2-345M-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa GPT2 Model
-
-Pretrained GPT2 345M model for Malay.
-
-## Pretraining Corpus
-
-`gpt2-345M-bahasa-cased` model was pretrained on ~0.9 Billion words. We trained on standard language structure only, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-3. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-4. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-5. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-6. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-7. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-8. [Common-Crawl](https://github.com/huseinzol05/malaya-dataset#common-crawl).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using GPT2's github [repository](https://github.com/openai/gpt-2) on a V3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/gpt2](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import GPT2Tokenizer, GPT2Model
-
-model = GPT2Model.from_pretrained('huseinzol05/gpt2-345M-bahasa-cased')
-tokenizer = GPT2Tokenizer.from_pretrained(
-    'huseinzol05/gpt2-345M-bahasa-cased',
-)
-```
-
-## Example using GPT2LMHeadModel
-
-```python
-from transformers import GPT2Tokenizer, GPT2LMHeadModel
-
-tokenizer = GPT2Tokenizer.from_pretrained('huseinzol05/gpt2-345M-bahasa-cased')
-model = GPT2LMHeadModel.from_pretrained(
-    'huseinzol05/gpt2-345M-bahasa-cased', pad_token_id = tokenizer.eos_token_id
-)
-
-input_ids = tokenizer.encode(
-    'penat bak hang, macam ni aku takmau kerja dah', return_tensors = 'pt'
-)
-sample_outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    max_length = 50,
-    top_k = 50,
-    top_p = 0.95,
-    num_return_sequences = 3,
-)
-
-print('Output:\n' + 100 * '-')
-for i, sample_output in enumerate(sample_outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-```
-
-Output is,
-
-```text
-Output:
----------------------------------------------------------------------------------------------------
-0: penat bak hang, macam ni aku takmau kerja dah dekat 2,3 jam.
-Aku harap aku dapat berjimat banyak.
-Ini pun masa kerja, bila dah kerja jadi satu.
-Aku buat kerja ni la.
-Aku memang kalau ada
-1: penat bak hang, macam ni aku takmau kerja dah.
-Tapi nak buat macam mana kan, aku tolong bentang tugas.
-Dan, memang sangat-sangat tak mahu buat kerja sekarang ni.
-Aku pun suka sangat kerja di luar bandar
-2: penat bak hang, macam ni aku takmau kerja dah pun.
-Takpa nak buat kerja-kerja sampingan, baru boleh dapat hadiah pulak.
-Ni la tempat paling best bila duduk di restoran yang ada pekena kopi.
-Cumanya
-```
--- a/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa T5 Model
-
-Pretrained T5 base language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
-
-1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
-14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
-15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
-16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
-17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
-18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import T5Tokenizer, T5Model
-
-model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
-```
-
-## Example using T5ForConditionalGeneration
-
-```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
-input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-
-Output is,
-
-```
-'Mahathir Mohamad'
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/huseinzol05/t5-base-bahasa-summarization-cased/README.md
+++ b/model_cards/huseinzol05/t5-base-bahasa-summarization-cased/README.md
---
-language: ms
---
-
-# Bahasa T5 Summarization Model
-
-Finetuned T5 base summarization model for Malay and Indonesian. 
-
-## Finetuning Corpus
-
-`t5-base-bahasa-summarization-cased` model was finetuned on multiple summarization dataset. Below is list of tasks we trained on,
-
-1. [Translated CNN News](https://github.com/huseinzol05/Malay-Dataset#cnn-news)
-2. [Translated Gigawords](https://github.com/huseinzol05/Malay-Dataset#gigawords)
-3. [Translated Multinews](https://github.com/huseinzol05/Malay-Dataset#multinews)
-
-## Finetuning details
-
- This model was trained using Malaya T5's github [repository](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5) on v3-8 TPU using Base size.
- All steps can reproduce from here, [Malaya/session/summarization](https://github.com/huseinzol05/Malaya/tree/master/session/summarization).
-
-## Load Finetuned Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import T5Tokenizer, T5Model
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
-```
-
-## Example using T5ForConditionalGeneration
-
-```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
-
-# https://www.hmetro.com.my/mutakhir/2020/05/580438/peletakan-jawatan-tun-m-ditolak-bukan-lagi-isu
-# original title, Peletakan jawatan Tun M ditolak, bukan lagi isu
-string = 'PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada 24 Februari lalu. Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri Muhyiddin Yassin. Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT sebulat suara menolak peletakan jawatan Dr Mahathir. "Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang buat keputusan. "Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan keputusan parti. "Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti," katanya kepada Harian Metro. Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di tengah-tengah pergolakan politik pada akhir Februari adalah sah. Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah. Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi ditolak oleh MPT. "Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini mesyuarat sudah menolak," katanya. Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari ini yang menyatakan sokongan kepada Perikatan Nasional. "Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya.'
-
-# https://huggingface.co/blog/how-to-generate
-# generate summary
-input_ids = tokenizer.encode(f'ringkasan: {string}', return_tensors = 'pt')
-outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    temperature = 0.8,
-    top_k = 50,
-    top_p = 0.95,
-    max_length = 300,
-    num_return_sequences = 3,
-)
-
-for i, sample_output in enumerate(outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-
-# generate news title
-input_ids = tokenizer.encode(f'tajuk: {string}', return_tensors = 'pt')
-outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    temperature = 0.8,
-    top_k = 50,
-    top_p = 0.95,
-    max_length = 300,
-    num_return_sequences = 3,
-)
-
-for i, sample_output in enumerate(outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-```
-
-Output is,
-
-```
-0: "Ini agak berlawanan dengan keputusan yang kita sudah buat," kata Marzuki Yahya. Kenyataan media adalah keputusan rasmi. Marzuki: Kenyataan media tidak mengubah keputusan mesyuarat
-1: MPT sebulat suara menolak peletakan jawatan Dr M di mesyuarat 24 Februari. Tidak ada persoalan peletakan jawatan itu sah atau tidak, tetapi ia adalah keputusan parti yang dipersetujui semua. Bekas Setiausaha Agung Bersatu mengatakan keputusan itu perlu dibuat melalui parti. Bekas setiausaha agung itu mengatakan kenyataan media tidak lagi menyokong Perikatan Nasional
-2: Kenyataan media menunjukkan sokongan kepada Perikatan Nasional. Marzuki: Kedudukan Dr M sebagai Pengerusi Bersatu juga sah. Beliau berkata pengumuman itu harus diserahkan kepada setiausaha Agung
-
-0: 'Kalah Tun M, Muhyiddin tetap sah'
-1: Boleh letak jawatan PM di MPT
-2: 'Ketegangan Dr M sudah tolak, tak timbul isu peletakan jawatan'
-```
-
-## Result
-
-We found out using original Tensorflow implementation gives better results, check it at https://malaya.readthedocs.io/en/latest/Abstractive.html#generate-ringkasan
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/huseinzol05/t5-small-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/t5-small-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa T5 Model
-
-Pretrained T5 small language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`t5-small-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
-
-1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
-14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
-15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
-16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
-17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
-18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import T5Tokenizer, T5Model
-
-model = T5Model.from_pretrained('huseinzol05/t5-small-bahasa-cased')
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
-```
-
-## Example using T5ForConditionalGeneration
-
-```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-cased')
-input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-
-Output is,
-
-```
-'Mahathir Mohamad'
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/huseinzol05/t5-small-bahasa-summarization-cased/README.md
+++ b/model_cards/huseinzol05/t5-small-bahasa-summarization-cased/README.md
---
-language: ms
---
-
-# Bahasa T5 Summarization Model
-
-Finetuned T5 small summarization model for Malay and Indonesian. 
-
-## Finetuning Corpus
-
-`t5-small-bahasa-summarization-cased` model was finetuned on multiple summarization dataset. Below is list of tasks we trained on,
-
-1. [Translated CNN News](https://github.com/huseinzol05/Malay-Dataset#cnn-news)
-2. [Translated Gigawords](https://github.com/huseinzol05/Malay-Dataset#gigawords)
-3. [Translated Multinews](https://github.com/huseinzol05/Malay-Dataset#multinews)
-
-## Finetuning details
-
- This model was trained using Malaya T5's github [repository](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5) on v3-8 TPU using small size.
- All steps can reproduce from here, [Malaya/session/summarization](https://github.com/huseinzol05/Malaya/tree/master/session/summarization).
-
-## Load Finetuned Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import T5Tokenizer, T5Model
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
-```
-
-## Example using T5ForConditionalGeneration
-
-```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
-
-tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
-model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
-
-# https://www.hmetro.com.my/mutakhir/2020/05/580438/peletakan-jawatan-tun-m-ditolak-bukan-lagi-isu
-# original title, Peletakan jawatan Tun M ditolak, bukan lagi isu
-string = 'PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada 24 Februari lalu. Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri Muhyiddin Yassin. Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT sebulat suara menolak peletakan jawatan Dr Mahathir. "Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang buat keputusan. "Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan keputusan parti. "Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti," katanya kepada Harian Metro. Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di tengah-tengah pergolakan politik pada akhir Februari adalah sah. Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah. Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi ditolak oleh MPT. "Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini mesyuarat sudah menolak," katanya. Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari ini yang menyatakan sokongan kepada Perikatan Nasional. "Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya.'
-
-# https://huggingface.co/blog/how-to-generate
-# generate summary
-input_ids = tokenizer.encode(f'ringkasan: {string}', return_tensors = 'pt')
-outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    temperature = 0.8,
-    top_k = 50,
-    top_p = 0.95,
-    max_length = 300,
-    num_return_sequences = 3,
-)
-
-for i, sample_output in enumerate(outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-
-# generate news title
-input_ids = tokenizer.encode(f'tajuk: {string}', return_tensors = 'pt')
-outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    temperature = 0.8,
-    top_k = 50,
-    top_p = 0.95,
-    max_length = 300,
-    num_return_sequences = 3,
-)
-
-for i, sample_output in enumerate(outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
-```
-
-Output is,
-
-```
-0: Pengerusi Bersatu Bersatu menafikan peletakan jawatan dalam mesyuarat khas Majlis Pimpinan Tertinggi. Tidak timbul isu peletakan jawatan itu sah atau tidak kerana ia sudah diputuskan di peringkat parti. Kenyataan media yang dibuat oleh pemimpin parti hari ini menyokong Perikatan Nasional
-1: Tiada keputusan kerana ia sudah diputuskan pada peringkat parti, Marzuki berkata. Pejabat rasmi parti menolak peletakan jawatan Dr M, dengan mengatakan ia adalah keputusan. Kedudukan Muhyiddin memangku jawatan itu juga sah, katanya
-2: Tiada peletakan jawatan Dr Mahathir dalam mesyuarat khas MPT pada 24 Februari. Ketua parti menolak peletakan jawatan itu. Tidak timbul isu peletakan jawatan itu sah atau tidak, katanya
-
-0: Tiada peletakan jawatan Tun M dalam mesyuarat khas
-1: ‘Tidak timbul peletakan jawatan Tun M’
-2: Tidak timbul isu peletakan jawatan Tun M di mesyuarat khas
-```
-
-## Result
-
-We found out using original Tensorflow implementation gives better results, check it at https://malaya.readthedocs.io/en/latest/Abstractive.html#generate-ringkasan
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/huseinzol05/tiny-bert-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/tiny-bert-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa Tiny-BERT Model
-
-General Distilled Tiny BERT language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`tiny-bert-bahasa-cased` model was distilled on ~1.8 Billion words. We distilled on both standard and social media language structures, and below is list of data we distilled on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Distilling details
-
- This model was distilled using huawei-noah Tiny-BERT's github [repository](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/tiny-bert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/tiny-bert).
-
-## Load Distilled Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import AlbertTokenizer, BertModel
-
-model = BertModel.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/tiny-bert-bahasa-cased',
-    unk_token = '[UNK]',
-    pad_token = '[PAD]',
-    do_lower_case = False,
-)
-```
-
-We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
-tokenizer = AlbertTokenizer.from_pretrained(
-    'huseinzol05/tiny-bert-bahasa-cased',
-    unk_token = '[UNK]',
-    pad_token = '[PAD]',
-    do_lower_case = False,
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan [MASK]'))
-```
-
-Output is,
-
-```text
-[{'sequence': '[CLS] makan ayam dengan berbual[SEP]',
-  'score': 0.00015769545279908925,
-  'token': 17859},
- {'sequence': '[CLS] makan ayam dengan kembar[SEP]',
-  'score': 0.0001448775001335889,
-  'token': 8289},
- {'sequence': '[CLS] makan ayam dengan memaklumkan[SEP]',
-  'score': 0.00013484008377417922,
-  'token': 6881},
- {'sequence': '[CLS] makan ayam dengan Senarai[SEP]',
-  'score': 0.00013061291247140616,
-  'token': 11698},
- {'sequence': '[CLS] makan ayam dengan Tiga[SEP]',
-  'score': 0.00012453157978598028,
-  'token': 4232}]
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train BERT for Bahasa. 
-
-
--- a/model_cards/huseinzol05/xlnet-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/xlnet-base-bahasa-cased/README.md
---
-language: ms
---
-
-# Bahasa XLNet Model
-
-Pretrained XLNet base language model for Malay and Indonesian. 
-
-## Pretraining Corpus
-
-`XLNET-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
-
-1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
-2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
-3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
-4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
-5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
-6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
-7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
-8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
-9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
-
-Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
-
-## Pretraining details
-
- This model was trained using zihangdai XLNet's github [repository](https://github.com/zihangdai/xlnet) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/xlnet](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet).
-
-## Load Pretrained Model
-
-You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
-
-```python
-from transformers import XLNetTokenizer, XLNetModel
-
-model = XLNetModel.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
-tokenizer = XLNetTokenizer.from_pretrained(
-    'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
-)
-```
-
-## Example using AutoModelWithLMHead
-
-```python
-from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
-
-model = AutoModelWithLMHead.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
-tokenizer = XLNetTokenizer.from_pretrained(
-    'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
-)
-fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
-print(fill_mask('makan ayam dengan <mask>'))
-```
-
-## Results
-
-For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
-
-## Acknowledgement
-
-Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train XLNet for Bahasa. 
-
-
--- a/model_cards/iarfmoose/bert-base-cased-qa-evaluator/README.md
+++ b/model_cards/iarfmoose/bert-base-cased-qa-evaluator/README.md
-# BERT-base-cased-qa-evaluator
-
-This model takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is a pretrained [BERT-base-cased](https://huggingface.co/bert-base-cased) with a sequence classification head.
-
-## Intended uses
-
-The QA evaluator was originally designed to be used with the [t5-base-question-generator](https://huggingface.co/iarfmoose/t5-base-question-generator) for evaluating the quality of generated questions. 
-
-The input for the QA evaluator follows the format for `BertForSequenceClassification`, but using the question and answer as the two sequences. Inputs should take the following format:
-```
-[CLS] <question> [SEP] <answer [SEP]
-```
-
-## Limitations and bias
-
-The model is trained to evaluate if a question and answer are semantically related, but cannot determine whether an answer is actually true/correct or not.
-
-## Training data
-
-The training data was made up of question-answer pairs from the following datasets: 
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
- [RACE](http://www.cs.cmu.edu/~glai1/data/race/)
- [CoQA](https://stanfordnlp.github.io/coqa/)
- [MSMARCO](https://microsoft.github.io/msmarco/)
-
-## Training procedure
-
-The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.
--- a/model_cards/iarfmoose/roberta-base-bulgarian-pos/README.md
+++ b/model_cards/iarfmoose/roberta-base-bulgarian-pos/README.md
---
-language: bg
---
-
-# RoBERTa-base-bulgarian-POS
-
-
-The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian) fine-tuned for part-of-speech tagging.
-
-## Intended uses
-
-The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
-
-An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
-
-## Limitations and bias
-
-The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
-
-## Training data
-
-In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from [UD_Bulgarian-BTB](https://github.com/UniversalDependencies/UD_Bulgarian-BTB).
-
-## Training procedure
-
-The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 97% on the test set.
--- a/model_cards/iarfmoose/roberta-base-bulgarian/README.md
+++ b/model_cards/iarfmoose/roberta-base-bulgarian/README.md
---
-language: bg
---
-
-# RoBERTa-base-bulgarian
-
-
-The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a version of [RoBERTa-base](https://huggingface.co/roberta-base) pretrained on Bulgarian text.
-
-## Intended uses
-
-This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
-
-## Limitations and bias
-
-The training data is unfiltered text from the internet and may contain all sorts of biases.
-
-## Training data
-
-This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
-
-## Training procedure
-
-The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
-
-It was trained for 200k steps. The batch size was limited to 8 due to GPU memory limitations.
--- a/model_cards/iarfmoose/roberta-small-bulgarian-pos/README.md
+++ b/model_cards/iarfmoose/roberta-small-bulgarian-pos/README.md
---
-language: bg
---
-
-# RoBERTa-small-bulgarian-POS
-
-
-The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-small-Bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) fine-tuned for part-of-speech tagging.
-
-## Intended uses
-
-The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
-
-An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
-
-## Limitations and bias
-
-The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
-
-## Training data
-
-In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from (UD_Bulgarian-BTB)[https://github.com/UniversalDependencies/UD_Bulgarian-BTB].
-
-## Training procedure
-
-The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 98% on the test set.
--- a/model_cards/iarfmoose/roberta-small-bulgarian/README.md
+++ b/model_cards/iarfmoose/roberta-small-bulgarian/README.md
---
-language: bg
---
-
-# RoBERTa-small-bulgarian
-
-
-The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a smaller version of [RoBERTa-base-bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) with only 6 hidden layers, but similar performance.
-
-## Intended uses
-
-This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
-
-## Limitations and bias
-
-The training data is unfiltered text from the internet and may contain all sorts of biases.
-
-## Training data
-
-This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
-
-## Training procedure
-
-The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
-
-It was trained for 160k steps. The batch size was limited to 8 due to GPU memory limitations.
--- a/model_cards/iarfmoose/t5-base-question-generator/README.md
+++ b/model_cards/iarfmoose/t5-base-question-generator/README.md
-# Model name
-
-## Model description
-
-This model is a sequence-to-sequence question generator which takes an answer and context as an input, and generates a question as an output. It is based on a pretrained `t5-base` model.
-
-## Intended uses & limitations
-
-The model is trained to generate reading comprehension-style questions with answers extracted from a text. The model performs best with full sentence answers, but can also be used with single word or short phrase answers.
-
-#### How to use
-
-The model takes concatenated answers and context as an input sequence, and will generate a full question sentence as an output sequence. The max sequence length is 512 tokens. Inputs should be organised into the following format:
-```
-answer_token <answer-phrase> context_token <context-from-text>
-```
-The input sequence can then be encoded and passed as the `input_ids` argument in the model's `generate()` method.
-
-For best results, a large number of questions can be generated, and then filtered using [iarfmoose/bert-base-cased-qa-evaluator](https://huggingface.co/iarfmoose/bert-base-cased-qa-evaluator).
-
-For examples, please see https://github.com/iarfmoose/question_generator.
-
-#### Limitations and bias
-
-The model is limited to generating questions in the same style as those found in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [CoQA](https://stanfordnlp.github.io/coqa/), and [MSMARCO](https://microsoft.github.io/msmarco/). The generated questions can potentially be leading or reflect biases that are present in the context. If the context is too short or completely absent, or if the context and answer do not match, the generated question is likely to be incoherent.
-
-## Training data
-
-The model was fine-tuned on a dataset made up of several well-known QA datasets ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [CoQA](https://stanfordnlp.github.io/coqa/), and [MSMARCO](https://microsoft.github.io/msmarco/)). The datasets were restructured by concatenating the answer and context fields into the previously-mentioned format. The question field from the datasets was used as the target during training. The full training set was roughly 200,000 examples.
-
-## Training procedure
-
-The model was trained for 20 epochs over the training set with a learning rate of 1e-3. The batch size was only 4 due to GPU memory limitations when training on Google Colab.