Unverified Commit 3552d0e0 authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

[model_cards] Migrate cards from this repo to model repos on huggingface.co (#9013)



* rm all model cards

* Update the .rst

@sgugger it is still not super crystal clear/streamlined so let me know if any ideas to make it simpler

* Add a rootlevel README.md with simple instructions/context

* Update docs/source/model_sharing.rst
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* make style

* rm all model cards
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
parent 29e45979
---
language: ms
---
# Bahasa Albert Model
Pretrained Albert tiny language model for Malay and Indonesian, 85% faster execution and 50% smaller than Albert base.
## Pretraining Corpus
`albert-tiny-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google Albert's github [repository](https://github.com/google-research/ALBERT) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/albert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, AlbertModel
model = BertModel.from_pretrained('huseinzol05/albert-tiny-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-tiny-bahasa-cased',
do_lower_case = False,
)
```
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/albert-tiny-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-tiny-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
'score': 0.05121927708387375,
'token': 629},
{'sequence': '[CLS] makan ayam dengan sayur[SEP]',
'score': 0.04497420787811279,
'token': 1639},
{'sequence': '[CLS] makan ayam dengan nasi[SEP]',
'score': 0.039827536791563034,
'token': 453},
{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
'score': 0.032997727394104004,
'token': 2451},
{'sequence': '[CLS] makan ayam dengan makan[SEP]',
'score': 0.031354598701000214,
'token': 129}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train Albert for Bahasa.
---
language: ms
---
# Bahasa BERT Model
Pretrained BERT base language model for Malay and Indonesian.
## Pretraining Corpus
`bert-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/bert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/bert).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, BertModel
model = BertModel.from_pretrained('huseinzol05/bert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/bert-base-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
```
We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/bert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/bert-base-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
'score': 0.10812027007341385,
'token': 2446},
{'sequence': '[CLS] makan ayam dengan kicap[SEP]',
'score': 0.07653367519378662,
'token': 12928},
{'sequence': '[CLS] makan ayam dengan nasi[SEP]',
'score': 0.06839974224567413,
'token': 450},
{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
'score': 0.059544261544942856,
'token': 638},
{'sequence': '[CLS] makan ayam dengan sayur[SEP]',
'score': 0.05294966697692871,
'token': 1639}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train BERT for Bahasa.
---
language: ms
---
# Bahasa ELECTRA Model
Pretrained ELECTRA base language model for Malay and Indonesian.
## Pretraining Corpus
`electra-base-discriminator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import ElectraTokenizer, ElectraModel
model = ElectraModel.from_pretrained('huseinzol05/electra-base-discriminator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-base-discriminator-bahasa-cased',
do_lower_case = False,
)
```
## Example using ElectraForPreTraining
```python
from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
model = ElectraForPreTraining.from_pretrained('huseinzol05/electra-base-discriminator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-base-discriminator-bahasa-cased',
do_lower_case = False
)
sentence = 'kerajaan sangat prihatin terhadap rakyat'
fake_tokens = tokenizer.tokenize(sentence)
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
list(zip(fake_tokens, predictions.tolist()))
```
Output is,
```text
[('kerajaan', 0.0),
('sangat', 0.0),
('prihatin', 0.0),
('terhadap', 0.0),
('rakyat', 0.0)]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa.
---
language: ms
---
# Bahasa ELECTRA Model
Pretrained ELECTRA base language model for Malay and Indonesian.
## Pretraining Corpus
`electra-base-generator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import ElectraTokenizer, ElectraModel
model = ElectraModel.from_pretrained('huseinzol05/electra-base-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-base-generator-bahasa-cased',
do_lower_case = False,
)
```
## Example using AutoModelWithLMHead
```python
from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/electra-base-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-base-generator-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan ayam [SEP]',
'score': 0.08424834907054901,
'token': 3255},
{'sequence': '[CLS] makan ayam dengan rendang [SEP]',
'score': 0.064150370657444,
'token': 6288},
{'sequence': '[CLS] makan ayam dengan nasi [SEP]',
'score': 0.033446669578552246,
'token': 2533},
{'sequence': '[CLS] makan ayam dengan kucing [SEP]',
'score': 0.02803465723991394,
'token': 3577},
{'sequence': '[CLS] makan ayam dengan telur [SEP]',
'score': 0.026627106592059135,
'token': 6350}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa.
---
language: ms
---
# Bahasa ELECTRA Model
Pretrained ELECTRA small language model for Malay and Indonesian.
## Pretraining Corpus
`electra-small-discriminator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import ElectraTokenizer, ElectraModel
model = ElectraModel.from_pretrained('huseinzol05/electra-small-discriminator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-small-discriminator-bahasa-cased',
do_lower_case = False,
)
```
## Example using ElectraForPreTraining
```python
from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
model = ElectraForPreTraining.from_pretrained('huseinzol05/electra-small-discriminator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-small-discriminator-bahasa-cased',
do_lower_case = False
)
sentence = 'kerajaan sangat prihatin terhadap rakyat'
fake_tokens = tokenizer.tokenize(sentence)
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
list(zip(fake_tokens, predictions.tolist()))
```
Output is,
```text
[('kerajaan', 0.0),
('sangat', 0.0),
('prihatin', 0.0),
('terhadap', 0.0),
('rakyat', 0.0)]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa.
---
language: ms
---
# Bahasa ELECTRA Model
Pretrained ELECTRA small language model for Malay and Indonesian.
## Pretraining Corpus
`electra-small-generator-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google ELECTRA's github [repository](https://github.com/google-research/electra) on a single TESLA V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/electra](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/electra).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import ElectraTokenizer, ElectraModel
model = ElectraModel.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-small-generator-bahasa-cased',
do_lower_case = False,
)
```
## Example using AutoModelWithLMHead
```python
from transformers import ElectraTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/electra-small-generator-bahasa-cased')
tokenizer = ElectraTokenizer.from_pretrained(
'huseinzol05/electra-small-generator-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan ayam [SEP]',
'score': 0.08424834907054901,
'token': 3255},
{'sequence': '[CLS] makan ayam dengan rendang [SEP]',
'score': 0.064150370657444,
'token': 6288},
{'sequence': '[CLS] makan ayam dengan nasi [SEP]',
'score': 0.033446669578552246,
'token': 2533},
{'sequence': '[CLS] makan ayam dengan kucing [SEP]',
'score': 0.02803465723991394,
'token': 3577},
{'sequence': '[CLS] makan ayam dengan telur [SEP]',
'score': 0.026627106592059135,
'token': 6350}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train ELECTRA for Bahasa.
---
language: ms
---
# Bahasa GPT2 Model
Pretrained GPT2 117M model for Malay.
## Pretraining Corpus
`gpt2-117M-bahasa-cased` model was pretrained on ~0.9 Billion words. We trained on standard language structure only, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
5. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
6. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
7. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
8. [Common-Crawl](https://github.com/huseinzol05/malaya-dataset#common-crawl).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using GPT2's github [repository](https://github.com/openai/gpt-2) on a V3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/gpt2](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import GPT2Tokenizer, GPT2Model
model = GPT2Model.from_pretrained('huseinzol05/gpt2-117M-bahasa-cased')
tokenizer = GPT2Tokenizer.from_pretrained(
'huseinzol05/gpt2-117M-bahasa-cased',
)
```
## Example using GPT2LMHeadModel
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('huseinzol05/gpt2-117M-bahasa-cased')
model = GPT2LMHeadModel.from_pretrained(
'huseinzol05/gpt2-117M-bahasa-cased', pad_token_id = tokenizer.eos_token_id
)
input_ids = tokenizer.encode(
'penat bak hang, macam ni aku takmau kerja dah', return_tensors = 'pt'
)
sample_outputs = model.generate(
input_ids,
do_sample = True,
max_length = 50,
top_k = 50,
top_p = 0.95,
num_return_sequences = 3,
)
print('Output:\n' + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
```
Output is,
```text
Output:
----------------------------------------------------------------------------------------------------
0: penat bak hang, macam ni aku takmau kerja dah jadi aku pernah beritahu orang.
Ini bukan aku rasa cam nak ajak teman kan ni.
Tengok ni aku dah ada adik-adik & anak yang tinggal dan kerja2 yang kat sekolah.
1: penat bak hang, macam ni aku takmau kerja dah.
Takleh takleh nak ambik air.
Tgk jugak aku kat rumah ni.
Pastu aku nak bagi aku.
So aku dah takde masalah pulak.
Balik aku pun
2: penat bak hang, macam ni aku takmau kerja dah macam tu.
Tapi semua tu aku ingat cakap, ada cara hidup ni yang kita kena bayar.. pastu kita tak mampu bayar.. kan!!
Takpelah, aku nak cakap, masa yang
```
---
language: ms
---
# Bahasa GPT2 Model
Pretrained GPT2 345M model for Malay.
## Pretraining Corpus
`gpt2-345M-bahasa-cased` model was pretrained on ~0.9 Billion words. We trained on standard language structure only, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
5. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
6. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
7. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
8. [Common-Crawl](https://github.com/huseinzol05/malaya-dataset#common-crawl).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using GPT2's github [repository](https://github.com/openai/gpt-2) on a V3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/gpt2](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import GPT2Tokenizer, GPT2Model
model = GPT2Model.from_pretrained('huseinzol05/gpt2-345M-bahasa-cased')
tokenizer = GPT2Tokenizer.from_pretrained(
'huseinzol05/gpt2-345M-bahasa-cased',
)
```
## Example using GPT2LMHeadModel
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('huseinzol05/gpt2-345M-bahasa-cased')
model = GPT2LMHeadModel.from_pretrained(
'huseinzol05/gpt2-345M-bahasa-cased', pad_token_id = tokenizer.eos_token_id
)
input_ids = tokenizer.encode(
'penat bak hang, macam ni aku takmau kerja dah', return_tensors = 'pt'
)
sample_outputs = model.generate(
input_ids,
do_sample = True,
max_length = 50,
top_k = 50,
top_p = 0.95,
num_return_sequences = 3,
)
print('Output:\n' + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
```
Output is,
```text
Output:
----------------------------------------------------------------------------------------------------
0: penat bak hang, macam ni aku takmau kerja dah dekat 2,3 jam.
Aku harap aku dapat berjimat banyak.
Ini pun masa kerja, bila dah kerja jadi satu.
Aku buat kerja ni la.
Aku memang kalau ada
1: penat bak hang, macam ni aku takmau kerja dah.
Tapi nak buat macam mana kan, aku tolong bentang tugas.
Dan, memang sangat-sangat tak mahu buat kerja sekarang ni.
Aku pun suka sangat kerja di luar bandar
2: penat bak hang, macam ni aku takmau kerja dah pun.
Takpa nak buat kerja-kerja sampingan, baru boleh dapat hadiah pulak.
Ni la tempat paling best bila duduk di restoran yang ada pekena kopi.
Cumanya
```
---
language: ms
---
# Bahasa T5 Model
Pretrained T5 base language model for Malay and Indonesian.
## Pretraining Corpus
`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
Output is,
```
'Mahathir Mohamad'
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.
---
language: ms
---
# Bahasa T5 Summarization Model
Finetuned T5 base summarization model for Malay and Indonesian.
## Finetuning Corpus
`t5-base-bahasa-summarization-cased` model was finetuned on multiple summarization dataset. Below is list of tasks we trained on,
1. [Translated CNN News](https://github.com/huseinzol05/Malay-Dataset#cnn-news)
2. [Translated Gigawords](https://github.com/huseinzol05/Malay-Dataset#gigawords)
3. [Translated Multinews](https://github.com/huseinzol05/Malay-Dataset#multinews)
## Finetuning details
- This model was trained using Malaya T5's github [repository](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5) on v3-8 TPU using Base size.
- All steps can reproduce from here, [Malaya/session/summarization](https://github.com/huseinzol05/Malaya/tree/master/session/summarization).
## Load Finetuned Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-summarization-cased')
# https://www.hmetro.com.my/mutakhir/2020/05/580438/peletakan-jawatan-tun-m-ditolak-bukan-lagi-isu
# original title, Peletakan jawatan Tun M ditolak, bukan lagi isu
string = 'PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada 24 Februari lalu. Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri Muhyiddin Yassin. Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT sebulat suara menolak peletakan jawatan Dr Mahathir. "Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang buat keputusan. "Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan keputusan parti. "Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti," katanya kepada Harian Metro. Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di tengah-tengah pergolakan politik pada akhir Februari adalah sah. Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah. Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi ditolak oleh MPT. "Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini mesyuarat sudah menolak," katanya. Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari ini yang menyatakan sokongan kepada Perikatan Nasional. "Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya.'
# https://huggingface.co/blog/how-to-generate
# generate summary
input_ids = tokenizer.encode(f'ringkasan: {string}', return_tensors = 'pt')
outputs = model.generate(
input_ids,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95,
max_length = 300,
num_return_sequences = 3,
)
for i, sample_output in enumerate(outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
# generate news title
input_ids = tokenizer.encode(f'tajuk: {string}', return_tensors = 'pt')
outputs = model.generate(
input_ids,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95,
max_length = 300,
num_return_sequences = 3,
)
for i, sample_output in enumerate(outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
```
Output is,
```
0: "Ini agak berlawanan dengan keputusan yang kita sudah buat," kata Marzuki Yahya. Kenyataan media adalah keputusan rasmi. Marzuki: Kenyataan media tidak mengubah keputusan mesyuarat
1: MPT sebulat suara menolak peletakan jawatan Dr M di mesyuarat 24 Februari. Tidak ada persoalan peletakan jawatan itu sah atau tidak, tetapi ia adalah keputusan parti yang dipersetujui semua. Bekas Setiausaha Agung Bersatu mengatakan keputusan itu perlu dibuat melalui parti. Bekas setiausaha agung itu mengatakan kenyataan media tidak lagi menyokong Perikatan Nasional
2: Kenyataan media menunjukkan sokongan kepada Perikatan Nasional. Marzuki: Kedudukan Dr M sebagai Pengerusi Bersatu juga sah. Beliau berkata pengumuman itu harus diserahkan kepada setiausaha Agung
0: 'Kalah Tun M, Muhyiddin tetap sah'
1: Boleh letak jawatan PM di MPT
2: 'Ketegangan Dr M sudah tolak, tak timbul isu peletakan jawatan'
```
## Result
We found out using original Tensorflow implementation gives better results, check it at https://malaya.readthedocs.io/en/latest/Abstractive.html#generate-ringkasan
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.
---
language: ms
---
# Bahasa T5 Model
Pretrained T5 small language model for Malay and Indonesian.
## Pretraining Corpus
`t5-small-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
model = T5Model.from_pretrained('huseinzol05/t5-small-bahasa-cased')
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-cased')
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
Output is,
```
'Mahathir Mohamad'
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.
---
language: ms
---
# Bahasa T5 Summarization Model
Finetuned T5 small summarization model for Malay and Indonesian.
## Finetuning Corpus
`t5-small-bahasa-summarization-cased` model was finetuned on multiple summarization dataset. Below is list of tasks we trained on,
1. [Translated CNN News](https://github.com/huseinzol05/Malay-Dataset#cnn-news)
2. [Translated Gigawords](https://github.com/huseinzol05/Malay-Dataset#gigawords)
3. [Translated Multinews](https://github.com/huseinzol05/Malay-Dataset#multinews)
## Finetuning details
- This model was trained using Malaya T5's github [repository](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5) on v3-8 TPU using small size.
- All steps can reproduce from here, [Malaya/session/summarization](https://github.com/huseinzol05/Malaya/tree/master/session/summarization).
## Load Finetuned Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-summarization-cased')
# https://www.hmetro.com.my/mutakhir/2020/05/580438/peletakan-jawatan-tun-m-ditolak-bukan-lagi-isu
# original title, Peletakan jawatan Tun M ditolak, bukan lagi isu
string = 'PELETAKAN jawatan Tun Dr Mahathir Mohamad sebagai Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu) ditolak di dalam mesyuarat khas Majlis Pimpinan Tertinggi (MPT) pada 24 Februari lalu. Justeru, tidak timbul soal peletakan jawatan itu sah atau tidak kerana ia sudah pun diputuskan pada peringkat parti yang dipersetujui semua termasuk Presiden, Tan Sri Muhyiddin Yassin. Bekas Setiausaha Agung Bersatu Datuk Marzuki Yahya berkata, pada mesyuarat itu MPT sebulat suara menolak peletakan jawatan Dr Mahathir. "Jadi ini agak berlawanan dengan keputusan yang kita sudah buat. Saya tak faham bagaimana Jabatan Pendaftar Pertubuhan Malaysia (JPPM) kata peletakan jawatan itu sah sedangkan kita sudah buat keputusan di dalam mesyuarat, bukan seorang dua yang buat keputusan. "Semua keputusan mesti dibuat melalui parti. Walau apa juga perbincangan dibuat di luar daripada keputusan mesyuarat, ini bukan keputusan parti. "Apa locus standy yang ada pada Setiausaha Kerja untuk membawa perkara ini kepada JPPM. Seharusnya ia dibawa kepada Setiausaha Agung sebagai pentadbir kepada parti," katanya kepada Harian Metro. Beliau mengulas laporan media tempatan hari ini mengenai pengesahan JPPM bahawa Dr Mahathir tidak lagi menjadi Pengerusi Bersatu berikutan peletakan jawatannya di tengah-tengah pergolakan politik pada akhir Februari adalah sah. Laporan itu juga menyatakan, kedudukan Muhyiddin Yassin memangku jawatan itu juga sah. Menurutnya, memang betul Dr Mahathir menghantar surat peletakan jawatan, tetapi ditolak oleh MPT. "Fasal yang disebut itu terpakai sekiranya berhenti atau diberhentikan, tetapi ini mesyuarat sudah menolak," katanya. Marzuki turut mempersoal kenyataan media yang dibuat beberapa pimpinan parti itu hari ini yang menyatakan sokongan kepada Perikatan Nasional. "Kenyataan media bukanlah keputusan rasmi. Walaupun kita buat 1,000 kenyataan sekali pun ia tetap tidak merubah keputusan yang sudah dibuat di dalam mesyuarat. Kita catat di dalam minit apa yang berlaku di dalam mesyuarat," katanya.'
# https://huggingface.co/blog/how-to-generate
# generate summary
input_ids = tokenizer.encode(f'ringkasan: {string}', return_tensors = 'pt')
outputs = model.generate(
input_ids,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95,
max_length = 300,
num_return_sequences = 3,
)
for i, sample_output in enumerate(outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
# generate news title
input_ids = tokenizer.encode(f'tajuk: {string}', return_tensors = 'pt')
outputs = model.generate(
input_ids,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95,
max_length = 300,
num_return_sequences = 3,
)
for i, sample_output in enumerate(outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
```
Output is,
```
0: Pengerusi Bersatu Bersatu menafikan peletakan jawatan dalam mesyuarat khas Majlis Pimpinan Tertinggi. Tidak timbul isu peletakan jawatan itu sah atau tidak kerana ia sudah diputuskan di peringkat parti. Kenyataan media yang dibuat oleh pemimpin parti hari ini menyokong Perikatan Nasional
1: Tiada keputusan kerana ia sudah diputuskan pada peringkat parti, Marzuki berkata. Pejabat rasmi parti menolak peletakan jawatan Dr M, dengan mengatakan ia adalah keputusan. Kedudukan Muhyiddin memangku jawatan itu juga sah, katanya
2: Tiada peletakan jawatan Dr Mahathir dalam mesyuarat khas MPT pada 24 Februari. Ketua parti menolak peletakan jawatan itu. Tidak timbul isu peletakan jawatan itu sah atau tidak, katanya
0: Tiada peletakan jawatan Tun M dalam mesyuarat khas
1: ‘Tidak timbul peletakan jawatan Tun M’
2: Tidak timbul isu peletakan jawatan Tun M di mesyuarat khas
```
## Result
We found out using original Tensorflow implementation gives better results, check it at https://malaya.readthedocs.io/en/latest/Abstractive.html#generate-ringkasan
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.
---
language: ms
---
# Bahasa Tiny-BERT Model
General Distilled Tiny BERT language model for Malay and Indonesian.
## Pretraining Corpus
`tiny-bert-bahasa-cased` model was distilled on ~1.8 Billion words. We distilled on both standard and social media language structures, and below is list of data we distilled on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Distilling details
- This model was distilled using huawei-noah Tiny-BERT's github [repository](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/tiny-bert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/tiny-bert).
## Load Distilled Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, BertModel
model = BertModel.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/tiny-bert-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
```
We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/tiny-bert-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan berbual[SEP]',
'score': 0.00015769545279908925,
'token': 17859},
{'sequence': '[CLS] makan ayam dengan kembar[SEP]',
'score': 0.0001448775001335889,
'token': 8289},
{'sequence': '[CLS] makan ayam dengan memaklumkan[SEP]',
'score': 0.00013484008377417922,
'token': 6881},
{'sequence': '[CLS] makan ayam dengan Senarai[SEP]',
'score': 0.00013061291247140616,
'token': 11698},
{'sequence': '[CLS] makan ayam dengan Tiga[SEP]',
'score': 0.00012453157978598028,
'token': 4232}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train BERT for Bahasa.
---
language: ms
---
# Bahasa XLNet Model
Pretrained XLNet base language model for Malay and Indonesian.
## Pretraining Corpus
`XLNET-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using zihangdai XLNet's github [repository](https://github.com/zihangdai/xlnet) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/xlnet](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import XLNetTokenizer, XLNetModel
model = XLNetModel.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
)
```
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan <mask>'))
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train XLNet for Bahasa.
# BERT-base-cased-qa-evaluator
This model takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is a pretrained [BERT-base-cased](https://huggingface.co/bert-base-cased) with a sequence classification head.
## Intended uses
The QA evaluator was originally designed to be used with the [t5-base-question-generator](https://huggingface.co/iarfmoose/t5-base-question-generator) for evaluating the quality of generated questions.
The input for the QA evaluator follows the format for `BertForSequenceClassification`, but using the question and answer as the two sequences. Inputs should take the following format:
```
[CLS] <question> [SEP] <answer [SEP]
```
## Limitations and bias
The model is trained to evaluate if a question and answer are semantically related, but cannot determine whether an answer is actually true/correct or not.
## Training data
The training data was made up of question-answer pairs from the following datasets:
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
- [RACE](http://www.cs.cmu.edu/~glai1/data/race/)
- [CoQA](https://stanfordnlp.github.io/coqa/)
- [MSMARCO](https://microsoft.github.io/msmarco/)
## Training procedure
The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.
---
language: bg
---
# RoBERTa-base-bulgarian-POS
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian) fine-tuned for part-of-speech tagging.
## Intended uses
The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
## Limitations and bias
The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from [UD_Bulgarian-BTB](https://github.com/UniversalDependencies/UD_Bulgarian-BTB).
## Training procedure
The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 97% on the test set.
---
language: bg
---
# RoBERTa-base-bulgarian
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a version of [RoBERTa-base](https://huggingface.co/roberta-base) pretrained on Bulgarian text.
## Intended uses
This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
## Limitations and bias
The training data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
## Training procedure
The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
It was trained for 200k steps. The batch size was limited to 8 due to GPU memory limitations.
---
language: bg
---
# RoBERTa-small-bulgarian-POS
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-small-Bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) fine-tuned for part-of-speech tagging.
## Intended uses
The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
## Limitations and bias
The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from (UD_Bulgarian-BTB)[https://github.com/UniversalDependencies/UD_Bulgarian-BTB].
## Training procedure
The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 98% on the test set.
---
language: bg
---
# RoBERTa-small-bulgarian
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a smaller version of [RoBERTa-base-bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) with only 6 hidden layers, but similar performance.
## Intended uses
This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
## Limitations and bias
The training data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
## Training procedure
The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
It was trained for 160k steps. The batch size was limited to 8 due to GPU memory limitations.
# Model name
## Model description
This model is a sequence-to-sequence question generator which takes an answer and context as an input, and generates a question as an output. It is based on a pretrained `t5-base` model.
## Intended uses & limitations
The model is trained to generate reading comprehension-style questions with answers extracted from a text. The model performs best with full sentence answers, but can also be used with single word or short phrase answers.
#### How to use
The model takes concatenated answers and context as an input sequence, and will generate a full question sentence as an output sequence. The max sequence length is 512 tokens. Inputs should be organised into the following format:
```
answer_token <answer-phrase> context_token <context-from-text>
```
The input sequence can then be encoded and passed as the `input_ids` argument in the model's `generate()` method.
For best results, a large number of questions can be generated, and then filtered using [iarfmoose/bert-base-cased-qa-evaluator](https://huggingface.co/iarfmoose/bert-base-cased-qa-evaluator).
For examples, please see https://github.com/iarfmoose/question_generator.
#### Limitations and bias
The model is limited to generating questions in the same style as those found in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [CoQA](https://stanfordnlp.github.io/coqa/), and [MSMARCO](https://microsoft.github.io/msmarco/). The generated questions can potentially be leading or reflect biases that are present in the context. If the context is too short or completely absent, or if the context and answer do not match, the generated question is likely to be incoherent.
## Training data
The model was fine-tuned on a dataset made up of several well-known QA datasets ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), [CoQA](https://stanfordnlp.github.io/coqa/), and [MSMARCO](https://microsoft.github.io/msmarco/)). The datasets were restructured by concatenating the answer and context fields into the previously-mentioned format. The question field from the datasets was used as the target during training. The full training set was roughly 200,000 examples.
## Training procedure
The model was trained for 20 epochs over the training set with a learning rate of 1e-3. The batch size was only 4 due to GPU memory limitations when training on Google Colab.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment