[model_cards] sarahlintang/IndoBERT (#7748)

* Create README.md * Update model_cards/sarahlintang/IndoBERT/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>

[model_cards] sarahlintang/IndoBERT (#7748)
* Create README.md * Update model_cards/sarahlintang/IndoBERT/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>
3fdbeba8 · sarahlintang · GitHub · ba654270 · 3fdbeba8
Unverified Commit 3fdbeba8 authored Oct 15, 2020 by sarahlintang Committed by GitHub Oct 14, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 43 additions and 0 deletions

model_cards/sarahlintang/IndoBERT/README.md model_cards/sarahlintang/IndoBERT/README.md +43 -0

No files found.
--- a/model_cards/sarahlintang/IndoBERT/README.md
+++ b/model_cards/sarahlintang/IndoBERT/README.md
+---
+language: id
+datasets:
+- oscar
+---
+# IndoBERT (Indonesian BERT Model)
+
+## Model description
+IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. 
+
+This model is base-uncased version which use bert-base config.
+
+## Intended uses & limitations
+
+#### How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
+model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
+tokenizer.encode("hai aku mau makan.")
+[2, 8078, 1785, 2318, 1946, 18, 4]
+```
+
+
+## Training data
+
+This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/). 
+
+This model is equal to bert-base model which has 32,000 vocabulary size. 
+
+## Training procedure
+
+The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
+We used a Google Cloud Storage bucket, for persistent storage of training data and models.
+
+## Eval results
+
+We evaluate this model on three Indonesian NLP downstream task:
+- some extractive summarization model
+- sentiment analysis
+- Part-of-Speech Tagger
+it was proven that this model outperforms multilingual BERT for all downstream tasks.