[IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:
* Indonesian Wikipedia (74M words)
* news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
* an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).
We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being <b>3.97</b> (similar to English BERT-base).
This <b>IndoBERT</b> was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.