This is a first run of a Hindi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra). **I don't modify ELECTRA until we get into finetuning**
Tokenization and training CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_
Blog post: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81
- Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param
## Pretrain TF Records
[build_pretraining_dataset.py](https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py) splits the corpus into training documents
Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own.