Below is the list of corpora used along with the output of wc command (counting lines, words and characters). These corpora were concatenated and tokenized with HuggingFace Roberta Tokenizer.
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Ukrainian Wikipedia - May 2020](https://dumps.wikimedia.org/ukwiki/latest/ukwiki-latest-pages-articles.xml.bz2) | 18 001 466| 201 207 739 | 2 647 891 947 |