The training dataset was tokenized into subwords using [HerBERT Tokenizer](https://huggingface.co/allegro/herbert-klej-cased-tokenizer-v1); a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with a [fastBPE](https://github.com/glample/fastBPE) library.
Tokenizer utilizes `XLMTokenizer` implementation for that reason, one should load it as `allegro/herbert-klej-cased-tokenizer-v1`.
HerBERT was evaluated on the [KLEJ](https://klejbenchmark.com/) benchmark, publicly available set of nine evaluation tasks for the Polish language understanding.
It had the best average performance and obtained the best results for three of them.
| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN |PolEmo2.0-OUT | DYK | PSC | AR |