Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
9ebb5b2a
"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "79eb391586193b86c7f11bb5cf66effe0e446395"
Unverified
Commit
9ebb5b2a
authored
May 08, 2020
by
rmroczkowski
Committed by
GitHub
May 08, 2020
Browse files
Model card for allegro/herbert-klej-cased-tokenizer-v1 (#4184)
parent
9e54efd0
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
44 additions
and
0 deletions
+44
-0
model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
...l_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
+44
-0
No files found.
model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
0 → 100644
View file @
9ebb5b2a
---
language
:
polish
---
# HerBERT tokenizer
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)**
tokenizer is a character level byte-pair encoding with
vocabulary size of 50k tokens. The tokenizer was trained on
[
Wolne Lektury
](
https://wolnelektury.pl/
)
and a publicly available subset of
[
National Corpus of Polish
](
http://nkjp.pl/index.php?page=14&lang=0
)
with
[
fastBPE
](
https://github.com/glample/fastBPE
)
library.
Tokenizer utilize
`XLMTokenizer`
implementation from
[
transformers
](
https://github.com/huggingface/transformers
)
.
## Tokenizer usage
Herbert tokenizer should be used together with
[
HerBERT model
](
https://huggingface.co/allegro/herbert-klej-cased-v1
)
:
```
python
from
transformers
import
XLMTokenizer
,
RobertaModel
tokenizer
=
XLMTokenizer
.
from_pretrained
(
"allegro/herbert-klej-cased-tokenizer-v1"
)
model
=
RobertaModel
.
from_pretrained
(
"allegro/herbert-klej-cased-v1"
)
encoded_input
=
tokenizer
.
encode
(
"Kto ma lepszą sztukę, ma lepszy rząd – to jasne."
,
return_tensors
=
'pt'
)
outputs
=
model
(
encoded_input
)
```
## License
CC BY-SA 4.0
## Citation
If you use this tokenizer, please cite the following paper:
```
@misc{rybak2020klej,
title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
year={2020},
eprint={2005.00630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
## Authors
Tokenizer was created by
**Allegro Machine Learning Research**
team.
You can contact us at:
<a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment