tokenizer.rst 865 Bytes
Newer Older
thomwolf's avatar
thomwolf committed
1
2
3
Tokenizer
----------------------------------------------------

4
5
6
7
8
9
10
11
The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).

``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:

- tokenizing, converting tokens to ids and back and encoding/decoding,
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)

thomwolf's avatar
thomwolf committed
12
13
14
``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~

15
.. autoclass:: transformers.PreTrainedTokenizer
thomwolf's avatar
thomwolf committed
16
    :members: