- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
...
...
@@ -106,18 +115,30 @@ class AutoTokenizer(object):
Examples::
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Download vocabulary from S3 and cache.
tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/') # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
@@ -78,6 +78,8 @@ class PreTrainedTokenizer(object):
"pad_token","cls_token","mask_token",
"additional_special_tokens"]
padding_side="right"
@property
defbos_token(self):
""" Beginning of sentence token (string). Log an error if used while not having been set. """
...
...
@@ -191,6 +193,11 @@ class PreTrainedTokenizer(object):
""" Id of the padding token in the vocabulary. Log an error if used while not having been set. """
returnself.convert_tokens_to_ids(self.pad_token)
@property
defpad_token_type_id(self):
""" Id of the padding token type in the vocabulary."""
returnself._pad_token_type_id
@property
defcls_token_id(self):
""" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
...
...
@@ -214,12 +221,17 @@ class PreTrainedTokenizer(object):
# inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)
...
...
@@ -244,6 +256,7 @@ class PreTrainedTokenizer(object):
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
...
...
@@ -271,6 +284,9 @@ class PreTrainedTokenizer(object):