- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
...
@@ -106,18 +115,30 @@ class AutoTokenizer(object):
...
@@ -106,18 +115,30 @@ class AutoTokenizer(object):
Examples::
Examples::
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Download vocabulary from S3 and cache.
# Download vocabulary from S3 and cache.
tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/') # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
@@ -78,6 +78,8 @@ class PreTrainedTokenizer(object):
...
@@ -78,6 +78,8 @@ class PreTrainedTokenizer(object):
"pad_token","cls_token","mask_token",
"pad_token","cls_token","mask_token",
"additional_special_tokens"]
"additional_special_tokens"]
padding_side="right"
@property
@property
defbos_token(self):
defbos_token(self):
""" Beginning of sentence token (string). Log an error if used while not having been set. """
""" Beginning of sentence token (string). Log an error if used while not having been set. """
...
@@ -191,6 +193,11 @@ class PreTrainedTokenizer(object):
...
@@ -191,6 +193,11 @@ class PreTrainedTokenizer(object):
""" Id of the padding token in the vocabulary. Log an error if used while not having been set. """
""" Id of the padding token in the vocabulary. Log an error if used while not having been set. """
returnself.convert_tokens_to_ids(self.pad_token)
returnself.convert_tokens_to_ids(self.pad_token)
@property
defpad_token_type_id(self):
""" Id of the padding token type in the vocabulary."""
returnself._pad_token_type_id
@property
@property
defcls_token_id(self):
defcls_token_id(self):
""" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
""" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
...
@@ -214,12 +221,17 @@ class PreTrainedTokenizer(object):
...
@@ -214,12 +221,17 @@ class PreTrainedTokenizer(object):
# inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)
# inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)
...
@@ -244,6 +256,7 @@ class PreTrainedTokenizer(object):
...
@@ -244,6 +256,7 @@ class PreTrainedTokenizer(object):
pretrained_model_name_or_path: either:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
...
@@ -271,6 +284,9 @@ class PreTrainedTokenizer(object):
...
@@ -271,6 +284,9 @@ class PreTrainedTokenizer(object):