Handle a few common parameters and methods for loading/downloading/saving configurations.
Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.
Class attributes (overridden by derived classes):
- ``pretrained_config_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained model configurations as values.
Parameters:
``finetuning_task``: string, default `None`. Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
``num_labels``: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
``output_attentions``: boolean, default `False`. Should the model returns attentions weights.
``output_hidden_states``: string, default `False`. Should the model returns all hidden-states.
``torchscript``: string, default `False`. Is the model used with Torchscript.
"""
"""
pretrained_config_archive_map={}
pretrained_config_archive_map={}
...
@@ -81,8 +91,8 @@ class PretrainedConfig(object):
...
@@ -81,8 +91,8 @@ class PretrainedConfig(object):
self.torchscript=kwargs.pop('torchscript',False)
self.torchscript=kwargs.pop('torchscript',False)
defsave_pretrained(self,save_directory):
defsave_pretrained(self,save_directory):
""" Save a configuration object to a directory, so that it
""" Save a configuration object to the directory `save_directory`, so that it
can be re-loaded using the `from_pretrained(save_directory)` class method.
can be re-loaded using the :func:`~pytorch_transformers.PretrainedConfig.from_pretrained` class method.
"""
"""
assertos.path.isdir(save_directory),"Saving path should be a directory where the model and configuration can be saved"
assertos.path.isdir(save_directory),"Saving path should be a directory where the model and configuration can be saved"
...
@@ -93,41 +103,42 @@ class PretrainedConfig(object):
...
@@ -93,41 +103,42 @@ class PretrainedConfig(object):
r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
r""" Instantiate a :class:`~pytorch_transformers.PretrainedConfig` (or a derived class) from a pre-trained model configuration.
Params:
Parameters:
**pretrained_model_name_or_path**: either:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model configuration to load from cache
or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
- a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing a configuration file saved
- a path to a `directory` containing a configuration file saved using the :func:`~pytorch_transformers.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.
using the `save_pretrained(save_directory)` method.
- a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
- a path or url to a saved configuration `file`.
**cache_dir**: (`optional`) string:
cache_dir: (`optional`) string:
Path to a directory in which a downloaded pre-trained model
Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used.
configuration should be cached if the standard cache should not be used.
**return_unused_kwargs**: (`optional`) bool:
kwargs: (`optional`) dict: key/value pairs with which to update the configuration object after loading.
- The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
- Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
return_unused_kwargs: (`optional`) bool:
- If False, then this function returns just the final configuration object.
- If False, then this function returns just the final configuration object.
- If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs`
- If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
**kwargs**: (`optional`) dict:
Dictionary of key/value pairs with which to update the configuration object after loading.
- The values in kwargs of any keys which are configuration attributes will be used
to override the loaded values.
- Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
by the `return_unused_kwargs` keyword parameter.
Examples::
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache.
# We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a
>>> config = BertConfig.from_pretrained('./test/saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
@@ -217,14 +228,26 @@ class PretrainedConfig(object):
...
@@ -217,14 +228,26 @@ class PretrainedConfig(object):
classPreTrainedModel(nn.Module):
classPreTrainedModel(nn.Module):
""" Base class for all models. Handle loading/storing model config and
r""" Base class for all models.
a simple interface for dowloading and loading pretrained models.
:class:`~pytorch_transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
as well as a few methods commons to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.
Class attributes (overridden by derived classes):
- ``config_class``: a class derived from :class:`~pytorch_transformers.PretrainedConfig` to use as configuration class for this model architecture.
- ``pretrained_model_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained weights as values.
- ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:
- ``model``: an instance of the relevant subclass of :class:`~pytorch_transformers.PreTrainedModel`,
- ``config``: an instance of the relevant subclass of :class:`~pytorch_transformers.PretrainedConfig`,
- ``path``: a path (string) to the TensorFlow checkpoint.
- ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.
"""
"""
config_class=PretrainedConfig
config_class=None
pretrained_model_archive_map={}
pretrained_model_archive_map={}
load_tf_weights=lambdamodel,config,path:None
load_tf_weights=lambdamodel,config,path:None
base_model_prefix=""
base_model_prefix=""
input_embeddings=None
def__init__(self,config,*inputs,**kwargs):
def__init__(self,config,*inputs,**kwargs):
super(PreTrainedModel,self).__init__()
super(PreTrainedModel,self).__init__()
...
@@ -282,17 +305,16 @@ class PreTrainedModel(nn.Module):
...
@@ -282,17 +305,16 @@ class PreTrainedModel(nn.Module):
""" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
""" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
Args:
Arguments:
new_num_tokens: (`optional`) int
New number of tokens in the embedding matrix.
new_num_tokens: (`optional`) int:
Increasing the size will add newly initialized vectors at the end
New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.
Reducing the size will remove vectors from the end
If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.
If not provided or None: does nothing and just returns a pointer to the input tokens Embedding Module of the model.
Return: ``torch.nn.Embeddings``
Return: ``torch.nn.Embeddings``
Pointer to the input tokens Embedding Module of the model
Pointer to the input tokens Embeddings Module of the model
"""
"""
base_model=getattr(self,self.base_model_prefix,self)# get the base model if needed
base_model=getattr(self,self.base_model_prefix,self)# get the base model if needed
@@ -311,15 +333,17 @@ class PreTrainedModel(nn.Module):
...
@@ -311,15 +333,17 @@ class PreTrainedModel(nn.Module):
defprune_heads(self,heads_to_prune):
defprune_heads(self,heads_to_prune):
""" Prunes heads of the base model.
""" Prunes heads of the base model.
Args:
heads_to_prune: dict of {layer_num (int): list of heads to prune in this layer (list of int)}
Arguments:
heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).
"""
"""
base_model=getattr(self,self.base_model_prefix,self)# get the base model if needed
base_model=getattr(self,self.base_model_prefix,self)# get the base model if needed
base_model._prune_heads(heads_to_prune)
base_model._prune_heads(heads_to_prune)
defsave_pretrained(self,save_directory):
defsave_pretrained(self,save_directory):
""" Save a model with its configuration file to a directory, so that it
""" Save a model and its configuration file to a directory, so that it
can be re-loaded using the `from_pretrained(save_directory)` class method.
can be re-loaded using the `:func:`~pytorch_transformers.PreTrainedModel.from_pretrained`` class method.
"""
"""
assertos.path.isdir(save_directory),"Saving path should be a directory where the model and configuration can be saved"
assertos.path.isdir(save_directory),"Saving path should be a directory where the model and configuration can be saved"
...
@@ -338,58 +362,53 @@ class PreTrainedModel(nn.Module):
...
@@ -338,58 +362,53 @@ class PreTrainedModel(nn.Module):
r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are desactivated)
The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`
To train the model, you should first set it back in training mode with ``model.train()``
Params:
Parameters:
**pretrained_model_name_or_path**: either:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model to load from cache
or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing a configuration file saved
- a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
using the `save_pretrained(save_directory)` method.
- a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
- a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
In this case, ``from_tf`` should be set to True and a configuration object should be
model_args: (`optional`) Sequence of positional arguments:
provided as `config` argument. This loading option is slower than converting the TensorFlow
All remaning positional arguments will be passed to the underlying model's ``__init__`` method
checkpoint in a PyTorch model using the provided conversion scripts and loading
the PyTorch model afterwards.
config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
**model_args**: (`optional`) Sequence:
Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
All remaning positional arguments will be passed to the underlying model's __init__ function
**config**: an optional configuration for the model to use instead of an automatically loaded configuation.
- the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
Configuration can be automatically loaded when:
- the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
- the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
- the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
- the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
**state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
state_dict: (`optional`) dict:
from saved weights file.
an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
This option can be used if you want to create a model from a pretrained configuraton but load your own weights.
This option can be used if you want to create a model from a pretrained configuration but load your own weights.
In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
a simpler option.
**cache_dir**: (`optional`) string:
cache_dir: (`optional`) string:
Path to a directory in which a downloaded pre-trained model
Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used.
configuration should be cached if the standard cache should not be used.
**output_loading_info**: (`optional`) boolean:
output_loading_info: (`optional`) boolean:
Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
**kwargs**: (`optional`) dict:
Dictionary of key, values to update the configuration object after loading.
kwargs: (`optional`) Remaining dictionary of keyword arguments:
Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
- If a configuration is provided with `config`, **kwargs will be directly passed
- If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
to the underlying model's __init__ method.
- If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
- If a configuration is not provided, **kwargs will be first passed to the pretrained
model configuration class loading function (`PretrainedConfig.from_pretrained`).
Each key of **kwargs that corresponds to a configuration attribute
will be used to override said attribute with the supplied **kwargs value.
Remaining keys that do not correspond to any configuration attribute will
be passed to the underlying model's __init__ function.
Examples::
Examples::
>>> model = BertModel.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache.
model = BertModel.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache.
>>> model = BertModel.from_pretrained('./test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = BertModel.from_pretrained('./test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
>>> model = BertModel.from_pretrained('bert-base-uncased', output_attention=True) # Update configuration during loading
model = BertModel.from_pretrained('bert-base-uncased', output_attention=True) # Update configuration during loading
>>> assert model.config.output_attention == True
assert model.config.output_attention == True
>>> # Loading from a TF checkpoint file instead of a PyTorch model (slower)
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
""" An abstract class to handle dowloading and loading pretrained tokenizers and adding tokens to the vocabulary.
""" Base class for all tokenizers.
Handle all the shared methods for tokenization and special tokens as well as methods dowloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
Derived class can set up a few special tokens to be used in common scripts and internals:
This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the
Class attributes (overridden by derived classes):
specific vocabulary augmentation methods of the various underlying dictionnary structures (BPE, sentencepiece...).
- ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
- ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the associated pretrained vocabulary file.
- ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
Parameters:
- ``bos_token``: (`Optional`) string: a beginning of sentence token. Will be associated to ``self.bos_token``
- ``eos_token``: (`Optional`) string: an end of sentence token. Will be associated to ``self.eos_token``
- ``unk_token``: (`Optional`) string: an unknown token. Will be associated to ``self.unk_token``
- ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence). Will be associated to ``self.sep_token``
- ``pad_token``: (`Optional`) string: a padding token. Will be associated to ``self.pad_token``
- ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model). Will be associated to ``self.cls_token``
- ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language modeling). Will be associated to ``self.mask_token``
- ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens. Adding all special tokens here ensure they won't be split by the tokenization process. Will be associated to ``self.additional_special_tokens``
"""
"""
vocab_files_names={}
vocab_files_names={}
pretrained_vocab_files_map={}
pretrained_vocab_files_map={}
...
@@ -49,48 +69,56 @@ class PreTrainedTokenizer(object):
...
@@ -49,48 +69,56 @@ class PreTrainedTokenizer(object):
@property
@property
defbos_token(self):
defbos_token(self):
""" Beginning of sentence token (string). Log an error if used while not having been set. """
ifself._bos_tokenisNone:
ifself._bos_tokenisNone:
logger.error("Using bos_token, but it is not set yet.")
logger.error("Using bos_token, but it is not set yet.")
returnself._bos_token
returnself._bos_token
@property
@property
defeos_token(self):
defeos_token(self):
""" End of sentence token (string). Log an error if used while not having been set. """
ifself._eos_tokenisNone:
ifself._eos_tokenisNone:
logger.error("Using eos_token, but it is not set yet.")
logger.error("Using eos_token, but it is not set yet.")
returnself._eos_token
returnself._eos_token
@property
@property
defunk_token(self):
defunk_token(self):
""" Unknown token (string). Log an error if used while not having been set. """
ifself._unk_tokenisNone:
ifself._unk_tokenisNone:
logger.error("Using unk_token, but it is not set yet.")
logger.error("Using unk_token, but it is not set yet.")
returnself._unk_token
returnself._unk_token
@property
@property
defsep_token(self):
defsep_token(self):
""" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. """
ifself._sep_tokenisNone:
ifself._sep_tokenisNone:
logger.error("Using sep_token, but it is not set yet.")
logger.error("Using sep_token, but it is not set yet.")
returnself._sep_token
returnself._sep_token
@property
@property
defpad_token(self):
defpad_token(self):
""" Padding token (string). Log an error if used while not having been set. """
ifself._pad_tokenisNone:
ifself._pad_tokenisNone:
logger.error("Using pad_token, but it is not set yet.")
logger.error("Using pad_token, but it is not set yet.")
returnself._pad_token
returnself._pad_token
@property
@property
defcls_token(self):
defcls_token(self):
""" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
ifself._cls_tokenisNone:
ifself._cls_tokenisNone:
logger.error("Using cls_token, but it is not set yet.")
logger.error("Using cls_token, but it is not set yet.")
returnself._cls_token
returnself._cls_token
@property
@property
defmask_token(self):
defmask_token(self):
""" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. """
ifself._mask_tokenisNone:
ifself._mask_tokenisNone:
logger.error("Using mask_token, but it is not set yet.")
logger.error("Using mask_token, but it is not set yet.")
returnself._mask_token
returnself._mask_token
@property
@property
defadditional_special_tokens(self):
defadditional_special_tokens(self):
""" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. """
ifself._additional_special_tokensisNone:
ifself._additional_special_tokensisNone:
logger.error("Using additional_special_tokens, but it is not set yet.")
logger.error("Using additional_special_tokens, but it is not set yet.")
returnself._additional_special_tokens
returnself._additional_special_tokens
...
@@ -143,20 +171,58 @@ class PreTrainedTokenizer(object):
...
@@ -143,20 +171,58 @@ class PreTrainedTokenizer(object):
r""" Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
Parameters:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
cache_dir: (`optional`) string:
Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
Examples::
# We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer
Instantiate a PreTrainedTokenizer from pre-trained vocabulary files.
Download and cache the vocabulary files if needed.
"""
s3_models=list(cls.max_model_input_sizes.keys())
s3_models=list(cls.max_model_input_sizes.keys())
vocab_files={}
vocab_files={}
ifpretrained_model_name_or_pathins3_models:
ifpretrained_model_name_or_pathins3_models:
...
@@ -271,8 +337,9 @@ class PreTrainedTokenizer(object):
...
@@ -271,8 +337,9 @@ class PreTrainedTokenizer(object):
defsave_pretrained(self,save_directory):
defsave_pretrained(self,save_directory):
""" Save the tokenizer vocabulary files (with added tokens) and the
""" Save the tokenizer vocabulary files (with added tokens) and the
special-tokens-to-class-attributes-mapping to a directory, so that it
special-tokens-to-class-attributes-mapping to a directory.
can be re-loaded using the `from_pretrained(save_directory)` class method.
This method make sure the full tokenizer can then be re-loaded using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
"""
"""
ifnotos.path.isdir(save_directory):
ifnotos.path.isdir(save_directory):
logger.error("Saving directory ({}) should be a directory".format(save_directory))
logger.error("Saving directory ({}) should be a directory".format(save_directory))
...
@@ -297,38 +364,52 @@ class PreTrainedTokenizer(object):
...
@@ -297,38 +364,52 @@ class PreTrainedTokenizer(object):
defsave_vocabulary(self,save_directory):
defsave_vocabulary(self,save_directory):
""" Save the tokenizer vocabulary to a directory. This method doesn't save added tokens
""" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
and special token mappings.
and special token mappings.
Please use `save_pretrained()` to save the full Tokenizer state so that it can be
Please use :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` `()` to save the full Tokenizer state if you want to reload it using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
reloaded using the `from_pretrained(save_directory)` class method.
"""
"""
raiseNotImplementedError
raiseNotImplementedError
defvocab_size(self):
defvocab_size(self):
""" Size of the base vocabulary (without the added tokens) """
raiseNotImplementedError
raiseNotImplementedError
def__len__(self):
def__len__(self):
""" Size of the full vocabulary with the added tokens """
""" Add a list of new tokens to the tokenizer class. If the new tokens are not in the
""" Add a list of new tokens to the tokenizer class. If the new tokens are not in the
vocabulary, they are added to the added_tokens_encoder with indices starting from
vocabulary, they are added to it with indices starting from length of the current vocabulary.
the last index of the current vocabulary.
Parameters:
new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
Returns:
Returns:
Number of tokens added to the vocabulary which can be used to correspondingly
Number of tokens added to the vocabulary.
increase the size of the associated model embedding matrices.
Examples::
# Let's see how to increase the vocabulary of Bert model and tokenizer
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
@@ -341,24 +422,48 @@ class PreTrainedTokenizer(object):
...
@@ -341,24 +422,48 @@ class PreTrainedTokenizer(object):
defadd_special_tokens(self,special_tokens_dict):
defadd_special_tokens(self,special_tokens_dict):
""" Add a dictionnary of special tokens (eos, pad, cls...) to the encoder and link them
""" Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
to class attributes. If the special tokens are not in the vocabulary, they are added
to class attributes. If special tokens are NOT in the vocabulary, they are added
to it and indexed starting from the last index of the current vocabulary.
to it (indexed starting from the last index of the current vocabulary).
Parameters:
special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes: [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``].
Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
Returns:
Returns:
Number of tokens added to the vocabulary which can be used to correspondingly
Number of tokens added to the vocabulary.
increase the size of the associated model embedding matrices.
Examples::
# Let's see how to add a new classification token to GPT-2
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.