Commit 00132b7a authored by thomwolf's avatar thomwolf
Browse files

updating docs - adding few tests to tokenizers

parent 009273db
...@@ -40,7 +40,6 @@ The library currently contains PyTorch implementations, pre-trained model weight ...@@ -40,7 +40,6 @@ The library currently contains PyTorch implementations, pre-trained model weight
:maxdepth: 2 :maxdepth: 2
:caption: Package Reference :caption: Package Reference
model_doc/overview
model_doc/bert model_doc/bert
model_doc/gpt model_doc/gpt
model_doc/transformerxl model_doc/transformerxl
......
Configuration Configuration
---------------------------------------------------- ----------------------------------------------------
We provide a base class, ``PretrainedConfig``, which can load a pretrained instance either from a local file or directory or from a pretrained model configuration provided by the library (downloaded from HuggingFace AWS S3 repository). The base class ``PretrainedConfig`` implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
``PretrainedConfig`` ``PretrainedConfig``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
......
Models Models
---------------------------------------------------- ----------------------------------------------------
The base class ``PreTrainedModel`` implements the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
``PreTrainedModel`` also implements a few methods which are common among all the models to:
- resize the input token embeddings when new tokens are added to the vocabulary
- prune the attention heads of the model.
``PreTrainedModel`` ``PreTrainedModel``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
......
Optimizer Optimizer
---------------------------------------------------- ----------------------------------------------------
The ``.optimization`` module provides:
- an optimizer with weight decay fixed that can be used to fine-tuned models, and
- several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
``AdamW`` ``AdamW``
~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
...@@ -10,17 +15,41 @@ Optimizer ...@@ -10,17 +15,41 @@ Optimizer
Schedules Schedules
---------------------------------------------------- ----------------------------------------------------
Learning Rate Schedules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: pytorch_transformers.ConstantLRSchedule .. autoclass:: pytorch_transformers.ConstantLRSchedule
:members: :members:
.. autoclass:: pytorch_transformers.WarmupConstantSchedule .. autoclass:: pytorch_transformers.WarmupConstantSchedule
:members: :members:
.. image:: /imgs/warmup_constant_schedule.png
:target: /imgs/warmup_constant_schedule.png
:alt:
.. autoclass:: pytorch_transformers.WarmupCosineSchedule .. autoclass:: pytorch_transformers.WarmupCosineSchedule
:members: :members:
.. image:: /imgs/warmup_cosine_schedule.png
:target: /imgs/warmup_cosine_schedule.png
:alt:
.. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule .. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule
:members: :members:
.. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
:target: /imgs/warmup_cosine_hard_restarts_schedule.png
:alt:
.. autoclass:: pytorch_transformers.WarmupLinearSchedule .. autoclass:: pytorch_transformers.WarmupLinearSchedule
:members: :members:
.. image:: /imgs/warmup_linear_schedule.png
:target: /imgs/warmup_linear_schedule.png
:alt:
Tokenizer Tokenizer
---------------------------------------------------- ----------------------------------------------------
The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:
- tokenizing, converting tokens to ids and back and encoding/decoding,
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)
``PreTrainedTokenizer`` ``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
......
Overview
================================================
Here is a detailed documentation of the classes in the package and how to use them:
.. list-table::
:header-rows: 1
* - Sub-section
- Description
* - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`__
- How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
* - `Serialization best-practices <#serialization-best-practices>`__
- How to save and reload a fine-tuned model
* - `Configurations <#configurations>`__
- API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.
Configurations
^^^^^^^^^^^^^^
Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which contains the
parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON
configuration files. The respective configuration classes are:
* ``BertConfig`` for ``BertModel`` and BERT classes instances.
* ``OpenAIGPTConfig`` for ``OpenAIGPTModel`` and OpenAI GPT classes instances.
* ``GPT2Config`` for ``GPT2Model`` and OpenAI GPT-2 classes instances.
* ``TransfoXLConfig`` for ``TransfoXLModel`` and Transformer-XL classes instances.
These configuration classes contains a few utilities to load and save configurations:
* ``from_dict(cls, json_object)``\ : A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
* ``from_json_file(cls, json_file)``\ : A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
* ``to_dict()``\ : Serializes an instance to a Python dictionary. Returns a dictionary.
* ``to_json_string()``\ : Serializes an instance to a JSON string. Returns a string.
* ``to_json_file(json_file_path)``\ : Save an instance to a json file.
Loading Google AI or OpenAI pre-trained weights or PyTorch dump
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``from_pretrained()`` method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
.. code-block:: python
model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
where
* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
*
``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
*
the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
* ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
* ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
* ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
* ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
* ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
* ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
* ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
* ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
* ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
* ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
* ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
* ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
* ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
*
a path or url to a pretrained model archive containing:
* ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
* ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
*
``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
* ``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
Examples:
.. code-block:: python
# BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# OpenAI GPT
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTModel.from_pretrained('openai-gpt')
# Transformer-XL
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
# OpenAI GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
Cache directory
~~~~~~~~~~~~~~~
``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
* PyTorch cache home + ``/pytorch_pretrained_bert/``
where PyTorch cache home is defined by (in this order):
* shell environment variable ``ENV_TORCH_HOME``
* shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
* default: ``~/.cache/torch/``
Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
Serialization best-practices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
There are three types of files you need to save to be able to reload a fine-tuned model:
* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
* the configuration file of the model which is saved as a JSON file, and
* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
The *default filenames* of these files are as follow:
* the model weights file: ``pytorch_model.bin``\ ,
* the configuration file: ``config.json``\ ,
* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
.. code-block:: python
from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
output_dir = "./models/"
# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)
# Step 2: Re-load the saved model and vocabulary
# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case) # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
Here is another way you can save and reload the model if you want to use specific paths for each type of files:
.. code-block:: python
output_model_file = "./models/my_own_model_file.bin"
output_config_file = "./models/my_own_config_file.bin"
output_vocab_file = "./models/my_own_vocab_file.bin"
# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_vocab_file)
# Step 2: Re-load the saved model and vocabulary
# We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
# Here is how to do it in this situation:
# Example for a Bert model
config = BertConfig.from_json_file(output_config_file)
model = BertForQuestionAnswering(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
# Example for a GPT model
config = OpenAIGPTConfig.from_json_file(output_config_file)
model = OpenAIGPTDoubleHeadsModel(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = OpenAIGPTTokenizer(output_vocab_file)
Learning Rate Schedules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``.
All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction.
When an ``_LRSchedule`` object is passed into ``AdamW``\ ,
the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used.
An overview of the implemented schedules:
* ``ConstantLR``\ : always returns learning rate 1.
* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
Keeps learning rate equal to 1. after warmup.
.. image:: /imgs/warmup_constant_schedule.png
:target: /imgs/warmup_constant_schedule.png
:alt:
* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.
.. image:: /imgs/warmup_linear_schedule.png
:target: /imgs/warmup_linear_schedule.png
:alt:
* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve. \
If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
.. image:: /imgs/warmup_cosine_schedule.png
:target: /imgs/warmup_cosine_schedule.png
:alt:
* ``WarmupCosineWithHardRestartsSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).
.. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
:target: /imgs/warmup_cosine_hard_restarts_schedule.png
:alt:
* ``WarmupCosineWithWarmupRestartsSchedule`` : All training progress is divided in ``cycles`` (default=1.) parts of equal length.
Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
followed by a learning rate decreasing from 1. to 0. following a cosine curve.
Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``
.. image:: /imgs/warmup_cosine_warm_restarts_schedule.png
:target: /imgs/warmup_cosine_warm_restarts_schedule.png
:alt:
\ No newline at end of file
...@@ -8,13 +8,13 @@ The library was designed with two strong goals in mind: ...@@ -8,13 +8,13 @@ The library was designed with two strong goals in mind:
- be as easy and fast to use as possible: - be as easy and fast to use as possible:
- we strongly limited the number of abstractions to learn, in fact there are almost no abstractions, just three standard classes for each model: configuration, models and tokenizer, - we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
- each pretrained model configuration, weights and vocabulary can be downloaded, cached and loaded in the related class in a simple way by using a common `from_pretrained()` instantiation method. - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
- this library is NOT a modular toolbox of building blocks for neural nets, to extend/build-upon the library, just use your regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving. - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
- provide state-of-the-art models with performances as close as possible to the original models: - provide state-of-the-art models with performances as close as possible to the original models:
- we provide at least one example for each model which reproduces a result provided by the official authors of said model, - we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
- the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code. - the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
A few other goals: A few other goals:
...@@ -34,15 +34,18 @@ A few other goals: ...@@ -34,15 +34,18 @@ A few other goals:
The library is build around three type of classes for each models: The library is build around three type of classes for each models:
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel` - **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig` - **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer` - **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
All these classes can be instantiated from pretrained instances and saved locally using two methods: All these classes can be instantiated from pretrained instances and saved locally using two methods:
- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user, - `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`. - `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
Let's go through a few simple quick-start examples to see how we can instantiate and use these classes. We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
## Quick tour: Usage ## Quick tour: Usage
......
This diff is collapsed.
...@@ -54,20 +54,22 @@ else: ...@@ -54,20 +54,22 @@ else:
class PretrainedConfig(object): class PretrainedConfig(object):
""" Base class for all configuration classes. r""" Base class for all configuration classes.
Handle a few common attributes and methods for loading/downloading/saving configurations. Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.
Class attributes (overridden by derived classes):
- ``pretrained_config_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained model configurations as values.
Parameters:
``finetuning_task``: string, default `None`. Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
``num_labels``: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
``output_attentions``: boolean, default `False`. Should the model returns attentions weights.
``output_hidden_states``: string, default `False`. Should the model returns all hidden-states.
``torchscript``: string, default `False`. Is the model used with Torchscript.
""" """
pretrained_config_archive_map = {} pretrained_config_archive_map = {}
def __init__(self, **kwargs): def __init__(self, **kwargs):
r""" The initialization of :class:`~pytorch_transformers.PretrainedConfig` extracts
a few configuration attributes from `**kwargs` which are common to all models:
- `finetuning_task`: string, default `None`. Name of the task used to fine-tune the model (used to convert from original checkpoint)
- `num_labels`: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
- `output_attentions`: boolean, default `False`. Should the model returns attentions weights.
- `output_hidden_states`: string, default `False`. Should the model returns all hidden-states.
- `torchscript`: string, default `False`. Is the model used with Torchscript.
"""
self.finetuning_task = kwargs.pop('finetuning_task', None) self.finetuning_task = kwargs.pop('finetuning_task', None)
self.num_labels = kwargs.pop('num_labels', 2) self.num_labels = kwargs.pop('num_labels', 2)
self.output_attentions = kwargs.pop('output_attentions', False) self.output_attentions = kwargs.pop('output_attentions', False)
...@@ -76,7 +78,7 @@ class PretrainedConfig(object): ...@@ -76,7 +78,7 @@ class PretrainedConfig(object):
def save_pretrained(self, save_directory): def save_pretrained(self, save_directory):
""" Save a configuration object to the directory `save_directory`, so that it """ Save a configuration object to the directory `save_directory`, so that it
can be re-loaded using the `from_pretrained(save_directory)` class method. can be re-loaded using the :func:`~pytorch_transformers.PretrainedConfig.from_pretrained` class method.
""" """
assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved" assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
...@@ -87,31 +89,28 @@ class PretrainedConfig(object): ...@@ -87,31 +89,28 @@ class PretrainedConfig(object):
@classmethod @classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
r""" Instantiate a PretrainedConfig from a pre-trained model configuration. r""" Instantiate a :class:`~pytorch_transformers.PretrainedConfig` (or a derived class) from a pre-trained model configuration.
Parameters: Parameters:
**pretrained_model_name_or_path**: either: pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``. - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing a configuration file saved using the `save_pretrained(save_directory)` method, e.g.: ``./my_model_directory/``. - a path to a `directory` containing a configuration file saved using the :func:`~pytorch_transformers.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.
- a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``. - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
**cache_dir**: (`optional`) string: cache_dir: (`optional`) string:
Path to a directory in which a downloaded pre-trained model Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used. configuration should be cached if the standard cache should not be used.
**return_unused_kwargs**: (`optional`) bool: kwargs: (`optional`) dict: key/value pairs with which to update the configuration object after loading.
- If False, then this function returns just the final configuration object. - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
- If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored. - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
**kwargs**: (`optional`) dict: return_unused_kwargs: (`optional`) bool:
Dictionary of key/value pairs with which to update the configuration object after loading.
- The values in kwargs of any keys which are configuration attributes will be used - If False, then this function returns just the final configuration object.
to override the loaded values. - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
- Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
by the `return_unused_kwargs` keyword parameter.
Examples:: Examples::
...@@ -215,14 +214,26 @@ class PretrainedConfig(object): ...@@ -215,14 +214,26 @@ class PretrainedConfig(object):
class PreTrainedModel(nn.Module): class PreTrainedModel(nn.Module):
""" Base class for all models. Handle loading/storing model config and r""" Base class for all models.
a simple interface for dowloading and loading pretrained models.
:class:`~pytorch_transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
as well as a few methods commons to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.
Class attributes (overridden by derived classes):
- ``config_class``: a class derived from :class:`~pytorch_transformers.PretrainedConfig` to use as configuration class for this model architecture.
- ``pretrained_model_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained weights as values.
- ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:
- ``model``: an instance of the relevant subclass of :class:`~pytorch_transformers.PreTrainedModel`,
- ``config``: an instance of the relevant subclass of :class:`~pytorch_transformers.PretrainedConfig`,
- ``path``: a path (string) to the TensorFlow checkpoint.
- ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.
""" """
config_class = PretrainedConfig config_class = None
pretrained_model_archive_map = {} pretrained_model_archive_map = {}
load_tf_weights = lambda model, config, path: None load_tf_weights = lambda model, config, path: None
base_model_prefix = "" base_model_prefix = ""
input_embeddings = None
def __init__(self, config, *inputs, **kwargs): def __init__(self, config, *inputs, **kwargs):
super(PreTrainedModel, self).__init__() super(PreTrainedModel, self).__init__()
...@@ -280,17 +291,16 @@ class PreTrainedModel(nn.Module): ...@@ -280,17 +291,16 @@ class PreTrainedModel(nn.Module):
def resize_token_embeddings(self, new_num_tokens=None): def resize_token_embeddings(self, new_num_tokens=None):
""" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size. """ Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method. Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
Args: Arguments:
new_num_tokens: (`optional`) int
New number of tokens in the embedding matrix. new_num_tokens: (`optional`) int:
Increasing the size will add newly initialized vectors at the end New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.
Reducing the size will remove vectors from the end If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.
If not provided or None: does nothing and just returns a pointer to the input tokens Embedding Module of the model.
Return: ``torch.nn.Embeddings`` Return: ``torch.nn.Embeddings``
Pointer to the input tokens Embedding Module of the model Pointer to the input tokens Embeddings Module of the model
""" """
base_model = getattr(self, self.base_model_prefix, self) # get the base model if needed base_model = getattr(self, self.base_model_prefix, self) # get the base model if needed
model_embeds = base_model._resize_token_embeddings(new_num_tokens) model_embeds = base_model._resize_token_embeddings(new_num_tokens)
...@@ -309,15 +319,17 @@ class PreTrainedModel(nn.Module): ...@@ -309,15 +319,17 @@ class PreTrainedModel(nn.Module):
def prune_heads(self, heads_to_prune): def prune_heads(self, heads_to_prune):
""" Prunes heads of the base model. """ Prunes heads of the base model.
Args:
heads_to_prune: dict of {layer_num (int): list of heads to prune in this layer (list of int)} Arguments:
heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).
""" """
base_model = getattr(self, self.base_model_prefix, self) # get the base model if needed base_model = getattr(self, self.base_model_prefix, self) # get the base model if needed
base_model._prune_heads(heads_to_prune) base_model._prune_heads(heads_to_prune)
def save_pretrained(self, save_directory): def save_pretrained(self, save_directory):
""" Save a model with its configuration file to a directory, so that it """ Save a model and its configuration file to a directory, so that it
can be re-loaded using the `from_pretrained(save_directory)` class method. can be re-loaded using the `:func:`~pytorch_transformers.PreTrainedModel.from_pretrained`` class method.
""" """
assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved" assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
...@@ -336,50 +348,45 @@ class PreTrainedModel(nn.Module): ...@@ -336,50 +348,45 @@ class PreTrainedModel(nn.Module):
def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
r"""Instantiate a pretrained pytorch model from a pre-trained model configuration. r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated) The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()` To train the model, you should first set it back in training mode with ``model.train()``
Params: Parameters:
**pretrained_model_name_or_path**: either: pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model to load from cache
or download and cache if not already stored in cache (e.g. 'bert-base-uncased'). - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing a configuration file saved - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
using the `save_pretrained(save_directory)` method. - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
- a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
In this case, ``from_tf`` should be set to True and a configuration object should be model_args: (`optional`) Sequence of positional arguments:
provided as `config` argument. This loading option is slower than converting the TensorFlow All remaning positional arguments will be passed to the underlying model's ``__init__`` method
checkpoint in a PyTorch model using the provided conversion scripts and loading
the PyTorch model afterwards. config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
**model_args**: (`optional`) Sequence: Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
All remaning positional arguments will be passed to the underlying model's __init__ function
**config**: an optional configuration for the model to use instead of an automatically loaded configuation. - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
Configuration can be automatically loaded when: - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
- the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
- the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
**state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded state_dict: (`optional`) dict:
from saved weights file. an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
This option can be used if you want to create a model from a pretrained configuration but load your own weights. This option can be used if you want to create a model from a pretrained configuration but load your own weights.
In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
a simpler option.
**cache_dir**: (`optional`) string: cache_dir: (`optional`) string:
Path to a directory in which a downloaded pre-trained model Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used. configuration should be cached if the standard cache should not be used.
**output_loading_info**: (`optional`) boolean:
output_loading_info: (`optional`) boolean:
Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages. Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
**kwargs**: (`optional`) dict:
Dictionary of key, values to update the configuration object after loading. kwargs: (`optional`) Remaining dictionary of keyword arguments:
Can be used to override selected configuration parameters. E.g. ``output_attention=True``. Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
- If a configuration is providedictionaryfig`, **kwargs will be directly passed - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
to the underlying model's __init__ method. - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
- If a configuration is not provided, **kwargs will be first passed to the pretrained
model configuration class loading function (`PretrainedConfig.from_pretrained`). Examples::
Each key of **kwargs that corresponds to a configuration attribute
will be used to override said attribute with the supplied **kwargs value.
Remaining keys that do not correspond to any configuration attribute will
be passed to the underlying model's __init__ function.
Examples::dictionary
model = BertModel.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache. model = BertModel.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache.
model = BertModel.from_pretrained('./test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` model = BertModel.from_pretrained('./test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
......
...@@ -30,14 +30,34 @@ SPECIAL_TOKENS_MAP_FILE = 'special_tokens_map.json' ...@@ -30,14 +30,34 @@ SPECIAL_TOKENS_MAP_FILE = 'special_tokens_map.json'
ADDED_TOKENS_FILE = 'added_tokens.json' ADDED_TOKENS_FILE = 'added_tokens.json'
class PreTrainedTokenizer(object): class PreTrainedTokenizer(object):
""" An abstract class to handle dowloading and loading pretrained tokenizers and adding tokens to the vocabulary. """ Base class for all tokenizers.
Handle all the shared methods for tokenization and special tokens as well as methods dowloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
Derived class can set up a few special tokens to be used in common scripts and internals: This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
bos_token, eos_token, EOP_TOKEN, EOD_TOKEN, unk_token, sep_token, pad_token, cls_token, mask_token
additional_special_tokens = []
We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the Class attributes (overridden by derived classes):
specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
- ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
- ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the associated pretrained vocabulary file.
- ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
Parameters:
- ``bos_token``: (`Optional`) string: a beginning of sentence token. Will be associated to ``self.bos_token``
- ``eos_token``: (`Optional`) string: an end of sentence token. Will be associated to ``self.eos_token``
- ``unk_token``: (`Optional`) string: an unknown token. Will be associated to ``self.unk_token``
- ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence). Will be associated to ``self.sep_token``
- ``pad_token``: (`Optional`) string: a padding token. Will be associated to ``self.pad_token``
- ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model). Will be associated to ``self.cls_token``
- ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language modeling). Will be associated to ``self.mask_token``
- ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens. Adding all special tokens here ensure they won't be split by the tokenization process. Will be associated to ``self.additional_special_tokens``
""" """
vocab_files_names = {} vocab_files_names = {}
pretrained_vocab_files_map = {} pretrained_vocab_files_map = {}
...@@ -49,82 +69,98 @@ class PreTrainedTokenizer(object): ...@@ -49,82 +69,98 @@ class PreTrainedTokenizer(object):
@property @property
def bos_token(self): def bos_token(self):
""" Beginning of sentence token (string). Log an error if used while not having been set. """
if self._bos_token is None: if self._bos_token is None:
logger.error("Using bos_token, but it is not set yet.") logger.error("Using bos_token, but it is not set yet.")
return self._bos_token return self._bos_token
@property @property
def eos_token(self): def eos_token(self):
""" End of sentence token (string). Log an error if used while not having been set. """
if self._eos_token is None: if self._eos_token is None:
logger.error("Using eos_token, but it is not set yet.") logger.error("Using eos_token, but it is not set yet.")
return self._eos_token return self._eos_token
@property @property
def unk_token(self): def unk_token(self):
""" Unknown token (string). Log an error if used while not having been set. """
if self._unk_token is None: if self._unk_token is None:
logger.error("Using unk_token, but it is not set yet.") logger.error("Using unk_token, but it is not set yet.")
return self._unk_token return self._unk_token
@property @property
def sep_token(self): def sep_token(self):
""" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. """
if self._sep_token is None: if self._sep_token is None:
logger.error("Using sep_token, but it is not set yet.") logger.error("Using sep_token, but it is not set yet.")
return self._sep_token return self._sep_token
@property @property
def pad_token(self): def pad_token(self):
""" Padding token (string). Log an error if used while not having been set. """
if self._pad_token is None: if self._pad_token is None:
logger.error("Using pad_token, but it is not set yet.") logger.error("Using pad_token, but it is not set yet.")
return self._pad_token return self._pad_token
@property @property
def cls_token(self): def cls_token(self):
""" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
if self._cls_token is None: if self._cls_token is None:
logger.error("Using cls_token, but it is not set yet.") logger.error("Using cls_token, but it is not set yet.")
return self._cls_token return self._cls_token
@property @property
def mask_token(self): def mask_token(self):
""" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. """
if self._mask_token is None: if self._mask_token is None:
logger.error("Using mask_token, but it is not set yet.") logger.error("Using mask_token, but it is not set yet.")
return self._mask_token return self._mask_token
@property @property
def additional_special_tokens(self): def additional_special_tokens(self):
""" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. """
if self._additional_special_tokens is None: if self._additional_special_tokens is None:
logger.error("Using additional_special_tokens, but it is not set yet.") logger.error("Using additional_special_tokens, but it is not set yet.")
return self._additional_special_tokens return self._additional_special_tokens
@bos_token.setter @bos_token.setter
def bos_token(self, value): def bos_token(self, value):
self.add_tokens([value])
self._bos_token = value self._bos_token = value
@eos_token.setter @eos_token.setter
def eos_token(self, value): def eos_token(self, value):
self.add_tokens([value])
self._eos_token = value self._eos_token = value
@unk_token.setter @unk_token.setter
def unk_token(self, value): def unk_token(self, value):
self.add_tokens([value])
self._unk_token = value self._unk_token = value
@sep_token.setter @sep_token.setter
def sep_token(self, value): def sep_token(self, value):
self.add_tokens([value])
self._sep_token = value self._sep_token = value
@pad_token.setter @pad_token.setter
def pad_token(self, value): def pad_token(self, value):
self.add_tokens([value])
self._pad_token = value self._pad_token = value
@cls_token.setter @cls_token.setter
def cls_token(self, value): def cls_token(self, value):
self.add_tokens([value])
self._cls_token = value self._cls_token = value
@mask_token.setter @mask_token.setter
def mask_token(self, value): def mask_token(self, value):
self.add_tokens([value])
self._mask_token = value self._mask_token = value
@additional_special_tokens.setter @additional_special_tokens.setter
def additional_special_tokens(self, value): def additional_special_tokens(self, value):
self.add_tokens(value)
self._additional_special_tokens = value self._additional_special_tokens = value
def __init__(self, max_len=None, **kwargs): def __init__(self, max_len=None, **kwargs):
...@@ -148,15 +184,47 @@ class PreTrainedTokenizer(object): ...@@ -148,15 +184,47 @@ class PreTrainedTokenizer(object):
@classmethod @classmethod
def from_pretrained(cls, *inputs, **kwargs): def from_pretrained(cls, *inputs, **kwargs):
r""" Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
Parameters:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
cache_dir: (`optional`) string:
Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
Examples::
# We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer
# Download vocabulary from S3 and cache.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')
# If the tokenizer uses a single vocabulary file, you can point directly to this file
tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')
# You can link tokens to special vocabulary when instantiating
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
# You should be sure '<unk>' is in the vocabulary when doing that.
# Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
assert tokenizer.unk_token == '<unk>'
"""
return cls._from_pretrained(*inputs, **kwargs) return cls._from_pretrained(*inputs, **kwargs)
@classmethod @classmethod
def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
"""
Instantiate a PreTrainedTokenizer from pre-trained vocabulary files.
Download and cache the vocabulary files if needed.
"""
cache_dir = kwargs.pop('cache_dir', None) cache_dir = kwargs.pop('cache_dir', None)
s3_models = list(cls.max_model_input_sizes.keys()) s3_models = list(cls.max_model_input_sizes.keys())
...@@ -253,8 +321,9 @@ class PreTrainedTokenizer(object): ...@@ -253,8 +321,9 @@ class PreTrainedTokenizer(object):
def save_pretrained(self, save_directory): def save_pretrained(self, save_directory):
""" Save the tokenizer vocabulary files (with added tokens) and the """ Save the tokenizer vocabulary files (with added tokens) and the
special-tokens-to-class-attributes-mapping to a directory, so that it special-tokens-to-class-attributes-mapping to a directory.
can be re-loaded using the `from_pretrained(save_directory)` class method.
This method make sure the full tokenizer can then be re-loaded using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
""" """
if not os.path.isdir(save_directory): if not os.path.isdir(save_directory):
logger.error("Saving directory ({}) should be a directory".format(save_directory)) logger.error("Saving directory ({}) should be a directory".format(save_directory))
...@@ -279,37 +348,50 @@ class PreTrainedTokenizer(object): ...@@ -279,37 +348,50 @@ class PreTrainedTokenizer(object):
def save_vocabulary(self, save_directory): def save_vocabulary(self, save_directory):
""" Save the tokenizer vocabulary to a directory. This method doesn't save added tokens """ Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
and special token mappings. and special token mappings.
Please use `save_pretrained()` to save the full Tokenizer state so that it can be Please use :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` `()` to save the full Tokenizer state if you want to reload it using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
reloaded using the `from_pretrained(save_directory)` class method.
""" """
raise NotImplementedError raise NotImplementedError
def vocab_size(self): def vocab_size(self):
""" Size of the base vocabulary (without the added tokens) """
raise NotImplementedError raise NotImplementedError
def __len__(self): def __len__(self):
""" Size of the full vocabulary with the added tokens """
return self.vocab_size + len(self.added_tokens_encoder) return self.vocab_size + len(self.added_tokens_encoder)
def add_tokens(self, new_tokens): def add_tokens(self, new_tokens):
""" Add a list of new tokens to the tokenizer class. If the new tokens are not in the """ Add a list of new tokens to the tokenizer class. If the new tokens are not in the
vocabulary, they are added to the added_tokens_encoder with indices starting from vocabulary, they are added to it with indices starting from length of the current vocabulary.
the last index of the current vocabulary.
Parameters:
new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
Returns: Returns:
Number of tokens added to the vocabulary which can be used to correspondingly Number of tokens added to the vocabulary.
increase the size of the associated model embedding matrices.
Examples::
# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
""" """
if not new_tokens: if not new_tokens:
return 0 return 0
to_add_tokens = [] to_add_tokens = []
for token in new_tokens: for token in new_tokens:
assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
if token != self.unk_token and \ if token != self.unk_token and \
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token): self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
to_add_tokens.append(token) to_add_tokens.append(token)
...@@ -325,23 +407,23 @@ class PreTrainedTokenizer(object): ...@@ -325,23 +407,23 @@ class PreTrainedTokenizer(object):
def add_special_tokens(self, special_tokens_dict): def add_special_tokens(self, special_tokens_dict):
""" Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them """ Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
to class attributes. If the special tokens are not in the vocabulary, they are added to class attributes. If special tokens are NOT in the vocabulary, they are added
to it and indexed starting from the last index of the current vocabulary. to it (indexed starting from the last index of the current vocabulary).
Parameters:
special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes: [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``].
Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
Returns:
Number of tokens added to the vocabulary which can be used to correspondingly
increase the size of the associated model embedding matrices.
""" """
if not special_tokens_dict: if not special_tokens_dict:
return 0 return 0
added_special_tokens = self.add_tokens(special_tokens_dict.values())
for key, value in special_tokens_dict.items(): for key, value in special_tokens_dict.items():
assert key in self.SPECIAL_TOKENS_ATTRIBUTES
logger.info("Assigning %s to the %s key of the tokenizer", value, key) logger.info("Assigning %s to the %s key of the tokenizer", value, key)
setattr(self, key, value) setattr(self, key, value)
return added_special_tokens
def tokenize(self, text, **kwargs): def tokenize(self, text, **kwargs):
""" Converts a string in a sequence of tokens (string), using the tokenizer. """ Converts a string in a sequence of tokens (string), using the tokenizer.
...@@ -369,13 +451,13 @@ class PreTrainedTokenizer(object): ...@@ -369,13 +451,13 @@ class PreTrainedTokenizer(object):
Split in words for word-based vocabulary or sub-words for sub-word-based Split in words for word-based vocabulary or sub-words for sub-word-based
vocabularies (BPE/SentencePieces/WordPieces). vocabularies (BPE/SentencePieces/WordPieces).
Don't take care of added tokens. Do NOT take care of added tokens.
""" """
raise NotImplementedError raise NotImplementedError
def convert_tokens_to_ids(self, tokens): def convert_tokens_to_ids(self, tokens):
""" Converts a single token or a sequence of tokens (str/unicode) in a integer id """ Converts a single token, or a sequence of tokens, (str/unicode) in a single integer id
(resp.) a sequence of ids, using the vocabulary. (resp. a sequence of ids), using the vocabulary.
""" """
if isinstance(tokens, str) or (six.PY2 and isinstance(tokens, unicode)): if isinstance(tokens, str) or (six.PY2 and isinstance(tokens, unicode)):
return self._convert_token_to_id_with_added_voc(tokens) return self._convert_token_to_id_with_added_voc(tokens)
...@@ -400,7 +482,8 @@ class PreTrainedTokenizer(object): ...@@ -400,7 +482,8 @@ class PreTrainedTokenizer(object):
def encode(self, text): def encode(self, text):
""" Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
same as self.convert_tokens_to_ids(self.tokenize(text)).
Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
""" """
return self.convert_tokens_to_ids(self.tokenize(text)) return self.convert_tokens_to_ids(self.tokenize(text))
...@@ -440,6 +523,8 @@ class PreTrainedTokenizer(object): ...@@ -440,6 +523,8 @@ class PreTrainedTokenizer(object):
def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True): def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
""" Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. with options to remove special tokens and clean up tokenization spaces.
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
""" """
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
text = self.convert_tokens_to_string(filtered_tokens) text = self.convert_tokens_to_string(filtered_tokens)
...@@ -482,6 +567,8 @@ class PreTrainedTokenizer(object): ...@@ -482,6 +567,8 @@ class PreTrainedTokenizer(object):
@staticmethod @staticmethod
def clean_up_tokenization(out_string): def clean_up_tokenization(out_string):
""" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.
"""
out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ',' out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't" ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re") ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment