"vscode:/vscode.git/clone" did not exist on "7764669c544c2d882704c3ef2d13c4e4284789f8"
Commit 009273db authored by thomwolf's avatar thomwolf
Browse files

big doc update [WIP]

parent bfbe52ec
...@@ -119,6 +119,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights) ...@@ -119,6 +119,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
``` ```
## Quick tour of the fine-tuning/usage scripts ## Quick tour of the fine-tuning/usage scripts
The library comprises several example scripts with SOTA performances for NLU and NLG tasks: The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*) - `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
......
Converting Tensorflow Checkpoints Converting Tensorflow Checkpoints
================================================ ================================================
A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class (for OpenAI GPT). A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
BERT BERT
^^^^ ^^^^
...@@ -41,6 +41,20 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model, ...@@ -41,6 +41,20 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
[OPENAI_GPT_CONFIG] [OPENAI_GPT_CONFIG]
OpenAI GPT-2
^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
.. code-block:: shell
export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights
pytorch_transformers gpt2 \
$OPENAI_GPT2_CHECKPOINT_PATH \
$PYTORCH_DUMP_OUTPUT \
[OPENAI_GPT2_CONFIG]
Transformer-XL Transformer-XL
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
...@@ -55,19 +69,6 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo ...@@ -55,19 +69,6 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
[TRANSFO_XL_CONFIG] [TRANSFO_XL_CONFIG]
GPT-2
^^^^^
Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
.. code-block:: shell
export GPT2_DIR=/path/to/gpt2/checkpoint
pytorch_transformers gpt2 \
$GPT2_DIR/model.ckpt \
$PYTORCH_DUMP_OUTPUT \
[GPT2_CONFIG]
XLNet XLNet
^^^^^ ^^^^^
...@@ -84,3 +85,17 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine ...@@ -84,3 +85,17 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
$TRANSFO_XL_CONFIG_PATH \ $TRANSFO_XL_CONFIG_PATH \
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
STS-B \ STS-B \
XLM
^^^
Here is an example of the conversion process for a pre-trained XLM model:
.. code-block:: shell
export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint
pytorch_transformers xlm \
$XLM_CHECKPOINT_PATH \
$PYTORCH_DUMP_OUTPUT \
...@@ -21,11 +21,20 @@ The library currently contains PyTorch implementations, pre-trained model weight ...@@ -21,11 +21,20 @@ The library currently contains PyTorch implementations, pre-trained model weight
pretrained_models pretrained_models
examples examples
notebooks notebooks
serialization
converting_tensorflow_models converting_tensorflow_models
migration migration
bertology bertology
torchscript torchscript
.. toctree::
:maxdepth: 2
:caption: Main classes
main_classes/configuration
main_classes/model
main_classes/tokenizer
main_classes/optimizer_schedules
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
......
Installation Installation
================================================ ================================================
This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0 PyTorch-Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
With pip With pip
^^^^^^^^ ^^^^^^^^
PyTorch pretrained bert can be installed with pip as follows: PyTorch Transformers can be installed using pip as follows:
.. code-block:: bash .. code-block:: bash
...@@ -15,7 +15,7 @@ PyTorch pretrained bert can be installed with pip as follows: ...@@ -15,7 +15,7 @@ PyTorch pretrained bert can be installed with pip as follows:
From source From source
^^^^^^^^^^^ ^^^^^^^^^^^
Clone the repository and instal locally: To install from source, clone the repository and install with:
.. code-block:: bash .. code-block:: bash
...@@ -27,11 +27,11 @@ Clone the repository and instal locally: ...@@ -27,11 +27,11 @@ Clone the repository and instal locally:
Tests Tests
^^^^^ ^^^^^
An extensive test suite is included for the library and the example scripts. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_. An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
These tests can be run using `pytest` (install pytest if needed with `pip install pytest`). Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
You can run the tests from the root of the cloned repository with the commands: Run all the tests from the root of the cloned repository with the commands:
.. code-block:: bash .. code-block:: bash
...@@ -42,11 +42,11 @@ You can run the tests from the root of the cloned repository with the commands: ...@@ -42,11 +42,11 @@ You can run the tests from the root of the cloned repository with the commands:
OpenAI GPT original tokenization workflow OpenAI GPT original tokenization workflow
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` : If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
.. code-block:: bash .. code-block:: bash
pip install spacy ftfy==4.4.3 pip install spacy ftfy==4.4.3
python -m spacy download en python -m spacy download en
If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry). If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer defaults to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
Configuration
----------------------------------------------------
We provide a base class, ``PretrainedConfig``, which can load a pretrained instance either from a local file or directory or from a pretrained model configuration provided by the library (downloaded from HuggingFace AWS S3 repository).
``PretrainedConfig``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.PretrainedConfig
:members:
Models
----------------------------------------------------
``PreTrainedModel``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.PreTrainedModel
:members:
Optimizer
----------------------------------------------------
``AdamW``
~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.AdamW
:members:
Schedules
----------------------------------------------------
.. autoclass:: pytorch_transformers.ConstantLRSchedule
:members:
.. autoclass:: pytorch_transformers.WarmupConstantSchedule
:members:
.. autoclass:: pytorch_transformers.WarmupCosineSchedule
:members:
.. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule
:members:
.. autoclass:: pytorch_transformers.WarmupLinearSchedule
:members:
Tokenizer
----------------------------------------------------
``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.PreTrainedTokenizer
:members:
...@@ -35,10 +35,13 @@ loss, logits, attentions = outputs ...@@ -35,10 +35,13 @@ loss, logits, attentions = outputs
### Serialization ### Serialization
Breaking change: Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. Breaking change in the `from_pretrained()`method:
To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before. 1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.
Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
Here is an example: Here is an example:
......
...@@ -15,12 +15,6 @@ BERT ...@@ -15,12 +15,6 @@ BERT
:members: :members:
``AdamW``
~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.AdamW
:members:
``BertModel`` ``BertModel``
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
......
# Quickstart # Quickstart
## Philosophy
PyTorch-Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
The library was designed with two strong goals in mind:
- be as easy and fast to use as possible:
- we strongly limited the number of abstractions to learn, in fact there are almost no abstractions, just three standard classes for each model: configuration, models and tokenizer,
- each pretrained model configuration, weights and vocabulary can be downloaded, cached and loaded in the related class in a simple way by using a common `from_pretrained()` instantiation method.
- this library is NOT a modular toolbox of building blocks for neural nets, to extend/build-upon the library, just use your regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
- provide state-of-the-art models with performances as close as possible to the original models:
- we provide at least one example for each model which reproduces a result provided by the official authors of said model,
- the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
A few other goals:
- expose the models internals as consistently as possible:
- we give access, using a single API to the full hidden-states and attention weights,
- tokenizer and base model's API are standardized to easily switch between models.
- incorporate a subjective selection of promising tools for fine-tuning/investiguating these models:
- a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
- simple ways to mask and prune transformer heads.
## Main concepts ## Main concepts
The library is build around three type of classes for each models:
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
All these classes can be instantiated from pretrained instances and saved locally using two methods:
- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
Let's go through a few simple quick-start examples to see how we can instantiate and use these classes.
## Quick tour: Usage ## Quick tour: Usage
Here are two quick-start examples showcasing a few `Bert` and `GPT2` classes and pre-trained models. Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
See package reference for examples for each model classe. See full API reference for examples for each model classe.
### BERT example ### BERT example
First let's prepare a tokenized input from a text string using `BertTokenizer` Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
```python ```python
import torch import torch
......
Serialization
----------------------------------------------------
### Loading Google AI or OpenAI pre-trained weights or PyTorch dump ### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
### `from_pretrained()` method ### `from_pretrained()` method
......
...@@ -5,7 +5,7 @@ from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus) ...@@ -5,7 +5,7 @@ from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
from .tokenization_gpt2 import GPT2Tokenizer from .tokenization_gpt2 import GPT2Tokenizer
from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
from .tokenization_xlm import XLMTokenizer from .tokenization_xlm import XLMTokenizer
from .tokenization_utils import (PreTrainedTokenizer, clean_up_tokenization) from .tokenization_utils import (PreTrainedTokenizer)
from .modeling_bert import (BertConfig, BertModel, BertForPreTraining, from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
BertForMaskedLM, BertForNextSentencePrediction, BertForMaskedLM, BertForNextSentencePrediction,
......
...@@ -55,11 +55,19 @@ else: ...@@ -55,11 +55,19 @@ else:
class PretrainedConfig(object): class PretrainedConfig(object):
""" Base class for all configuration classes. """ Base class for all configuration classes.
Handle a few common parameters and methods for loading/downloading/saving configurations. Handle a few common attributes and methods for loading/downloading/saving configurations.
""" """
pretrained_config_archive_map = {} pretrained_config_archive_map = {}
def __init__(self, **kwargs): def __init__(self, **kwargs):
r""" The initialization of :class:`~pytorch_transformers.PretrainedConfig` extracts
a few configuration attributes from `**kwargs` which are common to all models:
- `finetuning_task`: string, default `None`. Name of the task used to fine-tune the model (used to convert from original checkpoint)
- `num_labels`: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
- `output_attentions`: boolean, default `False`. Should the model returns attentions weights.
- `output_hidden_states`: string, default `False`. Should the model returns all hidden-states.
- `torchscript`: string, default `False`. Is the model used with Torchscript.
"""
self.finetuning_task = kwargs.pop('finetuning_task', None) self.finetuning_task = kwargs.pop('finetuning_task', None)
self.num_labels = kwargs.pop('num_labels', 2) self.num_labels = kwargs.pop('num_labels', 2)
self.output_attentions = kwargs.pop('output_attentions', False) self.output_attentions = kwargs.pop('output_attentions', False)
...@@ -67,7 +75,7 @@ class PretrainedConfig(object): ...@@ -67,7 +75,7 @@ class PretrainedConfig(object):
self.torchscript = kwargs.pop('torchscript', False) self.torchscript = kwargs.pop('torchscript', False)
def save_pretrained(self, save_directory): def save_pretrained(self, save_directory):
""" Save a configuration object to a directory, so that it """ Save a configuration object to the directory `save_directory`, so that it
can be re-loaded using the `from_pretrained(save_directory)` class method. can be re-loaded using the `from_pretrained(save_directory)` class method.
""" """
assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved" assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
...@@ -81,30 +89,34 @@ class PretrainedConfig(object): ...@@ -81,30 +89,34 @@ class PretrainedConfig(object):
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
r""" Instantiate a PretrainedConfig from a pre-trained model configuration. r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
Params: Parameters:
**pretrained_model_name_or_path**: either: **pretrained_model_name_or_path**: either:
- a string with the `shortcut name` of a pre-trained model configuration to load from cache
or download and cache if not already stored in cache (e.g. 'bert-base-uncased'). - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
- a path to a `directory` containing a configuration file saved - a path to a `directory` containing a configuration file saved using the `save_pretrained(save_directory)` method, e.g.: ``./my_model_directory/``.
using the `save_pretrained(save_directory)` method. - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
- a path or url to a saved configuration `file`.
**cache_dir**: (`optional`) string: **cache_dir**: (`optional`) string:
Path to a directory in which a downloaded pre-trained model Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used. configuration should be cached if the standard cache should not be used.
**return_unused_kwargs**: (`optional`) bool: **return_unused_kwargs**: (`optional`) bool:
- If False, then this function returns just the final configuration object. - If False, then this function returns just the final configuration object.
- If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
**kwargs**: (`optional`) dict: **kwargs**: (`optional`) dict:
Dictionary of key/value pairs with which to update the configuration object after loading. Dictionary of key/value pairs with which to update the configuration object after loading.
- The values in kwargs of any keys which are configuration attributes will be used - The values in kwargs of any keys which are configuration attributes will be used
to override the loaded values. to override the loaded values.
- Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
by the `return_unused_kwargs` keyword parameter. by the `return_unused_kwargs` keyword parameter.
Examples:: Examples::
# We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a
# derived class: BertConfig
config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache. config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache.
config = BertConfig.from_pretrained('./test/saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')` config = BertConfig.from_pretrained('./test/saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json') config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
......
...@@ -22,7 +22,7 @@ import os ...@@ -22,7 +22,7 @@ import os
import unicodedata import unicodedata
from io import open from io import open
from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization from .tokenization_utils import PreTrainedTokenizer
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
......
...@@ -31,7 +31,7 @@ except ImportError: ...@@ -31,7 +31,7 @@ except ImportError:
def lru_cache(): def lru_cache():
return lambda func: func return lambda func: func
from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization from .tokenization_utils import PreTrainedTokenizer
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
......
...@@ -30,7 +30,7 @@ import torch ...@@ -30,7 +30,7 @@ import torch
import numpy as np import numpy as np
from .file_utils import cached_path from .file_utils import cached_path
from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization from .tokenization_utils import PreTrainedTokenizer
if sys.version_info[0] == 2: if sys.version_info[0] == 2:
import cPickle as pickle import cPickle as pickle
......
...@@ -444,7 +444,7 @@ class PreTrainedTokenizer(object): ...@@ -444,7 +444,7 @@ class PreTrainedTokenizer(object):
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
text = self.convert_tokens_to_string(filtered_tokens) text = self.convert_tokens_to_string(filtered_tokens)
if clean_up_tokenization_spaces: if clean_up_tokenization_spaces:
text = clean_up_tokenization(text) text = self.clean_up_tokenization(text)
return text return text
@property @property
...@@ -480,10 +480,9 @@ class PreTrainedTokenizer(object): ...@@ -480,10 +480,9 @@ class PreTrainedTokenizer(object):
all_ids = list(self.convert_tokens_to_ids(t) for t in all_toks) all_ids = list(self.convert_tokens_to_ids(t) for t in all_toks)
return all_ids return all_ids
@staticmethod
def clean_up_tokenization(out_string):
def clean_up_tokenization(out_string): out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ',' ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't" ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re") return out_string
return out_string
...@@ -23,7 +23,7 @@ from shutil import copyfile ...@@ -23,7 +23,7 @@ from shutil import copyfile
import unicodedata import unicodedata
import six import six
from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization from .tokenization_utils import PreTrainedTokenizer
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment