Merge branch 'master' into RoBERTa

d2cc6b10 · LysandreJik · 39d72bcc · 7729ef73 · d2cc6b10 · d2cc6b10
Commit d2cc6b10 authored Aug 08, 2019 by LysandreJik
20 changed files
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
+---
+name: "\U0001F41B Bug Report"
+about: Submit a bug report to help us improve PyTorch Transformers
+---
+## 🐛 Bug
+<!-- Important information -->
+Model I am using (Bert, XLNet....):
+Language I am using the model on (English, Chinese....):
+The problem arise when using:
+* [ ] the official example scripts: (give details)
+* [ ] my own modified scripts: (give details)
+The tasks I am working on is:
+* [ ] an official GLUE/SQUaD task: (give the name)
+* [ ] my own task or dataset: (give details)
+## To Reproduce
+Steps to reproduce the behavior:
+1.
+2.
+3.
+<!-- If you have a code sample, error messages, stack traces, please provide it here as well. -->
+## Expected behavior
+<!-- A clear and concise description of what you expected to happen. -->
+## Environment
+* OS:
+* Python version:
+* PyTorch version:
+* PyTorch Transformers version (or branch):
+* Using GPU ?
+* Distributed of parallel setup ?
+* Any other relevant information:
+## Additional context
+<!-- Add any other context about the problem here. -->
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/feature-request.md
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
+---
+name: "\U0001F680 Feature Request"
+about: Submit a proposal/request for a new PyTorch Transformers feature
+---
+## 🚀 Feature
+<!-- A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist. -->
+## Motivation
+<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. -->
+## Additional context
+<!-- Add any other context or screenshots about the feature request here. -->
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
+---
+name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
+about: Report a problem when migrating from PyTorch-pretrained-Bert to PyTorch-Transformers
+---
+## 📚 Migration
+<!-- Important information -->
+Model I am using (Bert, XLNet....):
+Language I am using the model on (English, Chinese....):
+The problem arise when using:
+* [ ] the official example scripts: (give details)
+* [ ] my own modified scripts: (give details)
+The tasks I am working on is:
+* [ ] an official GLUE/SQUaD task: (give the name)
+* [ ] my own task or dataset: (give details)
+Details of the issue:
+<!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->
+## Environment
+* OS:
+* Python version:
+* PyTorch version:
+* PyTorch Transformers version (or branch):
+* Using GPU ?
+* Distributed of parallel setup ?
+* Any other relevant information:
+## Checklist
+- [ ] I have read the migration guide in the readme.
+- [ ] I checked if a related official extension example runs on my machine.
+## Additional context
+<!-- Add any other context about the problem here. -->
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/question-help.md
+++ b/.github/ISSUE_TEMPLATE/question-help.md
+---
+name: "❓Questions & Help"
+about: Start a general discussion related to PyTorch Transformers
+---
+## ❓ Questions & Help
+<!-- A clear and concise description of the question. -->
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ These implementations have been tested on several datasets (see the example scri
 ## Installation
-This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
+This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+
 ### With pip
@@ -130,6 +130,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
 ```
 ## Quick tour of the fine-tuning/usage scripts
 The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
 - `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
@@ -314,7 +315,7 @@ loss = outputs[0]
 # In pytorch-transformers you can also have access to the logits:
 loss, logits = outputs[:2]
-# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
+# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
 model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
 outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs
@@ -355,8 +356,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+- it only implements weights decay correction,
+- schedules are now externals (see below),
+- gradient clipping is now also external (see below).
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
 The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
@@ -365,6 +371,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
 ```python
 # Parameters:
 lr = 1e-3
+max_grad_norm = 1.0
 num_total_steps = 1000
 num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
@@ -384,8 +391,10 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
 for batch in train_data:
    loss = model(batch)
    loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    scheduler.step()
    optimizer.step()
+    optimizer.zero_grad()
 ```
 ## Citation

--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
 Converting Tensorflow Checkpoints
 ================================================
-A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
 BERT
 ^^^^
@@ -41,6 +41,20 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
     $PYTORCH_DUMP_OUTPUT \
     [OPENAI_GPT_CONFIG]
+OpenAI GPT-2
+^^^^^^^^^^^^
+Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
+.. code-block:: shell
+   export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights
+   pytorch_transformers gpt2 \
+     $OPENAI_GPT2_CHECKPOINT_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [OPENAI_GPT2_CONFIG]
 Transformer-XL
 ^^^^^^^^^^^^^^
@@ -55,19 +69,6 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
     $PYTORCH_DUMP_OUTPUT \
     [TRANSFO_XL_CONFIG]
-GPT-2
-^^^^^
-Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
-.. code-block:: shell
-   export GPT2_DIR=/path/to/gpt2/checkpoint
-   pytorch_transformers gpt2 \
-     $GPT2_DIR/model.ckpt \
-     $PYTORCH_DUMP_OUTPUT \
-     [GPT2_CONFIG]
 XLNet
 ^^^^^
@@ -84,3 +85,17 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
     $TRANSFO_XL_CONFIG_PATH \
     $PYTORCH_DUMP_OUTPUT \
     STS-B \
+XLM
+^^^
+Here is an example of the conversion process for a pre-trained XLM model:
+.. code-block:: shell
+   export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint
+   pytorch_transformers xlm \
+     $XLM_CHECKPOINT_PATH \
+     $PYTORCH_DUMP_OUTPUT \
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,17 +21,26 @@ The library currently contains PyTorch implementations, pre-trained model weight
    pretrained_models
    examples
    notebooks
+    serialization
    converting_tensorflow_models
    migration
    bertology
    torchscript
+.. toctree::
+    :maxdepth: 2
+    :caption: Main classes
+    main_classes/configuration
+    main_classes/model
+    main_classes/tokenizer
+    main_classes/optimizer_schedules
 .. toctree::
    :maxdepth: 2
    :caption: Package Reference
-    model_doc/overview
+    model_doc/auto
    model_doc/bert
    model_doc/gpt
    model_doc/transformerxl

--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
 Installation
 ================================================
-This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
+PyTorch-Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
 With pip
 ^^^^^^^^
-PyTorch pretrained bert can be installed with pip as follows:
+PyTorch Transformers can be installed using pip as follows:
 .. code-block:: bash
@@ -15,7 +15,7 @@ PyTorch pretrained bert can be installed with pip as follows:
 From source
 ^^^^^^^^^^^
-Clone the repository and instal locally:
+To install from source, clone the repository and install with:
 .. code-block:: bash
@@ -27,11 +27,11 @@ Clone the repository and instal locally:
 Tests
 ^^^^^
-An extensive test suite is included for the library and the example scripts. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
-These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
+Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
-You can run the tests from the root of the cloned repository with the commands:
+Run all the tests from the root of the cloned repository with the commands:
 .. code-block:: bash
@@ -42,7 +42,7 @@ You can run the tests from the root of the cloned repository with the commands:
 OpenAI GPT original tokenization workflow
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
 .. code-block:: bash
@@ -62,4 +62,4 @@ It contains an example of a conversion script from a Pytorch trained Transformer
 It also contains an implementation of BERT for Question answering.
 At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
 or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
\ No newline at end of file
--- a/docs/source/main_classes/configuration.rst
+++ b/docs/source/main_classes/configuration.rst
+Configuration
+----------------------------------------------------
+The base class ``PretrainedConfig`` implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
+``PretrainedConfig``
+~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.PretrainedConfig
+    :members:
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
+Models
+----------------------------------------------------
+The base class ``PreTrainedModel`` implements the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
+``PreTrainedModel`` also implements a few methods which are common among all the models to:
+- resize the input token embeddings when new tokens are added to the vocabulary
+- prune the attention heads of the model.
+``PreTrainedModel``
+~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.PreTrainedModel
+    :members:
--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
+Optimizer
+----------------------------------------------------
+The ``.optimization`` module provides:
+- an optimizer with weight decay fixed that can be used to fine-tuned models, and
+- several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
+``AdamW``
+~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.AdamW
+    :members:
+Schedules
+----------------------------------------------------
+Learning Rate Schedules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: pytorch_transformers.ConstantLRSchedule
+    :members:
+.. autoclass:: pytorch_transformers.WarmupConstantSchedule
+    :members:
+.. image:: /imgs/warmup_constant_schedule.png
+    :target: /imgs/warmup_constant_schedule.png
+    :alt:
+.. autoclass:: pytorch_transformers.WarmupCosineSchedule
+    :members:
+.. image:: /imgs/warmup_cosine_schedule.png
+    :target: /imgs/warmup_cosine_schedule.png
+    :alt:
+.. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule
+    :members:
+.. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
+    :target: /imgs/warmup_cosine_hard_restarts_schedule.png
+    :alt:
+.. autoclass:: pytorch_transformers.WarmupLinearSchedule
+    :members:
+.. image:: /imgs/warmup_linear_schedule.png
+    :target: /imgs/warmup_linear_schedule.png
+    :alt:
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
+Tokenizer
+----------------------------------------------------
+The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
+``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:
+- tokenizing, converting tokens to ids and back and encoding/decoding,
+- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
+- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)
+``PreTrainedTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.PreTrainedTokenizer
+    :members:
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -35,10 +35,13 @@ loss, logits, attentions = outputs
 ### Serialization
-Breaking change: Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method.
+Breaking change in the `from_pretrained()`method:
-To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
-Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.
+1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.
+Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
 Here is an example:
@@ -65,8 +68,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+- it only implements weights decay correction,
+- schedules are now externals (see below),
+- gradient clipping is now also external (see below).
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
 The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
@@ -75,6 +83,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
 ```python
 # Parameters:
 lr = 1e-3
+max_grad_norm = 1.0
 num_total_steps = 1000
 num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
@@ -94,6 +103,7 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
 for batch in train_data:
    loss = model(batch)
    loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    scheduler.step()
    optimizer.step()
 ```
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
+AutoModels
+-----------
+In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
+AutoClasses are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary:
+Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``).
+``AutoConfig``
+~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.AutoConfig
+    :members:
+``AutoModel``
+~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.AutoModel
+    :members:
+``AutoTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.AutoTokenizer
+    :members:
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -15,12 +15,6 @@ BERT
    :members:
-``AdamW``
-~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_transformers.AdamW
-    :members:
 ``BertModel``
 ~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
-Overview
-================================================
-Here is a detailed documentation of the classes in the package and how to use them:
-.. list-table::
-   :header-rows: 1
-   * - Sub-section
-     - Description
-   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`__
-     - How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
-   * - `Serialization best-practices <#serialization-best-practices>`__
-     - How to save and reload a fine-tuned model
-   * - `Configurations <#configurations>`__
-     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
-TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.
-Configurations
-^^^^^^^^^^^^^^
-Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which contains the
-parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON
-configuration files. The respective configuration classes are:
-* ``BertConfig`` for ``BertModel`` and BERT classes instances.
-* ``OpenAIGPTConfig`` for ``OpenAIGPTModel`` and OpenAI GPT classes instances.
-* ``GPT2Config`` for ``GPT2Model`` and OpenAI GPT-2 classes instances.
-* ``TransfoXLConfig`` for ``TransfoXLModel`` and Transformer-XL classes instances.
-These configuration classes contains a few utilities to load and save configurations:
-* ``from_dict(cls, json_object)``\ : A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
-* ``from_json_file(cls, json_file)``\ : A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
-* ``to_dict()``\ : Serializes an instance to a Python dictionary. Returns a dictionary.
-* ``to_json_string()``\ : Serializes an instance to a JSON string. Returns a string.
-* ``to_json_file(json_file_path)``\ : Save an instance to a json file.
-Loading Google AI or OpenAI pre-trained weights or PyTorch dump
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-``from_pretrained()`` method
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
-.. code-block:: python
-   model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
-where
-* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
-*
-  ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
-  *
-    the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
-    * ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    * ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
-    * ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
-    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
-    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
-    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
-    * ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
-  *
-    a path or url to a pretrained model archive containing:
-    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
-    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
-  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
-*
-  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
-* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
-* ``state_dict``\ : an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
-* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
-``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
-When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
-Examples:
-.. code-block:: python
-   # BERT
-   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
-   model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
-   # OpenAI GPT
-   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-   model = OpenAIGPTModel.from_pretrained('openai-gpt')
-   # Transformer-XL
-   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
-   # OpenAI GPT-2
-   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-   model = GPT2Model.from_pretrained('gpt2')
-Cache directory
-~~~~~~~~~~~~~~~
-``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
-* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
-* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
-* PyTorch cache home + ``/pytorch_pretrained_bert/``
-  where PyTorch cache home is defined by (in this order):
-  * shell environment variable ``ENV_TORCH_HOME``
-  * shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
-  * default: ``~/.cache/torch/``
-Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
-You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
-Serialization best-practices
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
-There are three types of files you need to save to be able to reload a fine-tuned model:
-* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
-* the configuration file of the model which is saved as a JSON file, and
-* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
-The *default filenames* of these files are as follow:
-* the model weights file: ``pytorch_model.bin``\ ,
-* the configuration file: ``config.json``\ ,
-* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
-* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
-**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
-Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
-.. code-block:: python
-   from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
-   output_dir = "./models/"
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-   # If we save using the predefined names, we can load using `from_pretrained`
-   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
-   output_config_file = os.path.join(output_dir, CONFIG_NAME)
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_vocabulary(output_dir)
-   # Step 2: Re-load the saved model and vocabulary
-   # Example for a Bert model
-   model = BertForQuestionAnswering.from_pretrained(output_dir)
-   tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
-   # Example for a GPT model
-   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
-   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
-Here is another way you can save and reload the model if you want to use specific paths for each type of files:
-.. code-block:: python
-   output_model_file = "./models/my_own_model_file.bin"
-   output_config_file = "./models/my_own_config_file.bin"
-   output_vocab_file = "./models/my_own_vocab_file.bin"
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_vocabulary(output_vocab_file)
-   # Step 2: Re-load the saved model and vocabulary
-   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
-   # Here is how to do it in this situation:
-   # Example for a Bert model
-   config = BertConfig.from_json_file(output_config_file)
-   model = BertForQuestionAnswering(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
-   # Example for a GPT model
-   config = OpenAIGPTConfig.from_json_file(output_config_file)
-   model = OpenAIGPTDoubleHeadsModel(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = OpenAIGPTTokenizer(output_vocab_file)
-Learning Rate Schedules
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``.
-All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction.
-When an ``_LRSchedule`` object is passed into ``AdamW``\ ,
-the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used.
-An overview of the implemented schedules:
-* ``ConstantLR``\ : always returns learning rate 1.
-* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-    Keeps learning rate equal to 1. after warmup.
-  .. image:: /imgs/warmup_constant_schedule.png
-     :target: /imgs/warmup_constant_schedule.png
-     :alt:
-* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-    Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.
-  .. image:: /imgs/warmup_linear_schedule.png
-     :target: /imgs/warmup_linear_schedule.png
-     :alt:
-* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-  Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve. \
-  If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
-  .. image:: /imgs/warmup_cosine_schedule.png
-     :target: /imgs/warmup_cosine_schedule.png
-     :alt:
-* ``WarmupCosineWithHardRestartsSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
-  If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).
-  .. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
-     :target: /imgs/warmup_cosine_hard_restarts_schedule.png
-     :alt:
-* ``WarmupCosineWithWarmupRestartsSchedule`` : All training progress is divided in ``cycles`` (default=1.) parts of equal length.
-  Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
-  followed by a learning rate decreasing from 1. to 0. following a cosine curve.
-  Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``
-  .. image:: /imgs/warmup_cosine_warm_restarts_schedule.png
-     :target: /imgs/warmup_cosine_warm_restarts_schedule.png
-     :alt:
\ No newline at end of file
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
 # Quickstart
+## Philosophy
+PyTorch-Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
+The library was designed with two strong goals in mind:
+- be as easy and fast to use as possible:
+  - we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
+  - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
+  - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
+- provide state-of-the-art models with performances as close as possible to the original models:
+  - we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
+  - the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
+A few other goals:
+- expose the models internals as consistently as possible:
+  - we give access, using a single API to the full hidden-states and attention weights,
+  - tokenizer and base model's API are standardized to easily switch between models.
+- incorporate a subjective selection of promising tools for fine-tuning/investiguating these models:
+  - a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
+  - simple ways to mask and prune transformer heads.
 ## Main concepts
+The library is build around three type of classes for each models:
+- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
+- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
+- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
+All these classes can be instantiated from pretrained instances and saved locally using two methods:
+- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
+- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
+We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
+- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
+- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
 ## Quick tour: Usage
-Here are two quick-start examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
+Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
-See package reference for examples for each model classe.
+See full API reference for examples for each model classe.
 ### BERT example
-First let's prepare a tokenized input from a text string using `BertTokenizer`
+Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
 ```python
 import torch

--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
--- a/docs/source/torchscript.rst
+++ b/docs/source/torchscript.rst
@@ -105,6 +105,9 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
    # The model needs to be in evaluation mode
    model.eval()
+    # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
+    model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
    # Creating the trace
    traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
    torch.jit.save(traced_model, "traced_bert.pt")