Merge branch 'master' into pr/1383

d9e60f4f · thomwolf · 07d055f8 · 1c507995 · d9e60f4f · d9e60f4f
Commit d9e60f4f authored Oct 09, 2019 by thomwolf
20 changed files
--- a/.github/ISSUE_TEMPLATE/--new-model-addition.md
+++ b/.github/ISSUE_TEMPLATE/--new-model-addition.md
+---
+name: "\U0001F31FNew model addition"
+about: Submit a proposal/request to implement a new Transformer-based model
+title: ''
+labels: ''
+assignees: ''
+---
+# 🌟New model addition
+## Model description
+<!-- Important information -->
+## Open Source status
+* [ ] the model implementation is available: (give details)
+* [ ] the model weights are available: (give details)
+## Additional context
+<!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
 ---
 name: "\U0001F41B Bug Report"
 about: Submit a bug report to help us improve PyTorch Transformers
+title: ''
+labels: ''
+assignees: ''
 ---
 ## 🐛 Bug

--- a/.github/ISSUE_TEMPLATE/feature-request.md
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
 ---
 name: "\U0001F680 Feature Request"
 about: Submit a proposal/request for a new PyTorch Transformers feature
+title: ''
+labels: ''
+assignees: ''
 ---
 ## 🚀 Feature

--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
 ---
 name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
 about: Report a problem when migrating from PyTorch-pretrained-Bert to Transformers
+title: ''
+labels: ''
+assignees: ''
 ---
 ## 📚 Migration

--- a/.github/ISSUE_TEMPLATE/question-help.md
+++ b/.github/ISSUE_TEMPLATE/question-help.md
 ---
 name: "❓Questions & Help"
 about: Start a general discussion related to PyTorch Transformers
+title: ''
+labels: ''
+assignees: ''
 ---
 ## ❓ Questions & Help

--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# How to contribute to transformers?
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+is thus not the only way to help the community. Answering questions, helping
+others, reaching out and improving the documentations are immensely valuable to
+the community.
+It also helps us if you spread the word: reference the library from blog posts
+on the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply star the repo to say "thank you".
+## You can contribute in so many ways!
+There are 4 ways you can contribute to transformers:
+* Fixing outstanding issues with the existing code;
+* Implementing new models;
+* Contributing to the examples or to the documentation;
+* Submitting issues related to bugs or desired new features.
+*All are equally valuable to the community.*
+## Submitting a new issue or feature request
+Do your best to follow these guidelines when submitting an issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+### Did you find a bug?
+The transformers are robust and reliable thanks to the users who notify us of
+the problems they encounter. So thank you for reporting an issue.
+First, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on Github under Issues).
+Did not find it? :( So we can act quickly on it, please follow these steps:
+* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
+  **Tensorflow** when applicable;
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s;
+* Provide the *full* traceback if an exception is raised.
+To get the OS and software versions, execute the following code and copy-paste
+the output:
+```
+import platform; print("Platform", platform.platform())
+import sys; print("Python", sys.version)
+import torch; print("PyTorch", torch.__version__)
+import tensorflow; print("Tensorflow", tensorflow.__version__)
+```
+### Do you want to implement a new model?
+Awesome! Please provide the following information:
+* Short description of the model and link to the paper;
+* Link to the implementation if it is open-source;
+* Link to the model weights if they are available.
+If you are willing to contribute the model yourself, let us know so we can best
+guide you.
+### Do you want a new feature (that is not a model)?
+A world-class feature request addresses the following points:
+1. Motivation first:
+  * Is it related to a problem/frustration with the library? If so, please explain
+    why. Providing a code snippet that demonstrates the problem is best.
+  * Is it related to something you would need for a project? We'd love to hear
+    about it!
+  * Is it something you worked on and think could benefit the community?
+    Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+If your issue is well written we're already 80% of the way there by the time you
+post it.
+## Start contributing! (Pull Requests)
+Before writing code, we strongly advise you to search through the exising PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+You will need basic `git` proficiency to be able to contribute to
+`transformers`. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+Follow these steps to start contributing:
+1. Fork the [repository](https://github.com/huggingface/transformers) by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your github user account.
+2. Clone your fork to your local disk, and add the base repository as a remote:
+   ```bash
+   $ git clone git@github.com:<your Github handle>/transformers.git
+   $ cd transformers
+   $ git remote add upstream git@github.com:huggingface/transformers.git
+   ```
+3. Create a new branch to hold your development changes:
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+   **do not** work on the `master` branch.
+4. Set up a development environment by running the following command in a virtual environment:
+   ```bash
+   $ pip install -r requirements-dev.txt
+   ```
+5. Develop the features on your branch. Add changed files using `git add` and
+   then `git commit` to record your changes locally:
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+   Please write [good commit
+   messages](https://chris.beams.io/posts/git-commit/). It
+   is a good idea to sync your copy of the code with the original repository
+   regularly. This way you can quickly account for changes:
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/master
+   ```
+   Push the changes to your account using:
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on Github. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+### Checklist
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request adresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`. These
+   are useful to avoid duplicated work, and to differentiate it from PRs ready
+   to be merged;
+4. Make sure pre-existing tests still pass;
+5. Add high-coverage tests. No quality test, no merge;
+6. All public methods must have informative doctrings;
+### Style guide
+For documentation strings, `transformers` follows the [google
+style](https://google.github.io/styleguide/pyguide.html).
+#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
--- a/README.md
+++ b/README.md
@@ -56,7 +56,7 @@ Choose the right framework for every part of a model's lifetime
 | [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
 | [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
 | [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
-| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
+| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
 | [Documentation](https://huggingface.co/transformers/) | Full API documentation and more |
@@ -67,7 +67,7 @@ This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3
 ### With pip
 First you need to install one of, or both, TensorFlow 2.0 and PyTorch.
-Please refere to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
+Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
 When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be installed using pip as follows:
@@ -78,9 +78,9 @@ pip install transformers
 ### From source
 Here also, you first need to install one of, or both, TensorFlow 2.0 and PyTorch.
-Please refere to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
+Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
-When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and runing:
+When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and running:
 ```bash
 pip install [--editable] .
@@ -88,7 +88,7 @@ pip install [--editable] .
 ### Tests
-A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
+A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
 These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
@@ -105,7 +105,7 @@ python -m pytest -sv ./examples/
 You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
-It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
+It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
 At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!
@@ -120,8 +120,7 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5
+8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
-) by Victor Sanh, Lysandre Debut and Thomas Wolf.
 9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
@@ -182,24 +181,24 @@ for model_class in BERT_MODEL_CLASSES:
    # Load pretrained model/tokenizer
    model = model_class.from_pretrained('bert-base-uncased')
-# Models can return full list of hidden-states & attentions weights at each layer
+    # Models can return full list of hidden-states & attentions weights at each layer
-model = model_class.from_pretrained(pretrained_weights,
+    model = model_class.from_pretrained(pretrained_weights,
                                        output_hidden_states=True,
                                        output_attentions=True)
-input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
+    input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
-all_hidden_states, all_attentions = model(input_ids)[-2:]
+    all_hidden_states, all_attentions = model(input_ids)[-2:]
-# Models are compatible with Torchscript
+    # Models are compatible with Torchscript
-model = model_class.from_pretrained(pretrained_weights, torchscript=True)
+    model = model_class.from_pretrained(pretrained_weights, torchscript=True)
-traced_model = torch.jit.trace(model, (input_ids,))
+    traced_model = torch.jit.trace(model, (input_ids,))
-# Simple serialization for models and tokenizers
+    # Simple serialization for models and tokenizers
-model.save_pretrained('./directory/to/save/')  # save
+    model.save_pretrained('./directory/to/save/')  # save
-model = model_class.from_pretrained('./directory/to/save/')  # re-load
+    model = model_class.from_pretrained('./directory/to/save/')  # re-load
-tokenizer.save_pretrained('./directory/to/save/')  # save
+    tokenizer.save_pretrained('./directory/to/save/')  # save
-tokenizer = tokenizer_class.from_pretrained('./directory/to/save/')  # re-load
+    tokenizer = BertTokenizer.from_pretrained('./directory/to/save/')  # re-load
-# SOTA examples for GLUE, SQUAD, text generation...
+    # SOTA examples for GLUE, SQUAD, text generation...
 ```
 ## Quick tour TF 2.0 training and PyTorch interoperability
@@ -396,7 +395,7 @@ This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-s
 ### `run_generation.py`: Text generation with GPT, GPT-2, CTRL, Transformer-XL and XLNet
 A conditional generation script is also included to generate text from a prompt.
-The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
+The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
 Here is how to run the script with the small version of OpenAI GPT-2 model:
@@ -436,9 +435,9 @@ Here is a quick summary of what you should take care of when migrating from `pyt
 ### Models always output `tuples`
-The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
+The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that every model's forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
-The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
+The exact content of the tuples for each model is detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
 In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
@@ -470,11 +469,11 @@ By enabling the configuration option `output_hidden_states`, it was possible to
 ### Serialization
-Breaking change in the `from_pretrained()`method:
+Breaking change in the `from_pretrained()` method:
-1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them, don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
-2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
+2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead, which can break derived model classes built based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model's `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
 Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
@@ -546,4 +545,4 @@ for batch in train_data:
 ## Citation
-At the moment, there is no paper associated to Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
+At the moment, there is no paper associated with Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
--- a/docs/README.md
+++ b/docs/README.md
@@ -50,7 +50,7 @@ make html
 ---
 **NOTE**
-If you are adding/removing elements from the toc-tree or from any strutural item, it is recommended to clean the build
+If you are adding/removing elements from the toc-tree or from any structural item, it is recommended to clean the build
 directory before rebuilding. Run the following command to clean and build:
 ```bash

--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
-huggingface.css
 /* The literal code blocks */
 .rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
    color: #6670FF;
@@ -44,11 +42,11 @@ huggingface.css
 /* The text items on the toc tree */
 .wy-menu-vertical a {
    color: #FFFFDD;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
 }
 .wy-menu-vertical header, .wy-menu-vertical p.caption{
    color: white;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
 }
 /* The color inside the selected toc tree block */
@@ -85,7 +83,7 @@ a {
    border-right: solid 2px #FB8D68;
    border-left: solid 2px #FB8D68;
    color: #FB8D68;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
    border-top: none;
    font-style: normal !important;
 }
@@ -136,14 +134,14 @@ a {
 /* class and method names in doc */
 .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
-    font-family: Calibre;
+    font-family: Calibre, sans-serif;
    font-size: 20px !important;
 }
 /* class name in doc*/
 .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
    margin-right: 10px;
-    font-family: Calibre-Medium;
+    font-family: Calibre-Medium, sans-serif;
 }
 /* Method and class parameters */
@@ -160,17 +158,17 @@ a {
 /* FONTS */
 body{
-    font-family: Calibre;
+    font-family: Calibre, sans-serif;
    font-size: 16px;
 }
 h1 {
-    font-family: Calibre-Thin;
+    font-family: Calibre-Thin, sans-serif;
    font-size: 70px;
 }
 h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
-    font-family: Calibre-Medium;
+    font-family: Calibre-Medium, sans-serif;
 }
 @font-face {
@@ -196,4 +194,3 @@ h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
    src: url(./Calibre-Thin.otf);
    font-weight:400;
 }
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -46,8 +46,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
 5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
 7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
+8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
 .. toctree::
    :maxdepth: 2
@@ -63,6 +62,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    migration
    bertology
    torchscript
+    multilingual
 .. toctree::
    :maxdepth: 2

--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
-Installation
+# Installation
-================================================
 Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
-With pip
+## With pip
-^^^^^^^^
 PyTorch Transformers can be installed using pip as follows:
-.. code-block:: bash
+``` bash
+pip install transformers
+```
-   pip install transformers
+## From source
-From source
-^^^^^^^^^^^
 To install from source, clone the repository and install with:
-.. code-block:: bash
+``` bash
+git clone https://github.com/huggingface/transformers.git
-    git clone https://github.com/huggingface/transformers.git
+cd transformers
-    cd transformers
+pip install [--editable] .
-    pip install [--editable] .
+```
-Tests
+## Tests
-^^^^^
-An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/transformers/tree/master/transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/transformers/tree/master/examples>`_.
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
 Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
 Run all the tests from the root of the cloned repository with the commands:
-.. code-block:: bash
+``` bash
+python -m pytest -sv ./transformers/tests/
-    python -m pytest -sv ./transformers/tests/
+python -m pytest -sv ./examples/
-    python -m pytest -sv ./examples/
+```
-OpenAI GPT original tokenization workflow
+## OpenAI GPT original tokenization workflow
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (use version 4.4.3 if you are using Python 2) and `SpaCy`:
-.. code-block:: bash
+``` bash
+pip install spacy ftfy==4.4.3
+python -m spacy download en
+```
-   pip install spacy ftfy==4.4.3
+If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
-   python -m spacy download en
-If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+## Note on model downloads (Continuous Integration or large-scale deployments)
-Note on model downloads (Continuous Integration or large-scale deployments)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
+## Do you want to run a Transformer model on a mobile device?
-Do you want to run a Transformer model on a mobile device?
+You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
-It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
-It also contains an implementation of BERT for Question answering.
+It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
 At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
 or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
+Multi-lingual models
+================================================
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
+multi-lingual models are available and have a different mechanisms than mono-lingual models.
+This page details the usage of these models.
+The two models that currently support multiple languages are BERT and XLM.
+XLM
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
+be split in two categories: the checkpoints that make use of language embeddings, and those that don't
+XLM & Language Embeddings
+------------------------------------------------
+This section concerns the following checkpoints:
+- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
+- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
+- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
+- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
+- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
+- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
+- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
+These checkpoints require language embeddings that will specify the language used at inference time. These language
+embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
+from the tokenizer.
+Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
+.. code-block::
+    import torch
+    from transformers import XLMTokenizer, XLMWithLMHeadModel
+    tokenizer = XLMTokenizer.from_pretrained("xlm-clm-1024-enfr")
+The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
+``lang2id`` attribute:
+.. code-block::
+    print(tokenizer.lang2id)  # {'en': 0, 'fr': 1}
+These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
+.. code-block::
+    input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
+We should now define the language embedding by using the previously defined language id. We want to create a tensor
+filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
+.. code-block::
+    language_id = tokenizer.lang2id['en']  # 0
+    langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
+    # We reshape it to be of size (batch_size, sequence_length)
+    langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+You can then feed it all as input to your model:
+.. code-block::
+    outputs = model(input_ids, langs=langs)
+The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
+can generate text using the CLM checkpoints from XLM, using the language embeddings.
+XLM without Language Embeddings
+------------------------------------------------
+This section concerns the following checkpoints:
+- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
+- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
+These checkpoints do not require language embeddings at inference time. These models are used to have generic
+sentence representations, differently from previously-mentioned XLM checkpoints.
+BERT
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+BERT has two checkpoints that can be used for multi-lingual tasks:
+- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
+- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
+These checkpoints do not require language embeddings at inference time. They should identify the language
+used in the context and infer accordingly.
\ No newline at end of file
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -98,6 +98,12 @@ Here is the full list of the currently provided pretrained models together with
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-17-1280``                                        | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
+|                   |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 17 languages.                                                              |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-100-1280``                                       | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
+|                   |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 100 languages.                                                             |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | RoBERTa           | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |
 |                   |                                                            | | RoBERTa using the BERT-base architecture                                                                                            |
@@ -113,11 +119,15 @@ Here is the full list of the currently provided pretrained models together with
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DistilBERT        | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``distilgpt2``                                             | | 6-layer, 768-hidden, 12-heads, 82M parameters                                                                                       |
+|                   |                                                            | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint.                                                               |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | CTRL              | ``ctrl``                                                   | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters                                                                                    |
 |                   |                                                            | | Salesforce's Large-sized CTRL English model                                                                                         |

--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
-# DistilBERT
+# Distil*
-This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
+This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2.
+**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2.
 **2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
-## What is DistilBERT
+## What is Distil*
+Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
-DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for DistilGPT2 (after fine-tuning on the train set).
-For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
+For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances.
-). *Please note that we will publish a formal write-up with updated and more complete results in the near future (September 19th).*
-Here's the updated results on the dev sets of GLUE:
+Here are the results on the dev sets of GLUE:
-| Model      | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B | WNLI |
+| Model      | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP  | RTE  | SST-2| STS-B| WNLI |
 | :---:      |    :---:    | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
 | BERT-base  |  **77.6**   | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
-| DistilBERT |  **75.2**   | 49.1 | 81.8 | 90.2 | 87.0 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
+| DistilBERT |  **76.8**   | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
 ## Setup
@@ -26,10 +29,12 @@ This part of the library has only be tested with Python3.6+. There are few speci
 ## How to use DistilBERT
-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
+Transformers includes two pre-trained Distil* models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
 - `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
 - `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset and . The model has 6 layers, 768 dimension and 12 heads, totalizing 82M (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
+- and more to come! 🤗🤗🤗
 Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
@@ -42,9 +47,11 @@ outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```
-## How to train DistilBERT
+Similarly, using DistilGPT2 simply consists in calling the GPT2 classes from a different pretrained checkpoint: `model = GPT2Model.from_pretrained('distilgpt2')`.
+## How to train Distil*
-In the following, we will explain how you can train your own compressed model.
+In the following, we will explain how you can train DistilBERT.
 ### A. Preparing the data
@@ -57,7 +64,8 @@ First, we will binarize the data, i.e. tokenize the data and convert each token
 ```bash
 python scripts/binarized_data.py \
    --file_path data/dump.txt \
-    --bert_tokenizer bert-base-uncased \
+    --tokenizer_type bert \
+    --tokenizer_name bert-base-uncased \
    --dump_file data/binarized_text
 ```
@@ -66,7 +74,8 @@ Our implementation of masked language modeling loss follows [XLM](https://github
 ```bash
 python scripts/token_counts.py \
    --data_file data/binarized_text.bert-base-uncased.pickle \
-    --token_counts_dump data/token_counts.bert-base-uncased.pickle
+    --token_counts_dump data/token_counts.bert-base-uncased.pickle \
+    --vocab_size 30522
 ```
 ### B. Training
@@ -75,6 +84,12 @@ Training with distillation is really simple once you have pre-processed the data
 ```bash
 python train.py \
+    --student_type distilbert \
+    --student_config training_configs/distilbert-base-uncased.json \
+    --teacher_type bert \
+    --teacher_name bert-base-uncased \
+    --alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --mlm \
+    --freeze_pos_embs \
    --dump_path serialization_dir/my_first_training \
    --data_file data/binarized_text.bert-base-uncased.pickle \
    --token_counts data/token_counts.bert-base-uncased.pickle \
@@ -83,7 +98,7 @@ python train.py \
 By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
-We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
+We highly encourage you to use distributed training for training DistilBERT as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
 ```bash
 export NODE_RANK=0
@@ -105,11 +120,17 @@ python -m torch.distributed.launch \
    train.py \
        --force \
        --n_gpu $WORLD_SIZE \
+        --student_type distilbert \
+        --student_config training_configs/distilbert-base-uncased.json \
+        --teacher_type bert \
+        --teacher_name bert-base-uncased \
+        --alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --mlm \
+        --freeze_pos_embs \
+        --dump_path serialization_dir/my_first_training \
        --data_file data/binarized_text.bert-base-uncased.pickle \
-        --token_counts data/token_counts.bert-base-uncased.pickle \
+        --token_counts data/token_counts.bert-base-uncased.pickle
-        --dump_path serialization_dir/my_first_distillation
 ```
-**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
+**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
 Happy distillation!
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -12,8 +12,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" The distiller to distil DistilBERT
+""" The distiller to distil the student.
-    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+    Adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
 import os
 import math
@@ -28,16 +28,19 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.optim import AdamW
+from torch.utils.data.distributed import DistributedSampler
+from torch.utils.data import RandomSampler, BatchSampler, DataLoader
 from transformers import WarmupLinearSchedule
 from utils import logger
-from dataset import Dataset
+from lm_seqs_dataset import LmSeqsDataset
+from grouped_batch_sampler import GroupedBatchSampler, create_lengths_groups
 class Distiller:
    def __init__(self,
                 params: dict,
-                 dataloader: Dataset,
+                 dataset: LmSeqsDataset,
                 token_probs: torch.tensor,
                 student: nn.Module,
                 teacher: nn.Module):
@@ -50,24 +53,36 @@ class Distiller:
        self.student = student
        self.teacher = teacher
-        self.dataloader = dataloader
+        self.student_config = student.config
-        if self.params.n_gpu > 1:
+        self.vocab_size = student.config.vocab_size
-            self.dataloader.split()
-        self.get_iterator(seed=params.seed)
+        if params.n_gpu <= 1:
+            sampler = RandomSampler(dataset)
+        else:
+            sampler = DistributedSampler(dataset)
+        if params.group_by_size:
+            groups = create_lengths_groups(lengths=dataset.lengths, k=params.max_model_input_size)
+            sampler = GroupedBatchSampler(sampler=sampler, group_ids=groups, batch_size=params.batch_size)
+        else:
+            sampler = BatchSampler(sampler=sampler, batch_size=params.batch_size, drop_last=False)
+        self.dataloader = DataLoader(dataset=dataset,
+                                     batch_sampler=sampler,
+                                     collate_fn=dataset.batch_sequences)
        self.temperature = params.temperature
        assert self.temperature > 0.
        self.alpha_ce = params.alpha_ce
        self.alpha_mlm = params.alpha_mlm
+        self.alpha_clm = params.alpha_clm
        self.alpha_mse = params.alpha_mse
        self.alpha_cos = params.alpha_cos
-        assert self.alpha_ce >= 0.
-        assert self.alpha_mlm >= 0.
-        assert self.alpha_mse >= 0.
-        assert self.alpha_cos >= 0.
-        assert self.alpha_ce + self.alpha_mlm + self.alpha_mse + self.alpha_cos > 0.
+        self.mlm = params.mlm
+        if self.mlm:
+            logger.info(f'Using MLM loss for LM step.')
            self.mlm_mask_prop = params.mlm_mask_prop
            assert 0.0 <= self.mlm_mask_prop <= 1.0
            assert params.word_mask + params.word_keep + params.word_rand == 1.0
@@ -77,6 +92,8 @@ class Distiller:
            if self.fp16:
                self.pred_probs = self.pred_probs.half()
                self.token_probs = self.token_probs.half()
+        else:
+            logger.info(f'Using CLM loss for LM step.')
        self.epoch = 0
        self.n_iter = 0
@@ -86,12 +103,13 @@ class Distiller:
        self.last_loss = 0
        self.last_loss_ce = 0
        self.last_loss_mlm = 0
+        self.last_loss_clm = 0
        if self.alpha_mse > 0.: self.last_loss_mse = 0
        if self.alpha_cos > 0.: self.last_loss_cos = 0
        self.last_log = 0
        self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
-        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+        self.lm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
        if self.alpha_mse > 0.:
            self.mse_loss_fct = nn.MSELoss(reduction='sum')
        if self.alpha_cos > 0.:
@@ -99,7 +117,7 @@ class Distiller:
        logger.info('--- Initializing model optimizer')
        assert params.gradient_accumulation_steps >= 1
-        self.num_steps_epoch = int(len(self.dataloader) / params.batch_size) + 1
+        self.num_steps_epoch = len(self.dataloader)
        num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
        no_decay = ['bias', 'LayerNorm.weight']
@@ -140,42 +158,17 @@ class Distiller:
                logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
                self.student = DistributedDataParallel(self.student,
                                                       device_ids=[params.local_rank],
-                                                       output_device=params.local_rank)
+                                                       output_device=params.local_rank,
+                                                       find_unused_parameters=True)
        self.is_master = params.is_master
        if self.is_master:
            logger.info('--- Initializing Tensorboard')
            self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
-            self.tensorboard.add_text(tag='config', text_string=str(self.params), global_step=0)
+            self.tensorboard.add_text(tag='config/training', text_string=str(self.params), global_step=0)
+            self.tensorboard.add_text(tag='config/student', text_string=str(self.student_config), global_step=0)
-    def get_iterator(self,
-                     seed: int = None):
-        """
-        Initialize the data iterator.
-        Each process has its own data iterator (iterating on his own random portion of the dataset).
-        Input:
-        ------
-            seed: `int` - The random seed.
-        """
-        logger.info('--- Initializing Data Iterator')
-        self.data_iterator = self.dataloader.get_iterator(seed=seed)
-    def get_batch(self):
+    def prepare_batch_mlm(self,
-        """
-        Call the data iterator to output a new batch.
-        If the data iterator went through the whole dataset, create a new iterator.
-        """
-        assert hasattr(self, 'data_iterator')
-        try:
-            x = next(self.data_iterator)
-        except StopIteration:
-            logger.warning('--- Went through the whole dataset. Creating new data iterator.')
-            self.data_iterator = self.dataloader.get_iterator()
-            x = next(self.data_iterator)
-        return x
-    def prepare_batch(self,
                          batch):
        """
        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
@@ -222,7 +215,7 @@ class Distiller:
                assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()
        _token_ids_real = token_ids[pred_mask]
-        _token_ids_rand = _token_ids_real.clone().random_(self.params.vocab_size)
+        _token_ids_rand = _token_ids_real.clone().random_(self.vocab_size)
        _token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
        probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
        _token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
@@ -230,8 +223,41 @@ class Distiller:
        mlm_labels[~pred_mask] = -1 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility
+        # sanity checks
+        assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
        return token_ids, attn_mask, mlm_labels
+    def prepare_batch_clm(self,
+                          batch):
+        """
+        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
+        Input:
+        ------
+            batch: `Tuple`
+                token_ids: `torch.tensor(bs, seq_length)` - The token ids for each of the sequence. It is padded.
+                lengths: `torch.tensor(bs)` - The lengths of each of the sequences in the batch.
+        Output:
+        -------
+            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
+            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
+            clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -1 where there is nothing to predict.
+        """
+        token_ids, lengths = batch
+        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
+        assert token_ids.size(0) == lengths.size(0)
+        attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
+        clm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
+        clm_labels[~attn_mask] = -1 # previously `clm_labels[1-attn_mask] = -1`, cf pytorch 1.2.0 compatibility
+        # sanity checks
+        assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
+        return token_ids, attn_mask, clm_labels
    def round_batch(self,
                    x: torch.tensor,
                    lengths: torch.tensor):
@@ -269,7 +295,10 @@ class Distiller:
        if ml1 % 8 != 0:
            pad = 8 - (ml1 % 8)
            ml2 = ml1 + pad
+            if self.mlm:
                pad_id = self.params.special_tok_ids['pad_token']
+            else:
+                pad_id = self.params.special_tok_ids['unk_token']
            padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
            x = torch.cat([x, padding_tensor], 1)
            assert x.size() == (bs2, ml2)
@@ -292,14 +321,16 @@ class Distiller:
            if self.multi_gpu:
                torch.distributed.barrier()
-            iter_bar = trange(self.num_steps_epoch, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
+            iter_bar = tqdm(self.dataloader, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
-            for __ in range(self.num_steps_epoch):
+            for batch in iter_bar:
-                batch = self.get_batch()
                if self.params.n_gpu > 0:
                    batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
-                token_ids, attn_mask, mlm_labels = self.prepare_batch(batch=batch)
-                self.step(input_ids=token_ids, attention_mask=attn_mask, mlm_labels=mlm_labels)
+                if self.mlm:
+                    token_ids, attn_mask, lm_labels = self.prepare_batch_mlm(batch=batch)
+                else:
+                    token_ids, attn_mask, lm_labels = self.prepare_batch_clm(batch=batch)
+                self.step(input_ids=token_ids, attention_mask=attn_mask, lm_labels=lm_labels)
                iter_bar.update()
                iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
@@ -317,7 +348,7 @@ class Distiller:
    def step(self,
             input_ids: torch.tensor,
             attention_mask: torch.tensor,
-             mlm_labels: torch.tensor):
+             lm_labels: torch.tensor):
        """
        One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
        and possibly a parameter update (depending on the gradient accumulation).
@@ -326,17 +357,22 @@ class Distiller:
        ------
        input_ids: `torch.tensor(bs, seq_length)` - The token ids.
        attention_mask: `torch.tensor(bs, seq_length)` - The attention mask for self attention.
-        mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels.
+        lm_labels: `torch.tensor(bs, seq_length)` - The language modeling labels (mlm labels for MLM and clm labels for CLM).
        """
+        if self.mlm:
            s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask)     # (bs, seq_length, voc_size)
            with torch.no_grad():
                t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
+        else:
+            s_logits, _, s_hidden_states = self.student(input_ids=input_ids, attention_mask=None)            # (bs, seq_length, voc_size)
+            with torch.no_grad():
+                t_logits, _, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=None)           # (bs, seq_length, voc_size)
        assert s_logits.size() == t_logits.size()
        #https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
        #https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
        if self.params.restrict_ce_to_mask:
-            mask = (mlm_labels>-1).unsqueeze(-1).expand_as(s_logits)   # (bs, seq_lenth, voc_size)
+            mask = (lm_labels>-1).unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
        else:
            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
        s_logits_slct = torch.masked_select(s_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
@@ -348,13 +384,20 @@ class Distiller:
        loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
                                   F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
        loss = self.alpha_ce*loss_ce
        if self.alpha_mlm > 0.:
-            loss_mlm = self.mlm_loss_fct(s_logits.view(-1, s_logits.size(-1)), mlm_labels.view(-1))
+            loss_mlm = self.lm_loss_fct(s_logits.view(-1, s_logits.size(-1)), lm_labels.view(-1))
            loss += self.alpha_mlm * loss_mlm
+        if self.alpha_clm > 0.:
+            shift_logits = s_logits[..., :-1, :].contiguous()
+            shift_labels = lm_labels[..., 1:].contiguous()
+            loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                                        shift_labels.view(-1))
+            loss += self.alpha_clm * loss_clm
        if self.alpha_mse > 0.:
            loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
            loss += self.alpha_mse * loss_mse
        if self.alpha_cos > 0.:
            s_hidden_states = s_hidden_states[-1]                              # (bs, seq_length, dim)
            t_hidden_states = t_hidden_states[-1]                              # (bs, seq_length, dim)
@@ -376,6 +419,8 @@ class Distiller:
        self.last_loss_ce = loss_ce.item()
        if self.alpha_mlm > 0.:
            self.last_loss_mlm = loss_mlm.item()
+        if self.alpha_clm > 0.:
+            self.last_loss_clm = loss_clm.item()
        if self.alpha_mse > 0.:
            self.last_loss_mse = loss_mse.item()
        if self.alpha_cos > 0.:
@@ -452,6 +497,8 @@ class Distiller:
        self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
        if self.alpha_mlm > 0.:
            self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
+        if self.alpha_clm > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_clm", scalar_value=self.last_loss_clm, global_step=self.n_total_iter)
        if self.alpha_mse > 0.:
            self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
        if self.alpha_cos > 0.:

--- a/examples/distillation/grouped_batch_sampler.py
+++ b/examples/distillation/grouped_batch_sampler.py
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Adapted from PyTorch Vision (https://github.com/pytorch/vision/blob/master/references/detection/group_by_aspect_ratio.py)
+"""
+import bisect
+import copy
+from collections import defaultdict
+import numpy as np
+from torch.utils.data.sampler import BatchSampler, Sampler
+from utils import logger
+def _quantize(x, bins):
+    bins = copy.deepcopy(bins)
+    bins = sorted(bins)
+    quantized = list(map(lambda y: bisect.bisect_right(bins, y), x))
+    return quantized
+def create_lengths_groups(lengths, k=0):
+    bins = np.arange(start=3, stop=k, step=4).tolist() if k > 0 else [10]
+    groups = _quantize(lengths, bins)
+    # count number of elements per group
+    counts = np.unique(groups, return_counts=True)[1]
+    fbins = [0] + bins + [np.inf]
+    logger.info("Using {} as bins for aspect lengths quantization".format(fbins))
+    logger.info("Count of instances per bin: {}".format(counts))
+    return groups
+class GroupedBatchSampler(BatchSampler):
+    """
+    Wraps another sampler to yield a mini-batch of indices.
+    It enforces that the batch only contain elements from the same group.
+    It also tries to provide mini-batches which follows an ordering which is
+    as close as possible to the ordering from the original sampler.
+    Arguments:
+        sampler (Sampler): Base sampler.
+        group_ids (list[int]): If the sampler produces indices in range [0, N),
+            `group_ids` must be a list of `N` ints which contains the group id of each sample.
+            The group ids must be a continuous set of integers starting from
+            0, i.e. they must be in the range [0, num_groups).
+        batch_size (int): Size of mini-batch.
+    """
+    def __init__(self, sampler, group_ids, batch_size):
+        if not isinstance(sampler, Sampler):
+            raise ValueError(
+                "sampler should be an instance of "
+                "torch.utils.data.Sampler, but got sampler={}".format(sampler)
+            )
+        self.sampler = sampler
+        self.group_ids = group_ids
+        self.batch_size = batch_size
+    def __iter__(self):
+        buffer_per_group = defaultdict(list)
+        samples_per_group = defaultdict(list)
+        num_batches = 0
+        for idx in self.sampler:
+            group_id = self.group_ids[idx]
+            buffer_per_group[group_id].append(idx)
+            samples_per_group[group_id].append(idx)
+            if len(buffer_per_group[group_id]) == self.batch_size:
+                yield buffer_per_group[group_id] #TODO
+                num_batches += 1
+                del buffer_per_group[group_id]
+            assert len(buffer_per_group[group_id]) < self.batch_size
+        # now we have run out of elements that satisfy
+        # the group criteria, let's return the remaining
+        # elements so that the size of the sampler is
+        # deterministic
+        expected_num_batches = len(self)
+        num_remaining = expected_num_batches - num_batches
+        if num_remaining > 0:
+            # for the remaining batches, group the batches by similar lengths
+            batch_idx = []
+            for group_id, idxs in sorted(buffer_per_group.items(), key=lambda x: x[0]):
+                batch_idx.extend(idxs)
+                if len(batch_idx) >= self.batch_size:
+                    yield batch_idx[:self.batch_size]
+                    batch_idx = batch_idx[self.batch_size:]
+                    num_remaining -= 1
+            if len(batch_idx) > 0:
+                yield batch_idx
+                num_remaining -= 1
+        assert num_remaining == 0
+    def __len__(self):
+        """
+        Return the number of mini-batches rather than the number of samples.
+        """
+        return (len(self.sampler) + self.batch_size - 1) // self.batch_size
--- a/examples/distillation/dataset.py
+++ b/examples/distillation/dataset.py
@@ -12,30 +12,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Dataloaders to train DistilBERT
+""" Dataset to distilled models
    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
-from typing import List
-import math
-from itertools import chain
-from collections import Counter
-import numpy as np
 import torch
+from torch.utils.data import Dataset
+import numpy as np
 from utils import logger
-class Dataset:
+class LmSeqsDataset(Dataset):
+    """Custom Dataset wrapping language modeling sequences.
+    Each sample will be retrieved by indexing the list of token_ids and their corresponding lengths.
+    Input:
+    ------
+        params: `NameSpace` parameters
+        data: `List[np.array[int]]
+    """
    def __init__(self,
                 params,
                 data):
        self.params = params
-        self.tokens_per_batch = params.tokens_per_batch
-        self.batch_size = params.batch_size
-        self.shuffle = params.shuffle
-        self.group_by_size = params.group_by_size
        self.token_ids = np.array(data)
-        self.lengths = np.uint16([len(t) for t in data])
+        self.lengths = np.array([len(t) for t in data])
        self.check()
        self.remove_long_sequences()
@@ -43,6 +46,9 @@ class Dataset:
        self.check()
        self.print_statistics()
+    def __getitem__(self, index):
+        return (self.token_ids[index], self.lengths[index])
    def __len__(self):
        return len(self.lengths)
@@ -51,12 +57,14 @@ class Dataset:
        Some sanity checks
        """
        assert len(self.token_ids) == len(self.lengths)
+        assert all(self.lengths[i] == len(self.token_ids[i]) for i in range(len(self.lengths))) 
    def remove_long_sequences(self):
        """
-        Sequences that are too long are splitted by chunk of max_position_embeddings.
+        Sequences that are too long are splitted by chunk of max_model_input_size.
        """
-        indices = self.lengths >= self.params.max_position_embeddings
+        max_len = self.params.max_model_input_size
+        indices = self.lengths > max_len
        logger.info(f'Splitting {sum(indices)} too long sequences.')
        def divide_chunks(l, n):
@@ -64,10 +72,13 @@ class Dataset:
        new_tok_ids = []
        new_lengths = []
+        if self.params.mlm:
            cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
-        max_len = self.params.max_position_embeddings
+        else:
+            cls_id, sep_id = self.params.special_tok_ids['bos_token'], self.params.special_tok_ids['eos_token']
        for seq_, len_ in zip(self.token_ids, self.lengths):
+            assert (seq_[0] == cls_id) and (seq_[-1] == sep_id), seq_
            if len_ <= max_len:
                new_tok_ids.append(seq_)
                new_lengths.append(len_)
@@ -79,6 +90,7 @@ class Dataset:
                    if sub_s[-1] != sep_id:
                        sub_s = np.insert(sub_s, len(sub_s), sep_id)
                    assert len(sub_s) <= max_len
+                    assert (sub_s[0] == cls_id) and (sub_s[-1] == sep_id), sub_s
                    sub_seqs.append(sub_s)
                new_tok_ids.extend(sub_seqs)
@@ -113,89 +125,27 @@ class Dataset:
        # nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
        # logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
-    def select_data(self, a: int, b: int):
-        """
-        Select a subportion of the data.
-        """
-        n_sequences = len(self)
-        assert 0 <= a < b <= n_sequences, ValueError(f'`0 <= a < b <= n_sequences` is not met with a={a} and b={b}')
-        logger.info(f'Selecting sequences from {a} to {b} (excluded).')
-        self.token_ids = self.token_ids[a:b]
-        self.lengths = self.lengths[a:b]
-        self.check()
-    def split(self):
-        """
-        Distributed training: split the data accross the processes.
-        """
-        assert self.params.n_gpu > 1
-        logger.info('Splitting the data accross the processuses.')
-        n_seq = len(self)
-        n_seq_per_procesus = n_seq // self.params.world_size
-        a = n_seq_per_procesus * self.params.global_rank
-        b = a + n_seq_per_procesus
-        self.select_data(a=a, b=b)
    def batch_sequences(self,
-                        token_ids: List[List[int]],
+                        batch):
-                        lengths: List[int]):
        """
        Do the padding and transform into torch.tensor.
        """
+        token_ids = [t[0] for t in batch]
+        lengths = [t[1] for t in batch]
        assert len(token_ids) == len(lengths)
        # Max for paddings
        max_seq_len_ = max(lengths)
        # Pad token ids
+        if self.params.mlm:
            pad_idx = self.params.special_tok_ids['pad_token']
+        else:
+            pad_idx = self.params.special_tok_ids['unk_token']
        tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
        assert len(tk_) == len(token_ids)
        assert all(len(t) == max_seq_len_ for t in tk_)
        tk_t = torch.tensor(tk_)      # (bs, max_seq_len_)
-        lg_t = torch.tensor(lengths.astype(int))  # (bs)
+        lg_t = torch.tensor(lengths)  # (bs)
        return tk_t, lg_t
-    def get_batches_iterator(self,
-                             batches):
-        """
-        Return an iterator over batches.
-        """
-        for sequences_ids in batches:
-            token_ids, lengths = self.batch_sequences(self.token_ids[sequences_ids],
-                                                    self.lengths[sequences_ids])
-            yield (token_ids, lengths)
-    def get_iterator(self,
-                     seed: int = None):
-        """
-        Return a data iterator.
-        """
-        rng = np.random.RandomState(seed)
-        n_sequences = len(self)
-        indices = np.arange(n_sequences)
-        if self.group_by_size:
-            indices = indices[np.argsort(self.lengths[indices], kind='mergesort')]
-        if self.tokens_per_batch == -1:
-            batches = np.array_split(indices, math.ceil(len(indices) * 1. / self.batch_size))
-        else:
-            assert self.tokens_per_batch > 0
-            batch_ids = np.cumsum(self.lengths[indices]) // self.tokens_per_batch
-            _, bounds = np.unique(batch_ids, return_index=True)
-            batches = [indices[bounds[i]:bounds[i + 1]] for i in range(len(bounds) - 1)]
-            if bounds[-1] < len(indices):
-                batches.append(indices[bounds[-1]:])
-        if self.shuffle:
-            rng.shuffle(batches)
-        assert n_sequences == sum([len(x) for x in batches])
-        assert self.lengths[indices].sum() == sum([self.lengths[x].sum() for x in batches])
-        return self.get_batches_iterator(batches=batches)
--- a/examples/distillation/requirements.txt
+++ b/examples/distillation/requirements.txt
@@ -3,4 +3,4 @@ tensorboard>=1.14.0
 tensorboardX==1.8
 psutil==5.6.3
 scipy==1.3.1
-pytorch_transformers==1.2.0
+transformers==2.0.0
--- a/examples/distillation/run_squad_w_distillation.py
+++ b/examples/distillation/run_squad_w_distillation.py
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" This is the exact same script as `examples/run_squad.py` (as of 2019, October 4th) with an additional and optional step of distillation."""
+from __future__ import absolute_import, division, print_function
+import argparse
+import logging
+import os
+import random
+import glob
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+import torch.nn.functional as F
+import torch.nn as nn
+from tqdm import tqdm, trange
+from tensorboardX import SummaryWriter
+from transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForQuestionAnswering, BertTokenizer,
+                                  XLMConfig, XLMForQuestionAnswering,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForQuestionAnswering,
+                                  XLNetTokenizer,
+                                  DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
+from transformers import AdamW, WarmupLinearSchedule
+from ..utils_squad import (read_squad_examples, convert_examples_to_features,
+                         RawResult, write_predictions,
+                         RawResultExtended, write_predictions_extended)
+# The follwing import is the official SQuAD evaluation script (2.0).
+# You can remove it from the dependencies if you are using this script outside of the library
+# We've added it here for automated tests (see examples/test_examples.py file)
+from ..utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad
+logger = logging.getLogger(__name__)
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
+                  for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
+    'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
+}
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+def to_list(tensor):
+    return tensor.detach().cpu().tolist()
+def train(args, train_dataset, model, tokenizer, teacher=None):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            if teacher is not None:
+                teacher.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':       batch[0],
+                      'attention_mask':  batch[1], 
+                      'start_positions': batch[3], 
+                      'end_positions':   batch[4]}
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[5],
+                               'p_mask':       batch[6]})
+            outputs = model(**inputs)
+            loss, start_logits_stu, end_logits_stu = outputs
+            # Distillation loss
+            if teacher is not None:
+                if 'token_type_ids' not in inputs:
+                    inputs['token_type_ids'] = None if args.teacher_type == 'xlm' else batch[2]
+                with torch.no_grad():
+                    start_logits_tea, end_logits_tea = teacher(input_ids=inputs['input_ids'],
+                                                               token_type_ids=inputs['token_type_ids'],
+                                                               attention_mask=inputs['attention_mask'])
+                assert start_logits_tea.size() == start_logits_stu.size()
+                assert end_logits_tea.size() == end_logits_stu.size()
+                loss_fct = nn.KLDivLoss(reduction='batchmean')
+                loss_start = loss_fct(F.log_softmax(start_logits_stu/args.temperature, dim=-1),
+                                      F.softmax(start_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
+                loss_end = loss_fct(F.log_softmax(end_logits_stu/args.temperature, dim=-1),
+                                    F.softmax(end_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
+                loss_ce = (loss_start + loss_end)/2.
+                loss = args.alpha_ce*loss_ce + args.alpha_squad*loss
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+    return global_step, tr_loss / global_step
+def evaluate(args, model, tokenizer, prefix=""):
+    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(dataset) if args.local_rank == -1 else DistributedSampler(dataset)
+    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    all_results = []
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        model.eval()
+        batch = tuple(t.to(args.device) for t in batch)
+        with torch.no_grad():
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1]
+                      }
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
+            example_indices = batch[3]
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[4],
+                               'p_mask':    batch[5]})
+            outputs = model(**inputs)
+        for i, example_index in enumerate(example_indices):
+            eval_feature = features[example_index.item()]
+            unique_id = int(eval_feature.unique_id)
+            if args.model_type in ['xlnet', 'xlm']:
+                # XLNet uses a more complex post-processing procedure
+                result = RawResultExtended(unique_id            = unique_id,
+                                           start_top_log_probs  = to_list(outputs[0][i]),
+                                           start_top_index      = to_list(outputs[1][i]),
+                                           end_top_log_probs    = to_list(outputs[2][i]),
+                                           end_top_index        = to_list(outputs[3][i]),
+                                           cls_logits           = to_list(outputs[4][i]))
+            else:
+                result = RawResult(unique_id    = unique_id,
+                                   start_logits = to_list(outputs[0][i]),
+                                   end_logits   = to_list(outputs[1][i]))
+            all_results.append(result)
+    # Compute predictions
+    output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
+    output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
+    if args.version_2_with_negative:
+        output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    else:
+        output_null_log_odds_file = None
+    if args.model_type in ['xlnet', 'xlm']:
+        # XLNet uses a more complex post-processing procedure
+        write_predictions_extended(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.predict_file,
+                        model.config.start_n_top, model.config.end_n_top,
+                        args.version_2_with_negative, tokenizer, args.verbose_logging)
+    else:
+        write_predictions(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, args.do_lower_case, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.verbose_logging,
+                        args.version_2_with_negative, args.null_score_diff_threshold)
+    # Evaluate with the official SQuAD script
+    evaluate_options = EVAL_OPTS(data_file=args.predict_file,
+                                 pred_file=output_prediction_file,
+                                 na_prob_file=output_null_log_odds_file)
+    results = evaluate_on_squad(evaluate_options)
+    return results
+def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+    # Load data features from cache or dataset file
+    input_file = args.predict_file if evaluate else args.train_file
+    cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length)))
+    if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", input_file)
+        examples = read_squad_examples(input_file=input_file,
+                                                is_training=not evaluate,
+                                                version_2_with_negative=args.version_2_with_negative)
+        features = convert_examples_to_features(examples=examples,
+                                                tokenizer=tokenizer,
+                                                max_seq_length=args.max_seq_length,
+                                                doc_stride=args.doc_stride,
+                                                max_query_length=args.max_query_length,
+                                                is_training=not evaluate)
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+    all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
+    all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
+    if evaluate:
+        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_example_index, all_cls_index, all_p_mask)
+    else:
+        all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
+        all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_start_positions, all_end_positions,
+                                all_cls_index, all_p_mask)
+    if output_examples:
+        return dataset, examples, features
+    return dataset
+def main():
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--train_file", default=None, type=str, required=True,
+                        help="SQuAD json for training. E.g., train-v1.1.json")
+    parser.add_argument("--predict_file", default=None, type=str, required=True,
+                        help="SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model checkpoints and predictions will be written.")
+    # Distillation parameters (optional)
+    parser.add_argument('--teacher_type', default=None, type=str,
+                        help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.")
+    parser.add_argument('--teacher_name_or_path', default=None, type=str,
+                        help="Path to the already SQuAD fine-tuned teacher model. Only for distillation.")
+    parser.add_argument('--alpha_ce', default=0.5, type=float,
+                        help="Distillation loss linear weight. Only for distillation.")
+    parser.add_argument('--alpha_squad', default=0.5, type=float,
+                        help="True SQuAD loss linear weight. Only for distillation.")
+    parser.add_argument('--temperature', default=2.0, type=float,
+                        help="Distillation temperature. Only for distillation.")
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument('--version_2_with_negative', action='store_true',
+                        help='If true, the SQuAD examples contain some that do not have an answer.')
+    parser.add_argument('--null_score_diff_threshold', type=float, default=0.0,
+                        help="If null_score - best_non_null is greater than the threshold predict null.")
+    parser.add_argument("--max_seq_length", default=384, type=int,
+                        help="The maximum total input sequence length after WordPiece tokenization. Sequences "
+                             "longer than this will be truncated, and sequences shorter than this will be padded.")
+    parser.add_argument("--doc_stride", default=128, type=int,
+                        help="When splitting up a long document into chunks, how much stride to take between chunks.")
+    parser.add_argument("--max_query_length", default=64, type=int,
+                        help="The maximum number of tokens for the question. Questions longer than this will "
+                             "be truncated to this length.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--n_best_size", default=20, type=int,
+                        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+    parser.add_argument("--max_answer_length", default=30, type=int,
+                        help="The maximum length of an answer that can be generated. This is needed because the start "
+                             "and end predictions are not conditioned on one another.")
+    parser.add_argument("--verbose_logging", action='store_true',
+                        help="If true, all of the warnings related to data processing will be printed. "
+                             "A number of warnings are expected for a normal SQuAD evaluation.")
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Whether not to use CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+    # Set seed
+    set_seed(args)
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    if args.teacher_type is not None:
+        assert args.teacher_name_or_path is not None
+        assert args.alpha_ce > 0.
+        assert args.alpha_ce + args.alpha_squad > 0.
+        assert args.teacher_type != 'distilbert', "We constraint teachers not to be of type DistilBERT."
+        teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
+        teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
+        teacher = teacher_model_class.from_pretrained(args.teacher_name_or_path, config=teacher_config)
+        teacher.to(args.device)
+    else:
+        teacher = None
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+    model.to(args.device)
+    logger.info("Training/evaluation parameters %s", args)
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+    # Save the trained model and the tokenizer
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce model loading logs
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            # Reload the model
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            # Evaluate
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
+            results.update(result)
+    logger.info("Results: {}".format(results))
+    return results
+if __name__ == "__main__":
+    main()
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -13,14 +13,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DistilBERT.
+Preprocessing script before distillation.
 """
 import argparse
 import pickle
 import random
 import time
 import numpy as np
-from transformers import BertTokenizer, RobertaTokenizer
+from transformers import BertTokenizer, RobertaTokenizer, GPT2Tokenizer
 import logging
 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
@@ -32,7 +32,7 @@ def main():
    parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
    parser.add_argument('--file_path', type=str, default='data/dump.txt',
                        help='The path to the data.')
-    parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta'])
+    parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta', 'gpt2'])
    parser.add_argument('--tokenizer_name', type=str, default='bert-base-uncased',
                        help="The tokenizer to use.")
    parser.add_argument('--dump_file', type=str, default='data/dump',
@@ -43,10 +43,16 @@ def main():
    logger.info(f'Loading Tokenizer ({args.tokenizer_name})')
    if args.tokenizer_type == 'bert':
        tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
+        bos = tokenizer.special_tokens_map['cls_token'] # `[CLS]`
+        sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]`
    elif args.tokenizer_type == 'roberta':
        tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name)
-    bos = tokenizer.special_tokens_map['bos_token'] # `[CLS]` for bert, `<s>` for roberta
+        bos = tokenizer.special_tokens_map['cls_token'] # `<s>`
-    sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]` for bert, `</s>` for roberta
+        sep = tokenizer.special_tokens_map['sep_token'] # `</s>`
+    elif args.tokenizer_type == 'gpt2':
+        tokenizer = GPT2Tokenizer.from_pretrained(args.tokenizer_name)
+        bos = tokenizer.special_tokens_map['bos_token'] # `<|endoftext|>`
+        sep = tokenizer.special_tokens_map['eos_token'] # `<|endoftext|>`    
    logger.info(f'Loading text from {args.file_path}')
    with open(args.file_path, 'r', encoding='utf8') as fp: