Merge branch 'master' into add_models_special_tokens_to_specific_configs

146c5212 · Lysandre Debut · GitHub · f5b50c6b · b623ddc0 · 146c5212
Unverified Commit 146c5212 authored Mar 05, 2020 by Lysandre Debut Committed by GitHub Mar 05, 2020
20 changed files
--- a/model_cards/mrm8488/bert-uncased-finetuned-qnli/README.md
+++ b/model_cards/mrm8488/bert-uncased-finetuned-qnli/README.md
+---
+language: english
+thumbnail:
+---
+
+# [BERT](https://huggingface.co/deepset/bert-base-cased-squad2) fine tuned on [QNLI](https://github.com/rhythmcao/QNLI)+ compression ([BERT-of-Theseus](https://github.com/JetRunner/BERT-of-Theseus))
+
+I used a [Bert model fine tuned on **SQUAD v2**](https://huggingface.co/deepset/bert-base-cased-squad2) and then I fine tuned it on **QNLI** using **compression** (with a constant replacing rate) as proposed in **BERT-of-Theseus**
+
+## Details of the downstream task (QNLI):
+
+### Getting the dataset
+```bash
+wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/train.tsv
+wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/test.tsv
+wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/dev.tsv
+
+mkdir QNLI_dataset
+mv *.tsv QNLI_dataset
+```
+
+### Model training
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+!python /content/BERT-of-Theseus/run_glue.py \
+  --model_name_or_path deepset/bert-base-cased-squad2 \
+  --task_name qnli \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir /content/QNLI_dataset \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --per_gpu_eval_batch_size 32 \
+  --learning_rate 2e-5 \
+  --save_steps 2000 \
+  --num_train_epochs 50 \
+  --output_dir /content/ouput_dir \
+  --evaluate_during_training \
+  --replacing_rate 0.7 \
+  --steps_for_replacing 2500 
+```
+
+## Metrics:
+
+| Model          | Accuracy |
+|-----------------|------|
+| BERT-base       | 91.2 |
+| BERT-of-Theseus | 88.8 |
+| [bert-uncased-finetuned-qnli](https://huggingface.co/mrm8488/bert-uncased-finetuned-qnli) | 87.2
+| DistillBERT     | 85.3 |
+
+
+
+
+> [See all my models](https://huggingface.co/models?search=mrm8488)
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/notebooks/01-training-tokenizers.ipynb
+++ b/notebooks/01-training-tokenizers.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## Tokenization doesn't have to be slow !\n",
+    "\n",
+    "### Introduction\n",
+    "\n",
+    "Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner\n",
+    "should find a way to map raw input strings to a representation understandable by a trainable model.\n",
+    "\n",
+    "One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach\n",
+    "would look similar to the code below in python\n",
+    "\n",
+    "```python\n",
+    "s = \"very long corpus...\"\n",
+    "words = s.split(\" \")  # Split over space\n",
+    "vocabulary = dict(enumerate(set(words)))  # Map storing the word to it's corresponding id\n",
+    "```\n",
+    "\n",
+    "This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original\n",
+    "input. Moreover, word variations like \"cat\" and \"cats\" would not share the same identifiers even if their meaning is \n",
+    "quite close.\n",
+    "\n",
+    "![tokenization_simple](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)\n",
+    "\n",
+    "### Subtoken Tokenization\n",
+    "\n",
+    "To overcome the issues described above, recent works have been done on tokenization, leveraging \"subtoken\" tokenization.\n",
+    "**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned\n",
+    "from the data.\n",
+    "\n",
+    "Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _\"##\"_ indicates a subtoken of the initial input. \n",
+    "Such training algorithms might extract sub-tokens such as _\"##ing\"_, _\"##ed\"_ over English corpus.\n",
+    "\n",
+    "As you might think of, this kind of sub-tokens construction leveraging compositions of _\"pieces\"_ overall reduces the size\n",
+    "of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded\n",
+    "into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. \n",
+    " \n",
+    "![subtokenization](https://nlp.fast.ai/images/multifit_vocabularies.png)\n",
+    " \n",
+    "Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : \n",
+    "\n",
+    "- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)\n",
+    "- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)\n",
+    "- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)\n",
+    "- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)\n",
+    "\n",
+    "Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.\n",
+    "\n",
+    "### @huggingface/tokenizers library \n",
+    "Along with the transformers library, we @huggingface provide a blazing fast tokenization library\n",
+    "able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.\n",
+    "\n",
+    "The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which \n",
+    "we provide bindings for Python and NodeJS (more bindings may be added in the future). \n",
+    "\n",
+    "We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide\n",
+    "these various components: \n",
+    "\n",
+    "- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to\n",
+    "lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. \n",
+    "- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to\n",
+    "pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.\n",
+    "- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant\n",
+    " of your input data.\n",
+    "- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA\n",
+    "models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.\n",
+    "- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according\n",
+    "to the `PreTokenizer` we used previously.\n",
+    "- **Trainer**: Provides training capabilities to each model.\n",
+    "\n",
+    "For each of the components above we provide multiple implementations:\n",
+    "\n",
+    "- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...\n",
+    "- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...\n",
+    "- **Model**: WordLevel, BPE, WordPiece\n",
+    "- **Post-Processor**: BertProcessor, ...\n",
+    "- **Decoder**: WordLevel, BPE, WordPiece, ...\n",
+    "\n",
+    "All of these building blocks can be combined to create working tokenization pipelines. \n",
+    "In the next section we will go over our first pipeline."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
+    "\n",
+    "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
+    "We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
+    "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "!pip install tokenizers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
+    "\n",
+    "# Let's download the file and save it somewhere\n",
+    "from requests import get\n",
+    "with open('big.txt', 'wb') as big_f:\n",
+    "    response = get(BIG_FILE_URL, )\n",
+    "    \n",
+    "    if response.status_code == 200:\n",
+    "        big_f.write(response.content)\n",
+    "    else:\n",
+    "        print(\"Unable to get the file: {}\".format(response.reason))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    " \n",
+    "Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
+    "# the overall pipeline for various well-known tokenization algorithm. \n",
+    "# Everything described below can be replaced by the ByteLevelBPETokenizer class. \n",
+    "\n",
+    "from tokenizers import Tokenizer\n",
+    "from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
+    "from tokenizers.models import BPE\n",
+    "from tokenizers.normalizers import Lowercase, NFKC, Sequence\n",
+    "from tokenizers.pre_tokenizers import ByteLevel\n",
+    "\n",
+    "# First we create an empty Byte-Pair Encoding model (i.e. not trained model)\n",
+    "tokenizer = Tokenizer(BPE.empty())\n",
+    "\n",
+    "# Then we enable lower-casing and unicode-normalization\n",
+    "# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
+    "# executed in order.\n",
+    "tokenizer.normalizer = Sequence([\n",
+    "    NFKC(),\n",
+    "    Lowercase()\n",
+    "])\n",
+    "\n",
+    "# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
+    "tokenizer.pre_tokenizer = ByteLevel()\n",
+    "\n",
+    "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
+    "tokenizer.decoder = ByteLevelDecoder()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Trained vocab size: 25000\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tokenizers.trainers import BpeTrainer\n",
+    "\n",
+    "# We initialize our trainer, giving him the details about the vocabulary we want to generate\n",
+    "trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())\n",
+    "tokenizer.train(trainer, [\"big.txt\"])\n",
+    "\n",
+    "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
+    "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
+    "on the `Trainer` class, but the overall process should be very similar.\n",
+    "\n",
+    "We can save the content of the model to reuse it later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['./vocab.json', './merges.txt']"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# You will see the generated files in the output.\n",
+    "tokenizer.model.save('.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Now, let load the trained model and start using out newly trained tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
+      "Decoded string:  this is a simple input to be tokenized\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Let's tokenizer a simple input\n",
+    "tokenizer.model = BPE.from_files('vocab.json', 'merges.txt')\n",
+    "encoding = tokenizer.encode(\"This is a simple input to be tokenized\")\n",
+    "\n",
+    "print(\"Encoded string: {}\".format(encoding.tokens))\n",
+    "\n",
+    "decoded = tokenizer.decode(encoding.ids)\n",
+    "print(\"Decoded string: {}\".format(decoded))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
+    "\n",
+    "- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)\n",
+    "- original_str: The input string as it was provided\n",
+    "- tokens: The generated tokens with their string representation\n",
+    "- input_ids: The generated tokens with their integer representation\n",
+    "- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.\n",
+    "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
+    "- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
+    "- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/notebooks/02-transformers.ipynb
+++ b/notebooks/02-transformers.ipynb
--- a/notebooks/03-pipelines.ipynb
+++ b/notebooks/03-pipelines.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
+    "API for doing inference over a variety of downstream-tasks, including: \n",
+    "\n",
+    "- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
+    "- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
+    "- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
+    "- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
+    "- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
+    "\n",
+    "Pipelines encapsulate the overall process of every NLP process:\n",
+    " \n",
+    " 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
+    " 2. Inference: Maps every tokens into a more meaningful representation. \n",
+    " 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
+    "\n",
+    "The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
+    "structure:\n",
+    "\n",
+    "```python\n",
+    "from transformers import pipeline\n",
+    "\n",
+    "# Using default model and tokenizer for the task\n",
+    "pipeline(\"<task-name>\")\n",
+    "\n",
+    "# Using a user-specified model\n",
+    "pipeline(\"<task-name>\", model=\"<model_name>\")\n",
+    "\n",
+    "# Using custom model/tokenizer as str\n",
+    "pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "!pip install transformers"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code \n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from __future__ import print_function\n",
+    "import ipywidgets as widgets\n",
+    "from transformers import pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## 1. Sentence Classification - Sentiment Analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "c9db53f30b9446c0af03268633a966c0"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "text": [
+      "\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "data": {
+      "text/plain": "[{'label': 'POSITIVE', 'score': 0.9997656}]"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 8
+    }
+   ],
+   "source": [
+    "nlp_sentence_classif = pipeline('sentiment-analysis')\n",
+    "nlp_sentence_classif('Such a nice weather outside !')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## 2. Token Classification - Named Entity Recognition"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "1e300789e22644f1aed66a5ed60e75c4"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "text": [
+      "\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "data": {
+      "text/plain": "[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 9
+    }
+   ],
+   "source": [
+    "nlp_token_class = pipeline('ner')\n",
+    "nlp_token_class('Hugging Face is a French company based in New-York.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Question Answering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "82aca58f1ea24b4cb37f16402e8a5923"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "text": [
+      "\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "name": "stderr",
+     "text": [
+      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 225.51it/s]\n",
+      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2158.67it/s]\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "data": {
+      "text/plain": "{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 10
+    }
+   ],
+   "source": [
+    "nlp_qa = pipeline('question-answering')\n",
+    "nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Text Generation - Mask Filling"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "49df2227b4fa4eb28dcdcfc3d9261d0f"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "text": [
+      "\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "data": {
+      "text/plain": "[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n  'score': 0.23106691241264343,\n  'token': 2201},\n {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n  'score': 0.0819825753569603,\n  'token': 12790},\n {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n  'score': 0.04769463092088699,\n  'token': 11559},\n {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n  'score': 0.047622501850128174,\n  'token': 6497},\n {'sequence': '<s> Hugging Face is a French company based in France</s>',\n  'score': 0.04130595177412033,\n  'token': 1470}]"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 11
+    }
+   ],
+   "source": [
+    "nlp_fill = pipeline('fill-mask')\n",
+    "nlp_fill('Hugging Face is a French company based in <mask>')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Projection - Features Extraction "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "2af4cfb19e3243dda014d0f56b48f4b2"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "text": [
+      "\n"
+     ],
+     "output_type": "stream"
+    },
+    {
+     "data": {
+      "text/plain": "(1, 12, 768)"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 12
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "nlp_features = pipeline('feature-extraction')\n",
+    "output = nlp_features('Hugging Face is a French company based in Paris')\n",
+    "np.array(output).shape   # (Samples, Tokens, Vector Size)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
+    "to come in future releases. \n",
+    "\n",
+    "In the meantime, you can try the different pipelines with your own inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "10bac065d46f4e4d9a8498dcc8104ecd"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": "Text(value='', description='Your input:', placeholder='Enter something')",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "2c5f1411f7a94714bc00f01b0e3b27b2"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "task = widgets.Dropdown(\n",
+    "    options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
+    "    value='ner',\n",
+    "    description='Task:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "input = widgets.Text(\n",
+    "    value='',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Your input:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "def forward(_):\n",
+    "    if len(input.value) > 0: \n",
+    "        if task.value == 'ner':\n",
+    "            output = nlp_token_class(input.value)\n",
+    "        elif task.value == 'sentiment-analysis':\n",
+    "            output = nlp_sentence_classif(input.value)\n",
+    "        else:\n",
+    "            if input.value.find('<mask>') == -1:\n",
+    "                output = nlp_fill(input.value + ' <mask>')\n",
+    "            else:\n",
+    "                output = nlp_fill(input.value)                \n",
+    "        print(output)\n",
+    "\n",
+    "input.on_submit(forward)\n",
+    "display(task, input)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% Question Answering\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…",
+      "application/vnd.jupyter.widget-view+json": {
+       "version_major": 2,
+       "version_minor": 0,
+       "model_id": "019fde2343634e94b6f32d04f6350ec1"
+      }
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "context = widgets.Textarea(\n",
+    "    value='Einstein is famous for the general theory of relativity',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Context:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "query = widgets.Text(\n",
+    "    value='Why is Einstein famous for ?',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Question:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "def forward(_):\n",
+    "    if len(context.value) > 0 and len(query.value) > 0: \n",
+    "        output = nlp_qa(question=query.value, context=context.value)            \n",
+    "        print(output)\n",
+    "\n",
+    "query.on_submit(forward)\n",
+    "display(context, query)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "source": [],
+    "metadata": {
+     "collapsed": false
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
\ No newline at end of file
--- a/notebooks/Comparing-PT-and-TF-models.ipynb
+++ b/notebooks/Comparing-PT-and-TF-models.ipynb
--- a/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
--- a/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb
--- a/notebooks/Comparing-TF-and-PT-models.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models.ipynb
--- a/notebooks/README.md
+++ b/notebooks/README.md
+# Transformers Notebooks
+
+You can find here a list of the official notebooks provided by Hugging Face.
+
+Also, we would like to list here interesting content created by the community. 
+If you wrote some notebook(s) leveraging transformers and would like be listed here, please open a 
+Pull Request and we'll review it so it can be included here. 
+
+
+## Hugging Face's notebooks :hugs:
+
+| Notebook     |      Description      |   |
+|:----------|:-------------:|------:|
+| [Getting Started Tokenizers](01-training-tokenizers.ipynb)  | How to train and use your very own tokenizer  |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
+| [Getting Started Transformers](02-transformers.ipynb)   | How to easily start using transformers  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
+| [How to use Pipelines](03-pipelines.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
+| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
--- a/setup.py
+++ b/setup.py
@@ -64,7 +64,7 @@ if stale_egg_info.exists():
 extras = {}

 extras["mecab"] = ["mecab-python3"]
-extras["sklearn"] = ["scikit-learn"]
+extras["sklearn"] = ["scikit-learn==0.22.1"]
 extras["tf"] = ["tensorflow"]
 extras["tf-cpu"] = ["tensorflow-cpu"]
 extras["torch"] = ["torch"]

--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -136,7 +136,7 @@ if is_sklearn_available():

 # Modeling
 if is_torch_available():
-    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D
+    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering
    from .modeling_auto import (
        AutoModel,
        AutoModelForPreTraining,
@@ -241,7 +241,7 @@ if is_torch_available():
        CamembertForTokenClassification,
        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    )
-    from .modeling_encoder_decoder import PreTrainedEncoderDecoder, Model2Model
+    from .modeling_encoder_decoder import PreTrainedEncoderDecoder
    from .modeling_t5 import (
        T5PreTrainedModel,
        T5Model,
@@ -255,6 +255,7 @@ if is_torch_available():
        AlbertForMaskedLM,
        AlbertForSequenceClassification,
        AlbertForQuestionAnswering,
+        AlbertForTokenClassification,
        load_tf_weights_in_albert,
        ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    )
@@ -290,7 +291,13 @@ if is_torch_available():

 # TensorFlow
 if is_tf_available():
-    from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list
+    from .modeling_tf_utils import (
+        TFPreTrainedModel,
+        TFSharedEmbeddings,
+        TFSequenceSummary,
+        shape_list,
+        tf_top_k_top_p_filtering,
+    )
    from .modeling_tf_auto import (
        TFAutoModel,
        TFAutoModelForPreTraining,

--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@@ -22,11 +22,10 @@ from .configuration_utils import PretrainedConfig

 logger = logging.getLogger(__name__)

-_bart_large_url = "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json"
 BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "bart-large": _bart_large_url,
-    "bart-large-mnli": _bart_large_url,  # fine as same
-    "bart-cnn": None,  # not done
+    "bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json",
+    "bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json",
+    "bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json",
 }


@@ -59,6 +58,7 @@ class BartConfig(PretrainedConfig):
        classifier_dropout=0.0,
        output_past=False,
        num_labels=3,
+        bos_token_id=0,
        **common_kwargs
    ):
        r"""
@@ -67,12 +67,16 @@ class BartConfig(PretrainedConfig):
                config = BartConfig.from_pretrained('bart-large')
                model = BartModel(config)
        """
-        super().__init__(num_labels=num_labels, output_past=output_past, pad_token_id=pad_token_id, **common_kwargs)
-
+        super().__init__(
+            num_labels=num_labels,
+            output_past=output_past,
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            **common_kwargs,
+        )
        self.vocab_size = vocab_size
        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim
        self.eos_token_id = eos_token_id
-
        self.encoder_ffn_dim = encoder_ffn_dim
        self.encoder_layers = self.num_hidden_layers = encoder_layers
        self.encoder_attention_heads = encoder_attention_heads

--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@@ -109,6 +109,7 @@ class FlaubertConfig(XLMConfig):
                Argument used when doing sequence summary. Used in for the multiple choice head in
                :class:`~transformers.XLMForSequenceClassification`.
                Is one of the following options:
+
                - 'last' => take the last token hidden state (like XLNet)
                - 'first' => take the first token hidden state (like Bert)
                - 'mean' => take the mean of all tokens hidden states

--- a/src/transformers/configuration_gpt2.py
+++ b/src/transformers/configuration_gpt2.py
@@ -73,6 +73,7 @@ class GPT2Config(PretrainedConfig):
                Argument used when doing sequence summary. Used in for the multiple choice head in
                :class:`~transformers.GPT2DoubleHeadsModel`.
                Is one of the following options:
+
                - 'last' => take the last token hidden state (like XLNet)
                - 'first' => take the first token hidden state (like Bert)
                - 'mean' => take the mean of all tokens hidden states

--- a/src/transformers/configuration_openai.py
+++ b/src/transformers/configuration_openai.py
@@ -73,6 +73,7 @@ class OpenAIGPTConfig(PretrainedConfig):
                Argument used when doing sequence summary. Used in for the multiple choice head in
                :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
                Is one of the following options:
+
                - 'last' => take the last token hidden state (like XLNet)
                - 'first' => take the first token hidden state (like Bert)
                - 'mean' => take the mean of all tokens hidden states

--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -98,6 +98,18 @@ class PretrainedConfig(object):
                logger.error("Can't set {} with value {} for {}".format(key, value, self))
                raise err

+    @property
+    def num_labels(self):
+        return self._num_labels
+
+    @num_labels.setter
+    def num_labels(self, num_labels):
+        self._num_labels = num_labels
+        self.id2label = {i: "LABEL_{}".format(i) for i in range(self.num_labels)}
+        self.id2label = dict((int(key), value) for key, value in self.id2label.items())
+        self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))
+        self.label2id = dict((key, int(value)) for key, value in self.label2id.items())
+
    def save_pretrained(self, save_directory):
        """
        Save a configuration object to the directory `save_directory`, so that it

--- a/src/transformers/configuration_xlm.py
+++ b/src/transformers/configuration_xlm.py
@@ -108,6 +108,7 @@ class XLMConfig(PretrainedConfig):
                Argument used when doing sequence summary. Used in for the multiple choice head in
                :class:`~transformers.XLMForSequenceClassification`.
                Is one of the following options:
+
                - 'last' => take the last token hidden state (like XLNet)
                - 'first' => take the first token hidden state (like Bert)
                - 'mean' => take the mean of all tokens hidden states

--- a/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
@@ -23,9 +23,11 @@ import fairseq
 import torch
 from packaging import version

-from transformers import BartConfig, BartForSequenceClassification, BartModel, BartTokenizer
+from transformers import BartConfig, BartForMaskedLM, BartForSequenceClassification, BartModel, BartTokenizer


+FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn"]
+
 if version.parse(fairseq.__version__) < version.parse("0.9.0"):
    raise Exception("requires fairseq >= 0.9.0")

@@ -33,7 +35,7 @@ if version.parse(fairseq.__version__) < version.parse("0.9.0"):
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)

-SAMPLE_TEXT = "Hello world! cécé herlolip"
+SAMPLE_TEXT = " Hello world! cécé herlolip"

 rename_keys = [
    ("model.classification_heads.mnli.dense.weight", "classification_head.dense.weight"),
@@ -41,7 +43,7 @@ rename_keys = [
    ("model.classification_heads.mnli.out_proj.weight", "classification_head.out_proj.weight"),
    ("model.classification_heads.mnli.out_proj.bias", "classification_head.out_proj.bias"),
 ]
-IGNORE_KEYS = ["encoder.version", "decoder.version", "model.encoder.version", "model.decoder.version"]
+IGNORE_KEYS = ["encoder.version", "decoder.version", "model.encoder.version", "model.decoder.version", "_float_tensor"]


 def rename_key(dct, old, new):
@@ -53,36 +55,45 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
    """
    Copy/paste/tweak model's weights to our BERT structure.
    """
-    b2 = torch.hub.load("pytorch/fairseq", checkpoint_path)
-    b2.eval()  # disable dropout
-    b2.model.upgrade_state_dict(b2.model.state_dict())
-    config = BartConfig()
-    tokens = b2.encode(SAMPLE_TEXT).unsqueeze(0)
-    tokens2 = BartTokenizer.from_pretrained("bart-large").encode(SAMPLE_TEXT).unsqueeze(0)
+    bart = torch.hub.load("pytorch/fairseq", checkpoint_path)
+    bart.eval()  # disable dropout
+    bart.model.upgrade_state_dict(bart.model.state_dict())
+    hf_model_name = checkpoint_path.replace(".", "-")
+    config = BartConfig.from_pretrained(hf_model_name)
+    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)
+    tokens2 = BartTokenizer.from_pretrained(hf_model_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
    assert torch.eq(tokens, tokens2).all()

-    # assert their_output.size() == (1, 11, 1024)
-
-    if checkpoint_path == "bart.large":
-        state_dict = b2.model.state_dict()
+    if checkpoint_path in ["bart.large", "bart.large.cnn"]:
+        state_dict = bart.model.state_dict()
+        for k in IGNORE_KEYS:
+            state_dict.pop(k, None)
        state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
        model = BartModel(config)
-        their_output = b2.extract_features(tokens)
-
+        their_output = bart.extract_features(tokens)
    else:  # MNLI Case
-        state_dict = b2.state_dict()
+        state_dict = bart.state_dict()
+        for k in IGNORE_KEYS:
+            state_dict.pop(k, None)
        state_dict["model.shared.weight"] = state_dict["model.decoder.embed_tokens.weight"]
        for src, dest in rename_keys:
            rename_key(state_dict, src, dest)
-        state_dict.pop("_float_tensor", None)
        model = BartForSequenceClassification(config)
-        their_output = b2.predict("mnli", tokens, return_logits=True)
-    for k in IGNORE_KEYS:
-        state_dict.pop(k, None)
+        their_output = bart.predict("mnli", tokens, return_logits=True)
+
+    # Load state dict
    model.load_state_dict(state_dict)
    model.eval()
-    our_outputs = model.forward(tokens)[0]
+    # Check results

+    if checkpoint_path == "bart.large.cnn":  # generate doesnt work yet
+        model = BartForMaskedLM(config, base_model=model)
+        assert "lm_head.weight" in model.state_dict()
+        assert model.lm_head.out_features == config.max_position_embeddings
+        model.eval()
+        our_outputs = model.model.forward(tokens)[0]
+    else:
+        our_outputs = model.forward(tokens)[0]
    assert their_output.shape == our_outputs.shape
    assert (their_output == our_outputs).all().item()
    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
@@ -92,7 +103,8 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Required parameters
-    parser.add_argument("fairseq_path", choices=["bart.large", "bart.large.mnli"], type=str, help="")
+    parser.add_argument("fairseq_path", choices=FAIRSEQ_MODELS, type=str, help="")
+
    parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
    args = parser.parse_args()
    convert_bart_checkpoint(

--- a/src/transformers/convert_roberta_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_roberta_original_pytorch_checkpoint_to_pytorch.py
@@ -46,7 +46,9 @@ logger = logging.getLogger(__name__)
 SAMPLE_TEXT = "Hello world! cécé herlolip"


-def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path, classification_head):
+def convert_roberta_checkpoint_to_pytorch(
+    roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool
+):
    """
    Copy/paste/tweak roberta's weights to our BERT structure.
    """

--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -788,6 +788,103 @@ class AlbertForSequenceClassification(AlbertPreTrainedModel):
        return outputs  # (loss), logits, (hidden_states), (attentions)


+@add_start_docstrings(
+    """Albert Model with a token classification head on top (a linear layer on top of
+    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
+    ALBERT_START_DOCSTRING,
+)
+class AlbertForTokenClassification(AlbertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.albert = AlbertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Labels for computing the token classification loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
+            Classification loss.
+        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)
+            Classification scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        from transformers import AlbertTokenizer, AlbertForTokenClassification
+        import torch
+
+        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
+        model = AlbertForTokenClassification.from_pretrained('albert-base-v2')
+
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+
+        loss, scores = outputs[:2]
+
+        """
+
+        outputs = self.albert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)[active_loss]
+                active_labels = labels.view(-1)[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+
 @add_start_docstrings(
    """Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
    the hidden-states output to compute `span start logits` and `span end logits`). """,