Adding Docker images for transformers + notebooks (#3051)

* Added transformers-pytorch-cpu and gpu Docker images Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added automatic jupyter launch for Docker image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move image from alpine to Ubuntu to align with NVidia container images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added TRANSFORMERS_VERSION argument to Dockerfile. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Pytorch-GPU based Docker image Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Tensorflow images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove double FROM instructions on transformers-pytorch-cpu image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added transformers-tensorflo...

Adding Docker images for transformers + notebooks (#3051)
* Added transformers-pytorch-cpu and gpu Docker images Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added automatic jupyter launch for Docker image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move image from alpine to Ubuntu to align with NVidia container images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added TRANSFORMERS_VERSION argument to Dockerfile. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Pytorch-GPU based Docker image Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Tensorflow images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove double FROM instructions on transformers-pytorch-cpu image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added transformers-tensorflo...
71c87119 · Funtowicz Morgan · GitHub · 34de670d · 34de670d · 71c87119
Unverified Commit 71c87119 authored Mar 04, 2020 by Funtowicz Morgan Committed by GitHub Mar 04, 2020
15 changed files
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
-FROM pytorch/pytorch:latest
-
-RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
-
-RUN pip install transformers
-
-WORKDIR /workspace
\ No newline at end of file
--- a/docker/transformers-cpu/Dockerfile
+++ b/docker/transformers-cpu/Dockerfile
+FROM ubuntu:18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    jupyter \
+    tensorflow-cpu \
+    torch
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/docker/transformers-gpu/Dockerfile
+++ b/docker/transformers-gpu/Dockerfile
+FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    jupyter \
+    tensorflow \
+    torch
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/docker/transformers-pytorch-cpu/Dockerfile
+++ b/docker/transformers-pytorch-cpu/Dockerfile
+FROM ubuntu:18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    jupyter \
+    torch
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/docker/transformers-pytorch-gpu/Dockerfile
+++ b/docker/transformers-pytorch-gpu/Dockerfile
+FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    mkl \
+    torch
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/docker/transformers-tensorflow-cpu/Dockerfile
+++ b/docker/transformers-tensorflow-cpu/Dockerfile
+FROM ubuntu:18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    mkl \
+    tensorflow-cpu
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/docker/transformers-tensorflow-gpu/Dockerfile
+++ b/docker/transformers-tensorflow-gpu/Dockerfile
+FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt update && \
+    apt install -y bash \
+                   build-essential \
+                   git \
+                   curl \
+                   ca-certificates \
+                   python3 \
+                   python3-pip && \
+    rm -rf /var/lib/apt/lists
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip && \
+    python3 -m pip install --no-cache-dir \
+    mkl \
+    tensorflow
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+CMD ["/bin/bash"]
\ No newline at end of file
--- a/notebooks/01-training-tokenizers.ipynb
+++ b/notebooks/01-training-tokenizers.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Tokenization doesn't have to be slow !\n",
+    "\n",
+    "### Introduction\n",
+    "\n",
+    "Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner\n",
+    "should find a way to map raw input strings to a representation understandable by a trainable model.\n",
+    "\n",
+    "One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach\n",
+    "would look similar to the code below in python\n",
+    "\n",
+    "```python\n",
+    "s = \"very long corpus...\"\n",
+    "words = s.split(\" \")  # Split over space\n",
+    "vocabulary = dict(enumerate(set(words)))  # Map storing the word to it's corresponding id\n",
+    "```\n",
+    "\n",
+    "This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original\n",
+    "input. Moreover, word variations like \"cat\" and \"cats\" would not share the same identifiers even if their meaning is \n",
+    "quite close.\n",
+    "\n",
+    "![tokenization_simple](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)\n",
+    "\n",
+    "### Subtoken Tokenization\n",
+    "\n",
+    "To overcome the issues described above, recent works have been done on tokenization, leveraging \"subtoken\" tokenization.\n",
+    "**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned\n",
+    "from the data.\n",
+    "\n",
+    "Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _\"##\"_ indicates a subtoken of the initial input. \n",
+    "Such training algorithms might extract sub-tokens such as _\"##ing\"_, _\"##ed\"_ over English corpus.\n",
+    "\n",
+    "As you might think of, this kind of sub-tokens construction leveraging compositions of _\"pieces\"_ overall reduces the size\n",
+    "of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded\n",
+    "into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. \n",
+    " \n",
+    "![subtokenization](https://nlp.fast.ai/images/multifit_vocabularies.png)\n",
+    " \n",
+    "Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : \n",
+    "\n",
+    "- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)\n",
+    "- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)\n",
+    "- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)\n",
+    "- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)\n",
+    "\n",
+    "Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.\n",
+    "\n",
+    "### @huggingface/tokenizers library \n",
+    "Along with the transformers library, we @huggingface provide a blazing fast tokenization library\n",
+    "able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.\n",
+    "\n",
+    "The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which \n",
+    "we provide bindings for Python and NodeJS (more bindings may be added in the future). \n",
+    "\n",
+    "We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide\n",
+    "these various components: \n",
+    "\n",
+    "- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to\n",
+    "lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. \n",
+    "- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to\n",
+    "pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.\n",
+    "- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant\n",
+    " of your input data.\n",
+    "- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA\n",
+    "models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.\n",
+    "- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according\n",
+    "to the `PreTokenizer` we used previously.\n",
+    "- **Trainer**: Provides training capabilities to each model.\n",
+    "\n",
+    "For each of the components above we provide multiple implementations:\n",
+    "\n",
+    "- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...\n",
+    "- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...\n",
+    "- **Model**: WordLevel, BPE, WordPiece\n",
+    "- **Post-Processor**: BertProcessor, ...\n",
+    "- **Decoder**: WordLevel, BPE, WordPiece, ...\n",
+    "\n",
+    "All of these building blocks can be combined to create working tokenization pipelines. \n",
+    "In the next section we will go over our first pipeline."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
+    "\n",
+    "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
+    "We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
+    "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "outputs": [],
+   "source": [
+    "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
+    "\n",
+    "# Let's download the file and save it somewhere\n",
+    "from requests import get\n",
+    "with open('big.txt', 'wb') as big_f:\n",
+    "    response = get(BIG_FILE_URL, )\n",
+    "    \n",
+    "    if response.status_code == 200:\n",
+    "        big_f.write(response.content)\n",
+    "    else:\n",
+    "        print(\"Unable to get the file: {}\".format(response.reason))\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    " \n",
+    "Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
+    " "
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "outputs": [],
+   "source": [
+    "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
+    "# the overall pipeline for various well-known tokenization algorithm. \n",
+    "# Everything described below can be replaced by the ByteLevelBPETokenizer class. \n",
+    "\n",
+    "from tokenizers import Tokenizer\n",
+    "from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
+    "from tokenizers.models import BPE\n",
+    "from tokenizers.normalizers import Lowercase, NFKC, Sequence\n",
+    "from tokenizers.pre_tokenizers import ByteLevel\n",
+    "\n",
+    "# First we create an empty Byte-Pair Encoding model (i.e. not trained model)\n",
+    "tokenizer = Tokenizer(BPE.empty())\n",
+    "\n",
+    "# Then we enable lower-casing and unicode-normalization\n",
+    "# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n",
+    "# executed in sequence.\n",
+    "tokenizer.normalizer = Sequence([\n",
+    "    NFKC(),\n",
+    "    Lowercase()\n",
+    "])\n",
+    "\n",
+    "# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
+    "tokenizer.pre_tokenizer = ByteLevel()\n",
+    "\n",
+    "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
+    "tokenizer.decoder = ByteLevelDecoder()"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "outputs": [
+    {
+     "name": "stdout",
+     "text": [
+      "Trained vocab size: 25000\n"
+     ],
+     "output_type": "stream"
+    }
+   ],
+   "source": [
+    "from tokenizers.trainers import BpeTrainer\n",
+    "\n",
+    "# We initialize our trainer, giving him the details about the vocabulary we want to generate\n",
+    "trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())\n",
+    "tokenizer.train(trainer, [\"big.txt\"])\n",
+    "\n",
+    "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
+    "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
+    "on the `Trainer` class, but the overall process should be very similar.\n",
+    "\n",
+    "We can save the content of the model to reuse it later."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "['./vocab.json', './merges.txt']"
+     },
+     "metadata": {},
+     "output_type": "execute_result",
+     "execution_count": 12
+    }
+   ],
+   "source": [
+    "# You will see the generated files in the output.\n",
+    "tokenizer.model.save('.')"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Now, let load the trained model and start using out newly trained tokenizer"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "outputs": [
+    {
+     "name": "stdout",
+     "text": [
+      "Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
+      "Decoded string:  this is a simple input to be tokenized\n"
+     ],
+     "output_type": "stream"
+    }
+   ],
+   "source": [
+    "# Let's tokenizer a simple input\n",
+    "tokenizer.model = BPE.from_files('vocab.json', 'merges.txt')\n",
+    "encoding = tokenizer.encode(\"This is a simple input to be tokenized\")\n",
+    "\n",
+    "print(\"Encoded string: {}\".format(encoding.tokens))\n",
+    "\n",
+    "decoded = tokenizer.decode(encoding.ids)\n",
+    "print(\"Decoded string: {}\".format(decoded))"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% code\n",
+     "is_executing": false
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
+    "\n",
+    "- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)\n",
+    "- original_str: The input string as it was provided\n",
+    "- tokens: The generated tokens with their string representation\n",
+    "- input_ids: The generated tokens with their integer representation\n",
+    "- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.\n",
+    "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
+    "- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
+    "- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "source": [],
+    "metadata": {
+     "collapsed": false
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file
--- a/notebooks/02-transformers.ipynb
+++ b/notebooks/02-transformers.ipynb
--- a/notebooks/03-pipelines.ipynb
+++ b/notebooks/03-pipelines.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
+    "API for doing inference over a variety of downstream-tasks, including: \n",
+    "\n",
+    "- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
+    "- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
+    "- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
+    "- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
+    "- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
+    "\n",
+    "Pipelines encapsulate the overall process of every NLP process:\n",
+    " \n",
+    " 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
+    " 2. Inference: Maps every tokens into a more meaningful representation. \n",
+    " 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
+    "\n",
+    "The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
+    "structure:\n",
+    "\n",
+    "```python\n",
+    "from transformers import pipeline\n",
+    "\n",
+    "# Using default model and tokenizer for the task\n",
+    "pipeline(\"<task-name>\")\n",
+    "\n",
+    "# Using a user-specified model\n",
+    "pipeline(\"<task-name>\", model=\"<model_name>\")\n",
+    "\n",
+    "# Using custom model/tokenizer as str\n",
+    "pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code \n"
+    }
+   },
+   "outputs": [
+    {
+     "ename": "SyntaxError",
+     "evalue": "from __future__ imports must occur at the beginning of the file (<ipython-input-29-c3a037bd4c55>, line 5)",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-29-c3a037bd4c55>\"\u001b[0;36m, line \u001b[0;32m5\u001b[0m\n\u001b[0;31m    from transformers import pipeline\u001b[0m\n\u001b[0m           ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m from __future__ imports must occur at the beginning of the file\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "from __future__ import print_function\n",
+    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
+    "import ipywidgets as widgets\n",
+    "from transformers import pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## 1. Sentence Classification - Sentiment Analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "6aeccfdf51994149bdd1f3d3533e380f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'label': 'POSITIVE', 'score': 0.800251},\n",
+       " {'label': 'NEGATIVE', 'score': 1.2489903}]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "nlp_sentence_classif = pipeline('sentiment-analysis')\n",
+    "nlp_sentence_classif(['Such a nice weather outside !', 'This movie was kind of boring.'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## 2. Token Classification - Named Entity Recognition"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b5549c53c27346a899af553c977f00bc",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n",
+       " {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n",
+       " {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n",
+       " {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n",
+       " {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n",
+       " {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n",
+       " {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "nlp_token_class = pipeline('ner')\n",
+    "nlp_token_class('Hugging Face is a French company based in New-York.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Question Answering"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "6e56a8edcef44ec2ae838711ecd22d3a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 53.05it/s]\n",
+      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2673.23it/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "nlp_qa = pipeline('question-answering')\n",
+    "nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Text Generation - Mask Filling"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1930695ea2d24ca98c6d7c13842d377f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n",
+       "  'score': 0.25288480520248413,\n",
+       "  'token': 2201},\n",
+       " {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n",
+       "  'score': 0.07639515399932861,\n",
+       "  'token': 12790},\n",
+       " {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n",
+       "  'score': 0.055500105023384094,\n",
+       "  'token': 6497},\n",
+       " {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n",
+       "  'score': 0.04264815151691437,\n",
+       "  'token': 11559},\n",
+       " {'sequence': '<s> Hugging Face is a French company based in France</s>',\n",
+       "  'score': 0.03868963569402695,\n",
+       "  'token': 1470}]"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "nlp_fill = pipeline('fill-mask')\n",
+    "nlp_fill('Hugging Face is a French company based in <mask>')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Projection - Features Extraction "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "92fa4d67290f49a3943dc0abd7529892",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(1, 12, 768)"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "nlp_features = pipeline('feature-extraction')\n",
+    "output = nlp_features('Hugging Face is a French company based in Paris')\n",
+    "np.array(output).shape   # (Samples, Tokens, Vector Size)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
+    "to come in future releases. \n",
+    "\n",
+    "In the meantime, you can try the different pipelines with your own inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% code\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "261ae9fa30e84d1d84a3b0d9682ac477",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ddc51b71c6eb40e5ab60998664e6a857",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Text(value='', description='Your input:', placeholder='Enter something')"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'word': 'Paris', 'score': 0.9991844296455383, 'entity': 'I-LOC'}]\n",
+      "[{'sequence': '<s> I\\'m from Paris.\"</s>', 'score': 0.224044069647789, 'token': 72}, {'sequence': \"<s> I'm from Paris.)</s>\", 'score': 0.16959427297115326, 'token': 1592}, {'sequence': \"<s> I'm from Paris.]</s>\", 'score': 0.10994981974363327, 'token': 21838}, {'sequence': '<s> I\\'m from Paris!\"</s>', 'score': 0.0706234946846962, 'token': 2901}, {'sequence': \"<s> I'm from Paris.</s>\", 'score': 0.0698278620839119, 'token': 4}]\n",
+      "[{'sequence': \"<s> I'm from Paris and London</s>\", 'score': 0.12238534539937973, 'token': 928}, {'sequence': \"<s> I'm from Paris and Brussels</s>\", 'score': 0.07107886672019958, 'token': 6497}, {'sequence': \"<s> I'm from Paris and Belgium</s>\", 'score': 0.040912602096796036, 'token': 7320}, {'sequence': \"<s> I'm from Paris and Berlin</s>\", 'score': 0.039884064346551895, 'token': 5459}, {'sequence': \"<s> I'm from Paris and Melbourne</s>\", 'score': 0.038133684545755386, 'token': 5703}]\n",
+      "[{'sequence': '<s> I like go to sleep</s>', 'score': 0.08942786604166031, 'token': 3581}, {'sequence': '<s> I like go to bed</s>', 'score': 0.07789064943790436, 'token': 3267}, {'sequence': '<s> I like go to concerts</s>', 'score': 0.06356740742921829, 'token': 12858}, {'sequence': '<s> I like go to school</s>', 'score': 0.03660670667886734, 'token': 334}, {'sequence': '<s> I like go to dinner</s>', 'score': 0.032155368477106094, 'token': 3630}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "task = widgets.Dropdown(\n",
+    "    options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
+    "    value='ner',\n",
+    "    description='Task:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "input = widgets.Text(\n",
+    "    value='',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Your input:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "def forward(_):\n",
+    "    if len(input.value) > 0: \n",
+    "        if task.value == 'ner':\n",
+    "            output = nlp_token_class(input.value)\n",
+    "        elif task.value == 'sentiment-analysis':\n",
+    "            output = nlp_sentence_classif(input.value)\n",
+    "        else:\n",
+    "            if input.value.find('<mask>') == -1:\n",
+    "                output = nlp_fill(input.value + ' <mask>')\n",
+    "            else:\n",
+    "                output = nlp_fill(input.value)                \n",
+    "        print(output)\n",
+    "\n",
+    "input.on_submit(forward)\n",
+    "display(task, input)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {
+    "pycharm": {
+     "is_executing": false,
+     "name": "#%% Question Answering\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "5ae68677bd8a41f990355aa43840d3f8",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "14bcfd9a2c5a47e6b1383989ab7632c8",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Text(value='Why is Einstein famous for ?', description='Question:', placeholder='Enter something')"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 168.83it/s]\n",
+      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 1919.59it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'score': 0.40340670623875496, 'start': 27, 'end': 54, 'answer': 'general theory of relativity'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "context = widgets.Textarea(\n",
+    "    value='Einstein is famous for the general theory of relativity',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Context:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "query = widgets.Text(\n",
+    "    value='Why is Einstein famous for ?',\n",
+    "    placeholder='Enter something',\n",
+    "    description='Question:',\n",
+    "    disabled=False\n",
+    ")\n",
+    "\n",
+    "def forward(_):\n",
+    "    if len(context.value) > 0 and len(query.value) > 0: \n",
+    "        output = nlp_qa(question=query.value, context=context.value)            \n",
+    "        print(output)\n",
+    "\n",
+    "query.on_submit(forward)\n",
+    "display(context, query)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "source": [],
+    "metadata": {
+     "collapsed": false
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
\ No newline at end of file
--- a/notebooks/Comparing-PT-and-TF-models.ipynb
+++ b/notebooks/Comparing-PT-and-TF-models.ipynb
--- a/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
--- a/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb
--- a/notebooks/Comparing-TF-and-PT-models.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models.ipynb
--- a/notebooks/README.md
+++ b/notebooks/README.md
+# Transformers Notebooks
+
+You can find here a list of the official notebooks provided by Hugging Face.
+
+Also, we would like to list here interesting content created by the community. 
+If you wrote some notebook(s) leveraging transformers and would like be listed here, please open a 
+Pull Request and we'll review it so it can be included here. 
+
+
+## Hugging Face's notebooks :hugs:
+
+| Notebook     |      Description      |   |
+|:----------|:-------------:|------:|
+| [Getting Started Tokenizers](01-training_tokenizers.ipynb)  | How to train and use your very own tokenizer  |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [Getting Started Transformers](02-transformers.ipynb)   | How to easily start using transformers  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [How to use Pipelines](03-pipelines.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vochicong/blog/blob/fix-notebook-add-tokenizer-config/notebooks/01_how_to_train.ipynb)|
\ No newline at end of file