03-pipelines.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
    "API for doing inference over a variety of downstream-tasks, including: \n",
    "\n",
    "- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
    "- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
    "- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
    "- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
    "- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
    "\n",
    "Pipelines encapsulate the overall process of every NLP process:\n",
    " \n",
    " 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
    " 2. Inference: Maps every tokens into a more meaningful representation. \n",
    " 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
    "\n",
    "The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
    "structure:\n",
    "\n",
    "```python\n",
    "from transformers import pipeline\n",
    "\n",
    "# Using default model and tokenizer for the task\n",
    "pipeline(\"<task-name>\")\n",
    "\n",
    "# Using a user-specified model\n",
    "pipeline(\"<task-name>\", model=\"<model_name>\")\n",
    "\n",
    "# Using custom model/tokenizer as str\n",
    "pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install transformers"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code \n"
    }
   },
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "from __future__ imports must occur at the beginning of the file (<ipython-input-29-c3a037bd4c55>, line 5)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-29-c3a037bd4c55>\"\u001b[0;36m, line \u001b[0;32m5\u001b[0m\n\u001b[0;31m    from transformers import pipeline\u001b[0m\n\u001b[0m           ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m from __future__ imports must occur at the beginning of the file\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from __future__ import print_function\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "import ipywidgets as widgets\n",
    "from transformers import pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 1. Sentence Classification - Sentiment Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6aeccfdf51994149bdd1f3d3533e380f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'label': 'POSITIVE', 'score': 0.800251},\n",
       " {'label': 'NEGATIVE', 'score': 1.2489903}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp_sentence_classif = pipeline('sentiment-analysis')\n",
    "nlp_sentence_classif(['Such a nice weather outside !', 'This movie was kind of boring.'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 2. Token Classification - Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b5549c53c27346a899af553c977f00bc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n",
       " {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n",
       " {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n",
       " {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n",
       " {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n",
       " {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n",
       " {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp_token_class = pipeline('ner')\n",
    "nlp_token_class('Hugging Face is a French company based in New-York.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Question Answering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6e56a8edcef44ec2ae838711ecd22d3a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 53.05it/s]\n",
      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2673.23it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp_qa = pipeline('question-answering')\n",
    "nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Text Generation - Mask Filling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1930695ea2d24ca98c6d7c13842d377f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n",
       "  'score': 0.25288480520248413,\n",
       "  'token': 2201},\n",
       " {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n",
       "  'score': 0.07639515399932861,\n",
       "  'token': 12790},\n",
       " {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n",
       "  'score': 0.055500105023384094,\n",
       "  'token': 6497},\n",
       " {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n",
       "  'score': 0.04264815151691437,\n",
       "  'token': 11559},\n",
       " {'sequence': '<s> Hugging Face is a French company based in France</s>',\n",
       "  'score': 0.03868963569402695,\n",
       "  'token': 1470}]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp_fill = pipeline('fill-mask')\n",
    "nlp_fill('Hugging Face is a French company based in <mask>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Projection - Features Extraction "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "92fa4d67290f49a3943dc0abd7529892",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(1, 12, 768)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "nlp_features = pipeline('feature-extraction')\n",
    "output = nlp_features('Hugging Face is a French company based in Paris')\n",
    "np.array(output).shape   # (Samples, Tokens, Vector Size)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
    "to come in future releases. \n",
    "\n",
    "In the meantime, you can try the different pipelines with your own inputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "261ae9fa30e84d1d84a3b0d9682ac477",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ddc51b71c6eb40e5ab60998664e6a857",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Text(value='', description='Your input:', placeholder='Enter something')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'word': 'Paris', 'score': 0.9991844296455383, 'entity': 'I-LOC'}]\n",
      "[{'sequence': '<s> I\\'m from Paris.\"</s>', 'score': 0.224044069647789, 'token': 72}, {'sequence': \"<s> I'm from Paris.)</s>\", 'score': 0.16959427297115326, 'token': 1592}, {'sequence': \"<s> I'm from Paris.]</s>\", 'score': 0.10994981974363327, 'token': 21838}, {'sequence': '<s> I\\'m from Paris!\"</s>', 'score': 0.0706234946846962, 'token': 2901}, {'sequence': \"<s> I'm from Paris.</s>\", 'score': 0.0698278620839119, 'token': 4}]\n",
      "[{'sequence': \"<s> I'm from Paris and London</s>\", 'score': 0.12238534539937973, 'token': 928}, {'sequence': \"<s> I'm from Paris and Brussels</s>\", 'score': 0.07107886672019958, 'token': 6497}, {'sequence': \"<s> I'm from Paris and Belgium</s>\", 'score': 0.040912602096796036, 'token': 7320}, {'sequence': \"<s> I'm from Paris and Berlin</s>\", 'score': 0.039884064346551895, 'token': 5459}, {'sequence': \"<s> I'm from Paris and Melbourne</s>\", 'score': 0.038133684545755386, 'token': 5703}]\n",
      "[{'sequence': '<s> I like go to sleep</s>', 'score': 0.08942786604166031, 'token': 3581}, {'sequence': '<s> I like go to bed</s>', 'score': 0.07789064943790436, 'token': 3267}, {'sequence': '<s> I like go to concerts</s>', 'score': 0.06356740742921829, 'token': 12858}, {'sequence': '<s> I like go to school</s>', 'score': 0.03660670667886734, 'token': 334}, {'sequence': '<s> I like go to dinner</s>', 'score': 0.032155368477106094, 'token': 3630}]\n"
     ]
    }
   ],
   "source": [
    "task = widgets.Dropdown(\n",
    "    options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
    "    value='ner',\n",
    "    description='Task:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "input = widgets.Text(\n",
    "    value='',\n",
    "    placeholder='Enter something',\n",
    "    description='Your input:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "def forward(_):\n",
    "    if len(input.value) > 0: \n",
    "        if task.value == 'ner':\n",
    "            output = nlp_token_class(input.value)\n",
    "        elif task.value == 'sentiment-analysis':\n",
    "            output = nlp_sentence_classif(input.value)\n",
    "        else:\n",
    "            if input.value.find('<mask>') == -1:\n",
    "                output = nlp_fill(input.value + ' <mask>')\n",
    "            else:\n",
    "                output = nlp_fill(input.value)                \n",
    "        print(output)\n",
    "\n",
    "input.on_submit(forward)\n",
    "display(task, input)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% Question Answering\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5ae68677bd8a41f990355aa43840d3f8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "14bcfd9a2c5a47e6b1383989ab7632c8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Text(value='Why is Einstein famous for ?', description='Question:', placeholder='Enter something')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 168.83it/s]\n",
      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 1919.59it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'score': 0.40340670623875496, 'start': 27, 'end': 54, 'answer': 'general theory of relativity'}\n"
     ]
    }
   ],
   "source": [
    "context = widgets.Textarea(\n",
    "    value='Einstein is famous for the general theory of relativity',\n",
    "    placeholder='Enter something',\n",
    "    description='Context:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "query = widgets.Text(\n",
    "    value='Why is Einstein famous for ?',\n",
    "    placeholder='Enter something',\n",
    "    description='Question:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "def forward(_):\n",
    "    if len(context.value) > 0 and len(query.value) > 0: \n",
    "        output = nlp_qa(question=query.value, context=context.value)            \n",
    "        print(output)\n",
    "\n",
    "query.on_submit(forward)\n",
    "display(context, query)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}