03-pipelines.ipynb 14 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
    "API for doing inference over a variety of downstream-tasks, including: \n",
    "\n",
    "- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
    "- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
    "- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
    "- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
    "- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
    "\n",
    "Pipelines encapsulate the overall process of every NLP process:\n",
    " \n",
    " 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
    " 2. Inference: Maps every tokens into a more meaningful representation. \n",
    " 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
    "\n",
    "The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
    "structure:\n",
    "\n",
    "```python\n",
    "from transformers import pipeline\n",
    "\n",
    "# Using default model and tokenizer for the task\n",
    "pipeline(\"<task-name>\")\n",
    "\n",
    "# Using a user-specified model\n",
    "pipeline(\"<task-name>\", model=\"<model_name>\")\n",
    "\n",
    "# Using custom model/tokenizer as str\n",
    "pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
    "```"
   ]
  },
54
55
56
57
58
59
60
61
62
63
64
65
66
67
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install transformers"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n"
    }
   }
  },
68
69
  {
   "cell_type": "code",
70
   "execution_count": 6,
71
72
73
74
75
76
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code \n"
    }
   },
77
   "outputs": [],
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
   "source": [
    "from __future__ import print_function\n",
    "import ipywidgets as widgets\n",
    "from transformers import pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 1. Sentence Classification - Sentiment Analysis"
   ]
  },
  {
   "cell_type": "code",
97
   "execution_count": 8,
98
99
100
101
102
103
104
105
106
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
107
      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
108
109
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
110
111
112
       "version_minor": 0,
       "model_id": "c9db53f30b9446c0af03268633a966c0"
      }
113
114
115
116
117
118
119
120
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "text": [
      "\n"
121
122
     ],
     "output_type": "stream"
123
124
125
    },
    {
     "data": {
126
      "text/plain": "[{'label': 'POSITIVE', 'score': 0.9997656}]"
127
128
     },
     "metadata": {},
129
130
     "output_type": "execute_result",
     "execution_count": 8
131
132
133
134
    }
   ],
   "source": [
    "nlp_sentence_classif = pipeline('sentiment-analysis')\n",
135
    "nlp_sentence_classif('Such a nice weather outside !')"
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 2. Token Classification - Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
151
   "execution_count": 9,
152
153
154
155
156
157
158
159
160
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
161
      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
162
163
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
164
165
166
       "version_minor": 0,
       "model_id": "1e300789e22644f1aed66a5ed60e75c4"
      }
167
168
169
170
171
172
173
174
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "text": [
      "\n"
175
176
     ],
     "output_type": "stream"
177
178
179
    },
    {
     "data": {
180
      "text/plain": "[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
181
182
     },
     "metadata": {},
183
184
     "output_type": "execute_result",
     "execution_count": 9
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
    }
   ],
   "source": [
    "nlp_token_class = pipeline('ner')\n",
    "nlp_token_class('Hugging Face is a French company based in New-York.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Question Answering"
   ]
  },
  {
   "cell_type": "code",
201
   "execution_count": 10,
202
203
204
205
206
207
208
209
210
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
211
      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
212
213
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
214
215
216
       "version_minor": 0,
       "model_id": "82aca58f1ea24b4cb37f16402e8a5923"
      }
217
218
219
220
221
222
223
224
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "text": [
      "\n"
225
226
     ],
     "output_type": "stream"
227
228
229
230
    },
    {
     "name": "stderr",
     "text": [
231
232
233
234
      "convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 225.51it/s]\n",
      "add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2158.67it/s]\n"
     ],
     "output_type": "stream"
235
236
237
    },
    {
     "data": {
238
      "text/plain": "{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
239
240
     },
     "metadata": {},
241
242
     "output_type": "execute_result",
     "execution_count": 10
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
    }
   ],
   "source": [
    "nlp_qa = pipeline('question-answering')\n",
    "nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Text Generation - Mask Filling"
   ]
  },
  {
   "cell_type": "code",
259
   "execution_count": 11,
260
261
262
263
264
265
266
267
268
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
269
      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
270
271
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
272
273
274
       "version_minor": 0,
       "model_id": "49df2227b4fa4eb28dcdcfc3d9261d0f"
      }
275
276
277
278
279
280
281
282
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "text": [
      "\n"
283
284
     ],
     "output_type": "stream"
285
286
287
    },
    {
     "data": {
288
      "text/plain": "[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n  'score': 0.23106691241264343,\n  'token': 2201},\n {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n  'score': 0.0819825753569603,\n  'token': 12790},\n {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n  'score': 0.04769463092088699,\n  'token': 11559},\n {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n  'score': 0.047622501850128174,\n  'token': 6497},\n {'sequence': '<s> Hugging Face is a French company based in France</s>',\n  'score': 0.04130595177412033,\n  'token': 1470}]"
289
290
     },
     "metadata": {},
291
292
     "output_type": "execute_result",
     "execution_count": 11
293
294
295
296
    }
   ],
   "source": [
    "nlp_fill = pipeline('fill-mask')\n",
297
    "nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)"
298
299
300
301
302
303
304
305
306
307
308
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Projection - Features Extraction "
   ]
  },
  {
   "cell_type": "code",
309
   "execution_count": 12,
310
311
312
313
314
315
316
317
318
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
319
      "text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
320
321
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
322
323
324
       "version_minor": 0,
       "model_id": "2af4cfb19e3243dda014d0f56b48f4b2"
      }
325
326
327
328
329
330
331
332
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "text": [
      "\n"
333
334
     ],
     "output_type": "stream"
335
336
337
    },
    {
     "data": {
338
      "text/plain": "(1, 12, 768)"
339
340
     },
     "metadata": {},
341
342
     "output_type": "execute_result",
     "execution_count": 12
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
    }
   ],
   "source": [
    "import numpy as np\n",
    "nlp_features = pipeline('feature-extraction')\n",
    "output = nlp_features('Hugging Face is a French company based in Paris')\n",
    "np.array(output).shape   # (Samples, Tokens, Vector Size)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
    "to come in future releases. \n",
    "\n",
    "In the meantime, you can try the different pipelines with your own inputs"
   ]
  },
  {
   "cell_type": "code",
368
   "execution_count": 13,
369
370
371
372
373
374
375
376
377
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
378
      "text/plain": "Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')",
379
380
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
381
382
383
       "version_minor": 0,
       "model_id": "10bac065d46f4e4d9a8498dcc8104ecd"
      }
384
385
386
387
388
389
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
390
      "text/plain": "Text(value='', description='Your input:', placeholder='Enter something')",
391
392
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
393
394
395
       "version_minor": 0,
       "model_id": "2c5f1411f7a94714bc00f01b0e3b27b2"
      }
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "task = widgets.Dropdown(\n",
    "    options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
    "    value='ner',\n",
    "    description='Task:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "input = widgets.Text(\n",
    "    value='',\n",
    "    placeholder='Enter something',\n",
    "    description='Your input:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "def forward(_):\n",
    "    if len(input.value) > 0: \n",
    "        if task.value == 'ner':\n",
    "            output = nlp_token_class(input.value)\n",
    "        elif task.value == 'sentiment-analysis':\n",
    "            output = nlp_sentence_classif(input.value)\n",
    "        else:\n",
    "            if input.value.find('<mask>') == -1:\n",
    "                output = nlp_fill(input.value + ' <mask>')\n",
    "            else:\n",
    "                output = nlp_fill(input.value)                \n",
    "        print(output)\n",
    "\n",
    "input.on_submit(forward)\n",
    "display(task, input)"
   ]
  },
  {
   "cell_type": "code",
435
   "execution_count": 14,
436
437
438
439
440
441
442
443
444
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% Question Answering\n"
    }
   },
   "outputs": [
    {
     "data": {
445
      "text/plain": "Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…",
446
447
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
448
449
450
       "version_minor": 0,
       "model_id": "019fde2343634e94b6f32d04f6350ec1"
      }
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "context = widgets.Textarea(\n",
    "    value='Einstein is famous for the general theory of relativity',\n",
    "    placeholder='Enter something',\n",
    "    description='Context:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "query = widgets.Text(\n",
    "    value='Why is Einstein famous for ?',\n",
    "    placeholder='Enter something',\n",
    "    description='Question:',\n",
    "    disabled=False\n",
    ")\n",
    "\n",
    "def forward(_):\n",
    "    if len(context.value) > 0 and len(query.value) > 0: \n",
    "        output = nlp_qa(question=query.value, context=context.value)            \n",
    "        print(output)\n",
    "\n",
    "query.on_submit(forward)\n",
    "display(context, query)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}