01-training-tokenizers.ipynb 14.1 KB
Newer Older
1
2
3
4
{
 "cells": [
  {
   "cell_type": "markdown",
5
6
7
8
9
10
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
   "source": [
    "## Tokenization doesn't have to be slow !\n",
    "\n",
    "### Introduction\n",
    "\n",
    "Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner\n",
    "should find a way to map raw input strings to a representation understandable by a trainable model.\n",
    "\n",
    "One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach\n",
    "would look similar to the code below in python\n",
    "\n",
    "```python\n",
    "s = \"very long corpus...\"\n",
    "words = s.split(\" \")  # Split over space\n",
    "vocabulary = dict(enumerate(set(words)))  # Map storing the word to it's corresponding id\n",
    "```\n",
    "\n",
    "This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original\n",
    "input. Moreover, word variations like \"cat\" and \"cats\" would not share the same identifiers even if their meaning is \n",
    "quite close.\n",
    "\n",
    "![tokenization_simple](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)\n",
    "\n",
    "### Subtoken Tokenization\n",
    "\n",
    "To overcome the issues described above, recent works have been done on tokenization, leveraging \"subtoken\" tokenization.\n",
    "**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned\n",
    "from the data.\n",
    "\n",
    "Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _\"##\"_ indicates a subtoken of the initial input. \n",
    "Such training algorithms might extract sub-tokens such as _\"##ing\"_, _\"##ed\"_ over English corpus.\n",
    "\n",
    "As you might think of, this kind of sub-tokens construction leveraging compositions of _\"pieces\"_ overall reduces the size\n",
    "of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded\n",
    "into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. \n",
    " \n",
    "![subtokenization](https://nlp.fast.ai/images/multifit_vocabularies.png)\n",
    " \n",
    "Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : \n",
    "\n",
    "- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)\n",
    "- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)\n",
    "- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)\n",
    "- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)\n",
    "\n",
    "Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.\n",
    "\n",
    "### @huggingface/tokenizers library \n",
    "Along with the transformers library, we @huggingface provide a blazing fast tokenization library\n",
    "able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.\n",
    "\n",
    "The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which \n",
    "we provide bindings for Python and NodeJS (more bindings may be added in the future). \n",
    "\n",
    "We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide\n",
    "these various components: \n",
    "\n",
    "- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to\n",
    "lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. \n",
    "- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to\n",
    "pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.\n",
    "- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant\n",
    " of your input data.\n",
    "- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA\n",
    "models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.\n",
    "- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according\n",
    "to the `PreTokenizer` we used previously.\n",
    "- **Trainer**: Provides training capabilities to each model.\n",
    "\n",
    "For each of the components above we provide multiple implementations:\n",
    "\n",
    "- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...\n",
    "- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...\n",
    "- **Model**: WordLevel, BPE, WordPiece\n",
    "- **Post-Processor**: BertProcessor, ...\n",
    "- **Decoder**: WordLevel, BPE, WordPiece, ...\n",
    "\n",
    "All of these building blocks can be combined to create working tokenization pipelines. \n",
    "In the next section we will go over our first pipeline."
90
   ]
91
92
93
  },
  {
   "cell_type": "markdown",
94
95
96
97
98
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
99
100
101
102
   "source": [
    "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
    "\n",
    "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
103
104
105
106
107
108
109
    "We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
    "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
110
111
   "metadata": {
    "pycharm": {
112
113
     "is_executing": false,
     "name": "#%% code\n"
114
    }
115
116
117
118
119
   },
   "outputs": [],
   "source": [
    "!pip install tokenizers"
   ]
120
121
122
123
  },
  {
   "cell_type": "code",
   "execution_count": 2,
124
125
126
127
128
129
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
130
131
132
133
134
135
136
137
138
139
140
141
142
   "outputs": [],
   "source": [
    "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
    "\n",
    "# Let's download the file and save it somewhere\n",
    "from requests import get\n",
    "with open('big.txt', 'wb') as big_f:\n",
    "    response = get(BIG_FILE_URL, )\n",
    "    \n",
    "    if response.status_code == 200:\n",
    "        big_f.write(response.content)\n",
    "    else:\n",
    "        print(\"Unable to get the file: {}\".format(response.reason))\n"
143
   ]
144
145
146
  },
  {
   "cell_type": "markdown",
147
148
149
150
151
152
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
153
154
155
156
   "source": [
    " \n",
    "Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
    " "
157
   ]
158
159
160
161
  },
  {
   "cell_type": "code",
   "execution_count": 10,
162
163
164
165
166
167
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
   "outputs": [],
   "source": [
    "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
    "# the overall pipeline for various well-known tokenization algorithm. \n",
    "# Everything described below can be replaced by the ByteLevelBPETokenizer class. \n",
    "\n",
    "from tokenizers import Tokenizer\n",
    "from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
    "from tokenizers.models import BPE\n",
    "from tokenizers.normalizers import Lowercase, NFKC, Sequence\n",
    "from tokenizers.pre_tokenizers import ByteLevel\n",
    "\n",
    "# First we create an empty Byte-Pair Encoding model (i.e. not trained model)\n",
    "tokenizer = Tokenizer(BPE.empty())\n",
    "\n",
    "# Then we enable lower-casing and unicode-normalization\n",
184
185
    "# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
    "# executed in order.\n",
186
187
188
189
190
    "tokenizer.normalizer = Sequence([\n",
    "    NFKC(),\n",
    "    Lowercase()\n",
    "])\n",
    "\n",
191
    "# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
192
193
194
195
    "tokenizer.pre_tokenizer = ByteLevel()\n",
    "\n",
    "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
    "tokenizer.decoder = ByteLevelDecoder()"
196
   ]
197
198
199
200
201
202
203
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
204
205
206
207
   },
   "source": [
    "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
   ]
208
209
210
211
  },
  {
   "cell_type": "code",
   "execution_count": 11,
212
213
214
215
216
217
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
218
219
220
   "outputs": [
    {
     "name": "stdout",
221
     "output_type": "stream",
222
223
     "text": [
      "Trained vocab size: 25000\n"
224
     ]
225
226
227
228
229
230
231
232
233
234
    }
   ],
   "source": [
    "from tokenizers.trainers import BpeTrainer\n",
    "\n",
    "# We initialize our trainer, giving him the details about the vocabulary we want to generate\n",
    "trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())\n",
    "tokenizer.train(trainer, [\"big.txt\"])\n",
    "\n",
    "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
235
   ]
236
237
238
  },
  {
   "cell_type": "markdown",
239
240
241
242
243
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
244
245
246
247
248
249
   "source": [
    "Et voil脿 ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
    "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
    "on the `Trainer` class, but the overall process should be very similar.\n",
    "\n",
    "We can save the content of the model to reuse it later."
250
   ]
251
252
253
254
  },
  {
   "cell_type": "code",
   "execution_count": 12,
255
256
257
258
259
260
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
261
262
263
   "outputs": [
    {
     "data": {
264
265
266
      "text/plain": [
       "['./vocab.json', './merges.txt']"
      ]
267
     },
268
     "execution_count": 12,
269
     "metadata": {},
270
     "output_type": "execute_result"
271
272
273
274
275
    }
   ],
   "source": [
    "# You will see the generated files in the output.\n",
    "tokenizer.model.save('.')"
276
   ]
277
278
279
280
281
282
283
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
284
285
286
287
   },
   "source": [
    "Now, let load the trained model and start using out newly trained tokenizer"
   ]
288
289
290
291
  },
  {
   "cell_type": "code",
   "execution_count": 13,
292
293
294
295
296
297
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
298
299
300
   "outputs": [
    {
     "name": "stdout",
301
     "output_type": "stream",
302
303
304
     "text": [
      "Encoded string: ['臓this', '臓is', '臓a', '臓simple', '臓in', 'put', '臓to', '臓be', '臓token', 'ized']\n",
      "Decoded string:  this is a simple input to be tokenized\n"
305
     ]
306
307
308
309
310
311
312
313
314
315
316
    }
   ],
   "source": [
    "# Let's tokenizer a simple input\n",
    "tokenizer.model = BPE.from_files('vocab.json', 'merges.txt')\n",
    "encoding = tokenizer.encode(\"This is a simple input to be tokenized\")\n",
    "\n",
    "print(\"Encoded string: {}\".format(encoding.tokens))\n",
    "\n",
    "decoded = tokenizer.decode(encoding.ids)\n",
    "print(\"Decoded string: {}\".format(decoded))"
317
   ]
318
319
320
  },
  {
   "cell_type": "markdown",
321
322
323
324
325
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
326
327
328
329
330
331
332
333
334
   "source": [
    "The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
    "\n",
    "- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)\n",
    "- original_str: The input string as it was provided\n",
    "- tokens: The generated tokens with their string representation\n",
    "- input_ids: The generated tokens with their integer representation\n",
    "- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.\n",
    "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
335
336
    "- type_ids: If your input was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
    "- overflowing: If your input has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
337
   ]
338
339
340
341
342
343
344
345
346
347
348
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
349
    "version": 3
350
351
352
353
354
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
355
356
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
357
358
359
360
361
362
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
363
364
    },
    "source": []
365
366
367
368
   }
  }
 },
 "nbformat": 4,
369
370
 "nbformat_minor": 1
}