Commit 30624f70 authored by Morgan Funtowicz's avatar Morgan Funtowicz
Browse files

Fix Colab links + install dependencies first.


Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
parent ff9e79ba
......@@ -2,6 +2,12 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
"## Tokenization doesn't have to be slow !\n",
"\n",
......@@ -81,34 +87,46 @@
"\n",
"All of these building blocks can be combined to create working tokenization pipelines. \n",
"In the next section we will go over our first pipeline."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
"\n",
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
"We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer."
],
"We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
"is_executing": false,
"name": "#%% code\n"
}
}
},
"outputs": [],
"source": [
"!pip install tokenizers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [],
"source": [
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
......@@ -122,33 +140,31 @@
" big_f.write(response.content)\n",
" else:\n",
" print(\"Unable to get the file: {}\".format(response.reason))\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
" \n",
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
" "
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n",
"is_executing": false
}
}
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [],
"source": [
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
......@@ -165,49 +181,47 @@
"tokenizer = Tokenizer(BPE.empty())\n",
"\n",
"# Then we enable lower-casing and unicode-normalization\n",
"# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n",
"# executed in sequence.\n",
"# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
"# executed in order.\n",
"tokenizer.normalizer = Sequence([\n",
" NFKC(),\n",
" Lowercase()\n",
"])\n",
"\n",
"# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
"# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
"tokenizer.pre_tokenizer = ByteLevel()\n",
"\n",
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
"tokenizer.decoder = ByteLevelDecoder()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"source": [
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Trained vocab size: 25000\n"
],
"output_type": "stream"
]
}
],
"source": [
......@@ -218,79 +232,77 @@
"tokenizer.train(trainer, [\"big.txt\"])\n",
"\n",
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
"on the `Trainer` class, but the overall process should be very similar.\n",
"\n",
"We can save the content of the model to reuse it later."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "['./vocab.json', './merges.txt']"
"text/plain": [
"['./vocab.json', './merges.txt']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result",
"execution_count": 12
"output_type": "execute_result"
}
],
"source": [
"# You will see the generated files in the output.\n",
"tokenizer.model.save('.')"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Now, let load the trained model and start using out newly trained tokenizer"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"Now, let load the trained model and start using out newly trained tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
"Decoded string: this is a simple input to be tokenized\n"
],
"output_type": "stream"
]
}
],
"source": [
......@@ -302,17 +314,15 @@
"\n",
"decoded = tokenizer.decode(encoding.ids)\n",
"print(\"Decoded string: {}\".format(decoded))"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
"\n",
......@@ -324,13 +334,7 @@
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
}
],
"metadata": {
......@@ -342,25 +346,25 @@
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
"nbformat_minor": 1
}
......@@ -75,6 +75,20 @@
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install transformers"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n"
}
}
},
{
"cell_type": "code",
"execution_count": 74,
......
......@@ -51,6 +51,20 @@
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install transformers"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n"
}
}
},
{
"cell_type": "code",
"execution_count": 29,
......
......@@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here.
| Notebook | Description | |
|:----------|:-------------:|------:|
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment