Commit 30624f70 authored by Morgan Funtowicz's avatar Morgan Funtowicz
Browse files

Fix Colab links + install dependencies first.


Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
parent ff9e79ba
...@@ -2,6 +2,12 @@ ...@@ -2,6 +2,12 @@
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [ "source": [
"## Tokenization doesn't have to be slow !\n", "## Tokenization doesn't have to be slow !\n",
"\n", "\n",
...@@ -81,34 +87,46 @@ ...@@ -81,34 +87,46 @@
"\n", "\n",
"All of these building blocks can be combined to create working tokenization pipelines. \n", "All of these building blocks can be combined to create working tokenization pipelines. \n",
"In the next section we will go over our first pipeline." "In the next section we will go over our first pipeline."
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [ "source": [
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n", "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
"\n", "\n",
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n", "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
"We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n", "We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer." "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
], ]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"collapsed": false,
"pycharm": { "pycharm": {
"name": "#%% md\n" "is_executing": false,
"name": "#%% code\n"
} }
} },
"outputs": [],
"source": [
"!pip install tokenizers"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 2, "execution_count": 2,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [], "outputs": [],
"source": [ "source": [
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n", "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
...@@ -122,33 +140,31 @@ ...@@ -122,33 +140,31 @@
" big_f.write(response.content)\n", " big_f.write(response.content)\n",
" else:\n", " else:\n",
" print(\"Unable to get the file: {}\".format(response.reason))\n" " print(\"Unable to get the file: {}\".format(response.reason))\n"
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [ "source": [
" \n", " \n",
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n", "Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
" " " "
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 10, "execution_count": 10,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [], "outputs": [],
"source": [ "source": [
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n", "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
...@@ -165,49 +181,47 @@ ...@@ -165,49 +181,47 @@
"tokenizer = Tokenizer(BPE.empty())\n", "tokenizer = Tokenizer(BPE.empty())\n",
"\n", "\n",
"# Then we enable lower-casing and unicode-normalization\n", "# Then we enable lower-casing and unicode-normalization\n",
"# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n", "# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
"# executed in sequence.\n", "# executed in order.\n",
"tokenizer.normalizer = Sequence([\n", "tokenizer.normalizer = Sequence([\n",
" NFKC(),\n", " NFKC(),\n",
" Lowercase()\n", " Lowercase()\n",
"])\n", "])\n",
"\n", "\n",
"# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n", "# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
"tokenizer.pre_tokenizer = ByteLevel()\n", "tokenizer.pre_tokenizer = ByteLevel()\n",
"\n", "\n",
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n", "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
"tokenizer.decoder = ByteLevelDecoder()" "tokenizer.decoder = ByteLevelDecoder()"
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
],
"metadata": { "metadata": {
"collapsed": false,
"pycharm": { "pycharm": {
"name": "#%% md\n" "name": "#%% md\n"
} }
} },
"source": [
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 11, "execution_count": 11,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream",
"text": [ "text": [
"Trained vocab size: 25000\n" "Trained vocab size: 25000\n"
], ]
"output_type": "stream"
} }
], ],
"source": [ "source": [
...@@ -218,79 +232,77 @@ ...@@ -218,79 +232,77 @@
"tokenizer.train(trainer, [\"big.txt\"])\n", "tokenizer.train(trainer, [\"big.txt\"])\n",
"\n", "\n",
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))" "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [ "source": [
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n", "Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n", "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
"on the `Trainer` class, but the overall process should be very similar.\n", "on the `Trainer` class, but the overall process should be very similar.\n",
"\n", "\n",
"We can save the content of the model to reuse it later." "We can save the content of the model to reuse it later."
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 12,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": "['./vocab.json', './merges.txt']" "text/plain": [
"['./vocab.json', './merges.txt']"
]
}, },
"execution_count": 12,
"metadata": {}, "metadata": {},
"output_type": "execute_result", "output_type": "execute_result"
"execution_count": 12
} }
], ],
"source": [ "source": [
"# You will see the generated files in the output.\n", "# You will see the generated files in the output.\n",
"tokenizer.model.save('.')" "tokenizer.model.save('.')"
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"Now, let load the trained model and start using out newly trained tokenizer"
],
"metadata": { "metadata": {
"collapsed": false,
"pycharm": { "pycharm": {
"name": "#%% md\n" "name": "#%% md\n"
} }
} },
"source": [
"Now, let load the trained model and start using out newly trained tokenizer"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 13,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream",
"text": [ "text": [
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n", "Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
"Decoded string: this is a simple input to be tokenized\n" "Decoded string: this is a simple input to be tokenized\n"
], ]
"output_type": "stream"
} }
], ],
"source": [ "source": [
...@@ -302,17 +314,15 @@ ...@@ -302,17 +314,15 @@
"\n", "\n",
"decoded = tokenizer.decode(encoding.ids)\n", "decoded = tokenizer.decode(encoding.ids)\n",
"print(\"Decoded string: {}\".format(decoded))" "print(\"Decoded string: {}\".format(decoded))"
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n",
"is_executing": false
}
}
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [ "source": [
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n", "The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
"\n", "\n",
...@@ -324,13 +334,7 @@ ...@@ -324,13 +334,7 @@
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n", "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n", "- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts." "- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
], ]
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
} }
], ],
"metadata": { "metadata": {
...@@ -342,25 +346,25 @@ ...@@ -342,25 +346,25 @@
"language_info": { "language_info": {
"codemirror_mode": { "codemirror_mode": {
"name": "ipython", "name": "ipython",
"version": 2 "version": 3
}, },
"file_extension": ".py", "file_extension": ".py",
"mimetype": "text/x-python", "mimetype": "text/x-python",
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython2", "pygments_lexer": "ipython3",
"version": "2.7.6" "version": "3.7.6"
}, },
"pycharm": { "pycharm": {
"stem_cell": { "stem_cell": {
"cell_type": "raw", "cell_type": "raw",
"source": [],
"metadata": { "metadata": {
"collapsed": false "collapsed": false
} },
"source": []
} }
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 0 "nbformat_minor": 1
} }
\ No newline at end of file
...@@ -75,6 +75,20 @@ ...@@ -75,6 +75,20 @@
"in PyTorch and TensorFlow in a transparent and interchangeable way. " "in PyTorch and TensorFlow in a transparent and interchangeable way. "
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install transformers"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n"
}
}
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 74, "execution_count": 74,
......
...@@ -51,6 +51,20 @@ ...@@ -51,6 +51,20 @@
"```" "```"
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install transformers"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n"
}
}
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 29, "execution_count": 29,
......
...@@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here. ...@@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here.
| Notebook | Description | | | Notebook | Description | |
|:----------|:-------------:|------:| |:----------|:-------------:|------:|
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | | [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | | [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | | [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| | [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment