Unverified Commit 4d489735 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Update notebook table and transformers intro notebook (#9136)

parent fb650df8
...@@ -55,11 +55,11 @@ Coming soon! ...@@ -55,11 +55,11 @@ Coming soon!
|---|---|:---:|:---:|:---:|:---:| |---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb) | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | - | [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | - | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb) | [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) | [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | - | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
......
...@@ -73,12 +73,14 @@ ...@@ -73,12 +73,14 @@
"\n", "\n",
"The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n", "The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n",
"infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n", "infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n",
"in PyTorch and TensorFlow in a transparent and interchangeable way. " "in PyTorch and TensorFlow in a transparent and interchangeable way. \n",
"\n",
"If you're executing this notebook in Colab, you will need to install the transformers library. You can do so with this command:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 2,
"metadata": { "metadata": {
"id": "KnT3Jn6fSXai", "id": "KnT3Jn6fSXai",
"pycharm": { "pycharm": {
...@@ -89,13 +91,12 @@ ...@@ -89,13 +91,12 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"!pip install transformers\n", "# !pip install transformers"
"!pip install --upgrade tensorflow"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 3,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
...@@ -111,13 +112,11 @@ ...@@ -111,13 +112,11 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"<torch.autograd.grad_mode.set_grad_enabled at 0x7f9c03e5b3c8>" "<torch.autograd.grad_mode.set_grad_enabled at 0x7ff0cc2a2c50>"
] ]
}, },
"execution_count": 2, "execution_count": 3,
"metadata": { "metadata": {},
"tags": []
},
"output_type": "execute_result" "output_type": "execute_result"
} }
], ],
...@@ -130,7 +129,7 @@ ...@@ -130,7 +129,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 4,
"metadata": { "metadata": {
"id": "1xMDTHQXSXai", "id": "1xMDTHQXSXai",
"pycharm": { "pycharm": {
...@@ -159,103 +158,56 @@ ...@@ -159,103 +158,56 @@
"source": [ "source": [
"With only the above two lines of code, you're ready to use a BERT pre-trained model. \n", "With only the above two lines of code, you're ready to use a BERT pre-trained model. \n",
"The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n", "The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n",
"in a way the model can manipulate." "in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 6,
"metadata": { "metadata": {},
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XgkFg52fSXai",
"outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Tokens: ['This', 'is', 'an', 'input', 'example']\n", "input_ids:\n",
"Tokens id: [1188, 1110, 1126, 7758, 1859]\n", "\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n", "token_type_ids:\n",
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n" "\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n"
] ]
} }
], ],
"source": [ "source": [
"# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. \n", "tokens_pt = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"tokens = tokenizer.tokenize(\"This is an input example\")\n", "for key, value in tokens_pt.items():\n",
"print(\"Tokens: {}\".format(tokens))\n", " print(\"{}:\\n\\t{}\".format(key, value))"
"\n",
"# This is not sufficient for the model, as it requires integers as input, \n",
"# not a problem, let's convert tokens to ids.\n",
"tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"print(\"Tokens id: {}\".format(tokens_ids))\n",
"\n",
"# Add the required special tokens\n",
"tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)\n",
"\n",
"# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.\n",
"tokens_pt = torch.tensor([tokens_ids])\n",
"print(\"Tokens PyTorch: {}\".format(tokens_pt))\n",
"\n",
"# Now we're ready to go through BERT with out input\n",
"outputs = model(tokens_pt)\n",
"last_hidden_state = outputs.last_hidden_state\n",
"pooler_output = outputs.pooler_output\n",
"\n",
"print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {},
"id": "lBbvwNKXSXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [ "source": [
"As you can see, BERT outputs two tensors:\n", "The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: \n",
" - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
" - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
" \n",
"The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
"want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
"\n", "\n",
"The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n", "- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
"require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval." "- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below).\n",
] "\n",
}, "You can check our [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. \n",
{ "\n",
"cell_type": "markdown", "We can just feed this directly into our model:"
"metadata": {
"id": "DCxuDWH2SXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The code you saw in the previous section introduced all the steps required to do simple model invocation.\n",
"For more day-to-day usage, transformers provides you higher-level methods which will makes your NLP journey easier.\n",
"Let's improve our previous example"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 7,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
}, },
"id": "sgcNCdXUSXaj", "id": "XgkFg52fSXai",
"outputId": "af2fb928-7c17-475b-cf81-89cfc4b1d9e5", "outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
"pycharm": { "pycharm": {
"is_executing": false, "is_executing": false,
"name": "#%% code\n" "name": "#%% code\n"
...@@ -266,52 +218,41 @@ ...@@ -266,52 +218,41 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"input_ids:\n", "Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"token_type_ids:\n",
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n",
"Difference with previous code: (0.0, 0.0)\n"
] ]
} }
], ],
"source": [ "source": [
"# tokens = tokenizer.tokenize(\"This is an input example\")\n", "outputs = model(**tokens_pt)\n",
"# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n", "last_hidden_state = outputs.last_hidden_state\n",
"# tokens_pt = torch.tensor([tokens_ids])\n", "pooler_output = outputs.pooler_output\n",
"\n",
"# This code can be factored into one-line as follow\n",
"tokens_pt2 = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"\n",
"for key, value in tokens_pt2.items():\n",
" print(\"{}:\\n\\t{}\".format(key, value))\n",
"\n",
"outputs2 = model(**tokens_pt2)\n",
"last_hidden_state2 = outputs2.last_hidden_state\n",
"pooler_output2 = outputs2.pooler_output\n",
"\n", "\n",
"print(\"Difference with previous code: ({}, {})\".format((last_hidden_state2 - last_hidden_state).sum(), (pooler_output2 - pooler_output).sum()))" "print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "gC-7xGYPSXal" "id": "lBbvwNKXSXaj",
"pycharm": {
"name": "#%% md\n"
}
}, },
"source": [ "source": [
"As you can see above, calling the tokenizer provides a convenient way to generate all the required parameters\n", "As you can see, BERT outputs two tensors:\n",
"that will go through the model. \n", " - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
"\n", " - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
"Moreover, you might have noticed it generated some additional tensors: \n", " \n",
"The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
"want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
"\n", "\n",
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n", "The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n",
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below)." "require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 8,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
...@@ -357,7 +298,7 @@ ...@@ -357,7 +298,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 9,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
...@@ -414,14 +355,47 @@ ...@@ -414,14 +355,47 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 10,
"metadata": { "metadata": {
"id": "Kubwm-wJSXan", "id": "Kubwm-wJSXan",
"pycharm": { "pycharm": {
"is_executing": false "is_executing": false
} }
}, },
"outputs": [], "outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3b971be3639d4fedb02778fb5c6898a0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n",
"- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
}
],
"source": [ "source": [
"from transformers import TFBertModel, BertModel\n", "from transformers import TFBertModel, BertModel\n",
"\n", "\n",
...@@ -432,7 +406,7 @@ ...@@ -432,7 +406,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 11,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
...@@ -448,8 +422,8 @@ ...@@ -448,8 +422,8 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"last_hidden_state differences: 1.0094e-05\n", "last_hidden_state differences: 1.2933e-05\n",
"pooler_output differences: 7.2969e-07\n" "pooler_output differences: 2.9691e-06\n"
] ]
} }
], ],
...@@ -482,7 +456,7 @@ ...@@ -482,7 +456,7 @@
"\n", "\n",
"For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n", "For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n",
"\n", "\n",
"![transformers-parameters](https://lh5.googleusercontent.com/NRdXzEcgZV3ooykjIaTm9uvbr9QnSjDQHHAHb2kk_Lm9lIF0AhS-PJdXGzpcBDztax922XAp386hyNmWZYsZC1lUN2r4Ip5p9v-PHO19-jevRGg4iQFxgv5Olq4DWaqSA_8ptep7)\n", "![transformers-parameters](https://github.com/huggingface/notebooks/blob/master/examples/images/model_parameters.png?raw=true)\n",
"\n", "\n",
"With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n", "With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n",
"\n", "\n",
...@@ -673,7 +647,7 @@ ...@@ -673,7 +647,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.3" "version": "3.7.9"
}, },
"pycharm": { "pycharm": {
"stem_cell": { "stem_cell": {
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment