Unverified Commit 146c5212 authored by Lysandre Debut's avatar Lysandre Debut Committed by GitHub
Browse files

Merge branch 'master' into add_models_special_tokens_to_specific_configs

parents f5b50c6b b623ddc0
---
language: english
thumbnail:
---
# [BERT](https://huggingface.co/deepset/bert-base-cased-squad2) fine tuned on [QNLI](https://github.com/rhythmcao/QNLI)+ compression ([BERT-of-Theseus](https://github.com/JetRunner/BERT-of-Theseus))
I used a [Bert model fine tuned on **SQUAD v2**](https://huggingface.co/deepset/bert-base-cased-squad2) and then I fine tuned it on **QNLI** using **compression** (with a constant replacing rate) as proposed in **BERT-of-Theseus**
## Details of the downstream task (QNLI):
### Getting the dataset
```bash
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/train.tsv
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/test.tsv
wget https://raw.githubusercontent.com/rhythmcao/QNLI/master/data/QNLI/dev.tsv
mkdir QNLI_dataset
mv *.tsv QNLI_dataset
```
### Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
!python /content/BERT-of-Theseus/run_glue.py \
--model_name_or_path deepset/bert-base-cased-squad2 \
--task_name qnli \
--do_train \
--do_eval \
--do_lower_case \
--data_dir /content/QNLI_dataset \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 2e-5 \
--save_steps 2000 \
--num_train_epochs 50 \
--output_dir /content/ouput_dir \
--evaluate_during_training \
--replacing_rate 0.7 \
--steps_for_replacing 2500
```
## Metrics:
| Model | Accuracy |
|-----------------|------|
| BERT-base | 91.2 |
| BERT-of-Theseus | 88.8 |
| [bert-uncased-finetuned-qnli](https://huggingface.co/mrm8488/bert-uncased-finetuned-qnli) | 87.2
| DistillBERT | 85.3 |
> [See all my models](https://huggingface.co/models?search=mrm8488)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
"## Tokenization doesn't have to be slow !\n",
"\n",
"### Introduction\n",
"\n",
"Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner\n",
"should find a way to map raw input strings to a representation understandable by a trainable model.\n",
"\n",
"One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach\n",
"would look similar to the code below in python\n",
"\n",
"```python\n",
"s = \"very long corpus...\"\n",
"words = s.split(\" \") # Split over space\n",
"vocabulary = dict(enumerate(set(words))) # Map storing the word to it's corresponding id\n",
"```\n",
"\n",
"This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original\n",
"input. Moreover, word variations like \"cat\" and \"cats\" would not share the same identifiers even if their meaning is \n",
"quite close.\n",
"\n",
"![tokenization_simple](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)\n",
"\n",
"### Subtoken Tokenization\n",
"\n",
"To overcome the issues described above, recent works have been done on tokenization, leveraging \"subtoken\" tokenization.\n",
"**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned\n",
"from the data.\n",
"\n",
"Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _\"##\"_ indicates a subtoken of the initial input. \n",
"Such training algorithms might extract sub-tokens such as _\"##ing\"_, _\"##ed\"_ over English corpus.\n",
"\n",
"As you might think of, this kind of sub-tokens construction leveraging compositions of _\"pieces\"_ overall reduces the size\n",
"of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded\n",
"into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. \n",
" \n",
"![subtokenization](https://nlp.fast.ai/images/multifit_vocabularies.png)\n",
" \n",
"Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : \n",
"\n",
"- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)\n",
"- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)\n",
"- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)\n",
"- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)\n",
"\n",
"Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.\n",
"\n",
"### @huggingface/tokenizers library \n",
"Along with the transformers library, we @huggingface provide a blazing fast tokenization library\n",
"able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.\n",
"\n",
"The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which \n",
"we provide bindings for Python and NodeJS (more bindings may be added in the future). \n",
"\n",
"We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide\n",
"these various components: \n",
"\n",
"- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to\n",
"lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. \n",
"- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to\n",
"pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.\n",
"- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant\n",
" of your input data.\n",
"- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA\n",
"models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.\n",
"- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according\n",
"to the `PreTokenizer` we used previously.\n",
"- **Trainer**: Provides training capabilities to each model.\n",
"\n",
"For each of the components above we provide multiple implementations:\n",
"\n",
"- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...\n",
"- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...\n",
"- **Model**: WordLevel, BPE, WordPiece\n",
"- **Post-Processor**: BertProcessor, ...\n",
"- **Decoder**: WordLevel, BPE, WordPiece, ...\n",
"\n",
"All of these building blocks can be combined to create working tokenization pipelines. \n",
"In the next section we will go over our first pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
"\n",
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
"We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [],
"source": [
"!pip install tokenizers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [],
"source": [
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
"\n",
"# Let's download the file and save it somewhere\n",
"from requests import get\n",
"with open('big.txt', 'wb') as big_f:\n",
" response = get(BIG_FILE_URL, )\n",
" \n",
" if response.status_code == 200:\n",
" big_f.write(response.content)\n",
" else:\n",
" print(\"Unable to get the file: {}\".format(response.reason))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
" \n",
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [],
"source": [
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
"# the overall pipeline for various well-known tokenization algorithm. \n",
"# Everything described below can be replaced by the ByteLevelBPETokenizer class. \n",
"\n",
"from tokenizers import Tokenizer\n",
"from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
"from tokenizers.models import BPE\n",
"from tokenizers.normalizers import Lowercase, NFKC, Sequence\n",
"from tokenizers.pre_tokenizers import ByteLevel\n",
"\n",
"# First we create an empty Byte-Pair Encoding model (i.e. not trained model)\n",
"tokenizer = Tokenizer(BPE.empty())\n",
"\n",
"# Then we enable lower-casing and unicode-normalization\n",
"# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
"# executed in order.\n",
"tokenizer.normalizer = Sequence([\n",
" NFKC(),\n",
" Lowercase()\n",
"])\n",
"\n",
"# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
"tokenizer.pre_tokenizer = ByteLevel()\n",
"\n",
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
"tokenizer.decoder = ByteLevelDecoder()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Trained vocab size: 25000\n"
]
}
],
"source": [
"from tokenizers.trainers import BpeTrainer\n",
"\n",
"# We initialize our trainer, giving him the details about the vocabulary we want to generate\n",
"trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())\n",
"tokenizer.train(trainer, [\"big.txt\"])\n",
"\n",
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
"on the `Trainer` class, but the overall process should be very similar.\n",
"\n",
"We can save the content of the model to reuse it later."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['./vocab.json', './merges.txt']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You will see the generated files in the output.\n",
"tokenizer.model.save('.')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Now, let load the trained model and start using out newly trained tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
"Decoded string: this is a simple input to be tokenized\n"
]
}
],
"source": [
"# Let's tokenizer a simple input\n",
"tokenizer.model = BPE.from_files('vocab.json', 'merges.txt')\n",
"encoding = tokenizer.encode(\"This is a simple input to be tokenized\")\n",
"\n",
"print(\"Encoded string: {}\".format(encoding.tokens))\n",
"\n",
"decoded = tokenizer.decode(encoding.ids)\n",
"print(\"Decoded string: {}\".format(decoded))"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
"\n",
"- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)\n",
"- original_str: The input string as it was provided\n",
"- tokens: The generated tokens with their string representation\n",
"- input_ids: The generated tokens with their integer representation\n",
"- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.\n",
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}
This diff is collapsed.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
"API for doing inference over a variety of downstream-tasks, including: \n",
"\n",
"- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
"- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
"- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
"- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
"- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
"\n",
"Pipelines encapsulate the overall process of every NLP process:\n",
" \n",
" 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
" 2. Inference: Maps every tokens into a more meaningful representation. \n",
" 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
"\n",
"The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
"structure:\n",
"\n",
"```python\n",
"from transformers import pipeline\n",
"\n",
"# Using default model and tokenizer for the task\n",
"pipeline(\"<task-name>\")\n",
"\n",
"# Using a user-specified model\n",
"pipeline(\"<task-name>\", model=\"<model_name>\")\n",
"\n",
"# Using custom model/tokenizer as str\n",
"pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install transformers"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% code\n"
}
}
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code \n"
}
},
"outputs": [],
"source": [
"from __future__ import print_function\n",
"import ipywidgets as widgets\n",
"from transformers import pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 1. Sentence Classification - Sentiment Analysis"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "c9db53f30b9446c0af03268633a966c0"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "[{'label': 'POSITIVE', 'score': 0.9997656}]"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 8
}
],
"source": [
"nlp_sentence_classif = pipeline('sentiment-analysis')\n",
"nlp_sentence_classif('Such a nice weather outside !')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 2. Token Classification - Named Entity Recognition"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "1e300789e22644f1aed66a5ed60e75c4"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 9
}
],
"source": [
"nlp_token_class = pipeline('ner')\n",
"nlp_token_class('Hugging Face is a French company based in New-York.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Question Answering"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "82aca58f1ea24b4cb37f16402e8a5923"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"\n"
],
"output_type": "stream"
},
{
"name": "stderr",
"text": [
"convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 225.51it/s]\n",
"add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2158.67it/s]\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 10
}
],
"source": [
"nlp_qa = pipeline('question-answering')\n",
"nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Text Generation - Mask Filling"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "49df2227b4fa4eb28dcdcfc3d9261d0f"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n 'score': 0.23106691241264343,\n 'token': 2201},\n {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n 'score': 0.0819825753569603,\n 'token': 12790},\n {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n 'score': 0.04769463092088699,\n 'token': 11559},\n {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n 'score': 0.047622501850128174,\n 'token': 6497},\n {'sequence': '<s> Hugging Face is a French company based in France</s>',\n 'score': 0.04130595177412033,\n 'token': 1470}]"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 11
}
],
"source": [
"nlp_fill = pipeline('fill-mask')\n",
"nlp_fill('Hugging Face is a French company based in <mask>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Projection - Features Extraction "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "2af4cfb19e3243dda014d0f56b48f4b2"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"text": [
"\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "(1, 12, 768)"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 12
}
],
"source": [
"import numpy as np\n",
"nlp_features = pipeline('feature-extraction')\n",
"output = nlp_features('Hugging Face is a French company based in Paris')\n",
"np.array(output).shape # (Samples, Tokens, Vector Size)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
"to come in future releases. \n",
"\n",
"In the meantime, you can try the different pipelines with your own inputs"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"text/plain": "Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "10bac065d46f4e4d9a8498dcc8104ecd"
}
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": "Text(value='', description='Your input:', placeholder='Enter something')",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "2c5f1411f7a94714bc00f01b0e3b27b2"
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"task = widgets.Dropdown(\n",
" options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
" value='ner',\n",
" description='Task:',\n",
" disabled=False\n",
")\n",
"\n",
"input = widgets.Text(\n",
" value='',\n",
" placeholder='Enter something',\n",
" description='Your input:',\n",
" disabled=False\n",
")\n",
"\n",
"def forward(_):\n",
" if len(input.value) > 0: \n",
" if task.value == 'ner':\n",
" output = nlp_token_class(input.value)\n",
" elif task.value == 'sentiment-analysis':\n",
" output = nlp_sentence_classif(input.value)\n",
" else:\n",
" if input.value.find('<mask>') == -1:\n",
" output = nlp_fill(input.value + ' <mask>')\n",
" else:\n",
" output = nlp_fill(input.value) \n",
" print(output)\n",
"\n",
"input.on_submit(forward)\n",
"display(task, input)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% Question Answering\n"
}
},
"outputs": [
{
"data": {
"text/plain": "Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…",
"application/vnd.jupyter.widget-view+json": {
"version_major": 2,
"version_minor": 0,
"model_id": "019fde2343634e94b6f32d04f6350ec1"
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"context = widgets.Textarea(\n",
" value='Einstein is famous for the general theory of relativity',\n",
" placeholder='Enter something',\n",
" description='Context:',\n",
" disabled=False\n",
")\n",
"\n",
"query = widgets.Text(\n",
" value='Why is Einstein famous for ?',\n",
" placeholder='Enter something',\n",
" description='Question:',\n",
" disabled=False\n",
")\n",
"\n",
"def forward(_):\n",
" if len(context.value) > 0 and len(query.value) > 0: \n",
" output = nlp_qa(question=query.value, context=context.value) \n",
" print(output)\n",
"\n",
"query.on_submit(forward)\n",
"display(context, query)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Transformers Notebooks
You can find here a list of the official notebooks provided by Hugging Face.
Also, we would like to list here interesting content created by the community.
If you wrote some notebook(s) leveraging transformers and would like be listed here, please open a
Pull Request and we'll review it so it can be included here.
## Hugging Face's notebooks :hugs:
| Notebook | Description | |
|:----------|:-------------:|------:|
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
......@@ -64,7 +64,7 @@ if stale_egg_info.exists():
extras = {}
extras["mecab"] = ["mecab-python3"]
extras["sklearn"] = ["scikit-learn"]
extras["sklearn"] = ["scikit-learn==0.22.1"]
extras["tf"] = ["tensorflow"]
extras["tf-cpu"] = ["tensorflow-cpu"]
extras["torch"] = ["torch"]
......
......@@ -136,7 +136,7 @@ if is_sklearn_available():
# Modeling
if is_torch_available():
from .modeling_utils import PreTrainedModel, prune_layer, Conv1D
from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering
from .modeling_auto import (
AutoModel,
AutoModelForPreTraining,
......@@ -241,7 +241,7 @@ if is_torch_available():
CamembertForTokenClassification,
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
)
from .modeling_encoder_decoder import PreTrainedEncoderDecoder, Model2Model
from .modeling_encoder_decoder import PreTrainedEncoderDecoder
from .modeling_t5 import (
T5PreTrainedModel,
T5Model,
......@@ -255,6 +255,7 @@ if is_torch_available():
AlbertForMaskedLM,
AlbertForSequenceClassification,
AlbertForQuestionAnswering,
AlbertForTokenClassification,
load_tf_weights_in_albert,
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
)
......@@ -290,7 +291,13 @@ if is_torch_available():
# TensorFlow
if is_tf_available():
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list
from .modeling_tf_utils import (
TFPreTrainedModel,
TFSharedEmbeddings,
TFSequenceSummary,
shape_list,
tf_top_k_top_p_filtering,
)
from .modeling_tf_auto import (
TFAutoModel,
TFAutoModelForPreTraining,
......
......@@ -22,11 +22,10 @@ from .configuration_utils import PretrainedConfig
logger = logging.getLogger(__name__)
_bart_large_url = "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json"
BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"bart-large": _bart_large_url,
"bart-large-mnli": _bart_large_url, # fine as same
"bart-cnn": None, # not done
"bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json",
"bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json",
"bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json",
}
......@@ -59,6 +58,7 @@ class BartConfig(PretrainedConfig):
classifier_dropout=0.0,
output_past=False,
num_labels=3,
bos_token_id=0,
**common_kwargs
):
r"""
......@@ -67,12 +67,16 @@ class BartConfig(PretrainedConfig):
config = BartConfig.from_pretrained('bart-large')
model = BartModel(config)
"""
super().__init__(num_labels=num_labels, output_past=output_past, pad_token_id=pad_token_id, **common_kwargs)
super().__init__(
num_labels=num_labels,
output_past=output_past,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
**common_kwargs,
)
self.vocab_size = vocab_size
self.d_model = d_model # encoder_embed_dim and decoder_embed_dim
self.eos_token_id = eos_token_id
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = self.num_hidden_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
......
......@@ -109,6 +109,7 @@ class FlaubertConfig(XLMConfig):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
......
......@@ -73,6 +73,7 @@ class GPT2Config(PretrainedConfig):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.GPT2DoubleHeadsModel`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
......
......@@ -73,6 +73,7 @@ class OpenAIGPTConfig(PretrainedConfig):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.OpenAIGPTDoubleHeadsModel`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
......
......@@ -98,6 +98,18 @@ class PretrainedConfig(object):
logger.error("Can't set {} with value {} for {}".format(key, value, self))
raise err
@property
def num_labels(self):
return self._num_labels
@num_labels.setter
def num_labels(self, num_labels):
self._num_labels = num_labels
self.id2label = {i: "LABEL_{}".format(i) for i in range(self.num_labels)}
self.id2label = dict((int(key), value) for key, value in self.id2label.items())
self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))
self.label2id = dict((key, int(value)) for key, value in self.label2id.items())
def save_pretrained(self, save_directory):
"""
Save a configuration object to the directory `save_directory`, so that it
......
......@@ -108,6 +108,7 @@ class XLMConfig(PretrainedConfig):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
......
......@@ -23,9 +23,11 @@ import fairseq
import torch
from packaging import version
from transformers import BartConfig, BartForSequenceClassification, BartModel, BartTokenizer
from transformers import BartConfig, BartForMaskedLM, BartForSequenceClassification, BartModel, BartTokenizer
FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn"]
if version.parse(fairseq.__version__) < version.parse("0.9.0"):
raise Exception("requires fairseq >= 0.9.0")
......@@ -33,7 +35,7 @@ if version.parse(fairseq.__version__) < version.parse("0.9.0"):
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
SAMPLE_TEXT = "Hello world! cécé herlolip"
SAMPLE_TEXT = " Hello world! cécé herlolip"
rename_keys = [
("model.classification_heads.mnli.dense.weight", "classification_head.dense.weight"),
......@@ -41,7 +43,7 @@ rename_keys = [
("model.classification_heads.mnli.out_proj.weight", "classification_head.out_proj.weight"),
("model.classification_heads.mnli.out_proj.bias", "classification_head.out_proj.bias"),
]
IGNORE_KEYS = ["encoder.version", "decoder.version", "model.encoder.version", "model.decoder.version"]
IGNORE_KEYS = ["encoder.version", "decoder.version", "model.encoder.version", "model.decoder.version", "_float_tensor"]
def rename_key(dct, old, new):
......@@ -53,36 +55,45 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
"""
Copy/paste/tweak model's weights to our BERT structure.
"""
b2 = torch.hub.load("pytorch/fairseq", checkpoint_path)
b2.eval() # disable dropout
b2.model.upgrade_state_dict(b2.model.state_dict())
config = BartConfig()
tokens = b2.encode(SAMPLE_TEXT).unsqueeze(0)
tokens2 = BartTokenizer.from_pretrained("bart-large").encode(SAMPLE_TEXT).unsqueeze(0)
bart = torch.hub.load("pytorch/fairseq", checkpoint_path)
bart.eval() # disable dropout
bart.model.upgrade_state_dict(bart.model.state_dict())
hf_model_name = checkpoint_path.replace(".", "-")
config = BartConfig.from_pretrained(hf_model_name)
tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)
tokens2 = BartTokenizer.from_pretrained(hf_model_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
assert torch.eq(tokens, tokens2).all()
# assert their_output.size() == (1, 11, 1024)
if checkpoint_path == "bart.large":
state_dict = b2.model.state_dict()
if checkpoint_path in ["bart.large", "bart.large.cnn"]:
state_dict = bart.model.state_dict()
for k in IGNORE_KEYS:
state_dict.pop(k, None)
state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
model = BartModel(config)
their_output = b2.extract_features(tokens)
their_output = bart.extract_features(tokens)
else: # MNLI Case
state_dict = b2.state_dict()
state_dict = bart.state_dict()
for k in IGNORE_KEYS:
state_dict.pop(k, None)
state_dict["model.shared.weight"] = state_dict["model.decoder.embed_tokens.weight"]
for src, dest in rename_keys:
rename_key(state_dict, src, dest)
state_dict.pop("_float_tensor", None)
model = BartForSequenceClassification(config)
their_output = b2.predict("mnli", tokens, return_logits=True)
for k in IGNORE_KEYS:
state_dict.pop(k, None)
their_output = bart.predict("mnli", tokens, return_logits=True)
# Load state dict
model.load_state_dict(state_dict)
model.eval()
our_outputs = model.forward(tokens)[0]
# Check results
if checkpoint_path == "bart.large.cnn": # generate doesnt work yet
model = BartForMaskedLM(config, base_model=model)
assert "lm_head.weight" in model.state_dict()
assert model.lm_head.out_features == config.max_position_embeddings
model.eval()
our_outputs = model.model.forward(tokens)[0]
else:
our_outputs = model.forward(tokens)[0]
assert their_output.shape == our_outputs.shape
assert (their_output == our_outputs).all().item()
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
......@@ -92,7 +103,8 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("fairseq_path", choices=["bart.large", "bart.large.mnli"], type=str, help="")
parser.add_argument("fairseq_path", choices=FAIRSEQ_MODELS, type=str, help="")
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
args = parser.parse_args()
convert_bart_checkpoint(
......
......@@ -46,7 +46,9 @@ logger = logging.getLogger(__name__)
SAMPLE_TEXT = "Hello world! cécé herlolip"
def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path, classification_head):
def convert_roberta_checkpoint_to_pytorch(
roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool
):
"""
Copy/paste/tweak roberta's weights to our BERT structure.
"""
......
......@@ -788,6 +788,103 @@ class AlbertForSequenceClassification(AlbertPreTrainedModel):
return outputs # (loss), logits, (hidden_states), (attentions)
@add_start_docstrings(
"""Albert Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
ALBERT_START_DOCSTRING,
)
class AlbertForTokenClassification(AlbertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.albert = AlbertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
Classification loss.
scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)
Classification scores (before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import AlbertTokenizer, AlbertForTokenClassification
import torch
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertForTokenClassification.from_pretrained('albert-base-v2')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, scores = outputs[:2]
"""
outputs = self.albert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)
@add_start_docstrings(
"""Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment