Unverified Commit 02ef825b authored by Forrest Iandola's avatar Forrest Iandola Committed by GitHub
Browse files

SqueezeBERT architecture (#7083)

* configuration_squeezebert.py

thin wrapper around bert tokenizer

fix typos

wip sb model code

wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working

set up squeezebert to use BertModelOutput when returning results.

squeezebert documentation

formatting

allow head mask that is an array of [None, ..., None]

docs

docs cont'd

path to vocab

docs and pointers to cloud files (WIP)

line length and indentation

squeezebert model cards

formatting of model cards

untrack modeling_squeezebert_scratchpad.py

update aws paths to vocab and config files

get rid of stub of NSP code, and advise users to pretrain with mlm only

fix rebase issues

redo rebase of modeling_auto.py

fix issues with code formatting

more code format auto-fixes

move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert

tests for squeezebert modeling and tokenization

fix typo

move squeezebert before bert in modeling_auto.py to fix inheritance problem

disable test_head_masking, since squeezebert doesn't yet implement head masking

fix issues exposed by the test_modeling_squeezebert.py

fix an issue exposed by test_tokenization_squeezebert.py

fix issue exposed by test_modeling_squeezebert.py

auto generated code style improvement

issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head()

update copyright

resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask

docs

add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli

autogenerated formatting tweaks

integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings

* tiny change to order of imports
parent e2c935f5
......@@ -25,9 +25,9 @@
<p>State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
</h3>
🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
🤗 Transformers is backed by the two most popular deep learning libraries, [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/), with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.
......@@ -38,7 +38,7 @@
You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer an [inference API](https://huggingface.co/pricing) to use those models.
Here are a few examples:
Here are a few examples:
- [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
- [Name Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
- [Text generation with GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+)
......@@ -51,7 +51,7 @@ Here are a few examples:
## Quick tour
To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts
To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts
```python
>>> from transformers import pipeline
......@@ -62,7 +62,7 @@ To immediately use a model on a given text, we provide the `pipeline` API. Pipel
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]
```
The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.
The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.
This is another example of pipeline used for that can extract question answers from some context:
......@@ -127,7 +127,7 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
1. Easily customize a model or an example to your needs:
- Examples for each architecture to reproduce the results by the official authors of said architecture.
- Expose the models internal as consistently as possible.
- Model files can be used independently of the library for quick experiments.
- Model files can be used independently of the library for quick experiments.
## Why shouldn't I use transformers?
......@@ -179,12 +179,13 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
1. **[Pegasus](https://github.com/google-research/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
......
......@@ -148,7 +148,10 @@ conversion utilities for the following models:
29. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `​XLNet: Generalized
Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang
Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
30. `Other community models <https://huggingface.co/models>`_, contributed by the `community
30. SqueezeBERT (from UC Berkeley) released with the paper
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`_
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
31. `Other community models <https://huggingface.co/models>`_, contributed by the `community
<https://huggingface.co/users>`_.
.. toctree::
......@@ -241,6 +244,7 @@ conversion utilities for the following models:
model_doc/reformer
model_doc/retribert
model_doc/roberta
model_doc/squeezebert
model_doc/t5
model_doc/transformerxl
model_doc/xlm
......
SqueezeBERT
----------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
The SqueezeBERT model was proposed in
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
<https://arxiv.org/abs/2006.11316>`__
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
It's a bidirectional transformer similar to the BERT model.
The key difference between the BERT architecture and the SqueezeBERT architecture
is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
instead of fully-connected layers for the Q, K, V and FFN layers.
The abstract from the paper is the following:
*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
large datasets, large computing systems, and better neural network models, natural language processing (NLP)
technology has made significant strides in understanding, proofreading, and organizing these messages.
Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
social networks, and businesses. In particular, we consider smartphones and other mobile devices as
crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
convolutions have yielded significant speedups for computer vision networks, but many of these techniques
have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
accuracy on the GLUE test set. The SqueezeBERT code will be released.*
Tips:
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
- For best results when finetuning on sequence classification tasks, it is recommended to start with the
`squeezebert/squeezebert-mnli-headless` checkpoint.
SqueezeBertConfig
~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertConfig
:members:
SqueezeBertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
SqueezeBertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertTokenizerFast
:members:
SqueezeBertModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertModel
:members:
SqueezeBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForMaskedLM
:members:
SqueezeBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForSequenceClassification
:members:
SqueezeBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForMultipleChoice
:members:
SqueezeBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForTokenClassification
:members:
SqueezeBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SqueezeBertForQuestionAnswering
:members:
......@@ -426,4 +426,13 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | |
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| SqueezeBERT | ``squeezebert/squeezebert-uncased`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
| | | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks. |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``squeezebert/squeezebert-mnli`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
| | | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base. |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``squeezebert/squeezebert-mnli-headless`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
| | | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base. |
| | | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized. |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-mnli-headless`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-mnli-headless`, is the "finetuned with bells and whistles" MNLI-finetuned SqueezeBERT model. In this particular model, we have removed the final classification layer -- in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-mnli`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [Multi-Genre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
From the SqueezeBERT paper:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper presents 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-mnli`, is the "trained with bells and whistles" MNLI-finetuned SqueezeBERT model.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
language: en
license: bsd
datasets:
- bookcorpus
- wikipedia
---
# SqueezeBERT pretrained model
This model, `squeezebert-uncased`, is a pretrained model for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective.
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with [grouped convolutions](https://blog.yani.io/filter-group-tutorial/).
The authors found that SqueezeBERT is 4.3x faster than `bert-base-uncased` on a Google Pixel 3 smartphone.
## Pretraining
### Pretraining data
- [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books
- [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
### Pretraining procedure
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)
The SqueezeBERT paper presents 2 approaches to finetuning the model:
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
## Finetuning
The SqueezeBERT paper results from 2 approaches to finetuning the model:
- "finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
- "finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.
This model, `squeezebert/squeezebert-uncased`, has been pretrained but not finetuned. For most text classification tasks, we recommend using squeezebert-mnli-headless as a starting point.
### How to finetune
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) text classification task, you can run the following command:
```
./utils/download_glue_data.py
python examples/text-classification/run_glue.py \
--model_name_or_path squeezebert-base-headless \
--task_name mrpc \
--data_dir ./glue_data/MRPC \
--output_dir ./models/squeezebert_mrpc \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 10 \
--learning_rate 3e-05 \
--per_device_train_batch_size 16 \
--save_steps 20000
```
## BibTeX entry and citation info
```
@article{2020_SqueezeBERT,
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
journal = {arXiv:2006.11316},
year = {2020}
}
```
......@@ -57,6 +57,7 @@ from .configuration_rag import RagConfig
from .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig
from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from .configuration_utils import PretrainedConfig
......@@ -184,6 +185,7 @@ from .tokenization_rag import RagTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .tokenization_t5 import T5Tokenizer
from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast
from .tokenization_utils import PreTrainedTokenizer
......@@ -453,6 +455,17 @@ if is_torch_available():
RobertaForTokenClassification,
RobertaModel,
)
from .modeling_squeezebert import (
SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
SqueezeBertForMaskedLM,
SqueezeBertForMultipleChoice,
SqueezeBertForQuestionAnswering,
SqueezeBertForSequenceClassification,
SqueezeBertForTokenClassification,
SqueezeBertModel,
SqueezeBertModule,
SqueezeBertPreTrainedModel,
)
from .modeling_t5 import (
T5_PRETRAINED_MODEL_ARCHIVE_LIST,
T5ForConditionalGeneration,
......
......@@ -44,6 +44,7 @@ from .configuration_rag import RagConfig
from .configuration_reformer import ReformerConfig
from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from .configuration_utils import PretrainedConfig
......@@ -80,6 +81,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
]
for key, value, in pretrained_map.items()
)
......@@ -105,6 +107,7 @@ CONFIG_MAPPING = OrderedDict(
("deberta", DebertaConfig),
("flaubert", FlaubertConfig),
("fsmt", FSMTConfig),
("squeezebert", SqueezeBertConfig),
("bert", BertConfig),
("openai-gpt", OpenAIGPTConfig),
("gpt2", GPT2Config),
......@@ -141,6 +144,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("roberta", "RoBERTa"),
("flaubert", "FlauBERT"),
("fsmt", "FairSeq Machine-Translation"),
("squeezebert", "SqueezeBERT"),
("bert", "BERT"),
("openai-gpt", "OpenAI GPT"),
("gpt2", "OpenAI GPT-2"),
......
# coding=utf-8
# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SqueezeBERT model configuration """
from .configuration_utils import PretrainedConfig
from .utils import logging
logger = logging.get_logger(__name__)
SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"squeezebert/squeezebert-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-uncased/config.json",
"squeezebert/squeezebert-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli/config.json",
"squeezebert/squeezebert-mnli-headless": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli-headless/config.json",
}
class SqueezeBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.SqueezeBertModel`.
It is used to instantiate a SqueezeBERT model according to the specified arguments, defining the model
architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the SqueezeBERT model. Defines the different tokens that
can be represented by the :obj:`inputs_ids` passed to the forward method of :class:`~transformers.SqueezeBertModel`.
hidden_size (:obj:`int`, optional, defaults to 512):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 24):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 512):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu_new"`, :obj:`"gelu_fast"`,
and :obj:`"mish"` are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed into :class:`~transformers.SqueezeBertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
The epsilon used by the layer normalization layers.
pad_token_id (:obj:`int`, optional, defaults to 0):
The ID of the token in the word embedding to use as padding.
embedding_size (:obj:`int`, optional, defaults to 128):
The dimension of the word embedding vectors.
q_groups (:obj:`int`, optional, defaults to 4):
The number of groups in Q layer
k_groups (:obj:`int`, optional, defaults to 4):
The number of groups in K layer
v_groups (:obj:`int`, optional, defaults to 4):
The number of groups in V layer
post_attention_groups (:obj:`int`, optional, defaults to 1):
The number of groups in the first feed forward network layer
intermediate_groups (:obj:`int`, optional, defaults to 4):
The number of groups in the second feed forward network layer
output_groups (:obj:`int`, optional, defaults to 4):
The number of groups in the third feed forward network layer
Example:
>>> from transformers import SqueezeBertModel, SqueezeBertConfig
>>> # Initializing a SqueezeBERT configuration
>>> configuration = SqueezeBertConfig()
>>> # Initializing a model from the configuration above
>>> model = SqueezeBertModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Attributes:
pretrained_config_archive_map (Dict[str, str]):
A dictionary containing all the available pre-trained checkpoints.
"""
pretrained_config_archive_map = SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
model_type = "squeezebert"
def __init__(
self,
vocab_size=30522,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
embedding_size=768,
q_groups=4,
k_groups=4,
v_groups=4,
post_attention_groups=1,
intermediate_groups=4,
output_groups=4,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.embedding_size = embedding_size
self.q_groups = q_groups
self.k_groups = k_groups
self.v_groups = v_groups
self.post_attention_groups = post_attention_groups
self.intermediate_groups = intermediate_groups
self.output_groups = output_groups
......@@ -45,6 +45,7 @@ from .configuration_auto import (
ReformerConfig,
RetriBertConfig,
RobertaConfig,
SqueezeBertConfig,
T5Config,
TransfoXLConfig,
XLMConfig,
......@@ -173,6 +174,14 @@ from .modeling_roberta import (
RobertaForTokenClassification,
RobertaModel,
)
from .modeling_squeezebert import (
SqueezeBertForMaskedLM,
SqueezeBertForMultipleChoice,
SqueezeBertForQuestionAnswering,
SqueezeBertForSequenceClassification,
SqueezeBertForTokenClassification,
SqueezeBertModel,
)
from .modeling_t5 import T5ForConditionalGeneration, T5Model
from .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel
from .modeling_xlm import (
......@@ -218,6 +227,7 @@ MODEL_MAPPING = OrderedDict(
(LongformerConfig, LongformerModel),
(RobertaConfig, RobertaModel),
(LayoutLMConfig, LayoutLMModel),
(SqueezeBertConfig, SqueezeBertModel),
(BertConfig, BertModel),
(OpenAIGPTConfig, OpenAIGPTModel),
(GPT2Config, GPT2Model),
......@@ -251,6 +261,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
(FSMTConfig, FSMTForConditionalGeneration),
(LongformerConfig, LongformerForMaskedLM),
(RobertaConfig, RobertaForMaskedLM),
(SqueezeBertConfig, SqueezeBertForMaskedLM),
(BertConfig, BertForPreTraining),
(OpenAIGPTConfig, OpenAIGPTLMHeadModel),
(GPT2Config, GPT2LMHeadModel),
......@@ -278,6 +289,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
(BartConfig, BartForConditionalGeneration),
(LongformerConfig, LongformerForMaskedLM),
(RobertaConfig, RobertaForMaskedLM),
(SqueezeBertConfig, SqueezeBertForMaskedLM),
(BertConfig, BertForMaskedLM),
(OpenAIGPTConfig, OpenAIGPTLMHeadModel),
(GPT2Config, GPT2LMHeadModel),
......@@ -324,6 +336,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
(XLMRobertaConfig, XLMRobertaForMaskedLM),
(LongformerConfig, LongformerForMaskedLM),
(RobertaConfig, RobertaForMaskedLM),
(SqueezeBertConfig, SqueezeBertForMaskedLM),
(BertConfig, BertForMaskedLM),
(MobileBertConfig, MobileBertForMaskedLM),
(FlaubertConfig, FlaubertWithLMHeadModel),
......@@ -355,6 +368,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
(BartConfig, BartForSequenceClassification),
(LongformerConfig, LongformerForSequenceClassification),
(RobertaConfig, RobertaForSequenceClassification),
(SqueezeBertConfig, SqueezeBertForSequenceClassification),
(BertConfig, BertForSequenceClassification),
(XLNetConfig, XLNetForSequenceClassification),
(MobileBertConfig, MobileBertForSequenceClassification),
......@@ -375,6 +389,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
(LongformerConfig, LongformerForQuestionAnswering),
(XLMRobertaConfig, XLMRobertaForQuestionAnswering),
(RobertaConfig, RobertaForQuestionAnswering),
(SqueezeBertConfig, SqueezeBertForQuestionAnswering),
(BertConfig, BertForQuestionAnswering),
(XLNetConfig, XLNetForQuestionAnsweringSimple),
(FlaubertConfig, FlaubertForQuestionAnsweringSimple),
......@@ -396,6 +411,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
(XLMRobertaConfig, XLMRobertaForTokenClassification),
(LongformerConfig, LongformerForTokenClassification),
(RobertaConfig, RobertaForTokenClassification),
(SqueezeBertConfig, SqueezeBertForTokenClassification),
(BertConfig, BertForTokenClassification),
(MobileBertConfig, MobileBertForTokenClassification),
(XLNetConfig, XLNetForTokenClassification),
......@@ -413,6 +429,7 @@ MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
(XLMRobertaConfig, XLMRobertaForMultipleChoice),
(LongformerConfig, LongformerForMultipleChoice),
(RobertaConfig, RobertaForMultipleChoice),
(SqueezeBertConfig, SqueezeBertForMultipleChoice),
(BertConfig, BertForMultipleChoice),
(DistilBertConfig, DistilBertForMultipleChoice),
(MobileBertConfig, MobileBertForMultipleChoice),
......
# coding=utf-8
# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch SqueezeBert model. """
import math
import torch
from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss
from .activations import ACT2FN
from .configuration_squeezebert import SqueezeBertConfig
from .file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_callable
from .modeling_outputs import (
BaseModelOutput,
BaseModelOutputWithPooling,
MaskedLMOutput,
MultipleChoiceModelOutput,
QuestionAnsweringModelOutput,
SequenceClassifierOutput,
TokenClassifierOutput,
)
from .modeling_utils import PreTrainedModel
from .utils import logging
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "SqueezeBertConfig"
_TOKENIZER_FOR_DOC = "SqueezeBertTokenizer"
####################################################
# This list contrains shortcut names for some of
# the pretrained weights provided with the models
####################################################
SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
"squeezebert/squeezebert-uncased",
"squeezebert/squeezebert-mnli",
"squeezebert/squeezebert-mnli-headless",
]
class SqueezeBertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, :seq_length]
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class MatMulWrapper(torch.nn.Module):
"""
Wrapper for torch.matmul(). This makes flop-counting easier to implement.
Note that if you directly call torch.matmul() in your code, the flop counter will typically
ignore the flops of the matmul.
"""
def __init__(self):
super().__init__()
def forward(self, mat1, mat2):
"""
:param inputs: two torch tensors
:return: matmul of these tensors
Here are the typical dimensions found in BERT (the B is optional)
mat1.shape: [B, <optional extra dims>, M, K]
mat2.shape: [B, <optional extra dims>, K, N]
output shape: [B, <optional extra dims>, M, N]
"""
return torch.matmul(mat1, mat2)
class SqueezeBertLayerNorm(nn.LayerNorm):
"""
This is a nn.LayerNorm subclass that accepts NCW data layout and performs normalization in the C dimension.
N = batch
C = channels
W = sequence length
"""
def __init__(self, hidden_size, eps=1e-12):
nn.LayerNorm.__init__(self, normalized_shape=hidden_size, eps=eps) # instantiates self.{weight, bias, eps}
def forward(self, x):
x = x.permute(0, 2, 1)
x = nn.LayerNorm.forward(self, x)
return x.permute(0, 2, 1)
class ConvDropoutLayerNorm(nn.Module):
"""
ConvDropoutLayerNorm: Conv, Dropout, LayerNorm
"""
def __init__(self, cin, cout, groups, dropout_prob):
super().__init__()
self.conv1d = nn.Conv1d(in_channels=cin, out_channels=cout, kernel_size=1, groups=groups)
self.layernorm = SqueezeBertLayerNorm(cout)
self.dropout = nn.Dropout(dropout_prob)
def forward(self, hidden_states, input_tensor):
x = self.conv1d(hidden_states)
x = self.dropout(x)
x = x + input_tensor
x = self.layernorm(x)
return x
class ConvActivation(nn.Module):
"""
ConvActivation: Conv, Activation
"""
def __init__(self, cin, cout, groups, act):
super().__init__()
self.conv1d = nn.Conv1d(in_channels=cin, out_channels=cout, kernel_size=1, groups=groups)
self.act = ACT2FN[act]
def forward(self, x):
output = self.conv1d(x)
return self.act(output)
class SqueezeBertSelfAttention(nn.Module):
def __init__(self, config, cin, q_groups=1, k_groups=1, v_groups=1):
"""
config = used for some things; ignored for others (work in progress...)
cin = input channels = output channels
groups = number of groups to use in conv1d layers
"""
super().__init__()
if cin % config.num_attention_heads != 0:
raise ValueError(
"cin (%d) is not a multiple of the number of attention "
"heads (%d)" % (cin, config.num_attention_heads)
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(cin / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Conv1d(in_channels=cin, out_channels=cin, kernel_size=1, groups=q_groups)
self.key = nn.Conv1d(in_channels=cin, out_channels=cin, kernel_size=1, groups=k_groups)
self.value = nn.Conv1d(in_channels=cin, out_channels=cin, kernel_size=1, groups=v_groups)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.softmax = nn.Softmax(dim=-1)
self.matmul_qk = MatMulWrapper()
self.matmul_qkv = MatMulWrapper()
def transpose_for_scores(self, x):
"""
input: [N, C, W]
output: [N, C1, W, C2]
where C1 is the head index, and C2 is one head's contents
"""
new_x_shape = (x.size()[0], self.num_attention_heads, self.attention_head_size, x.size()[-1]) # [N, C1, C2, W]
x = x.view(*new_x_shape)
return x.permute(0, 1, 3, 2) # [N, C1, C2, W] --> [N, C1, W, C2]
def transpose_key_for_scores(self, x):
"""
input: [N, C, W]
output: [N, C1, C2, W]
where C1 is the head index, and C2 is one head's contents
"""
new_x_shape = (x.size()[0], self.num_attention_heads, self.attention_head_size, x.size()[-1]) # [N, C1, C2, W]
x = x.view(*new_x_shape)
# no `permute` needed
return x
def transpose_output(self, x):
"""
input: [N, C1, W, C2]
output: [N, C, W]
"""
x = x.permute(0, 1, 3, 2).contiguous() # [N, C1, C2, W]
new_x_shape = (x.size()[0], self.all_head_size, x.size()[3]) # [N, C, W]
x = x.view(*new_x_shape)
return x
def forward(self, hidden_states, attention_mask, output_attentions):
"""
expects hidden_states in [N, C, W] data layout.
The attention_mask data layout is [N, W], and it does not need to be transposed.
"""
mixed_query_layer = self.query(hidden_states)
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_key_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_score = self.matmul_qk(query_layer, key_layer)
attention_score = attention_score / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_score = attention_score + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = self.softmax(attention_score)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
context_layer = self.matmul_qkv(attention_probs, value_layer)
context_layer = self.transpose_output(context_layer)
result = {"context_layer": context_layer}
if output_attentions:
result["attention_score"] = attention_score
return result
class SqueezeBertModule(nn.Module):
def __init__(self, config):
"""
hidden_size = input chans = output chans for Q, K, V (they are all the same ... for now) = output chans for the module
intermediate_size = output chans for intermediate layer
groups = number of groups for all layers in the BertModule. (eventually we could change the interface to allow
different groups for different layers)
"""
super().__init__()
c0 = config.hidden_size
c1 = config.hidden_size
c2 = config.intermediate_size
c3 = config.hidden_size
self.attention = SqueezeBertSelfAttention(
config=config, cin=c0, q_groups=config.q_groups, k_groups=config.k_groups, v_groups=config.v_groups
)
self.post_attention = ConvDropoutLayerNorm(
cin=c0, cout=c1, groups=config.post_attention_groups, dropout_prob=config.hidden_dropout_prob
)
self.intermediate = ConvActivation(cin=c1, cout=c2, groups=config.intermediate_groups, act=config.hidden_act)
self.output = ConvDropoutLayerNorm(
cin=c2, cout=c3, groups=config.output_groups, dropout_prob=config.hidden_dropout_prob
)
def forward(self, hidden_states, attention_mask, output_attentions):
att = self.attention(hidden_states, attention_mask, output_attentions)
attention_output = att["context_layer"]
post_attention_output = self.post_attention(attention_output, hidden_states)
intermediate_output = self.intermediate(post_attention_output)
layer_output = self.output(intermediate_output, post_attention_output)
output_dict = {"feature_map": layer_output}
if output_attentions:
output_dict["attention_score"] = att["attention_score"]
return output_dict
class SqueezeBertEncoder(nn.Module):
def __init__(self, config):
super().__init__()
assert config.embedding_size == config.hidden_size, (
"If you want embedding_size != intermediate hidden_size,"
"please insert a Conv1d layer to adjust the number of channels "
"before the first SqueezeBertModule."
)
self.layers = nn.ModuleList(SqueezeBertModule(config) for _ in range(config.num_hidden_layers))
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=False,
):
if head_mask is None:
head_mask_is_all_none = True
elif head_mask.count(None) == len(head_mask):
head_mask_is_all_none = True
else:
head_mask_is_all_none = False
assert head_mask_is_all_none is True, "head_mask is not yet supported in the SqueezeBert implementation."
# [batch_size, sequence_length, hidden_size] --> [batch_size, hidden_size, sequence_length]
hidden_states = hidden_states.permute(0, 2, 1)
all_hidden_states = (hidden_states,) if output_hidden_states else None
all_attentions = () if output_attentions else None
for layer in self.layers:
layer_output = layer.forward(hidden_states, attention_mask, output_attentions)
if output_attentions:
all_attentions += (layer_output["attention_score"],)
if output_hidden_states:
all_hidden_states += (layer_output["feature_map"],)
hidden_states = layer_output["feature_map"]
# Transpose hidden states to be compatible with the standard format in Transformers.
if all_hidden_states:
old_all_hidden_states = all_hidden_states
all_hidden_states = ()
for hs in old_all_hidden_states:
# [batch_size, hidden_size, sequence_length] --> [batch_size, sequence_length, hidden_size]
all_hidden_states += (hs.permute(0, 2, 1),)
# [batch_size, hidden_size, sequence_length] --> [batch_size, sequence_length, hidden_size]
hidden_states = hidden_states.permute(0, 2, 1)
if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
)
class SqueezeBertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
class SqueezeBertPreTrainedModel(PreTrainedModel):
"""An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
config_class = SqueezeBertConfig
base_model_prefix = "transformer"
authorized_missing_keys = [r"position_ids"]
def _init_weights(self, module):
""" Initialize the weights """
if isinstance(module, (nn.Linear, nn.Conv1d, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, SqueezeBertLayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, (nn.Linear, nn.Conv1d)) and module.bias is not None:
module.bias.data.zero_()
SQUEEZEBERT_START_DOCSTRING = r""" The SqueezeBERT model was proposed in
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
For best results finetuning SqueezeBERT on text classification tasks, it is recommended to use the
`squeezebert/squeezebert-mnli-headless` checkpoint as a starting point.
Parameters:
config (:class:`~transformers.SqueezeBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
Hierarchy::
Internal class hierarchy:
SqueezeBertModel
SqueezeBertEncoder
SqueezeBertModule
SqueezeBertSelfAttention
ConvActivation
ConvDropoutLayerNorm
Data layouts::
Input data is in [batch, sequence_length, hidden_size] format.
Data inside the encoder is in [batch, hidden_size, sequence_length] format.
But, if :obj:`output_hidden_states == True`, the data from inside the encoder is
returned in [batch, sequence_length, hidden_size] format.
The final output of the encoder is in [batch, sequence_length, hidden_size] format.
"""
SQUEEZEBERT_INPUTS_DOCSTRING = r"""
Inputs:
input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.SqueezeBertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
corresponds to a `sentence B` token
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
plain tuple.
"""
@add_start_docstrings(
"The bare SqueezeBERT Model transformer outputting raw hidden-states without any specific head on top.",
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertModel(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embeddings = SqueezeBertEmbeddings(config)
self.encoder = SqueezeBertEncoder(config)
self.pooler = SqueezeBertPooler(config)
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, new_embeddings):
self.embeddings.word_embeddings = new_embeddings
def _prune_heads(self, heads_to_prune):
"""Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-mnli-headless",
output_type=BaseModelOutputWithPooling,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
device = input_ids.device if input_ids is not None else inputs_embeds.device
if attention_mask is None:
attention_mask = torch.ones(input_shape, device=device)
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
embedding_output = self.embeddings(
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
)
encoder_outputs = self.encoder(
hidden_states=embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output)
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
@add_start_docstrings("""SqueezeBERT Model with a `language modeling` head on top. """, SQUEEZEBERT_START_DOCSTRING)
class SqueezeBertForMaskedLM(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.transformer = SqueezeBertModel(config)
self.lm_head = nn.Linear(config.embedding_size, config.vocab_size)
self.init_weights()
def get_output_embeddings(self):
return self.lm_head
@add_start_docstrings_to_callable(SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-uncased",
output_type=MaskedLMOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
prediction_scores = self.lm_head(sequence_output)
masked_lm_loss = None
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict:
output = (prediction_scores,) + outputs[2:]
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
return MaskedLMOutput(
loss=masked_lm_loss,
logits=prediction_scores,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""SqueezeBERT Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForSequenceClassification(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.transformer = SqueezeBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-mnli-headless",
output_type=SequenceClassifierOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
loss = None
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""SqueezeBERT Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForMultipleChoice(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.transformer = SqueezeBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, 1)
self.init_weights()
@add_start_docstrings_to_callable(
SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)")
)
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-mnli-headless",
output_type=MultipleChoiceModelOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
inputs_embeds = (
inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
if inputs_embeds is not None
else None
)
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
reshaped_logits = logits.view(-1, num_choices)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(reshaped_logits, labels)
if not return_dict:
output = (reshaped_logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return MultipleChoiceModelOutput(
loss=loss,
logits=reshaped_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""SqueezeBERT Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForTokenClassification(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.transformer = SqueezeBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-mnli-headless",
output_type=TokenClassifierOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)
active_labels = torch.where(
active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
)
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return TokenClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""SqueezeBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
SQUEEZEBERT_START_DOCSTRING,
)
class SqueezeBertForQuestionAnswering(SqueezeBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.transformer = SqueezeBertModel(config)
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(SQUEEZEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
@add_code_sample_docstrings(
tokenizer_class=_TOKENIZER_FOR_DOC,
checkpoint="squeezebert/squeezebert-mnli-headless",
output_type=QuestionAnsweringModelOutput,
config_class=_CONFIG_FOR_DOC,
)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
start_positions=None,
end_positions=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
total_loss = None
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions.clamp_(0, ignored_index)
end_positions.clamp_(0, ignored_index)
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
if not return_dict:
output = (start_logits, end_logits) + outputs[2:]
return ((total_loss,) + output) if total_loss is not None else output
return QuestionAnsweringModelOutput(
loss=total_loss,
start_logits=start_logits,
end_logits=end_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
......@@ -46,6 +46,7 @@ from .configuration_auto import (
ReformerConfig,
RetriBertConfig,
RobertaConfig,
SqueezeBertConfig,
T5Config,
TransfoXLConfig,
XLMConfig,
......@@ -83,6 +84,7 @@ from .tokenization_rag import RagTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .tokenization_t5 import T5Tokenizer
from .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast
from .tokenization_xlm import XLMTokenizer
......@@ -117,6 +119,7 @@ TOKENIZER_MAPPING = OrderedDict(
(LxmertConfig, (LxmertTokenizer, LxmertTokenizerFast)),
(LayoutLMConfig, (LayoutLMTokenizer, LayoutLMTokenizerFast)),
(DPRConfig, (DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast)),
(SqueezeBertConfig, (SqueezeBertTokenizer, SqueezeBertTokenizerFast)),
(BertConfig, (BertTokenizer, BertTokenizerFast)),
(OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),
(GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),
......
# coding=utf-8
# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for SqueezeBERT."""
from .tokenization_bert import BertTokenizer, BertTokenizerFast
from .utils import logging
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"squeezebert/squeezebert-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-uncased/vocab.txt",
"squeezebert/squeezebert-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli/vocab.txt",
"squeezebert/squeezebert-mnli-headless": "https://s3.amazonaws.com/models.huggingface.co/bert/squeezebert/squeezebert-mnli-headless/vocab.txt",
}
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"squeezebert/squeezebert-uncased": 512,
"squeezebert/squeezebert-mnli": 512,
"squeezebert/squeezebert-mnli-headless": 512,
}
PRETRAINED_INIT_CONFIGURATION = {
"squeezebert/squeezebert-uncased": {"do_lower_case": True},
"squeezebert/squeezebert-mnli": {"do_lower_case": True},
"squeezebert/squeezebert-mnli-headless": {"do_lower_case": True},
}
class SqueezeBertTokenizer(BertTokenizer):
r"""
Constructs a SqueezeBert tokenizer.
:class:`~transformers.SqueezeBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
parameters.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
class SqueezeBertTokenizerFast(BertTokenizerFast):
r"""
Constructs a "Fast" SqueezeBert tokenizer (backed by HuggingFace's `tokenizers` library).
:class:`~transformers.SqueezeBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
parameters.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
......@@ -21,7 +21,7 @@ For a quick overview of the general philosphy of the library and its organizatio
# Typical workflow for including a model
Here an overview of the general workflow:
Here an overview of the general workflow:
- [ ] Add model/configuration/tokenization classes.
- [ ] Add conversion scripts.
......@@ -69,7 +69,7 @@ Here is the workflow for documentation:
- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
Make sure to check you have no sphinx warnings when building the documentation locally and follow our
[documentaiton guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
[documentation guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
## Final steps
......
......@@ -19,7 +19,6 @@
####################################################
import logging
import os
import torch
......@@ -37,9 +36,10 @@ from .modeling_outputs import (
TokenClassifierOutput,
)
from .modeling_utils import PreTrainedModel
from .utils import logging
logger = logging.getLogger(__name__)
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "XXXConfig"
_TOKENIZER_FOR_DOC = "XXXTokenizer"
......@@ -433,7 +433,7 @@ class XxxForMaskedLM(XxxPreTrainedModel):
)
sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output)
prediction_scores = self.lm_head(sequence_output)
masked_lm_loss = None
if labels is not None:
......
# coding=utf-8
# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from transformers import (
SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
SqueezeBertConfig,
SqueezeBertForMaskedLM,
SqueezeBertForMultipleChoice,
SqueezeBertForQuestionAnswering,
SqueezeBertForSequenceClassification,
SqueezeBertForTokenClassification,
SqueezeBertModel,
)
class SqueezeBertModelTester(object):
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_input_mask=True,
use_token_type_ids=False,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=64,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
num_choices=4,
scope=None,
q_groups=2,
k_groups=2,
v_groups=2,
post_attention_groups=2,
intermediate_groups=4,
output_groups=1,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = scope
self.q_groups = q_groups
self.k_groups = k_groups
self.v_groups = v_groups
self.post_attention_groups = post_attention_groups
self.intermediate_groups = intermediate_groups
self.output_groups = output_groups
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
sequence_labels = None
token_labels = None
choice_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
config = SqueezeBertConfig(
embedding_size=self.hidden_size,
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
attention_probs_dropout_prob=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
initializer_range=self.initializer_range,
q_groups=self.q_groups,
k_groups=self.k_groups,
v_groups=self.v_groups,
post_attention_groups=self.post_attention_groups,
intermediate_groups=self.intermediate_groups,
output_groups=self.output_groups,
return_dict=True,
)
return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
def create_and_check_squeezebert_model(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = SqueezeBertModel(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, input_mask)
result = model(input_ids)
self.parent.assertEqual(
result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size)
)
def create_and_check_squeezebert_for_masked_lm(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = SqueezeBertForMaskedLM(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, labels=token_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
def create_and_check_squeezebert_for_question_answering(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = SqueezeBertForQuestionAnswering(config=config)
model.to(torch_device)
model.eval()
result = model(
input_ids, attention_mask=input_mask, start_positions=sequence_labels, end_positions=sequence_labels
)
self.parent.assertEqual(result.start_logits.shape, (self.batch_size, self.seq_length))
self.parent.assertEqual(result.end_logits.shape, (self.batch_size, self.seq_length))
def create_and_check_squeezebert_for_sequence_classification(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
config.num_labels = self.num_labels
model = SqueezeBertForSequenceClassification(config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_squeezebert_for_token_classification(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
config.num_labels = self.num_labels
model = SqueezeBertForTokenClassification(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, labels=token_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.num_labels))
def create_and_check_squeezebert_for_multiple_choice(
self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
):
config.num_choices = self.num_choices
model = SqueezeBertForMultipleChoice(config=config)
model.to(torch_device)
model.eval()
multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
multiple_choice_input_mask = input_mask.unsqueeze(1).expand(-1, self.num_choices, -1).contiguous()
result = model(
multiple_choice_inputs_ids,
attention_mask=multiple_choice_input_mask,
labels=choice_labels,
)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_choices))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class SqueezeBertModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
SqueezeBertModel,
SqueezeBertForMaskedLM,
SqueezeBertForMultipleChoice,
SqueezeBertForQuestionAnswering,
SqueezeBertForSequenceClassification,
SqueezeBertForTokenClassification,
)
if is_torch_available()
else None
)
test_pruning = False
test_torchscript = True
test_resize_embeddings = True
test_head_masking = False
def setUp(self):
self.model_tester = SqueezeBertModelTester(self)
self.config_tester = ConfigTester(self, config_class=SqueezeBertConfig, dim=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_squeezebert_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_model(*config_and_inputs)
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_for_masked_lm(*config_and_inputs)
def test_for_question_answering(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_for_question_answering(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_for_sequence_classification(*config_and_inputs)
def test_for_token_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_for_token_classification(*config_and_inputs)
def test_for_multiple_choice(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_squeezebert_for_multiple_choice(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in SQUEEZEBERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = SqueezeBertModel.from_pretrained(model_name)
self.assertIsNotNone(model)
class SqueezeBertModelIntegrationTest(unittest.TestCase):
@slow
def test_inference_classification_head(self):
model = SqueezeBertForSequenceClassification.from_pretrained("squeezebert/squeezebert-mnli")
input_ids = torch.tensor([[0, 29414, 232, 328, 740, 1140, 12695, 69, 13, 1588, 2]])
output = model(input_ids)[0]
expected_shape = torch.Size((1, 3))
self.assertEqual(output.shape, expected_shape)
expected_tensor = torch.tensor([[0.5075, 0.0682, -0.5881]])
self.assertTrue(torch.allclose(output, expected_tensor, atol=1e-4))
# coding=utf-8
# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from transformers.testing_utils import slow
from transformers.tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .test_tokenization_bert import BertTokenizationTest
class SqueezeBertTokenizationTest(BertTokenizationTest):
tokenizer_class = SqueezeBertTokenizer
def get_rust_tokenizer(self, **kwargs):
return SqueezeBertTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
@slow
def test_sequence_builders(self):
tokenizer = SqueezeBertTokenizer.from_pretrained("squeezebert/squeezebert-mnli-headless")
text = tokenizer.encode("sequence builders", add_special_tokens=False)
text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
tokenizer.sep_token_id
]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment