Unverified Commit 7a0cf0ec authored by Pengcheng He's avatar Pengcheng He Committed by GitHub
Browse files

Add DeBERTa model (#5929)



* Add DeBERTa model

* Remove dependency of deberta

* Address comments

* Patch DeBERTa
Documentation
Style

* Add final tests

* Style

* Enable tests + nitpicks

* position IDs

* BERT -> DeBERTa

* Quality

* Style

* Tokenization

* Last updates.

* @patrickvonplaten's comments

* Not everything can be a copy

* Apply most of @sgugger's review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Last reviews

* DeBERTa -> Deberta
Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 44a93c98
...@@ -187,8 +187,9 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. ...@@ -187,8 +187,9 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
25. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. 25. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
26. **[Funnel Transformer](https://github.com/laiguokun/Funnel-Transformer)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 26. **[Funnel Transformer](https://github.com/laiguokun/Funnel-Transformer)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
27. **[LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 27. **[LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
28. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users). 28. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
29. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR. 29. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
30. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html). These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
......
...@@ -214,6 +214,7 @@ conversion utilities for the following models: ...@@ -214,6 +214,7 @@ conversion utilities for the following models:
model_doc/bertgeneration model_doc/bertgeneration
model_doc/camembert model_doc/camembert
model_doc/ctrl model_doc/ctrl
model_doc/deberta
model_doc/dialogpt model_doc/dialogpt
model_doc/distilbert model_doc/distilbert
model_doc/dpr model_doc/dpr
......
DeBERTa
----------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__
by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.
The abstract from the paper is the following:
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.
In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa
models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode
its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and
relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining.
We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to
RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements
on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained
models will be made publicly available at https://github.com/microsoft/DeBERTa.*
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
DebertaConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaConfig
:members:
DebertaTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
DebertaModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaModel
:members:
DebertaPreTrainedModel
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaPreTrainedModel
:members:
DebertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForSequenceClassification
:members:
...@@ -415,4 +415,15 @@ For a list that includes community-uploaded models, refer to `https://huggingfac ...@@ -415,4 +415,15 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``microsoft/layoutlm-large-uncased`` | | 24 layers, 1024-hidden, 16-heads, 343M parameters | | | ``microsoft/layoutlm-large-uncased`` | | 24 layers, 1024-hidden, 16-heads, 343M parameters |
| | | | | | | |
| | | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__) | | | | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__) |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
\ No newline at end of file | DeBERTa | ``microsoft/deberta-base`` | | 12-layer, 768-hidden, 12-heads, ~125M parameters |
| | | | DeBERTa using the BERT-base architecture |
| | | |
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``microsoft/deberta-large`` | | 24-layer, 1024-hidden, 16-heads, ~390M parameters |
| | | | DeBERTa using the BERT-large architecture |
| | | |
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
---
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
license: mit
---
## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data.
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
#### Fine-tuning on NLU tasks
We present the dev results on SQuAD 1.1/2.0 and MNLI tasks.
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m |
|-------------------|-----------|-----------|--------|
| RoBERTa-base | 91.5/84.6 | 83.7/80.5 | 87.6 |
| XLNet-Large | -/- | -/80.2 | 86.8 |
| **DeBERTa-base** | 93.1/87.2 | 86.2/83.1 | 88.8 |
### Citation
If you find DeBERTa useful for your work, please cite the following paper:
``` latex
@misc{he2020deberta,
title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
year={2020},
eprint={2006.03654},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
---
thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
license: mit
---
## DeBERTa: Decoding-enhanced BERT with Disentangled Attention
[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data.
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
#### Fine-tuning on NLU tasks
We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m | SST-2 | QNLI | CoLA | RTE | MRPC | QQP |STS-B|
|-------------------|-----------|-----------|--------|-------|------|------|------|------|------|-----|
| BERT-Large | 90.9/84.1 | 81.8/79.0 | 86.6 | 93.2 | 92.3 | 60.6 | 70.4 | 88.0 | 91.3 |90.0 |
| RoBERTa-Large | 94.6/88.9 | 89.4/86.5 | 90.2 | 96.4 | 93.9 | 68.0 | 86.6 | 90.9 | 92.2 |92.4 |
| XLNet-Large | 95.1/89.7 | 90.6/87.9 | 90.8 | 97.0 | 94.9 | 69.0 | 85.9 | 90.8 | 92.3 |92.5 |
| **DeBERTa-Large** | 95.5/90.1 | 90.7/88.0 | 91.1 | 96.5 | 95.3 | 69.5 | 88.1 | 92.5 | 92.3 |92.5 |
### Citation
If you find DeBERTa useful for your work, please cite the following paper:
``` latex
@misc{he2020deberta,
title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
year={2020},
eprint={2006.03654},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
...@@ -35,6 +35,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig ...@@ -35,6 +35,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from .configuration_bert_generation import BertGenerationConfig from .configuration_bert_generation import BertGenerationConfig
from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from .configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from .configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig from .configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig
from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
...@@ -156,6 +157,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenize ...@@ -156,6 +157,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenize
from .tokenization_bertweet import BertweetTokenizer from .tokenization_bertweet import BertweetTokenizer
from .tokenization_camembert import CamembertTokenizer from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer from .tokenization_ctrl import CTRLTokenizer
from .tokenization_deberta import DebertaTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
from .tokenization_dpr import ( from .tokenization_dpr import (
DPRContextEncoderTokenizer, DPRContextEncoderTokenizer,
...@@ -310,6 +312,12 @@ if is_torch_available(): ...@@ -310,6 +312,12 @@ if is_torch_available():
CamembertModel, CamembertModel,
) )
from .modeling_ctrl import CTRL_PRETRAINED_MODEL_ARCHIVE_LIST, CTRLLMHeadModel, CTRLModel, CTRLPreTrainedModel from .modeling_ctrl import CTRL_PRETRAINED_MODEL_ARCHIVE_LIST, CTRLLMHeadModel, CTRLModel, CTRLPreTrainedModel
from .modeling_deberta import (
DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,
DebertaForSequenceClassification,
DebertaModel,
DebertaPreTrainedModel,
)
from .modeling_distilbert import ( from .modeling_distilbert import (
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
DistilBertForMaskedLM, DistilBertForMaskedLM,
......
...@@ -44,6 +44,10 @@ def mish(x): ...@@ -44,6 +44,10 @@ def mish(x):
return x * torch.tanh(torch.nn.functional.softplus(x)) return x * torch.tanh(torch.nn.functional.softplus(x))
def linear_act(x):
return x
ACT2FN = { ACT2FN = {
"relu": F.relu, "relu": F.relu,
"swish": swish, "swish": swish,
...@@ -52,6 +56,8 @@ ACT2FN = { ...@@ -52,6 +56,8 @@ ACT2FN = {
"gelu_new": gelu_new, "gelu_new": gelu_new,
"gelu_fast": gelu_fast, "gelu_fast": gelu_fast,
"mish": mish, "mish": mish,
"linear": linear_act,
"sigmoid": torch.sigmoid,
} }
......
...@@ -23,6 +23,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig ...@@ -23,6 +23,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from .configuration_bert_generation import BertGenerationConfig from .configuration_bert_generation import BertGenerationConfig
from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from .configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from .configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig from .configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig
from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
...@@ -78,6 +79,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict( ...@@ -78,6 +79,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
] ]
for key, value, in pretrained_map.items() for key, value, in pretrained_map.items()
) )
...@@ -100,6 +102,7 @@ CONFIG_MAPPING = OrderedDict( ...@@ -100,6 +102,7 @@ CONFIG_MAPPING = OrderedDict(
("reformer", ReformerConfig), ("reformer", ReformerConfig),
("longformer", LongformerConfig), ("longformer", LongformerConfig),
("roberta", RobertaConfig), ("roberta", RobertaConfig),
("deberta", DebertaConfig),
("flaubert", FlaubertConfig), ("flaubert", FlaubertConfig),
("fsmt", FSMTConfig), ("fsmt", FSMTConfig),
("bert", BertConfig), ("bert", BertConfig),
...@@ -149,6 +152,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -149,6 +152,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("encoder-decoder", "Encoder decoder"), ("encoder-decoder", "Encoder decoder"),
("funnel", "Funnel Transformer"), ("funnel", "Funnel Transformer"),
("lxmert", "LXMERT"), ("lxmert", "LXMERT"),
("deberta", "DeBERTa"),
("layoutlm", "LayoutLM"), ("layoutlm", "LayoutLM"),
("dpr", "DPR"), ("dpr", "DPR"),
("rag", "RAG"), ("rag", "RAG"),
......
# coding=utf-8
# Copyright 2020, Microsoft and the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" DeBERTa model configuration """
from .configuration_utils import PretrainedConfig
from .utils import logging
logger = logging.get_logger(__name__)
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"microsoft/deberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/deberta-base/config.json",
"microsoft/deberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/deberta-large/config.json",
}
class DebertaConfig(PretrainedConfig):
r"""
:class:`~transformers.DebertaConfig` is the configuration class to store the configuration of a
:class:`~transformers.DebertaModel`.
Arguments:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the DeBERTa model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.DebertaModel` or
:class:`~transformers.TFDebertaModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler.
If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"`, :obj:`"gelu"`, :obj:`"tanh"`, :obj:`"gelu_fast"`,
:obj:`"mish"`, :obj:`"linear"`, :obj:`"sigmoid"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.DebertaModel` or
:class:`~transformers.TFDebertaModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
relative_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether use relative position encoding.
max_relative_positions (:obj:`int`, `optional`, defaults to 1):
The range of relative positions :obj:`[-max_position_embeddings, max_position_embeddings]`.
Use the same value as :obj:`max_position_embeddings`.
pad_token_id (:obj:`int`, `optional`, defaults to 0):
The value used to pad input_ids.
position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether add absolute position embedding to content embedding.
pos_att_type (:obj:`List[str]`, `optional`):
The type of relative position attention, it can be a combination of :obj:`["p2c", "c2p", "p2p"]`,
e.g. :obj:`["p2c"]`, :obj:`["p2c", "c2p"]`, :obj:`["p2c", "c2p", 'p2p"]`.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
The epsilon used by the layer normalization layers.
"""
model_type = "deberta"
def __init__(
self,
vocab_size=50265,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=0,
initializer_range=0.02,
layer_norm_eps=1e-7,
relative_attention=False,
max_relative_positions=-1,
pad_token_id=0,
position_biased_input=True,
pos_att_type=None,
pooler_dropout=0,
pooler_hidden_act="gelu",
**kwargs
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.relative_attention = relative_attention
self.max_relative_positions = max_relative_positions
self.pad_token_id = pad_token_id
self.position_biased_input = position_biased_input
# Backwards compatibility
if type(pos_att_type) == str:
pos_att_type = [x.strip() for x in pos_att_type.lower().split("|")]
self.pos_att_type = pos_att_type
self.vocab_size = vocab_size
self.layer_norm_eps = layer_norm_eps
self.pooler_hidden_size = kwargs.get("pooler_hidden_size", hidden_size)
self.pooler_dropout = pooler_dropout
self.pooler_hidden_act = pooler_hidden_act
...@@ -26,6 +26,7 @@ from .configuration_auto import ( ...@@ -26,6 +26,7 @@ from .configuration_auto import (
BertGenerationConfig, BertGenerationConfig,
CamembertConfig, CamembertConfig,
CTRLConfig, CTRLConfig,
DebertaConfig,
DistilBertConfig, DistilBertConfig,
DPRConfig, DPRConfig,
ElectraConfig, ElectraConfig,
...@@ -90,6 +91,7 @@ from .modeling_camembert import ( ...@@ -90,6 +91,7 @@ from .modeling_camembert import (
CamembertModel, CamembertModel,
) )
from .modeling_ctrl import CTRLLMHeadModel, CTRLModel from .modeling_ctrl import CTRLLMHeadModel, CTRLModel
from .modeling_deberta import DebertaForSequenceClassification, DebertaModel
from .modeling_distilbert import ( from .modeling_distilbert import (
DistilBertForMaskedLM, DistilBertForMaskedLM,
DistilBertForMultipleChoice, DistilBertForMultipleChoice,
...@@ -231,6 +233,7 @@ MODEL_MAPPING = OrderedDict( ...@@ -231,6 +233,7 @@ MODEL_MAPPING = OrderedDict(
(FunnelConfig, FunnelModel), (FunnelConfig, FunnelModel),
(LxmertConfig, LxmertModel), (LxmertConfig, LxmertModel),
(BertGenerationConfig, BertGenerationEncoder), (BertGenerationConfig, BertGenerationEncoder),
(DebertaConfig, DebertaModel),
(DPRConfig, DPRQuestionEncoder), (DPRConfig, DPRQuestionEncoder),
] ]
) )
...@@ -359,6 +362,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict( ...@@ -359,6 +362,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
(XLMConfig, XLMForSequenceClassification), (XLMConfig, XLMForSequenceClassification),
(ElectraConfig, ElectraForSequenceClassification), (ElectraConfig, ElectraForSequenceClassification),
(FunnelConfig, FunnelForSequenceClassification), (FunnelConfig, FunnelForSequenceClassification),
(DebertaConfig, DebertaForSequenceClassification),
] ]
) )
......
This diff is collapsed.
...@@ -25,6 +25,7 @@ from .configuration_auto import ( ...@@ -25,6 +25,7 @@ from .configuration_auto import (
BertGenerationConfig, BertGenerationConfig,
CamembertConfig, CamembertConfig,
CTRLConfig, CTRLConfig,
DebertaConfig,
DistilBertConfig, DistilBertConfig,
DPRConfig, DPRConfig,
ElectraConfig, ElectraConfig,
...@@ -61,6 +62,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer ...@@ -61,6 +62,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer
from .tokenization_bertweet import BertweetTokenizer from .tokenization_bertweet import BertweetTokenizer
from .tokenization_camembert import CamembertTokenizer from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer from .tokenization_ctrl import CTRLTokenizer
from .tokenization_deberta import DebertaTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
from .tokenization_dpr import DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast from .tokenization_dpr import DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
...@@ -125,6 +127,7 @@ TOKENIZER_MAPPING = OrderedDict( ...@@ -125,6 +127,7 @@ TOKENIZER_MAPPING = OrderedDict(
(CTRLConfig, (CTRLTokenizer, None)), (CTRLConfig, (CTRLTokenizer, None)),
(FSMTConfig, (FSMTTokenizer, None)), (FSMTConfig, (FSMTTokenizer, None)),
(BertGenerationConfig, (BertGenerationTokenizer, None)), (BertGenerationConfig, (BertGenerationTokenizer, None)),
(DebertaConfig, (DebertaTokenizer, None)),
(LayoutLMConfig, (LayoutLMTokenizer, None)), (LayoutLMConfig, (LayoutLMTokenizer, None)),
(RagConfig, (RagTokenizer, None)), (RagConfig, (RagTokenizer, None)),
] ]
......
This diff is collapsed.
# coding=utf-8
# Copyright 2018 Microsoft Authors and the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import unittest
import numpy as np
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available():
import torch
from transformers import ( # XxxForMaskedLM,; XxxForQuestionAnswering,; XxxForTokenClassification,
DebertaConfig,
DebertaForSequenceClassification,
DebertaModel,
)
from transformers.modeling_deberta import DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST
@require_torch
class DebertaModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
DebertaModel,
DebertaForSequenceClassification,
) # , DebertaForMaskedLM, DebertaForQuestionAnswering, DebertaForTokenClassification)
if is_torch_available()
else ()
)
test_torchscript = False
test_pruning = False
test_head_masking = False
is_encoder_decoder = False
class DebertaModelTester(object):
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_input_mask=True,
use_token_type_ids=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
relative_attention=False,
position_biased_input=True,
pos_att_type="None",
num_labels=3,
num_choices=4,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.relative_attention = relative_attention
self.position_biased_input = position_biased_input
self.pos_att_type = pos_att_type
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
sequence_labels = None
token_labels = None
choice_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
config = DebertaConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
type_vocab_size=self.type_vocab_size,
initializer_range=self.initializer_range,
relative_attention=self.relative_attention,
position_biased_input=self.position_biased_input,
pos_att_type=self.pos_att_type,
)
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
def check_loss_output(self, result):
self.parent.assertListEqual(list(result["loss"].size()), [])
def create_and_check_deberta_model(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = DebertaModel(config=config)
model.to(torch_device)
model.eval()
sequence_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)[0]
sequence_output = model(input_ids, token_type_ids=token_type_ids)[0]
sequence_output = model(input_ids)[0]
result = {
"sequence_output": sequence_output,
}
self.parent.assertListEqual(
list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
)
def create_and_check_deberta_for_sequence_classification(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
config.num_labels = self.num_labels
model = DebertaForSequenceClassification(config)
model.to(torch_device)
model.eval()
loss, logits = model(
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels
)
result = {
"loss": loss,
"logits": logits,
}
self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.num_labels])
self.check_loss_output(result)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
return config, inputs_dict
def setUp(self):
self.model_tester = DebertaModelTest.DebertaModelTester(self)
self.config_tester = ConfigTester(self, config_class=DebertaConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_deberta_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_deberta_model(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_deberta_for_sequence_classification(*config_and_inputs)
@unittest.skip(reason="Model not available yet")
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_deberta_for_masked_lm(*config_and_inputs)
@unittest.skip(reason="Model not available yet")
def test_for_question_answering(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_deberta_for_question_answering(*config_and_inputs)
@unittest.skip(reason="Model not available yet")
def test_for_token_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_deberta_for_token_classification(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = DebertaModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@require_torch
class DebertaModelIntegrationTest(unittest.TestCase):
@unittest.skip(reason="Model not available yet")
def test_inference_masked_lm(self):
pass
@slow
def test_inference_no_head(self):
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
DebertaModel.base_model_prefix = "bert"
model = DebertaModel.from_pretrained("microsoft/deberta-base")
input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
output = model(input_ids)[0]
# compare the actual values for a slice.
expected_slice = torch.tensor(
[[[-0.0218, -0.6641, -0.3665], [-0.3907, -0.4716, -0.6640], [0.7461, 1.2570, -0.9063]]]
)
self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4), f"{output[:, :3, :3]}")
@slow
def test_inference_classification_head(self):
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
model = DebertaForSequenceClassification.from_pretrained("microsoft/deberta-base")
input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
output = model(input_ids)[0]
expected_shape = torch.Size((1, 2))
self.assertEqual(output.shape, expected_shape)
expected_tensor = torch.tensor([[0.0884, -0.1047]])
self.assertTrue(torch.allclose(output, expected_tensor, atol=1e-4), f"{output}")
# coding=utf-8
# Copyright 2018 Microsoft.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
import unittest
from typing import Tuple
from transformers.testing_utils import require_torch
from transformers.tokenization_deberta import DebertaTokenizer
from .test_tokenization_common import TokenizerTesterMixin
@require_torch
class DebertaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = DebertaTokenizer
def setUp(self):
super().setUp()
def get_tokenizer(self, name="microsoft/deberta-base", **kwargs):
return DebertaTokenizer.from_pretrained(name, **kwargs)
def get_input_output_texts(self, tokenizer):
input_text = "lower newer"
output_text = "lower newer"
return input_text, output_text
def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20) -> Tuple[str, list]:
toks = [
(i, tokenizer.decode([i], clean_up_tokenization_spaces=False))
for i in range(5, min(len(tokenizer), 50260))
]
toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
toks = list(filter(lambda t: [t[0]] == tokenizer.encode(t[1], add_special_tokens=False), toks))
if max_length is not None and len(toks) > max_length:
toks = toks[:max_length]
# toks_str = [t[1] for t in toks]
toks_ids = [t[0] for t in toks]
# Ensure consistency
output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
if " " not in output_txt and len(toks_ids) > 1:
output_txt = (
tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+ " "
+ tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
)
if with_prefix_space and not output_txt.startswith(" "):
output_txt = " " + output_txt
output_ids = tokenizer.encode(output_txt, add_special_tokens=False)
return output_txt, output_ids
def test_full_tokenizer(self):
tokenizer = self.get_tokenizer("microsoft/deberta-base")
input_str = "UNwant\u00E9d,running"
tokens = tokenizer.tokenize(input_str)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
self.assertEqual(tokenizer.decode(token_ids), input_str)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment