Tf model outputs (#6247)

* TF outputs and test on BERT * Albert to DistilBert * All remaining TF models except T5 * Documentation * One file forgotten * TF outputs and test on BERT * Albert to DistilBert * All remaining TF models except T5 * Documentation * One file forgotten * Add new models and fix issues * Quality improvements * Add T5 * A bit of cleanup * Fix for slow tests * Style

Tf model outputs (#6247)
* TF outputs and test on BERT * Albert to DistilBert * All remaining TF models except T5 * Documentation * One file forgotten * TF outputs and test on BERT * Albert to DistilBert * All remaining TF models except T5 * Documentation * One file forgotten * Add new models and fix issues * Quality improvements * Add T5 * A bit of cleanup * Fix for slow tests * Style
c67d1a02 · Sylvain Gugger · GitHub · bd0eab35 · c67d1a02 · c67d1a02
Unverified Commit c67d1a02 authored Aug 05, 2020 by Sylvain Gugger Committed by GitHub Aug 05, 2020
20 changed files
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -50,7 +50,10 @@ AlbertTokenizer
 Albert specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.modeling_albert.AlbertForPretrainingOutput
+.. autoclass:: transformers.modeling_albert.AlbertForPreTrainingOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_albert.TFAlbertForPreTrainingOutput
    :members:

--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -57,7 +57,10 @@ BertTokenizerFast
 Bert specific outputs
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.modeling_bert.BertForPretrainingOutput
+.. autoclass:: transformers.modeling_bert.BertForPreTrainingOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_bert.TFBertForPreTrainingOutput
    :members:

--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst
@@ -74,7 +74,10 @@ ElectraTokenizerFast
 Electra specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.modeling_electra.ElectraForPretrainingOutput
+.. autoclass:: transformers.modeling_electra.ElectraForPreTrainingOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_electra.TFElectraForPreTrainingOutput
    :members:
@@ -106,6 +109,13 @@ ElectraForSequenceClassification
    :members:
+ElectraForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ElectraForMultipleChoice
+    :members:
 ElectraForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -141,6 +151,20 @@ TFElectraForMaskedLM
    :members:
+TFElectraForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.TFElectraForSequenceClassification
+    :members:
+TFElectraForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.TFElectraForMultipleChoice
+    :members:
 TFElectraForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -77,6 +77,9 @@ OpenAI specific outputs
 .. autoclass:: transformers.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
    :members:
+.. autoclass:: transformers.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
+    :members:
 OpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -64,6 +64,9 @@ GPT2 specific outputs
 .. autoclass:: transformers.modeling_gpt2.GPT2DoubleHeadsModelOutput
    :members:
+.. autoclass:: transformers.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
+    :members:
 GPT2Model
 ~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@@ -59,7 +59,10 @@ MobileBertTokenizerFast
 MobileBert specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.modeling_mobilebert.MobileBertForPretrainingOutput
+.. autoclass:: transformers.modeling_mobilebert.MobileBertForPreTrainingOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_mobilebert.TFMobileBertForPreTrainingOutput
    :members:

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -63,6 +63,12 @@ TransfoXL specific outputs
 .. autoclass:: transformers.modeling_transfo_xl.TransfoXLLMHeadModelOutput
    :members:
+.. autoclass:: transformers.modeling_tf_transfo_xl.TFTransfoXLModelOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput
+    :members:
 TransfoXLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -74,6 +74,24 @@ XLNet specific outputs
 .. autoclass:: transformers.modeling_xlnet.XLNetForQuestionAnsweringOutput
    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetModelOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetLMHeadModelOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForSequenceClassificationOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForMultipleChoiceOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForTokenClassificationOutput
+    :members:
+.. autoclass:: transformers.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput
+    :members:
 XLNetModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -190,7 +190,7 @@ def add_end_docstrings(*docstr):
    return docstring_decorator
-RETURN_INTRODUCTION = r"""
+PT_RETURN_INTRODUCTION = r"""
    Returns:
        :class:`~{full_output_type}` or :obj:`tuple(torch.FloatTensor)`:
        A :class:`~{full_output_type}` (if ``return_dict=True`` is passed or when ``config.return_dict=True``) or a
@@ -200,6 +200,16 @@ RETURN_INTRODUCTION = r"""
 """
+TF_RETURN_INTRODUCTION = r"""
+    Returns:
+        :class:`~{full_output_type}` or :obj:`tuple(tf.Tensor)`:
+        A :class:`~{full_output_type}` (if ``return_dict=True`` is passed or when ``config.return_dict=True``) or a
+        tuple of :obj:`tf.Tensor` comprising various elements depending on the configuration
+        (:class:`~transformers.{config_class}`) and inputs.
+"""
 def _get_indent(t):
    """Returns the indentation in the first line of t"""
    search = re.search(r"^(\s*)\S", t)
@@ -249,7 +259,8 @@ def _prepare_output_docstrings(output_type, config_class):
    # Add the return introduction
    full_output_type = f"{output_type.__module__}.{output_type.__name__}"
-    intro = RETURN_INTRODUCTION.format(full_output_type=full_output_type, config_class=config_class)
+    intro = TF_RETURN_INTRODUCTION if output_type.__name__.startswith("TF") else PT_RETURN_INTRODUCTION
+    intro = intro.format(full_output_type=full_output_type, config_class=config_class)
    return intro + docstrings

--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -407,9 +407,9 @@ class AlbertPreTrainedModel(PreTrainedModel):
 @dataclass
-class AlbertForPretrainingOutput(ModelOutput):
+class AlbertForPreTrainingOutput(ModelOutput):
    """
-    Output type of :class:`~transformers.AlbertForPretrainingModel`.
+    Output type of :class:`~transformers.AlbertForPreTrainingModel`.
    Args:
        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
@@ -643,7 +643,7 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
        return self.predictions.decoder
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=AlbertForPretrainingOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=AlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids=None,
@@ -728,7 +728,7 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
            output = (prediction_scores, sop_scores) + outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output
-        return AlbertForPretrainingOutput(
+        return AlbertForPreTrainingOutput(
            loss=total_loss,
            prediction_logits=prediction_scores,
            sop_logits=sop_scores,

--- a/src/transformers/modeling_bert.py
+++ b/src/transformers/modeling_bert.py
@@ -586,9 +586,9 @@ class BertPreTrainedModel(PreTrainedModel):
 @dataclass
-class BertForPretrainingOutput(ModelOutput):
+class BertForPreTrainingOutput(ModelOutput):
    """
-    Output type of :class:`~transformers.BertForPretrainingModel`.
+    Output type of :class:`~transformers.BertForPreTrainingModel`.
    Args:
        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
@@ -837,7 +837,7 @@ class BertForPreTraining(BertPreTrainedModel):
        return self.cls.predictions.decoder
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
-    @replace_return_docstrings(output_type=BertForPretrainingOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids=None,
@@ -918,7 +918,7 @@ class BertForPreTraining(BertPreTrainedModel):
            output = (prediction_scores, seq_relationship_score) + outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output
-        return BertForPretrainingOutput(
+        return BertForPreTrainingOutput(
            loss=total_loss,
            prediction_logits=prediction_scores,
            seq_relationship_logits=seq_relationship_score,

--- a/src/transformers/modeling_electra.py
+++ b/src/transformers/modeling_electra.py
@@ -188,9 +188,9 @@ class ElectraPreTrainedModel(BertPreTrainedModel):
 @dataclass
-class ElectraForPretrainingOutput(ModelOutput):
+class ElectraForPreTrainingOutput(ModelOutput):
    """
-    Output type of :class:`~transformers.ElectraForPretrainingModel`.
+    Output type of :class:`~transformers.ElectraForPreTrainingModel`.
    Args:
        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
@@ -496,7 +496,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
        self.init_weights()
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=ElectraForPretrainingOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=ElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids=None,
@@ -562,7 +562,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
            output = (logits,) + discriminator_hidden_states[1:]
            return ((loss,) + output) if loss is not None else output
-        return ElectraForPretrainingOutput(
+        return ElectraForPreTrainingOutput(
            loss=loss,
            logits=logits,
            hidden_states=discriminator_hidden_states.hidden_states,
@@ -850,7 +850,7 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel):
 @add_start_docstrings(
    """ELECTRA Model with a multiple choice classification head on top (a linear layer on top of
    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
-    ELECTRA_INPUTS_DOCSTRING,
+    ELECTRA_START_DOCSTRING,
 )
 class ElectraForMultipleChoice(ElectraPreTrainedModel):
    def __init__(self, config):

--- a/src/transformers/modeling_mobilebert.py
+++ b/src/transformers/modeling_mobilebert.py
@@ -685,9 +685,9 @@ class MobileBertPreTrainedModel(PreTrainedModel):
 @dataclass
-class MobileBertForPretrainingOutput(ModelOutput):
+class MobileBertForPreTrainingOutput(ModelOutput):
    """
-    Output type of :class:`~transformers.MobileBertForPretrainingModel`.
+    Output type of :class:`~transformers.MobileBertForPreTrainingModel`.
    Args:
        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
@@ -948,7 +948,7 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=MobileBertForPretrainingOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=MobileBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids=None,
@@ -1018,7 +1018,7 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
            output = (prediction_scores, seq_relationship_score) + outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output
-        return MobileBertForPretrainingOutput(
+        return MobileBertForPreTrainingOutput(
            loss=total_loss,
            prediction_logits=prediction_scores,
            seq_relationship_logits=seq_relationship_score,

--- a/src/transformers/modeling_t5.py
+++ b/src/transformers/modeling_t5.py
@@ -973,7 +973,7 @@ class T5Model(T5PreTrainedModel):
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
-        elif not return_dict and not isinstance(encoder_outputs, BaseModelOutput):
+        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,

--- a/src/transformers/modeling_tf_albert.py
+++ b/src/transformers/modeling_tf_albert.py
@@ -17,17 +17,30 @@
 import logging
+from dataclasses import dataclass
+from typing import Optional, Tuple
 import tensorflow as tf
 from .configuration_albert import AlbertConfig
 from .file_utils import (
    MULTIPLE_CHOICE_DUMMY_INPUTS,
+    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_callable,
+    replace_return_docstrings,
 )
 from .modeling_tf_bert import ACT2FN, TFBertSelfAttention
+from .modeling_tf_outputs import (
+    TFBaseModelOutput,
+    TFBaseModelOutputWithPooling,
+    TFMaskedLMOutput,
+    TFMultipleChoiceModelOutput,
+    TFQuestionAnsweringModelOutput,
+    TFSequenceClassifierOutput,
+    TFTokenClassifierOutput,
+)
 from .modeling_tf_utils import (
    TFMaskedLanguageModelingLoss,
    TFMultipleChoiceLoss,
@@ -44,6 +57,7 @@ from .tokenization_utils import BatchEncoding
 logger = logging.getLogger(__name__)
+_CONFIG_FOR_DOC = "AlbertConfig"
 _TOKENIZER_FOR_DOC = "AlbertTokenizer"
 TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
@@ -414,12 +428,19 @@ class TFAlbertTransformer(tf.keras.layers.Layer):
            for i in range(config.num_hidden_groups)
        ]
-    def call(self, hidden_states, attention_mask, head_mask, output_attentions, output_hidden_states, training=False):
+    def call(
+        self,
+        hidden_states,
+        attention_mask,
+        head_mask,
+        output_attentions,
+        output_hidden_states,
+        return_dict,
+        training=False,
+    ):
        hidden_states = self.embedding_hidden_mapping_in(hidden_states)
-        all_attentions = ()
+        all_attentions = () if output_attentions else None
+        all_hidden_states = (hidden_states,) if output_hidden_states else None
-        if output_hidden_states:
-            all_hidden_states = (hidden_states,)
        for i in range(self.config.num_hidden_layers):
            # Number of layers in a hidden group
@@ -444,14 +465,11 @@ class TFAlbertTransformer(tf.keras.layers.Layer):
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)
-        outputs = (hidden_states,)
+        if not return_dict:
-        if output_hidden_states:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
-            outputs = outputs + (all_hidden_states,)
+        return TFBaseModelOutput(
-        if output_attentions:
+            last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
-            outputs = outputs + (all_attentions,)
+        )
-        # last-layer hidden state, (all hidden states), (all attentions)
-        return outputs
 class TFAlbertPreTrainedModel(TFPreTrainedModel):
@@ -506,6 +524,7 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
        self.num_hidden_layers = config.num_hidden_layers
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
+        self.return_dict = config.use_return_dict
        self.embeddings = TFAlbertEmbeddings(config, name="embeddings")
        self.encoder = TFAlbertTransformer(config, name="encoder")
@@ -543,6 +562,7 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
        if isinstance(inputs, (tuple, list)):
@@ -554,7 +574,8 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
+            assert len(inputs) <= 9, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -564,12 +585,14 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs.get("return_dict", return_dict)
+            assert len(inputs) <= 9, "Too many inputs."
        else:
            input_ids = inputs
        output_attentions = output_attentions if output_attentions is not None else self.output_attentions
        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.return_dict
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -619,16 +642,52 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
            head_mask,
            output_attentions,
            output_hidden_states,
+            return_dict,
            training=training,
        )
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output[:, 0])
-        # add hidden_states and attentions if they are here
+        if not return_dict:
-        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]
+            return (sequence_output, pooled_output,) + encoder_outputs[1:]
-        # sequence_output, pooled_output, (hidden_states), (attentions)
-        return outputs
+        return TFBaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+@dataclass
+class TFAlbertForPreTrainingOutput(ModelOutput):
+    """
+    Output type of :class:`~transformers.TFAlbertForPreTrainingModel`.
+    Args:
+        prediction_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        sop_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False
+            continuation before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    prediction_logits: tf.Tensor = None
+    sop_logits: tf.Tensor = None
+    hidden_states: Optional[Tuple[tf.Tensor]] = None
+    attentions: Optional[Tuple[tf.Tensor]] = None
 ALBERT_START_DOCSTRING = r"""
@@ -707,6 +766,11 @@ ALBERT_INPUTS_DOCSTRING = r"""
            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            plain tuple.
 """
@@ -720,32 +784,13 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
        self.albert = TFAlbertMainLayer(config, name="albert")
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFBaseModelOutputWithPooling,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(self, inputs, **kwargs):
-        r"""
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
-        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):
-            Last layer hidden-state of the first token of the sequence (classification token)
-            further processed by a Linear layer and a Tanh activation function. The Linear
-            layer weights are trained from the next sentence prediction (classification)
-            objective during Albert pretraining. This output is usually *not* a good summary
-            of the semantic content of the input, you're often better with averaging or pooling
-            the sequence of hidden-states for the whole input sequence.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
-        """
        outputs = self.albert(inputs, **kwargs)
        return outputs
@@ -768,25 +813,10 @@ class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
        return self.albert.embeddings
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @replace_return_docstrings(output_type=TFAlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def call(self, inputs, **kwargs):
        r"""
    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):
-            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
    Examples::
        import tensorflow as tf
@@ -797,13 +827,22 @@ class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
        outputs = model(input_ids)
        prediction_scores, sop_scores = outputs[:2]
        """
+        return_dict = kwargs.get("return_dict")
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        outputs = self.albert(inputs, **kwargs)
        sequence_output, pooled_output = outputs[:2]
        prediction_scores = self.predictions(sequence_output)
        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get("training", False))
-        outputs = (prediction_scores, sop_scores) + outputs[2:]
-        return outputs
+        if not return_dict:
+            return (prediction_scores, sop_scores) + outputs[2:]
+        return TFAlbertForPreTrainingOutput(
+            prediction_logits=prediction_scores,
+            sop_logits=sop_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
 class TFAlbertSOPHead(tf.keras.layers.Layer):
@@ -833,7 +872,12 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
        return self.albert.embeddings
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFMaskedLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -844,6 +888,7 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -853,27 +898,12 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
            in ``[0, ..., config.vocab_size]``
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
-        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -886,20 +916,22 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        sequence_output = outputs[0]
        prediction_scores = self.predictions(sequence_output, training=training)
-        # Add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, prediction_scores)
-        outputs = (prediction_scores,) + outputs[2:]
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, prediction_scores)
+            output = (prediction_scores,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # prediction_scores, (hidden_states), (attentions)
+        return TFMaskedLMOutput(
+            loss=loss, logits=prediction_scores, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -919,7 +951,12 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
        )
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFSequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -930,6 +967,7 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -939,27 +977,12 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
-        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)
-            Classification (or regression if config.num_labels==1) scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -972,6 +995,7 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -980,13 +1004,15 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
        pooled_output = self.dropout(pooled_output, training=training)
        logits = self.classifier(pooled_output)
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFSequenceClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1006,7 +1032,12 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
        )
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFTokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1017,6 +1048,7 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -1024,27 +1056,12 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the token classification loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -1057,6 +1074,7 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1065,13 +1083,15 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
        sequence_output = self.dropout(sequence_output, training=training)
        logits = self.classifier(sequence_output)
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFTokenClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1089,7 +1109,12 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
        )
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFQuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1100,6 +1125,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        start_positions=None,
        end_positions=None,
        training=False,
@@ -1113,30 +1139,13 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs:
-        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-start scores (before SoftMax).
-        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-end scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        if isinstance(inputs, (tuple, list)):
-            start_positions = inputs[8] if len(inputs) > 8 else start_positions
+            start_positions = inputs[9] if len(inputs) > 9 else start_positions
-            end_positions = inputs[9] if len(inputs) > 9 else end_positions
+            end_positions = inputs[10] if len(inputs) > 10 else end_positions
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            start_positions = inputs.pop("start_positions", start_positions)
            end_positions = inputs.pop("end_positions", start_positions)
@@ -1150,6 +1159,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1160,15 +1170,23 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
        start_logits = tf.squeeze(start_logits, axis=-1)
        end_logits = tf.squeeze(end_logits, axis=-1)
-        outputs = (start_logits, end_logits,) + outputs[2:]
+        loss = None
        if start_positions is not None and end_positions is not None:
            labels = {"start_position": start_positions}
            labels["end_position"] = end_positions
-            loss = self.compute_loss(labels, outputs[:2])
+            loss = self.compute_loss(labels, (start_logits, end_logits))
-            outputs = (loss,) + outputs
+        if not return_dict:
-        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TFQuestionAnsweringModelOutput(
+            loss=loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1196,7 +1214,12 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="albert-base-v2")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="albert-base-v2",
+        output_type=TFMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -1207,6 +1230,7 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -1215,24 +1239,6 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
            Labels for computing the multiple choice classification loss.
            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
            of the input tensors. (see `input_ids` above)
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
-            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
        if isinstance(inputs, (tuple, list)):
            input_ids = inputs[0]
@@ -1243,8 +1249,9 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            labels = inputs[8] if len(inputs) > 8 else labels
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
-            assert len(inputs) <= 9, "Too many inputs."
+            labels = inputs[9] if len(inputs) > 9 else labels
+            assert len(inputs) <= 10, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -1254,10 +1261,12 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_attentions)
+            return_dict = inputs.get("return_dict", return_dict)
            labels = inputs.get("labels", labels)
-            assert len(inputs) <= 9, "Too many inputs."
+            assert len(inputs) <= 10, "Too many inputs."
        else:
            input_ids = inputs
+        return_dict = return_dict if return_dict is not None else self.albert.return_dict
        if input_ids is not None:
            num_choices = shape_list(input_ids)[1]
@@ -1280,6 +1289,7 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
            inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1289,10 +1299,12 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
        logits = self.classifier(pooled_output)
        reshaped_logits = tf.reshape(logits, (-1, num_choices))
-        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, reshaped_logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, reshaped_logits)
+            output = (reshaped_logits,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+        return TFMultipleChoiceModelOutput(
+            loss=loss, logits=reshaped_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
--- a/src/transformers/modeling_tf_bert.py
+++ b/src/transformers/modeling_tf_bert.py
@@ -17,6 +17,8 @@
 import logging
+from dataclasses import dataclass
+from typing import Optional, Tuple
 import numpy as np
 import tensorflow as tf
@@ -24,9 +26,22 @@ import tensorflow as tf
 from .configuration_bert import BertConfig
 from .file_utils import (
    MULTIPLE_CHOICE_DUMMY_INPUTS,
+    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_callable,
+    replace_return_docstrings,
+)
+from .modeling_tf_outputs import (
+    TFBaseModelOutput,
+    TFBaseModelOutputWithPooling,
+    TFCausalLMOutput,
+    TFMaskedLMOutput,
+    TFMultipleChoiceModelOutput,
+    TFNextSentencePredictorOutput,
+    TFQuestionAnsweringModelOutput,
+    TFSequenceClassifierOutput,
+    TFTokenClassifierOutput,
 )
 from .modeling_tf_utils import (
    TFCausalLanguageModelingLoss,
@@ -45,6 +60,7 @@ from .tokenization_utils import BatchEncoding
 logger = logging.getLogger(__name__)
+_CONFIG_FOR_DOC = "BertConfig"
 _TOKENIZER_FOR_DOC = "BertTokenizer"
 TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
@@ -389,9 +405,18 @@ class TFBertEncoder(tf.keras.layers.Layer):
        super().__init__(**kwargs)
        self.layer = [TFBertLayer(config, name="layer_._{}".format(i)) for i in range(config.num_hidden_layers)]
-    def call(self, hidden_states, attention_mask, head_mask, output_attentions, output_hidden_states, training=False):
+    def call(
-        all_hidden_states = ()
+        self,
-        all_attentions = ()
+        hidden_states,
+        attention_mask,
+        head_mask,
+        output_attentions,
+        output_hidden_states,
+        return_dict,
+        training=False,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
        for i, layer_module in enumerate(self.layer):
            if output_hidden_states:
@@ -409,15 +434,11 @@ class TFBertEncoder(tf.keras.layers.Layer):
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)
-        outputs = (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
-        if output_hidden_states:
+        return TFBaseModelOutput(
-            outputs = outputs + (all_hidden_states,)
+            last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
+        )
-        if output_attentions:
-            outputs = outputs + (all_attentions,)
-        return outputs  # outputs, (hidden states), (attentions)
 class TFBertPooler(tf.keras.layers.Layer):
@@ -517,6 +538,7 @@ class TFBertMainLayer(tf.keras.layers.Layer):
        self.initializer_range = config.initializer_range
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
+        self.return_dict = config.use_return_dict
        self.embeddings = TFBertEmbeddings(config, name="embeddings")
        self.encoder = TFBertEncoder(config, name="encoder")
        self.pooler = TFBertPooler(config, name="pooler")
@@ -545,6 +567,7 @@ class TFBertMainLayer(tf.keras.layers.Layer):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
        if isinstance(inputs, (tuple, list)):
@@ -556,7 +579,8 @@ class TFBertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
+            assert len(inputs) <= 9, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -566,12 +590,14 @@ class TFBertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs.get("return_dict", return_dict)
+            assert len(inputs) <= 9, "Too many inputs."
        else:
            input_ids = inputs
        output_attentions = output_attentions if output_attentions is not None else self.output_attentions
        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.return_dict
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -621,16 +647,22 @@ class TFBertMainLayer(tf.keras.layers.Layer):
            head_mask,
            output_attentions,
            output_hidden_states,
+            return_dict,
            training=training,
        )
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output)
-        outputs = (sequence_output, pooled_output,) + encoder_outputs[
-            1:
-        ]  # add hidden_states and attentions if they are here
-        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)
+        if not return_dict:
+            return (sequence_output, pooled_output,) + encoder_outputs[1:]
+        return TFBaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
 class TFBertPreTrainedModel(TFPreTrainedModel):
@@ -642,6 +674,36 @@ class TFBertPreTrainedModel(TFPreTrainedModel):
    base_model_prefix = "bert"
+@dataclass
+class TFBertForPreTrainingOutput(ModelOutput):
+    """
+    Output type of :class:`~transformers.TFBertForPreTrainingModel`.
+    Args:
+        prediction_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False
+            continuation before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    prediction_logits: tf.Tensor = None
+    seq_relationship_logits: tf.Tensor = None
+    hidden_states: Optional[Tuple[tf.Tensor]] = None
+    attentions: Optional[Tuple[tf.Tensor]] = None
 BERT_START_DOCSTRING = r"""
    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
    Use it as a regular TF 2.0 Keras Model and
@@ -712,6 +774,11 @@ BERT_INPUTS_DOCSTRING = r"""
            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            plain tuple.
 """
@@ -725,32 +792,13 @@ class TFBertModel(TFBertPreTrainedModel):
        self.bert = TFBertMainLayer(config, name="bert")
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFBaseModelOutputWithPooling,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(self, inputs, **kwargs):
-        r"""
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):
-            Last layer hidden-state of the first token of the sequence (classification token)
-            further processed by a Linear layer and a Tanh activation function. The Linear
-            layer weights are trained from the next sentence prediction (classification)
-            objective during Bert pretraining. This output is usually *not* a good summary
-            of the semantic content of the input, you're often better with averaging or pooling
-            the sequence of hidden-states for the whole input sequence.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
-        """
        outputs = self.bert(inputs, **kwargs)
        return outputs
@@ -772,25 +820,10 @@ class TFBertForPreTraining(TFBertPreTrainedModel):
        return self.bert.embeddings
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @replace_return_docstrings(output_type=TFBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def call(self, inputs, **kwargs):
        r"""
    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        seq_relationship_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):
-            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
    Examples::
@@ -804,17 +837,23 @@ class TFBertForPreTraining(TFBertPreTrainedModel):
        prediction_scores, seq_relationship_scores = outputs[:2]
        """
+        return_dict = kwargs.get("return_dict")
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        outputs = self.bert(inputs, **kwargs)
        sequence_output, pooled_output = outputs[:2]
        prediction_scores = self.mlm(sequence_output, training=kwargs.get("training", False))
        seq_relationship_score = self.nsp(pooled_output)
-        outputs = (prediction_scores, seq_relationship_score,) + outputs[
+        if not return_dict:
-            2:
+            return (prediction_scores, seq_relationship_score) + outputs[2:]
-        ]  # add hidden states and attention if they are here
-        return outputs  # prediction_scores, seq_relationship_score, (hidden_states), (attentions)
+        return TFBertForPreTrainingOutput(
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
 @add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING)
@@ -832,7 +871,12 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
        return self.bert.embeddings
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFMaskedLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -843,6 +887,7 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -852,27 +897,12 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
            in ``[0, ..., config.vocab_size]``
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -885,19 +915,22 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        sequence_output = outputs[0]
        prediction_scores = self.mlm(sequence_output, training=training)
-        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, prediction_scores)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, prediction_scores)
+            output = (prediction_scores,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), prediction_scores, (hidden_states), (attentions)
+        return TFMaskedLMOutput(
+            loss=loss, logits=prediction_scores, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
@@ -911,7 +944,12 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
    def get_output_embeddings(self):
        return self.bert.embeddings
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFCausalLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -922,6 +960,7 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -929,27 +968,12 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the cross entropy classification loss.
            Indices should be in ``[0, ..., config.vocab_size - 1]``.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -962,21 +986,27 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        sequence_output = outputs[0]
        logits = self.mlm(sequence_output, training=training)
-        outputs = (logits,) + outputs[2:]  # Add hidden states and attention if they are here
+        loss = None
        if labels is not None:
            # shift labels to the left and cut last logit token
            logits = logits[:, :-1]
            labels = labels[:, 1:]
            loss = self.compute_loss(labels, logits)
-            outputs = (loss,) + outputs
-        return outputs  # prediction_scores, (hidden_states), (attentions)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TFCausalLMOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -990,23 +1020,10 @@ class TFBertForNextSentencePrediction(TFBertPreTrainedModel):
        self.nsp = TFBertNSPHead(config, name="nsp___cls")
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @replace_return_docstrings(output_type=TFNextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
    def call(self, inputs, **kwargs):
        r"""
    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        seq_relationship_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`)
-            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
    Examples::
@@ -1023,14 +1040,19 @@ class TFBertForNextSentencePrediction(TFBertPreTrainedModel):
        logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]
        assert logits[0][0] < logits[0][1] # the next sentence was random
        """
+        return_dict = kwargs.get("return_dict")
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        outputs = self.bert(inputs, **kwargs)
        pooled_output = outputs[1]
        seq_relationship_score = self.nsp(pooled_output)
-        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here
+        if not return_dict:
+            return (seq_relationship_score,) + outputs[2:]
-        return outputs  # seq_relationship_score, (hidden_states), (attentions)
+        return TFNextSentencePredictorOutput(
+            logits=seq_relationship_score, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1050,7 +1072,12 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
        )
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFSequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1061,6 +1088,7 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -1070,27 +1098,12 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):
-            Classification (or regression if config.num_labels==1) scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -1103,6 +1116,7 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1111,13 +1125,15 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
        pooled_output = self.dropout(pooled_output, training=training)
        logits = self.classifier(pooled_output)
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFSequenceClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1145,7 +1161,12 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -1156,6 +1177,7 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -1164,24 +1186,6 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
            Labels for computing the multiple choice classification loss.
            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
            of the input tensors. (see `input_ids` above)
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
-            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
        if isinstance(inputs, (tuple, list)):
            input_ids = inputs[0]
@@ -1192,8 +1196,9 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            labels = inputs[8] if len(inputs) > 8 else labels
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
-            assert len(inputs) <= 9, "Too many inputs."
+            labels = inputs[9] if len(inputs) > 9 else labels
+            assert len(inputs) <= 10, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -1203,10 +1208,12 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
+            return_dict = inputs.get("return_dict", return_dict)
            labels = inputs.get("labels", labels)
-            assert len(inputs) <= 9, "Too many inputs."
+            assert len(inputs) <= 10, "Too many inputs."
        else:
            input_ids = inputs
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if input_ids is not None:
            num_choices = shape_list(input_ids)[1]
@@ -1233,19 +1240,23 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
            flat_inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output, training=training)
        logits = self.classifier(pooled_output)
        reshaped_logits = tf.reshape(logits, (-1, num_choices))
-        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
-        if labels is not None:
+        loss = None if labels is None else self.compute_loss(labels, reshaped_logits)
-            loss = self.compute_loss(labels, reshaped_logits)
-            outputs = (loss,) + outputs
-        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TFMultipleChoiceModelOutput(
+            loss=loss, logits=reshaped_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1265,7 +1276,12 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
        )
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFTokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1276,6 +1292,7 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -1283,27 +1300,12 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the token classification loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -1316,6 +1318,7 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1324,13 +1327,15 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
        sequence_output = self.dropout(sequence_output, training=training)
        logits = self.classifier(sequence_output)
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + outputs[2:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFTokenClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -1349,7 +1354,12 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
        )
    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="bert-base-cased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="bert-base-cased",
+        output_type=TFQuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1360,6 +1370,7 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        start_positions=None,
        end_positions=None,
        training=False,
@@ -1373,30 +1384,13 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-start scores (before SoftMax).
-        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-end scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.bert.return_dict
        if isinstance(inputs, (tuple, list)):
-            start_positions = inputs[8] if len(inputs) > 8 else start_positions
+            start_positions = inputs[9] if len(inputs) > 9 else start_positions
-            end_positions = inputs[9] if len(inputs) > 9 else end_positions
+            end_positions = inputs[10] if len(inputs) > 10 else end_positions
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            start_positions = inputs.pop("start_positions", start_positions)
            end_positions = inputs.pop("end_positions", start_positions)
@@ -1410,6 +1404,7 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1420,12 +1415,20 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
        start_logits = tf.squeeze(start_logits, axis=-1)
        end_logits = tf.squeeze(end_logits, axis=-1)
-        outputs = (start_logits, end_logits,) + outputs[2:]
+        loss = None
        if start_positions is not None and end_positions is not None:
            labels = {"start_position": start_positions}
            labels["end_position"] = end_positions
-            loss = self.compute_loss(labels, outputs[:2])
+            loss = self.compute_loss(labels, (start_logits, end_logits))
-            outputs = (loss,) + outputs
+        if not return_dict:
-        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TFQuestionAnsweringModelOutput(
+            loss=loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
--- a/src/transformers/modeling_tf_camembert.py
+++ b/src/transformers/modeling_tf_camembert.py
@@ -62,8 +62,6 @@ CAMEMBERT_START_DOCSTRING = r"""
        config (:class:`~transformers.CamembertConfig`): Model configuration class with all the parameters of the
            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
-        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
 """

--- a/src/transformers/modeling_tf_ctrl.py
+++ b/src/transformers/modeling_tf_ctrl.py
@@ -23,6 +23,7 @@ import tensorflow as tf
 from .configuration_ctrl import CTRLConfig
 from .file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_callable
+from .modeling_tf_outputs import TFBaseModelOutputWithPast, TFCausalLMOutputWithPast
 from .modeling_tf_utils import (
    TFCausalLanguageModelingLoss,
    TFPreTrainedModel,
@@ -35,7 +36,8 @@ from .tokenization_utils import BatchEncoding
 logger = logging.getLogger(__name__)
-_TOKENIZER_FOR_DOC = "CtrlTokenizer"
+_CONFIG_FOR_DOC = "CTRLConfig"
+_TOKENIZER_FOR_DOC = "CTRLTokenizer"
 TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [
    "ctrl"
@@ -207,6 +209,7 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
        self.output_hidden_states = config.output_hidden_states
        self.output_attentions = config.output_attentions
        self.use_cache = config.use_cache
+        self.return_dict = config.use_return_dict
        self.d_model_size = config.n_embd
        self.num_layers = config.n_layer
@@ -260,6 +263,7 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
@@ -274,7 +278,8 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
            use_cache = inputs[7] if len(inputs) > 7 else use_cache
            output_attentions = inputs[8] if len(inputs) > 8 else output_attentions
            output_hidden_states = inputs[9] if len(inputs) > 9 else output_hidden_states
-            assert len(inputs) <= 10, "Too many inputs."
+            return_dict = inputs[10] if len(inputs) > 10 else return_dict
+            assert len(inputs) <= 11, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            past = inputs.get("past", past)
@@ -286,13 +291,15 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
            use_cache = inputs.get("use_cache", use_cache)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
-            assert len(inputs) <= 10, "Too many inputs."
+            return_dict = inputs.get("return_dict", return_dict)
+            assert len(inputs) <= 11, "Too many inputs."
        else:
            input_ids = inputs
        output_attentions = output_attentions if output_attentions is not None else self.output_attentions
        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.output_hidden_states
        use_cache = use_cache if use_cache is not None else self.use_cache
+        return_dict = return_dict if return_dict is not None else self.return_dict
        # If using past key value states, only the last tokens
        # should be given as an input
@@ -374,9 +381,9 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
        hidden_states = self.dropout(hidden_states, training=training)
        output_shape = input_shape + [shape_list(hidden_states)[-1]]
-        presents = ()
+        presents = () if use_cache else None
-        all_hidden_states = ()
+        all_hidden_states = () if output_hidden_states else None
-        all_attentions = []
+        all_attentions = () if output_attentions else None
        for i, (h, layer_past) in enumerate(zip(self.h, past)):
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)
@@ -396,24 +403,27 @@ class TFCTRLMainLayer(tf.keras.layers.Layer):
                presents = presents + (present,)
            if output_attentions:
-                all_attentions.append(outputs[2])
+                all_attentions = all_attentions + (outputs[2],)
        hidden_states = self.layernorm(hidden_states)
        hidden_states = tf.reshape(hidden_states, output_shape)
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)
-        outputs = (hidden_states,)
-        if use_cache:
-            outputs = outputs + (presents,)
-        if output_hidden_states:
-            outputs = outputs + (all_hidden_states,)
        if output_attentions:
            # let the number of heads free (-1) so we can extract attention even after head pruning
            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]
            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)
-            outputs = outputs + (all_attentions,)
-        return outputs
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_attentions] if v is not None)
+        return TFBaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
 class TFCTRLPreTrainedModel(TFPreTrainedModel):
@@ -503,6 +513,11 @@ CTRL_INPUTS_DOCSTRING = r"""
            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            plain tuple.
 """
@@ -516,29 +531,13 @@ class TFCTRLModel(TFCTRLPreTrainedModel):
        self.transformer = TFCTRLMainLayer(config, name="transformer")
    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="ctrl")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="ctrl",
+        output_type=TFBaseModelOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(self, inputs, **kwargs):
-        r"""
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.CTRLConfig`) and inputs:
-        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the last layer of the model.
-        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):
-            Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
-        """
        outputs = self.transformer(inputs, **kwargs)
        return outputs
@@ -585,7 +584,12 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
        return {"inputs": inputs, "past": past, "use_cache": kwargs["use_cache"]}
    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="ctrl")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="ctrl",
+        output_type=TFCausalLMOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -598,6 +602,7 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -605,31 +610,12 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the cross entropy classification loss.
            Indices should be in ``[0, ..., config.vocab_size - 1]``.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.CTRLConfig`) and inputs:
-        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):
-            Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.transformer.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[10] if len(inputs) > 10 else labels
+            labels = inputs[11] if len(inputs) > 11 else labels
-            if len(inputs) > 10:
+            if len(inputs) > 11:
-                inputs = inputs[:10]
+                inputs = inputs[:11]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -644,6 +630,7 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -651,12 +638,21 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
        logits = self.lm_head(hidden_states)
-        outputs = (logits,) + transformer_outputs[1:]
+        loss = None
        if labels is not None:
            # shift labels to the left and cut last logit token
            logits = logits[:, :-1]
            labels = labels[:, 1:]
            loss = self.compute_loss(labels, logits)
-            outputs = (loss,) + outputs
-        return outputs  # lm_logits, presents, (all hidden_states), (attentions)
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFCausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
--- a/src/transformers/modeling_tf_distilbert.py
+++ b/src/transformers/modeling_tf_distilbert.py
@@ -29,6 +29,14 @@ from .file_utils import (
    add_start_docstrings,
    add_start_docstrings_to_callable,
 )
+from .modeling_tf_outputs import (
+    TFBaseModelOutput,
+    TFMaskedLMOutput,
+    TFMultipleChoiceModelOutput,
+    TFQuestionAnsweringModelOutput,
+    TFSequenceClassifierOutput,
+    TFTokenClassifierOutput,
+)
 from .modeling_tf_utils import (
    TFMaskedLanguageModelingLoss,
    TFMultipleChoiceLoss,
@@ -46,6 +54,7 @@ from .tokenization_utils import BatchEncoding
 logger = logging.getLogger(__name__)
+_CONFIG_FOR_DOC = "DistilBertConfig"
 _TOKENIZER_FOR_DOC = "DistilBertTokenizer"
 TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
@@ -359,7 +368,7 @@ class TFTransformer(tf.keras.layers.Layer):
        self.layer = [TFTransformerBlock(config, name="layer_._{}".format(i)) for i in range(config.n_layers)]
-    def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states, training=False):
+    def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states, return_dict, training=False):
        """
        Parameters
        ----------
@@ -379,8 +388,8 @@ class TFTransformer(tf.keras.layers.Layer):
            Tuple of length n_layers with the attention weights from each layer
            Optional: only if output_attentions=True
        """
-        all_hidden_states = ()
+        all_hidden_states = () if output_hidden_states else None
-        all_attentions = ()
+        all_attentions = () if output_attentions else None
        hidden_state = x
        for i, layer_module in enumerate(self.layer):
@@ -401,12 +410,11 @@ class TFTransformer(tf.keras.layers.Layer):
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_state,)
-        outputs = (hidden_state,)
+        if not return_dict:
-        if output_hidden_states:
+            return tuple(v for v in [hidden_state, all_hidden_states, all_attentions] if v is not None)
-            outputs = outputs + (all_hidden_states,)
+        return TFBaseModelOutput(
-        if output_attentions:
+            last_hidden_state=hidden_state, hidden_states=all_hidden_states, attentions=all_attentions
-            outputs = outputs + (all_attentions,)
+        )
-        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
 @keras_serializable
@@ -418,6 +426,7 @@ class TFDistilBertMainLayer(tf.keras.layers.Layer):
        self.num_hidden_layers = config.num_hidden_layers
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
+        self.return_dict = config.use_return_dict
        self.embeddings = TFEmbeddings(config, name="embeddings")  # Embeddings
        self.transformer = TFTransformer(config, name="transformer")  # Encoder
@@ -440,6 +449,7 @@ class TFDistilBertMainLayer(tf.keras.layers.Layer):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
        if isinstance(inputs, (tuple, list)):
@@ -449,7 +459,8 @@ class TFDistilBertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds
            output_attentions = inputs[4] if len(inputs) > 4 else output_attentions
            output_hidden_states = inputs[5] if len(inputs) > 5 else output_hidden_states
-            assert len(inputs) <= 6, "Too many inputs."
+            return_dict = inputs[6] if len(inputs) > 6 else return_dict
+            assert len(inputs) <= 7, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -457,12 +468,14 @@ class TFDistilBertMainLayer(tf.keras.layers.Layer):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
-            assert len(inputs) <= 6, "Too many inputs."
+            return_dict = inputs.get("return_dict", return_dict)
+            assert len(inputs) <= 7, "Too many inputs."
        else:
            input_ids = inputs
        output_attentions = output_attentions if output_attentions is not None else self.output_attentions
        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.output_hidden_states
+        return_dict = return_dict if return_dict is not None else self.return_dict
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -491,7 +504,13 @@ class TFDistilBertMainLayer(tf.keras.layers.Layer):
        embedding_output = self.embeddings(input_ids, inputs_embeds=inputs_embeds)  # (bs, seq_length, dim)
        tfmr_output = self.transformer(
-            embedding_output, attention_mask, head_mask, output_attentions, output_hidden_states, training=training
+            embedding_output,
+            attention_mask,
+            head_mask,
+            output_attentions,
+            output_hidden_states,
+            return_dict,
+            training=training,
        )
        return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)
@@ -564,9 +583,13 @@ DISTILBERT_INPUTS_DOCSTRING = r"""
        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            plain tuple.
 """
@@ -580,25 +603,13 @@ class TFDistilBertModel(TFDistilBertPreTrainedModel):
        self.distilbert = TFDistilBertMainLayer(config, name="distilbert")  # Embeddings
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFBaseModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(self, inputs, **kwargs):
-        r"""
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers,DistilBertConfig`) and inputs:
-        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
-        """
        outputs = self.distilbert(inputs, **kwargs)
        return outputs
@@ -642,7 +653,12 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
        return self.vocab_projector.input_embeddings
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFMaskedLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -651,6 +667,7 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -660,27 +677,12 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
            in ``[0, ..., config.vocab_size]``
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers,DistilBertConfig`) and inputs:
-        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[6] if len(inputs) > 6 else labels
+            labels = inputs[7] if len(inputs) > 7 else labels
-            if len(inputs) > 6:
+            if len(inputs) > 7:
-                inputs = inputs[:6]
+                inputs = inputs[:7]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -691,6 +693,7 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -700,13 +703,18 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)
        prediction_logits = self.vocab_projector(prediction_logits)
-        outputs = (prediction_logits,) + distilbert_output[1:]
+        loss = None if labels is None else self.compute_loss(labels, prediction_logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, prediction_logits)
+            output = (prediction_logits,) + distilbert_output[1:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # logits, (hidden_states), (attentions)
+        return TFMaskedLMOutput(
+            loss=loss,
+            logits=prediction_logits,
+            hidden_states=distilbert_output.hidden_states,
+            attentions=distilbert_output.attentions,
+        )
 @add_start_docstrings(
@@ -732,7 +740,12 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFSequenceClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -741,6 +754,7 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -750,27 +764,12 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers,DistilBertConfig`) and inputs:
-        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):
-            Classification (or regression if config.num_labels==1) scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[6] if len(inputs) > 6 else labels
+            labels = inputs[7] if len(inputs) > 7 else labels
-            if len(inputs) > 6:
+            if len(inputs) > 7:
-                inputs = inputs[:6]
+                inputs = inputs[:7]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -781,6 +780,7 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -790,13 +790,18 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
        pooled_output = self.dropout(pooled_output, training=training)  # (bs, dim)
        logits = self.classifier(pooled_output)  # (bs, dim)
-        outputs = (logits,) + distilbert_output[1:]
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + distilbert_output[1:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=distilbert_output.hidden_states,
+            attentions=distilbert_output.attentions,
+        )
 @add_start_docstrings(
@@ -816,7 +821,12 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
        )
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFTokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -825,6 +835,7 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -832,27 +843,12 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the token classification loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers,DistilBertConfig`) and inputs:
-        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[6] if len(inputs) > 6 else labels
+            labels = inputs[7] if len(inputs) > 7 else labels
-            if len(inputs) > 6:
+            if len(inputs) > 7:
-                inputs = inputs[:6]
+                inputs = inputs[:7]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -863,6 +859,7 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -871,13 +868,15 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
        sequence_output = self.dropout(sequence_output, training=training)
        logits = self.classifier(sequence_output)
-        outputs = (logits,) + outputs[1:]  # add hidden states and attention if they are here
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + outputs[1:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFTokenClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -911,7 +910,12 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -920,6 +924,7 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -928,24 +933,6 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
            Labels for computing the multiple choice classification loss.
            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
            of the input tensors. (see `input_ids` above)
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
-            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
        if isinstance(inputs, (tuple, list)):
            input_ids = inputs[0]
@@ -954,8 +941,9 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds
            output_attentions = inputs[4] if len(inputs) > 4 else output_attentions
            output_hidden_states = inputs[5] if len(inputs) > 5 else output_hidden_states
-            labels = inputs[6] if len(inputs) > 6 else labels
+            return_dict = inputs[6] if len(inputs) > 6 else return_dict
-            assert len(inputs) <= 7, "Too many inputs."
+            labels = inputs[7] if len(inputs) > 7 else labels
+            assert len(inputs) <= 8, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -963,10 +951,12 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
+            return_dict = inputs.get("return_dict", return_dict)
            labels = inputs.get("labels", labels)
-            assert len(inputs) <= 7, "Too many inputs."
+            assert len(inputs) <= 8, "Too many inputs."
        else:
            input_ids = inputs
+        return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
        if input_ids is not None:
            num_choices = shape_list(input_ids)[1]
@@ -989,6 +979,7 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
            flat_inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
@@ -997,13 +988,19 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
        pooled_output = self.dropout(pooled_output, training=training)  # (bs, dim)
        logits = self.classifier(pooled_output)
        reshaped_logits = tf.reshape(logits, (-1, num_choices))
-        outputs = (reshaped_logits,) + distilbert_output[1:]  # add hidden states and attention if they are here
-        if labels is not None:
+        loss = None if labels is None else self.compute_loss(labels, reshaped_logits)
-            loss = self.compute_loss(labels, reshaped_logits)
-            outputs = (loss,) + outputs
-        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+        if not return_dict:
+            output = (reshaped_logits,) + distilbert_output[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFMultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=distilbert_output.hidden_states,
+            attentions=distilbert_output.attentions,
+        )
 @add_start_docstrings(
@@ -1023,7 +1020,12 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)
    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="distilbert-base-uncased")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="distilbert-base-uncased",
+        output_type=TFQuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs=None,
@@ -1032,6 +1034,7 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        start_positions=None,
        end_positions=None,
        training=False,
@@ -1045,30 +1048,13 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers,DistilBertConfig`) and inputs:
-        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-start scores (before SoftMax).
-        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-end scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
        if isinstance(inputs, (tuple, list)):
-            start_positions = inputs[6] if len(inputs) > 6 else start_positions
+            start_positions = inputs[7] if len(inputs) > 7 else start_positions
-            end_positions = inputs[7] if len(inputs) > 7 else end_positions
+            end_positions = inputs[8] if len(inputs) > 8 else end_positions
-            if len(inputs) > 6:
+            if len(inputs) > 7:
-                inputs = inputs[:6]
+                inputs = inputs[:7]
        elif isinstance(inputs, (dict, BatchEncoding)):
            start_positions = inputs.pop("start_positions", start_positions)
            end_positions = inputs.pop("end_positions", start_positions)
@@ -1080,6 +1066,7 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
@@ -1090,12 +1077,20 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
        start_logits = tf.squeeze(start_logits, axis=-1)
        end_logits = tf.squeeze(end_logits, axis=-1)
-        outputs = (start_logits, end_logits,) + distilbert_output[1:]
+        loss = None
        if start_positions is not None and end_positions is not None:
            labels = {"start_position": start_positions}
            labels["end_position"] = end_positions
-            loss = self.compute_loss(labels, outputs[:2])
+            loss = self.compute_loss(labels, (start_logits, end_logits))
-            outputs = (loss,) + outputs
+        if not return_dict:
-        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
+            output = (start_logits, end_logits) + distilbert_output[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFQuestionAnsweringModelOutput(
+            loss=loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=distilbert_output.hidden_states,
+            attentions=distilbert_output.attentions,
+        )
--- a/src/transformers/modeling_tf_electra.py
+++ b/src/transformers/modeling_tf_electra.py
 import logging
+from dataclasses import dataclass
+from typing import Optional, Tuple
 import tensorflow as tf
@@ -6,11 +8,21 @@ from transformers import ElectraConfig
 from .file_utils import (
    MULTIPLE_CHOICE_DUMMY_INPUTS,
+    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_callable,
+    replace_return_docstrings,
 )
 from .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel
+from .modeling_tf_outputs import (
+    TFBaseModelOutput,
+    TFMaskedLMOutput,
+    TFMultipleChoiceModelOutput,
+    TFQuestionAnsweringModelOutput,
+    TFSequenceClassifierOutput,
+    TFTokenClassifierOutput,
+)
 from .modeling_tf_utils import (
    TFMaskedLanguageModelingLoss,
    TFMultipleChoiceLoss,
@@ -27,8 +39,8 @@ from .tokenization_utils import BatchEncoding
 logger = logging.getLogger(__name__)
-_TOKENIZER_FOR_DOC = "ElectraTokenizer"
 _CONFIG_FOR_DOC = "ElectraConfig"
+_TOKENIZER_FOR_DOC = "ElectraTokenizer"
 TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [
    "google/electra-small-generator",
@@ -254,6 +266,7 @@ class TFElectraMainLayer(TFElectraPreTrainedModel):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
        if isinstance(inputs, (tuple, list)):
@@ -265,7 +278,8 @@ class TFElectraMainLayer(TFElectraPreTrainedModel):
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
+            assert len(inputs) <= 9, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -275,7 +289,8 @@ class TFElectraMainLayer(TFElectraPreTrainedModel):
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
-            assert len(inputs) <= 8, "Too many inputs."
+            return_dict = inputs.get("return_dict", return_dict)
+            assert len(inputs) <= 9, "Too many inputs."
        else:
            input_ids = inputs
@@ -283,6 +298,7 @@ class TFElectraMainLayer(TFElectraPreTrainedModel):
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -312,12 +328,41 @@ class TFElectraMainLayer(TFElectraPreTrainedModel):
            head_mask,
            output_attentions,
            output_hidden_states,
+            return_dict,
            training=training,
        )
        return hidden_states
+@dataclass
+class TFElectraForPreTrainingOutput(ModelOutput):
+    """
+    Output type of :class:`~transformers.TFElectraForPreTrainingModel`.
+    Args:
+        loss (`optional`, returned when ``labels`` is provided, ``tf.Tensor`` of shape :obj:`(1,)`):
+            Total loss of the ELECTRA objective.
+        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
+            Prediction scores of the head (scores for each token before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    logits: tf.Tensor = None
+    hidden_states: Optional[Tuple[tf.Tensor]] = None
+    attentions: Optional[Tuple[tf.Tensor]] = None
 ELECTRA_START_DOCSTRING = r"""
    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
    Use it as a regular TF 2.0 Keras Model and
@@ -380,9 +425,13 @@ ELECTRA_INPUTS_DOCSTRING = r"""
        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
+            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            plain tuple.
 """
@@ -400,25 +449,13 @@ class TFElectraModel(TFElectraPreTrainedModel):
        self.electra = TFElectraMainLayer(config, name="electra")
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/electra-small-discriminator")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="google/electra-small-discriminator",
+        output_type=TFBaseModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(self, inputs, **kwargs):
-        r"""
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
-        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
-        """
        outputs = self.electra(inputs, **kwargs)
        return outputs
@@ -439,6 +476,7 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
        self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name="discriminator_predictions")
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=TFElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
    def call(
        self,
        input_ids,
@@ -449,24 +487,11 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        training=False,
    ):
        r"""
    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
-        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
-            Prediction scores of the head (scores for each token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
    Examples::
@@ -479,6 +504,7 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
        outputs = model(input_ids)
        scores = outputs[0]
        """
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
        discriminator_hidden_states = self.electra(
            input_ids,
@@ -489,14 +515,20 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
            inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        discriminator_sequence_output = discriminator_hidden_states[0]
        logits = self.discriminator_predictions(discriminator_sequence_output)
-        output = (logits,)
-        output += discriminator_hidden_states[1:]
-        return output  # (loss), scores, (hidden_states), (attentions)
+        if not return_dict:
+            return (logits,) + discriminator_hidden_states[1:]
+        return TFElectraForPreTrainingOutput(
+            logits=logits,
+            hidden_states=discriminator_hidden_states.hidden_states,
+            attentions=discriminator_hidden_states.attentions,
+        )
 class TFElectraMaskedLMHead(tf.keras.layers.Layer):
@@ -539,7 +571,12 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel, TFMaskedLanguageModelingLos
        return self.generator_lm_head
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/electra-small-generator")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="google/electra-small-generator",
+        output_type=TFMaskedLMOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        input_ids,
@@ -550,6 +587,7 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel, TFMaskedLanguageModelingLos
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -559,27 +597,12 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel, TFMaskedLanguageModelingLos
            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
            in ``[0, ..., config.vocab_size]``
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
-        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
        if isinstance(input_ids, (tuple, list)):
-            labels = input_ids[8] if len(input_ids) > 8 else labels
+            labels = input_ids[9] if len(input_ids) > 9 else labels
-            if len(input_ids) > 8:
+            if len(input_ids) > 9:
-                input_ids = input_ids[:8]
+                input_ids = input_ids[:9]
        elif isinstance(input_ids, (dict, BatchEncoding)):
            labels = input_ids.pop("labels", labels)
@@ -592,19 +615,25 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel, TFMaskedLanguageModelingLos
            inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        generator_sequence_output = generator_hidden_states[0]
        prediction_scores = self.generator_predictions(generator_sequence_output, training=training)
        prediction_scores = self.generator_lm_head(prediction_scores, training=training)
-        output = (prediction_scores,)
-        output += generator_hidden_states[1:]
-        if labels is not None:
+        loss = None if labels is None else self.compute_loss(labels, prediction_scores)
-            loss = self.compute_loss(labels, prediction_scores)
-            output = (loss,) + output
-        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
+        if not return_dict:
+            output = (prediction_scores,) + generator_hidden_states[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFMaskedLMOutput(
+            loss=loss,
+            logits=prediction_scores,
+            hidden_states=generator_hidden_states.hidden_states,
+            attentions=generator_hidden_states.attentions,
+        )
 class TFElectraClassificationHead(tf.keras.layers.Layer):
@@ -647,6 +676,7 @@ class TFElectraForSequenceClassification(TFElectraPreTrainedModel, TFSequenceCla
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="google/electra-small-discriminator",
+        output_type=TFSequenceClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
    )
    def call(
@@ -659,23 +689,25 @@ class TFElectraForSequenceClassification(TFElectraPreTrainedModel, TFSequenceCla
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
        r"""
-    Returns:
+        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
+            Labels for computing the sequence classification/regression loss.
-        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)
+            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
-            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
        """
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
+        if isinstance(input_ids, (tuple, list)):
+            labels = input_ids[9] if len(input_ids) > 9 else labels
+            if len(input_ids) > 9:
+                input_ids = input_ids[:9]
+        elif isinstance(input_ids, (dict, BatchEncoding)):
+            labels = input_ids.pop("labels", labels)
        outputs = self.electra(
            input_ids,
            attention_mask,
@@ -685,16 +717,20 @@ class TFElectraForSequenceClassification(TFElectraPreTrainedModel, TFSequenceCla
            inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        logits = self.classifier(outputs[0])
-        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
-        if labels is not None:
+        loss = None if labels is None else self.compute_loss(labels, logits)
-            loss = self.compute_loss(labels, logits)
-            outputs = (loss,) + outputs
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), logits, (hidden_states), (attentions)
+        return TFSequenceClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -724,7 +760,12 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/electra-small-discriminator")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="google/electra-small-discriminator",
+        output_type=TFMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -735,6 +776,7 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -743,24 +785,6 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
            Labels for computing the multiple choice classification loss.
            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
            of the input tensors. (see `input_ids` above)
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
-        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
-            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
        if isinstance(inputs, (tuple, list)):
            input_ids = inputs[0]
@@ -771,8 +795,9 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
            output_attentions = inputs[6] if len(inputs) > 6 else output_attentions
            output_hidden_states = inputs[7] if len(inputs) > 7 else output_hidden_states
-            labels = inputs[8] if len(inputs) > 8 else labels
+            return_dict = inputs[8] if len(inputs) > 8 else return_dict
-            assert len(inputs) <= 9, "Too many inputs."
+            labels = inputs[9] if len(inputs) > 9 else labels
+            assert len(inputs) <= 10, "Too many inputs."
        elif isinstance(inputs, (dict, BatchEncoding)):
            input_ids = inputs.get("input_ids")
            attention_mask = inputs.get("attention_mask", attention_mask)
@@ -782,10 +807,12 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
            output_attentions = inputs.get("output_attentions", output_attentions)
            output_hidden_states = inputs.get("output_hidden_states", output_hidden_states)
+            return_dict = inputs.get("return_dict", return_dict)
            labels = inputs.get("labels", labels)
-            assert len(inputs) <= 9, "Too many inputs."
+            assert len(inputs) <= 10, "Too many inputs."
        else:
            input_ids = inputs
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
        if input_ids is not None:
            num_choices = shape_list(input_ids)[1]
@@ -812,18 +839,22 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
            flat_inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        logits = self.sequence_summary(outputs[0])
        logits = self.classifier(logits)
        reshaped_logits = tf.reshape(logits, (-1, num_choices))
-        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
-        if labels is not None:
+        loss = None if labels is None else self.compute_loss(labels, reshaped_logits)
-            loss = self.compute_loss(labels, reshaped_logits)
-            outputs = (loss,) + outputs
-        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFMultipleChoiceModelOutput(
+            loss=loss, logits=reshaped_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )
 @add_start_docstrings(
@@ -843,7 +874,12 @@ class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassific
        )
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/electra-small-discriminator")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="google/electra-small-discriminator",
+        output_type=TFTokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -854,6 +890,7 @@ class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassific
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        labels=None,
        training=False,
    ):
@@ -861,27 +898,12 @@ class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassific
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the token classification loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
-    Returns:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
-        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
-            Classification scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
        if isinstance(inputs, (tuple, list)):
-            labels = inputs[8] if len(inputs) > 8 else labels
+            labels = inputs[9] if len(inputs) > 9 else labels
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            labels = inputs.pop("labels", labels)
@@ -894,19 +916,25 @@ class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassific
            inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        discriminator_sequence_output = discriminator_hidden_states[0]
        discriminator_sequence_output = self.dropout(discriminator_sequence_output)
        logits = self.classifier(discriminator_sequence_output)
-        outputs = (logits,) + discriminator_hidden_states[1:]
+        loss = None if labels is None else self.compute_loss(labels, logits)
-        if labels is not None:
+        if not return_dict:
-            loss = self.compute_loss(labels, logits)
+            output = (logits,) + discriminator_hidden_states[1:]
-            outputs = (loss,) + outputs
+            return ((loss,) + output) if loss is not None else output
-        return outputs  # (loss), scores, (hidden_states), (attentions)
+        return TFTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=discriminator_hidden_states.hidden_states,
+            attentions=discriminator_hidden_states.attentions,
+        )
 @add_start_docstrings(
@@ -925,7 +953,12 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
        )
    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
-    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/electra-small-discriminator")
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="google/electra-small-discriminator",
+        output_type=TFQuestionAnsweringModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
    def call(
        self,
        inputs,
@@ -936,6 +969,7 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
+        return_dict=None,
        start_positions=None,
        end_positions=None,
        training=False,
@@ -949,30 +983,13 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
-    Return:
-        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
-        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-start scores (before SoftMax).
-        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
-            Span-end scores (before SoftMax).
-        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
-            tuple of :obj:`tf.Tensor` (one for each layer) of shape
-            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
        """
+        return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
        if isinstance(inputs, (tuple, list)):
-            start_positions = inputs[8] if len(inputs) > 8 else start_positions
+            start_positions = inputs[9] if len(inputs) > 9 else start_positions
-            end_positions = inputs[9] if len(inputs) > 9 else end_positions
+            end_positions = inputs[10] if len(inputs) > 10 else end_positions
-            if len(inputs) > 8:
+            if len(inputs) > 9:
-                inputs = inputs[:8]
+                inputs = inputs[:9]
        elif isinstance(inputs, (dict, BatchEncoding)):
            start_positions = inputs.pop("start_positions", start_positions)
            end_positions = inputs.pop("end_positions", start_positions)
@@ -986,6 +1003,7 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
            inputs_embeds,
            output_attentions,
            output_hidden_states,
+            return_dict=return_dict,
            training=training,
        )
        discriminator_sequence_output = discriminator_hidden_states[0]
@@ -995,12 +1013,20 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
        start_logits = tf.squeeze(start_logits, axis=-1)
        end_logits = tf.squeeze(end_logits, axis=-1)
-        outputs = (start_logits, end_logits,) + discriminator_hidden_states[1:]
+        loss = None
        if start_positions is not None and end_positions is not None:
            labels = {"start_position": start_positions}
            labels["end_position"] = end_positions
-            loss = self.compute_loss(labels, outputs[:2])
+            loss = self.compute_loss(labels, (start_logits, end_logits))
-            outputs = (loss,) + outputs
+        if not return_dict:
-        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
+            output = (start_logits, end_logits,) + discriminator_hidden_states[1:]
+            return ((loss,) + output) if loss is not None else output
+        return TFQuestionAnsweringModelOutput(
+            loss=loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=discriminator_hidden_states.hidden_states,
+            attentions=discriminator_hidden_states.attentions,
+        )