ALBERT Modeling + required changes to utilities

00df3d4d · Lysandre · Lysandre Debut · f81b6c95 · 00df3d4d · 00df3d4d
Commit 00df3d4d authored Jan 15, 2020 by Lysandre Committed by Lysandre Debut Jan 23, 2020
4 changed files
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
 ALBERT
 ----------------------------------------------------
-``AlbertConfig``
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
+by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
+two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
+- Splitting the embedding matrix into two smaller matrices
+- Using repeating layers split among groups
+The abstract from the paper is the following:
+*Increasing model size when pretraining natural language representations often results in improved performance on
+downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
+longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
+techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
+that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
+tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
+RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+Tips:
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
+  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
+  number of (repeating) layers.
+AlbertConfig
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertConfig
    :members:
-``AlbertTokenizer``
+AlbertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertTokenizer
    :members:
-``AlbertModel``
+AlbertModel
 ~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertModel
    :members:
-``AlbertForMaskedLM``
+AlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForMaskedLM
    :members:
-``AlbertForSequenceClassification``
+AlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForSequenceClassification
    :members:
-``AlbertForQuestionAnswering``
+AlbertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForQuestionAnswering
    :members:
-``TFAlbertModel``
+TFAlbertModel
 ~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertModel
    :members:
-``TFAlbertForMaskedLM``
+TFAlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForMaskedLM
    :members:
-``TFAlbertForSequenceClassification``
+TFAlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForSequenceClassification

--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -105,7 +105,25 @@ def is_tf_available():
 def add_start_docstrings(*docstr):
    def docstring_decorator(fn):
-        fn.__doc__ = "".join(docstr) + fn.__doc__
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+    return docstring_decorator
+def add_start_docstrings_to_callable(*docstr):
+    def docstring_decorator(fn):
+        class_name = ":class:`~transformers.{}`".format(fn.__qualname__.split(".")[0])
+        intro = "   The {} forward method, overrides the :func:`__call__` special method.".format(class_name)
+        note = r""" 
+    .. note::
+        Although the recipe for forward pass needs to be defined within
+        this function, one should call the :class:`Module` instance afterwards
+        instead of this since the former takes care of running the
+        registered hooks while the latter silently ignores them.
+        """
+        fn.__doc__ = intro + note + "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
        return fn
    return docstring_decorator

--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -114,7 +114,12 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
        return getattr(self, self.base_model_prefix, self)
    def get_input_embeddings(self):
-        """ Get model's input embeddings
+        """
+        Returns the model's input embeddings.
+        Returns:
+            :obj:`nn.Module`:
+                A torch module mapping vocabulary to hidden states.
        """
        base_model = getattr(self, self.base_model_prefix, self)
        if base_model is not self:
@@ -123,7 +128,12 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            raise NotImplementedError
    def set_input_embeddings(self, value):
-        """ Set model's input embeddings
+        """
+        Set model's input embeddings
+        Args:
+            value (:obj:`nn.Module`):
+                A module mapping vocabulary to hidden states.
        """
        base_model = getattr(self, self.base_model_prefix, self)
        if base_model is not self:
@@ -132,14 +142,20 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            raise NotImplementedError
    def get_output_embeddings(self):
-        """ Get model's output embeddings
+        """
-            Return None if the model doesn't have output embeddings
+        Returns the model's output embeddings.
+        Returns:
+            :obj:`nn.Module`:
+                A torch module mapping hidden states to vocabulary.
        """
        return None  # Overwrite for models with output embeddings
    def tie_weights(self):
-        """ Make sure we are sharing the input and output embeddings.
+        """
-            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        Tie the weights between the input embeddings and the output embeddings.
+        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning
+        the weights instead.
        """
        output_embeddings = self.get_output_embeddings()
        if output_embeddings is not None: