Refactored Docstrings of BERT, GPT2, GPT, TransfoXL, XLM and XLNet.

8fe2c9d9 · LysandreJik · ed6c8d37 · 8fe2c9d9 · 8fe2c9d9 · 8fe2c9d9
Commit 8fe2c9d9 authored Jul 09, 2019 by LysandreJik
13 changed files
--- a/docs/source/cli.rst
+++ b/docs/source/cli.rst
@@ -20,7 +20,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
   export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
-   pytorch_pretrained_bert bert \
+   pytorch_transformers bert \
     $BERT_BASE_DIR/bert_model.ckpt \
     $BERT_BASE_DIR/bert_config.json \
     $BERT_BASE_DIR/pytorch_model.bin
@@ -36,7 +36,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
   export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
-   pytorch_pretrained_bert gpt \
+   pytorch_transformers gpt \
     $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
     $PYTORCH_DUMP_OUTPUT \
     [OPENAI_GPT_CONFIG]
@@ -50,7 +50,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
   export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
-   pytorch_pretrained_bert transfo_xl \
+   pytorch_transformers transfo_xl \
     $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
     $PYTORCH_DUMP_OUTPUT \
     [TRANSFO_XL_CONFIG]
@@ -64,7 +64,7 @@ Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 mo
   export GPT2_DIR=/path/to/gpt2/checkpoint
-   pytorch_pretrained_bert gpt2 \
+   pytorch_transformers gpt2 \
     $GPT2_DIR/model.ckpt \
     $PYTORCH_DUMP_OUTPUT \
     [GPT2_CONFIG]
@@ -79,7 +79,7 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
   export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
   export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
-   pytorch_pretrained_bert xlnet \
+   pytorch_transformers xlnet \
     $TRANSFO_XL_CHECKPOINT_PATH \
     $TRANSFO_XL_CONFIG_PATH \
     $PYTORCH_DUMP_OUTPUT \

--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -4,75 +4,75 @@ BERT
 ``BertConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertConfig
+.. autoclass:: pytorch_transformers.BertConfig
    :members:
 ``BertTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertTokenizer
+.. autoclass:: pytorch_transformers.BertTokenizer
    :members:
 ``BertAdam``
 ~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertAdam
+.. autoclass:: pytorch_transformers.BertAdam
    :members:
 1. ``BertModel``
 ~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertModel
+.. autoclass:: pytorch_transformers.BertModel
    :members:
 2. ``BertForPreTraining``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForPreTraining
+.. autoclass:: pytorch_transformers.BertForPreTraining
    :members:
 3. ``BertForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForMaskedLM
+.. autoclass:: pytorch_transformers.BertForMaskedLM
    :members:
 4. ``BertForNextSentencePrediction``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForNextSentencePrediction
+.. autoclass:: pytorch_transformers.BertForNextSentencePrediction
    :members:
 5. ``BertForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForSequenceClassification
+.. autoclass:: pytorch_transformers.BertForSequenceClassification
    :members:
 6. ``BertForMultipleChoice``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForMultipleChoice
+.. autoclass:: pytorch_transformers.BertForMultipleChoice
    :members:
 7. ``BertForTokenClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForTokenClassification
+.. autoclass:: pytorch_transformers.BertForTokenClassification
    :members:
 8. ``BertForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForQuestionAnswering
+.. autoclass:: pytorch_transformers.BertForQuestionAnswering
    :members:
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -4,40 +4,40 @@ OpenAI GPT
 ``OpenAIGPTConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig
+.. autoclass:: pytorch_transformers.OpenAIGPTConfig
    :members:
 ``OpenAIGPTTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer
+.. autoclass:: pytorch_transformers.OpenAIGPTTokenizer
    :members:
 ``OpenAIAdam``
 ~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIAdam
+.. autoclass:: pytorch_transformers.OpenAIAdam
    :members:
 9. ``OpenAIGPTModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTModel
+.. autoclass:: pytorch_transformers.OpenAIGPTModel
    :members:
 10. ``OpenAIGPTLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTLMHeadModel
+.. autoclass:: pytorch_transformers.OpenAIGPTLMHeadModel
    :members:
 11. ``OpenAIGPTDoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTDoubleHeadsModel
+.. autoclass:: pytorch_transformers.OpenAIGPTDoubleHeadsModel
    :members:
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -4,33 +4,33 @@ OpenAI GPT2
 ``GPT2Config``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Config
+.. autoclass:: pytorch_transformers.GPT2Config
    :members:
 ``GPT2Tokenizer``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer
+.. autoclass:: pytorch_transformers.GPT2Tokenizer
    :members:
 14. ``GPT2Model``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Model
+.. autoclass:: pytorch_transformers.GPT2Model
    :members:
 15. ``GPT2LMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2LMHeadModel
+.. autoclass:: pytorch_transformers.GPT2LMHeadModel
    :members:
 16. ``GPT2DoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2DoubleHeadsModel
+.. autoclass:: pytorch_transformers.GPT2DoubleHeadsModel
    :members:
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -5,26 +5,26 @@ Transformer XL
 ``TransfoXLConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig
+.. autoclass:: pytorch_transformers.TransfoXLConfig
    :members:
 ``TransfoXLTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer
+.. autoclass:: pytorch_transformers.TransfoXLTokenizer
    :members:
 12. ``TransfoXLModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLModel
+.. autoclass:: pytorch_transformers.TransfoXLModel
    :members:
 13. ``TransfoXLLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLLMHeadModel
+.. autoclass:: pytorch_transformers.TransfoXLLMHeadModel
    :members:
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
 XLM
 ----------------------------------------------------
+``XLMConfig``
+~~~~~~~~~~~~~~~~~~~~~
-I don't really know what to put here, I'll leave it up to you to decide @Thom
+.. autoclass:: pytorch_transformers.TransfoXLConfig
\ No newline at end of file
+    :members:
+17. ``XLMModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMModel
+    :members:
+18. ``XLMWithLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMWithLMHeadModel
+    :members:
+19. ``XLMForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMForSequenceClassification
+    :members:
+20. ``XLMForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMForQuestionAnswering
+    :members:
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -11,7 +11,7 @@ First let's prepare a tokenized input with ``BertTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
+   from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -89,7 +89,7 @@ First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
+   from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -177,7 +177,7 @@ First let's prepare a tokenized input with ``TransfoXLTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
+   from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -253,7 +253,7 @@ First let's prepare a tokenized input with ``GPT2Tokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
+   from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging

--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -177,6 +177,38 @@ def load_tf_weights_in_transfo_xl(model, config, tf_path):
 class TransfoXLConfig(PretrainedConfig):
    """Configuration class to store the configuration of a `TransfoXLModel`.
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `TransfoXLModel` or a configuration json file.
+            cutoffs: cutoffs for the adaptive softmax
+            d_model: Dimensionality of the model's hidden states.
+            d_embed: Dimensionality of the embeddings
+            d_head: Dimensionality of the model's heads.
+            div_val: divident value for adapative input and softmax
+            pre_lnorm: apply LayerNorm to the input instead of the output
+            d_inner: Inner dimension in FF
+            n_layer: Number of hidden layers in the Transformer encoder.
+            n_head: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            tgt_len: number of tokens to predict
+            ext_len: length of the extended context
+            mem_len: length of the retained previous heads
+            same_length: use the same attn length for all tokens
+            proj_share_all_but_first: True to share all but first projs, False not to share.
+            attn_type: attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
+            clamp_len: use the same pos embeddings after clamp_len
+            sample_softmax: number of samples in sampled softmax
+            adaptive: use adaptive softmax
+            tie_weight: tie the word embedding and softmax weights
+            dropout: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            dropatt: The dropout ratio for the attention probabilities.
+            untie_r: untie relative position biases
+            embd_pdrop: The dropout ratio for the embeddings.
+            init: parameter initializer to use
+            init_range: parameters initialized by U(-init_range, init_range).
+            proj_init_std: parameters initialized by N(0, init_std)
+            init_std: parameters initialized by N(0, init_std)
    """
    pretrained_config_archive_map = TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
@@ -210,38 +242,6 @@ class TransfoXLConfig(PretrainedConfig):
                 init_std=0.02,
                 **kwargs):
        """Constructs TransfoXLConfig.
-        Args:
-            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `TransfoXLModel` or a configuration json file.
-            cutoffs: cutoffs for the adaptive softmax
-            d_model: Dimensionality of the model's hidden states.
-            d_embed: Dimensionality of the embeddings
-            d_head: Dimensionality of the model's heads.
-            div_val: divident value for adapative input and softmax
-            pre_lnorm: apply LayerNorm to the input instead of the output
-            d_inner: Inner dimension in FF
-            n_layer: Number of hidden layers in the Transformer encoder.
-            n_head: Number of attention heads for each attention layer in
-                the Transformer encoder.
-            tgt_len: number of tokens to predict
-            ext_len: length of the extended context
-            mem_len: length of the retained previous heads
-            same_length: use the same attn length for all tokens
-            proj_share_all_but_first: True to share all but first projs, False not to share.
-            attn_type: attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
-            clamp_len: use the same pos embeddings after clamp_len
-            sample_softmax: number of samples in sampled softmax
-            adaptive: use adaptive softmax
-            tie_weight: tie the word embedding and softmax weights
-            dropout: The dropout probabilitiy for all fully connected
-                layers in the embeddings, encoder, and pooler.
-            dropatt: The dropout ratio for the attention probabilities.
-            untie_r: untie relative position biases
-            embd_pdrop: The dropout ratio for the embeddings.
-            init: parameter initializer to use
-            init_range: parameters initialized by U(-init_range, init_range).
-            proj_init_std: parameters initialized by N(0, init_std)
-            init_std: parameters initialized by N(0, init_std)
        """
        super(TransfoXLConfig, self).__init__(**kwargs)
@@ -901,42 +901,20 @@ class TransfoXLPreTrainedModel(PreTrainedModel):
 class TransfoXLModel(TransfoXLPreTrainedModel):
    """Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context").
-    Transformer XL use a relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that:
+    Transformer XL uses relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that:
-    - you don't need to specify positioning embeddings indices
-    - the tokens in the vocabulary have to be sorted to decreasing frequency.
+        - you don't need to specify positioning embeddings indices.
-    Params:
+        - the tokens in the vocabulary have to be sorted in decreasing frequency.
+    Args:
        config: a TransfoXLConfig class instance with the configuration to build a new model
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the token indices selected in the range [0, self.config.n_token[
-        `mems`: optional memomry of hidden states from previous forward passes
-            as a list (num layers) of hidden states at the entry of each layer
-            each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
-            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
-    Outputs:
-        A tuple of (last_hidden_state, new_mems)
-        `last_hidden_state`: the encoded-hidden-states at the top of the model
-            as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
-        `new_mems`: list (num layers) of updated mem states at the entry of each layer
-            each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]
-            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
-    Example usage:
+    Example::
-    ```python
-    # Already been converted into BPE token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
        config = TransfoXLConfig()
        model = TransfoXLModel(config)
-    last_hidden_state, new_mems = model(input_ids)
-    # Another time on input_ids_next using the memory:
-    last_hidden_state, new_mems = model(input_ids_next, new_mems)
-    ```
    """
    def __init__(self, config):
        super(TransfoXLModel, self).__init__(config)
@@ -1200,18 +1178,40 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)
    def forward(self, input_ids, mems=None, head_mask=None):
-        """ Params:
+        """
-                input_ids :: [bsz, len]
+        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
-                mems :: optional mems from previous forwar passes (or init_mems)
-                    list (num layers) of mem states at the entry of each layer
+        Args:
-                        shape :: [self.config.mem_len, bsz, self.config.d_model]
+            `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
+                with the token indices selected in the range [0, self.config.n_token[
+            `mems`: optional memory of hidden states from previous forward passes
+                as a list (num layers) of hidden states at the entry of each layer
+                each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
                Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
        Returns:
-                tuple (last_hidden, new_mems) where:
+            A tuple of ``(last_hidden_state, new_mems)``.
-                    new_mems: list (num layers) of mem states at the entry of each layer
-                        shape :: [self.config.mem_len, bsz, self.config.d_model]
+                ``last_hidden_state``: the encoded-hidden-states at the top of the model
-                    last_hidden: output of the last layer:
+                as a ``torch.FloatTensor`` of size [batch_size, sequence_length, self.config.d_model]
-                        shape :: [bsz, len, self.config.d_model]
+                ``new_mems``: list (num layers) of updated mem states at the entry of each layer
+                each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
+                Note that the first two dimensions are transposed in ``mems`` with regards to ``input_ids`` and
+                ``labels``
+        Example::
+            # Already been converted into BPE token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
+            last_hidden_state, new_mems = model(input_ids)
+            # or
+            last_hidden_state, new_mems = model.forward(input_ids)
+            # Another time on input_ids_next using the memory:
+            last_hidden_state, new_mems = model(input_ids_next, new_mems)
        """
        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library
        # so we transpose here from shape [bsz, len] to shape [len, bsz]
@@ -1227,52 +1227,24 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
 class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
    """Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context").
-    This model add an (adaptive) softmax head on top of the TransfoXLModel
+    This model adds an (adaptive) softmax head on top of the ``TransfoXLModel``
+    Transformer XL uses a relative positioning (with sinusoidal patterns) and adaptive softmax inputs which means that:
-    Transformer XL use a relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that:
        - you don't need to specify positioning embeddings indices
-    - the tokens in the vocabulary have to be sorted to decreasing frequency.
-    Call self.tie_weights() if you update/load the weights of the transformer to keep the weights tied.
+        - the tokens in the vocabulary have to be sorted in decreasing frequency.
-    Params:
+    Call ``self.tie_weights()`` if you update/load the weights of the transformer to keep the weights tied.
-        config: a TransfoXLConfig class instance with the configuration to build a new model
-    Inputs:
+    Args:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+        config: a ``TransfoXLConfig`` class instance with the configuration to build a new model
-            with the token indices selected in the range [0, self.config.n_token[
-        `labels`: an optional torch.LongTensor of shape [batch_size, sequence_length]
-            with the labels token indices selected in the range [0, self.config.n_token[
-        `mems`: an optional memory of hidden states from previous forward passes
-            as a list (num layers) of hidden states at the entry of each layer
-            each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
-            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
-    Outputs:
-        A tuple of (last_hidden_state, new_mems)
-        `softmax_output`: output of the (adaptive) softmax:
-            if labels is None:
-                Negative log likelihood of shape [batch_size, sequence_length] 
-            else:
-                log probabilities of tokens, shape [batch_size, sequence_length, n_tokens]
-        `new_mems`: list (num layers) of updated mem states at the entry of each layer
-            each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]
-            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
-    Example usage:
+    Example::
-    ```python
-    # Already been converted into BPE token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
        config = TransfoXLConfig()
        model = TransfoXLModel(config)
-    last_hidden_state, new_mems = model(input_ids)
-    # Another time on input_ids_next using the memory:
-    last_hidden_state, new_mems = model(input_ids_next, mems=new_mems)
-    ```
    """
    def __init__(self, config):
        super(TransfoXLLMHeadModel, self).__init__(config)
@@ -1290,7 +1262,9 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
        self.tie_weights()
    def tie_weights(self):
-        """ Run this to be sure output and input (adaptive) softmax weights are tied """
+        """
+        Run this to be sure output and input (adaptive) softmax weights are tied
+        """
        # sampled softmax
        if self.sample_softmax > 0:
            if self.config.tie_weight:
@@ -1314,18 +1288,43 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
        return self.transformer.init_mems(data)
    def forward(self, input_ids, labels=None, mems=None, head_mask=None):
-        """ Params:
+        """
-                input_ids :: [bsz, len]
+        Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
-                labels :: [bsz, len]
+        Args:
+            `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
+                with the token indices selected in the range [0, self.config.n_token[
+            `labels`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length]
+                with the labels token indices selected in the range [0, self.config.n_token[
+            `mems`: an optional memory of hidden states from previous forward passes
+                as a list (num layers) of hidden states at the entry of each layer
+                each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
+                Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
        Returns:
-                tuple(softmax_output, new_mems) where:
+            A tuple of (last_hidden_state, new_mems)
-                    new_mems: list (num layers) of hidden states at the entry of each layer
-                        shape :: [mem_len, bsz, self.config.d_model] :: Warning: shapes are transposed here w. regards to input_ids
+                ``last_hidden_state``: output of the (adaptive) softmax. If ``labels`` is ``None``, it is the negative
-                    softmax_output: output of the (adaptive) softmax:
+                log likelihood of shape [batch_size, sequence_length]. Otherwise, it is the log probabilities of
-                        if labels is None:
+                tokens of, shape [batch_size, sequence_length, n_tokens].
-                            Negative log likelihood of shape :: [bsz, len] 
-                        else:
+                ``new_mems``: list (num layers) of updated mem states at the entry of each layer
-                            log probabilities of tokens, shape :: [bsz, len, n_tokens]
+                each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
+                Note that the first two dimensions are transposed in ``mems`` with regards to ``input_ids`` and
+                ``labels``
+        Example::
+            # Already been converted into BPE token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
+            last_hidden_state, new_mems = model(input_ids)
+            # or
+            last_hidden_state, new_mems = model.forward(input_ids)
+            # Another time on input_ids_next using the memory:
+            last_hidden_state, new_mems = model(input_ids_next, mems=new_mems)
        """
        bsz = input_ids.size(0)
        tgt_len = input_ids.size(1)

--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -958,10 +958,10 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
        `encoded_layers`: controled by `output_all_encoded_layers` argument:
            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                of each attention block (i.e. 12 full sequences for XLNet-base, 24 for XLNet-large), each
-                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, d_model],
+                encoded-hidden-state is a ``torch.FloatTensor`` of size [batch_size, sequence_length, d_model],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
                to the last attention block of shape [batch_size, sequence_length, d_model],
-        `pooled_output`: a torch.FloatTensor of size [batch_size, d_model] which is the output of a
+        `pooled_output`: a ``torch.FloatTensor`` of size [batch_size, d_model] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLS`) to train on the Next-Sentence task (see XLNet's paper).
@@ -1087,7 +1087,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
            1 for tokens with losses and 0 for tokens without losses.
            Only used during pretraining for two-stream attention.
            Set to None during finetuning.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+        `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
@@ -1098,7 +1098,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
            else:
                CrossEntropy loss with the targets
        `new_mems`: list (num layers) of updated mem states at the entry of each layer
-            each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]
+            each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
            Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
    Example usage:
@@ -1189,27 +1189,27 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
            This can be used to compute head importance metrics. Default: False
    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+        `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+        `token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see XLNet paper for more details).
        `attention_mask`: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
            but with 1 for real tokens and 0 for padding.
            Added for easy compatibility with the BERT model (which uses this negative masking).
            You can only uses one among `input_mask` and `attention_mask`
-        `input_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+        `input_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
-        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
+        `start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
-        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
+        `end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+        `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
    Outputs: