Commit 8fe2c9d9 authored by LysandreJik's avatar LysandreJik
Browse files

Refactored Docstrings of BERT, GPT2, GPT, TransfoXL, XLM and XLNet.

parent ed6c8d37
...@@ -20,7 +20,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas ...@@ -20,7 +20,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
pytorch_pretrained_bert bert \ pytorch_transformers bert \
$BERT_BASE_DIR/bert_model.ckpt \ $BERT_BASE_DIR/bert_model.ckpt \
$BERT_BASE_DIR/bert_config.json \ $BERT_BASE_DIR/bert_config.json \
$BERT_BASE_DIR/pytorch_model.bin $BERT_BASE_DIR/pytorch_model.bin
...@@ -36,7 +36,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model, ...@@ -36,7 +36,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
pytorch_pretrained_bert gpt \ pytorch_transformers gpt \
$OPENAI_GPT_CHECKPOINT_FOLDER_PATH \ $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
[OPENAI_GPT_CONFIG] [OPENAI_GPT_CONFIG]
...@@ -50,7 +50,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo ...@@ -50,7 +50,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
pytorch_pretrained_bert transfo_xl \ pytorch_transformers transfo_xl \
$TRANSFO_XL_CHECKPOINT_FOLDER_PATH \ $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
[TRANSFO_XL_CONFIG] [TRANSFO_XL_CONFIG]
...@@ -64,7 +64,7 @@ Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 mo ...@@ -64,7 +64,7 @@ Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 mo
export GPT2_DIR=/path/to/gpt2/checkpoint export GPT2_DIR=/path/to/gpt2/checkpoint
pytorch_pretrained_bert gpt2 \ pytorch_transformers gpt2 \
$GPT2_DIR/model.ckpt \ $GPT2_DIR/model.ckpt \
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
[GPT2_CONFIG] [GPT2_CONFIG]
...@@ -79,7 +79,7 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine ...@@ -79,7 +79,7 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
pytorch_pretrained_bert xlnet \ pytorch_transformers xlnet \
$TRANSFO_XL_CHECKPOINT_PATH \ $TRANSFO_XL_CHECKPOINT_PATH \
$TRANSFO_XL_CONFIG_PATH \ $TRANSFO_XL_CONFIG_PATH \
$PYTORCH_DUMP_OUTPUT \ $PYTORCH_DUMP_OUTPUT \
......
...@@ -4,75 +4,75 @@ BERT ...@@ -4,75 +4,75 @@ BERT
``BertConfig`` ``BertConfig``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertConfig .. autoclass:: pytorch_transformers.BertConfig
:members: :members:
``BertTokenizer`` ``BertTokenizer``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertTokenizer .. autoclass:: pytorch_transformers.BertTokenizer
:members: :members:
``BertAdam`` ``BertAdam``
~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertAdam .. autoclass:: pytorch_transformers.BertAdam
:members: :members:
1. ``BertModel`` 1. ``BertModel``
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertModel .. autoclass:: pytorch_transformers.BertModel
:members: :members:
2. ``BertForPreTraining`` 2. ``BertForPreTraining``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForPreTraining .. autoclass:: pytorch_transformers.BertForPreTraining
:members: :members:
3. ``BertForMaskedLM`` 3. ``BertForMaskedLM``
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForMaskedLM .. autoclass:: pytorch_transformers.BertForMaskedLM
:members: :members:
4. ``BertForNextSentencePrediction`` 4. ``BertForNextSentencePrediction``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForNextSentencePrediction .. autoclass:: pytorch_transformers.BertForNextSentencePrediction
:members: :members:
5. ``BertForSequenceClassification`` 5. ``BertForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForSequenceClassification .. autoclass:: pytorch_transformers.BertForSequenceClassification
:members: :members:
6. ``BertForMultipleChoice`` 6. ``BertForMultipleChoice``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForMultipleChoice .. autoclass:: pytorch_transformers.BertForMultipleChoice
:members: :members:
7. ``BertForTokenClassification`` 7. ``BertForTokenClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForTokenClassification .. autoclass:: pytorch_transformers.BertForTokenClassification
:members: :members:
8. ``BertForQuestionAnswering`` 8. ``BertForQuestionAnswering``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.BertForQuestionAnswering .. autoclass:: pytorch_transformers.BertForQuestionAnswering
:members: :members:
...@@ -4,40 +4,40 @@ OpenAI GPT ...@@ -4,40 +4,40 @@ OpenAI GPT
``OpenAIGPTConfig`` ``OpenAIGPTConfig``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig .. autoclass:: pytorch_transformers.OpenAIGPTConfig
:members: :members:
``OpenAIGPTTokenizer`` ``OpenAIGPTTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer .. autoclass:: pytorch_transformers.OpenAIGPTTokenizer
:members: :members:
``OpenAIAdam`` ``OpenAIAdam``
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIAdam .. autoclass:: pytorch_transformers.OpenAIAdam
:members: :members:
9. ``OpenAIGPTModel`` 9. ``OpenAIGPTModel``
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIGPTModel .. autoclass:: pytorch_transformers.OpenAIGPTModel
:members: :members:
10. ``OpenAIGPTLMHeadModel`` 10. ``OpenAIGPTLMHeadModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIGPTLMHeadModel .. autoclass:: pytorch_transformers.OpenAIGPTLMHeadModel
:members: :members:
11. ``OpenAIGPTDoubleHeadsModel`` 11. ``OpenAIGPTDoubleHeadsModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.OpenAIGPTDoubleHeadsModel .. autoclass:: pytorch_transformers.OpenAIGPTDoubleHeadsModel
:members: :members:
...@@ -4,33 +4,33 @@ OpenAI GPT2 ...@@ -4,33 +4,33 @@ OpenAI GPT2
``GPT2Config`` ``GPT2Config``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.GPT2Config .. autoclass:: pytorch_transformers.GPT2Config
:members: :members:
``GPT2Tokenizer`` ``GPT2Tokenizer``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer .. autoclass:: pytorch_transformers.GPT2Tokenizer
:members: :members:
14. ``GPT2Model`` 14. ``GPT2Model``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.GPT2Model .. autoclass:: pytorch_transformers.GPT2Model
:members: :members:
15. ``GPT2LMHeadModel`` 15. ``GPT2LMHeadModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.GPT2LMHeadModel .. autoclass:: pytorch_transformers.GPT2LMHeadModel
:members: :members:
16. ``GPT2DoubleHeadsModel`` 16. ``GPT2DoubleHeadsModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.GPT2DoubleHeadsModel .. autoclass:: pytorch_transformers.GPT2DoubleHeadsModel
:members: :members:
...@@ -5,26 +5,26 @@ Transformer XL ...@@ -5,26 +5,26 @@ Transformer XL
``TransfoXLConfig`` ``TransfoXLConfig``
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig .. autoclass:: pytorch_transformers.TransfoXLConfig
:members: :members:
``TransfoXLTokenizer`` ``TransfoXLTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer .. autoclass:: pytorch_transformers.TransfoXLTokenizer
:members: :members:
12. ``TransfoXLModel`` 12. ``TransfoXLModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.TransfoXLModel .. autoclass:: pytorch_transformers.TransfoXLModel
:members: :members:
13. ``TransfoXLLMHeadModel`` 13. ``TransfoXLLMHeadModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_pretrained_bert.TransfoXLLMHeadModel .. autoclass:: pytorch_transformers.TransfoXLLMHeadModel
:members: :members:
XLM XLM
---------------------------------------------------- ----------------------------------------------------
``XLMConfig``
~~~~~~~~~~~~~~~~~~~~~
I don't really know what to put here, I'll leave it up to you to decide @Thom .. autoclass:: pytorch_transformers.TransfoXLConfig
\ No newline at end of file :members:
17. ``XLMModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.XLMModel
:members:
18. ``XLMWithLMHeadModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.XLMWithLMHeadModel
:members:
19. ``XLMForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.XLMForSequenceClassification
:members:
20. ``XLMForQuestionAnswering``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.XLMForQuestionAnswering
:members:
...@@ -11,7 +11,7 @@ First let's prepare a tokenized input with ``BertTokenizer`` ...@@ -11,7 +11,7 @@ First let's prepare a tokenized input with ``BertTokenizer``
.. code-block:: python .. code-block:: python
import torch import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging import logging
...@@ -89,7 +89,7 @@ First let's prepare a tokenized input with ``OpenAIGPTTokenizer`` ...@@ -89,7 +89,7 @@ First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
.. code-block:: python .. code-block:: python
import torch import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging import logging
...@@ -177,7 +177,7 @@ First let's prepare a tokenized input with ``TransfoXLTokenizer`` ...@@ -177,7 +177,7 @@ First let's prepare a tokenized input with ``TransfoXLTokenizer``
.. code-block:: python .. code-block:: python
import torch import torch
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging import logging
...@@ -253,7 +253,7 @@ First let's prepare a tokenized input with ``GPT2Tokenizer`` ...@@ -253,7 +253,7 @@ First let's prepare a tokenized input with ``GPT2Tokenizer``
.. code-block:: python .. code-block:: python
import torch import torch
from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging import logging
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -177,6 +177,38 @@ def load_tf_weights_in_transfo_xl(model, config, tf_path): ...@@ -177,6 +177,38 @@ def load_tf_weights_in_transfo_xl(model, config, tf_path):
class TransfoXLConfig(PretrainedConfig): class TransfoXLConfig(PretrainedConfig):
"""Configuration class to store the configuration of a `TransfoXLModel`. """Configuration class to store the configuration of a `TransfoXLModel`.
Args:
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `TransfoXLModel` or a configuration json file.
cutoffs: cutoffs for the adaptive softmax
d_model: Dimensionality of the model's hidden states.
d_embed: Dimensionality of the embeddings
d_head: Dimensionality of the model's heads.
div_val: divident value for adapative input and softmax
pre_lnorm: apply LayerNorm to the input instead of the output
d_inner: Inner dimension in FF
n_layer: Number of hidden layers in the Transformer encoder.
n_head: Number of attention heads for each attention layer in
the Transformer encoder.
tgt_len: number of tokens to predict
ext_len: length of the extended context
mem_len: length of the retained previous heads
same_length: use the same attn length for all tokens
proj_share_all_but_first: True to share all but first projs, False not to share.
attn_type: attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
clamp_len: use the same pos embeddings after clamp_len
sample_softmax: number of samples in sampled softmax
adaptive: use adaptive softmax
tie_weight: tie the word embedding and softmax weights
dropout: The dropout probabilitiy for all fully connected
layers in the embeddings, encoder, and pooler.
dropatt: The dropout ratio for the attention probabilities.
untie_r: untie relative position biases
embd_pdrop: The dropout ratio for the embeddings.
init: parameter initializer to use
init_range: parameters initialized by U(-init_range, init_range).
proj_init_std: parameters initialized by N(0, init_std)
init_std: parameters initialized by N(0, init_std)
""" """
pretrained_config_archive_map = TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP pretrained_config_archive_map = TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
...@@ -210,38 +242,6 @@ class TransfoXLConfig(PretrainedConfig): ...@@ -210,38 +242,6 @@ class TransfoXLConfig(PretrainedConfig):
init_std=0.02, init_std=0.02,
**kwargs): **kwargs):
"""Constructs TransfoXLConfig. """Constructs TransfoXLConfig.
Args:
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `TransfoXLModel` or a configuration json file.
cutoffs: cutoffs for the adaptive softmax
d_model: Dimensionality of the model's hidden states.
d_embed: Dimensionality of the embeddings
d_head: Dimensionality of the model's heads.
div_val: divident value for adapative input and softmax
pre_lnorm: apply LayerNorm to the input instead of the output
d_inner: Inner dimension in FF
n_layer: Number of hidden layers in the Transformer encoder.
n_head: Number of attention heads for each attention layer in
the Transformer encoder.
tgt_len: number of tokens to predict
ext_len: length of the extended context
mem_len: length of the retained previous heads
same_length: use the same attn length for all tokens
proj_share_all_but_first: True to share all but first projs, False not to share.
attn_type: attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
clamp_len: use the same pos embeddings after clamp_len
sample_softmax: number of samples in sampled softmax
adaptive: use adaptive softmax
tie_weight: tie the word embedding and softmax weights
dropout: The dropout probabilitiy for all fully connected
layers in the embeddings, encoder, and pooler.
dropatt: The dropout ratio for the attention probabilities.
untie_r: untie relative position biases
embd_pdrop: The dropout ratio for the embeddings.
init: parameter initializer to use
init_range: parameters initialized by U(-init_range, init_range).
proj_init_std: parameters initialized by N(0, init_std)
init_std: parameters initialized by N(0, init_std)
""" """
super(TransfoXLConfig, self).__init__(**kwargs) super(TransfoXLConfig, self).__init__(**kwargs)
...@@ -901,42 +901,20 @@ class TransfoXLPreTrainedModel(PreTrainedModel): ...@@ -901,42 +901,20 @@ class TransfoXLPreTrainedModel(PreTrainedModel):
class TransfoXLModel(TransfoXLPreTrainedModel): class TransfoXLModel(TransfoXLPreTrainedModel):
"""Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"). """Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context").
Transformer XL use a relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that: Transformer XL uses relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that:
- you don't need to specify positioning embeddings indices
- the tokens in the vocabulary have to be sorted to decreasing frequency. - you don't need to specify positioning embeddings indices.
Params: - the tokens in the vocabulary have to be sorted in decreasing frequency.
Args:
config: a TransfoXLConfig class instance with the configuration to build a new model config: a TransfoXLConfig class instance with the configuration to build a new model
Inputs:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
with the token indices selected in the range [0, self.config.n_token[
`mems`: optional memomry of hidden states from previous forward passes
as a list (num layers) of hidden states at the entry of each layer
each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Outputs:
A tuple of (last_hidden_state, new_mems)
`last_hidden_state`: the encoded-hidden-states at the top of the model
as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
`new_mems`: list (num layers) of updated mem states at the entry of each layer
each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Example usage: Example::
```python
# Already been converted into BPE token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
config = TransfoXLConfig() config = TransfoXLConfig()
model = TransfoXLModel(config) model = TransfoXLModel(config)
last_hidden_state, new_mems = model(input_ids)
# Another time on input_ids_next using the memory:
last_hidden_state, new_mems = model(input_ids_next, new_mems)
```
""" """
def __init__(self, config): def __init__(self, config):
super(TransfoXLModel, self).__init__(config) super(TransfoXLModel, self).__init__(config)
...@@ -1200,18 +1178,40 @@ class TransfoXLModel(TransfoXLPreTrainedModel): ...@@ -1200,18 +1178,40 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
return outputs # last hidden state, new_mems, (all hidden states), (all attentions) return outputs # last hidden state, new_mems, (all hidden states), (all attentions)
def forward(self, input_ids, mems=None, head_mask=None): def forward(self, input_ids, mems=None, head_mask=None):
""" Params: """
input_ids :: [bsz, len] Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
mems :: optional mems from previous forwar passes (or init_mems)
list (num layers) of mem states at the entry of each layer Args:
shape :: [self.config.mem_len, bsz, self.config.d_model] `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the token indices selected in the range [0, self.config.n_token[
`mems`: optional memory of hidden states from previous forward passes
as a list (num layers) of hidden states at the entry of each layer
each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels` Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Returns: Returns:
tuple (last_hidden, new_mems) where: A tuple of ``(last_hidden_state, new_mems)``.
new_mems: list (num layers) of mem states at the entry of each layer
shape :: [self.config.mem_len, bsz, self.config.d_model] ``last_hidden_state``: the encoded-hidden-states at the top of the model
last_hidden: output of the last layer: as a ``torch.FloatTensor`` of size [batch_size, sequence_length, self.config.d_model]
shape :: [bsz, len, self.config.d_model]
``new_mems``: list (num layers) of updated mem states at the entry of each layer
each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
Note that the first two dimensions are transposed in ``mems`` with regards to ``input_ids`` and
``labels``
Example::
# Already been converted into BPE token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
last_hidden_state, new_mems = model(input_ids)
# or
last_hidden_state, new_mems = model.forward(input_ids)
# Another time on input_ids_next using the memory:
last_hidden_state, new_mems = model(input_ids_next, new_mems)
""" """
# the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library
# so we transpose here from shape [bsz, len] to shape [len, bsz] # so we transpose here from shape [bsz, len] to shape [len, bsz]
...@@ -1227,52 +1227,24 @@ class TransfoXLModel(TransfoXLPreTrainedModel): ...@@ -1227,52 +1227,24 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
class TransfoXLLMHeadModel(TransfoXLPreTrainedModel): class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
"""Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"). """Transformer XL model ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context").
This model add an (adaptive) softmax head on top of the TransfoXLModel This model adds an (adaptive) softmax head on top of the ``TransfoXLModel``
Transformer XL uses a relative positioning (with sinusoidal patterns) and adaptive softmax inputs which means that:
Transformer XL use a relative positioning (with sinusiodal patterns) and adaptive softmax inputs which means that:
- you don't need to specify positioning embeddings indices - you don't need to specify positioning embeddings indices
- the tokens in the vocabulary have to be sorted to decreasing frequency.
Call self.tie_weights() if you update/load the weights of the transformer to keep the weights tied. - the tokens in the vocabulary have to be sorted in decreasing frequency.
Params: Call ``self.tie_weights()`` if you update/load the weights of the transformer to keep the weights tied.
config: a TransfoXLConfig class instance with the configuration to build a new model
Inputs: Args:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] config: a ``TransfoXLConfig`` class instance with the configuration to build a new model
with the token indices selected in the range [0, self.config.n_token[
`labels`: an optional torch.LongTensor of shape [batch_size, sequence_length]
with the labels token indices selected in the range [0, self.config.n_token[
`mems`: an optional memory of hidden states from previous forward passes
as a list (num layers) of hidden states at the entry of each layer
each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Outputs:
A tuple of (last_hidden_state, new_mems)
`softmax_output`: output of the (adaptive) softmax:
if labels is None:
Negative log likelihood of shape [batch_size, sequence_length]
else:
log probabilities of tokens, shape [batch_size, sequence_length, n_tokens]
`new_mems`: list (num layers) of updated mem states at the entry of each layer
each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Example usage: Example::
```python
# Already been converted into BPE token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
config = TransfoXLConfig() config = TransfoXLConfig()
model = TransfoXLModel(config) model = TransfoXLModel(config)
last_hidden_state, new_mems = model(input_ids)
# Another time on input_ids_next using the memory:
last_hidden_state, new_mems = model(input_ids_next, mems=new_mems)
```
""" """
def __init__(self, config): def __init__(self, config):
super(TransfoXLLMHeadModel, self).__init__(config) super(TransfoXLLMHeadModel, self).__init__(config)
...@@ -1290,7 +1262,9 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel): ...@@ -1290,7 +1262,9 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
self.tie_weights() self.tie_weights()
def tie_weights(self): def tie_weights(self):
""" Run this to be sure output and input (adaptive) softmax weights are tied """ """
Run this to be sure output and input (adaptive) softmax weights are tied
"""
# sampled softmax # sampled softmax
if self.sample_softmax > 0: if self.sample_softmax > 0:
if self.config.tie_weight: if self.config.tie_weight:
...@@ -1314,18 +1288,43 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel): ...@@ -1314,18 +1288,43 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
return self.transformer.init_mems(data) return self.transformer.init_mems(data)
def forward(self, input_ids, labels=None, mems=None, head_mask=None): def forward(self, input_ids, labels=None, mems=None, head_mask=None):
""" Params: """
input_ids :: [bsz, len] Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
labels :: [bsz, len]
Args:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the token indices selected in the range [0, self.config.n_token[
`labels`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the labels token indices selected in the range [0, self.config.n_token[
`mems`: an optional memory of hidden states from previous forward passes
as a list (num layers) of hidden states at the entry of each layer
each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Returns: Returns:
tuple(softmax_output, new_mems) where: A tuple of (last_hidden_state, new_mems)
new_mems: list (num layers) of hidden states at the entry of each layer
shape :: [mem_len, bsz, self.config.d_model] :: Warning: shapes are transposed here w. regards to input_ids ``last_hidden_state``: output of the (adaptive) softmax. If ``labels`` is ``None``, it is the negative
softmax_output: output of the (adaptive) softmax: log likelihood of shape [batch_size, sequence_length]. Otherwise, it is the log probabilities of
if labels is None: tokens of, shape [batch_size, sequence_length, n_tokens].
Negative log likelihood of shape :: [bsz, len]
else: ``new_mems``: list (num layers) of updated mem states at the entry of each layer
log probabilities of tokens, shape :: [bsz, len, n_tokens] each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
Note that the first two dimensions are transposed in ``mems`` with regards to ``input_ids`` and
``labels``
Example::
# Already been converted into BPE token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_ids_next = torch.LongTensor([[53, 21, 1], [64, 23, 100]])
last_hidden_state, new_mems = model(input_ids)
# or
last_hidden_state, new_mems = model.forward(input_ids)
# Another time on input_ids_next using the memory:
last_hidden_state, new_mems = model(input_ids_next, mems=new_mems)
""" """
bsz = input_ids.size(0) bsz = input_ids.size(0)
tgt_len = input_ids.size(1) tgt_len = input_ids.size(1)
......
This diff is collapsed.
...@@ -958,10 +958,10 @@ class XLNetLMHeadModel(XLNetPreTrainedModel): ...@@ -958,10 +958,10 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
`encoded_layers`: controled by `output_all_encoded_layers` argument: `encoded_layers`: controled by `output_all_encoded_layers` argument:
- `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
of each attention block (i.e. 12 full sequences for XLNet-base, 24 for XLNet-large), each of each attention block (i.e. 12 full sequences for XLNet-base, 24 for XLNet-large), each
encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, d_model], encoded-hidden-state is a ``torch.FloatTensor`` of size [batch_size, sequence_length, d_model],
- `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
to the last attention block of shape [batch_size, sequence_length, d_model], to the last attention block of shape [batch_size, sequence_length, d_model],
`pooled_output`: a torch.FloatTensor of size [batch_size, d_model] which is the output of a `pooled_output`: a ``torch.FloatTensor`` of size [batch_size, d_model] which is the output of a
classifier pretrained on top of the hidden state associated to the first character of the classifier pretrained on top of the hidden state associated to the first character of the
input (`CLS`) to train on the Next-Sentence task (see XLNet's paper). input (`CLS`) to train on the Next-Sentence task (see XLNet's paper).
...@@ -1087,7 +1087,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel): ...@@ -1087,7 +1087,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
1 for tokens with losses and 0 for tokens without losses. 1 for tokens with losses and 0 for tokens without losses.
Only used during pretraining for two-stream attention. Only used during pretraining for two-stream attention.
Set to None during finetuning. Set to None during finetuning.
`head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked. It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
...@@ -1098,7 +1098,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel): ...@@ -1098,7 +1098,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
else: else:
CrossEntropy loss with the targets CrossEntropy loss with the targets
`new_mems`: list (num layers) of updated mem states at the entry of each layer `new_mems`: list (num layers) of updated mem states at the entry of each layer
each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model] each mem state is a ``torch.FloatTensor`` of size [self.config.mem_len, batch_size, self.config.d_model]
Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels` Note that the first two dimensions are transposed in `mems` with regards to `input_ids` and `labels`
Example usage: Example usage:
...@@ -1189,27 +1189,27 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel): ...@@ -1189,27 +1189,27 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
This can be used to compute head importance metrics. Default: False This can be used to compute head importance metrics. Default: False
Inputs: Inputs:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] `input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`) `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token `token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see XLNet paper for more details). a `sentence B` token (see XLNet paper for more details).
`attention_mask`: [optional] float32 Tensor, SAME FUNCTION as `input_mask` `attention_mask`: [optional] float32 Tensor, SAME FUNCTION as `input_mask`
but with 1 for real tokens and 0 for padding. but with 1 for real tokens and 0 for padding.
Added for easy compatibility with the BERT model (which uses this negative masking). Added for easy compatibility with the BERT model (which uses this negative masking).
You can only uses one among `input_mask` and `attention_mask` You can only uses one among `input_mask` and `attention_mask`
`input_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices `input_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences. a batch has varying length sentences.
`start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size]. `start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
Positions are clamped to the length of the sequence and position outside of the sequence are not taken Positions are clamped to the length of the sequence and position outside of the sequence are not taken
into account for computing the loss. into account for computing the loss.
`end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size]. `end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
Positions are clamped to the length of the sequence and position outside of the sequence are not taken Positions are clamped to the length of the sequence and position outside of the sequence are not taken
into account for computing the loss. into account for computing the loss.
`head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. `head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked. It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Outputs: Outputs:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment