Commit 73306d02 authored by Lysandre's avatar Lysandre Committed by Lysandre Debut
Browse files

FlauBERT documentation

parent ce2f4227
...@@ -98,3 +98,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train ...@@ -98,3 +98,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/camembert model_doc/camembert
model_doc/albert model_doc/albert
model_doc/xlmroberta model_doc/xlmroberta
model_doc/flaubert
\ No newline at end of file
FlauBERT
----------------------------------------------------
The FlauBERT model was proposed in the paper
`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le et al.
It's a transformer pre-trained using a masked language modeling (MLM) objective (BERT-like).
The abstract from the paper is the following:
*Language models have become a key step to achieve state-of-the art results in many different Natural Language
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient
way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
contextualization at the sentence level. This has been widely demonstrated for English using contextualized
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et
al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large
and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre
for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most
of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified
evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared
to the research community for further reproducible experiments in French NLP.*
FlaubertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertConfig
:members:
FlaubertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertTokenizer
:members:
FlaubertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertModel
:members:
FlaubertWithLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertWithLMHeadModel
:members:
FlaubertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForSequenceClassification
:members:
FlaubertForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForQuestionAnsweringSimple
:members:
FlaubertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForQuestionAnswering
:members:
...@@ -31,44 +31,111 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -31,44 +31,111 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class FlaubertConfig(XLMConfig): class FlaubertConfig(XLMConfig):
"""Configuration class to store the configuration of a `FlaubertModel`. """
Configuration class to store the configuration of a `FlaubertModel`.
This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
It is used to instantiate an XLM model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
Args: Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
vocab_size: Vocabulary size of `inputs_ids` in `FlaubertModel`. to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
d_model: Size of the encoder layers and the pooler layer. for more information.
n_layer: Number of hidden layers in the Transformer encoder.
n_head: Number of attention heads for each attention layer in
the Transformer encoder.
d_inner: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
ff_activation: The non-linear activation function (function or string) in the
encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
untie_r: untie relative position biases
attn_type: 'bi' for Flaubert, 'uni' for Transformer-XL
dropout: The dropout probabilitiy for all fully connected Args:
pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to apply the layer normalization before or after the feed forward layer following the
attention in each layer.
vocab_size (:obj:`int`, optional, defaults to 30145):
Vocabulary size of the XLM model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XLMModel`.
emb_dim (:obj:`int`, optional, defaults to 2048):
Dimensionality of the encoder layers and the pooler layer.
n_layer (:obj:`int`, optional, defaults to 12):
Number of hidden layers in the Transformer encoder.
n_head (:obj:`int`, optional, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
dropout (:obj:`float`, optional, defaults to 0.1):
The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler. layers in the embeddings, encoder, and pooler.
max_position_embeddings: The maximum sequence length that this model might attention_dropout (:obj:`float`, optional, defaults to 0.1):
The dropout probability for the attention mechanism
gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
The non-linear activation function (function or string) in the
encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
causal (:obj:`boolean`, optional, defaults to :obj:`False`):
Set this to `True` for the model to behave in a causal manner.
Causal models use a triangular attention mask in order to only attend to the left-side context instead
if a bidirectional context.
asm (:obj:`boolean`, optional, defaults to :obj:`False`):
Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
layer.
n_langs (:obj:`int`, optional, defaults to 1):
The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
Whether to use language embeddings. Some models use additional language embeddings, see
`the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
for information on how to use them.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048). (e.g., 512 or 1024 or 2048).
initializer_range: The sttdev of the truncated_normal_initializer for embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
initializing all weight matrices. The standard deviation of the truncated_normal_initializer for
layer_norm_eps: The epsilon used by LayerNorm. initializing the embedding matrices.
init_std (:obj:`int`, optional, defaults to 50257):
dropout: float, dropout rate. The standard deviation of the truncated_normal_initializer for
init: str, the initialization scheme, either "normal" or "uniform". initializing all weight matrices except the embedding matrices.
init_range: float, initialize the parameters with a uniform distribution layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
in [-init_range, init_range]. Only effective when init="uniform". The epsilon used by the layer normalization layers.
init_std: float, initialize the parameters with a normal distribution bos_index (:obj:`int`, optional, defaults to 0):
with mean 0 and stddev init_std. Only effective when init="normal". The index of the beginning of sentence token in the vocabulary.
mem_len: int, the number of tokens to cache. eos_index (:obj:`int`, optional, defaults to 1):
reuse_len: int, the number of tokens in the currect batch to be cached The index of the end of sentence token in the vocabulary.
and reused in the future. pad_index (:obj:`int`, optional, defaults to 2):
bi_data: bool, whether to use bidirectional input pipeline. The index of the padding token in the vocabulary.
Usually set to True during pretraining and False during finetuning. unk_index (:obj:`int`, optional, defaults to 3):
clamp_len: int, clamp all relative distances larger than clamp_len. The index of the unknown token in the vocabulary.
-1 means no clamping. mask_index (:obj:`int`, optional, defaults to 5):
same_length: bool, whether to use the same attention length for each token. The index of the masking token in the vocabulary.
is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
summary_type (:obj:`string`, optional, defaults to "first"):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Is one of the following options:
- 'last' => take the last token hidden state (like XLNet)
- 'first' => take the first token hidden state (like Bert)
- 'mean' => take the mean of all tokens hidden states
- 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
- 'attn' => Not implemented now, use multi-head attention
summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Add a projection after the vector extraction
summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
'tanh' => add a tanh activation to the output, Other => no activation.
summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
Argument used when doing sequence summary. Used in for the multiple choice head in
:class:`~transformers.XLMForSequenceClassification`.
Add a dropout before the projection and activation
start_n_top (:obj:`int`, optional, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet.
end_n_top (:obj:`int`, optional, defaults to 5):
Used in the SQuAD evaluation script for XLM and XLNet.
mask_token_id (:obj:`int`, optional, defaults to 0):
Model agnostic parameter to identify masked tokens when generating text in an MLM context.
lang_id (:obj:`int`, optional, defaults to 1):
The ID of the language used by the model. This parameter is used when generating
text in a given language.
""" """
pretrained_config_archive_map = FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP pretrained_config_archive_map = FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
......
...@@ -21,7 +21,7 @@ import torch ...@@ -21,7 +21,7 @@ import torch
from torch.nn import functional as F from torch.nn import functional as F
from .configuration_flaubert import FlaubertConfig from .configuration_flaubert import FlaubertConfig
from .file_utils import add_start_docstrings from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
from .modeling_xlm import ( from .modeling_xlm import (
XLMForQuestionAnswering, XLMForQuestionAnswering,
XLMForQuestionAnsweringSimple, XLMForQuestionAnsweringSimple,
...@@ -42,24 +42,11 @@ FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP = { ...@@ -42,24 +42,11 @@ FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
} }
FLAUBERT_START_DOCSTRING = r""" The Flaubert model was proposed in FLAUBERT_START_DOCSTRING = r"""
`FlauBERT: Unsupervised Language Model Pre-training for French`_
by Hang Le et al. It's a transformer pre-trained using a masked
language modeling (MLM) objective (BERT-like).
Original code can be found `here`_. This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and usage and behavior.
refer to the PyTorch documentation for all matters related to general usage and behavior.
.. _`FlauBERT: Unsupervised Language Model Pre-training for French`:
https://arxiv.org/abs/1912.05372
.. _`torch.nn.Module`:
https://pytorch.org/docs/stable/nn.html#module
.. _`here`:
https://github.com/getalp/Flaubert
Parameters: Parameters:
config (:class:`~transformers.FlaubertConfig`): Model configuration class with all the parameters of the model. config (:class:`~transformers.FlaubertConfig`): Model configuration class with all the parameters of the model.
...@@ -68,42 +55,47 @@ FLAUBERT_START_DOCSTRING = r""" The Flaubert model was proposed in ...@@ -68,42 +55,47 @@ FLAUBERT_START_DOCSTRING = r""" The Flaubert model was proposed in
""" """
FLAUBERT_INPUTS_DOCSTRING = r""" FLAUBERT_INPUTS_DOCSTRING = r"""
Inputs: Args:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``: input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Flaubert is a model with absolute position embeddings so it's usually advised to pad the inputs on Indices can be obtained using :class:`transformers.BertTokenizer`.
the right rather than the left.
Indices can be obtained using :class:`transformers.FlaubertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices. Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens. ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
A parallel sequence of tokens (can be used to indicate various portions of the inputs). `What are attention masks? <../glossary.html#attention-mask>`__
The embeddings from these tokens will be summed with the respective token embeddings. token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices). Segment token indices to indicate first and second portions of the inputs.
**position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``: Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
corresponds to a `sentence B` token
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Indices of positions of each input sequence tokens in the position embeddings. Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``. Selected in the range ``[0, config.max_position_embeddings - 1]``.
**lengths**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
`What are position IDs? <../glossary.html#position-ids>`_
lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Length of each sentence that can be used to avoid performing attention on padding token indices. Length of each sentence that can be used to avoid performing attention on padding token indices.
You can also use `attention_mask` for the same result (see above), kept here for compatbility. You can also use `attention_mask` for the same result (see above), kept here for compatbility.
Indices selected in ``[0, ..., input_ids.size(-1)]``: Indices selected in ``[0, ..., input_ids.size(-1)]``:
**cache**: cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):
dictionary with ``torch.FloatTensor`` that contains pre-computed dictionary with ``torch.FloatTensor`` that contains pre-computed
hidden-states (key and values in the attention blocks) as computed by the model hidden-states (key and values in the attention blocks) as computed by the model
(see `cache` output below). Can be used to speed up sequential decoding. (see `cache` output below). Can be used to speed up sequential decoding.
The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states. The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``: head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules. Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``: Mask values selected in ``[0, 1]``:
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**. :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
**inputs_embeds**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``: input_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation. Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix. than the model's internal embedding lookup matrix.
""" """
...@@ -112,30 +104,8 @@ FLAUBERT_INPUTS_DOCSTRING = r""" ...@@ -112,30 +104,8 @@ FLAUBERT_INPUTS_DOCSTRING = r"""
@add_start_docstrings( @add_start_docstrings(
"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.", "The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.",
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
FLAUBERT_INPUTS_DOCSTRING,
) )
class FlaubertModel(XLMModel): class FlaubertModel(XLMModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
Sequence of hidden-states at the last layer of the model.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertModel.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
config_class = FlaubertConfig config_class = FlaubertConfig
pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP
...@@ -146,6 +116,7 @@ class FlaubertModel(XLMModel): ...@@ -146,6 +116,7 @@ class FlaubertModel(XLMModel):
self.layerdrop = 0.0 if not hasattr(config, "layerdrop") else config.layerdrop self.layerdrop = 0.0 if not hasattr(config, "layerdrop") else config.layerdrop
self.pre_norm = False if not hasattr(config, "pre_norm") else config.pre_norm self.pre_norm = False if not hasattr(config, "pre_norm") else config.pre_norm
@add_start_docstrings_to_callable(FLAUBERT_INPUTS_DOCSTRING)
def forward( def forward(
self, self,
input_ids=None, input_ids=None,
...@@ -157,7 +128,34 @@ class FlaubertModel(XLMModel): ...@@ -157,7 +128,34 @@ class FlaubertModel(XLMModel):
cache=None, cache=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
): # removed: src_enc=None, src_len=None ):
r"""
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.XLMConfig`) and inputs:
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertModel.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
# removed: src_enc=None, src_len=None
if input_ids is not None: if input_ids is not None:
bs, slen = input_ids.size() bs, slen = input_ids.size()
else: else:
...@@ -306,38 +304,11 @@ class FlaubertModel(XLMModel): ...@@ -306,38 +304,11 @@ class FlaubertModel(XLMModel):
"""The Flaubert Model transformer with a language modeling head on top """The Flaubert Model transformer with a language modeling head on top
(linear layer with weights tied to the input embeddings). """, (linear layer with weights tied to the input embeddings). """,
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
FLAUBERT_INPUTS_DOCSTRING,
) )
class FlaubertWithLMHeadModel(XLMWithLMHeadModel): class FlaubertWithLMHeadModel(XLMWithLMHeadModel):
r""" """
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``: This class overrides :class:`~transformers.XLMWithLMHeadModel`. Please check the
Labels for language modeling. superclass for the appropriate documentation alongside usage examples.
Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Language modeling loss.
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertWithLMHeadModel.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
""" """
config_class = FlaubertConfig config_class = FlaubertConfig
pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP
...@@ -352,38 +323,11 @@ class FlaubertWithLMHeadModel(XLMWithLMHeadModel): ...@@ -352,38 +323,11 @@ class FlaubertWithLMHeadModel(XLMWithLMHeadModel):
"""Flaubert Model with a sequence classification/regression head on top (a linear layer on top of """Flaubert Model with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """, the pooled output) e.g. for GLUE tasks. """,
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
FLAUBERT_INPUTS_DOCSTRING,
) )
class FlaubertForSequenceClassification(XLMForSequenceClassification): class FlaubertForSequenceClassification(XLMForSequenceClassification):
r""" """
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``: This class overrides :class:`~transformers.XLMForSequenceClassification`. Please check the
Labels for computing the sequence classification/regression loss. superclass for the appropriate documentation alongside usage examples.
Indices should be in ``[0, ..., config.num_labels - 1]``.
If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Classification (or regression if config.num_labels==1) loss.
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
Classification (or regression if config.num_labels==1) scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertForSequenceClassification.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
""" """
config_class = FlaubertConfig config_class = FlaubertConfig
pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP
...@@ -398,50 +342,11 @@ class FlaubertForSequenceClassification(XLMForSequenceClassification): ...@@ -398,50 +342,11 @@ class FlaubertForSequenceClassification(XLMForSequenceClassification):
"""Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of """Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """, the hidden-states output to compute `span start logits` and `span end logits`). """,
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
FLAUBERT_INPUTS_DOCSTRING,
) )
class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple): class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):
r""" """
**start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``: This class overrides :class:`~transformers.XLMForQuestionAnsweringSimple`. Please check the
Labels for position (index) of the start of the labelled span for computing the token classification loss. superclass for the appropriate documentation alongside usage examples.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
**end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
**is_impossible**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels whether a question has an answer or no answer (SQuAD 2.0)
**cls_index**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
**p_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...)
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
**start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-start scores (before SoftMax).
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-end scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertForQuestionAnsweringSimple.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme", add_special_tokens=True)).unsqueeze(0) # Batch size 1
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
loss, start_scores, end_scores = outputs[:2]
""" """
config_class = FlaubertConfig config_class = FlaubertConfig
pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP
...@@ -456,50 +361,11 @@ class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple): ...@@ -456,50 +361,11 @@ class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):
"""Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of """Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """, the hidden-states output to compute `span start logits` and `span end logits`). """,
FLAUBERT_START_DOCSTRING, FLAUBERT_START_DOCSTRING,
FLAUBERT_INPUTS_DOCSTRING,
) )
class FlaubertForQuestionAnswering(XLMForQuestionAnswering): class FlaubertForQuestionAnswering(XLMForQuestionAnswering):
r""" """
**start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``: This class overrides :class:`~transformers.XLMForQuestionAnswering`. Please check the
Labels for position (index) of the start of the labelled span for computing the token classification loss. superclass for the appropriate documentation alongside usage examples.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
**end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
**is_impossible**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels whether a question has an answer or no answer (SQuAD 2.0)
**cls_index**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
**p_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...)
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
**start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-start scores (before SoftMax).
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-end scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')
model = FlaubertForQuestionAnswering.from_pretrained('flaubert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Le chat manges une pomme.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
loss, start_scores, end_scores = outputs[:2]
""" """
config_class = FlaubertConfig config_class = FlaubertConfig
pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP pretrained_model_archive_map = FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment