Unverified Commit 87e6e4fe authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styler v2 (#14950)

* New doc styler

* Fix issue with args at the start

* Code sample fixes

* Style code examples in MDX

* Fix more patterns

* Typo

* Typo

* More patterns

* Do without black for now

* Get more info in error

* Docstring style

* Re-enable check

* Quality

* Fix add_end_docstring decorator

* Fix docstring
parent c1138273
......@@ -98,13 +98,12 @@ class Wav2Vec2ProcessorWithLM:
def save_pretrained(self, save_directory):
"""
Save the Wav2Vec2 feature_extractor, a tokenizer object and a pyctcdecode decoder to the directory
`save_directory`, so that they can be re-loaded using the
[`~Wav2Vec2ProcessorWithLM.from_pretrained`] class method.
`save_directory`, so that they can be re-loaded using the [`~Wav2Vec2ProcessorWithLM.from_pretrained`] class
method.
<Tip>
This class method is simply calling
[`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained,`]
This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained,`]
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`] and pyctcdecode's
[`pyctcdecode.BeamSearchDecoderCTC.save_to_dir`].
......@@ -129,9 +128,9 @@ class Wav2Vec2ProcessorWithLM:
<Tip>
This class method is simply calling Wav2Vec2FeatureExtractor's
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`],
Wav2Vec2CTCTokenizer's [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`],
and [`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`], and
[`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].
Please refer to the docstrings of the methods above for more information.
......@@ -145,8 +144,7 @@ class Wav2Vec2ProcessorWithLM:
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
- a path to a *directory* containing a feature extractor file saved using the
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g.,
`./my_model_directory/`.
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
- a path or url to a saved feature extractor JSON *file*, e.g.,
`./my_model_directory/preprocessor_config.json`.
**kwargs
......@@ -221,8 +219,8 @@ class Wav2Vec2ProcessorWithLM:
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
[`~Wav2Vec2FeatureExtractor.__call__`] and returns its output. If used in the context
[`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.__call__`]. Please refer to the docstring of
the above two methods for more information.
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.__call__`]. Please refer to the docstring of the above two
methods for more information.
"""
return self.current_processor(*args, **kwargs)
......@@ -231,8 +229,8 @@ class Wav2Vec2ProcessorWithLM:
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
[`~Wav2Vec2FeatureExtractor.pad`] and returns its output. If used in the context
[`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.pad`]. Please refer to the docstring of the
above two methods for more information.
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.pad`]. Please refer to the docstring of the above two methods
for more information.
"""
return self.current_processor.pad(*args, **kwargs)
......
......@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" WavLM model configuration """
""" WavLM model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
......@@ -28,20 +28,20 @@ WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class WavLMConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`WavLMModel`]. It is used to
instantiate an WavLM model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the WavLM [facebook/wavlm-base-960h](https://huggingface.co/facebook/wavlm-base-960h) architecture.
This is the configuration class to store the configuration of a [`WavLMModel`]. It is used to instantiate an WavLM
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the WavLM
[facebook/wavlm-base-960h](https://huggingface.co/facebook/wavlm-base-960h) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32):
Vocabulary size of the WavLM model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`WavLMModel`]. Vocabulary size of the model.
Defines the different tokens that can be represented by the *inputs_ids* passed to the forward method of
[`WavLMModel`].
`inputs_ids` passed when calling [`WavLMModel`]. Vocabulary size of the model. Defines the different tokens
that can be represented by the *inputs_ids* passed to the forward method of [`WavLMModel`].
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 12):
......@@ -51,8 +51,8 @@ class WavLMConfig(PretrainedConfig):
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
`"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` are supported.
hidden_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.1):
......@@ -92,24 +92,27 @@ class WavLMConfig(PretrainedConfig):
num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
Number of groups of 1D convolutional positional embeddings layer.
do_stable_layer_norm (`bool`, *optional*, defaults to `False`):
Whether to apply *stable* layer norm architecture of the Transformer encoder. `do_stable_layer_norm is True` corresponds to applying layer norm before the attention layer, whereas `do_stable_layer_norm is False` corresponds to applying layer norm after the attention layer.
Whether to apply *stable* layer norm architecture of the Transformer encoder. `do_stable_layer_norm is
True` corresponds to applying layer norm before the attention layer, whereas `do_stable_layer_norm is
False` corresponds to applying layer norm after the attention layer.
apply_spec_augment (`bool`, *optional*, defaults to `True`):
Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779).
[SpecAugment: A Simple Data Augmentation Method for Automatic Speech
Recognition](https://arxiv.org/abs/1904.08779).
mask_time_prob (`float`, *optional*, defaults to 0.05):
Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
masked. Approximately `mask_time_prob * sequence_length // mask_time_length` feature vectors will be
masked along the time axis. This is only relevant if `apply_spec_augment is True`.
masked. Approximately `mask_time_prob * sequence_length // mask_time_length` feature vectors will be masked
along the time axis. This is only relevant if `apply_spec_augment is True`.
mask_time_length (`int`, *optional*, defaults to 10):
Length of vector span along the time axis.
mask_time_min_masks (`int`, *optional*, defaults to 2),:
The minimum number of masks of length `mask_feature_length` generated along the time axis, each time
step, irrespectively of `mask_feature_prob`. Only relevant if
''mask_time_prob*len(time_axis)/mask_time_length < mask_time_min_masks''
The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
mask_time_min_masks''
mask_feature_prob (`float`, *optional*, defaults to 0.0):
Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
be masked. Approximately `mask_time_prob * hidden_size // mask_time_length` feature vectors will be
masked along the time axis. This is only relevant if `apply_spec_augment is True`.
be masked. Approximately `mask_time_prob * hidden_size // mask_time_length` feature vectors will be masked
along the time axis. This is only relevant if `apply_spec_augment is True`.
mask_feature_length (`int`, *optional*, defaults to 10):
Length of vector span along the feature axis.
num_codevectors_per_group (`int`, *optional*, defaults to 320):
......@@ -132,9 +135,9 @@ class WavLMConfig(PretrainedConfig):
Specifies the reduction to apply to the output of `torch.nn.CTCLoss`. Only relevant when training an
instance of [`WavLMForCTC`].
ctc_zero_infinity (`bool`, *optional*, defaults to `False`):
Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses
mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
instance of [`WavLMForCTC`].
Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses mainly
occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance
of [`WavLMForCTC`].
use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
instance of [`WavLMForSequenceClassification`].
......@@ -159,7 +162,8 @@ class WavLMConfig(PretrainedConfig):
adapter_stride (`int`, *optional*, defaults to 2):
Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
num_adapter_layers (`int`, *optional*, defaults to 3):
Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is True`.
Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
True`.
output_hidden_size (`int`, *optional*):
Dimensionality of the encoder output layer. If not defined, this defaults to *hidden-size*. Only relevant
if `add_adapter is True`.
......
......@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch WavLM model. """
""" PyTorch WavLM model."""
import math
from dataclasses import dataclass
......@@ -73,12 +73,13 @@ class WavLMBaseModelOutput(ModelOutput):
extract_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, conv_dim[-1])`):
Sequence of extracted feature vectors of the last convolutional layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -103,12 +104,13 @@ class XVectorOutput(ModelOutput):
embeddings (`torch.FloatTensor` of shape `(batch_size, config.xvector_output_dim)`):
Utterance embeddings used for vector similarity-based retrieval.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -131,8 +133,8 @@ def _compute_mask_indices(
) -> np.ndarray:
"""
Computes random mask spans for a given shape. Used to implement [SpecAugment: A Simple Data Augmentation Method for
ASR](https://arxiv.org/abs/1904.08779). Note that this method is not optimized to run on TPU and should be run
on CPU as part of the preprocessing during training.
ASR](https://arxiv.org/abs/1904.08779). Note that this method is not optimized to run on TPU and should be run on
CPU as part of the preprocessing during training.
Args:
shape: The shape for which to compute masks. This should be of a tuple of size 2 where
......@@ -1080,11 +1082,12 @@ class WavLMPreTrainedModel(PreTrainedModel):
WAVLM_START_DOCSTRING = r"""
WavLM was proposed in [WavLM: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei,
WavLM was proposed in [WavLM: Unified Speech Representation Learning with Labeled and Unlabeled
Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei,
Michael Zeng, Xuedong Huang.
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving etc.).
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
......@@ -1093,8 +1096,7 @@ WAVLM_START_DOCSTRING = r"""
Parameters:
config ([`WavLMConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
......@@ -1103,11 +1105,11 @@ WAVLM_INPUTS_DOCSTRING = r"""
input_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
Float values of input raw speech waveform. Values can be obtained by loading a *.flac* or *.wav* audio file
into an array of type *List[float]* or a *numpy.ndarray*, *e.g.* via the soundfile library (*pip install
soundfile*). To prepare the array into *input_values*, the [`WavLMProcessor`] should be
used for padding and conversion into a tensor of type *torch.FloatTensor*. See
[`WavLMProcessor.__call__`] for details.
soundfile*). To prepare the array into *input_values*, the [`WavLMProcessor`] should be used for padding
and conversion into a tensor of type *torch.FloatTensor*. See [`WavLMProcessor.__call__`] for details.
attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing convolution and attention on padding token indices. Mask values selected in `[0, 1]`:
Mask to avoid performing convolution and attention on padding token indices. Mask values selected in `[0,
1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
......@@ -1116,12 +1118,11 @@ WAVLM_INPUTS_DOCSTRING = r"""
<Tip warning={true}>
`attention_mask` should only be passed if the corresponding processor has
`config.return_attention_mask == True`. For all models whose processor has
`config.return_attention_mask == False`, `attention_mask` should **not** be passed to avoid
degraded performance when doing batched inference. For such models `input_values` should simply be
padded with 0 and passed without `attention_mask`. Be aware that these models also yield slightly
different results depending on whether `input_values` is padded or not.
`attention_mask` should only be passed if the corresponding processor has `config.return_attention_mask ==
True`. For all models whose processor has `config.return_attention_mask == False`, `attention_mask` should
**not** be passed to avoid degraded performance when doing batched inference. For such models
`input_values` should simply be padded with 0 and passed without `attention_mask`. Be aware that these
models also yield slightly different results depending on whether `input_values` is padded or not.
</Tip>
......@@ -1268,7 +1269,7 @@ class WavLMModel(WavLMPreTrainedModel):
@add_start_docstrings(
"""WavLM Model with a `language modeling` head on top for Connectionist Temporal Classification (CTC). """,
"""WavLM Model with a `language modeling` head on top for Connectionist Temporal Classification (CTC).""",
WAVLM_START_DOCSTRING,
)
# Copied from transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC with Wav2Vec2->WavLM, wav2vec2->wavlm, WAV_2_VEC_2->WAVLM
......@@ -1317,7 +1318,9 @@ class WavLMForCTC(WavLMPreTrainedModel):
r"""
labels (`torch.LongTensor` of shape `(batch_size, target_length)`, *optional*):
Labels for connectionist temporal classification. Note that `target_length` has to be smaller or equal to
the sequence length of the output logits. Indices are selected in `[-100, 0, ..., config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size - 1]`.
the sequence length of the output logits. Indices are selected in `[-100, 0, ..., config.vocab_size - 1]`.
All labels set to `-100` are ignored (masked), the loss is only computed for labels in `[0, ...,
config.vocab_size - 1]`.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1432,8 +1435,9 @@ class WavLMForSequenceClassification(WavLMPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1534,8 +1538,9 @@ class WavLMForAudioFrameClassification(WavLMPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1698,8 +1703,9 @@ class WavLMForXVector(WavLMPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" XLM configuration """
""" XLM configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
......@@ -36,13 +36,13 @@ XLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLMConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`XLMModel`] or a
[`TFXLMModel`]. It is used to instantiate a XLM model according to the specified arguments,
defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
to that of the [xlm-mlm-en-2048](https://huggingface.co/xlm-mlm-en-2048) architecture.
This is the configuration class to store the configuration of a [`XLMModel`] or a [`TFXLMModel`]. It is used to
instantiate a XLM model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the
[xlm-mlm-en-2048](https://huggingface.co/xlm-mlm-en-2048) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 30145):
......@@ -72,8 +72,8 @@ class XLMConfig(PretrainedConfig):
The number of languages the model handles. Set to 1 for monolingual models.
use_lang_emb (`bool`, *optional*, defaults to `True`)
Whether to use language embeddings. Some models use additional language embeddings, see [the multilingual
models page](http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings) for
information on how to use them.
models page](http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings) for information
on how to use them.
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
......
......@@ -563,12 +563,13 @@ class TFXLMWithLMHeadModelOutput(ModelOutput):
logits (`tf.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -581,13 +582,13 @@ class TFXLMWithLMHeadModelOutput(ModelOutput):
XLM_START_DOCSTRING = r"""
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
behavior.
<Tip>
......@@ -596,11 +597,11 @@ XLM_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all
the tensors in the first argument of the model call function: `model(inputs)`.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
tensors in the first argument of the model call function: `model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the
first positional argument :
- a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -613,8 +614,7 @@ XLM_START_DOCSTRING = r"""
Parameters:
config ([`XLMConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
XLM_INPUTS_DOCSTRING = r"""
......@@ -622,9 +622,8 @@ XLM_INPUTS_DOCSTRING = r"""
input_ids (`Numpy array` or `tf.Tensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BertTokenizer`]. See
[`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for
details.
Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
[`PreTrainedTokenizer.encode`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`Numpy array` or `tf.Tensor` of shape `({0})`, *optional*):
......@@ -643,14 +642,16 @@ XLM_INPUTS_DOCSTRING = r"""
See usage examples detailed in the [multilingual documentation](../multilingual).
token_type_ids (`Numpy array` or `tf.Tensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (`Numpy array` or `tf.Tensor` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
lengths (`tf.Tensor` or `Numpy array` of shape `(batch_size,)`, *optional*):
......@@ -659,8 +660,8 @@ XLM_INPUTS_DOCSTRING = r"""
`[0, ..., input_ids.size(-1)]`.
cache (`Dict[str, tf.Tensor]`, *optional*):
Dictionary string to `torch.FloatTensor` that contains precomputed hidden states (key and values in the
attention blocks) as computed by the model (see `cache` output below). Can be used to speed up
sequential decoding.
attention blocks) as computed by the model (see `cache` output below). Can be used to speed up sequential
decoding.
The dictionary object will be modified in-place during the forward pass to add newly computed
hidden-states.
......@@ -671,9 +672,9 @@ XLM_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**.
inputs_embeds (`tf.Tensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the
......@@ -683,8 +684,8 @@ XLM_INPUTS_DOCSTRING = r"""
more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
used instead.
return_dict (`bool`, *optional*):
Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This
argument can be used in eager mode, in graph mode the value will always be set to True.
Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This argument can be used
in eager mode, in graph mode the value will always be set to True.
training (`bool`, *optional*, defaults to `False`):
Whether or not to use the model in training mode (some modules like dropout modules have different
behaviors between training and evaluation).
......@@ -970,8 +971,9 @@ class TFXLMForSequenceClassification(TFXLMPreTrainedModel, TFSequenceClassificat
):
r"""
labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
inputs = input_processing(
func=self.call,
......@@ -1351,12 +1353,12 @@ class TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel, TFQuestionAnsweringL
r"""
start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
inputs = input_processing(
func=self.call,
......
......@@ -287,12 +287,13 @@ class XLMForQuestionAnsweringOutput(ModelOutput):
cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the `is_impossible` label of the answers.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -310,19 +311,18 @@ class XLMForQuestionAnsweringOutput(ModelOutput):
XLM_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`XLMConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
XLM_INPUTS_DOCSTRING = r"""
......@@ -330,9 +330,8 @@ XLM_INPUTS_DOCSTRING = r"""
input_ids (`torch.LongTensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`XLMTokenizer`]. See
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
details.
Indices can be obtained using [`XLMTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
......@@ -351,14 +350,16 @@ XLM_INPUTS_DOCSTRING = r"""
See usage examples detailed in the [multilingual documentation](../multilingual).
token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
lengths (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
......@@ -367,8 +368,8 @@ XLM_INPUTS_DOCSTRING = r"""
`[0, ..., input_ids.size(-1)]`.
cache (`Dict[str, torch.FloatTensor]`, *optional*):
Dictionary string to `torch.FloatTensor` that contains precomputed hidden states (key and values in the
attention blocks) as computed by the model (see `cache` output below). Can be used to speed up
sequential decoding.
attention blocks) as computed by the model (see `cache` output below). Can be used to speed up sequential
decoding.
The dictionary object will be modified in-place during the forward pass to add newly computed
hidden-states.
......@@ -379,9 +380,9 @@ XLM_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
......@@ -734,8 +735,8 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
`labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to
`-100` are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
`labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -812,8 +813,9 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -914,12 +916,12 @@ class XLMForQuestionAnsweringSimple(XLMPreTrainedModel):
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1017,12 +1019,12 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
is_impossible (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels whether a question has an answer or no answer (SQuAD 2.0)
cls_index (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
......@@ -1231,7 +1233,8 @@ class XLMForMultipleChoice(XLMPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
`input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
......@@ -534,14 +534,14 @@ class XLMTokenizer(PreTrainedTokenizer):
- Moses preprocessing and tokenization for most supported languages.
- Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP).
- Optionally lowercases and normalizes all inputs text.
- The arguments `special_tokens` and the function `set_special_tokens`, can be used to add additional symbols
(like "__classify__") to a vocabulary.
- The `lang2id` attribute maps the languages supported by the model with their IDs if provided (automatically
set for pretrained vocabularies).
- The arguments `special_tokens` and the function `set_special_tokens`, can be used to add additional symbols (like
"__classify__") to a vocabulary.
- The `lang2id` attribute maps the languages supported by the model with their IDs if provided (automatically set
for pretrained vocabularies).
- The `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -767,11 +767,8 @@ class XLMTokenizer(PreTrainedTokenizer):
::
git clone git@github.com:neubig/kytea.git && cd kytea
autoreconf -i
./configure --prefix=$HOME/local
make && make install
pip install kytea
git clone git@github.com:neubig/kytea.git && cd kytea autoreconf -i ./configure --prefix=$HOME/local
make && make install pip install kytea
- [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)
- Install with `pip install jieba`
......@@ -938,8 +935,7 @@ class XLMTokenizer(PreTrainedTokenizer):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
sequence(s).
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
......
......@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" XLM-ProphetNet model configuration """
""" XLM-ProphetNet model configuration"""
from ...utils import logging
......@@ -28,8 +28,8 @@ XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLMProphetNetConfig(ProphetNetConfig):
"""
This class overrides [`ProphetNetConfig`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`ProphetNetConfig`]. Please check the superclass for the appropriate documentation alongside
usage examples.
"""
model_type = "xlm-prophetnet"
......@@ -37,8 +37,8 @@ XLM_PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = [
class XLMProphetNetEncoder(ProphetNetEncoder):
r"""
This class overrides [`ProphetNetEncoder`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`ProphetNetEncoder`]. Please check the superclass for the appropriate documentation alongside
usage examples.
Example:
......@@ -60,8 +60,8 @@ class XLMProphetNetEncoder(ProphetNetEncoder):
class XLMProphetNetDecoder(ProphetNetDecoder):
r"""
This class overrides [`ProphetNetDecoder`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`ProphetNetDecoder`]. Please check the superclass for the appropriate documentation alongside
usage examples.
Example:
......@@ -83,8 +83,8 @@ class XLMProphetNetDecoder(ProphetNetDecoder):
class XLMProphetNetModel(ProphetNetModel):
r"""
This class overrides [`ProphetNetModel`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`ProphetNetModel`]. Please check the superclass for the appropriate documentation alongside
usage examples.
Example:
......@@ -107,8 +107,8 @@ class XLMProphetNetModel(ProphetNetModel):
class XLMProphetNetForConditionalGeneration(ProphetNetForConditionalGeneration):
r"""
This class overrides [`ProphetNetForConditionalGeneration`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`ProphetNetForConditionalGeneration`]. Please check the superclass for the appropriate
documentation alongside usage examples.
Example:
......@@ -131,8 +131,8 @@ class XLMProphetNetForConditionalGeneration(ProphetNetForConditionalGeneration):
class XLMProphetNetForCausalLM(ProphetNetForCausalLM):
r"""
This class overrides [`ProphetNetForCausalLM`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`ProphetNetForCausalLM`]. Please check the superclass for the appropriate documentation
alongside usage examples.
Example:
......
......@@ -59,8 +59,8 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -80,8 +80,8 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
......@@ -103,7 +103,9 @@ class XLMProphetNetTokenizer(PreTrainedTokenizer):
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
to set:
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
......
......@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" XLM-RoBERTa configuration """
""" XLM-RoBERTa configuration"""
from collections import OrderedDict
from typing import Mapping
......@@ -36,8 +36,8 @@ XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLMRobertaConfig(RobertaConfig):
"""
This class overrides [`RobertaConfig`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`RobertaConfig`]. Please check the superclass for the appropriate documentation alongside
usage examples.
"""
model_type = "xlm-roberta"
......
......@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" TF 2.0 XLM-RoBERTa model. """
""" TF 2.0 XLM-RoBERTa model."""
from ...file_utils import add_start_docstrings
from ...utils import logging
......@@ -38,13 +38,13 @@ TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
XLM_ROBERTA_START_DOCSTRING = r"""
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
behavior.
<Tip>
......@@ -53,11 +53,11 @@ XLM_ROBERTA_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all
the tensors in the first argument of the model call function: `model(inputs)`.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
tensors in the first argument of the model call function: `model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the
first positional argument :
- a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -70,8 +70,7 @@ XLM_ROBERTA_START_DOCSTRING = r"""
Parameters:
config ([`XLMRobertaConfig`]): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
......@@ -81,8 +80,8 @@ XLM_ROBERTA_START_DOCSTRING = r"""
)
class TFXLMRobertaModel(TFRobertaModel):
"""
This class overrides [`TFRobertaModel`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`TFRobertaModel`]. Please check the superclass for the appropriate documentation alongside
usage examples.
"""
config_class = XLMRobertaConfig
......@@ -94,21 +93,21 @@ class TFXLMRobertaModel(TFRobertaModel):
)
class XLMRobertaForCausalLM(TFRobertaForCausalLM):
"""
This class overrides [`TFRobertaForCausalLM`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`TFRobertaForCausalLM`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
@add_start_docstrings(
"""XLM-RoBERTa Model with a `language modeling` head on top. """,
"""XLM-RoBERTa Model with a `language modeling` head on top.""",
XLM_ROBERTA_START_DOCSTRING,
)
class TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):
"""
This class overrides [`TFRobertaForMaskedLM`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`TFRobertaForMaskedLM`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -123,8 +122,8 @@ class TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):
)
class TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):
"""
This class overrides [`TFRobertaForSequenceClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`TFRobertaForSequenceClassification`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -139,8 +138,8 @@ class TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):
)
class TFXLMRobertaForTokenClassification(TFRobertaForTokenClassification):
"""
This class overrides [`TFRobertaForTokenClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`TFRobertaForTokenClassification`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -155,8 +154,8 @@ layers on top of the hidden-states output to compute `span start logits` and `sp
)
class TFXLMRobertaForQuestionAnswering(TFRobertaForQuestionAnswering):
"""
This class overrides [`TFRobertaForQuestionAnsweringSimple`]. Please check the superclass for
the appropriate documentation alongside usage examples.
This class overrides [`TFRobertaForQuestionAnsweringSimple`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -171,8 +170,8 @@ class TFXLMRobertaForQuestionAnswering(TFRobertaForQuestionAnswering):
)
class TFXLMRobertaForMultipleChoice(TFRobertaForMultipleChoice):
"""
This class overrides [`TFRobertaForMultipleChoice`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`TFRobertaForMultipleChoice`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch XLM-RoBERTa model. """
"""PyTorch XLM-RoBERTa model."""
from ...file_utils import add_start_docstrings
from ...utils import logging
......@@ -44,19 +44,18 @@ XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
XLM_ROBERTA_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`XLMRobertaConfig`]): Model configuration class with all the parameters of the
model. Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
......@@ -66,8 +65,8 @@ XLM_ROBERTA_START_DOCSTRING = r"""
)
class XLMRobertaModel(RobertaModel):
"""
This class overrides [`RobertaModel`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`RobertaModel`]. Please check the superclass for the appropriate documentation alongside
usage examples.
"""
config_class = XLMRobertaConfig
......@@ -79,21 +78,21 @@ class XLMRobertaModel(RobertaModel):
)
class XLMRobertaForCausalLM(RobertaForCausalLM):
"""
This class overrides [`RobertaForCausalLM`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`RobertaForCausalLM`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
@add_start_docstrings(
"""XLM-RoBERTa Model with a `language modeling` head on top. """,
"""XLM-RoBERTa Model with a `language modeling` head on top.""",
XLM_ROBERTA_START_DOCSTRING,
)
class XLMRobertaForMaskedLM(RobertaForMaskedLM):
"""
This class overrides [`RobertaForMaskedLM`]. Please check the superclass for the appropriate
documentation alongside usage examples.
This class overrides [`RobertaForMaskedLM`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -108,8 +107,8 @@ class XLMRobertaForMaskedLM(RobertaForMaskedLM):
)
class XLMRobertaForSequenceClassification(RobertaForSequenceClassification):
"""
This class overrides [`RobertaForSequenceClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`RobertaForSequenceClassification`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -124,8 +123,8 @@ class XLMRobertaForSequenceClassification(RobertaForSequenceClassification):
)
class XLMRobertaForMultipleChoice(RobertaForMultipleChoice):
"""
This class overrides [`RobertaForMultipleChoice`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`RobertaForMultipleChoice`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -140,8 +139,8 @@ class XLMRobertaForMultipleChoice(RobertaForMultipleChoice):
)
class XLMRobertaForTokenClassification(RobertaForTokenClassification):
"""
This class overrides [`RobertaForTokenClassification`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`RobertaForTokenClassification`]. Please check the superclass for the appropriate
documentation alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -156,8 +155,8 @@ class XLMRobertaForTokenClassification(RobertaForTokenClassification):
)
class XLMRobertaForQuestionAnswering(RobertaForQuestionAnswering):
"""
This class overrides [`RobertaForQuestionAnswering`]. Please check the superclass for the
appropriate documentation alongside usage examples.
This class overrides [`RobertaForQuestionAnswering`]. Please check the superclass for the appropriate documentation
alongside usage examples.
"""
config_class = XLMRobertaConfig
......@@ -57,8 +57,8 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -78,8 +78,8 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
......@@ -101,7 +101,9 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
to set:
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
......
......@@ -67,10 +67,11 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
class XLMRobertaTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" XLM-RoBERTa tokenizer (backed by HuggingFace's *tokenizers* library). Adapted from
[`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
[`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -90,8 +91,8 @@ class XLMRobertaTokenizerFast(PreTrainedTokenizerFast):
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
......
......@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" XLNet configuration """
""" XLNet configuration"""
import warnings
......@@ -31,19 +31,18 @@ XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XLNetConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`XLNetModel`] or a
[`TFXLNetModel`]. It is used to instantiate a XLNet model according to the specified arguments,
defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
to that of the [xlnet-large-cased](https://huggingface.co/xlnet-large-cased) architecture.
This is the configuration class to store the configuration of a [`XLNetModel`] or a [`TFXLNetModel`]. It is used to
instantiate a XLNet model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the
[xlnet-large-cased](https://huggingface.co/xlnet-large-cased) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the XLNet model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`XLNetModel`] or
[`TFXLNetModel`].
`inputs_ids` passed when calling [`XLNetModel`] or [`TFXLNetModel`].
d_model (`int`, *optional*, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer.
n_layer (`int`, *optional*, defaults to 24):
......@@ -53,8 +52,8 @@ class XLNetConfig(PretrainedConfig):
d_inner (`int`, *optional*, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
ff_activation (`str` or `Callable`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the If string, `"gelu"`, `"relu"`,
`"silu"` and `"gelu_new"` are supported.
The non-linear activation function (function or string) in the If string, `"gelu"`, `"relu"`, `"silu"` and
`"gelu_new"` are supported.
untie_r (`bool`, *optional*, defaults to `True`):
Whether or not to untie relative position biases
attn_type (`str`, *optional*, defaults to `"bi"`):
......@@ -67,12 +66,13 @@ class XLNetConfig(PretrainedConfig):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
mem_len (`int` or `None`, *optional*):
The number of tokens to cache. The key/value pairs that have already been pre-computed in a previous
forward pass won't be re-computed. See the [quickstart](https://huggingface.co/transformers/quickstart.html#using-the-past) for more information.
forward pass won't be re-computed. See the
[quickstart](https://huggingface.co/transformers/quickstart.html#using-the-past) for more information.
reuse_len (`int`, *optional*):
The number of tokens in the current batch to be cached and reused in the future.
bi_data (`bool`, *optional*, defaults to `False`):
Whether or not to use bidirectional input pipeline. Usually set to `True` during pretraining and
`False` during finetuning.
Whether or not to use bidirectional input pipeline. Usually set to `True` during pretraining and `False`
during finetuning.
clamp_len (`int`, *optional*, defaults to -1):
Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no clamping.
same_length (`bool`, *optional*, defaults to `False`):
......@@ -114,10 +114,12 @@ class XLNetConfig(PretrainedConfig):
<Tip>
For pretraining, it is recommended to set `use_mems_train` to `True`. For fine-tuning, it is
recommended to set `use_mems_train` to `False` as discussed [here](https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587). If `use_mems_train` is set
to `True`, one has to make sure that the train batches are correctly pre-processed, *e.g.*
`batch_1 = [[This line is], [This is the]]` and `batch_2 = [[ the first line], [ second line]]` and that all batches are of equal size.
For pretraining, it is recommended to set `use_mems_train` to `True`. For fine-tuning, it is recommended to
set `use_mems_train` to `False` as discussed
[here](https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587). If `use_mems_train` is set to
`True`, one has to make sure that the train batches are correctly pre-processed, *e.g.* `batch_1 = [[This
line is], [This is the]]` and `batch_2 = [[ the first line], [ second line]]` and that all batches are of
equal size.
</Tip>
......
......@@ -832,19 +832,20 @@ class TFXLNetModelOutput(ModelOutput):
last_hidden_state (`tf.Tensor` of shape `(batch_size, num_predict, hidden_size)`):
Sequence of hidden-states at the last layer of the model.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then
`num_predict` corresponds to `sequence_length`.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
corresponds to `sequence_length`.
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -867,19 +868,20 @@ class TFXLNetLMHeadModelOutput(ModelOutput):
logits (`tf.Tensor` of shape `(batch_size, num_predict, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then
`num_predict` corresponds to `sequence_length`.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
corresponds to `sequence_length`.
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -903,16 +905,17 @@ class TFXLNetForSequenceClassificationOutput(ModelOutput):
logits (`tf.Tensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -936,16 +939,17 @@ class TFXLNetForTokenClassificationOutput(ModelOutput):
logits (`tf.Tensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Classification scores (before SoftMax).
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -971,16 +975,17 @@ class TFXLNetForMultipleChoiceOutput(ModelOutput):
Classification scores (before SoftMax).
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -1006,16 +1011,17 @@ class TFXLNetForQuestionAnsweringSimpleOutput(ModelOutput):
end_logits (`tf.Tensor` of shape `(batch_size, sequence_length,)`):
Span-end scores (before SoftMax).
mems (`List[tf.Tensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -1031,13 +1037,13 @@ class TFXLNetForQuestionAnsweringSimpleOutput(ModelOutput):
XLNET_START_DOCSTRING = r"""
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the
generic methods the library implements for all its model (such as downloading or saving, resizing the input
embeddings, pruning heads etc.)
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
and behavior.
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
behavior.
<Tip>
......@@ -1046,11 +1052,11 @@ XLNET_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all
the tensors in the first argument of the model call function: `model(inputs)`.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
tensors in the first argument of the model call function: `model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
the first positional argument :
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the
first positional argument :
- a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
......@@ -1063,8 +1069,7 @@ XLNET_START_DOCSTRING = r"""
Parameters:
config ([`XLNetConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
XLNET_INPUTS_DOCSTRING = r"""
......@@ -1072,9 +1077,8 @@ XLNET_INPUTS_DOCSTRING = r"""
input_ids (`torch.LongTensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`XLNetTokenizer`]. See
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
details.
Indices can be obtained using [`XLNetTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
......@@ -1086,8 +1090,8 @@ XLNET_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask)
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states (see `mems` output below) . Can be used to speed up sequential
decoding. The token ids which have their past given to this model should not be passed as `input_ids`
as they have already been computed.
decoding. The token ids which have their past given to this model should not be passed as `input_ids` as
they have already been computed.
`use_mems` has to be set to `True` to make use of `mems`.
perm_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length, sequence_length)`, *optional*):
......@@ -1099,19 +1103,20 @@ XLNET_INPUTS_DOCSTRING = r"""
If not set, each token attends to all the others (full bidirectional attention). Only used during
pretraining (to define factorization order) or for sequential decoding (generation).
target_mapping (`torch.FloatTensor` of shape `(batch_size, num_predict, sequence_length)`, *optional*):
Mask to indicate the output tokens to use. If `target_mapping[k, i, j] = 1`, the i-th predict in batch k
is on the j-th token. Only used during pretraining for partial prediction or for sequential decoding
Mask to indicate the output tokens to use. If `target_mapping[k, i, j] = 1`, the i-th predict in batch k is
on the j-th token. Only used during pretraining for partial prediction or for sequential decoding
(generation).
token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
input_mask (`torch.FloatTensor` of shape `{0}`, *optional*):
Mask to avoid performing attention on padding token indices. Negative of `attention_mask`, i.e. with 0
for real tokens and 1 for padding which is kept for compatibility with the original code base.
Mask to avoid performing attention on padding token indices. Negative of `attention_mask`, i.e. with 0 for
real tokens and 1 for padding which is kept for compatibility with the original code base.
Mask values selected in `[0, 1]`:
......@@ -1126,9 +1131,9 @@ XLNET_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
......@@ -1304,7 +1309,8 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
):
r"""
labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the cross entropy classification loss. Indices should be in `[0, ..., config.vocab_size - 1]`.
Labels for computing the cross entropy classification loss. Indices should be in `[0, ...,
config.vocab_size - 1]`.
Return:
......@@ -1445,8 +1451,9 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
):
r"""
labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
inputs = input_processing(
func=self.call,
......@@ -1570,8 +1577,8 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
):
r"""
labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]` where `num_choices` is the size of the second dimension of the input tensors. (See
`input_ids` above)
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`
where `num_choices` is the size of the second dimension of the input tensors. (See `input_ids` above)
"""
inputs = input_processing(
......@@ -1826,12 +1833,12 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
r"""
start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
inputs = input_processing(
func=self.call,
......
......@@ -591,19 +591,20 @@ class XLNetModelOutput(ModelOutput):
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_predict, hidden_size)`):
Sequence of hidden-states at the last layer of the model.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then
`num_predict` corresponds to `sequence_length`.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
corresponds to `sequence_length`.
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -626,19 +627,20 @@ class XLNetLMHeadModelOutput(ModelOutput):
logits (`torch.FloatTensor` of shape `(batch_size, num_predict, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then
`num_predict` corresponds to `sequence_length`.
`num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
corresponds to `sequence_length`.
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -662,16 +664,17 @@ class XLNetForSequenceClassificationOutput(ModelOutput):
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -695,16 +698,17 @@ class XLNetForTokenClassificationOutput(ModelOutput):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Classification scores (before SoftMax).
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -730,16 +734,17 @@ class XLNetForMultipleChoiceOutput(ModelOutput):
Classification scores (before SoftMax).
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -765,16 +770,17 @@ class XLNetForQuestionAnsweringSimpleOutput(ModelOutput):
end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length,)`):
Span-end scores (before SoftMax).
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -809,16 +815,17 @@ class XLNetForQuestionAnsweringOutput(ModelOutput):
cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the `is_impossible` label of the answers.
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding.
The token ids which have their past given to this model should not be passed as `input_ids` as they
have already been computed.
Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
token ids which have their past given to this model should not be passed as `input_ids` as they have
already been computed.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
......@@ -837,19 +844,18 @@ class XLNetForQuestionAnsweringOutput(ModelOutput):
XLNET_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`XLNetConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
weights.
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
XLNET_INPUTS_DOCSTRING = r"""
......@@ -857,9 +863,8 @@ XLNET_INPUTS_DOCSTRING = r"""
input_ids (`torch.LongTensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`XLNetTokenizer`]. See
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
details.
Indices can be obtained using [`XLNetTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
......@@ -871,8 +876,8 @@ XLNET_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask)
mems (`List[torch.FloatTensor]` of length `config.n_layers`):
Contains pre-computed hidden-states (see `mems` output below) . Can be used to speed up sequential
decoding. The token ids which have their past given to this model should not be passed as `input_ids`
as they have already been computed.
decoding. The token ids which have their past given to this model should not be passed as `input_ids` as
they have already been computed.
`use_mems` has to be set to `True` to make use of `mems`.
perm_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length, sequence_length)`, *optional*):
......@@ -884,19 +889,20 @@ XLNET_INPUTS_DOCSTRING = r"""
If not set, each token attends to all the others (full bidirectional attention). Only used during
pretraining (to define factorization order) or for sequential decoding (generation).
target_mapping (`torch.FloatTensor` of shape `(batch_size, num_predict, sequence_length)`, *optional*):
Mask to indicate the output tokens to use. If `target_mapping[k, i, j] = 1`, the i-th predict in batch k
is on the j-th token. Only used during pretraining for partial prediction or for sequential decoding
Mask to indicate the output tokens to use. If `target_mapping[k, i, j] = 1`, the i-th predict in batch k is
on the j-th token. Only used during pretraining for partial prediction or for sequential decoding
(generation).
token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
input_mask (`torch.FloatTensor` of shape `{0}`, *optional*):
Mask to avoid performing attention on padding token indices. Negative of `attention_mask`, i.e. with 0
for real tokens and 1 for padding which is kept for compatibility with the original code base.
Mask to avoid performing attention on padding token indices. Negative of `attention_mask`, i.e. with 0 for
real tokens and 1 for padding which is kept for compatibility with the original code base.
Mask values selected in `[0, 1]`:
......@@ -911,9 +917,9 @@ XLNET_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
......@@ -969,8 +975,7 @@ class XLNetModel(XLNetPreTrainedModel):
::
same_length=False: same_length=True:
<mlen > < qlen > <mlen > < qlen >
same_length=False: same_length=True: <mlen > < qlen > <mlen > < qlen >
^ [0 0 0 0 0 1 1 1 1] [0 0 0 0 0 1 1 1 1]
[0 0 0 0 0 0 1 1 1] [1 0 0 0 0 0 1 1 1]
qlen [0 0 0 0 0 0 0 1 1] [1 1 0 0 0 0 0 1 1]
......@@ -1381,12 +1386,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
`target_mapping` is :obj*None*, then `num_predict` corresponds to `sequence_length`.
The labels should correspond to the masked input words that should be predicted and depends on
`target_mapping`. Note in order to perform standard auto-regressive language modeling a *<mask>* token
has to be added to the `input_ids` (see the `prepare_inputs_for_generation` function and examples
below)
`target_mapping`. Note in order to perform standard auto-regressive language modeling a *<mask>* token has
to be added to the `input_ids` (see the `prepare_inputs_for_generation` function and examples below)
Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` are ignored, the
loss is only computed for labels in `[0, ..., config.vocab_size]`
Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` are ignored, the loss
is only computed for labels in `[0, ..., config.vocab_size]`
Return:
......@@ -1465,8 +1469,8 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
def _reorder_cache(mems: List[torch.Tensor], beam_idx: torch.Tensor) -> List[torch.Tensor]:
"""
This function is used to re-order the `mems` cache if [`~PreTrainedModel.beam_search`] or
[`~PreTrainedModel.beam_sample`] is called. This is required to match `mems` with the
correct beam_idx at every generation step.
[`~PreTrainedModel.beam_sample`] is called. This is required to match `mems` with the correct beam_idx at every
generation step.
"""
return [layer_past.index_select(1, beam_idx.to(layer_past.device)) for layer_past in mems]
......@@ -1518,8 +1522,9 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1625,8 +1630,8 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]` where *num_choices* is the size of the second dimension of the input tensors. (see
*input_ids* above)
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`
where *num_choices* is the size of the second dimension of the input tensors. (see *input_ids* above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1722,7 +1727,8 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
):
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
`input_ids` above)
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1827,12 +1833,12 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......@@ -1939,12 +1945,12 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
sequence are not taken into account for computing the loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
is_impossible (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels whether a question has an answer or no answer (SQuAD 2.0)
cls_index (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
......
......@@ -55,8 +55,8 @@ class XLNetTokenizer(PreTrainedTokenizer):
"""
Construct an XLNet tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -83,8 +83,8 @@ class XLNetTokenizer(PreTrainedTokenizer):
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
......@@ -106,7 +106,9 @@ class XLNetTokenizer(PreTrainedTokenizer):
additional_special_tokens (`List[str]`, *optional*, defaults to `["<eop>", "<eod>"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
to set:
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
......@@ -323,8 +325,7 @@ class XLNetTokenizer(PreTrainedTokenizer):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
sequence(s).
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
"""
sep = [self.sep_token_id]
cls_segment_id = [2]
......
......@@ -63,10 +63,11 @@ SEG_ID_PAD = 4
class XLNetTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" XLNet tokenizer (backed by HuggingFace's *tokenizers* library). Based on [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models).
Construct a "fast" XLNet tokenizer (backed by HuggingFace's *tokenizers* library). Based on
[Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models).
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
......@@ -93,8 +94,8 @@ class XLNetTokenizerFast(PreTrainedTokenizerFast):
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the `sep_token`.
</Tip>
......@@ -217,8 +218,7 @@ class XLNetTokenizerFast(PreTrainedTokenizerFast):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
sequence(s).
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
"""
sep = [self.sep_token_id]
cls_segment_id = [2]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment