Unverified Commit 87e6e4fe authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc styler v2 (#14950)

* New doc styler

* Fix issue with args at the start

* Code sample fixes

* Style code examples in MDX

* Fix more patterns

* Typo

* Typo

* More patterns

* Do without black for now

* Get more info in error

* Docstring style

* Re-enable check

* Quality

* Fix add_end_docstring decorator

* Fix docstring
parent c1138273
...@@ -38,27 +38,25 @@ class BeitFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin): ...@@ -38,27 +38,25 @@ class BeitFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
r""" r"""
Constructs a BEiT feature extractor. Constructs a BEiT feature extractor.
This feature extractor inherits from [`~feature_extraction_utils.FeatureExtractionMixin`] which This feature extractor inherits from [`~feature_extraction_utils.FeatureExtractionMixin`] which contains most of
contains most of the main methods. Users should refer to this superclass for more information regarding those the main methods. Users should refer to this superclass for more information regarding those methods.
methods.
Args: Args:
do_resize (`bool`, *optional*, defaults to `True`): do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the input to a certain `size`. Whether to resize the input to a certain `size`.
size (`int` or `Tuple(int)`, *optional*, defaults to 256): size (`int` or `Tuple(int)`, *optional*, defaults to 256):
Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
integer is provided, then the input will be resized to (size, size). Only has an effect if `do_resize` integer is provided, then the input will be resized to (size, size). Only has an effect if `do_resize` is
is set to `True`. set to `True`.
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`): resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`, An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. `PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
Only has an effect if `do_resize` is set to `True`. if `do_resize` is set to `True`.
do_center_crop (`bool`, *optional*, defaults to `True`): do_center_crop (`bool`, *optional*, defaults to `True`):
Whether to crop the input at the center. If the input size is smaller than `crop_size` along any edge, Whether to crop the input at the center. If the input size is smaller than `crop_size` along any edge, the
the image is padded with 0's and then center cropped. image is padded with 0's and then center cropped.
crop_size (`int`, *optional*, defaults to 224): crop_size (`int`, *optional*, defaults to 224):
Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`.
`True`.
do_normalize (`bool`, *optional*, defaults to `True`): do_normalize (`bool`, *optional*, defaults to `True`):
Whether or not to normalize the input with `image_mean` and `image_std`. Whether or not to normalize the input with `image_mean` and `image_std`.
image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`): image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" PyTorch BEiT model. """ """ PyTorch BEiT model."""
import collections.abc import collections.abc
...@@ -56,12 +56,13 @@ class BeitModelOutputWithPooling(BaseModelOutputWithPooling): ...@@ -56,12 +56,13 @@ class BeitModelOutputWithPooling(BaseModelOutputWithPooling):
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned. will be returned.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
of shape `(batch_size, sequence_length, hidden_size)`. shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -547,15 +548,14 @@ class BeitPreTrainedModel(PreTrainedModel): ...@@ -547,15 +548,14 @@ class BeitPreTrainedModel(PreTrainedModel):
BEIT_START_DOCSTRING = r""" BEIT_START_DOCSTRING = r"""
This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior. behavior.
Parameters: Parameters:
config ([`BeitConfig`]): Model configuration class with all the parameters of the model. config ([`BeitConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
weights.
""" """
BEIT_INPUTS_DOCSTRING = r""" BEIT_INPUTS_DOCSTRING = r"""
...@@ -737,8 +737,9 @@ class BeitForMaskedImageModeling(BeitPreTrainedModel): ...@@ -737,8 +737,9 @@ class BeitForMaskedImageModeling(BeitPreTrainedModel):
Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the image classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns: Returns:
...@@ -824,8 +825,9 @@ class BeitForImageClassification(BeitPreTrainedModel): ...@@ -824,8 +825,9 @@ class BeitForImageClassification(BeitPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the image classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns: Returns:
...@@ -1158,8 +1160,8 @@ class BeitForSemanticSegmentation(BeitPreTrainedModel): ...@@ -1158,8 +1160,8 @@ class BeitForSemanticSegmentation(BeitPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ...,
(Cross-Entropy). config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed (Cross-Entropy).
Returns: Returns:
......
...@@ -54,23 +54,24 @@ class FlaxBeitModelOutputWithPooling(FlaxBaseModelOutputWithPooling): ...@@ -54,23 +54,24 @@ class FlaxBeitModelOutputWithPooling(FlaxBaseModelOutputWithPooling):
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned. will be returned.
hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus
layer plus the initial embedding outputs. the initial embedding outputs.
attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads. the self-attention heads.
""" """
BEIT_START_DOCSTRING = r""" BEIT_START_DOCSTRING = r"""
This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
generic methods the library implements for all its model (such as downloading, saving and converting weights from library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
PyTorch models)
This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
and refer to the Flax documentation for all matter related to general usage and behavior. subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
general usage and behavior.
Finally, this model supports inherent JAX features such as: Finally, this model supports inherent JAX features such as:
...@@ -82,11 +83,10 @@ BEIT_START_DOCSTRING = r""" ...@@ -82,11 +83,10 @@ BEIT_START_DOCSTRING = r"""
Parameters: Parameters:
config ([`BeitConfig`]): Model configuration class with all the parameters of the model. config ([`BeitConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
model weights.
dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`): dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
GPUs) and `jax.numpy.bfloat16` (on TPUs). `jax.numpy.bfloat16` (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
specified all the computation will be performed with the given `dtype`. specified all the computation will be performed with the given `dtype`.
...@@ -94,8 +94,8 @@ BEIT_START_DOCSTRING = r""" ...@@ -94,8 +94,8 @@ BEIT_START_DOCSTRING = r"""
**Note that this only specifies the dtype of the computation and does not influence the dtype of model **Note that this only specifies the dtype of the computation and does not influence the dtype of model
parameters.** parameters.**
If you wish to change the dtype of the model parameters, see If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
[`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`]. [`~FlaxPreTrainedModel.to_bf16`].
""" """
BEIT_INPUTS_DOCSTRING = r""" BEIT_INPUTS_DOCSTRING = r"""
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" BERT model configuration """ """ BERT model configuration"""
from collections import OrderedDict from collections import OrderedDict
from typing import Mapping from typing import Mapping
...@@ -53,20 +53,19 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -53,20 +53,19 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BertConfig(PretrainedConfig): class BertConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a [`BertModel`] or a This is the configuration class to store the configuration of a [`BertModel`] or a [`TFBertModel`]. It is used to
[`TFBertModel`]. It is used to instantiate a BERT model according to the specified arguments, instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a
defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration configuration with the defaults will yield a similar configuration to that of the BERT
to that of the BERT [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture. [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
outputs. Read the documentation from [`PretrainedConfig`] for more information. documentation from [`PretrainedConfig`] for more information.
Args: Args:
vocab_size (`int`, *optional*, defaults to 30522): vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`BertModel`] or `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
[`TFBertModel`].
hidden_size (`int`, *optional*, defaults to 768): hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer. Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 12): num_hidden_layers (`int`, *optional*, defaults to 12):
...@@ -76,8 +75,8 @@ class BertConfig(PretrainedConfig): ...@@ -76,8 +75,8 @@ class BertConfig(PretrainedConfig):
intermediate_size (`int`, *optional*, defaults to 3072): intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`): hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. `"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1): hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
...@@ -86,17 +85,17 @@ class BertConfig(PretrainedConfig): ...@@ -86,17 +85,17 @@ class BertConfig(PretrainedConfig):
The maximum sequence length that this model might ever be used with. Typically set this to something large The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (`int`, *optional*, defaults to 2): type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
[`TFBertModel`].
initializer_range (`float`, *optional*, defaults to 0.02): initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-12): layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
position_embedding_type (`str`, *optional*, defaults to `"absolute"`): position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
`"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
`"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
*Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
use_cache (`bool`, *optional*, defaults to `True`): use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`. relevant if `config.is_decoder=True`.
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""PyTorch BERT model. """ """PyTorch BERT model."""
import math import math
...@@ -1130,7 +1130,7 @@ class BertForPreTraining(BertPreTrainedModel): ...@@ -1130,7 +1130,7 @@ class BertForPreTraining(BertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
"""Bert Model with a `language modeling` head on top for CLM fine-tuning. """, BERT_START_DOCSTRING """Bert Model with a `language modeling` head on top for CLM fine-tuning.""", BERT_START_DOCSTRING
) )
class BertLMHeadModel(BertPreTrainedModel): class BertLMHeadModel(BertPreTrainedModel):
...@@ -1282,7 +1282,7 @@ class BertLMHeadModel(BertPreTrainedModel): ...@@ -1282,7 +1282,7 @@ class BertLMHeadModel(BertPreTrainedModel):
return reordered_past return reordered_past
@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING) @add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
class BertForMaskedLM(BertPreTrainedModel): class BertForMaskedLM(BertPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r"pooler"] _keys_to_ignore_on_load_unexpected = [r"pooler"]
...@@ -1391,7 +1391,7 @@ class BertForMaskedLM(BertPreTrainedModel): ...@@ -1391,7 +1391,7 @@ class BertForMaskedLM(BertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
"""Bert Model with a `next sentence prediction (classification)` head on top. """, """Bert Model with a `next sentence prediction (classification)` head on top.""",
BERT_START_DOCSTRING, BERT_START_DOCSTRING,
) )
class BertForNextSentencePrediction(BertPreTrainedModel): class BertForNextSentencePrediction(BertPreTrainedModel):
......
...@@ -66,12 +66,13 @@ class FlaxBertForPreTrainingOutput(ModelOutput): ...@@ -66,12 +66,13 @@ class FlaxBertForPreTrainingOutput(ModelOutput):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax). before SoftMax).
hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
shape `(batch_size, sequence_length, hidden_size)`. `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -85,12 +86,12 @@ class FlaxBertForPreTrainingOutput(ModelOutput): ...@@ -85,12 +86,12 @@ class FlaxBertForPreTrainingOutput(ModelOutput):
BERT_START_DOCSTRING = r""" BERT_START_DOCSTRING = r"""
This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
generic methods the library implements for all its model (such as downloading, saving and converting weights from library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
PyTorch models)
This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
and refer to the Flax documentation for all matter related to general usage and behavior. subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
general usage and behavior.
Finally, this model supports inherent JAX features such as: Finally, this model supports inherent JAX features such as:
...@@ -102,11 +103,10 @@ BERT_START_DOCSTRING = r""" ...@@ -102,11 +103,10 @@ BERT_START_DOCSTRING = r"""
Parameters: Parameters:
config ([`BertConfig`]): Model configuration class with all the parameters of the model. config ([`BertConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
model weights.
dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`): dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
GPUs) and `jax.numpy.bfloat16` (on TPUs). `jax.numpy.bfloat16` (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
specified all the computation will be performed with the given `dtype`. specified all the computation will be performed with the given `dtype`.
...@@ -114,11 +114,11 @@ BERT_START_DOCSTRING = r""" ...@@ -114,11 +114,11 @@ BERT_START_DOCSTRING = r"""
**Note that this only specifies the dtype of the computation and does not influence the dtype of model **Note that this only specifies the dtype of the computation and does not influence the dtype of model
parameters.** parameters.**
If you wish to change the dtype of the model parameters, see If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
[`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`]. [`~FlaxPreTrainedModel.to_bf16`].
dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`): dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
GPUs) and `jax.numpy.bfloat16` (on TPUs). `jax.numpy.bfloat16` (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
specified all the computation will be performed with the given `dtype`. specified all the computation will be performed with the given `dtype`.
...@@ -126,8 +126,8 @@ BERT_START_DOCSTRING = r""" ...@@ -126,8 +126,8 @@ BERT_START_DOCSTRING = r"""
**Note that this only specifies the dtype of the computation and does not influence the dtype of model **Note that this only specifies the dtype of the computation and does not influence the dtype of model
parameters.** parameters.**
If you wish to change the dtype of the model parameters, see If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
[`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`]. [`~FlaxPreTrainedModel.to_bf16`].
""" """
...@@ -136,9 +136,8 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -136,9 +136,8 @@ BERT_INPUTS_DOCSTRING = r"""
input_ids (`numpy.ndarray` of shape `({0})`): input_ids (`numpy.ndarray` of shape `({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BertTokenizer`]. See Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for [`PreTrainedTokenizer.__call__`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`numpy.ndarray` of shape `({0})`, *optional*): attention_mask (`numpy.ndarray` of shape `({0})`, *optional*):
...@@ -149,15 +148,18 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -149,15 +148,18 @@ BERT_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*): token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token, - 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token. - 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids) [What are token type IDs?](../glossary#token-type-ids)
position_ids (`numpy.ndarray` of shape `({0})`, *optional*): position_ids (`numpy.ndarray` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`. Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
head_mask (`numpy.ndarray` of shape `({0})`, `optional): Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`: config.max_position_embeddings - 1]`.
head_mask (`numpy.ndarray` of shape `({0})`, `optional):
Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
- 1 indicates the head is **not masked**, - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
...@@ -909,7 +911,7 @@ class FlaxBertForMaskedLMModule(nn.Module): ...@@ -909,7 +911,7 @@ class FlaxBertForMaskedLMModule(nn.Module):
) )
@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING) @add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
class FlaxBertForMaskedLM(FlaxBertPreTrainedModel): class FlaxBertForMaskedLM(FlaxBertPreTrainedModel):
module_class = FlaxBertForMaskedLMModule module_class = FlaxBertForMaskedLMModule
...@@ -968,7 +970,7 @@ class FlaxBertForNextSentencePredictionModule(nn.Module): ...@@ -968,7 +970,7 @@ class FlaxBertForNextSentencePredictionModule(nn.Module):
@add_start_docstrings( @add_start_docstrings(
"""Bert Model with a `next sentence prediction (classification)` head on top. """, """Bert Model with a `next sentence prediction (classification)` head on top.""",
BERT_START_DOCSTRING, BERT_START_DOCSTRING,
) )
class FlaxBertForNextSentencePrediction(FlaxBertPreTrainedModel): class FlaxBertForNextSentencePrediction(FlaxBertPreTrainedModel):
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" TF 2.0 BERT model. """ """ TF 2.0 BERT model."""
import math import math
import warnings import warnings
...@@ -938,12 +938,13 @@ class TFBertForPreTrainingOutput(ModelOutput): ...@@ -938,12 +938,13 @@ class TFBertForPreTrainingOutput(ModelOutput):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax). before SoftMax).
hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
shape `(batch_size, sequence_length, hidden_size)`. `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -958,13 +959,13 @@ class TFBertForPreTrainingOutput(ModelOutput): ...@@ -958,13 +959,13 @@ class TFBertForPreTrainingOutput(ModelOutput):
BERT_START_DOCSTRING = r""" BERT_START_DOCSTRING = r"""
This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
generic methods the library implements for all its model (such as downloading or saving, resizing the input library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
embeddings, pruning heads etc.) etc.)
This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
and behavior. behavior.
<Tip> <Tip>
...@@ -973,11 +974,11 @@ BERT_START_DOCSTRING = r""" ...@@ -973,11 +974,11 @@ BERT_START_DOCSTRING = r"""
- having all inputs as keyword arguments (like PyTorch models), or - having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments. - having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
the tensors in the first argument of the model call function: `model(inputs)`. tensors in the first argument of the model call function: `model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the
the first positional argument : first positional argument :
- a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)` - a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
...@@ -990,8 +991,7 @@ BERT_START_DOCSTRING = r""" ...@@ -990,8 +991,7 @@ BERT_START_DOCSTRING = r"""
Args: Args:
config ([`BertConfig`]): Model configuration class with all the parameters of the model. config ([`BertConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the model weights.
model weights.
""" """
BERT_INPUTS_DOCSTRING = r""" BERT_INPUTS_DOCSTRING = r"""
...@@ -999,9 +999,8 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -999,9 +999,8 @@ BERT_INPUTS_DOCSTRING = r"""
input_ids (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `({0})`): input_ids (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BertTokenizer`]. See Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
[`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for [`PreTrainedTokenizer.encode`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*): attention_mask (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
...@@ -1012,14 +1011,16 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -1012,14 +1011,16 @@ BERT_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
token_type_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*): token_type_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token, - 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token. - 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids) [What are token type IDs?](../glossary#token-type-ids)
position_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*): position_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`. Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids) [What are position IDs?](../glossary#position-ids)
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
...@@ -1029,9 +1030,9 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -1029,9 +1030,9 @@ BERT_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
inputs_embeds (`np.ndarray` or `tf.Tensor` of shape `({0}, hidden_size)`, *optional*): inputs_embeds (`np.ndarray` or `tf.Tensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
This is useful if you want more control over how to convert `input_ids` indices into associated is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
vectors than the model's internal embedding lookup matrix. model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the
...@@ -1041,8 +1042,8 @@ BERT_INPUTS_DOCSTRING = r""" ...@@ -1041,8 +1042,8 @@ BERT_INPUTS_DOCSTRING = r"""
more detail. This argument can be used only in eager mode, in graph mode the value in the config will be more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
used instead. used instead.
return_dict (`bool`, *optional*): return_dict (`bool`, *optional*):
Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This argument can be used
argument can be used in eager mode, in graph mode the value will always be set to True. in eager mode, in graph mode the value will always be set to True.
training (`bool`, *optional*, defaults to `False``): training (`bool`, *optional*, defaults to `False``):
Whether or not to use the model in training mode (some modules like dropout modules have different Whether or not to use the model in training mode (some modules like dropout modules have different
behaviors between training and evaluation). behaviors between training and evaluation).
...@@ -1097,12 +1098,12 @@ class TFBertModel(TFBertPreTrainedModel): ...@@ -1097,12 +1098,12 @@ class TFBertModel(TFBertPreTrainedModel):
past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`) past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`)
contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*, defaults to `True`): use_cache (`bool`, *optional*, defaults to `True`):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). Set to `False` during training, `True` during generation `past_key_values`). Set to `False` during training, `True` during generation
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
...@@ -1212,8 +1213,9 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss): ...@@ -1212,8 +1213,9 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss):
) -> Union[TFBertForPreTrainingOutput, Tuple[tf.Tensor]]: ) -> Union[TFBertForPreTrainingOutput, Tuple[tf.Tensor]]:
r""" r"""
labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
next_sentence_label (`tf.Tensor` of shape `(batch_size,)`, *optional*): next_sentence_label (`tf.Tensor` of shape `(batch_size,)`, *optional*):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
(see `input_ids` docstring) Indices should be in `[0, 1]`: (see `input_ids` docstring) Indices should be in `[0, 1]`:
...@@ -1300,7 +1302,7 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss): ...@@ -1300,7 +1302,7 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss):
) )
@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING) @add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss): class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
# names with a '.' represents the authorized unexpected/missing layers when a TF model is loaded from a PT model # names with a '.' represents the authorized unexpected/missing layers when a TF model is loaded from a PT model
_keys_to_ignore_on_load_unexpected = [ _keys_to_ignore_on_load_unexpected = [
...@@ -1353,8 +1355,9 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss): ...@@ -1353,8 +1355,9 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]: ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
r""" r"""
labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*): labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
...@@ -1483,14 +1486,15 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss): ...@@ -1483,14 +1486,15 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`) past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`)
contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*, defaults to `True`): use_cache (`bool`, *optional*, defaults to `True`):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). Set to `False` during training, `True` during generation `past_key_values`). Set to `False` during training, `True` during generation
labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*): labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the cross entropy classification loss. Indices should be in `[0, ..., config.vocab_size - 1]`. Labels for computing the cross entropy classification loss. Indices should be in `[0, ...,
config.vocab_size - 1]`.
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
...@@ -1566,7 +1570,7 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss): ...@@ -1566,7 +1570,7 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
@add_start_docstrings( @add_start_docstrings(
"""Bert Model with a `next sentence prediction (classification)` head on top. """, """Bert Model with a `next sentence prediction (classification)` head on top.""",
BERT_START_DOCSTRING, BERT_START_DOCSTRING,
) )
class TFBertForNextSentencePrediction(TFBertPreTrainedModel, TFNextSentencePredictionLoss): class TFBertForNextSentencePrediction(TFBertPreTrainedModel, TFNextSentencePredictionLoss):
...@@ -1721,8 +1725,9 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific ...@@ -1721,8 +1725,9 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]: ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
r""" r"""
labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*): labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
...@@ -1830,8 +1835,8 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss): ...@@ -1830,8 +1835,8 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]: ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
r""" r"""
labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*): labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]` where `num_choices` is the size of the second dimension of the input tensors. (See Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`
`input_ids` above) where `num_choices` is the size of the second dimension of the input tensors. (See `input_ids` above)
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
...@@ -2096,12 +2101,12 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss) ...@@ -2096,12 +2101,12 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
r""" r"""
start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*): start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
sequence are not taken into account for computing the loss. are not taken into account for computing the loss.
end_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*): end_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
sequence are not taken into account for computing the loss. are not taken into account for computing the loss.
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
......
...@@ -118,8 +118,8 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -118,8 +118,8 @@ class BertTokenizer(PreTrainedTokenizer):
r""" r"""
Construct a BERT tokenizer. Based on WordPiece. Construct a BERT tokenizer. Based on WordPiece.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
Users should refer to this superclass for more information regarding those methods. this superclass for more information regarding those methods.
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -149,7 +149,8 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -149,7 +149,8 @@ class BertTokenizer(PreTrainedTokenizer):
tokenize_chinese_chars (`bool`, *optional*, defaults to `True`): tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
Whether or not to tokenize Chinese characters. Whether or not to tokenize Chinese characters.
This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)). This should likely be deactivated for Japanese (see this
[issue](https://github.com/huggingface/transformers/issues/328)).
strip_accents: (`bool`, *optional*): strip_accents: (`bool`, *optional*):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for `lowercase` (as in the original BERT). value for `lowercase` (as in the original BERT).
...@@ -318,8 +319,7 @@ class BertTokenizer(PreTrainedTokenizer): ...@@ -318,8 +319,7 @@ class BertTokenizer(PreTrainedTokenizer):
Optional second list of IDs for sequence pairs. Optional second list of IDs for sequence pairs.
Returns: Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
sequence(s).
""" """
sep = [self.sep_token_id] sep = [self.sep_token_id]
cls = [self.cls_token_id] cls = [self.cls_token_id]
...@@ -361,7 +361,8 @@ class BasicTokenizer(object): ...@@ -361,7 +361,8 @@ class BasicTokenizer(object):
tokenize_chinese_chars (`bool`, *optional*, defaults to `True`): tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
Whether or not to tokenize Chinese characters. Whether or not to tokenize Chinese characters.
This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)). This should likely be deactivated for Japanese (see this
[issue](https://github.com/huggingface/transformers/issues/328)).
strip_accents: (`bool`, *optional*): strip_accents: (`bool`, *optional*):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for `lowercase` (as in the original BERT). value for `lowercase` (as in the original BERT).
......
...@@ -118,8 +118,8 @@ class BertTokenizerFast(PreTrainedTokenizerFast): ...@@ -118,8 +118,8 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
r""" r"""
Construct a "fast" BERT tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece. Construct a "fast" BERT tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece.
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
methods. Users should refer to this superclass for more information regarding those methods. refer to this superclass for more information regarding those methods.
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -245,8 +245,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast): ...@@ -245,8 +245,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
Optional second list of IDs for sequence pairs. Optional second list of IDs for sequence pairs.
Returns: Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
sequence(s).
""" """
sep = [self.sep_token_id] sep = [self.sep_token_id]
cls = [self.cls_token_id] cls = [self.cls_token_id]
......
...@@ -12,19 +12,18 @@ ...@@ -12,19 +12,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" BertGeneration model configuration """ """ BertGeneration model configuration"""
from ...configuration_utils import PretrainedConfig from ...configuration_utils import PretrainedConfig
class BertGenerationConfig(PretrainedConfig): class BertGenerationConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a This is the configuration class to store the configuration of a [`BertGenerationPreTrainedModel`]. It is used to
[`BertGenerationPreTrainedModel`]. It is used to instantiate a BertGeneration model according to instantiate a BertGeneration model according to the specified arguments, defining the model architecture.
the specified arguments, defining the model architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
outputs. Read the documentation from [`PretrainedConfig`] for more information. documentation from [`PretrainedConfig`] for more information.
Args: Args:
vocab_size (`int`, *optional*, defaults to 50358): vocab_size (`int`, *optional*, defaults to 50358):
...@@ -39,8 +38,8 @@ class BertGenerationConfig(PretrainedConfig): ...@@ -39,8 +38,8 @@ class BertGenerationConfig(PretrainedConfig):
intermediate_size (`int`, *optional*, defaults to 3072): intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder. Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. `"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1): hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
...@@ -53,10 +52,11 @@ class BertGenerationConfig(PretrainedConfig): ...@@ -53,10 +52,11 @@ class BertGenerationConfig(PretrainedConfig):
layer_norm_eps (`float`, *optional*, defaults to 1e-12): layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers. The epsilon used by the layer normalization layers.
position_embedding_type (`str`, *optional*, defaults to `"absolute"`): position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
`"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
`"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
*Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
use_cache (`bool`, *optional*, defaults to `True`): use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`. relevant if `config.is_decoder=True`.
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""PyTorch BERT model specific for generation. """ """PyTorch BERT model specific for generation."""
import torch import torch
...@@ -195,19 +195,18 @@ class BertGenerationPreTrainedModel(PreTrainedModel): ...@@ -195,19 +195,18 @@ class BertGenerationPreTrainedModel(PreTrainedModel):
BERT_GENERATION_START_DOCSTRING = r""" BERT_GENERATION_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
pruning heads etc.) etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
general usage and behavior. and behavior.
Parameters: Parameters:
config ([`BertGenerationConfig`]): Model configuration class with all the parameters of the model. config ([`BertGenerationConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
weights.
""" """
BERT_GENERATION_INPUTS_DOCSTRING = r""" BERT_GENERATION_INPUTS_DOCSTRING = r"""
...@@ -215,9 +214,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r""" ...@@ -215,9 +214,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
input_ids (`torch.LongTensor` of shape `({0})`): input_ids (`torch.LongTensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BertGenerationTokenizer`]. See Indices can be obtained using [`BertGenerationTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
[`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for [`PreTrainedTokenizer.encode`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
...@@ -228,7 +226,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r""" ...@@ -228,7 +226,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
position_ids (`torch.LongTensor` of shape `({0})`, *optional*): position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`. Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids) [What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
...@@ -238,9 +237,9 @@ BERT_GENERATION_INPUTS_DOCSTRING = r""" ...@@ -238,9 +237,9 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
This is useful if you want more control over how to convert `input_ids` indices into associated is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
vectors than the model's internal embedding lookup matrix. model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail. tensors for more detail.
...@@ -264,14 +263,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel): ...@@ -264,14 +263,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
This model should be used when leveraging Bert or Roberta checkpoints for the This model should be used when leveraging Bert or Roberta checkpoints for the [`EncoderDecoderModel`] class as
[`EncoderDecoderModel`] class as described in [Leveraging Pre-trained Checkpoints for Sequence described in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461)
Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
set to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
argument and `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
input to the forward pass.
""" """
def __init__(self, config): def __init__(self, config):
...@@ -331,12 +329,12 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel): ...@@ -331,12 +329,12 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). `past_key_values`).
""" """
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = ( output_hidden_states = (
...@@ -443,7 +441,7 @@ class BertGenerationOnlyLMHead(nn.Module): ...@@ -443,7 +441,7 @@ class BertGenerationOnlyLMHead(nn.Module):
@add_start_docstrings( @add_start_docstrings(
"""BertGeneration Model with a `language modeling` head on top for CLM fine-tuning. """, """BertGeneration Model with a `language modeling` head on top for CLM fine-tuning.""",
BERT_GENERATION_START_DOCSTRING, BERT_GENERATION_START_DOCSTRING,
) )
class BertGenerationDecoder(BertGenerationPreTrainedModel): class BertGenerationDecoder(BertGenerationPreTrainedModel):
...@@ -500,12 +498,12 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel): ...@@ -500,12 +498,12 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). `past_key_values`).
Returns: Returns:
......
...@@ -42,8 +42,8 @@ class BertGenerationTokenizer(PreTrainedTokenizer): ...@@ -42,8 +42,8 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
""" """
Construct a BertGeneration tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). Construct a BertGeneration tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
Users should refer to this superclass for more information regarding those methods. this superclass for more information regarding those methods.
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -59,7 +59,9 @@ class BertGenerationTokenizer(PreTrainedTokenizer): ...@@ -59,7 +59,9 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
pad_token (`str`, *optional*, defaults to `"<pad>"`): pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths. The token used for padding, for example when batching sequences of different lengths.
sp_model_kwargs (`dict`, *optional*): sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
to set:
- `enable_sampling`: Enable subword regularization. - `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Tokenization classes for BERTweet """ """ Tokenization classes for BERTweet"""
import html import html
...@@ -69,8 +69,8 @@ class BertweetTokenizer(PreTrainedTokenizer): ...@@ -69,8 +69,8 @@ class BertweetTokenizer(PreTrainedTokenizer):
""" """
Constructs a BERTweet tokenizer, using Byte-Pair-Encoding. Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
Users should refer to this superclass for more information regarding those methods. this superclass for more information regarding those methods.
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -94,8 +94,8 @@ class BertweetTokenizer(PreTrainedTokenizer): ...@@ -94,8 +94,8 @@ class BertweetTokenizer(PreTrainedTokenizer):
<Tip> <Tip>
When building a sequence using special tokens, this is not the token that is used for the end of When building a sequence using special tokens, this is not the token that is used for the end of sequence.
sequence. The token used is the `sep_token`. The token used is the `sep_token`.
</Tip> </Tip>
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" BigBird model configuration """ """ BigBird model configuration"""
from ...configuration_utils import PretrainedConfig from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
...@@ -30,13 +30,13 @@ BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -30,13 +30,13 @@ BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BigBirdConfig(PretrainedConfig): class BigBirdConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a [`BigBirdModel`]. It is used to This is the configuration class to store the configuration of a [`BigBirdModel`]. It is used to instantiate an
instantiate an BigBird model according to the specified arguments, defining the model architecture. Instantiating a BigBird model according to the specified arguments, defining the model architecture. Instantiating a configuration
configuration with the defaults will yield a similar configuration to that of the BigBird with the defaults will yield a similar configuration to that of the BigBird
[google/bigbird-roberta-base](https://huggingface.co/google/bigbird-roberta-base) architecture. [google/bigbird-roberta-base](https://huggingface.co/google/bigbird-roberta-base) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
outputs. Read the documentation from [`PretrainedConfig`] for more information. documentation from [`PretrainedConfig`] for more information.
Args: Args:
...@@ -52,8 +52,8 @@ class BigBirdConfig(PretrainedConfig): ...@@ -52,8 +52,8 @@ class BigBirdConfig(PretrainedConfig):
intermediate_size (`int`, *optional*, defaults to 3072): intermediate_size (`int`, *optional*, defaults to 3072):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"gelu_new"`): hidden_act (`str` or `function`, *optional*, defaults to `"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. `"relu"`, `"selu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1): hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
...@@ -80,7 +80,8 @@ class BigBirdConfig(PretrainedConfig): ...@@ -80,7 +80,8 @@ class BigBirdConfig(PretrainedConfig):
block_size (`int`, *optional*, defaults to 64) block_size (`int`, *optional*, defaults to 64)
Size of each block. Useful only when `attention_type == "block_sparse"`. Size of each block. Useful only when `attention_type == "block_sparse"`.
num_random_blocks (`int`, *optional*, defaults to 3) num_random_blocks (`int`, *optional*, defaults to 3)
Each query is going to attend these many number of random blocks. Useful only when `attention_type == "block_sparse"`. Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
"block_sparse"`.
classifier_dropout (`float`, *optional*): classifier_dropout (`float`, *optional*):
The dropout ratio for the classification head. The dropout ratio for the classification head.
...@@ -92,14 +93,13 @@ class BigBirdConfig(PretrainedConfig): ...@@ -92,14 +93,13 @@ class BigBirdConfig(PretrainedConfig):
>>> from transformers import BigBirdModel, BigBirdConfig >>> from transformers import BigBirdModel, BigBirdConfig
>>> # Initializing a BigBird google/bigbird-roberta-base style configuration >>> # Initializing a BigBird google/bigbird-roberta-base style configuration >>> configuration =
>>> configuration = BigBirdConfig() BigBirdConfig()
>>> # Initializing a model from the google/bigbird-roberta-base style configuration >>> # Initializing a model from the google/bigbird-roberta-base style configuration >>> model =
>>> model = BigBirdModel(configuration) BigBirdModel(configuration)
>>> # Accessing the model configuration >>> # Accessing the model configuration >>> configuration = model.config
>>> configuration = model.config
""" """
model_type = "big_bird" model_type = "big_bird"
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" PyTorch BigBird model. """ """ PyTorch BigBird model."""
import math import math
...@@ -1788,8 +1788,7 @@ BIG_BIRD_START_DOCSTRING = r""" ...@@ -1788,8 +1788,7 @@ BIG_BIRD_START_DOCSTRING = r"""
Parameters: Parameters:
config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model. config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
weights.
""" """
BIG_BIRD_INPUTS_DOCSTRING = r""" BIG_BIRD_INPUTS_DOCSTRING = r"""
...@@ -1797,9 +1796,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r""" ...@@ -1797,9 +1796,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
input_ids (`torch.LongTensor` of shape `({0})`): input_ids (`torch.LongTensor` of shape `({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BigBirdTokenizer`]. See Indices can be obtained using [`BigBirdTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for [`PreTrainedTokenizer.__call__`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
...@@ -1810,14 +1808,16 @@ BIG_BIRD_INPUTS_DOCSTRING = r""" ...@@ -1810,14 +1808,16 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token, - 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token. - 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids) [What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `({0})`, *optional*): position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`. Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids) [What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
...@@ -1827,9 +1827,9 @@ BIG_BIRD_INPUTS_DOCSTRING = r""" ...@@ -1827,9 +1827,9 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
This is useful if you want more control over how to convert *input_ids* indices into associated vectors is useful if you want more control over how to convert *input_ids* indices into associated vectors than the
than the model's internal embedding lookup matrix. model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail. tensors for more detail.
...@@ -1856,12 +1856,13 @@ class BigBirdForPreTrainingOutput(ModelOutput): ...@@ -1856,12 +1856,13 @@ class BigBirdForPreTrainingOutput(ModelOutput):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax). before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
of shape `(batch_size, sequence_length, hidden_size)`. shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -1889,12 +1890,13 @@ class BigBirdForQuestionAnsweringModelOutput(ModelOutput): ...@@ -1889,12 +1890,13 @@ class BigBirdForQuestionAnsweringModelOutput(ModelOutput):
pooler_output (`torch.FloatTensor` of shape `(batch_size, 1)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, 1)`):
pooler output from BigBigModel pooler output from BigBigModel
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
of shape `(batch_size, sequence_length, hidden_size)`. shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -1920,10 +1922,9 @@ class BigBirdModel(BigBirdPreTrainedModel): ...@@ -1920,10 +1922,9 @@ class BigBirdModel(BigBirdPreTrainedModel):
all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
set to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
argument and `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
input to the forward pass.
""" """
def __init__(self, config, add_pooling_layer=True): def __init__(self, config, add_pooling_layer=True):
...@@ -2004,12 +2005,12 @@ class BigBirdModel(BigBirdPreTrainedModel): ...@@ -2004,12 +2005,12 @@ class BigBirdModel(BigBirdPreTrainedModel):
- 0 for tokens that are **masked**. - 0 for tokens that are **masked**.
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). `past_key_values`).
""" """
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = ( output_hidden_states = (
...@@ -2286,12 +2287,13 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel): ...@@ -2286,12 +2287,13 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
next_sentence_label (`torch.LongTensor` of shape `(batch_size,)`, *optional*): next_sentence_label (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the next sequence prediction (classification) loss. If specified, nsp loss will be Labels for computing the next sequence prediction (classification) loss. If specified, nsp loss will be
added to masked_lm loss. Input should be a sequence pair (see `input_ids` docstring) Indices should be added to masked_lm loss. Input should be a sequence pair (see `input_ids` docstring) Indices should be in
in `[0, 1]`: `[0, 1]`:
- 0 indicates sequence B is a continuation of sequence A, - 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence. - 1 indicates sequence B is a random sequence.
...@@ -2354,7 +2356,7 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel): ...@@ -2354,7 +2356,7 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel):
) )
@add_start_docstrings("""BigBird Model with a `language modeling` head on top. """, BIG_BIRD_START_DOCSTRING) @add_start_docstrings("""BigBird Model with a `language modeling` head on top.""", BIG_BIRD_START_DOCSTRING)
class BigBirdForMaskedLM(BigBirdPreTrainedModel): class BigBirdForMaskedLM(BigBirdPreTrainedModel):
def __init__(self, config): def __init__(self, config):
super().__init__(config) super().__init__(config)
...@@ -2401,8 +2403,9 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel): ...@@ -2401,8 +2403,9 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -2455,7 +2458,7 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel): ...@@ -2455,7 +2458,7 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
"""BigBird Model with a `language modeling` head on top for CLM fine-tuning. """, BIG_BIRD_START_DOCSTRING """BigBird Model with a `language modeling` head on top for CLM fine-tuning.""", BIG_BIRD_START_DOCSTRING
) )
class BigBirdForCausalLM(BigBirdPreTrainedModel): class BigBirdForCausalLM(BigBirdPreTrainedModel):
...@@ -2510,16 +2513,16 @@ class BigBirdForCausalLM(BigBirdPreTrainedModel): ...@@ -2510,16 +2513,16 @@ class BigBirdForCausalLM(BigBirdPreTrainedModel):
- 0 for tokens that are **masked**. - 0 for tokens that are **masked**.
past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. `decoder_input_ids` of shape `(batch_size, sequence_length)`.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
`[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]`. ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). `past_key_values`).
Returns: Returns:
...@@ -2667,8 +2670,9 @@ class BigBirdForSequenceClassification(BigBirdPreTrainedModel): ...@@ -2667,8 +2670,9 @@ class BigBirdForSequenceClassification(BigBirdPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -2764,7 +2768,8 @@ class BigBirdForMultipleChoice(BigBirdPreTrainedModel): ...@@ -2764,7 +2768,8 @@ class BigBirdForMultipleChoice(BigBirdPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
`input_ids` above) `input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
...@@ -2970,12 +2975,12 @@ class BigBirdForQuestionAnswering(BigBirdPreTrainedModel): ...@@ -2970,12 +2975,12 @@ class BigBirdForQuestionAnswering(BigBirdPreTrainedModel):
r""" r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss. Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
sequence are not taken into account for computing the loss. are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss. Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
sequence are not taken into account for computing the loss. are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -64,12 +64,13 @@ class FlaxBigBirdForPreTrainingOutput(ModelOutput): ...@@ -64,12 +64,13 @@ class FlaxBigBirdForPreTrainingOutput(ModelOutput):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax). before SoftMax).
hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
shape `(batch_size, sequence_length, hidden_size)`. `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -94,12 +95,13 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput): ...@@ -94,12 +95,13 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput):
pooled_output (`jnp.ndarray` of shape `(batch_size, hidden_size)`): pooled_output (`jnp.ndarray` of shape `(batch_size, hidden_size)`):
pooled_output returned by FlaxBigBirdModel. pooled_output returned by FlaxBigBirdModel.
hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
shape `(batch_size, sequence_length, hidden_size)`. `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads. heads.
...@@ -114,12 +116,12 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput): ...@@ -114,12 +116,12 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput):
BIG_BIRD_START_DOCSTRING = r""" BIG_BIRD_START_DOCSTRING = r"""
This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
generic methods the library implements for all its model (such as downloading, saving and converting weights from library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
PyTorch models)
This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
and refer to the Flax documentation for all matter related to general usage and behavior. subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
general usage and behavior.
Finally, this model supports inherent JAX features such as: Finally, this model supports inherent JAX features such as:
...@@ -131,11 +133,10 @@ BIG_BIRD_START_DOCSTRING = r""" ...@@ -131,11 +133,10 @@ BIG_BIRD_START_DOCSTRING = r"""
Parameters: Parameters:
config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model. config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
model weights.
dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`): dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
GPUs) and `jax.numpy.bfloat16` (on TPUs). `jax.numpy.bfloat16` (on TPUs).
This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
specified all the computation will be performed with the given `dtype`. specified all the computation will be performed with the given `dtype`.
...@@ -143,8 +144,8 @@ BIG_BIRD_START_DOCSTRING = r""" ...@@ -143,8 +144,8 @@ BIG_BIRD_START_DOCSTRING = r"""
**Note that this only specifies the dtype of the computation and does not influence the dtype of model **Note that this only specifies the dtype of the computation and does not influence the dtype of model
parameters.** parameters.**
If you wish to change the dtype of the model parameters, see If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
[`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`]. [`~FlaxPreTrainedModel.to_bf16`].
""" """
BIG_BIRD_INPUTS_DOCSTRING = r""" BIG_BIRD_INPUTS_DOCSTRING = r"""
...@@ -152,9 +153,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r""" ...@@ -152,9 +153,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
input_ids (`numpy.ndarray` of shape `({0})`): input_ids (`numpy.ndarray` of shape `({0})`):
Indices of input sequence tokens in the vocabulary. Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BigBirdTokenizer`]. See Indices can be obtained using [`BigBirdTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for [`PreTrainedTokenizer.__call__`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`numpy.ndarray` of shape `({0})`, *optional*): attention_mask (`numpy.ndarray` of shape `({0})`, *optional*):
...@@ -165,15 +165,18 @@ BIG_BIRD_INPUTS_DOCSTRING = r""" ...@@ -165,15 +165,18 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*): token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token, - 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token. - 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids) [What are token type IDs?](../glossary#token-type-ids)
position_ids (`numpy.ndarray` of shape `({0})`, *optional*): position_ids (`numpy.ndarray` of shape `({0})`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`. Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
head_mask (`numpy.ndarray` of shape `({0})`, `optional): Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`: config.max_position_embeddings - 1]`.
head_mask (`numpy.ndarray` of shape `({0})`, `optional):
Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
- 1 indicates the head is **not masked**, - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
...@@ -787,7 +790,8 @@ class FlaxBigBirdBlockSparseAttention(nn.Module): ...@@ -787,7 +790,8 @@ class FlaxBigBirdBlockSparseAttention(nn.Module):
Args: Args:
from_blocked_mask: 2D Tensor of shape [batch_size, from_seq_length//from_block_size, from_block_size]. from_blocked_mask: 2D Tensor of shape [batch_size, from_seq_length//from_block_size, from_block_size].
to_blocked_mask: int32 Tensor of shape [batch_size, to_seq_length//to_block_size, to_block_size]. to_blocked_mask: int32 Tensor of shape [batch_size, to_seq_length//to_block_size, to_block_size].
broadcasted_rand_attn: [batch_size, num_attention_heads, from_seq_length//from_block_size-2, num_rand_blocks] broadcasted_rand_attn:
[batch_size, num_attention_heads, from_seq_length//from_block_size-2, num_rand_blocks]
num_attention_heads: int. Number of attention heads. num_attention_heads: int. Number of attention heads.
num_random_blocks: int. Number of random chunks per row. num_random_blocks: int. Number of random chunks per row.
batch_size: int. Batch size for computation. batch_size: int. Batch size for computation.
...@@ -1713,7 +1717,7 @@ class FlaxBigBirdForMaskedLMModule(nn.Module): ...@@ -1713,7 +1717,7 @@ class FlaxBigBirdForMaskedLMModule(nn.Module):
) )
@add_start_docstrings("""BigBird Model with a `language modeling` head on top. """, BIG_BIRD_START_DOCSTRING) @add_start_docstrings("""BigBird Model with a `language modeling` head on top.""", BIG_BIRD_START_DOCSTRING)
# Copied from transformers.models.bert.modeling_flax_bert.FlaxBertForMaskedLM with Bert->BigBird # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertForMaskedLM with Bert->BigBird
class FlaxBigBirdForMaskedLM(FlaxBigBirdPreTrainedModel): class FlaxBigBirdForMaskedLM(FlaxBigBirdPreTrainedModel):
module_class = FlaxBigBirdForMaskedLMModule module_class = FlaxBigBirdForMaskedLMModule
......
...@@ -48,8 +48,8 @@ class BigBirdTokenizer(PreTrainedTokenizer): ...@@ -48,8 +48,8 @@ class BigBirdTokenizer(PreTrainedTokenizer):
""" """
Construct a BigBird tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). Construct a BigBird tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
Users should refer to this superclass for more information regarding those methods. this superclass for more information regarding those methods.
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -75,7 +75,9 @@ class BigBirdTokenizer(PreTrainedTokenizer): ...@@ -75,7 +75,9 @@ class BigBirdTokenizer(PreTrainedTokenizer):
The token used for masking values. This is the token used when training this model with masked language The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict. modeling. This is the token which the model will try to predict.
sp_model_kwargs (`dict`, *optional*): sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
to set:
- `enable_sampling`: Enable subword regularization. - `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
...@@ -259,8 +261,7 @@ class BigBirdTokenizer(PreTrainedTokenizer): ...@@ -259,8 +261,7 @@ class BigBirdTokenizer(PreTrainedTokenizer):
Optional second list of IDs for sequence pairs. Optional second list of IDs for sequence pairs.
Returns: Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
sequence(s).
""" """
sep = [self.sep_token_id] sep = [self.sep_token_id]
cls = [self.cls_token_id] cls = [self.cls_token_id]
......
...@@ -58,9 +58,10 @@ SPIECE_UNDERLINE = "▁" ...@@ -58,9 +58,10 @@ SPIECE_UNDERLINE = "▁"
class BigBirdTokenizerFast(PreTrainedTokenizerFast): class BigBirdTokenizerFast(PreTrainedTokenizerFast):
""" """
Construct a "fast" BigBird tokenizer (backed by HuggingFace's *tokenizers* library). Based on [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models). This tokenizer Construct a "fast" BigBird tokenizer (backed by HuggingFace's *tokenizers* library). Based on
inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models). This
refer to this superclass for more information regarding those methods tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods
Args: Args:
vocab_file (`str`): vocab_file (`str`):
...@@ -219,8 +220,7 @@ class BigBirdTokenizerFast(PreTrainedTokenizerFast): ...@@ -219,8 +220,7 @@ class BigBirdTokenizerFast(PreTrainedTokenizerFast):
Optional second list of IDs for sequence pairs. Optional second list of IDs for sequence pairs.
Returns: Returns:
`List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
sequence(s).
""" """
sep = [self.sep_token_id] sep = [self.sep_token_id]
cls = [self.cls_token_id] cls = [self.cls_token_id]
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" BigBirdPegasus model configuration """ """ BigBirdPegasus model configuration"""
from ...configuration_utils import PretrainedConfig from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
...@@ -30,13 +30,13 @@ BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -30,13 +30,13 @@ BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class BigBirdPegasusConfig(PretrainedConfig): class BigBirdPegasusConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a [`BigBirdPegasusModel`]. It is This is the configuration class to store the configuration of a [`BigBirdPegasusModel`]. It is used to instantiate
used to instantiate an BigBirdPegasus model according to the specified arguments, defining the model architecture. an BigBirdPegasus model according to the specified arguments, defining the model architecture. Instantiating a
Instantiating a configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus
[google/bigbird-pegasus-large-arxiv](https://huggingface.co/google/bigbird-pegasus-large-arxiv) architecture. [google/bigbird-pegasus-large-arxiv](https://huggingface.co/google/bigbird-pegasus-large-arxiv) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
outputs. Read the documentation from [`PretrainedConfig`] for more information. documentation from [`PretrainedConfig`] for more information.
Args: Args:
...@@ -58,8 +58,8 @@ class BigBirdPegasusConfig(PretrainedConfig): ...@@ -58,8 +58,8 @@ class BigBirdPegasusConfig(PretrainedConfig):
encoder_ffn_dim (`int`, *optional*, defaults to 4096): encoder_ffn_dim (`int`, *optional*, defaults to 4096):
Dimension of the "intermediate" (often named feed-forward) layer in decoder. Dimension of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (`str` or `function`, *optional*, defaults to `"gelu_new"`): activation_function (`str` or `function`, *optional*, defaults to `"gelu_new"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. `"relu"`, `"silu"` and `"gelu_new"` are supported.
dropout (`float`, *optional*, defaults to 0.1): dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0): attention_dropout (`float`, *optional*, defaults to 0.0):
...@@ -74,23 +74,23 @@ class BigBirdPegasusConfig(PretrainedConfig): ...@@ -74,23 +74,23 @@ class BigBirdPegasusConfig(PretrainedConfig):
init_std (`float`, *optional*, defaults to 0.02): init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop: (`float`, *optional*, defaults to 0.0): encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
https://arxiv.org/abs/1909.11556) for more details. for more details.
decoder_layerdrop: (`float`, *optional*, defaults to 0.0): decoder_layerdrop: (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the decoder. See the [LayerDrop paper](see The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
https://arxiv.org/abs/1909.11556) for more details. for more details.
use_cache (`bool`, *optional*, defaults to `True`): use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Whether or not the model should return the last key/values attentions (not used by all models).
attention_type (`str`, *optional*, defaults to `"block_sparse"`) attention_type (`str`, *optional*, defaults to `"block_sparse"`)
Whether to use block sparse attention (with n complexity) as introduced in paper or original attention Whether to use block sparse attention (with n complexity) as introduced in paper or original attention
layer (with n^2 complexity) in encoder. Possible values are `"original_full"` and layer (with n^2 complexity) in encoder. Possible values are `"original_full"` and `"block_sparse"`.
`"block_sparse"`.
use_bias (`bool`, *optional*, defaults to `False`) use_bias (`bool`, *optional*, defaults to `False`)
Whether to use bias in query, key, value. Whether to use bias in query, key, value.
block_size (`int`, *optional*, defaults to 64) block_size (`int`, *optional*, defaults to 64)
Size of each block. Useful only when `attention_type == "block_sparse"`. Size of each block. Useful only when `attention_type == "block_sparse"`.
num_random_blocks (`int`, *optional*, defaults to 3) num_random_blocks (`int`, *optional*, defaults to 3)
Each query is going to attend these many number of random blocks. Useful only when `attention_type == "block_sparse"`. Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
"block_sparse"`.
scale_embeddings (`bool`, *optional*, defaults to `True`) scale_embeddings (`bool`, *optional*, defaults to `True`)
Whether to rescale embeddings with (hidden_size ** 0.5). Whether to rescale embeddings with (hidden_size ** 0.5).
...@@ -102,14 +102,13 @@ class BigBirdPegasusConfig(PretrainedConfig): ...@@ -102,14 +102,13 @@ class BigBirdPegasusConfig(PretrainedConfig):
>>> from transformers import BigBirdPegasusModel, BigBirdPegasusConfig >>> from transformers import BigBirdPegasusModel, BigBirdPegasusConfig
>>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration >>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration >>> configuration =
>>> configuration = BigBirdPegasusConfig() BigBirdPegasusConfig()
>>> # Initializing a model from the bigbird-pegasus-base style configuration >>> # Initializing a model from the bigbird-pegasus-base style configuration >>> model =
>>> model = BigBirdPegasusModel(configuration) BigBirdPegasusModel(configuration)
>>> # Accessing the model configuration >>> # Accessing the model configuration >>> configuration = model.config
>>> configuration = model.config
""" """
model_type = "bigbird_pegasus" model_type = "bigbird_pegasus"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" PyTorch BigBirdPegasus model. """ """ PyTorch BigBirdPegasus model."""
import copy import copy
...@@ -1474,7 +1474,8 @@ class BigBirdPegasusDecoderLayer(nn.Module): ...@@ -1474,7 +1474,8 @@ class BigBirdPegasusDecoderLayer(nn.Module):
hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)* hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
attention_mask (`torch.FloatTensor`): attention mask of size attention_mask (`torch.FloatTensor`): attention mask of size
*(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values. *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values.
encoder_hidden_states (`torch.FloatTensor`): cross attention input to the layer of shape *(seq_len, batch, embed_dim)* encoder_hidden_states (`torch.FloatTensor`):
cross attention input to the layer of shape *(seq_len, batch, embed_dim)*
encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
*(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values. *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values.
layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
...@@ -1603,13 +1604,12 @@ class BigBirdPegasusPreTrainedModel(PreTrainedModel): ...@@ -1603,13 +1604,12 @@ class BigBirdPegasusPreTrainedModel(PreTrainedModel):
BIGBIRD_PEGASUS_START_DOCSTRING = r""" BIGBIRD_PEGASUS_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings library implements for all its model (such as downloading or saving, resizing the input embeddings etc.)
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
general usage and behavior. and behavior.
Parameters: Parameters:
config ([`BigBirdPegasusConfig`]): config ([`BigBirdPegasusConfig`]):
...@@ -1623,15 +1623,15 @@ BIGBIRD_PEGASUS_GENERATION_EXAMPLE = r""" ...@@ -1623,15 +1623,15 @@ BIGBIRD_PEGASUS_GENERATION_EXAMPLE = r"""
>>> from transformers import PegasusTokenizer, BigBirdPegasusForConditionalGeneration, BigBirdPegasusConfig >>> from transformers import PegasusTokenizer, BigBirdPegasusForConditionalGeneration, BigBirdPegasusConfig
>>> model = BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-arxiv') >>> model = BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-arxiv') >>>
>>> tokenizer = PegasusTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv') tokenizer = PegasusTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv')
>>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs." >>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs." >>> inputs =
>>> inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', truncation=True) tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', truncation=True)
>>> # Generate Summary >>> # Generate Summary >>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5,
>>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True) early_stopping=True) >>> print([tokenizer.decode(g, skip_special_tokens=True,
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]) clean_up_tokenization_spaces=False) for g in summary_ids])
""" """
BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r""" BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
...@@ -1640,9 +1640,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r""" ...@@ -1640,9 +1640,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it. it.
Indices can be obtained using [`PegasusTokenizer`]. See Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for [`PreTrainedTokenizer.__call__`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
...@@ -1656,8 +1655,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r""" ...@@ -1656,8 +1655,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
Provide for translation and summarization training. By default, the model will create this tensor by Provide for translation and summarization training. By default, the model will create this tensor by
shifting the `input_ids` to the right, following the paper. shifting the `input_ids` to the right, following the paper.
decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
also be used by default. be used by default.
If you want to change padding behavior, you should read If you want to change padding behavior, you should read
[`modeling_bigbird_pegasus._prepare_decoder_inputs`] and modify to your needs. See diagram 1 in [the [`modeling_bigbird_pegasus._prepare_decoder_inputs`] and modify to your needs. See diagram 1 in [the
...@@ -1670,33 +1669,35 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r""" ...@@ -1670,33 +1669,35 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*): encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
`attentions`) `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`, `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) is a sequence of
*optional*) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
cross-attention of the decoder.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
instead of all ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of
vectors than the model's internal embedding lookup matrix. shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids`
you can choose to directly pass an embedded representation. This is useful if you want more control over
how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup
matrix.
decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
have to be input (see `past_key_values`). This is useful if you want more control over how to convert input (see `past_key_values`). This is useful if you want more control over how to convert
`decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
takes the value of `inputs_embeds`. of `inputs_embeds`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
decoding (see `past_key_values`). `past_key_values`).
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail. tensors for more detail.
...@@ -1713,9 +1714,8 @@ BIGBIRD_PEGASUS_STANDALONE_INPUTS_DOCSTRING = r""" ...@@ -1713,9 +1714,8 @@ BIGBIRD_PEGASUS_STANDALONE_INPUTS_DOCSTRING = r"""
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it. it.
Indices can be obtained using [`ProphetNetTokenizer`]. See Indices can be obtained using [`ProphetNetTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for [`PreTrainedTokenizer.__call__`] for details.
details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
...@@ -1792,9 +1792,8 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel): ...@@ -1792,9 +1792,8 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
provide it. provide it.
Indices can be obtained using [`PegasusTokenizer`]. See Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] [`PreTrainedTokenizer.__call__`] for details.
for details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
...@@ -1806,9 +1805,9 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel): ...@@ -1806,9 +1805,9 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
[What are attention masks?](../glossary#attention-mask) [What are attention masks?](../glossary#attention-mask)
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
representation. This is useful if you want more control over how to convert `input_ids` indices This is useful if you want more control over how to convert `input_ids` indices into associated vectors
into associated vectors than the model's internal embedding lookup matrix. than the model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail. returned tensors for more detail.
...@@ -2036,8 +2035,7 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel): ...@@ -2036,8 +2035,7 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel): class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
""" """
Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`BigBirdPegasusDecoderLayer`]
[`BigBirdPegasusDecoderLayer`]
Args: Args:
config: BigBirdPegasusConfig config: BigBirdPegasusConfig
...@@ -2114,9 +2112,8 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel): ...@@ -2114,9 +2112,8 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
provide it. provide it.
Indices can be obtained using [`PegasusTokenizer`]. See Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] [`PreTrainedTokenizer.__call__`] for details.
for details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
...@@ -2151,19 +2148,20 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel): ...@@ -2151,19 +2148,20 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
If `past_key_values` are used, the user can optionally input only the last that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
`decoder_input_ids` (those that don't have their past key value states given to this model) of all ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor`
shape `(batch_size, 1)` instead of all ``decoder_input_ids``` of shape `(batch_size, of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more
into associated vectors than the model's internal embedding lookup matrix. control over how to convert `input_ids` indices into associated vectors than the model's internal
embedding lookup matrix.
output_attentions (`bool`, *optional*): output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail. returned tensors for more detail.
...@@ -2504,7 +2502,8 @@ class BigBirdPegasusForConditionalGeneration(BigBirdPegasusPreTrainedModel): ...@@ -2504,7 +2502,8 @@ class BigBirdPegasusForConditionalGeneration(BigBirdPegasusPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns: Returns:
...@@ -2647,7 +2646,8 @@ class BigBirdPegasusForSequenceClassification(BigBirdPegasusPreTrainedModel): ...@@ -2647,7 +2646,8 @@ class BigBirdPegasusForSequenceClassification(BigBirdPegasusPreTrainedModel):
): ):
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if labels is not None: if labels is not None:
...@@ -2916,9 +2916,8 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel): ...@@ -2916,9 +2916,8 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
provide it. provide it.
Indices can be obtained using [`PegasusTokenizer`]. See Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] [`PreTrainedTokenizer.__call__`] for details.
for details.
[What are input IDs?](../glossary#input-ids) [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
...@@ -2947,25 +2946,24 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel): ...@@ -2947,25 +2946,24 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel):
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two additional
additional tensors are only required when the model is used as a decoder in a Sequence to Sequence tensors are only required when the model is used as a decoder in a Sequence to Sequence model.
model.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
(those that don't have their past key value states given to this model) of shape `(batch_size, 1)` that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
use_cache (`bool`, *optional*): use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
decoding (see `past_key_values`). (see `past_key_values`).
- 1 for tokens that are **not masked**, - 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**. - 0 for tokens that are **masked**.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment