Unverified Commit e68ec18c authored by Joao Gante's avatar Joao Gante Committed by GitHub
Browse files

Docs: formatting nits (#32247)



* doc formatting nits

* ignore non-autodocs

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/esm/modeling_esm.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/esm/modeling_esm.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* make fixup

---------
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
parent 2fbbcf50
...@@ -1075,7 +1075,7 @@ class RobertaPreLayerNormForMaskedLM(RobertaPreLayerNormPreTrainedModel): ...@@ -1075,7 +1075,7 @@ class RobertaPreLayerNormForMaskedLM(RobertaPreLayerNormPreTrainedModel):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
kwargs (`Dict[str, any]`, optional, defaults to *{}*): kwargs (`Dict[str, any]`, *optional*, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -1151,7 +1151,7 @@ class RoCBertForPreTraining(RoCBertPreTrainedModel): ...@@ -1151,7 +1151,7 @@ class RoCBertForPreTraining(RoCBertPreTrainedModel):
ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., ignored (masked), the loss is only computed for the tokens with labels in `[0, ...,
config.vocab_size]` config.vocab_size]`
kwargs (`Dict[str, any]`, optional, defaults to *{}*): kwargs (`Dict[str, any]`, *optional*, defaults to *{}*):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
Returns: Returns:
......
...@@ -59,7 +59,7 @@ class SegGptEncoderOutput(ModelOutput): ...@@ -59,7 +59,7 @@ class SegGptEncoderOutput(ModelOutput):
attentions (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_attentions=True`): attentions (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.output_attentions=True`):
Tuple of *torch.FloatTensor* (one for each layer) of shape Tuple of *torch.FloatTensor* (one for each layer) of shape
`(batch_size, num_heads, seq_len, seq_len)`. `(batch_size, num_heads, seq_len, seq_len)`.
intermediate_hidden_states (`Tuple[torch.FloatTensor]`, `optional`, returned when `config.intermediate_hidden_state_indices` is set): intermediate_hidden_states (`Tuple[torch.FloatTensor]`, *optional*, returned when `config.intermediate_hidden_state_indices` is set):
Tuple of `torch.FloatTensor` of shape `(batch_size, patch_height, patch_width, hidden_size)`. Tuple of `torch.FloatTensor` of shape `(batch_size, patch_height, patch_width, hidden_size)`.
Each element in the Tuple corresponds to the output of the layer specified in `config.intermediate_hidden_state_indices`. Each element in the Tuple corresponds to the output of the layer specified in `config.intermediate_hidden_state_indices`.
Additionaly, each feature passes through a LayerNorm. Additionaly, each feature passes through a LayerNorm.
...@@ -77,7 +77,7 @@ class SegGptImageSegmentationOutput(ModelOutput): ...@@ -77,7 +77,7 @@ class SegGptImageSegmentationOutput(ModelOutput):
Output type of [`SegGptImageSegmentationOutput`]. Output type of [`SegGptImageSegmentationOutput`].
Args: Args:
loss (`torch.FloatTensor`, `optional`, returned when `labels` is provided): loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided):
The loss value. The loss value.
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): pred_masks (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
The predicted masks. The predicted masks.
......
...@@ -745,10 +745,10 @@ class DisentangledSelfAttention(nn.Module): ...@@ -745,10 +745,10 @@ class DisentangledSelfAttention(nn.Module):
sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j* sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j*
th token. th token.
output_attentions (`bool`, optional): output_attentions (`bool`, *optional*):
Whether return the attention matrix. Whether return the attention matrix.
query_states (`torch.FloatTensor`, optional): query_states (`torch.FloatTensor`, *optional*):
The *Q* state in *Attention(Q,K,V)*. The *Q* state in *Attention(Q,K,V)*.
relative_pos (`torch.LongTensor`): relative_pos (`torch.LongTensor`):
......
...@@ -220,7 +220,7 @@ class Speech2TextFeatureExtractor(SequenceFeatureExtractor): ...@@ -220,7 +220,7 @@ class Speech2TextFeatureExtractor(SequenceFeatureExtractor):
sampling_rate (`int`, *optional*): sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors. `sampling_rate` at the forward call to prevent silent errors.
padding_value (`float`, defaults to 0.0): padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding values / vectors. The value that is used to fill the padding values / vectors.
""" """
......
...@@ -181,7 +181,7 @@ PARALLELIZE_DOCSTRING = r""" ...@@ -181,7 +181,7 @@ PARALLELIZE_DOCSTRING = r"""
it will evenly distribute blocks across all devices. it will evenly distribute blocks across all devices.
Args: Args:
device_map (`Dict[int, list]`, optional, defaults to None): device_map (`Dict[int, list]`, *optional*):
A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always
automatically mapped to the first device (for esoteric reasons). That means that the first device should automatically mapped to the first device (for esoteric reasons). That means that the first device should
have fewer attention modules mapped to it than other devices. For reference, the t5 models have the have fewer attention modules mapped to it than other devices. For reference, the t5 models have the
......
...@@ -1249,7 +1249,7 @@ class TapasTokenizer(PreTrainedTokenizer): ...@@ -1249,7 +1249,7 @@ class TapasTokenizer(PreTrainedTokenizer):
Total number of table columns Total number of table columns
max_length (`int`): max_length (`int`):
Total maximum length. Total maximum length.
truncation_strategy (`str` or [`TapasTruncationStrategy`]): truncation_strategy (`str` or [`TapasTruncationStrategy]`):
Truncation strategy to use. Seeing as this method should only be called when truncating, the only Truncation strategy to use. Seeing as this method should only be called when truncating, the only
available strategy is the `"drop_rows_to_fit"` strategy. available strategy is the `"drop_rows_to_fit"` strategy.
......
...@@ -833,7 +833,7 @@ class UdopTokenizer(PreTrainedTokenizer): ...@@ -833,7 +833,7 @@ class UdopTokenizer(PreTrainedTokenizer):
</Tip> </Tip>
Args: Args:
text (`str`, `List[str]` or `List[int]` (the latter only for not-fast tokenizers)): text (`str`, `List[str]` or (for non-fast tokenizers) `List[int]`):
The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
`tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids` `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
method). method).
......
...@@ -814,7 +814,7 @@ class UdopTokenizerFast(PreTrainedTokenizerFast): ...@@ -814,7 +814,7 @@ class UdopTokenizerFast(PreTrainedTokenizerFast):
</Tip> </Tip>
Args: Args:
text (`str`, `List[str]` or `List[int]` (the latter only for not-fast tokenizers)): text (`str`, `List[str]` or (for non-fast tokenizers) `List[int]`):
The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
`tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids` `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
method). method).
......
...@@ -243,7 +243,7 @@ class ViltImageProcessor(BaseImageProcessor): ...@@ -243,7 +243,7 @@ class ViltImageProcessor(BaseImageProcessor):
Image to resize. Image to resize.
size (`Dict[str, int]`): size (`Dict[str, int]`):
Controls the size of the output image. Should be of the form `{"shortest_edge": int}`. Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
size_divisor (`int`, defaults to 32): size_divisor (`int`, *optional*, defaults to 32):
The image is resized to a size that is a multiple of this value. The image is resized to a size that is a multiple of this value.
resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`): resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
Resampling filter to use when resiizing the image. Resampling filter to use when resiizing the image.
......
...@@ -182,7 +182,7 @@ def add_decomposed_relative_positions(attn, queries, rel_pos_h, rel_pos_w, q_siz ...@@ -182,7 +182,7 @@ def add_decomposed_relative_positions(attn, queries, rel_pos_h, rel_pos_w, q_siz
Relative position embeddings (Lw, num_channels) for width axis. Relative position embeddings (Lw, num_channels) for width axis.
q_size (`Tuple[int]`): q_size (`Tuple[int]`):
Spatial sequence size of query q with (queries_height, queries_width). Spatial sequence size of query q with (queries_height, queries_width).
k_size (`Tuple[int]`]): k_size (`Tuple[int]`):
Spatial sequence size of key k with (keys_height, keys_width). Spatial sequence size of key k with (keys_height, keys_width).
Returns: Returns:
......
...@@ -36,11 +36,11 @@ class Wav2Vec2FeatureExtractor(SequenceFeatureExtractor): ...@@ -36,11 +36,11 @@ class Wav2Vec2FeatureExtractor(SequenceFeatureExtractor):
most of the main methods. Users should refer to this superclass for more information regarding those methods. most of the main methods. Users should refer to this superclass for more information regarding those methods.
Args: Args:
feature_size (`int`, defaults to 1): feature_size (`int`, *optional*, defaults to 1):
The feature dimension of the extracted features. The feature dimension of the extracted features.
sampling_rate (`int`, defaults to 16000): sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
padding_value (`float`, defaults to 0.0): padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding values. The value that is used to fill the padding values.
do_normalize (`bool`, *optional*, defaults to `True`): do_normalize (`bool`, *optional*, defaults to `True`):
Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
...@@ -166,7 +166,7 @@ class Wav2Vec2FeatureExtractor(SequenceFeatureExtractor): ...@@ -166,7 +166,7 @@ class Wav2Vec2FeatureExtractor(SequenceFeatureExtractor):
sampling_rate (`int`, *optional*): sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors. `sampling_rate` at the forward call to prevent silent errors.
padding_value (`float`, defaults to 0.0): padding_value (`float`, *optional*, defaults to 0.0):
""" """
if sampling_rate is not None: if sampling_rate is not None:
......
...@@ -184,9 +184,9 @@ class Wav2Vec2ConformerConfig(PretrainedConfig): ...@@ -184,9 +184,9 @@ class Wav2Vec2ConformerConfig(PretrainedConfig):
If `"rotary"` position embeddings are used, defines the size of the embedding base. If `"rotary"` position embeddings are used, defines the size of the embedding base.
max_source_positions (`int`, *optional*, defaults to 5000): max_source_positions (`int`, *optional*, defaults to 5000):
if `"relative"` position embeddings are used, defines the maximum source input positions. if `"relative"` position embeddings are used, defines the maximum source input positions.
conv_depthwise_kernel_size (`int`, defaults to 31): conv_depthwise_kernel_size (`int`, *optional*, defaults to 31):
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Kernel size of convolutional depthwise 1D layer in Conformer blocks.
conformer_conv_dropout (`float`, defaults to 0.1): conformer_conv_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all convolutional layers in Conformer blocks. The dropout probability for all convolutional layers in Conformer blocks.
Example: Example:
......
...@@ -44,16 +44,16 @@ class WhisperFeatureExtractor(SequenceFeatureExtractor): ...@@ -44,16 +44,16 @@ class WhisperFeatureExtractor(SequenceFeatureExtractor):
Fourier Transform` which should match pytorch's `torch.stft` equivalent. Fourier Transform` which should match pytorch's `torch.stft` equivalent.
Args: Args:
feature_size (`int`, defaults to 80): feature_size (`int`, *optional*, defaults to 80):
The feature dimension of the extracted features. The feature dimension of the extracted features.
sampling_rate (`int`, defaults to 16000): sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
hop_length (`int`, defaults to 160): hop_length (`int`, *optional*, defaults to 160):
Length of the overlaping windows for the STFT used to obtain the Mel Frequency coefficients. Length of the overlaping windows for the STFT used to obtain the Mel Frequency coefficients.
chunk_length (`int`, defaults to 30): chunk_length (`int`, *optional*, defaults to 30):
The maximum number of chuncks of `sampling_rate` samples used to trim and pad longer or shorter audio The maximum number of chuncks of `sampling_rate` samples used to trim and pad longer or shorter audio
sequences. sequences.
n_fft (`int`, defaults to 400): n_fft (`int`, *optional*, defaults to 400):
Size of the Fourier transform. Size of the Fourier transform.
padding_value (`float`, *optional*, defaults to 0.0): padding_value (`float`, *optional*, defaults to 0.0):
Padding value used to pad the audio. Should correspond to silences. Padding value used to pad the audio. Should correspond to silences.
...@@ -231,7 +231,7 @@ class WhisperFeatureExtractor(SequenceFeatureExtractor): ...@@ -231,7 +231,7 @@ class WhisperFeatureExtractor(SequenceFeatureExtractor):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
pipeline. pipeline.
padding_value (`float`, defaults to 0.0): padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding values / vectors. The value that is used to fill the padding values / vectors.
do_normalize (`bool`, *optional*, defaults to `False`): do_normalize (`bool`, *optional*, defaults to `False`):
Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
......
...@@ -1368,7 +1368,7 @@ class WhisperGenerationMixin: ...@@ -1368,7 +1368,7 @@ class WhisperGenerationMixin:
priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s
default values, whose documentation should be checked to parameterize generation. default values, whose documentation should be checked to parameterize generation.
num_segment_frames (`int`, defaults to 3000): num_segment_frames (`int`, *optional*, defaults to 3000):
The number of log-mel frames the model expects The number of log-mel frames the model expects
Return: Return:
......
...@@ -565,7 +565,7 @@ class WhisperTokenizer(PreTrainedTokenizer): ...@@ -565,7 +565,7 @@ class WhisperTokenizer(PreTrainedTokenizer):
Args: Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method. List of tokenized input ids. Can be obtained using the `__call__` method.
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
""" """
offsets = [] offsets = []
...@@ -615,7 +615,7 @@ class WhisperTokenizer(PreTrainedTokenizer): ...@@ -615,7 +615,7 @@ class WhisperTokenizer(PreTrainedTokenizer):
Compute the timestamp token ids for a given precision and save to least-recently used (LRU) cache. Compute the timestamp token ids for a given precision and save to least-recently used (LRU) cache.
Args: Args:
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
""" """
return self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)]) return self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)])
...@@ -671,7 +671,7 @@ class WhisperTokenizer(PreTrainedTokenizer): ...@@ -671,7 +671,7 @@ class WhisperTokenizer(PreTrainedTokenizer):
output_offsets (`bool`, *optional*, defaults to `False`): output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps. timestamps.
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`): decode_with_timestamps (`bool`, *optional*, defaults to `False`):
Whether or not to decode with timestamps included in the raw text. Whether or not to decode with timestamps included in the raw text.
......
...@@ -207,7 +207,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast): ...@@ -207,7 +207,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
Args: Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method. List of tokenized input ids. Can be obtained using the `__call__` method.
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
""" """
offsets = [] offsets = []
...@@ -258,7 +258,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast): ...@@ -258,7 +258,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
Compute the timestamp token ids for a given precision and save to least-recently used (LRU) cache. Compute the timestamp token ids for a given precision and save to least-recently used (LRU) cache.
Args: Args:
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
""" """
return self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)]) return self.convert_tokens_to_ids([("<|%.2f|>" % (i * time_precision)) for i in range(1500 + 1)])
...@@ -317,7 +317,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast): ...@@ -317,7 +317,7 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
output_offsets (`bool`, *optional*, defaults to `False`): output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps. timestamps.
time_precision (`float`, `optional`, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`): decode_with_timestamps (`bool`, *optional*, defaults to `False`):
Whether or not to decode with timestamps included in the raw text. Whether or not to decode with timestamps included in the raw text.
......
...@@ -1081,7 +1081,7 @@ class XLMRobertaForMaskedLM(XLMRobertaPreTrainedModel): ...@@ -1081,7 +1081,7 @@ class XLMRobertaForMaskedLM(XLMRobertaPreTrainedModel):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
kwargs (`Dict[str, any]`, optional, defaults to *{}*): kwargs (`Dict[str, any]`, *optional*, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -1039,7 +1039,7 @@ class XLMRobertaXLForMaskedLM(XLMRobertaXLPreTrainedModel): ...@@ -1039,7 +1039,7 @@ class XLMRobertaXLForMaskedLM(XLMRobertaXLPreTrainedModel):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
kwargs (`Dict[str, any]`, optional, defaults to *{}*): kwargs (`Dict[str, any]`, *optional*, defaults to `{}`):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
...@@ -1173,7 +1173,7 @@ class XmodForMaskedLM(XmodPreTrainedModel): ...@@ -1173,7 +1173,7 @@ class XmodForMaskedLM(XmodPreTrainedModel):
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
kwargs (`Dict[str, any]`, optional, defaults to *{}*): kwargs (`Dict[str, any]`, *optional*, defaults to *{}*):
Used to hide legacy arguments that have been deprecated. Used to hide legacy arguments that have been deprecated.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment