Unverified Commit 27b3031d authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Mass conversion of documentation from rst to Markdown (#14866)

* Convert docstrings of all configurations and tokenizers

* Processors and fixes

* Last modeling files and fixes to models

* Pipeline modules

* Utils files

* Data submodule

* All the other files

* Style

* Missing examples

* Style again

* Fix copies

* Say bye bye to rst docstrings forever
parent 18587639
......@@ -523,10 +523,11 @@ MARIAN_START_DOCSTRING = r"""
MARIAN_GENERATION_EXAMPLE = r"""
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
Available models are listed `here <https://huggingface.co/models?search=Helsinki-NLP>`__.
Available models are listed [here](https://huggingface.co/models?search=Helsinki-NLP).
Examples::
Examples:
```python
>>> from transformers import MarianTokenizer, MarianMTModel
>>> from typing import List
>>> src = 'fr' # source language
......@@ -540,6 +541,7 @@ MARIAN_GENERATION_EXAMPLE = r"""
>>> gen = model.generate(**batch)
>>> tokenizer.batch_decode(gen, skip_special_tokens=True)
"Where is the bus stop ?"
```
"""
MARIAN_INPUTS_DOCSTRING = r"""
......@@ -1124,8 +1126,9 @@ class MarianModel(MarianPreTrainedModel):
r"""
Returns:
Example::
Example:
```python
>>> from transformers import MarianTokenizer, MarianModel
>>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
......@@ -1137,7 +1140,7 @@ class MarianModel(MarianPreTrainedModel):
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
>>> last_hidden_states = outputs.last_hidden_state
"""
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
......
......@@ -555,10 +555,11 @@ MARIAN_START_DOCSTRING = r"""
MARIAN_GENERATION_EXAMPLE = r"""
TF version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints. Available
models are listed `here <https://huggingface.co/models?search=Helsinki-NLP>`__.
models are listed [here](https://huggingface.co/models?search=Helsinki-NLP).
Examples::
Examples:
```python
>>> from transformers import MarianTokenizer, TFMarianMTModel
>>> from typing import List
>>> src = 'fr' # source language
......@@ -572,6 +573,7 @@ MARIAN_GENERATION_EXAMPLE = r"""
>>> gen = model.generate(**batch)
>>> tokenizer.batch_decode(gen, skip_special_tokens=True)
"Where is the bus stop ?"
```
"""
MARIAN_INPUTS_DOCSTRING = r"""
......
......@@ -55,50 +55,50 @@ PRETRAINED_INIT_CONFIGURATION = {}
class MarianTokenizer(PreTrainedTokenizer):
r"""
Construct a Marian tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
Construct a Marian tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
source_spm (:obj:`str`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
source_spm (`str`):
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that
contains the vocabulary for the source language.
target_spm (:obj:`str`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
target_spm (`str`):
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that
contains the vocabulary for the target language.
source_lang (:obj:`str`, `optional`):
source_lang (`str`, *optional*):
A string representing the source language.
target_lang (:obj:`str`, `optional`):
target_lang (`str`, *optional*):
A string representing the target language.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
model_max_length (:obj:`int`, `optional`, defaults to 512):
model_max_length (`int`, *optional*, defaults to 512):
The maximum sentence length the model accepts.
additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<eop>", "<eod>"]`):
additional_special_tokens (`List[str]`, *optional*, defaults to `["<eop>", "<eod>"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (:obj:`dict`, `optional`):
Will be passed to the ``SentencePieceProcessor.__init__()`` method. The `Python wrapper for SentencePiece
<https://github.com/google/sentencepiece/tree/master/python>`__ can be used, among other things, to set:
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
- ``enable_sampling``: Enable subword regularization.
- ``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout.
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
- ``nbest_size = {0,1}``: No sampling is performed.
- ``nbest_size > 1``: samples from the nbest_size results.
- ``nbest_size < 0``: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
- `nbest_size = {0,1}`: No sampling is performed.
- `nbest_size > 1`: samples from the nbest_size results.
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.
- ``alpha``: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.
Examples::
Examples:
```python
>>> from transformers import MarianTokenizer
>>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
>>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
......@@ -109,7 +109,7 @@ class MarianTokenizer(PreTrainedTokenizer):
>>> inputs["labels"] = labels["input_ids"]
# keys [input_ids, attention_mask, labels].
>>> outputs = model(**inputs) should work
"""
```"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
......@@ -202,20 +202,20 @@ class MarianTokenizer(PreTrainedTokenizer):
Convert a list of lists of token ids into a list of strings by calling decode.
Args:
sequences (:obj:`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
sequences (`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
use_source_tokenizer (:obj:`bool`, `optional`, defaults to :obj:`False`):
use_source_tokenizer (`bool`, *optional*, defaults to `False`):
Whether or not to use the source tokenizer to decode sequences (only applicable in sequence-to-sequence
problems).
kwargs (additional keyword arguments, `optional`):
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
:obj:`List[str]`: The list of decoded sentences.
`List[str]`: The list of decoded sentences.
"""
return super().batch_decode(sequences, **kwargs)
......@@ -224,23 +224,23 @@ class MarianTokenizer(PreTrainedTokenizer):
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
tokens and clean up tokenization spaces.
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.
Args:
token_ids (:obj:`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the ``__call__`` method.
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
use_source_tokenizer (:obj:`bool`, `optional`, defaults to :obj:`False`):
use_source_tokenizer (`bool`, *optional*, defaults to `False`):
Whether or not to use the source tokenizer to decode sequences (only applicable in sequence-to-sequence
problems).
kwargs (additional keyword arguments, `optional`):
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
:obj:`str`: The decoded sentence.
`str`: The decoded sentence.
"""
return super().decode(token_ids, **kwargs)
......
......@@ -32,66 +32,66 @@ MBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MBartConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.MBartModel`. It is used to
This is the configuration class to store the configuration of a [`MBartModel`]. It is used to
instantiate an MBART model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the MBART `facebook/mbart-large-cc25
<https://huggingface.co/facebook/mbart-large-cc25>`__ architecture.
configuration with the defaults will yield a similar configuration to that of the MBART [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 50265):
vocab_size (`int`, *optional*, defaults to 50265):
Vocabulary size of the MBART model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.MBartModel` or
:class:`~transformers.TFMBartModel`.
d_model (:obj:`int`, `optional`, defaults to 1024):
`inputs_ids` passed when calling [`MBartModel`] or
[`TFMBartModel`].
d_model (`int`, *optional*, defaults to 1024):
Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 12):
encoder_layers (`int`, *optional*, defaults to 12):
Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 12):
decoder_layers (`int`, *optional*, defaults to 12):
Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
encoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
decoder_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
decoder_ffn_dim (`int`, *optional*, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
encoder_ffn_dim (`int`, *optional*, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
dropout (:obj:`float`, `optional`, defaults to 0.1):
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
classifier_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for classifier.
max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
max_position_embeddings (`int`, *optional*, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02):
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details.
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see
https://arxiv.org/abs/1909.11556) for more details.
decoder_layerdrop: (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the decoder. See the [LayerDrop paper](see
https://arxiv.org/abs/1909.11556) for more details.
scale_embedding (`bool`, *optional*, defaults to `False`):
Scale embeddings by diving by sqrt(d_model).
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models)
forced_eos_token_id (:obj:`int`, `optional`, defaults to 2):
The id of the token to force as the last generated token when :obj:`max_length` is reached. Usually set to
:obj:`eos_token_id`.
forced_eos_token_id (`int`, *optional*, defaults to 2):
The id of the token to force as the last generated token when `max_length` is reached. Usually set to
`eos_token_id`.
Example::
Example:
```python
>>> from transformers import MBartModel, MBartConfig
>>> # Initializing a MBART facebook/mbart-large-cc25 style configuration
......@@ -102,7 +102,7 @@ class MBartConfig(PretrainedConfig):
>>> # Accessing the model configuration
>>> configuration = model.config
"""
```"""
model_type = "mbart"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"}
......
......@@ -1041,8 +1041,9 @@ class FlaxMBartPreTrainedModel(FlaxPreTrainedModel):
r"""
Returns:
Example::
Example:
```python
>>> from transformers import MBartTokenizer, FlaxMBartForConditionalGeneration
>>> model = FlaxMBartForConditionalGeneration.from_pretrained('facebook/mbart-large-cc25')
......@@ -1051,7 +1052,7 @@ class FlaxMBartPreTrainedModel(FlaxPreTrainedModel):
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)
"""
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
......@@ -1106,8 +1107,9 @@ class FlaxMBartPreTrainedModel(FlaxPreTrainedModel):
r"""
Returns:
Example::
Example:
```python
>>> from transformers import MBartTokenizer, FlaxMBartForConditionalGeneration
>>> model = FlaxMBartForConditionalGeneration.from_pretrained('facebook/mbart-large-cc25')
......@@ -1122,7 +1124,7 @@ class FlaxMBartPreTrainedModel(FlaxPreTrainedModel):
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> last_decoder_hidden_states = outputs.last_hidden_state
"""
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
......@@ -1372,8 +1374,9 @@ class FlaxMBartForConditionalGeneration(FlaxMBartPreTrainedModel):
r"""
Returns:
Example::
Example:
```python
>>> from transformers import MBartTokenizer, FlaxMBartForConditionalGeneration
>>> model = FlaxMBartForConditionalGeneration.from_pretrained('facebook/mbart-large-cc25')
......@@ -1388,7 +1391,7 @@ class FlaxMBartForConditionalGeneration(FlaxMBartPreTrainedModel):
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> logits = outputs.logits
"""
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
......
......@@ -71,15 +71,16 @@ class MBartTokenizer(XLMRobertaTokenizer):
"""
Construct an MBART tokenizer.
:class:`~transformers.MBartTokenizer` is a subclass of :class:`~transformers.XLMRobertaTokenizer`. Refer to
superclass :class:`~transformers.XLMRobertaTokenizer` for usage examples and documentation concerning the
[`MBartTokenizer`] is a subclass of [`XLMRobertaTokenizer`]. Refer to
superclass [`XLMRobertaTokenizer`] for usage examples and documentation concerning the
initialization parameters and other methods.
The tokenization method is ``<tokens> <eos> <language code>`` for source language documents, and ``<language code>
The tokenization method is `<tokens> <eos> <language code>` for source language documents, and ``<language code>
<tokens> <eos>``` for target language documents.
Examples::
Examples:
```python
>>> from transformers import MBartTokenizer
>>> tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro', src_lang="en_XX", tgt_lang="ro_RO")
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
......@@ -88,7 +89,7 @@ class MBartTokenizer(XLMRobertaTokenizer):
>>> with tokenizer.as_target_tokenizer():
... labels = tokenizer(expected_translation_romanian, return_tensors="pt")
>>> inputs["labels"] = labels["input_ids"]
"""
```"""
vocab_files_names = VOCAB_FILES_NAMES
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
......@@ -149,18 +150,18 @@ class MBartTokenizer(XLMRobertaTokenizer):
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
special tokens using the tokenizer `prepare_for_model` method.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
......@@ -179,22 +180,22 @@ class MBartTokenizer(XLMRobertaTokenizer):
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An MBART sequence has the following format, where ``X`` represents the sequence:
adding special tokens. An MBART sequence has the following format, where `X` represents the sequence:
- ``input_ids`` (for encoder) ``X [eos, src_lang_code]``
- ``decoder_input_ids``: (for decoder) ``X [eos, tgt_lang_code]``
- `input_ids` (for encoder) `X [eos, src_lang_code]`
- `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
......
......@@ -82,18 +82,18 @@ FAIRSEQ_LANGUAGE_CODES = [
class MBartTokenizerFast(XLMRobertaTokenizerFast):
"""
Construct a "fast" MBART tokenizer (backed by HuggingFace's `tokenizers` library). Based on `BPE
<https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models>`__.
Construct a "fast" MBART tokenizer (backed by HuggingFace's *tokenizers* library). Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
:class:`~transformers.MBartTokenizerFast` is a subclass of :class:`~transformers.XLMRobertaTokenizerFast`. Refer to
superclass :class:`~transformers.XLMRobertaTokenizerFast` for usage examples and documentation concerning the
[`MBartTokenizerFast`] is a subclass of [`XLMRobertaTokenizerFast`]. Refer to
superclass [`XLMRobertaTokenizerFast`] for usage examples and documentation concerning the
initialization parameters and other methods.
The tokenization method is ``<tokens> <eos> <language code>`` for source language documents, and ``<language code>
The tokenization method is `<tokens> <eos> <language code>` for source language documents, and ``<language code>
<tokens> <eos>``` for target language documents.
Examples::
Examples:
```python
>>> from transformers import MBartTokenizerFast
>>> tokenizer = MBartTokenizerFast.from_pretrained('facebook/mbart-large-en-ro', src_lang="en_XX", tgt_lang="ro_RO")
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
......@@ -102,7 +102,7 @@ class MBartTokenizerFast(XLMRobertaTokenizerFast):
>>> with tokenizer.as_target_tokenizer():
... labels = tokenizer(expected_translation_romanian, return_tensors="pt")
>>> inputs["labels"] = labels["input_ids"]
"""
```"""
vocab_files_names = VOCAB_FILES_NAMES
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
......@@ -164,22 +164,22 @@ class MBartTokenizerFast(XLMRobertaTokenizerFast):
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. The special tokens depend on calling set_lang.
An MBART sequence has the following format, where ``X`` represents the sequence:
An MBART sequence has the following format, where `X` represents the sequence:
- ``input_ids`` (for encoder) ``X [eos, src_lang_code]``
- ``decoder_input_ids``: (for decoder) ``X [eos, tgt_lang_code]``
- `input_ids` (for encoder) `X [eos, src_lang_code]`
- `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
......
......@@ -47,52 +47,52 @@ FAIRSEQ_LANGUAGE_CODES = ["ar_AR", "cs_CZ", "de_DE", "en_XX", "es_XX", "et_EE",
class MBart50Tokenizer(PreTrainedTokenizer):
"""
Construct a MBart50 tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
Construct a MBart50 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
vocab_file (`str`):
Path to the vocabulary file.
src_lang (:obj:`str`, `optional`):
src_lang (`str`, *optional*):
A string representing the source language.
tgt_lang (:obj:`str`, `optional`):
tgt_lang (`str`, *optional*):
A string representing the target language.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
sp_model_kwargs (:obj:`dict`, `optional`):
Will be passed to the ``SentencePieceProcessor.__init__()`` method. The `Python wrapper for SentencePiece
<https://github.com/google/sentencepiece/tree/master/python>`__ can be used, among other things, to set:
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
- ``enable_sampling``: Enable subword regularization.
- ``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout.
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
- ``nbest_size = {0,1}``: No sampling is performed.
- ``nbest_size > 1``: samples from the nbest_size results.
- ``nbest_size < 0``: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
- `nbest_size = {0,1}`: No sampling is performed.
- `nbest_size > 1`: samples from the nbest_size results.
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.
- ``alpha``: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.
Examples::
Examples:
```python
>>> from transformers import MBart50Tokenizer
>>> tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
>>> src_text = " UN Chief Says There Is No Military Solution in Syria"
......@@ -101,7 +101,7 @@ class MBart50Tokenizer(PreTrainedTokenizer):
>>> with tokenizer.as_target_tokenizer():
... labels = tokenizer(tgt_text, return_tensors="pt").input_ids
>>> # model(**model_inputs, labels=labels) should work
"""
```"""
vocab_files_names = VOCAB_FILES_NAMES
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
......@@ -252,18 +252,18 @@ class MBart50Tokenizer(PreTrainedTokenizer):
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
special tokens using the tokenizer `prepare_for_model` method.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
......@@ -282,22 +282,22 @@ class MBart50Tokenizer(PreTrainedTokenizer):
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An MBART-50 sequence has the following format, where ``X`` represents the sequence:
adding special tokens. An MBART-50 sequence has the following format, where `X` represents the sequence:
- ``input_ids`` (for encoder) ``[src_lang_code] X [eos]``
- ``labels``: (for decoder) ``[tgt_lang_code] X [eos]``
- `input_ids` (for encoder) `[src_lang_code] X [eos]`
- `labels`: (for decoder) `[tgt_lang_code] X [eos]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
......
......@@ -56,39 +56,39 @@ FAIRSEQ_LANGUAGE_CODES = ["ar_AR", "cs_CZ", "de_DE", "en_XX", "es_XX", "et_EE",
class MBart50TokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" MBART tokenizer for mBART-50 (backed by HuggingFace's `tokenizers` library). Based on `BPE
<https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models>`__.
Construct a "fast" MBART tokenizer for mBART-50 (backed by HuggingFace's *tokenizers* library). Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
vocab_file (`str`):
Path to the vocabulary file.
src_lang (:obj:`str`, `optional`):
src_lang (`str`, *optional*):
A string representing the source language.
tgt_lang (:obj:`str`, `optional`):
tgt_lang (`str`, *optional*):
A string representing the target language.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
Examples::
Examples:
```python
>>> from transformers import MBart50TokenizerFast
>>> tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
>>> src_text = " UN Chief Says There Is No Military Solution in Syria"
......@@ -97,7 +97,7 @@ class MBart50TokenizerFast(PreTrainedTokenizerFast):
>>> with tokenizer.as_target_tokenizer():
... labels = tokenizer(tgt_text, return_tensors="pt").input_ids
>>> # model(**model_inputs, labels=labels) should work
"""
```"""
vocab_files_names = VOCAB_FILES_NAMES
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
......@@ -172,22 +172,22 @@ class MBart50TokenizerFast(PreTrainedTokenizerFast):
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. The special tokens depend on calling set_lang.
An MBART-50 sequence has the following format, where ``X`` represents the sequence:
An MBART-50 sequence has the following format, where `X` represents the sequence:
- ``input_ids`` (for encoder) ``[src_lang_code] X [eos]``
- ``labels``: (for decoder) ``[tgt_lang_code] X [eos]``
- `input_ids` (for encoder) `[src_lang_code] X [eos]`
- `labels`: (for decoder) `[tgt_lang_code] X [eos]`
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
separator.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
......
......@@ -27,57 +27,56 @@ MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MegatronBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.MegatronBertModel`. It is
This is the configuration class to store the configuration of a [`MegatronBertModel`]. It is
used to instantiate a MEGATRON_BERT model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the MEGATRON_BERT
`megatron-bert-uncased-345m <https://huggingface.co/nvidia/megatron-bert-uncased-345m>`__ architecture.
[megatron-bert-uncased-345m](https://huggingface.co/nvidia/megatron-bert-uncased-345m) architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 29056):
vocab_size (`int`, *optional*, defaults to 29056):
Vocabulary size of the MEGATRON_BERT model. Defines the number of different tokens that can be represented
by the :obj:`inputs_ids` passed when calling :class:`~transformers.MegatronBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 1024):
by the `inputs_ids` passed when calling [`MegatronBertModel`].
hidden_size (`int`, *optional*, defaults to 1024):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
num_hidden_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 16):
num_attention_heads (`int`, *optional*, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 4096):
intermediate_size (`int`, *optional*, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling
:class:`~transformers.MegatronBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling
[`MegatronBertModel`].
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
position_embedding_type (:obj:`str`, `optional`, defaults to :obj:`"absolute"`):
Type of position embedding. Choose one of :obj:`"absolute"`, :obj:`"relative_key"`,
:obj:`"relative_key_query"`. For positional embeddings use :obj:`"absolute"`. For more information on
:obj:`"relative_key"`, please refer to `Self-Attention with Relative Position Representations (Shaw et al.)
<https://arxiv.org/abs/1803.02155>`__. For more information on :obj:`"relative_key_query"`, please refer to
`Method 4` in `Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)
<https://arxiv.org/abs/2009.13658>`__.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`,
`"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on
`"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to
*Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if ``config.is_decoder=True``.
relevant if `config.is_decoder=True`.
Examples::
Examples:
```python
>>> from transformers import MegatronBertModel, MegatronBertConfig
>>> # Initializing a MEGATRON_BERT bert-base-uncased style configuration
......@@ -88,7 +87,7 @@ class MegatronBertConfig(PretrainedConfig):
>>> # Accessing the model configuration
>>> configuration = model.config
"""
```"""
model_type = "megatron-bert"
def __init__(
......
......@@ -72,159 +72,164 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
}
ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING = r"""
return_token_type_ids (:obj:`bool`, `optional`):
return_token_type_ids (`bool`, *optional*):
Whether to return token type IDs. If left to the default, will return the token type IDs according to
the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
the specific tokenizer's default, defined by the `return_outputs` attribute.
`What are token type IDs? <../glossary.html#token-type-ids>`__
return_attention_mask (:obj:`bool`, `optional`):
[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
to the specific tokenizer's default, defined by the `return_outputs` attribute.
`What are attention masks? <../glossary.html#attention-mask>`__
return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
of pairs) is provided with :obj:`truncation_strategy = longest_first` or :obj:`True`, an error is
of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is
raised instead of returning overflowing tokens.
return_special_tokens_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):
return_special_tokens_mask (`bool`, *optional*, defaults to `False`):
Whether or not to return special tokens mask information.
return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to return :obj:`(char_start, char_end)` for each token.
return_offsets_mapping (`bool`, *optional*, defaults to `False`):
Whether or not to return `(char_start, char_end)` for each token.
This is only available on fast tokenizers inheriting from
:class:`~transformers.PreTrainedTokenizerFast`, if using Python's tokenizer, this method will raise
:obj:`NotImplementedError`.
return_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
[`PreTrainedTokenizerFast`], if using Python's tokenizer, this method will raise
`NotImplementedError`.
return_length (`bool`, *optional*, defaults to `False`):
Whether or not to return the lengths of the encoded inputs.
verbose (:obj:`bool`, `optional`, defaults to :obj:`True`):
verbose (`bool`, *optional*, defaults to `True`):
Whether or not to print more information and warnings.
**kwargs: passed to the :obj:`self.tokenize()` method
**kwargs: passed to the `self.tokenize()` method
Return:
:class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model.
`What are input IDs? <../glossary.html#input-ids>`__
[What are input IDs?](../glossary#input-ids)
- **token_type_ids** -- List of token type ids to be fed to a model (when :obj:`return_token_type_ids=True`
or if `"token_type_ids"` is in :obj:`self.model_input_names`).
- **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True`
or if *"token_type_ids"* is in `self.model_input_names`).
`What are token type IDs? <../glossary.html#token-type-ids>`__
[What are token type IDs?](../glossary#token-type-ids)
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
:obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`).
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).
`What are attention masks? <../glossary.html#attention-mask>`__
[What are attention masks?](../glossary#attention-mask)
- **entity_ids** -- List of entity ids to be fed to a model.
`What are input IDs? <../glossary.html#input-ids>`__
[What are input IDs?](../glossary#input-ids)
- **entity_position_ids** -- List of entity positions in the input sequence to be fed to a model.
- **entity_token_type_ids** -- List of entity token type ids to be fed to a model (when
:obj:`return_token_type_ids=True` or if `"entity_token_type_ids"` is in :obj:`self.model_input_names`).
`return_token_type_ids=True` or if *"entity_token_type_ids"* is in `self.model_input_names`).
`What are token type IDs? <../glossary.html#token-type-ids>`__
[What are token type IDs?](../glossary#token-type-ids)
- **entity_attention_mask** -- List of indices specifying which entities should be attended to by the model
(when :obj:`return_attention_mask=True` or if `"entity_attention_mask"` is in
:obj:`self.model_input_names`).
(when `return_attention_mask=True` or if *"entity_attention_mask"* is in
`self.model_input_names`).
`What are attention masks? <../glossary.html#attention-mask>`__
[What are attention masks?](../glossary#attention-mask)
- **entity_start_positions** -- List of the start positions of entities in the word token sequence (when
:obj:`task="entity_span_classification"`).
`task="entity_span_classification"`).
- **entity_end_positions** -- List of the end positions of entities in the word token sequence (when
:obj:`task="entity_span_classification"`).
- **overflowing_tokens** -- List of overflowing tokens sequences (when a :obj:`max_length` is specified and
:obj:`return_overflowing_tokens=True`).
- **num_truncated_tokens** -- Number of tokens truncated (when a :obj:`max_length` is specified and
:obj:`return_overflowing_tokens=True`).
`task="entity_span_classification"`).
- **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when :obj:`add_special_tokens=True` and :obj:`return_special_tokens_mask=True`).
- **length** -- The length of the inputs (when :obj:`return_length=True`)
regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
- **length** -- The length of the inputs (when `return_length=True`)
"""
class MLukeTokenizer(PreTrainedTokenizer):
"""
Adapted from :class:`~transformers.XLMRobertaTokenizer` and :class:`~transformers.LukeTokenizer`. Based on
`SentencePiece <https://github.com/google/sentencepiece>`__.
Adapted from [`XLMRobertaTokenizer`] and [`LukeTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
vocab_file (`str`):
Path to the vocabulary file.
entity_vocab_file (:obj:`str`):
entity_vocab_file (`str`):
Path to the entity vocabulary file.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `cls_token`.
</Tip>
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `sep_token`.
</Tip>
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
task (:obj:`str`, `optional`):
Task for which you want to prepare sequences. One of :obj:`"entity_classification"`,
:obj:`"entity_pair_classification"`, or :obj:`"entity_span_classification"`. If you specify this argument,
task (`str`, *optional*):
Task for which you want to prepare sequences. One of `"entity_classification"`,
`"entity_pair_classification"`, or `"entity_span_classification"`. If you specify this argument,
the entity sequence is automatically created based on the given entity span(s).
max_entity_length (:obj:`int`, `optional`, defaults to 32):
The maximum length of :obj:`entity_ids`.
max_mention_length (:obj:`int`, `optional`, defaults to 30):
max_entity_length (`int`, *optional*, defaults to 32):
The maximum length of `entity_ids`.
max_mention_length (`int`, *optional*, defaults to 30):
The maximum number of tokens inside an entity span.
entity_token_1 (:obj:`str`, `optional`, defaults to :obj:`<ent>`):
entity_token_1 (`str`, *optional*, defaults to `<ent>`):
The special token used to represent an entity span in a word token sequence. This token is only used when
``task`` is set to :obj:`"entity_classification"` or :obj:`"entity_pair_classification"`.
entity_token_2 (:obj:`str`, `optional`, defaults to :obj:`<ent2>`):
`task` is set to `"entity_classification"` or `"entity_pair_classification"`.
entity_token_2 (`str`, *optional*, defaults to `<ent2>`):
The special token used to represent an entity span in a word token sequence. This token is only used when
``task`` is set to :obj:`"entity_pair_classification"`.
additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<s>NOTUSED", "</s>NOTUSED"]`):
`task` is set to `"entity_pair_classification"`.
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (:obj:`dict`, `optional`):
Will be passed to the ``SentencePieceProcessor.__init__()`` method. The `Python wrapper for SentencePiece
<https://github.com/google/sentencepiece/tree/master/python>`__ can be used, among other things, to set:
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
- ``enable_sampling``: Enable subword regularization.
- ``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout.
- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
- ``nbest_size = {0,1}``: No sampling is performed.
- ``nbest_size > 1``: samples from the nbest_size results.
- ``nbest_size < 0``: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
- `nbest_size = {0,1}`: No sampling is performed.
- `nbest_size > 1`: samples from the nbest_size results.
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.
- ``alpha``: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.
Attributes:
sp_model (:obj:`SentencePieceProcessor`):
The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
sp_model (`SentencePieceProcessor`):
The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
"""
vocab_files_names = VOCAB_FILES_NAMES
......@@ -373,39 +378,39 @@ class MLukeTokenizer(PreTrainedTokenizer):
sequences, depending on the task you want to prepare them for.
Args:
text (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]`):
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence must be a string. Note that this
tokenizer does not support tokenization based on pretokenized strings.
text_pair (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]`):
text_pair (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence must be a string. Note that this
tokenizer does not support tokenization based on pretokenized strings.
entity_spans (:obj:`List[Tuple[int, int]]`, :obj:`List[List[Tuple[int, int]]]`, `optional`):
entity_spans (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
with two integers denoting character-based start and end positions of entities. If you specify
:obj:`"entity_classification"` or :obj:`"entity_pair_classification"` as the ``task`` argument in the
constructor, the length of each sequence must be 1 or 2, respectively. If you specify ``entities``, the
length of each sequence must be equal to the length of each sequence of ``entities``.
entity_spans_pair (:obj:`List[Tuple[int, int]]`, :obj:`List[List[Tuple[int, int]]]`, `optional`):
`"entity_classification"` or `"entity_pair_classification"` as the `task` argument in the
constructor, the length of each sequence must be 1 or 2, respectively. If you specify `entities`, the
length of each sequence must be equal to the length of each sequence of `entities`.
entity_spans_pair (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
with two integers denoting character-based start and end positions of entities. If you specify the
``task`` argument in the constructor, this argument is ignored. If you specify ``entities_pair``, the
length of each sequence must be equal to the length of each sequence of ``entities_pair``.
entities (:obj:`List[str]`, :obj:`List[List[str]]`, `optional`):
`task` argument in the constructor, this argument is ignored. If you specify `entities_pair`, the
length of each sequence must be equal to the length of each sequence of `entities_pair`.
entities (`List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
Angeles). This argument is ignored if you specify the ``task`` argument in the constructor. The length
of each sequence must be equal to the length of each sequence of ``entity_spans``. If you specify
``entity_spans`` without specifying this argument, the entity sequence or the batch of entity sequences
Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length
of each sequence must be equal to the length of each sequence of `entity_spans`. If you specify
`entity_spans` without specifying this argument, the entity sequence or the batch of entity sequences
is automatically constructed by filling it with the [MASK] entity.
entities_pair (:obj:`List[str]`, :obj:`List[List[str]]`, `optional`):
entities_pair (`List[str]`, `List[List[str]]`, *optional*):
The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
Angeles). This argument is ignored if you specify the ``task`` argument in the constructor. The length
of each sequence must be equal to the length of each sequence of ``entity_spans_pair``. If you specify
``entity_spans_pair`` without specifying this argument, the entity sequence or the batch of entity
Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length
of each sequence must be equal to the length of each sequence of `entity_spans_pair`. If you specify
`entity_spans_pair` without specifying this argument, the entity sequence or the batch of entity
sequences is automatically constructed by filling it with the [MASK] entity.
max_entity_length (:obj:`int`, `optional`):
The maximum length of :obj:`entity_ids`.
max_entity_length (`int`, *optional*):
The maximum length of `entity_ids`.
"""
# Input type checking for clearer error
is_valid_single_text = isinstance(text, str)
......@@ -969,24 +974,24 @@ class MLukeTokenizer(PreTrainedTokenizer):
Prepares a sequence of input id, entity id and entity span, or a pair of sequences of inputs ids, entity ids,
entity spans so that it can be used by the model. It adds special tokens, truncates sequences if overflowing
while taking into account the special tokens and manages a moving window (with user defined stride) for
overflowing tokens. Please Note, for `pair_ids` different than `None` and `truncation_strategy = longest_first`
or `True`, it is not possible to return overflowing tokens. Such a combination of arguments will raise an
overflowing tokens. Please Note, for *pair_ids* different than *None* and *truncation_strategy = longest_first*
or *True*, it is not possible to return overflowing tokens. Such a combination of arguments will raise an
error.
Args:
ids (:obj:`List[int]`):
ids (`List[int]`):
Tokenized input ids of the first sequence.
pair_ids (:obj:`List[int]`, `optional`):
pair_ids (`List[int]`, *optional*):
Tokenized input ids of the second sequence.
entity_ids (:obj:`List[int]`, `optional`):
entity_ids (`List[int]`, *optional*):
Entity ids of the first sequence.
pair_entity_ids (:obj:`List[int]`, `optional`):
pair_entity_ids (`List[int]`, *optional*):
Entity ids of the second sequence.
entity_token_spans (:obj:`List[Tuple[int, int]]`, `optional`):
entity_token_spans (`List[Tuple[int, int]]`, *optional*):
Entity spans of the first sequence.
pair_entity_token_spans (:obj:`List[Tuple[int, int]]`, `optional`):
pair_entity_token_spans (`List[Tuple[int, int]]`, *optional*):
Entity spans of the second sequence.
max_entity_length (:obj:`int`, `optional`):
max_entity_length (`int`, *optional*):
The maximum length of the entity sequence.
"""
......@@ -1188,46 +1193,45 @@ class MLukeTokenizer(PreTrainedTokenizer):
"""
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
in the batch. Padding side (left/right) padding token ids are defined at the tokenizer level (with
``self.padding_side``, ``self.pad_token_id`` and ``self.pad_token_type_id``) .. note:: If the
``encoded_inputs`` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result
will use the same type unless you provide a different tensor type with ``return_tensors``. In the case of
`self.padding_side`, `self.pad_token_id` and `self.pad_token_type_id`) .. note:: If the
`encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result
will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
PyTorch tensors, you will lose the specific device of your tensors however.
Args:
encoded_inputs (:class:`~transformers.BatchEncoding`, list of :class:`~transformers.BatchEncoding`, :obj:`Dict[str, List[int]]`, :obj:`Dict[str, List[List[int]]` or :obj:`List[Dict[str, List[int]]]`):
Tokenized inputs. Can represent one input (:class:`~transformers.BatchEncoding` or :obj:`Dict[str,
List[int]]`) or a batch of tokenized inputs (list of :class:`~transformers.BatchEncoding`, `Dict[str,
List[List[int]]]` or `List[Dict[str, List[int]]]`) so you can use this method during preprocessing as
well as in a PyTorch Dataloader collate function. Instead of :obj:`List[int]` you can have tensors
encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):
Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch of tokenized inputs (list of [`BatchEncoding`], *Dict[str,
List[List[int]]]* or *List[Dict[str, List[int]]]*) so you can use this method during preprocessing as
well as in a PyTorch Dataloader collate function. Instead of `List[int]` you can have tensors
(numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.file_utils.PaddingStrategy`, `optional`, defaults to :obj:`True`):
padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a
single sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
max_length (:obj:`int`, `optional`):
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
max_entity_length (:obj:`int`, `optional`):
max_entity_length (`int`, *optional*):
The maximum length of the entity sequence.
pad_to_multiple_of (:obj:`int`, `optional`):
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_attention_mask (:obj:`bool`, `optional`):
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute. `What are
attention masks? <../glossary.html#attention-mask>`__
return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
to the specific tokenizer's default, defined by the `return_outputs` attribute. [What are
attention masks?](../glossary#attention-mask)
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
* :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
* :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
* :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
verbose (:obj:`bool`, `optional`, defaults to :obj:`True`):
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
verbose (`bool`, *optional*, defaults to `True`):
Whether or not to print more information and warnings.
"""
# If we have a list of dicts, let's convert it in a dict of lists
......@@ -1495,17 +1499,17 @@ class MLukeTokenizer(PreTrainedTokenizer):
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An XLM-RoBERTa sequence has the following format:
- single sequence: ``<s> X </s>``
- pair of sequences: ``<s> A </s></s> B </s>``
- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
......@@ -1520,18 +1524,18 @@ class MLukeTokenizer(PreTrainedTokenizer):
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
special tokens using the tokenizer `prepare_for_model` method.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
......@@ -1552,13 +1556,13 @@ class MLukeTokenizer(PreTrainedTokenizer):
not make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of zeros.
`List[int]`: List of zeros.
"""
......
......@@ -23,15 +23,15 @@ logger = logging.get_logger(__name__)
class MMBTConfig(object):
"""
This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
This is the configuration class to store the configuration of a [`MMBTModel`]. It is used to
instantiate a MMBT model according to the specified arguments, defining the model architecture.
Args:
config (:class:`~transformers.PreTrainedConfig`):
config ([`PreTrainedConfig`]):
Config of the underlying Transformer models. Its values are copied over to use a single config.
num_labels (:obj:`int`, `optional`):
num_labels (`int`, *optional*):
Size of final Linear layer for classification.
modal_hidden_size (:obj:`int`, `optional`, defaults to 2048):
modal_hidden_size (`int`, *optional*, defaults to 2048):
Embedding dimension of the non-text modality encoder.
"""
......
......@@ -208,13 +208,14 @@ class MMBTModel(nn.Module, ModuleUtilsMixin):
r"""
Returns:
Examples::
Examples:
```python
# For example purposes. Not runnable.
transformer = BertModel.from_pretrained('bert-base-uncased')
encoder = ImageEncoder(args)
mmbt = MMBTModel(config, transformer, encoder)
"""
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
......
......@@ -27,68 +27,69 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MobileBertConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
:class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
This is the configuration class to store the configuration of a [`MobileBertModel`] or a
[`TFMobileBertModel`]. It is used to instantiate a MobileBERT model according to the specified
arguments, defining the model architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
:class:`~transformers.TFMobileBertModel`.
hidden_size (:obj:`int`, `optional`, defaults to 512):
the `inputs_ids` passed when calling [`MobileBertModel`] or
[`TFMobileBertModel`].
hidden_size (`int`, *optional*, defaults to 512):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
num_hidden_layers (`int`, *optional*, defaults to 24):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 4):
num_attention_heads (`int`, *optional*, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 512):
intermediate_size (`int`, *optional*, defaults to 512):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
hidden_act (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, `optional`, defaults to 2):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
or :class:`~transformers.TFMobileBertModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling [`MobileBertModel`]
or [`TFMobileBertModel`].
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
pad_token_id (:obj:`int`, `optional`, defaults to 0):
pad_token_id (`int`, *optional*, defaults to 0):
The ID of the token in the word embedding to use as padding.
embedding_size (:obj:`int`, `optional`, defaults to 128):
embedding_size (`int`, *optional*, defaults to 128):
The dimension of the word embedding vectors.
trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
trigram_input (`bool`, *optional*, defaults to `True`):
Use a convolution of trigram as input.
use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
use_bottleneck (`bool`, *optional*, defaults to `True`):
Whether to use bottleneck in BERT.
intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
intra_bottleneck_size (`int`, *optional*, defaults to 128):
Size of bottleneck layer output.
use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
use_bottleneck_attention (`bool`, *optional*, defaults to `False`):
Whether to use attention inputs from the bottleneck transformation.
key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
key_query_shared_bottleneck (`bool`, *optional*, defaults to `True`):
Whether to use the same linear transformation for query&key in the bottleneck.
num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
num_feedforward_networks (`int`, *optional*, defaults to 4):
Number of FFNs in a block.
normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
normalization_type (`str`, *optional*, defaults to `"no_norm"`):
The normalization type in MobileBERT.
classifier_dropout (:obj:`float`, `optional`):
classifier_dropout (`float`, *optional*):
The dropout ratio for the classification head.
Examples::
Examples:
```python
>>> from transformers import MobileBertModel, MobileBertConfig
>>> # Initializing a MobileBERT configuration
......@@ -99,6 +100,7 @@ class MobileBertConfig(PretrainedConfig):
>>> # Accessing the model configuration
>>> configuration = model.config
```
Attributes: pretrained_config_archive_map (Dict[str, str]): A dictionary containing all the available pre-trained
checkpoints.
......
......@@ -1031,8 +1031,9 @@ class TFMobileBertForPreTraining(TFMobileBertPreTrainedModel):
r"""
Return:
Examples::
Examples:
```python
>>> import tensorflow as tf
>>> from transformers import MobileBertTokenizer, TFMobileBertForPreTraining
......@@ -1041,8 +1042,7 @@ class TFMobileBertForPreTraining(TFMobileBertPreTrainedModel):
>>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
>>> outputs = model(input_ids)
>>> prediction_scores, seq_relationship_scores = outputs[:2]
"""
```"""
inputs = input_processing(
func=self.call,
config=self.config,
......@@ -1242,8 +1242,9 @@ class TFMobileBertForNextSentencePrediction(TFMobileBertPreTrainedModel, TFNextS
r"""
Return:
Examples::
Examples:
```python
>>> import tensorflow as tf
>>> from transformers import MobileBertTokenizer, TFMobileBertForNextSentencePrediction
......@@ -1255,7 +1256,7 @@ class TFMobileBertForNextSentencePrediction(TFMobileBertPreTrainedModel, TFNextS
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='tf')
>>> logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]
"""
```"""
inputs = input_processing(
func=self.call,
config=self.config,
......
......@@ -37,10 +37,10 @@ class MobileBertTokenizer(BertTokenizer):
r"""
Construct a MobileBERT tokenizer.
:class:`~transformers.MobileBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
[`MobileBertTokenizer`] is identical to [`BertTokenizer`] and runs end-to-end
tokenization: punctuation splitting and wordpiece.
Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
parameters.
"""
......
......@@ -39,12 +39,12 @@ PRETRAINED_INIT_CONFIGURATION = {}
class MobileBertTokenizerFast(BertTokenizerFast):
r"""
Construct a "fast" MobileBERT tokenizer (backed by HuggingFace's `tokenizers` library).
Construct a "fast" MobileBERT tokenizer (backed by HuggingFace's *tokenizers* library).
:class:`~transformers.MobileBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
[`MobileBertTokenizerFast`] is identical to [`BertTokenizerFast`] and runs
end-to-end tokenization: punctuation splitting and wordpiece.
Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
parameters.
"""
......
......@@ -28,46 +28,47 @@ MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MPNetConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.MPNetModel` or a
:class:`~transformers.TFMPNetModel`. It is used to instantiate a MPNet model according to the specified arguments,
This is the configuration class to store the configuration of a [`MPNetModel`] or a
[`TFMPNetModel`]. It is used to instantiate a MPNet model according to the specified arguments,
defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
to that of the MPNet `mpnet-base <https://huggingface.co/mpnet-base>`__ architecture.
to that of the MPNet [mpnet-base](https://huggingface.co/mpnet-base) architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
outputs. Read the documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30527):
vocab_size (`int`, *optional*, defaults to 30527):
Vocabulary size of the MPNet model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.MPNetModel` or
:class:`~transformers.TFMPNetModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
`inputs_ids` passed when calling [`MPNetModel`] or
[`TFMPNetModel`].
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
`"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
relative_attention_num_buckets (:obj:`int`, `optional`, defaults to 32):
relative_attention_num_buckets (`int`, *optional*, defaults to 32):
The number of buckets to use for each attention layer.
Examples::
Examples:
```python
>>> from transformers import MPNetModel, MPNetConfig
>>> # Initializing a MPNet mpnet-base style configuration
......@@ -78,7 +79,7 @@ class MPNetConfig(PretrainedConfig):
>>> # Accessing the model configuration
>>> configuration = model.config
"""
```"""
model_type = "mpnet"
def __init__(
......
......@@ -66,56 +66,61 @@ def whitespace_tokenize(text):
class MPNetTokenizer(PreTrainedTokenizer):
"""
This tokenizer inherits from :class:`~transformers.BertTokenizer` which contains most of the methods. Users should
This tokenizer inherits from [`BertTokenizer`] which contains most of the methods. Users should
refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`str`):
vocab_file (`str`):
Path to the vocabulary file.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
do_lower_case (`bool`, *optional*, defaults to `True`):
Whether or not to lowercase the input when tokenizing.
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
do_basic_tokenize (`bool`, *optional*, defaults to `True`):
Whether or not to do basic tokenization before WordPiece.
never_split (:obj:`Iterable`, `optional`):
never_split (`Iterable`, *optional*):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
`do_basic_tokenize=True`
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `cls_token`.
</Tip>
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `sep_token`.
</Tip>
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
unk_token (`str`, *optional*, defaults to `"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
Whether or not to tokenize Chinese characters.
This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)).
strip_accents: (`bool`, *optional*):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
value for `lowercase` (as in the original BERT).
"""
vocab_files_names = VOCAB_FILES_NAMES
......@@ -229,17 +234,17 @@ class MPNetTokenizer(PreTrainedTokenizer):
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A MPNet sequence has the following format:
- single sequence: ``<s> X </s>``
- pair of sequences: ``<s> A </s></s> B </s>``
- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
......@@ -252,18 +257,18 @@ class MPNetTokenizer(PreTrainedTokenizer):
) -> List[int]:
"""
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` methods.
special tokens using the tokenizer `prepare_for_model` methods.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Set to True if the token list is already formatted with special tokens for the model
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
......@@ -282,13 +287,13 @@ class MPNetTokenizer(PreTrainedTokenizer):
make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of zeros.
`List[int]`: List of zeros.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
......@@ -324,19 +329,18 @@ class BasicTokenizer(object):
Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
Args:
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
do_lower_case (`bool`, *optional*, defaults to `True`):
Whether or not to lowercase the input when tokenizing.
never_split (:obj:`Iterable`, `optional`):
never_split (`Iterable`, *optional*):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
`do_basic_tokenize=True`
tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
Whether or not to tokenize Chinese characters.
This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)).
strip_accents: (`bool`, *optional*):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
value for `lowercase` (as in the original BERT).
"""
def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
......@@ -353,9 +357,9 @@ class BasicTokenizer(object):
WordPieceTokenizer.
Args:
**never_split**: (`optional`) list of str
never_split (`LIst[str]`, *optional*)
Kept for backward compatibility purposes. Now implemented directly at the base class level (see
:func:`PreTrainedTokenizer.tokenize`) List of token not to split.
[`PreTrainedTokenizer.tokenize`]) List of token not to split.
"""
# union() returns a new set by concatenating the two sets.
never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
......@@ -482,11 +486,11 @@ class WordpieceTokenizer(object):
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
tokenization using the given vocabulary.
For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.
For example, `input = "unaffable"` wil return as output `["un", "##aff", "##able"]`.
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.
already been passed through *BasicTokenizer*.
Returns:
A list of wordpiece tokens.
......
......@@ -50,51 +50,57 @@ PRETRAINED_INIT_CONFIGURATION = {
class MPNetTokenizerFast(PreTrainedTokenizerFast):
r"""
Construct a "fast" MPNet tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece.
Construct a "fast" MPNet tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.
Args:
vocab_file (:obj:`str`):
vocab_file (`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
do_lower_case (`bool`, *optional*, defaults to `True`):
Whether or not to lowercase the input when tokenizing.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `cls_token`.
</Tip>
eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.
.. note::
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
sequence. The token used is the `sep_token`.
</Tip>
sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
unk_token (`str`, *optional*, defaults to `"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this
issue <https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see [this
issue](https://github.com/huggingface/transformers/issues/328)).
strip_accents: (`bool`, *optional*):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
value for `lowercase` (as in the original BERT).
"""
vocab_files_names = VOCAB_FILES_NAMES
......@@ -151,11 +157,11 @@ class MPNetTokenizerFast(PreTrainedTokenizerFast):
@property
def mask_token(self) -> str:
"""
:obj:`str`: Mask token, to use when training a model with masked-language modeling. Log an error if used while
`str`: Mask token, to use when training a model with masked-language modeling. Log an error if used while
not having been set.
MPNet tokenizer has a special mask token to be usable in the fill-mask pipeline. The mask token will greedily
comprise the space before the `<mask>`.
comprise the space before the *<mask>*.
"""
if self._mask_token is None and self.verbose:
logger.error("Using mask_token, but it is not set yet.")
......@@ -189,13 +195,13 @@ class MPNetTokenizerFast(PreTrainedTokenizerFast):
make use of token type ids, therefore a list of zeros is returned
Args:
token_ids_0 (:obj:`List[int]`):
token_ids_0 (`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`):
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs
Returns:
:obj:`List[int]`: List of zeros.
`List[int]`: List of zeros.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment