Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
0ed4d0df
Unverified
Commit
0ed4d0df
authored
Jul 20, 2022
by
Li-Huai (Allan) Lin
Committed by
GitHub
Jul 20, 2022
Browse files
Fix `LayoutXLM` docstrings (#17038)
* Fix docstrings * Fix legacy issue * up * apply suggestions * up * quality
parent
4b1ed797
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
163 additions
and
54 deletions
+163
-54
src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py
...transformers/models/layoutlmv2/tokenization_layoutlmv2.py
+54
-46
src/transformers/models/layoutxlm/tokenization_layoutxlm.py
src/transformers/models/layoutxlm/tokenization_layoutxlm.py
+107
-5
src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
...nsformers/models/layoutxlm/tokenization_layoutxlm_fast.py
+2
-3
No files found.
src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py
View file @
0ed4d0df
...
@@ -109,53 +109,61 @@ LAYOUTLMV2_ENCODE_KWARGS_DOCSTRING = r"""
...
@@ -109,53 +109,61 @@ LAYOUTLMV2_ENCODE_KWARGS_DOCSTRING = r"""
- `'np'`: Return Numpy `np.ndarray` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
"""
"""
LAYOUTLMV2_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
=
r
"""
LAYOUTLMV2_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
=
r
"""
add_special_tokens (`bool`, *optional*, defaults to `True`):
return_token_type_ids (`bool`, *optional*):
Whether or not to encode the sequences with the special tokens relative to their model.
Whether to return token type IDs. If left to the default, will return the token type IDs according to
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
the specific tokenizer's default, defined by the `return_outputs` attribute.
Activates and controls padding. Accepts the following values:
[What are token type IDs?](../glossary#token-type-ids)
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
return_attention_mask (`bool`, *optional*):
sequence if provided).
Whether to return the attention mask. If left to the default, will return the attention mask according
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
to the specific tokenizer's default, defined by the `return_outputs` attribute.
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
[What are attention masks?](../glossary#attention-mask)
lengths).
return_overflowing_tokens (`bool`, *optional*, defaults to `False`):
truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
Activates and controls truncation. Accepts the following values:
of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is raised instead
of returning overflowing tokens.
- `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
return_special_tokens_mask (`bool`, *optional*, defaults to `False`):
to the maximum acceptable input length for the model if that argument is not provided. This will
Whether or not to return special tokens mask information.
truncate token by token, removing a token from the longest sequence in the pair if a pair of
return_offsets_mapping (`bool`, *optional*, defaults to `False`):
sequences (or a batch of pairs) is provided.
Whether or not to return `(char_start, char_end)` for each token.
- `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
maximum acceptable input length for the model if that argument is not provided. This will only
This is only available on fast tokenizers inheriting from [`PreTrainedTokenizerFast`], if using
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
Python's tokenizer, this method will raise `NotImplementedError`.
- `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
return_length (`bool`, *optional*, defaults to `False`):
maximum acceptable input length for the model if that argument is not provided. This will only
Whether or not to return the lengths of the encoded inputs.
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
verbose (`bool`, *optional*, defaults to `True`):
- `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
Whether or not to print more information and warnings.
greater than the model maximum admissible input size).
**kwargs: passed to the `self.tokenize()` method
max_length (`int`, *optional*):
Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to
Return:
`None`, this will use the predefined model maximum length if a maximum length is required by one of the
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
truncation/padding parameters. If the model has no specific maximum input length (like XLNet)
truncation/padding to a maximum length will be deactivated.
- **input_ids** -- List of token ids to be fed to a model.
stride (`int`, *optional*, defaults to 0):
If set to a number along with `max_length`, the overflowing tokens returned when
[What are input IDs?](../glossary#input-ids)
`return_overflowing_tokens=True` will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
- **bbox** -- List of bounding boxes to be fed to a model.
argument defines the number of overlapping tokens.
pad_to_multiple_of (`int`, *optional*):
- **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True` or
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
if *"token_type_ids"* is in `self.model_input_names`).
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_tensors (`str` or [`~utils.TensorType`], *optional*):
[What are token type IDs?](../glossary#token-type-ids)
If set, will return tensors instead of list of python integers. Acceptable values are:
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
- `'tf'`: Return TensorFlow `tf.constant` objects.
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
[What are attention masks?](../glossary#attention-mask)
- **labels** -- List of labels to be fed to a model. (when `word_labels` is specified).
- **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
- **length** -- The length of the inputs (when `return_length=True`).
"""
"""
...
...
src/transformers/models/layoutxlm/tokenization_layoutxlm.py
View file @
0ed4d0df
...
@@ -20,11 +20,9 @@ from shutil import copyfile
...
@@ -20,11 +20,9 @@ from shutil import copyfile
from
typing
import
Any
,
Dict
,
List
,
Optional
,
Tuple
,
Union
from
typing
import
Any
,
Dict
,
List
,
Optional
,
Tuple
,
Union
import
sentencepiece
as
spm
import
sentencepiece
as
spm
from
transformers.models.layoutlmv2.tokenization_layoutlmv2
import
LAYOUTLMV2_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
from
...tokenization_utils
import
AddedToken
,
PreTrainedTokenizer
from
...tokenization_utils
import
AddedToken
,
PreTrainedTokenizer
from
...tokenization_utils_base
import
(
from
...tokenization_utils_base
import
(
ENCODE_KWARGS_DOCSTRING
,
BatchEncoding
,
BatchEncoding
,
EncodedInput
,
EncodedInput
,
PreTokenizedInput
,
PreTokenizedInput
,
...
@@ -44,6 +42,110 @@ from ..xlm_roberta.tokenization_xlm_roberta import (
...
@@ -44,6 +42,110 @@ from ..xlm_roberta.tokenization_xlm_roberta import (
logger
=
logging
.
get_logger
(
__name__
)
logger
=
logging
.
get_logger
(
__name__
)
LAYOUTXLM_ENCODE_KWARGS_DOCSTRING
=
r
"""
add_special_tokens (`bool`, *optional*, defaults to `True`):
Whether or not to encode the sequences with the special tokens relative to their model.
padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
Activates and controls padding. Accepts the following values:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
Activates and controls truncation. Accepts the following values:
- `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
- `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
max_length (`int`, *optional*):
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to `None`, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (`int`, *optional*, defaults to 0):
If set to a number along with `max_length`, the overflowing tokens returned when
`return_overflowing_tokens=True` will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
return_token_type_ids (`bool`, *optional*):
Whether to return token type IDs. If left to the default, will return the token type IDs according to
the specific tokenizer's default, defined by the `return_outputs` attribute.
[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer's default, defined by the `return_outputs` attribute.
[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is raised instead
of returning overflowing tokens.
return_special_tokens_mask (`bool`, *optional*, defaults to `False`):
Whether or not to return special tokens mask information.
return_offsets_mapping (`bool`, *optional*, defaults to `False`):
Whether or not to return `(char_start, char_end)` for each token.
This is only available on fast tokenizers inheriting from [`PreTrainedTokenizerFast`], if using
Python's tokenizer, this method will raise `NotImplementedError`.
return_length (`bool`, *optional*, defaults to `False`):
Whether or not to return the lengths of the encoded inputs.
verbose (`bool`, *optional*, defaults to `True`):
Whether or not to print more information and warnings.
**kwargs: passed to the `self.tokenize()` method
Return:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model.
[What are input IDs?](../glossary#input-ids)
- **bbox** -- List of bounding boxes to be fed to a model.
- **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True` or
if *"token_type_ids"* is in `self.model_input_names`).
[What are token type IDs?](../glossary#token-type-ids)
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).
[What are attention masks?](../glossary#attention-mask)
- **labels** -- List of labels to be fed to a model. (when `word_labels` is specified).
- **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
`return_overflowing_tokens=True`).
- **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
- **length** -- The length of the inputs (when `return_length=True`).
"""
class
LayoutXLMTokenizer
(
PreTrainedTokenizer
):
class
LayoutXLMTokenizer
(
PreTrainedTokenizer
):
"""
"""
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
...
@@ -339,7 +441,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
...
@@ -339,7 +441,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
return
(
out_vocab_file
,)
return
(
out_vocab_file
,)
@
add_end_docstrings
(
ENCODE_KWARGS_DOCSTRING
,
LAYOUTLM
V2
_ENCODE_
PLUS_ADDITIONAL_
KWARGS_DOCSTRING
)
@
add_end_docstrings
(
LAYOUT
X
LM_ENCODE_KWARGS_DOCSTRING
)
def
__call__
(
def
__call__
(
self
,
self
,
text
:
Union
[
TextInput
,
PreTokenizedInput
,
List
[
TextInput
],
List
[
PreTokenizedInput
]],
text
:
Union
[
TextInput
,
PreTokenizedInput
,
List
[
TextInput
],
List
[
PreTokenizedInput
]],
...
@@ -543,7 +645,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
...
@@ -543,7 +645,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
return
BatchEncoding
(
batch_outputs
)
return
BatchEncoding
(
batch_outputs
)
@
add_end_docstrings
(
ENCODE_KWARGS_DOCSTRING
,
LAYOUTLM
V2
_ENCODE_
PLUS_ADDITIONAL_
KWARGS_DOCSTRING
)
@
add_end_docstrings
(
LAYOUT
X
LM_ENCODE_KWARGS_DOCSTRING
)
def
_batch_prepare_for_model
(
def
_batch_prepare_for_model
(
self
,
self
,
batch_text_or_text_pairs
,
batch_text_or_text_pairs
,
...
@@ -666,7 +768,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
...
@@ -666,7 +768,7 @@ class LayoutXLMTokenizer(PreTrainedTokenizer):
verbose
=
verbose
,
verbose
=
verbose
,
)
)
@
add_end_docstrings
(
ENCODE_KWARGS_DOCSTRING
,
LAYOUTLM
V2
_ENCODE_
PLUS_ADDITIONAL_
KWARGS_DOCSTRING
)
@
add_end_docstrings
(
LAYOUT
X
LM_ENCODE_KWARGS_DOCSTRING
)
def
prepare_for_model
(
def
prepare_for_model
(
self
,
self
,
text
:
Union
[
TextInput
,
PreTokenizedInput
],
text
:
Union
[
TextInput
,
PreTokenizedInput
],
...
...
src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
View file @
0ed4d0df
...
@@ -19,11 +19,10 @@ import os
...
@@ -19,11 +19,10 @@ import os
from
shutil
import
copyfile
from
shutil
import
copyfile
from
typing
import
Dict
,
List
,
Optional
,
Tuple
,
Union
from
typing
import
Dict
,
List
,
Optional
,
Tuple
,
Union
from
transformers.models.layoutlm
v2
.tokenization_layoutlm
v2
import
LAYOUTLM
V2
_ENCODE_
PLUS_ADDITIONAL_
KWARGS_DOCSTRING
from
transformers.models.layout
x
lm.tokenization_layout
x
lm
import
LAYOUT
X
LM_ENCODE_KWARGS_DOCSTRING
from
...tokenization_utils
import
AddedToken
from
...tokenization_utils
import
AddedToken
from
...tokenization_utils_base
import
(
from
...tokenization_utils_base
import
(
ENCODE_KWARGS_DOCSTRING
,
BatchEncoding
,
BatchEncoding
,
EncodedInput
,
EncodedInput
,
PreTokenizedInput
,
PreTokenizedInput
,
...
@@ -166,7 +165,7 @@ class LayoutXLMTokenizerFast(PreTrainedTokenizerFast):
...
@@ -166,7 +165,7 @@ class LayoutXLMTokenizerFast(PreTrainedTokenizerFast):
self
.
pad_token_label
=
pad_token_label
self
.
pad_token_label
=
pad_token_label
self
.
only_label_first_subword
=
only_label_first_subword
self
.
only_label_first_subword
=
only_label_first_subword
@
add_end_docstrings
(
ENCODE_KWARGS_DOCSTRING
,
LAYOUTLM
V2
_ENCODE_
PLUS_ADDITIONAL_
KWARGS_DOCSTRING
)
@
add_end_docstrings
(
LAYOUT
X
LM_ENCODE_KWARGS_DOCSTRING
)
def
__call__
(
def
__call__
(
self
,
self
,
text
:
Union
[
TextInput
,
PreTokenizedInput
,
List
[
TextInput
],
List
[
PreTokenizedInput
]],
text
:
Union
[
TextInput
,
PreTokenizedInput
,
List
[
TextInput
],
List
[
PreTokenizedInput
]],
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment