Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
3312e96b
"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "c70dacde947cf83c4f3822a10f3f44bdb1e6cf7f"
Unverified
Commit
3312e96b
authored
Apr 13, 2021
by
Sylvain Gugger
Committed by
GitHub
Apr 13, 2021
Browse files
Doc check: a bit of clean up (#11224)
parent
edca520d
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
22 additions
and
39 deletions
+22
-39
docs/source/main_classes/data_collator.rst
docs/source/main_classes/data_collator.rst
+18
-30
utils/check_repo.py
utils/check_repo.py
+4
-9
No files found.
docs/source/main_classes/data_collator.rst
View file @
3312e96b
<!--- Copyright 2020 The HuggingFace Team. All rights reserved.
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
with
the License. You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed
on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for
the
specific language governing permissions and limitations under the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed
on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for
the
specific language governing permissions and limitations under the License.
-->
DataCollator
Data Collator
-----------------------------------------------------------------------------------------------------------------------
Data
C
ollators are objects that will form a batch by using a list of elements as input. These
lists of
elements are of
Data
c
ollators are objects that will form a batch by using a list of
dataset
elements as input. These elements are of
the same type as the elements of :obj:`train_dataset` or :obj:`eval_dataset`.
A data collator will default to :func:`transformers.data.data_collator.default_data_collator` if no `tokenizer` has
been provided. This is a function that takes a list of samples from a Dataset as input and collates them into a batch
of a dict-like object. The default collator performs special handling of potential keys:
- ``label``: handles a single value (int or float) per object
- ``label_ids``: handles a list of values per object
To be able to build batches, data collators may apply some processing (like padding). Some of them (like
:class:`~transformers.DataCollatorForLanguageModeling`) also apply some random data augmentation (like random masking)
oin the formed batch.
This function does not perform any preprocessing. An example of use can be found in glue and ner
.
Examples of use can be found in the :doc:`example scripts <../examples>` or :doc:`example notebooks <../notebooks>`
.
Default data collator
...
...
@@ -37,47 +33,39 @@ DataCollatorWithPadding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorWithPadding
:special-members: __call__
:members:
DataCollatorForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForTokenClassification
:special-members: __call__
:members:
DataCollatorForSeq2Seq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForSeq2Seq
:special-members: __call__
:members:
DataCollatorForLanguageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForLanguageModeling
:special-members: __call__
:members: mask_tokens
DataCollatorForWholeWordMask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForWholeWordMask
:special-members: __call__
:members: mask_tokens
DataCollatorForSOP
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForSOP
:special-members: __call__
:members: mask_tokens
DataCollatorForPermutationLanguageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForPermutationLanguageModeling
:special-members: __call__
:members: mask_tokens
utils/check_repo.py
View file @
3312e96b
...
...
@@ -348,6 +348,8 @@ def find_all_documented_objects():
DEPRECATED_OBJECTS
=
[
"AutoModelWithLMHead"
,
"BartPretrainedModel"
,
"DataCollator"
,
"DataCollatorForSOP"
,
"GlueDataset"
,
"GlueDataTrainingArguments"
,
"LineByLineTextDataset"
,
...
...
@@ -385,7 +387,9 @@ DEPRECATED_OBJECTS = [
UNDOCUMENTED_OBJECTS
=
[
"AddedToken"
,
# This is a tokenizers class.
"BasicTokenizer"
,
# Internal, should never have been in the main init.
"CharacterTokenizer"
,
# Internal, should never have been in the main init.
"DPRPretrainedReader"
,
# Like an Encoder.
"MecabTokenizer"
,
# Internal, should never have been in the main init.
"ModelCard"
,
# Internal type.
"SqueezeBertModule"
,
# Internal building block (should have been called SqueezeBertLayer)
"TFDPRPretrainedReader"
,
# Like an Encoder.
...
...
@@ -403,10 +407,6 @@ UNDOCUMENTED_OBJECTS = [
# This list should be empty. Objects in it should get their own doc page.
SHOULD_HAVE_THEIR_OWN_PAGE
=
[
# bert-japanese
"BertJapaneseTokenizer"
,
"CharacterTokenizer"
,
"MecabTokenizer"
,
# Benchmarks
"PyTorchBenchmark"
,
"PyTorchBenchmarkArguments"
,
...
...
@@ -448,11 +448,6 @@ def ignore_undocumented(name):
# MMBT model does not really work.
if
name
.
startswith
(
"MMBT"
):
return
True
# NOT DOCUMENTED BUT NOT ON PURPOSE, SHOULD BE FIXED!
# All data collators should be documented
if
name
.
startswith
(
"DataCollator"
)
or
name
.
endswith
(
"data_collator"
):
return
True
if
name
in
SHOULD_HAVE_THEIR_OWN_PAGE
:
return
True
return
False
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment