Unverified Commit 3312e96b authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc check: a bit of clean up (#11224)

parent edca520d
<!--- Copyright 2020 The HuggingFace Team. All rights reserved.
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for
the specific language governing permissions and limitations under the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
DataCollator
Data Collator
-----------------------------------------------------------------------------------------------------------------------
DataCollators are objects that will form a batch by using a list of elements as input. These lists of elements are of
Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of
the same type as the elements of :obj:`train_dataset` or :obj:`eval_dataset`.
A data collator will default to :func:`transformers.data.data_collator.default_data_collator` if no `tokenizer` has
been provided. This is a function that takes a list of samples from a Dataset as input and collates them into a batch
of a dict-like object. The default collator performs special handling of potential keys:
- ``label``: handles a single value (int or float) per object
- ``label_ids``: handles a list of values per object
To be able to build batches, data collators may apply some processing (like padding). Some of them (like
:class:`~transformers.DataCollatorForLanguageModeling`) also apply some random data augmentation (like random masking)
oin the formed batch.
This function does not perform any preprocessing. An example of use can be found in glue and ner.
Examples of use can be found in the :doc:`example scripts <../examples>` or :doc:`example notebooks <../notebooks>`.
Default data collator
......@@ -37,47 +33,39 @@ DataCollatorWithPadding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorWithPadding
:special-members: __call__
:members:
DataCollatorForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForTokenClassification
:special-members: __call__
:members:
DataCollatorForSeq2Seq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForSeq2Seq
:special-members: __call__
:members:
DataCollatorForLanguageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForLanguageModeling
:special-members: __call__
:members: mask_tokens
DataCollatorForWholeWordMask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForWholeWordMask
:special-members: __call__
:members: mask_tokens
DataCollatorForSOP
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForSOP
:special-members: __call__
:members: mask_tokens
DataCollatorForPermutationLanguageModeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.data.data_collator.DataCollatorForPermutationLanguageModeling
:special-members: __call__
:members: mask_tokens
......@@ -348,6 +348,8 @@ def find_all_documented_objects():
DEPRECATED_OBJECTS = [
"AutoModelWithLMHead",
"BartPretrainedModel",
"DataCollator",
"DataCollatorForSOP",
"GlueDataset",
"GlueDataTrainingArguments",
"LineByLineTextDataset",
......@@ -385,7 +387,9 @@ DEPRECATED_OBJECTS = [
UNDOCUMENTED_OBJECTS = [
"AddedToken", # This is a tokenizers class.
"BasicTokenizer", # Internal, should never have been in the main init.
"CharacterTokenizer", # Internal, should never have been in the main init.
"DPRPretrainedReader", # Like an Encoder.
"MecabTokenizer", # Internal, should never have been in the main init.
"ModelCard", # Internal type.
"SqueezeBertModule", # Internal building block (should have been called SqueezeBertLayer)
"TFDPRPretrainedReader", # Like an Encoder.
......@@ -403,10 +407,6 @@ UNDOCUMENTED_OBJECTS = [
# This list should be empty. Objects in it should get their own doc page.
SHOULD_HAVE_THEIR_OWN_PAGE = [
# bert-japanese
"BertJapaneseTokenizer",
"CharacterTokenizer",
"MecabTokenizer",
# Benchmarks
"PyTorchBenchmark",
"PyTorchBenchmarkArguments",
......@@ -448,11 +448,6 @@ def ignore_undocumented(name):
# MMBT model does not really work.
if name.startswith("MMBT"):
return True
# NOT DOCUMENTED BUT NOT ON PURPOSE, SHOULD BE FIXED!
# All data collators should be documented
if name.startswith("DataCollator") or name.endswith("data_collator"):
return True
if name in SHOULD_HAVE_THEIR_OWN_PAGE:
return True
return False
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment