processors.rst 7.55 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

LysandreJik's avatar
LysandreJik committed
13
Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
14
-----------------------------------------------------------------------------------------------------------------------
LysandreJik's avatar
LysandreJik committed
15
16
17
18

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.

19
Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
20
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
21

22
All processors follow the same architecture which is that of the
Sylvain Gugger's avatar
Sylvain Gugger committed
23
24
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
:class:`~transformers.data.processors.utils.InputExample`. These
25
26
:class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
LysandreJik's avatar
LysandreJik committed
27

28
.. autoclass:: transformers.data.processors.utils.DataProcessor
LysandreJik's avatar
LysandreJik committed
29
30
31
    :members:


32
.. autoclass:: transformers.data.processors.utils.InputExample
LysandreJik's avatar
LysandreJik committed
33
34
35
    :members:


36
.. autoclass:: transformers.data.processors.utils.InputFeatures
37
38
39
    :members:


40
GLUE
Sylvain Gugger's avatar
Sylvain Gugger committed
41
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
42

Sylvain Gugger's avatar
Sylvain Gugger committed
43
44
45
46
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
LysandreJik's avatar
LysandreJik committed
47

Sylvain Gugger's avatar
Sylvain Gugger committed
48
49
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
QQP, QNLI, RTE and WNLI.
LysandreJik's avatar
LysandreJik committed
50

51
Those processors are:
Sylvain Gugger's avatar
Sylvain Gugger committed
52

53
54
55
56
57
58
59
60
61
    - :class:`~transformers.data.processors.utils.MrpcProcessor`
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
    - :class:`~transformers.data.processors.utils.Sst2Processor`
    - :class:`~transformers.data.processors.utils.StsbProcessor`
    - :class:`~transformers.data.processors.utils.QqpProcessor`
    - :class:`~transformers.data.processors.utils.QnliProcessor`
    - :class:`~transformers.data.processors.utils.RteProcessor`
    - :class:`~transformers.data.processors.utils.WnliProcessor`
LysandreJik's avatar
LysandreJik committed
62

Sylvain Gugger's avatar
Sylvain Gugger committed
63
Additionally, the following method can be used to load values from a data file and convert them to a list of
64
:class:`~transformers.data.processors.utils.InputExample`.
LysandreJik's avatar
LysandreJik committed
65

66
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
67
68

Example usage
Sylvain Gugger's avatar
Sylvain Gugger committed
69
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LysandreJik's avatar
LysandreJik committed
70

Sylvain Gugger's avatar
Sylvain Gugger committed
71
72
An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
VictorSanh's avatar
VictorSanh committed
73
74
75


XNLI
Sylvain Gugger's avatar
Sylvain Gugger committed
76
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VictorSanh's avatar
VictorSanh committed
77

Sylvain Gugger's avatar
Sylvain Gugger committed
78
79
80
81
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
different languages (including both high-resource language such as English and low-resource languages such as Swahili).
VictorSanh's avatar
VictorSanh committed
82

Sylvain Gugger's avatar
Sylvain Gugger committed
83
84
It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
<https://arxiv.org/abs/1809.05053>`__
LysandreJik's avatar
LysandreJik committed
85

VictorSanh's avatar
VictorSanh committed
86
This library hosts the processor to load the XNLI data:
Sylvain Gugger's avatar
Sylvain Gugger committed
87

VictorSanh's avatar
VictorSanh committed
88
89
90
91
    - :class:`~transformers.data.processors.utils.XnliProcessor`

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

Sylvain Gugger's avatar
Sylvain Gugger committed
92
93
An example using these processors is given in the `run_xnli.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
LysandreJik's avatar
LysandreJik committed
94
95
96


SQuAD
Sylvain Gugger's avatar
Sylvain Gugger committed
97
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
98

Sylvain Gugger's avatar
Sylvain Gugger committed
99
100
101
102
103
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
LysandreJik's avatar
LysandreJik committed
104
105
106
107

This library hosts a processor for each of the two versions:

Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
108
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LysandreJik's avatar
LysandreJik committed
109
110

Those processors are:
Sylvain Gugger's avatar
Sylvain Gugger committed
111

LysandreJik's avatar
LysandreJik committed
112
113
114
115
116
117
118
119
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
    - :class:`~transformers.data.processors.utils.SquadV2Processor`

They both inherit from the abstract class :class:`~transformers.data.processors.utils.SquadProcessor`

.. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
120
121
Additionally, the following method can be used to convert SQuAD examples into
:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
LysandreJik's avatar
LysandreJik committed
122
123
124

.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features

Sylvain Gugger's avatar
Sylvain Gugger committed
125
126
These processors as well as the aforementionned method can be used with files containing the data as well as with the
`tensorflow_datasets` package. Examples are given below.
LysandreJik's avatar
LysandreJik committed
127

128

LysandreJik's avatar
LysandreJik committed
129
Example usage
Sylvain Gugger's avatar
Sylvain Gugger committed
130
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sylvain Gugger's avatar
Sylvain Gugger committed
131

LysandreJik's avatar
LysandreJik committed
132
133
Here is an example using the processors as well as the conversion method using data files:

134
.. code-block::
LysandreJik's avatar
LysandreJik committed
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154

    # Loading a V2 processor
    processor = SquadV2Processor()
    examples = processor.get_dev_examples(squad_v2_data_dir)

    # Loading a V1 processor
    processor = SquadV1Processor()
    examples = processor.get_dev_examples(squad_v1_data_dir)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )

Using `tensorflow_datasets` is as easy as using a data file:

155
.. code-block::
LysandreJik's avatar
LysandreJik committed
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

    # tensorflow_datasets only handle Squad V1.
    tfds_examples = tfds.load("squad")
    examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )


Sylvain Gugger's avatar
Sylvain Gugger committed
171
172
Another example using these processors is given in the `run_squad.py
<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.