processors.rst 6.95 KB
Newer Older
LysandreJik's avatar
LysandreJik committed
1
Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
2
-----------------------------------------------------------------------------------------------------------------------
LysandreJik's avatar
LysandreJik committed
3
4
5
6

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.

7
Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
8
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
9

10
All processors follow the same architecture which is that of the
Sylvain Gugger's avatar
Sylvain Gugger committed
11
12
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
:class:`~transformers.data.processors.utils.InputExample`. These
13
14
:class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
LysandreJik's avatar
LysandreJik committed
15

16
.. autoclass:: transformers.data.processors.utils.DataProcessor
LysandreJik's avatar
LysandreJik committed
17
18
19
    :members:


20
.. autoclass:: transformers.data.processors.utils.InputExample
LysandreJik's avatar
LysandreJik committed
21
22
23
    :members:


24
.. autoclass:: transformers.data.processors.utils.InputFeatures
25
26
27
    :members:


28
GLUE
Sylvain Gugger's avatar
Sylvain Gugger committed
29
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
30

Sylvain Gugger's avatar
Sylvain Gugger committed
31
32
33
34
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
LysandreJik's avatar
LysandreJik committed
35

Sylvain Gugger's avatar
Sylvain Gugger committed
36
37
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
QQP, QNLI, RTE and WNLI.
LysandreJik's avatar
LysandreJik committed
38

39
Those processors are:
Sylvain Gugger's avatar
Sylvain Gugger committed
40

41
42
43
44
45
46
47
48
49
    - :class:`~transformers.data.processors.utils.MrpcProcessor`
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
    - :class:`~transformers.data.processors.utils.Sst2Processor`
    - :class:`~transformers.data.processors.utils.StsbProcessor`
    - :class:`~transformers.data.processors.utils.QqpProcessor`
    - :class:`~transformers.data.processors.utils.QnliProcessor`
    - :class:`~transformers.data.processors.utils.RteProcessor`
    - :class:`~transformers.data.processors.utils.WnliProcessor`
LysandreJik's avatar
LysandreJik committed
50

Sylvain Gugger's avatar
Sylvain Gugger committed
51
Additionally, the following method can be used to load values from a data file and convert them to a list of
52
:class:`~transformers.data.processors.utils.InputExample`.
LysandreJik's avatar
LysandreJik committed
53

54
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
55
56

Example usage
Sylvain Gugger's avatar
Sylvain Gugger committed
57
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LysandreJik's avatar
LysandreJik committed
58

Sylvain Gugger's avatar
Sylvain Gugger committed
59
60
An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
VictorSanh's avatar
VictorSanh committed
61
62
63


XNLI
Sylvain Gugger's avatar
Sylvain Gugger committed
64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VictorSanh's avatar
VictorSanh committed
65

Sylvain Gugger's avatar
Sylvain Gugger committed
66
67
68
69
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
different languages (including both high-resource language such as English and low-resource languages such as Swahili).
VictorSanh's avatar
VictorSanh committed
70

Sylvain Gugger's avatar
Sylvain Gugger committed
71
72
It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
<https://arxiv.org/abs/1809.05053>`__
LysandreJik's avatar
LysandreJik committed
73

VictorSanh's avatar
VictorSanh committed
74
This library hosts the processor to load the XNLI data:
Sylvain Gugger's avatar
Sylvain Gugger committed
75

VictorSanh's avatar
VictorSanh committed
76
77
78
79
    - :class:`~transformers.data.processors.utils.XnliProcessor`

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

Sylvain Gugger's avatar
Sylvain Gugger committed
80
81
An example using these processors is given in the `run_xnli.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
LysandreJik's avatar
LysandreJik committed
82
83
84


SQuAD
Sylvain Gugger's avatar
Sylvain Gugger committed
85
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
86

Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
90
91
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
LysandreJik's avatar
LysandreJik committed
92
93
94
95

This library hosts a processor for each of the two versions:

Processors
Sylvain Gugger's avatar
Sylvain Gugger committed
96
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LysandreJik's avatar
LysandreJik committed
97
98

Those processors are:
Sylvain Gugger's avatar
Sylvain Gugger committed
99

LysandreJik's avatar
LysandreJik committed
100
101
102
103
104
105
106
107
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
    - :class:`~transformers.data.processors.utils.SquadV2Processor`

They both inherit from the abstract class :class:`~transformers.data.processors.utils.SquadProcessor`

.. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:

Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
Additionally, the following method can be used to convert SQuAD examples into
:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
LysandreJik's avatar
LysandreJik committed
110
111
112

.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features

Sylvain Gugger's avatar
Sylvain Gugger committed
113
114
These processors as well as the aforementionned method can be used with files containing the data as well as with the
`tensorflow_datasets` package. Examples are given below.
LysandreJik's avatar
LysandreJik committed
115

116

LysandreJik's avatar
LysandreJik committed
117
Example usage
Sylvain Gugger's avatar
Sylvain Gugger committed
118
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sylvain Gugger's avatar
Sylvain Gugger committed
119

LysandreJik's avatar
LysandreJik committed
120
121
Here is an example using the processors as well as the conversion method using data files:

122
.. code-block::
LysandreJik's avatar
LysandreJik committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

    # Loading a V2 processor
    processor = SquadV2Processor()
    examples = processor.get_dev_examples(squad_v2_data_dir)

    # Loading a V1 processor
    processor = SquadV1Processor()
    examples = processor.get_dev_examples(squad_v1_data_dir)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )

Using `tensorflow_datasets` is as easy as using a data file:

143
.. code-block::
LysandreJik's avatar
LysandreJik committed
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

    # tensorflow_datasets only handle Squad V1.
    tfds_examples = tfds.load("squad")
    examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )


Sylvain Gugger's avatar
Sylvain Gugger committed
159
160
Another example using these processors is given in the `run_squad.py
<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.