processors.rst 6.16 KB
Newer Older
LysandreJik's avatar
LysandreJik committed
1
2
3
4
5
6
Processors
----------------------------------------------------

This library includes processors for several traditional tasks. These processors can be used to process a dataset into
examples that can be fed to a model.

7
Processors
LysandreJik's avatar
LysandreJik committed
8
9
~~~~~~~~~~~~~~~~~~~~~

10
All processors follow the same architecture which is that of the
11
12
13
14
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
of :class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
LysandreJik's avatar
LysandreJik committed
15

16
.. autoclass:: transformers.data.processors.utils.DataProcessor
LysandreJik's avatar
LysandreJik committed
17
18
19
    :members:


20
.. autoclass:: transformers.data.processors.utils.InputExample
LysandreJik's avatar
LysandreJik committed
21
22
23
    :members:


24
.. autoclass:: transformers.data.processors.utils.InputFeatures
25
26
27
    :members:


28
29
GLUE
~~~~~~~~~~~~~~~~~~~~~
LysandreJik's avatar
LysandreJik committed
30

31
32
33
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
LysandreJik's avatar
LysandreJik committed
34

35
36
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
LysandreJik's avatar
LysandreJik committed
37

38
Those processors are:
39
40
41
42
43
44
45
46
47
    - :class:`~transformers.data.processors.utils.MrpcProcessor`
    - :class:`~transformers.data.processors.utils.MnliProcessor`
    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
    - :class:`~transformers.data.processors.utils.Sst2Processor`
    - :class:`~transformers.data.processors.utils.StsbProcessor`
    - :class:`~transformers.data.processors.utils.QqpProcessor`
    - :class:`~transformers.data.processors.utils.QnliProcessor`
    - :class:`~transformers.data.processors.utils.RteProcessor`
    - :class:`~transformers.data.processors.utils.WnliProcessor`
LysandreJik's avatar
LysandreJik committed
48

49
Additionally, the following method  can be used to load values from a data file and convert them to a list of
50
:class:`~transformers.data.processors.utils.InputExample`.
LysandreJik's avatar
LysandreJik committed
51

52
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
53
54
55

Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^
LysandreJik's avatar
LysandreJik committed
56

57
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
VictorSanh's avatar
VictorSanh committed
58
59
60
61
62
63
64
65
66
67
68
69


XNLI
~~~~~~~~~~~~~~~~~~~~~

`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
the quality of cross-lingual text representations. 
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment 
annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).

It was released together with the paper
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
LysandreJik's avatar
LysandreJik committed
70

VictorSanh's avatar
VictorSanh committed
71
72
73
74
75
This library hosts the processor to load the XNLI data:
    - :class:`~transformers.data.processors.utils.XnliProcessor`

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

76
77
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
LysandreJik's avatar
LysandreJik committed
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109


SQuAD
~~~~~~~~~~~~~~~~~~~~~

`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside 
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.

This library hosts a processor for each of the two versions:

Processors
^^^^^^^^^^^^^^^^^^^^^^^^^

Those processors are:
    - :class:`~transformers.data.processors.utils.SquadV1Processor`
    - :class:`~transformers.data.processors.utils.SquadV2Processor`

They both inherit from the abstract class :class:`~transformers.data.processors.utils.SquadProcessor`

.. autoclass:: transformers.data.processors.squad.SquadProcessor
    :members:

Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
that can be used as model inputs.

.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features

These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
Examples are given below.

110

LysandreJik's avatar
LysandreJik committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files:

Example::

    # Loading a V2 processor
    processor = SquadV2Processor()
    examples = processor.get_dev_examples(squad_v2_data_dir)

    # Loading a V1 processor
    processor = SquadV1Processor()
    examples = processor.get_dev_examples(squad_v1_data_dir)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )

Using `tensorflow_datasets` is as easy as using a data file:

Example::

    # tensorflow_datasets only handle Squad V1.
    tfds_examples = tfds.load("squad")
    examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

    features = squad_convert_examples_to_features( 
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=max_query_length,
        is_training=not evaluate,
    )


Another example using these processors is given in the
153
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/run_squad.py>`__ script.