reformer.rst 11 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

Patrick von Platen's avatar
Patrick von Platen committed
13
Reformer
Sylvain Gugger's avatar
Sylvain Gugger committed
14
15
16
17
-----------------------------------------------------------------------------------------------------------------------

**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
Patrick von Platen's avatar
Patrick von Platen committed
18
19

Overview
Sylvain Gugger's avatar
Sylvain Gugger committed
20
21
22
23
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
<https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, 艁ukasz Kaiser, Anselm Levskaya.
Patrick von Platen's avatar
Patrick von Platen committed
24

Sylvain Gugger's avatar
Sylvain Gugger committed
25
The abstract from the paper is the following:
Patrick von Platen's avatar
Patrick von Platen committed
26

Sylvain Gugger's avatar
Sylvain Gugger committed
27
28
29
30
31
32
33
34
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its
complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual
layers instead of the standard residuals, which allows storing activations only once in the training process instead of
N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models
while being much more memory-efficient and much faster on long sequences.*

35
36
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be
found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`__.
Patrick von Platen's avatar
Patrick von Platen committed
37

38
39
40
41
42
**Note**:

- Reformer does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see `issue #36035
  <https://github.com/pytorch/pytorch/issues/36035>`__

Patrick von Platen's avatar
Patrick von Platen committed
43
Axial Positional Encodings
Sylvain Gugger's avatar
Sylvain Gugger committed
44
45
46
47
48
49
50
51
52
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Axial Positional Encodings were first implemented in Google's `trax library
<https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`__
and developed by the authors of this model's paper. In models that are treating very long input sequences, the
conventional position id encodings store an embedings vector of size :math:`d` being the :obj:`config.hidden_size` for
every position :math:`i, \ldots, n_s`, with :math:`n_s` being :obj:`config.max_embedding_size`. This means that having
a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000`
would result in a position encoding matrix:
Patrick von Platen's avatar
Patrick von Platen committed
53
54
55
56

.. math::
    X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 

Sylvain Gugger's avatar
Sylvain Gugger committed
57
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
Patrick von Platen's avatar
Patrick von Platen committed
58
59
60
61

.. math::
    X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right] 

Sylvain Gugger's avatar
Sylvain Gugger committed
62
and
Patrick von Platen's avatar
Patrick von Platen committed
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

.. math::
    X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right] 

with:

.. math::
    d = d^1 + d^2 \text{ and } n_s = n_s^1 \times n_s^2 .

Therefore the following holds:

.. math::
    X_{i,j} = \begin{cases}
                X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\
                X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
              \end{cases}

Sylvain Gugger's avatar
Sylvain Gugger committed
80
81
82
83
Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two
factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj:`config.max_embedding_size` dimension
:math:`j` is factorized into :math:`k \text{ and } l`. This design ensures that each position embedding vector
:math:`x_j` is unique.
Patrick von Platen's avatar
Patrick von Platen committed
84

Sylvain Gugger's avatar
Sylvain Gugger committed
85
86
Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
Patrick von Platen's avatar
Patrick von Platen committed
87

Sylvain Gugger's avatar
Sylvain Gugger committed
88
89
90
91
In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
length` of the :obj:`input_ids`.
Patrick von Platen's avatar
Patrick von Platen committed
92
93
94


LSH Self Attention
Sylvain Gugger's avatar
Sylvain Gugger committed
95
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sylvain Gugger's avatar
Sylvain Gugger committed
96

Sylvain Gugger's avatar
Sylvain Gugger committed
97
98
99
100
101
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
`Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
Sylvain Gugger's avatar
Sylvain Gugger committed
102
the same bucket.
Sylvain Gugger's avatar
Sylvain Gugger committed
103

Sylvain Gugger's avatar
Sylvain Gugger committed
104
The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
Sylvain Gugger's avatar
Sylvain Gugger committed
105
106
107
108
109
110
111
112
113
:obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
(which are tied to themselves) and to the key embedding vectors of :obj:`config.lsh_num_chunks_before` previous
neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring chunks.

For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
<https://www.pragmatic.ml/reformer-deep-dive/>`__.

Sylvain Gugger's avatar
Sylvain Gugger committed
114
115
116
117
118
Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
save memory.
Sylvain Gugger's avatar
Sylvain Gugger committed
119
120
121
122
123
124
125
126

When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
saved in the config and should be reused for inference.

Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from
:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Patrick von Platen's avatar
Patrick von Platen committed
127
128


Sylvain Gugger's avatar
Sylvain Gugger committed
129
130
Local Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
131

Sylvain Gugger's avatar
Sylvain Gugger committed
132
133
134
135
Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
chunked so that in each chunk of length :obj:`config.local_chunk_length` the query embedding vectors only attends to
the key embedding vectors in its chunk and to the key embedding vectors of :obj:`config.local_num_chunks_before`
previous neighboring chunks and :obj:`config.local_num_chunks_after` following neighboring chunks.
Patrick von Platen's avatar
Patrick von Platen committed
136

Sylvain Gugger's avatar
Sylvain Gugger committed
137
138
139
Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from
:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Patrick von Platen's avatar
Patrick von Platen committed
140
141


Sylvain Gugger's avatar
Sylvain Gugger committed
142
143
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
144

Sylvain Gugger's avatar
Sylvain Gugger committed
145
146
147
148
During training, we must ensure that the sequence length is set to a value that can be divided by the least common
multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length` and that the parameters of the Axial
Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
easily be trained on sequences as long as 64000 tokens.
Patrick von Platen's avatar
Patrick von Platen committed
149

Sylvain Gugger's avatar
Sylvain Gugger committed
150
For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
Patrick von Platen's avatar
Patrick von Platen committed
151

Sylvain Gugger's avatar
Sylvain Gugger committed
152
.. code-block::
Patrick von Platen's avatar
Patrick von Platen committed
153

154
155
    input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
    loss = model(input_ids, labels=input_ids)[0]
Patrick von Platen's avatar
Patrick von Platen committed
156
157
158


ReformerConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
159
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
160
161
162
163
164
165

.. autoclass:: transformers.ReformerConfig
    :members:


ReformerTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
166
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
167
168

.. autoclass:: transformers.ReformerTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
169
    :members: save_vocabulary
Patrick von Platen's avatar
Patrick von Platen committed
170
171


172
173
174
175
176
177
178
ReformerTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ReformerTokenizerFast
    :members:


Patrick von Platen's avatar
Patrick von Platen committed
179
ReformerModel
Sylvain Gugger's avatar
Sylvain Gugger committed
180
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
181
182

.. autoclass:: transformers.ReformerModel
Sylvain Gugger's avatar
Sylvain Gugger committed
183
    :members: forward
Patrick von Platen's avatar
Patrick von Platen committed
184
185
186


ReformerModelWithLMHead
Sylvain Gugger's avatar
Sylvain Gugger committed
187
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Patrick von Platen's avatar
Patrick von Platen committed
188
189

.. autoclass:: transformers.ReformerModelWithLMHead
Sylvain Gugger's avatar
Sylvain Gugger committed
190
    :members: forward
191
192


193
ReformerForMaskedLM
Sylvain Gugger's avatar
Sylvain Gugger committed
194
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
195
196

.. autoclass:: transformers.ReformerForMaskedLM
Sylvain Gugger's avatar
Sylvain Gugger committed
197
    :members: forward
198
199


200
ReformerForSequenceClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
201
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
202
203

.. autoclass:: transformers.ReformerForSequenceClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
204
    :members: forward
205
206


207
ReformerForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
208
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
209
210

.. autoclass:: transformers.ReformerForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
211
    :members: forward