Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
49d8076f
Unverified
Commit
49d8076f
authored
Aug 17, 2020
by
Stas Bekman
Committed by
GitHub
Aug 17, 2020
Browse files
[doc] Summary of the models fixes (#6511)
* [doc] Summary of the models fixes * correction
parent
72911c89
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
131 additions
and
137 deletions
+131
-137
docs/source/model_summary.rst
docs/source/model_summary.rst
+131
-137
No files found.
docs/source/model_summary.rst
View file @
49d8076f
...
@@ -29,7 +29,7 @@ sentence classification or token classification. A typical example of such model
...
@@ -29,7 +29,7 @@ sentence classification or token classification. A typical example of such model
Note
that
the
only
difference
between
autoregressive
models
and
autoencoding
models
is
in
the
way
the
model
is
Note
that
the
only
difference
between
autoregressive
models
and
autoencoding
models
is
in
the
way
the
model
is
pretrained
.
Therefore
,
the
same
architecture
can
be
used
for
both
autoregressive
and
autoencoding
models
.
When
a
given
pretrained
.
Therefore
,
the
same
architecture
can
be
used
for
both
autoregressive
and
autoencoding
models
.
When
a
given
model
has
been
used
for
both
pretraining
,
we
have
put
it
in
the
category
corresponding
to
the
article
it
was
first
model
has
been
used
for
both
types
of
pretraining
,
we
have
put
it
in
the
category
corresponding
to
the
article
where
it
was
first
introduced
.
introduced
.
Sequence
-
to
-
sequence
models
use
both
the
encoder
and
the
decoder
of
the
original
transformer
,
either
for
translation
Sequence
-
to
-
sequence
models
use
both
the
encoder
and
the
decoder
of
the
original
transformer
,
either
for
translation
...
@@ -37,7 +37,7 @@ tasks or by transforming other tasks to sequence-to-sequence problems. They can
...
@@ -37,7 +37,7 @@ tasks or by transforming other tasks to sequence-to-sequence problems. They can
most
natural
applications
are
translation
,
summarization
and
question
answering
.
The
original
transformer
model
is
an
most
natural
applications
are
translation
,
summarization
and
question
answering
.
The
original
transformer
model
is
an
example
of
such
a
model
(
only
for
translation
),
T5
is
an
example
that
can
be
fine
-
tuned
on
other
tasks
.
example
of
such
a
model
(
only
for
translation
),
T5
is
an
example
that
can
be
fine
-
tuned
on
other
tasks
.
Multimodal
models
mix
text
inputs
with
other
kinds
(
like
image
)
and
are
more
specific
to
a
given
task
.
Multimodal
models
mix
text
inputs
with
other
kinds
(
e
.
g
.
image
s
)
and
are
more
specific
to
a
given
task
.
..
_autoregressive
-
models
:
..
_autoregressive
-
models
:
...
@@ -45,7 +45,7 @@ Autoregressive models
...
@@ -45,7 +45,7 @@ Autoregressive models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As
mentioned
before
,
these
models
rely
on
the
decoder
part
of
the
original
transformer
and
use
an
attention
mask
so
As
mentioned
before
,
these
models
rely
on
the
decoder
part
of
the
original
transformer
and
use
an
attention
mask
so
that
at
each
position
,
the
model
can
only
look
at
the
tokens
before
in
the
attention
heads
.
that
at
each
position
,
the
model
can
only
look
at
the
tokens
before
the
attention
heads
.
Original
GPT
Original
GPT
----------------------------------------------
----------------------------------------------
...
@@ -159,10 +159,10 @@ An autoregressive transformer model with lots of tricks to reduce memory footpri
...
@@ -159,10 +159,10 @@ An autoregressive transformer model with lots of tricks to reduce memory footpri
include
:
include
:
*
Use
:
ref
:`
Axial
position
encoding
<
axial
-
pos
-
encoding
>`
(
see
below
for
more
details
).
It
’
s
a
mechanism
to
avoid
*
Use
:
ref
:`
Axial
position
encoding
<
axial
-
pos
-
encoding
>`
(
see
below
for
more
details
).
It
’
s
a
mechanism
to
avoid
having
a
huge
positional
encoding
matrix
(
when
the
sequence
length
is
very
big
)
by
factorizing
it
in
smaller
having
a
huge
positional
encoding
matrix
(
when
the
sequence
length
is
very
big
)
by
factorizing
it
in
to
smaller
matrices
.
matrices
.
*
Replace
traditional
attention
by
:
ref
:`
LSH
(
local
-
sensitive
hashing
)
attention
<
lsh
-
attention
>`
(
see
below
for
more
*
Replace
traditional
attention
by
:
ref
:`
LSH
(
local
-
sensitive
hashing
)
attention
<
lsh
-
attention
>`
(
see
below
for
more
details
).
It
's a technique to avoid comput
e
the full product query-key in the attention layers.
details
).
It
's a technique to avoid comput
ing
the full product query-key in the attention layers.
* Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
* Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
for results inside a given layer (less efficient than storing them but saves memory).
for results inside a given layer (less efficient than storing them but saves memory).
...
@@ -206,8 +206,7 @@ Autoencoding models
...
@@ -206,8 +206,7 @@ Autoencoding models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
look at all the tokens in the attention heads. For pretraining, inputs are a corrupted version of the sentence, usually
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
obtained by masking tokens, and targets are the original sentences.
BERT
BERT
----------------------------------------------
----------------------------------------------
...
@@ -225,7 +224,7 @@ BERT
...
@@ -225,7 +224,7 @@ BERT
Jacob Devlin et al.
Jacob Devlin et al.
Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
15%)
are
masked by
15%)
is
masked by
:
* a special mask token with probability 0.8
* a special mask token with probability 0.8
* a random token different from the one masked with probability 0.1
* a random token different from the one masked with probability 0.1
...
@@ -256,11 +255,11 @@ Zhenzhong Lan et al.
...
@@ -256,11 +255,11 @@ Zhenzhong Lan et al.
Same as BERT but with a few tweaks:
Same as BERT but with a few tweaks:
* Embedding size E is different from hidden size H justified because the embeddings are context independent (one
* Embedding size E is different from hidden size H justified because the embeddings are context independent (one
embedding vector represents one token) whereas hidden states are context dependent (one hidden state represents a
embedding vector represents one token)
,
whereas hidden states are context dependent (one hidden state represents a
sequence of tokens) so it'
s
more
logical
to
have
H
>>
E
.
Als
,
the
embedding
matrix
is
large
since
it
's V x E (V
sequence of tokens) so it'
s
more
logical
to
have
H
>>
E
.
Als
o
,
the
embedding
matrix
is
large
since
it
's V x E (V
being the vocab size). If E < H, it has less parameters.
being the vocab size). If E < H, it has less parameters.
* Layers are split in groups that share parameters (to save memory).
* Layers are split in groups that share parameters (to save memory).
* Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A
et
B
* Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A
and
B
(that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
(that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
been swapped or not.
been swapped or not.
...
@@ -284,9 +283,9 @@ Yinhan Liu et al.
...
@@ -284,9 +283,9 @@ Yinhan Liu et al.
Same as BERT with better pretraining tricks:
Same as BERT with better pretraining tricks:
* dynamic masking: tokens are masked differently at each epoch whereas BERT does it once and for all
* dynamic masking: tokens are masked differently at each epoch
,
whereas BERT does it once and for all
* no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
* no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
contiguous texts together to reach 512 tokens (so sentences
in
in an order than may span
other
several documents)
contiguous texts together to reach 512 tokens (so
the
sentences
are
in an order than may span several documents)
* train with larger batches
* train with larger batches
* use BPE with bytes as a subunit and not characters (because of unicode characters)
* use BPE with bytes as a subunit and not characters (because of unicode characters)
...
@@ -337,18 +336,17 @@ library provides checkpoints for all of them:
...
@@ -337,18 +336,17 @@ library provides checkpoints for all of them:
*
Causal
language
modeling
(
CLM
)
which
is
the
traditional
autoregressive
training
(
so
this
model
could
be
in
the
*
Causal
language
modeling
(
CLM
)
which
is
the
traditional
autoregressive
training
(
so
this
model
could
be
in
the
previous
section
as
well
).
One
of
the
languages
is
selected
for
each
training
sample
,
and
the
model
input
is
a
previous
section
as
well
).
One
of
the
languages
is
selected
for
each
training
sample
,
and
the
model
input
is
a
sentence
of
256
tokens
that
may
span
o
n
several
documents
in
one
o
ne
those
languages
.
sentence
of
256
tokens
,
that
may
span
o
ver
several
documents
in
one
o
f
those
languages
.
*
Masked
language
modeling
(
MLM
)
which
is
like
RoBERTa
.
One
of
the
languages
is
selected
for
each
training
sample
,
*
Masked
language
modeling
(
MLM
)
which
is
like
RoBERTa
.
One
of
the
languages
is
selected
for
each
training
sample
,
and
the
model
input
is
a
sentence
of
256
tokens
that
may
span
o
n
several
documents
in
one
o
ne
those
languages
,
with
and
the
model
input
is
a
sentence
of
256
tokens
,
that
may
span
o
ver
several
documents
in
one
o
f
those
languages
,
with
dynamic
masking
of
the
tokens
.
dynamic
masking
of
the
tokens
.
*
A
combination
of
MLM
and
translation
language
modeling
(
TLM
).
This
consists
of
concatenating
a
sentence
in
two
*
A
combination
of
MLM
and
translation
language
modeling
(
TLM
).
This
consists
of
concatenating
a
sentence
in
two
different
languages
,
with
random
masking
.
To
predict
one
of
the
masked
token
,
the
model
can
use
both
the
different
languages
,
with
random
masking
.
To
predict
one
of
the
masked
token
s
,
the
model
can
use
both
,
the
surrounding
context
in
language
1
a
s
well
as
the
context
given
by
language
2.
surrounding
context
in
language
1
a
nd
the
context
given
by
language
2.
Checkpoints
refer
to
which
method
was
used
for
pretraining
by
having
`
clm
`,
`
mlm
`
or
`
mlm
-
tlm
`
in
their
names
.
On
top
Checkpoints
refer
to
which
method
was
used
for
pretraining
by
having
`
clm
`,
`
mlm
`
or
`
mlm
-
tlm
`
in
their
names
.
On
top
of
positional
embeddings
,
the
model
has
language
embeddings
.
When
training
using
MLM
/
CLM
,
this
gives
the
model
an
of
positional
embeddings
,
the
model
has
language
embeddings
.
When
training
using
MLM
/
CLM
,
this
gives
the
model
an
indication
of
the
language
used
,
and
when
training
using
MLM
+
TLM
,
an
indication
of
which
part
of
the
input
is
in
which
indication
of
the
language
used
,
and
when
training
using
MLM
+
TLM
,
an
indication
of
the
language
used
for
each
part
.
language
.
The
library
provides
a
version
of
the
model
for
language
modeling
,
token
classification
,
sentence
classification
and
The
library
provides
a
version
of
the
model
for
language
modeling
,
token
classification
,
sentence
classification
and
question
answering
.
question
answering
.
...
@@ -368,7 +366,7 @@ XLM-RoBERTa
...
@@ -368,7 +366,7 @@ XLM-RoBERTa
`
Unsupervised
Cross
-
lingual
Representation
Learning
at
Scale
<
https
://
arxiv
.
org
/
abs
/
1911.02116
>`
_
,
Alexis
Conneau
et
`
Unsupervised
Cross
-
lingual
Representation
Learning
at
Scale
<
https
://
arxiv
.
org
/
abs
/
1911.02116
>`
_
,
Alexis
Conneau
et
al
.
al
.
Uses
RoBERTa
tricks
on
the
XLM
approach
,
but
does
not
use
the
translation
language
modeling
objective
,
only
us
ing
Uses
RoBERTa
tricks
on
the
XLM
approach
,
but
does
not
use
the
translation
language
modeling
objective
.
It
only
us
es
masked
language
modeling
on
sentences
coming
from
one
language
.
However
,
the
model
is
trained
on
many
more
languages
masked
language
modeling
on
sentences
coming
from
one
language
.
However
,
the
model
is
trained
on
many
more
languages
(
100
)
and
doesn
't use the language embeddings, so it'
s
capable
of
detecting
the
input
language
by
itself
.
(
100
)
and
doesn
't use the language embeddings, so it'
s
capable
of
detecting
the
input
language
by
itself
.
...
@@ -469,14 +467,14 @@ BART
...
@@ -469,14 +467,14 @@ BART
<
https
://
arxiv
.
org
/
abs
/
1910.13461
>`
_
,
Mike
Lewis
et
al
.
<
https
://
arxiv
.
org
/
abs
/
1910.13461
>`
_
,
Mike
Lewis
et
al
.
Sequence
-
to
-
sequence
model
with
an
encoder
and
a
decoder
.
Encoder
is
fed
a
corrupted
version
of
the
tokens
,
decoder
is
Sequence
-
to
-
sequence
model
with
an
encoder
and
a
decoder
.
Encoder
is
fed
a
corrupted
version
of
the
tokens
,
decoder
is
fed
the
tokens
(
but
has
a
mask
to
hide
the
future
words
like
a
regular
transformers
decoder
).
For
the
encoder
,
on
the
fed
the
original
tokens
(
but
has
a
mask
to
hide
the
future
words
like
a
regular
transformers
decoder
).
For
the
encoder
,
on
the
pretraining
tasks
,
a
composition
of
the
following
transformations
are
applied
:
pretraining
tasks
,
a
composition
of
the
following
transformations
are
applied
:
*
mask
random
tokens
(
like
in
BERT
)
*
mask
random
tokens
(
like
in
BERT
)
*
delete
random
tokens
*
delete
random
tokens
*
mask
a
span
of
k
tokens
with
a
single
mask
token
(
a
span
of
0
tokens
is
an
insertion
of
a
mask
token
)
*
mask
a
span
of
k
tokens
with
a
single
mask
token
(
a
span
of
0
tokens
is
an
insertion
of
a
mask
token
)
*
permute
sentences
*
permute
sentences
*
rotate
the
document
to
make
it
start
by
a
specific
token
*
rotate
the
document
to
make
it
start
at
a
specific
token
The
library
provides
a
version
of
this
model
for
conditional
generation
and
sequence
classification
.
The
library
provides
a
version
of
this
model
for
conditional
generation
and
sequence
classification
.
...
@@ -513,20 +511,19 @@ T5
...
@@ -513,20 +511,19 @@ T5
`
Exploring
the
Limits
of
Transfer
Learning
with
a
Unified
Text
-
to
-
Text
Transformer
<
https
://
arxiv
.
org
/
abs
/
1910.10683
>`
_
,
`
Exploring
the
Limits
of
Transfer
Learning
with
a
Unified
Text
-
to
-
Text
Transformer
<
https
://
arxiv
.
org
/
abs
/
1910.10683
>`
_
,
Colin
Raffel
et
al
.
Colin
Raffel
et
al
.
Uses
the
traditional
transformer
model
(
except
a
slight
change
with
the
positional
embeddings
,
which
are
learned
at
Uses
the
traditional
transformer
model
(
with
a
slight
change
in
the
positional
embeddings
,
which
are
learned
at
each
layer
).
To
be
able
to
operate
on
all
NLP
tasks
,
it
transforms
them
in
text
-
to
-
text
problems
by
using
certain
each
layer
).
To
be
able
to
operate
on
all
NLP
tasks
,
it
transforms
them
in
to
text
-
to
-
text
problems
by
using
specific
prefixes
:
“
S
ummarize
:
…
”
,
“
question
:
…
”
,
“
translate
English
to
German
:
…
”
and
so
forth
.
prefixes
:
“
s
ummarize
:
”
,
“
question
:
”
,
“
translate
English
to
German
:
”
and
so
forth
.
The
pretraining
includes
both
supervised
and
self
-
supervised
training
.
Supervised
training
is
conducted
on
downstream
The
pretraining
includes
both
supervised
and
self
-
supervised
training
.
Supervised
training
is
conducted
on
downstream
tasks
provided
by
the
GLUE
and
SuperGLUE
benchmarks
(
c
hang
ing
them
to
text
-
to
-
text
tasks
as
explained
above
).
tasks
provided
by
the
GLUE
and
SuperGLUE
benchmarks
(
c
onvert
ing
them
in
to
text
-
to
-
text
tasks
as
explained
above
).
Self
-
supervised
training
consists
of
corrupted
pretrained
,
which
means
randomly
removing
15
%
of
the
tokens
and
Self
-
supervised
training
uses
corrupted
tokens
,
by
randomly
removing
15
%
of
the
tokens
and
replacing
them
by
individual
sentinel
tokens
(
if
several
consecutive
tokens
are
marked
for
removal
,
they
are
replaced
replacing
them
with
individual
sentinel
tokens
(
if
several
consecutive
tokens
are
marked
for
removal
,
the
whole
group
is
replaced
with
a
single
sentinel
token
).
The
input
of
the
encoder
is
the
corrupted
sentence
,
the
input
of
the
decoder
is
the
by
one
single
sentinel
token
).
The
input
of
the
encoder
is
the
corrupted
sentence
,
the
input
of
the
decoder
the
original
sentence
and
the
target
is
then
the
dropped
out
tokens
delimited
by
their
sentinel
tokens
.
original
sentence
and
the
target
is
then
the
dropped
out
tokens
delimited
by
their
sentinel
tokens
.
For
instance
,
if
we
have
the
sentence
“
My
dog
is
very
cute
.
”
,
and
we
decide
to
remove
the
token
dog
,
is
and
cute
,
the
For
instance
,
if
we
have
the
sentence
“
My
dog
is
very
cute
.
”
,
and
we
decide
to
remove
the
token
s
:
"
dog
"
,
"
is
"
and
"
cute
"
,
the
encoder
input
becomes
“
My
<
x
>
very
<
y
>
.
”
and
the
target
is
“
<
x
>
dog
is
<
y
>
.
<
z
>
”
input
becomes
“
My
<
x
>
very
<
y
>
.
”
and
the
target
i
nput
become
s
“
<
x
>
dog
is
<
y
>
cute
.<
z
>
”
The
library
provides
a
version
of
this
model
for
conditional
generation
.
The
library
provides
a
version
of
this
model
for
conditional
generation
.
...
@@ -545,12 +542,12 @@ MMBT
...
@@ -545,12 +542,12 @@ MMBT
et
al
.
et
al
.
A
transformers
model
used
in
multimodal
settings
,
combining
a
text
and
an
image
to
make
predictions
.
The
transformer
A
transformers
model
used
in
multimodal
settings
,
combining
a
text
and
an
image
to
make
predictions
.
The
transformer
model
takes
as
inputs
the
embeddings
of
the
tokenized
text
and
a
the
final
activations
of
a
pretrained
resnet
on
the
model
takes
as
inputs
the
embeddings
of
the
tokenized
text
and
the
final
activations
of
a
pretrained
on
images
resnet
images
(
after
the
pooling
layer
)
that
goes
through
a
linear
layer
(
to
go
from
number
of
features
at
the
end
of
the
(
after
the
pooling
layer
)
that
goes
through
a
linear
layer
(
to
go
from
number
of
features
at
the
end
of
the
resnet
to
the
hidden
state
dimension
of
the
transformer
).
resnet
to
the
hidden
state
dimension
of
the
transformer
).
The
different
inputs
are
concatenated
,
and
on
top
of
the
positional
embeddings
,
a
segment
embedding
is
added
to
let
the
The
different
inputs
are
concatenated
,
and
on
top
of
the
positional
embeddings
,
a
segment
embedding
is
added
to
let
the
model
know
which
part
of
the
input
vector
corresponds
to
the
text
o
r
the
image
.
model
know
which
part
of
the
input
vector
corresponds
to
the
text
and
which
t
o
the
image
.
The
pretrained
model
only
works
for
classification
.
The
pretrained
model
only
works
for
classification
.
...
@@ -573,19 +570,19 @@ use a sparse version of the attention matrix to speed up training.
...
@@ -573,19 +570,19 @@ use a sparse version of the attention matrix to speed up training.
**
LSH
attention
**
**
LSH
attention
**
:
ref
:`
Reformer
<
reformer
>`
uses
LSH
attention
.
In
the
softmax
(
QK
^
t
),
only
the
biggest
elements
(
in
the
softmax
:
ref
:`
Reformer
<
reformer
>`
uses
LSH
attention
.
In
the
softmax
(
QK
^
t
),
only
the
biggest
elements
(
in
the
softmax
dimension
)
of
the
matrix
QK
^
t
are
going
to
give
useful
contributions
.
So
for
each
query
q
in
Q
,
we
can
only
consider
dimension
)
of
the
matrix
QK
^
t
are
going
to
give
useful
contributions
.
So
for
each
query
q
in
Q
,
we
can
consider
only
the
keys
k
in
K
that
are
close
to
q
.
A
hash
function
is
used
to
determine
if
q
and
k
are
close
.
The
attention
mask
is
the
keys
k
in
K
that
are
close
to
q
.
A
hash
function
is
used
to
determine
if
q
and
k
are
close
.
The
attention
mask
is
modified
to
mask
the
current
token
(
except
at
the
first
position
)
because
it
will
give
a
query
and
key
equal
(
so
very
modified
to
mask
the
current
token
(
except
at
the
first
position
)
,
because
it
will
give
a
query
and
a
key
equal
(
so
very
similar
to
each
other
).
Since
the
hash
can
be
a
bit
random
,
several
hash
functions
are
used
in
practice
(
determined
by
similar
to
each
other
).
Since
the
hash
can
be
a
bit
random
,
several
hash
functions
are
used
in
practice
(
determined
by
a
n_rounds
parameter
)
then
are
averaged
together
.
a
n_rounds
parameter
)
and
then
are
averaged
together
.
..
_local
-
attention
:
..
_local
-
attention
:
**
Local
attention
**
**
Local
attention
**
:
ref
:`
Longformer
<
longformer
>`
uses
local
attention
:
often
,
the
local
context
(
e
.
g
.,
what
are
the
two
tokens
left
and
:
ref
:`
Longformer
<
longformer
>`
uses
local
attention
:
often
,
the
local
context
(
e
.
g
.,
what
are
the
two
tokens
to
the
left
and
right
?)
is
enough
to
take
action
for
a
given
token
.
Also
,
by
stacking
attention
layers
that
have
a
small
window
,
the
right
?)
is
enough
to
take
action
for
a
given
token
.
Also
,
by
stacking
attention
layers
that
have
a
small
window
,
the
last
layer
will
have
a
receptive
field
of
more
than
just
the
tokens
o
n
the
window
,
allowing
them
to
build
a
last
layer
will
have
a
receptive
field
of
more
than
just
the
tokens
i
n
the
window
,
allowing
them
to
build
a
representation
of
the
whole
sentence
.
representation
of
the
whole
sentence
.
Some
preselected
input
tokens
are
also
given
global
attention
:
for
those
few
tokens
,
the
attention
matrix
can
access
Some
preselected
input
tokens
are
also
given
global
attention
:
for
those
few
tokens
,
the
attention
matrix
can
access
...
@@ -608,11 +605,8 @@ Other tricks
...
@@ -608,11 +605,8 @@ Other tricks
:
ref
:`
Reformer
<
reformer
>`
uses
axial
positional
encodings
:
in
traditional
transformer
models
,
the
positional
encoding
:
ref
:`
Reformer
<
reformer
>`
uses
axial
positional
encodings
:
in
traditional
transformer
models
,
the
positional
encoding
E
is
a
matrix
of
size
:
math
:`
l
`
by
:
math
:`
d
`,
:
math
:`
l
`
being
the
sequence
length
and
:
math
:`
d
`
the
dimension
of
the
E
is
a
matrix
of
size
:
math
:`
l
`
by
:
math
:`
d
`,
:
math
:`
l
`
being
the
sequence
length
and
:
math
:`
d
`
the
dimension
of
the
hidden
state
.
If
you
have
very
long
texts
,
this
matrix
can
be
huge
and
take
way
too
much
space
on
the
GPU
.
hidden
state
.
If
you
have
very
long
texts
,
this
matrix
can
be
huge
and
take
way
too
much
space
on
the
GPU
.
To
alleviate
that
,
axial
positional
encodings
consist
of
factorizing
that
big
matrix
E
in
two
smaller
matrices
E1
and
To
alleviate
that
,
axial
positional
encodings
consists
in
factorizing
that
big
matrix
E
in
two
smaller
matrices
E1
and
E2
,
with
dimensions
:
math
:`
l_
{
1
}
\
times
d_
{
1
}`
and
:
math
:`
l_
{
2
}
\
times
d_
{
2
}`,
such
that
:
math
:`
l_
{
1
}
\
times
l_
{
2
}
=
l
`
E2
,
with
dimensions
:
math
:`
l_
{
1
}
\
times
d_
{
1
}`
and
:
math
:`
l_
{
2
}
\
times
d_
{
2
}`,
such
that
:
math
:`
l_
{
1
}
\
times
l_
{
2
}
=
l
`
and
:
math
:`
d_
{
1
}
+
d_
{
2
}
=
d
`
(
with
the
product
for
the
lengths
,
this
ends
up
being
way
smaller
).
The
embedding
for
and
:
math
:`
d_
{
1
}
+
d_
{
2
}
=
d
`
(
with
the
product
for
the
lengths
,
this
ends
up
being
way
smaller
).
The
embedding
for
time
step
:
math
:`
j
`
in
E
is
obtained
by
concatenating
the
embeddings
for
timestep
:
math
:`
j
\%
l1
`
in
E1
and
time
step
:
math
:`
j
`
in
E
is
obtained
by
concatenating
the
embeddings
for
timestep
:
math
:`
j
\%
l1
`
in
E1
and
:
math
:`
j
//
l1
`
in
E2
.
:
math
:`
j
//
l1
`
in
E2
.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment