Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
9bab9b83
"docs/zh_cn/vscode:/vscode.git/clone" did not exist on "eda3c57ea12b9aeb3cdc94e2a1c8c9da54a9b3bd"
Commit
9bab9b83
authored
Jan 14, 2020
by
Lysandre
Committed by
Lysandre Debut
Jan 23, 2020
Browse files
Glossary
parent
64abd3e0
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
135 additions
and
0 deletions
+135
-0
docs/source/glossary.rst
docs/source/glossary.rst
+134
-0
docs/source/index.rst
docs/source/index.rst
+1
-0
No files found.
docs/source/glossary.rst
0 → 100644
View file @
9bab9b83
Glossary
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Every
model
is
different
yet
bears
similarities
with
the
others
.
Therefore
most
models
use
the
same
inputs
,
which
are
detailed
here
alongside
usage
examples
.
Input
IDs
--------------------------
The
input
ids
are
often
the
only
required
parameters
to
be
passed
to
the
model
as
input
.
*
They
are
token
indices
,
numerical
representations
of
tokens
building
the
sequences
that
will
be
used
as
input
by
the
model
*.
Each
tokenizer
works
differently
but
the
underlying
mechanism
remains
the
same
.
Here
's an example using the BERT
tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
::
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
::
tokenized_sequence = tokenizer.tokenize(sequence)
assert tokenized_sequence == ['
A
', '
Titan
', '
R
', '
##
T
', '
##
X
', '
has
', '
24
', '
##
GB
', '
of
', '
V
', '
##
RA
', '
##
M
']
These tokens can then be converted into IDs which are understandable by the model. Several methods are available for
this, the recommended being `encode` or `encode_plus`, which leverage the Rust implementation of
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
::
encoded_sequence = tokenizer.encode(sequence)
assert encoded_sequence == [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
The `encode` and `encode_plus` methods automatically add "special tokens" which are special IDs the model uses.
Attention mask
--------------------------
The attention mask is an optional argument used when batching sequences together. This argument indicates to the
model which tokens should be attended to, and which should not.
For example, consider these two sequences:
::
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
encoded_sequence_a = tokenizer.encode(sequence_a)
assert len(encoded_sequence_a) == 8
encoded_sequence_b = tokenizer.encode(sequence_b)
assert len(encoded_sequence_b) == 19
These two sequences have different lengths and therefore can'
t
be
put
together
in
a
same
tensor
as
-
is
.
The
first
sequence
needs
to
be
padded
up
to
the
length
of
the
second
one
,
or
the
second
one
needs
to
be
truncated
down
to
the
length
of
the
first
one
.
In
the
first
case
,
the
list
of
IDs
will
be
extended
by
the
padding
indices
:
::
padded_sequence_a
=
tokenizer
.
encode
(
sequence_a
,
max_length
=
19
,
pad_to_max_length
=
True
)
assert
padded_sequence_a
=
[
101
,
1188
,
1110
,
170
,
1603
,
4954
,
119
,
102
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
]
assert
encoded_sequence_b
=
[
101
,
1188
,
1110
,
170
,
1897
,
1263
,
4954
,
119
,
1135
,
1110
,
1120
,
1655
,
2039
,
1190
,
1103
,
4954
,
138
,
119
,
102
]
These
can
then
be
converted
into
a
tensor
in
PyTorch
or
TensorFlow
.
The
attention
mask
is
a
binary
tensor
indicating
the
position
of
the
padded
indices
so
that
the
model
does
not
attend
to
them
.
For
the
:
class
:`~
transformers
.
BertTokenizer
`,
:
obj
:`
1
`
indicate
a
value
that
should
be
attended
to
while
:
obj
:`
0
`
indicate
a
padded
value
.
The
method
:
func
:`~
transformers
.
PreTrainedTokenizer
.
encode_plus
`
may
be
used
to
obtain
the
attention
mask
directly
:
::
sequence_a_dict
=
tokenizer
.
encode_plus
(
sequence_a
,
max_length
=
19
,
pad_to_max_length
=
True
)
assert
sequence_a_dict
[
'input_ids'
]
==
[
101
,
1188
,
1110
,
170
,
1603
,
4954
,
119
,
102
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
]
assert
sequence_a_dict
[
'attention_mask'
]
==
[
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
]
Token
Type
IDs
--------------------------
Some
models
' purpose is to do sequence classification or question answering. These require two different sequences to
be encoded in the same input IDs. They are usually separated by special tokens, such as the classifier and separator
tokens. For example, the BERT model builds its two sequence input as such:
::
# [CLS] SEQ_A [SEP] SEQ_B [SEP]
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_sequence = tokenizer.encode(sequence_a, sequence_b)
assert tokenizer.decode(encoded_sequence) == "[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]"
This is enough for some models to understand where one sequence ends and where another begins. However, other models
such as BERT have an additional mechanism, which are the segment IDs. The Token Type IDs are a binary mask identifying
the different sequences in the model.
We can leverage :func:`~transformers.PreTrainedTokenizer.encode_plus` to output the Token Type IDs for us:
::
encoded_dict = tokenizer.encode_plus(sequence_a, sequence_b)
assert sequence_a_dict['
input_ids
'] == [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102]
assert sequence_a_dict['
token_type_ids
'] == [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The first sequence, the "context" used for the question, has all its tokens represented by :obj:`0`, whereas the
question has all its tokens represented by :obj:`1`. Some models, like :class:`~transformers.XLNetModel` use an
additional token represented by a :obj:`2`.
Position IDs
--------------------------
The position IDs are used by the model to identify which token is at which position. Contrary to RNNs that have the
position of each token embedded within them, transformers are unaware of the position of each token. The position
IDs are created for this purpose.
They are an optional parameter. If no position IDs are passed to the model, they are automatically created as absolute
positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
docs/source/index.rst
View file @
9bab9b83
...
@@ -58,6 +58,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
...
@@ -58,6 +58,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
installation
installation
quickstart
quickstart
glossary
pretrained_models
pretrained_models
model_sharing
model_sharing
examples
examples
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment