Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
e08c01aa
Commit
e08c01aa
authored
Aug 26, 2019
by
LysandreJik
Browse files
fix #1102
parent
df9d6eff
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
5 additions
and
5 deletions
+5
-5
pytorch_transformers/modeling_roberta.py
pytorch_transformers/modeling_roberta.py
+3
-3
pytorch_transformers/tokenization_roberta.py
pytorch_transformers/tokenization_roberta.py
+2
-2
No files found.
pytorch_transformers/modeling_roberta.py
View file @
e08c01aa
...
@@ -98,15 +98,15 @@ ROBERTA_INPUTS_DOCSTRING = r"""
...
@@ -98,15 +98,15 @@ ROBERTA_INPUTS_DOCSTRING = r"""
Inputs:
Inputs:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Indices of input sequence tokens in the vocabulary.
Indices of input sequence tokens in the vocabulary.
To match pre-training, RoBERTa input sequence should be formatted with
[CLS] and [SEP]
tokens as follows:
To match pre-training, RoBERTa input sequence should be formatted with
<s> and </s>
tokens as follows:
(a) For sequence pairs:
(a) For sequence pairs:
``tokens:
[CLS] i
s this
j
ack
##son ##ville ? [SEP][SEP] n
o it is not .
[SEP]
``
``tokens:
<s> I
s this
J
ack
sonville ? </s> </s> N
o it is not .
</s>
``
(b) For single sequences:
(b) For single sequences:
``tokens:
[CLS]
the dog is hairy .
[SEP]
``
``tokens:
<s>
the dog is hairy .
</s>
``
Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with
Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with
the ``add_special_tokens`` parameter set to ``True``.
the ``add_special_tokens`` parameter set to ``True``.
...
...
pytorch_transformers/tokenization_roberta.py
View file @
e08c01aa
...
@@ -163,14 +163,14 @@ class RobertaTokenizer(PreTrainedTokenizer):
...
@@ -163,14 +163,14 @@ class RobertaTokenizer(PreTrainedTokenizer):
def
add_special_tokens_single_sentence
(
self
,
token_ids
):
def
add_special_tokens_single_sentence
(
self
,
token_ids
):
"""
"""
Adds special tokens to a sequence for sequence classification tasks.
Adds special tokens to a sequence for sequence classification tasks.
A RoBERTa sequence has the following format:
[CLS] X [SEP]
A RoBERTa sequence has the following format:
<s> X </s>
"""
"""
return
[
self
.
_convert_token_to_id
(
self
.
cls_token
)]
+
token_ids
+
[
self
.
_convert_token_to_id
(
self
.
sep_token
)]
return
[
self
.
_convert_token_to_id
(
self
.
cls_token
)]
+
token_ids
+
[
self
.
_convert_token_to_id
(
self
.
sep_token
)]
def
add_special_tokens_sentences_pair
(
self
,
token_ids_0
,
token_ids_1
):
def
add_special_tokens_sentences_pair
(
self
,
token_ids_0
,
token_ids_1
):
"""
"""
Adds special tokens to a sequence pair for sequence classification tasks.
Adds special tokens to a sequence pair for sequence classification tasks.
A RoBERTa sequence pair has the following format:
[CLS] A [SEP][SEP] B [SEP]
A RoBERTa sequence pair has the following format:
<s> A </s></s> B </s>
"""
"""
sep
=
[
self
.
_convert_token_to_id
(
self
.
sep_token
)]
sep
=
[
self
.
_convert_token_to_id
(
self
.
sep_token
)]
cls
=
[
self
.
_convert_token_to_id
(
self
.
cls_token
)]
cls
=
[
self
.
_convert_token_to_id
(
self
.
cls_token
)]
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment