Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
9ce36e3e
Commit
9ce36e3e
authored
Aug 14, 2019
by
samvelyan
Browse files
Re-implemented tokenize() iteratively in PreTrainedTokenizer.
parent
aaedfc35
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
36 additions
and
6 deletions
+36
-6
pytorch_transformers/tokenization_utils.py
pytorch_transformers/tokenization_utils.py
+36
-6
No files found.
pytorch_transformers/tokenization_utils.py
View file @
9ce36e3e
...
...
@@ -472,15 +472,45 @@ class PreTrainedTokenizer(object):
Take care of added tokens.
"""
def
split_on_token
(
tok
,
text
):
result
=
[]
split_text
=
text
.
split
(
tok
)
for
i
,
sub_text
in
enumerate
(
split_text
):
sub_text
=
sub_text
.
strip
()
if
i
==
0
and
not
sub_text
:
result
+=
[
tok
]
elif
i
==
len
(
split_text
)
-
1
:
if
sub_text
:
result
+=
[
sub_text
]
else
:
pass
else
:
if
sub_text
:
result
+=
[
sub_text
]
result
+=
[
tok
]
return
result
def
split_on_tokens
(
tok_list
,
text
):
if
not
text
:
return
[]
if
not
tok_list
:
return
self
.
_tokenize
(
text
,
**
kwargs
)
tok
=
tok_list
[
0
]
split_text
=
text
.
split
(
tok
)
return
sum
((
split_on_tokens
(
tok_list
[
1
:],
sub_text
.
strip
())
+
[
tok
]
\
for
sub_text
in
split_text
),
[])[:
-
1
]
tokenized_text
=
[]
text_list
=
[
text
]
for
tok
in
tok_list
:
tokenized_text
=
[]
for
sub_text
in
text_list
:
if
sub_text
not
in
self
.
added_tokens_encoder
\
and
sub_text
not
in
self
.
all_special_tokens
:
tokenized_text
+=
split_on_token
(
tok
,
sub_text
)
else
:
tokenized_text
+=
[
sub_text
]
text_list
=
tokenized_text
return
sum
((
self
.
_tokenize
(
token
,
**
kwargs
)
if
token
not
\
in
self
.
added_tokens_encoder
and
token
not
in
self
.
all_special_tokens
\
else
[
token
]
for
token
in
tokenized_text
),
[])
added_tokens
=
list
(
self
.
added_tokens_encoder
.
keys
())
+
self
.
all_special_tokens
tokenized_text
=
split_on_tokens
(
added_tokens
,
text
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment