Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Megatron-LM
Commits
09d220cf
Commit
09d220cf
authored
Feb 01, 2021
by
Jared Casper
Browse files
Handle empty documents in preprocess_data.
parent
1b8e2891
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
3 additions
and
1 deletion
+3
-1
tools/preprocess_data.py
tools/preprocess_data.py
+3
-1
No files found.
tools/preprocess_data.py
View file @
09d220cf
...
@@ -85,7 +85,7 @@ class Encoder(object):
...
@@ -85,7 +85,7 @@ class Encoder(object):
sentence_ids
=
Encoder
.
tokenizer
.
tokenize
(
sentence
)
sentence_ids
=
Encoder
.
tokenizer
.
tokenize
(
sentence
)
if
len
(
sentence_ids
)
>
0
:
if
len
(
sentence_ids
)
>
0
:
doc_ids
.
append
(
sentence_ids
)
doc_ids
.
append
(
sentence_ids
)
if
self
.
args
.
append_eod
:
if
len
(
doc_ids
)
>
0
and
self
.
args
.
append_eod
:
doc_ids
[
-
1
].
append
(
Encoder
.
tokenizer
.
eod
)
doc_ids
[
-
1
].
append
(
Encoder
.
tokenizer
.
eod
)
ids
[
key
]
=
doc_ids
ids
[
key
]
=
doc_ids
return
ids
,
len
(
json_line
)
return
ids
,
len
(
json_line
)
...
@@ -182,6 +182,8 @@ def main():
...
@@ -182,6 +182,8 @@ def main():
for
i
,
(
doc
,
bytes_processed
)
in
enumerate
(
encoded_docs
,
start
=
1
):
for
i
,
(
doc
,
bytes_processed
)
in
enumerate
(
encoded_docs
,
start
=
1
):
total_bytes_processed
+=
bytes_processed
total_bytes_processed
+=
bytes_processed
for
key
,
sentences
in
doc
.
items
():
for
key
,
sentences
in
doc
.
items
():
if
len
(
sentences
)
==
0
:
continue
for
sentence
in
sentences
:
for
sentence
in
sentences
:
builders
[
key
].
add_item
(
torch
.
IntTensor
(
sentence
))
builders
[
key
].
add_item
(
torch
.
IntTensor
(
sentence
))
builders
[
key
].
end_document
()
builders
[
key
].
end_document
()
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment