Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Megatron-LM
Commits
840759b8
Commit
840759b8
authored
Apr 03, 2020
by
Neel Kant
Browse files
Lint megatron/data/dataset_utils.py
parent
63262827
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
172 additions
and
171 deletions
+172
-171
megatron/data/dataset_utils.py
megatron/data/dataset_utils.py
+172
-171
No files found.
megatron/data/dataset_utils.py
View file @
840759b8
...
@@ -132,6 +132,7 @@ def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
...
@@ -132,6 +132,7 @@ def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
tokens
.
pop
()
tokens
.
pop
()
return
True
return
True
def
create_tokens_and_tokentypes
(
tokens_a
,
tokens_b
,
cls_id
,
sep_id
):
def
create_tokens_and_tokentypes
(
tokens_a
,
tokens_b
,
cls_id
,
sep_id
):
"""Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
"""Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
...
@@ -233,7 +234,7 @@ def create_masked_lm_predictions(tokens,
...
@@ -233,7 +234,7 @@ def create_masked_lm_predictions(tokens,
for
idx
in
range
(
len
(
cand_indexes
)):
for
idx
in
range
(
len
(
cand_indexes
)):
ngram_index
=
[]
ngram_index
=
[]
for
n
in
ngrams
:
for
n
in
ngrams
:
ngram_index
.
append
(
cand_indexes
[
idx
:
idx
+
n
])
ngram_index
.
append
(
cand_indexes
[
idx
:
idx
+
n
])
ngram_indexes
.
append
(
ngram_index
)
ngram_indexes
.
append
(
ngram_index
)
np_rng
.
shuffle
(
ngram_indexes
)
np_rng
.
shuffle
(
ngram_indexes
)
...
@@ -367,12 +368,12 @@ def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
...
@@ -367,12 +368,12 @@ def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions,
assert
len
(
masked_positions
)
==
len
(
masked_labels
)
assert
len
(
masked_positions
)
==
len
(
masked_labels
)
# Tokens and token types.
# Tokens and token types.
filler
=
[
pad_id
]
*
padding_length
filler
=
[
pad_id
]
*
padding_length
tokens_np
=
np
.
array
(
tokens
+
filler
,
dtype
=
np
.
int64
)
tokens_np
=
np
.
array
(
tokens
+
filler
,
dtype
=
np
.
int64
)
tokentypes_np
=
np
.
array
(
tokentypes
+
filler
,
dtype
=
np
.
int64
)
tokentypes_np
=
np
.
array
(
tokentypes
+
filler
,
dtype
=
np
.
int64
)
# Padding mask.
# Padding mask.
padding_mask_np
=
np
.
array
([
1
]
*
num_tokens
+
[
0
]
*
padding_length
,
padding_mask_np
=
np
.
array
([
1
]
*
num_tokens
+
[
0
]
*
padding_length
,
dtype
=
np
.
int64
)
dtype
=
np
.
int64
)
# Lables and loss mask.
# Lables and loss mask.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment