Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Megatron-LM
Commits
72105ef0
Commit
72105ef0
authored
Feb 02, 2021
by
Jared Casper
Browse files
Fix bug in merge_mp_partitions for handling recent checkpoints.
parent
c601d751
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
0 deletions
+5
-0
tools/merge_mp_partitions.py
tools/merge_mp_partitions.py
+5
-0
No files found.
tools/merge_mp_partitions.py
View file @
72105ef0
...
@@ -240,6 +240,11 @@ def main():
...
@@ -240,6 +240,11 @@ def main():
tokenizer
=
rebuild_tokenizer
(
args
)
tokenizer
=
rebuild_tokenizer
(
args
)
mpu
.
initialize
.
set_tensor_model_parallel_world_size
(
args
.
tensor_model_parallel_size
)
mpu
.
initialize
.
set_tensor_model_parallel_world_size
(
args
.
tensor_model_parallel_size
)
for
rank
in
range
(
args
.
tensor_model_parallel_size
):
for
rank
in
range
(
args
.
tensor_model_parallel_size
):
# Reset these since load_checkpoint asserts they are 0, but we are loading
# multiple checkpoints in the same process and they get set each time
args
.
consumed_train_samples
=
0
args
.
consumed_valid_samples
=
0
mpu
.
initialize
.
set_tensor_model_parallel_rank
(
rank
)
mpu
.
initialize
.
set_tensor_model_parallel_rank
(
rank
)
checkpoint_name
,
iteration
=
get_parallel_checkpoint_name
(
args
.
load
)
checkpoint_name
,
iteration
=
get_parallel_checkpoint_name
(
args
.
load
)
model_
=
get_model
(
model_type
)
model_
=
get_model
(
model_type
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment