Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Megatron-LM
Commits
7fccd6a1
Commit
7fccd6a1
authored
Jun 01, 2022
by
Lawrence McAfee
Browse files
testing fix w/ clip_grads.
parent
de593298
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
16 additions
and
4 deletions
+16
-4
megatron/optimizer/clip_grads.py
megatron/optimizer/clip_grads.py
+16
-4
No files found.
megatron/optimizer/clip_grads.py
View file @
7fccd6a1
...
...
@@ -124,7 +124,10 @@ def count_zeros_fp32(parameters, model_parallel_group):
# - grad should not be none
# - parameter should not be shared
# - should not be a replica due to tensor model parallelism
total_num_zeros
=
0.0
# >>>
# total_num_zeros = 0.0
total_num_zeros
=
torch
.
cuda
.
FloatTensor
([
0.0
])
# <<<
for
param
in
parameters
:
grad_not_none
=
param
.
grad
is
not
None
is_not_shared
=
param_is_not_shared
(
param
)
...
...
@@ -135,9 +138,18 @@ def count_zeros_fp32(parameters, model_parallel_group):
total_num_zeros
=
num_zeros
+
total_num_zeros
# Sum across all model-parallel GPUs.
torch
.
distributed
.
all_reduce
(
total_num_zeros
,
op
=
torch
.
distributed
.
ReduceOp
.
SUM
,
group
=
model_parallel_group
)
# >>>
try
:
torch
.
distributed
.
all_reduce
(
total_num_zeros
,
op
=
torch
.
distributed
.
ReduceOp
.
SUM
,
group
=
model_parallel_group
)
except
:
from
lutil
import
pax
pax
({
"total_num_zeros"
:
total_num_zeros
,
"parameters"
:
parameters
,
})
# <<<
total_num_zeros
=
total_num_zeros
.
item
()
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment