Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
cf0755aa
Unverified
Commit
cf0755aa
authored
Jul 21, 2021
by
Stas Bekman
Committed by
GitHub
Jul 21, 2021
Browse files
[debug] DebugUnderflowOverflow doesn't work with DP (#12816)
parent
ac3cb660
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
15 additions
and
4 deletions
+15
-4
docs/source/debugging.rst
docs/source/debugging.rst
+5
-1
src/transformers/trainer.py
src/transformers/trainer.py
+8
-1
src/transformers/trainer_utils.py
src/transformers/trainer_utils.py
+2
-2
No files found.
docs/source/debugging.rst
View file @
cf0755aa
...
@@ -24,7 +24,11 @@ Underflow and Overflow Detection
...
@@ -24,7 +24,11 @@ Underflow and Overflow Detection
.. note::
.. note::
This feature can be used with any ``nn.Module``-based model
For multi-GPU training it requires DDP (``torch.distributed.launch``).
.. note::
This feature can be used with any ``nn.Module``-based model.
If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
...
...
src/transformers/trainer.py
View file @
cf0755aa
...
@@ -1114,6 +1114,13 @@ class Trainer:
...
@@ -1114,6 +1114,13 @@ class Trainer:
num_train_samples
=
args
.
max_steps
*
total_train_batch_size
num_train_samples
=
args
.
max_steps
*
total_train_batch_size
if
DebugOption
.
UNDERFLOW_OVERFLOW
in
self
.
args
.
debug
:
if
DebugOption
.
UNDERFLOW_OVERFLOW
in
self
.
args
.
debug
:
if
self
.
args
.
n_gpu
>
1
:
# nn.DataParallel(model) replicates the model, creating new variables and module
# references registered here no longer work on other gpus, breaking the module
raise
ValueError
(
"Currently --debug underflow_overflow is not supported under DP. Please use DDP (torch.distributed.launch)."
)
else
:
debug_overflow
=
DebugUnderflowOverflow
(
self
.
model
)
# noqa
debug_overflow
=
DebugUnderflowOverflow
(
self
.
model
)
# noqa
delay_optimizer_creation
=
self
.
sharded_ddp
is
not
None
and
self
.
sharded_ddp
!=
ShardedDDPOption
.
SIMPLE
delay_optimizer_creation
=
self
.
sharded_ddp
is
not
None
and
self
.
sharded_ddp
!=
ShardedDDPOption
.
SIMPLE
...
...
src/transformers/trainer_utils.py
View file @
cf0755aa
...
@@ -420,7 +420,7 @@ class TrainerMemoryTracker:
...
@@ -420,7 +420,7 @@ class TrainerMemoryTracker:
self
.
cur_stage
=
None
self
.
cur_stage
=
None
def
update_metrics
(
self
,
stage
,
metrics
):
def
update_metrics
(
self
,
stage
,
metrics
):
"""
stop tracking for the passed stage
"""
"""
updates the metrics
"""
if
self
.
skip_memory_metrics
:
if
self
.
skip_memory_metrics
:
return
return
...
@@ -442,7 +442,7 @@ class TrainerMemoryTracker:
...
@@ -442,7 +442,7 @@ class TrainerMemoryTracker:
metrics
[
f
"
{
stage
}
_mem_gpu_
{
t
}
_delta"
]
=
self
.
gpu
[
stage
][
t
]
metrics
[
f
"
{
stage
}
_mem_gpu_
{
t
}
_delta"
]
=
self
.
gpu
[
stage
][
t
]
def
stop_and_update_metrics
(
self
,
metrics
=
None
):
def
stop_and_update_metrics
(
self
,
metrics
=
None
):
"""combine stop
+
update in one call for simpler code"""
"""combine stop
and metrics
update in one call for simpler code"""
if
self
.
skip_memory_metrics
:
if
self
.
skip_memory_metrics
:
return
return
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment