Unverified Commit e68c3756 authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Allow training to resume even if RNG states are not properly loaded (#14994)

* Allow training to resume even if RNG states are not properly loaded

* Proper f-string
parent 08cb5718
...@@ -1553,7 +1553,13 @@ class Trainer: ...@@ -1553,7 +1553,13 @@ class Trainer:
if self.args.local_rank != -1: if self.args.local_rank != -1:
torch.cuda.random.set_rng_state(checkpoint_rng_state["cuda"]) torch.cuda.random.set_rng_state(checkpoint_rng_state["cuda"])
else: else:
torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"]) try:
torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
except Exception as e:
logger.infor(
f"Didn't manage to set back the RNG states of the GPU because of the following error:\n {e}"
"\nThis won't yield the same results as if the training had not been interrupted."
)
if is_torch_tpu_available(): if is_torch_tpu_available():
xm.set_rng_state(checkpoint_rng_state["xla"]) xm.set_rng_state(checkpoint_rng_state["xla"])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment