Unverified Commit 7eb083c2 authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

document the requirement to call for all ranks (#801)

parent 490e6f7c
......@@ -1471,6 +1471,11 @@ class DeepSpeedEngine(Module):
used if not provided. Tag name must be the same across all ranks.
client_state: Optional. State dictionary used for saving required training states in the client code.
save_latest: Optional. Save a file 'latest' pointing to the latest saved checkpoint.
Important: all processes must call this method and not just the process with rank 0. It is
because each process needs to save its master weights and scheduler+optimizer states. This
method will hang waiting to synchronize with other processes if it's called just for the
process with rank 0.
"""
# This is to make sure the checkpoint names are created without collision
......
......@@ -127,6 +127,9 @@ accepts a client state dictionary `client_sd` for saving. These items can be
retrieved from `load_checkpoint` as a return argument. In the example above,
the `step` value is stored as part of the `client_sd`.
Important: all processes must call this method and not just the process with rank 0. It is because
each process needs to save its master weights and scheduler+optimizer states. This method will hang
waiting to synchronize with other processes if it's called just for the process with rank 0.
## DeepSpeed Configuration
DeepSpeed features can be enabled, disabled, or configured using a config JSON
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment