Commits · 5c23bee8c2190c6b99aba88bb7bc8814d3e51710 · OpenDAS / d2go

23 Jun, 2023 1 commit

remove AC prefix from EMA to make it compatible with loading · 5c23bee8

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/578

# Problem:
d2go EMA uses `named_parameters()` to traverse model states and save EMA checkpoints, while using `state_dict()`  to save model checkpoints. This is a brittle practice because `named_parameters()` and `state_dict()` are calling two sets of python APIs and can return different things.
In the case of Activation Checkpointing (AC), we don't want AC wrapper to affect checkpoint names. Thus, `state_dict()` is overriden by Pytorch to remove prefix "_checkpoint_wrapped_module" from FQN. However, `named_parameters()` does not have that support, so prefix still exists. In the event of us changing AC wrapping strategy (very common for optimization), we will not be able to load the previous EMA state back to the model. And the same problem also happened with FSDP.

# Short-term hack:
This diff adds a short term hack to manually remove the AC prefix in EMA. We can expand `IGNORED_FQN_PREFIX` to support more use cases.

Reviewed By: wat3rBro

Differential Revision: D46815031

fbshipit-source-id: 29b6ea444ed2ef90b8741fccdcb2b62625933e7f

5c23bee8

14 Jun, 2023 1 commit

Enable activation checkpointing · 0389f4ee

Anthony Chen authored Jun 14, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/573

Enable Activation Checkpointing from Pytorch Distributed in d2go.

Reviewed By: rohan-varma

Differential Revision: D45681009

fbshipit-source-id: c03f27af61e0374b9e5991d82070edbe41edde6d

0389f4ee