Add features to distributed Adam for Megatron support (#1414)
* Add features to distributed Adam for Megatron support Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices. * Restore closure arg to distributed Adam Review suggestion from @crcrpar
Showing
This diff is collapsed.
Please register or sign in to comment