[feat] optimizer state scaling (#44)
Implement scaling of optimizer state when using pure-fp16 training to avoid underflow. Update benchmark to use pure-fp16. Modify state_dict methods to store and load the optimizer state scale.
Co-authored-by:
Jun Ru Anderson <andersonic@fb.com>
Showing
Please register or sign in to comment