Improvements in distributed Adam optimizer for Megatron (#1432)
* Improvements in distributed Adam optimizer for Megatron Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype. * Style fixes in distributed Adam helper classes Review suggestions from @crcrpar
Showing
This diff is collapsed.
Please register or sign in to comment