• Tim Moon's avatar
    Improvements in distributed Adam optimizer for Megatron (#1432) · 2e025ab5
    Tim Moon authored
    * Improvements in distributed Adam optimizer for Megatron
    
    Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype.
    
    * Style fixes in distributed Adam helper classes
    
    Review suggestions from @crcrpar
    2e025ab5
test_dist_adam.py 13.1 KB