• Min Xu's avatar
    [fix] add and use get_process_group_cached (#678) · bde4bac5
    Min Xu authored
    * [fix] add and use get_process_group_cached
    
    - This commit makes FSDP avoid making too many process groups by default
    - Extra process group is bad for GPU memory and init time
    
    * add changelog
    
    * lint
    
    * note on speed
    
    * add better assert output
    
    test seems to be flaky:
    https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2957/workflows/383c9f9f-f1a5-461c-8c41-e2e28ece037b/jobs/26783/steps
    
    
    
    * update test reference memory values
    
    - With cached process groups, the memory is reduced as reported by
    pytorch as well (due to bucket buffer memory for the reduction buffer)
    - The effect on memory is actually more on the SMI memory, which is not
    reported by pytorch and checked by this test.
    
    * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
    
    * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
    
    * Update CHANGELOG.md
    
    * Update fairscale/utils/parallel.py
    
    * Update fairscale/utils/parallel.py
    
    * Update fairscale/utils/parallel.py
    
    * Update fairscale/utils/parallel.py
    
    * improved changelog
    
    * better handling of underscores in the md file
    Co-authored-by: default avatarMin Xu <min.xu@acm.org>
    bde4bac5
test_moe_layer.py 6.9 KB