- 08 Aug, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 12 Jun, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 26 Jun, 2021 1 commit
-
-
Pavel Belevich authored
-
- 13 May, 2021 1 commit
-
-
Min Xu authored
* [fix] add and use get_process_group_cached - This commit makes FSDP avoid making too many process groups by default - Extra process group is bad for GPU memory and init time * add changelog * lint * note on speed * add better assert output test seems to be flaky: https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2957/workflows/383c9f9f-f1a5-461c-8c41-e2e28ece037b/jobs/26783/steps * update test reference memory values - With cached process groups, the memory is reduced as reported by pytorch as well (due to bucket buffer memory for the reduction buffer) - The effect on memory is actually more on the SMI memory, which is not reported by pytorch and checked by this test. * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py * Update CHANGELOG.md * Update fairscale/utils/parallel.py * Update fairscale/utils/parallel.py * Update fairscale/utils/parallel.py * Update fairscale/utils/parallel.py * improved changelog * better handling of underscores in the md file Co-authored-by:
Min Xu <min.xu@acm.org>
-
- 05 May, 2021 1 commit
-
-
Min Xu authored
* [fix] add clear_autocast_cache flag - when training in AMP model with weight dtype32, FSDP may need to optionally clear the autocast cache to avoid GPU OOM - this flag is default false, automatically doing it is a future TODO - also added a verbose flag to make print(fsdp_model) a bit shorter - updated the memory test to cover those new code - added a couple of useful functions in parallel.py and testing.py * minor * address comments * format * improve the test Co-authored-by:Min Xu <min.xu@acm.org>
-
- 28 Apr, 2021 1 commit
-
-
Min Xu authored
* [feat] save memory by using bucket buffer only in backward - this fixes bug #627 - added documentation to clarify the buffer's cost and speed/memory tradeoff - added setup/teardown calls so that the buffer is only allocated during the backward pass, saving more memory for forward and stepping so that they can be used for things like activations. - added a unit test that assert the memory is in range. Comparing with DDP: 1. buffer size scales with # of FSDP not model size 2. buffer is only allocated during backward 3. buffer is used for small tensors only to reduce overhead 4. overlapping of compute-reduction is very different * add PR number to changelog * filled in with memory number on 1.9 * addressed comments * update comments * fix for 1.6 * add a todo Co-authored-by:Min Xu <min.xu@acm.org>
-