- 27 Oct, 2021 1 commit
-
-
anj-s authored
-
- 12 Aug, 2021 1 commit
-
-
Min Xu authored
* minor: changelog and pre-commit * addressed comment * update the release doc Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 28 Jun, 2021 1 commit
-
-
anj-s authored
-
- 14 May, 2021 1 commit
-
-
Shruti Bhosale authored
* fix saving and loading checkpoints with use_sharded_state=True * mypy fix * better fix of the infinite recursion - we need to specifically call FSDP.state_dict from its local state_dict - added unit test that fails without the fix and works with the fix - fixed mypy for the overloaded functions * make cpu-only fsdp work for state_dict at least Co-authored-by:
Min Xu <min.xu@acm.org> Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
Min Xu <m1n@fb.com>
-
- 11 May, 2021 1 commit
-
-
Min Xu authored
* [fix] FSDP forward pass overlap between compute and all-gather - much thanks for @cyanguwa for report and @QuentinDuval for debugging it - a new unit test is added to check for this and ensure we detect issue with overlapping and cpu/gpu blocking wait calls * fix * fix * fix * better assertion outputs * fix format and tune all_gather mb for CI * more tuning with non_flatten * undo an accidental change * tuning all gather mb and del model * Update + fix overlapping test to use patched all_gather w/ delay (#672) * fixing get_cycles_per_ms * add get_smi_memory * update the docstring Co-authored-by:
Min Xu <min.xu@acm.org> Co-authored-by:
Myle Ott <myleott@fb.com>
-
- 02 Apr, 2021 1 commit
-
-
msbaines authored
NCCL all_to_all is now supported in PyTorch (since v1.8.0) Fixes: #548
-
- 23 Feb, 2021 1 commit
-
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 29 Jan, 2021 1 commit
-
-
Min Xu authored
* [test]: test with py39 + torch 1.8 nightly * version fix * more fix * fix version function for nightly version * fix torch_pg build * invalidate cache * separate benchmark requirements * comment * fixed mypy * fixed a test
-
- 25 Jan, 2021 1 commit
-
-
Min Xu authored
* [test] cover python 3.7 to 3.9 on CPU - covering common python versions on CPU tests - added doc build test * add doc build test * skipping failing tests on py39 * catching doc build warnings * add doc build to py38 and py39 * minor fix * fix doc build for adascale * removed dead code * fix the skipping * skip unit test for py39 * add failing example * no more py39 skipping the tests
-
- 21 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 05 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding the pytest timeout plugin to properly root out hanging tests * removing redundant code, slightly more reasonable timeout, works on single cuda * finding the root bug for some of the cpu hangs, rpc init * propagating all the rpc init test changes to the pipe and model parallel tests
-
- 30 Oct, 2020 1 commit
-
-
msbaines authored
-
- 29 Oct, 2020 1 commit
-
-
msbaines authored
-
- 28 Oct, 2020 1 commit
-
-
msbaines authored
-
- 14 Oct, 2020 1 commit
-
-
msbaines authored
-
- 21 Aug, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* initial commit, dummy training loop, pure pytorch but not DDP * probably slightly broken, but rough DDP benchmark run * adding the torchvision requirement for testing * brainfart * reduce the loss, do something slightly distributed * Some cleanup, distributing the training on two GPUs * some cleanup + adding a vanilla run, still not good to go * less silly defaults, gtg for a start I think * smaller batch to fit the smaller gpus used in the circleci rigs * Adding some options for the benchmark, and regression testing * [test] set torch seed for Adam tests (#49) Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict. Co-authored-by:
Jun Ru Anderson <andersonic@fb.com> * linting, I really need to automate this isort insanity Co-authored-by:
Jun Ru Anderson <33384298+andersonic@users.noreply.github.com> Co-authored-by:
Jun Ru Anderson <andersonic@fb.com>
-
- 13 Aug, 2020 1 commit
-
-
msbaines authored
-
- 06 Aug, 2020 1 commit
-
-
Min Xu authored
Co-authored-by:Min Xu <m1n@fb.com>
-
- 31 Jul, 2020 1 commit
-
-
msbaines authored
-
- 08 Jul, 2020 1 commit
-
-
Mandeep Singh Baines authored
-