- 19 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259) * Getting rid of the "should bucket" hash table, just use a list Properly handle all params, with or without requires_grad * make sure that this case is unit tested
-
- 17 Dec, 2020 2 commits
-
-
Joshua Meier authored
-
Benjamin Lefaudeux authored
* typo, sorry about that * small perf fix
-
- 16 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* Better handling of the callback queue, try to consume it as we go. * dumping buckets for the reduce part, always the same unused params issue
-
- 10 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted * fixing the issue, not counting requests in flight properly * adding a multiple optimizers case
-
- 04 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work
-
- 21 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* rewrite using autograd and Variable execution queue to make the reduce automatic * share buckets with OSS to remove duplication * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
-
- 16 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
-
- 10 Nov, 2020 1 commit
-
-
Tom Birch authored
Adds support for: * Reused layers (e.g. for weight sharing) * Lazily-constructed layers * Single-process control via PipeRPCWrapper * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive Also added examples for multi-process and PipeRPCWrapper
-
- 04 Nov, 2020 1 commit
-
-
msbaines authored
-
- 23 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* small refactor, getting rid of the while loop
-
- 20 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* small refactor, code cleanup * broadcast tensor .data attribute directly
-
- 14 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
-
- 08 Oct, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* new unit test to catch rank issues in OSS
-
ngoyal2707 authored
authored-by:Naman Goyal <namangoyal@learnfair0755.h2.fair>
-
- 06 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Same bucketing strategy for OSS and SDP: sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed
-
- 01 Oct, 2020 3 commits
-
-
msbaines authored
-
Joshua Meier authored
support optimizer state sharding for megatron
-
Benjamin Lefaudeux authored
* minor, but gives some memory back * adjust CI and regression checks to 4 gpu
-
- 22 Sep, 2020 3 commits
-
-
Benjamin Lefaudeux authored
* various fixes, no more issues with `make html` and more API fields should be populated
-
Benjamin Lefaudeux authored
* Broadcasting grad-enabled tensors is forbidden in Gloo, because this is not differentiable. Workaround
-
Benjamin Lefaudeux authored
* Doc extensions to some APIs * FIx the benchmark and tutorial
-
- 17 Sep, 2020 2 commits
-
-
Benjamin Lefaudeux authored
- rename oss_ddp to ShardedDataParallel - some refactoring - ShardedDataParallel owns the sharded optimizer, exposed if need be - some small perf bumps
-
Benjamin Lefaudeux authored
Add a small tutorial, similar to the OSS README
-
- 15 Sep, 2020 2 commits
-
-
Benjamin Lefaudeux authored
Return either the local or global state when queried, depending on a prior consolidation
-
Benjamin Lefaudeux authored
Make OSS compatible with optimizers which do not support the closure argument
-
- 10 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Changes the broadcast calls in the OSS step() function to make them asynchronous
-
- 09 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading
-
- 08 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that.
-
- 03 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* Aligning the optimizer state dict with what PyTorch expects * Adding a check on the dict keys, ensure that `state` and `param_groups` are there * after installing the specific isort, black and all, one liner to please the linter..
-
- 28 Aug, 2020 1 commit
-
-
msbaines authored
* [fix] optim/oss: work correctly with LRScheduler Sync lr before every step and before consolidate.
-
- 27 Aug, 2020 4 commits
-
-
msbaines authored
Workaround PyTorch bug that casts state (pytorch/pytorch#43706). Copied from https://github.com/pytorch/fairseq/blob/v0.9.0/fairseq/optim/fp16_optimizer.py#L251-L268
-
msbaines authored
-
msbaines authored
-
msbaines authored
* [fix] optim/oss: support optimizers with additional step kwargs Some of the optimizers in apex support additional kwargs to step such as scale.
-
- 21 Aug, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* initial commit, dummy training loop, pure pytorch but not DDP * probably slightly broken, but rough DDP benchmark run * adding the torchvision requirement for testing * brainfart * reduce the loss, do something slightly distributed * Some cleanup, distributing the training on two GPUs * some cleanup + adding a vanilla run, still not good to go * less silly defaults, gtg for a start I think * smaller batch to fit the smaller gpus used in the circleci rigs * Adding some options for the benchmark, and regression testing * [test] set torch seed for Adam tests (#49) Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict. Co-authored-by:
Jun Ru Anderson <andersonic@fb.com> * linting, I really need to automate this isort insanity Co-authored-by:
Jun Ru Anderson <33384298+andersonic@users.noreply.github.com> Co-authored-by:
Jun Ru Anderson <andersonic@fb.com>
-
- 20 Aug, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* move the restored param groups to the original device * adding a corresponding test
-
- 14 Aug, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* hotfix a half-cooked optimizer state restoration, the global shared state also needs to be restored * [cleanup] get 100% coverage on oss.py (#38) authored-by:
Mandeep Singh Baines <msb@fb.com> * better unit testing, check that the .param_groups attribute is properly in sync with the loaded state Co-authored-by:
msbaines <35972327+msbaines@users.noreply.github.com>
-
msbaines authored
authored-by:Mandeep Singh Baines <msb@fb.com>
-
- 13 Aug, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Aligning OSS state dict with `https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer` (#31)
-