- 23 Jan, 2021 1 commit
-
-
Siddharth Goyal authored
* Add AMPnet implementation (clean version) * Move ampnet to experimental * Move stuff around pipeline * Address review comments and fix pre-commit errors * Refactor and modify delegate functionality * Modify header in pipe.py
-
- 21 Jan, 2021 2 commits
-
-
Benjamin Lefaudeux authored
-
anj-s authored
* [refactor]Remove unused variables and refactor common configurations * move helper function to call site * fixed lint errors * fix lint errors * fix lint errors * fix lint errors * fix import order * format files * remove unused imports * fix lint errors * fix lint errors * refactor common utilities * address PR comments * sorted imports * add space * modify comment * added doc strings and addressed PR comments. * addressed PR comments * added another comment to clarify. * fixing lint errors * addressed PR comments * addressed PR comments * fixed typos * initialize var * rename seq_pred to lm * fix lint errors * move datasets and models into separate folders * add the folders created * fix lint errors * create golden config to stats mapping * add common batching for both synthetic and real data * fixed lint errors * enable real pipe benchmakrs with new golden data * reduce seq len to avoid OOM * updated golden data * add logging * add golden data * add golden data * fix lint errors * add doc string * remove unused class * add seq len and batch size to the config * remove commented out line * address comments * rename imports * refactor common logic in dataloaders * add golden configs * lint changes * merge latest changes * lint errors * address PR comments Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
- 19 Jan, 2021 1 commit
-
-
anj-s authored
* [refactor]Remove unused variables and refactor common configurations * move helper function to call site * fixed lint errors * fix lint errors * fix lint errors * fix lint errors * fix import order * format files * remove unused imports * fix lint errors * fix lint errors * refactor common utilities * address PR comments * sorted imports * add space * modify comment * added doc strings and addressed PR comments. * addressed PR comments * added another comment to clarify. * fixing lint errors * addressed PR comments * addressed PR comments * fixed typos * initialize var * rename seq_pred to lm * fix lint errors * move datasets and models into separate folders * add the folders created * fix lint errors * create golden config to stats mapping * add common batching for both synthetic and real data * fixed lint errors * enable real pipe benchmakrs with new golden data * reduce seq len to avoid OOM * updated golden data * add logging * add golden data * add golden data * fix lint errors * add doc string * remove commented out line * address comments * rename imports * refactor common logic in dataloaders * add golden configs * lint changes Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
- 04 Jan, 2021 1 commit
-
-
anj-s authored
* [refactor]Remove unused variables and refactor common configurations * move helper function to call site * fixed lint errors * fix lint errors * fix lint errors * fix lint errors * fix import order * format files * remove unused imports * fix lint errors * fix lint errors * refactor common utilities * address PR comments * sorted imports * add space * modify comment * added doc strings and addressed PR comments. * addressed PR comments * added another comment to clarify. * fixing lint errors * addressed PR comments * addressed PR comments * fixed typos * initialize var * rename seq_pred to lm * fix lint errors Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
- 30 Dec, 2020 2 commits
-
-
anj-s authored
[refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252) * [refactor]Remove unused variables and refactor common configurations * move helper function to call site * fixed lint errors * fix lint errors * fix lint errors * fix lint errors * fix import order * format files * remove unused imports * fix lint errors * address PR comments * sorted imports * add space * modify comment * added doc strings and addressed PR comments. * addressed PR comments * added another comment to clarify. * fixing lint errors * rename variable Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
Benjamin Lefaudeux authored
* removing a dead call since ShardedDDP, small speedup * unrelated, but filling in the changelog * another nit
-
- 16 Dec, 2020 1 commit
-
-
jessijzhao authored
* [feat] add CPU support to tutorials in examples * now works on a machine without cuda * fixes some minor typos * [cleanup] factorize tutorials in examples * collects duplicate code across tutorials in helpers.py * [fix] getData in tutorials now returns iterable
-
- 01 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
-
- 22 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* testing median and MAD * synchronize on kernels to make sure that we're measuring the actual completion time * adjusting the circleci threshold, not that the speed has regressed but because we measure proper cuda execution time
-
- 21 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* rewrite using autograd and Variable execution queue to make the reduce automatic * share buckets with OSS to remove duplication * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
-
- 19 Nov, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* reverting a change which slipped in #188
-
Yuanyuan (Ana) Shen authored
* Add CPU support for pipe.py benchmarks, CUDA-free
-
- 18 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* adding a shard-aware GradScaler wrap, credits to Sean Naren for the idea * adding stubs & explanations in the documentation
-
- 16 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
-
- 12 Nov, 2020 1 commit
-
-
Yuanyuan (Ana) Shen authored
* now works on a machine without cuda, easier to debug and quick test
-
- 10 Nov, 2020 1 commit
-
-
Tom Birch authored
Adds support for: * Reused layers (e.g. for weight sharing) * Lazily-constructed layers * Single-process control via PipeRPCWrapper * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive Also added examples for multi-process and PipeRPCWrapper
-
- 06 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* oss benchmark: add an --amp option * add a circleCI test
-
- 28 Oct, 2020 1 commit
-
-
msbaines authored
-
- 23 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* Some ease of use in the benchmark tool, add a debug option
-
- 21 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* switching to MNIST * updating the reference values, should be good to go * download dataset once for all processes
-
- 20 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* Minor, ease of life to debug and makes it possible to test a host of models with the same code
-
- 17 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* adding a cpu option * adjust the reference loss
-
- 14 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
-
- 10 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* bugfix * adjust default non-regression loss, not all_reduced now
-
- 09 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
-
- 06 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Same bucketing strategy for OSS and SDP: sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed
-
- 29 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
- adding the buffer broadcast option - minor cleanup in shardedDDP
-
- 24 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
- small benchmark refactor, only one for all backends and ddp - deterministic, enforce alignment with pytorch ddp
-
- 22 Sep, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* Broadcasting grad-enabled tensors is forbidden in Gloo, because this is not differentiable. Workaround
-
Benjamin Lefaudeux authored
* Doc extensions to some APIs * FIx the benchmark and tutorial
-
- 17 Sep, 2020 2 commits
-
-
Tom Birch authored
Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines) * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess * Added support for lazy construction of modules (see lazy_construction for an example) * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
-
Benjamin Lefaudeux authored
- rename oss_ddp to ShardedDataParallel - some refactoring - ShardedDataParallel owns the sharded optimizer, exposed if need be - some small perf bumps
-
- 16 Sep, 2020 1 commit
-
-
msbaines authored
-
- 09 Sep, 2020 1 commit
-
-
Benjamin Lefaudeux authored
Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading
-
- 03 Sep, 2020 3 commits
-
-
Benjamin Lefaudeux authored
* Aligning the optimizer state dict with what PyTorch expects * Adding a check on the dict keys, ensure that `state` and `param_groups` are there * after installing the specific isort, black and all, one liner to please the linter.. * Adding some measurement of the memory consumption while training + checkpointing * mandatory lintfix commit * brainfart, reset the memory use counter at the beginning of the training in case two of them are run in a row * move reset stats call, hotfix * move the optimizer to rmsprop, more stateful and still used in CV * trying to figure out a sigsev in circleci
-
Jun Ru Anderson authored
Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge. Co-authored-by:Jun Ru Anderson <andersonic@fb.com>
-
Benjamin Lefaudeux authored
* Aligning the optimizer state dict with what PyTorch expects * Adding a check on the dict keys, ensure that `state` and `param_groups` are there * after installing the specific isort, black and all, one liner to please the linter..
-
- 28 Aug, 2020 1 commit
-
-
Jun Ru Anderson authored
* specify chunks for pipe/transformer benchmark Set chunks to be equal to len(balance) for pipe/transformer benchmark. Will update words per second and memory usage checks in next commit (must test on CircleCI to find appropriate values) * change benchmark words per second and memory usage Did six runs for words-per-second, with results: 9144.40, 9163.91, 9993.01, 9082.82, 9155.09, 9000.67 Peak allocated bytes per device (which does not change between runs) were 193206272, 645632, 562688, 92688384 for devices 0, 1, 2 and 3, respectively * increase batch size batch size was small enough that the GPU's computing power was not the bottleneck, slowing training and specifically making more chunks slower. Increasing batch size has therefore increased training speed * update benchmark numbers ran six times, with wps 36917.44, 36797.65, 37006.03, 36872.84, 37129.31, 37003.31 and peak allocated bytes 4061909504, 4050944, 10427392, 2031824896 for devices 0,1,2 and 3 respectively. Co-authored-by:Jun Ru Anderson <andersonic@fb.com>
-
- 22 Aug, 2020 1 commit
-
-
Jun Ru Anderson authored
Implement scaling of optimizer state when using pure-fp16 training to avoid underflow. Update benchmark to use pure-fp16. Modify state_dict methods to store and load the optimizer state scale. Co-authored-by:Jun Ru Anderson <andersonic@fb.com>
-