"docs/vscode:/vscode.git/clone" did not exist on "e9777adbac856c0b8a0654a122567b0ce442e487"
- 23 Oct, 2018 1 commit
-
-
jjsjann123 authored
* [syncBN] added syncBN in native pure python apex added fused cuda kernels used for sync BN. Using welford for mean/var optional installation using 'python setup.py install --cuda_ext' added unit test with side to side comparison between apex sync BN with PyTorch BN. Notice that for pytorch BN implementation, because of numerical issue for mean/var, the output will be slightly off. * [syncBN PR] added fp16 support addressing review comments on: 1. updating last pow 2 2. look for import error when importing syncBN kernel * [syncBN PR] added convert function to insert SyncBatchNorm refactored some kernel code * fixing type issue (fp16/fp32/fp64) added Kahan summation editing unit test to use pytorch primitive ops with double, passing reasonable tests now * updating tensor creation calls * fixing the all_reduce contiguous tensor * transposed all reduce results * [syncBN] support fp16 input & fp32 layer for apex fp16 partially fixing launch configs enabling imagenet example to run with --sync_bn * [syncBN PR] Documentation added * adjusting README * adjusting again * added some doc to imagenet example * [syncBN] warp-level reduction bug fix: warp reduction logic updated. check for dummy element to avoid nan. improved launch config for better reduction kernels. Further improvements would be to increase grid size. * [syncBN] fixing undefined behavior in __shfl_down_sync from divergent threads in warp reduction. changing at::native::empty to at::empty (upstream comments)
-
- 10 Oct, 2018 2 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-
- 08 Oct, 2018 2 commits
-
-
mcarilli authored
-
Michael Carilli authored
Moving gradient division back to after the allreduce. Empirically, it appears underflow is more of a danger than overflow.
-
- 07 Oct, 2018 2 commits
-
-
-
Michael Carilli authored
-
- 05 Oct, 2018 1 commit
-
-
Michael Carilli authored
-
- 03 Oct, 2018 1 commit
-
-
mcarilli authored
This is consistent with upstream, and safer against overflow.
-
- 29 Sep, 2018 4 commits
-
-
mcarilli authored
-
mcarilli authored
-
Michael Carilli authored
-
mcarilli authored
* beautiful * IT'S WORKING * Hopefully fix race condition for fallback hook * Updating test * shared_param -> delayed_allreduce * Adding a safety check * One more check * syntax...
-
- 19 Sep, 2018 1 commit
-
-
mcarilli authored
* Fix appears to work in Tomasz's example. * Somehow shared_param got de-enabled again?
-
- 18 Sep, 2018 1 commit
-
-
Michael Carilli authored
-
- 17 Sep, 2018 1 commit
-
-
Christian Sarofeen authored
* Remove some fp16 examples that don't converge Default static loss scale of 1.0 (default value) for resnet50 doesn't converge. Either remove example or put static loss scale 128 on it, which is known to converge well. * Update README.md
-
- 14 Sep, 2018 2 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-
- 13 Sep, 2018 1 commit
-
-
Michael Carilli authored
-
- 11 Sep, 2018 1 commit
-
-
Carl Case authored
* WIP: update to support new RNN backend code * small refactor * add test for rnn w/ packed sequences
-
- 10 Sep, 2018 4 commits
- 07 Sep, 2018 1 commit
-
-
Michael Carilli authored
-
- 06 Sep, 2018 2 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-
- 05 Sep, 2018 2 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-
- 30 Aug, 2018 2 commits
- 28 Aug, 2018 7 commits
-
-
Michael Carilli authored
-
Michael Carilli authored
-
Michael Carilli authored
-
Christian Sarofeen authored
* Add reducer class in parallel/distributed. * Separate DDP and Reducer examples. * Don't confuse DDP and reducer in example.
-
Christian Sarofeen authored
-
Michael Carilli authored
-
Carl Case authored
-
- 27 Aug, 2018 2 commits