Commits · 8421cfb44630ca285c7b1d5eb6715e1feabea406 · OpenDAS / apex

02 Jan, 2019 1 commit

[syncBN] · fa719e8b

Jie authored Jan 02, 2019

replacing new_group with torch.distributed.group.WORLD, avoids creating new
group in every iteration.

This should resolve the issue in Training gets stuck when using SyncBN #105

fa719e8b

22 Dec, 2018 1 commit

Update __init__.py · 870c917a

rxy1212 authored Dec 22, 2018

torch.distributed.new_group and torch.distributed.reduce_op are deprecated on pytorch 1.0.0, this fix can avoid some errors for now.

870c917a

17 Dec, 2018 2 commits
- Fix deprecation warnings for ReduceOp · a99e1875
  Michael Carilli authored Dec 17, 2018
  
  a99e1875
- Compatibility with new_group() API · 35891b28
  Michael Carilli authored Dec 17, 2018
  
  35891b28
14 Dec, 2018 1 commit
- Attempt to fix 97 (not sure why it's happening to begin with) · 4212b3e9
  Michael Carilli authored Dec 13, 2018
  
  4212b3e9
10 Dec, 2018 1 commit
- Adding process group in convert_syncbn_model · 6d3c75e5
  Jie authored Dec 10, 2018
  
  6d3c75e5
03 Dec, 2018 1 commit
- [syncBN] (#90) · 5dad4c21
  jjsjann123 authored Dec 03, 2018
```
supporting user specified process group
```
  5dad4c21
14 Nov, 2018 1 commit
- Distributed backend compatibility update · 64f3d362
  mcarilli authored Nov 14, 2018
  
  64f3d362
01 Nov, 2018 2 commits
- Docstring updates · 97ab5ad3
  Michael Carilli authored Nov 01, 2018
  
  97ab5ad3
- Adding switch to control averaging of gradients. · efc561ba
  Michael Carilli authored Nov 01, 2018
  
  efc561ba
30 Oct, 2018 2 commits
- Updating documentation for merged utilities · 8124fba2
  Michael Carilli authored Oct 30, 2018
  
  8124fba2
- Warning message for FusedAdam import if unavailable · 1fa1a073
  Michael Carilli authored Oct 30, 2018
  
  1fa1a073
29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

23 Oct, 2018 1 commit

[syncBN] (#48) · 81eef1ef

jjsjann123 authored Oct 23, 2018

* [syncBN]
  added syncBN in native pure python apex
  added fused cuda kernels used for sync BN. Using welford for mean/var
    optional installation using 'python setup.py install --cuda_ext'
  added unit test with side to side comparison between apex sync BN with
    PyTorch BN. Notice that for pytorch BN implementation, because of
    numerical issue for mean/var, the output will be slightly off.

* [syncBN PR]
  added fp16 support
  addressing review comments on:
    1. updating last pow 2
    2. look for import error when importing syncBN kernel

* [syncBN PR]
  added convert function to insert SyncBatchNorm
  refactored some kernel code

* fixing type issue (fp16/fp32/fp64)
added Kahan summation
editing unit test to use pytorch primitive ops with double, passing reasonable tests now

* updating tensor creation calls

* fixing the all_reduce contiguous tensor

* transposed all reduce results

* [syncBN]
support fp16 input & fp32 layer for apex fp16
partially fixing launch configs
enabling imagenet example to run with --sync_bn

* [syncBN PR]
Documentation added

* adjusting README

* adjusting again

* added some doc to imagenet example

* [syncBN]
  warp-level reduction
  bug fix: warp reduction logic updated. check for dummy element to avoid nan.
  improved launch config for better reduction kernels. Further improvements
would be to increase grid size.

* [syncBN]
  fixing undefined behavior in __shfl_down_sync from divergent threads in warp
reduction.
  changing at::native::empty to at::empty (upstream comments)

81eef1ef

10 Oct, 2018 2 commits
- Docstring updates · e12c1ec3
  Michael Carilli authored Oct 10, 2018
  
  e12c1ec3
- Docstring updates · 8add2b07
  Michael Carilli authored Oct 10, 2018
  
  8add2b07
08 Oct, 2018 1 commit
- Moving gradient division back to after the allreduce. Empirically, it appears... · fd9b02c0
  Michael Carilli authored Oct 08, 2018
```
Moving gradient division back to after the allreduce.  Empirically, it appears underflow is more of a danger than overflow.
```
  fd9b02c0
03 Oct, 2018 1 commit
- Move gradient division to before the allreduce · e4af2d90
  mcarilli authored Oct 03, 2018
```
This is consistent with upstream, and safer against overflow.
```
  e4af2d90
29 Sep, 2018 3 commits
- fix error message · 2f204bca
  mcarilli authored Sep 29, 2018
  
  2f204bca
- Move other logic after forward to take advantage of GPU skew · 89fa152b
  mcarilli authored Sep 28, 2018
  
  89fa152b
- Efficient bucketing (#49) · fa183ee8
  mcarilli authored Sep 28, 2018
```
* beautiful

* IT'S WORKING

* Hopefully fix race condition for fallback hook

* Updating test

* shared_param -> delayed_allreduce

* Adding a safety check

* One more check

* syntax...
```
  fa183ee8
19 Sep, 2018 1 commit

Fix param freezing (#47) · 53e1b61a

mcarilli authored Sep 18, 2018

* Fix appears to work in Tomasz's example.

* Somehow shared_param got de-enabled again?

53e1b61a

18 Sep, 2018 1 commit
- Forward compatibility fixes for distributed backend, thanks to @Ssnl · ed47ebff
  Michael Carilli authored Sep 18, 2018
  
  ed47ebff
05 Sep, 2018 2 commits
- minor fix · 01e29c97
  Michael Carilli authored Sep 05, 2018
  
  01e29c97
- Fixing needs_refresh logic to allow multiple forwards between each backward · ed14f39c
  Michael Carilli authored Sep 05, 2018
  
  ed14f39c
30 Aug, 2018 1 commit
- Update distributed.py · 5a39c5e3
  mcarilli authored Aug 30, 2018
  
  5a39c5e3
28 Aug, 2018 3 commits
- Reformatting · 559141e8
  Michael Carilli authored Aug 28, 2018
  
  559141e8
- Cleaning up git weirdness + updating docs for Reducer · 37a4b221
  Michael Carilli authored Aug 28, 2018
  
  37a4b221
- Add reducer class in parallel/distributed. (#37) · 73b62dde
  Christian Sarofeen authored Aug 28, 2018
  
  73b62dde
14 Aug, 2018 1 commit
- Fix to enable freezing params · 1d45fada
  Michael Carilli authored Aug 14, 2018
  
  1d45fada
18 Jul, 2018 2 commits
- delete __getstate__ arg · 12b49b98
  ngimel authored Jul 18, 2018
  
  12b49b98
- Handle set/get state for DDP, remove stream which cant be pickled. · f1f97f9f
  Christian Sarofeen authored Jul 18, 2018
  
  f1f97f9f
04 Jul, 2018 1 commit
- docs: minor spelling tweaks (#25) · d74fda26
  brett koonce authored Jul 03, 2018
  
  d74fda26
03 Jul, 2018 1 commit

LARC clipping+documentation (#6) · 88effd5d

Raul Puri authored Jul 03, 2018

* Proper implementation of LARC clipping
 * Documentation of LARC class
 * Modification of FP16_Optimizer to absorb optimizer instance that's being wrapped instead of creating new optimizer instance of same class.

88effd5d

26 Jun, 2018 2 commits
- More stringent check for parameter changes to trigger refresh of distributed (#20) · 80479eed
  mcarilli authored Jun 26, 2018
```
* More stringent check for distributed refresh
```
  80479eed
- Fixed deadlock issue · 68c96f4c
  Samuel authored Jun 26, 2018
  
  68c96f4c
22 Jun, 2018 1 commit
- Adding simple example links for distributed · bfa3e0ee
  Michael Carilli authored Jun 22, 2018
  
  bfa3e0ee
20 Jun, 2018 1 commit
- Readme updates + version in Sphinx · adaa9137
  Michael Carilli authored Jun 19, 2018
  
  adaa9137
16 Jun, 2018 1 commit
- README wiring in a reasonable state, Sphinx docstrings updated · c41d9f2b
  Michael Carilli authored Jun 15, 2018
  
  c41d9f2b
15 Jun, 2018 1 commit
- More docstring + README updates · 5f8c3183
  Michael Carilli authored Jun 15, 2018
  
  5f8c3183