Commits · 613997ea871c582b6e2e8fded1d8e89abf598b18 · OpenDAS / apex

26 Feb, 2019 1 commit
- No need for casts during optimizer step · 613997ea
  Michael Carilli authored Feb 26, 2019
  
  613997ea
24 Feb, 2019 1 commit
- Stashing work · d137b800
  Michael Carilli authored Feb 24, 2019
  
  d137b800
22 Feb, 2019 1 commit
- Allow multi-tensor unscale to handle FP16 output, so it can also be used for... · 80a3f3ca
  Michael Carilli authored Feb 21, 2019
```
Allow multi-tensor unscale to handle FP16 output, so it can also be used for copy-scatter. Rename some options.
```
  80a3f3ca
19 Feb, 2019 1 commit
- Reworked multi tensor apply, added tests · 6763a8be
  Michael Carilli authored Feb 18, 2019
  
  6763a8be
13 Feb, 2019 1 commit
- New API tentatively works on resnet50, ready for stress testing. · 889d1712
  Michael Carilli authored Feb 12, 2019
  
  889d1712
06 Feb, 2019 1 commit
- Tests for the fused downscale kernel · 340e71a4
  Michael Carilli authored Feb 05, 2019
  
  340e71a4
05 Feb, 2019 1 commit

Better FP16 support in pytorch fp16 utils. · 713e0fb8

Jerry Ma authored Feb 01, 2019

This commit adds an FP16Model class as a successor to network_to_half.

The benefits of this class are:

- Preservation of single-precision for BatchNorm layers. The models
  generated by network_to_half() convert BatchNorm moment tensors to
  half-precision, then back to single-precision, which hurts the
  accuracy of the moment estimators and occasionally results in NaNs.
- Support for multi-argument nn.Modules (self-explanatory from code).

713e0fb8

03 Feb, 2019 1 commit
- Lazy imports to reduce error spam · 48299b0d
  Michael Carilli authored Feb 02, 2019
  
  48299b0d
01 Feb, 2019 1 commit
- async->non_blocking, module-specific logging · cc85a2e5
  Michael Carilli authored Feb 01, 2019
  
  cc85a2e5
29 Jan, 2019 3 commits
- Update two_gpu_unit_test.py · 8b9ce244
  mcarilli authored Jan 28, 2019
  
  8b9ce244
- Update two_gpu_unit_test.py · d0624f4f
  mcarilli authored Jan 28, 2019
  
  d0624f4f
- adding comment to explain single process gradient averaging · c8d7c9f1
  jiej authored Jan 28, 2019
  
  c8d7c9f1
28 Jan, 2019 1 commit

[syncBN] · 63e47d29

jiej authored Jan 28, 2019

test update to resolve
  https://github.com/NVIDIA/apex/issues/134#issue-403525480

Using identical learning rate for both DDP with sync BN and single process BN.
The previous configure leaves the impression that sync BN requires adjusting lr
in the script, which is not true.

63e47d29

25 Jan, 2019 1 commit
- Adding tests, also, don't drop cache during eval. · dfd40f9a
  Michael Carilli authored Jan 24, 2019
  
  dfd40f9a
15 Jan, 2019 1 commit
- [sync BN nhwc] · 443fa76e
  Jie authored Jan 14, 2019
```
Added kernel to support sync BN for channel last tensor
```
  443fa76e
15 Dec, 2018 1 commit
- add unit tests for optimizers/fp16_optimizer · afc8d1b2
  Deyu Fu authored Dec 14, 2018
  
  afc8d1b2
01 Nov, 2018 1 commit
- Adding switch to control averaging of gradients. · efc561ba
  Michael Carilli authored Nov 01, 2018
  
  efc561ba
30 Oct, 2018 1 commit

Adam tests (#67) · d594826c

ngimel authored Oct 30, 2018

* Add unittest for FusedAdam.

* Fix some bugs.

* set seed for adam test

d594826c

29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

23 Oct, 2018 1 commit

[syncBN] (#48) · 81eef1ef

jjsjann123 authored Oct 23, 2018

* [syncBN]
  added syncBN in native pure python apex
  added fused cuda kernels used for sync BN. Using welford for mean/var
    optional installation using 'python setup.py install --cuda_ext'
  added unit test with side to side comparison between apex sync BN with
    PyTorch BN. Notice that for pytorch BN implementation, because of
    numerical issue for mean/var, the output will be slightly off.

* [syncBN PR]
  added fp16 support
  addressing review comments on:
    1. updating last pow 2
    2. look for import error when importing syncBN kernel

* [syncBN PR]
  added convert function to insert SyncBatchNorm
  refactored some kernel code

* fixing type issue (fp16/fp32/fp64)
added Kahan summation
editing unit test to use pytorch primitive ops with double, passing reasonable tests now

* updating tensor creation calls

* fixing the all_reduce contiguous tensor

* transposed all reduce results

* [syncBN]
support fp16 input & fp32 layer for apex fp16
partially fixing launch configs
enabling imagenet example to run with --sync_bn

* [syncBN PR]
Documentation added

* adjusting README

* adjusting again

* added some doc to imagenet example

* [syncBN]
  warp-level reduction
  bug fix: warp reduction logic updated. check for dummy element to avoid nan.
  improved launch config for better reduction kernels. Further improvements
would be to increase grid size.

* [syncBN]
  fixing undefined behavior in __shfl_down_sync from divergent threads in warp
reduction.
  changing at::native::empty to at::empty (upstream comments)

81eef1ef

29 Sep, 2018 2 commits
- Clean up race condition test, need to figure out a clean way to create distributed unit tests · 9d731777
  Michael Carilli authored Sep 29, 2018
  
  9d731777
- Efficient bucketing (#49) · fa183ee8
  mcarilli authored Sep 28, 2018
```
* beautiful

* IT'S WORKING

* Hopefully fix race condition for fallback hook

* Updating test

* shared_param -> delayed_allreduce

* Adding a safety check

* One more check

* syntax...
```
  fa183ee8
13 Sep, 2018 1 commit
- Skeleton for modular tests · b7025fc9
  Michael Carilli authored Sep 13, 2018
  
  b7025fc9
23 Jul, 2018 1 commit
- Switch to simple Python-only install, in preparation for upstreaming C++ backend. · d695b68b
  Michael Carilli authored Jul 23, 2018
  
  d695b68b
06 Jun, 2018 1 commit
- Macros based on torch.__version__ to compile with 0.4 and 0.5 · d506eff2
  Michael Carilli authored Jun 06, 2018
  
  d506eff2
26 May, 2018 1 commit
- Fleshed out Cuda version checking and compiling for multiple arches · fb7d4e1d
  Michael Carilli authored May 25, 2018
  
  fb7d4e1d
14 May, 2018 1 commit
- Multi-op sequence for ddp_race_condition_test.py · cc8f03c8
  Michael Carilli authored May 14, 2018
  
  cc8f03c8
07 May, 2018 1 commit
- Fix race condition in DDP. · 7c2ae41e
  Christian Sarofeen authored May 01, 2018
  
  7c2ae41e
25 Apr, 2018 2 commits
- Cleaned comments in fp16_utils and csrc. Keeping comments that are non-docstring but informative. · a3e2776a
  Michael Carilli authored Apr 25, 2018
  
  a3e2776a
- Initial release · 2fa4dbaf
  Christian Sarofeen authored Apr 25, 2018
  
  2fa4dbaf