Commits · e0bc5d62a15990bad1f31ea97994f84d2d3029c9 · OpenDAS / apex

29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

23 Oct, 2018 1 commit

[syncBN] (#48) · 81eef1ef

jjsjann123 authored Oct 23, 2018

* [syncBN]
  added syncBN in native pure python apex
  added fused cuda kernels used for sync BN. Using welford for mean/var
    optional installation using 'python setup.py install --cuda_ext'
  added unit test with side to side comparison between apex sync BN with
    PyTorch BN. Notice that for pytorch BN implementation, because of
    numerical issue for mean/var, the output will be slightly off.

* [syncBN PR]
  added fp16 support
  addressing review comments on:
    1. updating last pow 2
    2. look for import error when importing syncBN kernel

* [syncBN PR]
  added convert function to insert SyncBatchNorm
  refactored some kernel code

* fixing type issue (fp16/fp32/fp64)
added Kahan summation
editing unit test to use pytorch primitive ops with double, passing reasonable tests now

* updating tensor creation calls

* fixing the all_reduce contiguous tensor

* transposed all reduce results

* [syncBN]
support fp16 input & fp32 layer for apex fp16
partially fixing launch configs
enabling imagenet example to run with --sync_bn

* [syncBN PR]
Documentation added

* adjusting README

* adjusting again

* added some doc to imagenet example

* [syncBN]
  warp-level reduction
  bug fix: warp reduction logic updated. check for dummy element to avoid nan.
  improved launch config for better reduction kernels. Further improvements
would be to increase grid size.

* [syncBN]
  fixing undefined behavior in __shfl_down_sync from divergent threads in warp
reduction.
  changing at::native::empty to at::empty (upstream comments)

81eef1ef

07 Oct, 2018 1 commit
- Updating imagenet FP16_Optimizer example for new syntax · 2361a646
  Michael Carilli authored Oct 07, 2018
  
  2361a646
29 Sep, 2018 1 commit

Efficient bucketing (#49) · fa183ee8

mcarilli authored Sep 28, 2018

* beautiful

* IT'S WORKING

* Hopefully fix race condition for fallback hook

* Updating test

* shared_param -> delayed_allreduce

* Adding a safety check

* One more check

* syntax...

fa183ee8

19 Sep, 2018 1 commit

Fix param freezing (#47) · 53e1b61a

mcarilli authored Sep 18, 2018

* Fix appears to work in Tomasz's example.

* Somehow shared_param got de-enabled again?

53e1b61a

17 Sep, 2018 1 commit

Remove some fp16 examples that don't converge (#45) · 0ec8addb

Christian Sarofeen authored Sep 17, 2018

* Remove some fp16 examples that don't converge

Default static loss scale of 1.0 (default value) for resnet50 doesn't converge. Either remove example or put static loss scale 128 on it, which is known to converge well.

* Update README.md

0ec8addb

14 Sep, 2018 2 commits
- Only save and load master params if training with FP16 · 48f105d9
  Michael Carilli authored Sep 14, 2018
  
  48f105d9
- Fixing imagenet main.py and main_reducer.py to save and load master params · 327b2446
  Michael Carilli authored Sep 13, 2018
  
  327b2446
06 Sep, 2018 2 commits
- Enabling single-process fallback for examples/imagenet/main_reducer.py · cb6d8f1a
  Michael Carilli authored Sep 06, 2018
  
  cb6d8f1a
- Revising LR scaling to account for any choice of num processes, batch size per process · a2801d91
  Michael Carilli authored Sep 06, 2018
  
  a2801d91
30 Aug, 2018 1 commit
- Update README.md · 586c507e
  mcarilli authored Aug 30, 2018
  
  586c507e
28 Aug, 2018 5 commits
- Updating imagenet README · 034b8f02
  Michael Carilli authored Aug 28, 2018
  
  034b8f02
- Cleaning up git weirdness + updating docs for Reducer · 37a4b221
  Michael Carilli authored Aug 28, 2018
  
  37a4b221
- Reducer (#38) · f09fb4f4
  Christian Sarofeen authored Aug 28, 2018
```
* Add reducer class in parallel/distributed.

* Separate DDP and Reducer examples.

* Don't confuse DDP and reducer in example.
```
  f09fb4f4
- Add reducer class in parallel/distributed. (#37) · 73b62dde
  Christian Sarofeen authored Aug 28, 2018
  
  73b62dde
- Adjusting learning rate for batch size · dc41c5ce
  Michael Carilli authored Aug 27, 2018
  
  dc41c5ce
20 Aug, 2018 1 commit
- minor cleanup · 59bf7d13
  Michael Carilli authored Aug 20, 2018
  
  59bf7d13
19 Aug, 2018 2 commits
- updating FP16_Optimizer example as well · 17971202
  Michael Carilli authored Aug 18, 2018
  
  17971202
- Adjusting learning rate schedule for 76% accuracy · eae8b989
  Michael Carilli authored Aug 18, 2018
  
  eae8b989
16 Aug, 2018 2 commits
- Syncing imagenet examples to use distributed validation · 39c9be85
  Michael Carilli authored Aug 16, 2018
  
  39c9be85
- Removing orphaned /distributed/run_distributed.sh · 2af29c19
  Michael Carilli authored Aug 16, 2018
  
  2af29c19
07 Aug, 2018 1 commit
- Dist validation imagenet main.py · ae3be17a
  Christian Sarofeen authored Aug 07, 2018
  
  ae3be17a
23 Jul, 2018 1 commit
- Ported examples to use torch.distributed.launch · aa817132
  Michael Carilli authored Jul 19, 2018
  
  aa817132
09 Jul, 2018 1 commit
- Fix single-GPU fallback for examples/distributed · 4a8cf7ad
  mcarilli authored Jul 09, 2018
  
  4a8cf7ad
05 Jul, 2018 1 commit
- Update README.md · 738f7607
  mcarilli authored Jul 05, 2018
  
  738f7607
29 Jun, 2018 3 commits
- Syncing imagenet examples · 34582381
  Michael Carilli authored Jun 29, 2018
  
  34582381
- Updating examples · 5e54253f
  Michael Carilli authored Jun 29, 2018
  
  5e54253f
- Fixes to validation in imagenet example scripts. Precision and loss reporting... · cf45c54c
  Josh Romero authored Jun 29, 2018
```
Fixes to validation in imagenet example scripts. Precision and loss reporting modified to be consistent with train.
```
  cf45c54c
28 Jun, 2018 1 commit
- Fix to imagenet main.py data normalization. · 21c229e0
  Josh Romero authored Jun 27, 2018
  
  21c229e0
24 Jun, 2018 2 commits
- cosmetic · 3fa300b9
  Michael Carilli authored Jun 24, 2018
  
  3fa300b9
- Response to issue #16: more comprehensive instructions in examples/docker · 2e7d799f
  Michael Carilli authored Jun 24, 2018
  
  2e7d799f
22 Jun, 2018 1 commit
- Adding simple example links for distributed · bfa3e0ee
  Michael Carilli authored Jun 22, 2018
  
  bfa3e0ee
20 Jun, 2018 1 commit
- Readme updates + version in Sphinx · adaa9137
  Michael Carilli authored Jun 19, 2018
  
  adaa9137
18 Jun, 2018 2 commits
- Updating clip_master_grads for forward compatibility · 0a092aaf
  Michael Carilli authored Jun 18, 2018
  
  0a092aaf
- Added checkpointing example. · 7eba6bfb
  Michael Carilli authored Jun 18, 2018
  
  7eba6bfb
16 Jun, 2018 1 commit
- README wiring in a reasonable state, Sphinx docstrings updated · c41d9f2b
  Michael Carilli authored Jun 15, 2018
  
  c41d9f2b
15 Jun, 2018 3 commits
- More docstring + README updates · 5f8c3183
  Michael Carilli authored Jun 15, 2018
  
  5f8c3183
- Updating READMEs and examples · 82d7a3bf
  Michael Carilli authored Jun 15, 2018
  
  82d7a3bf
- Updating READMEs · e215dd41
  Michael Carilli authored Jun 14, 2018
  
  e215dd41
14 Jun, 2018 1 commit
- Update README.md · 38cb0ce4
  mcarilli authored Jun 14, 2018
  
  38cb0ce4