Commits · e0bc5d62a15990bad1f31ea97994f84d2d3029c9 · OpenDAS / apex

29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

23 Oct, 2018 1 commit

[syncBN] (#48) · 81eef1ef

jjsjann123 authored Oct 23, 2018

* [syncBN]
  added syncBN in native pure python apex
  added fused cuda kernels used for sync BN. Using welford for mean/var
    optional installation using 'python setup.py install --cuda_ext'
  added unit test with side to side comparison between apex sync BN with
    PyTorch BN. Notice that for pytorch BN implementation, because of
    numerical issue for mean/var, the output will be slightly off.

* [syncBN PR]
  added fp16 support
  addressing review comments on:
    1. updating last pow 2
    2. look for import error when importing syncBN kernel

* [syncBN PR]
  added convert function to insert SyncBatchNorm
  refactored some kernel code

* fixing type issue (fp16/fp32/fp64)
added Kahan summation
editing unit test to use pytorch primitive ops with double, passing reasonable tests now

* updating tensor creation calls

* fixing the all_reduce contiguous tensor

* transposed all reduce results

* [syncBN]
support fp16 input & fp32 layer for apex fp16
partially fixing launch configs
enabling imagenet example to run with --sync_bn

* [syncBN PR]
Documentation added

* adjusting README

* adjusting again

* added some doc to imagenet example

* [syncBN]
  warp-level reduction
  bug fix: warp reduction logic updated. check for dummy element to avoid nan.
  improved launch config for better reduction kernels. Further improvements
would be to increase grid size.

* [syncBN]
  fixing undefined behavior in __shfl_down_sync from divergent threads in warp
reduction.
  changing at::native::empty to at::empty (upstream comments)

81eef1ef

29 Sep, 2018 1 commit

Efficient bucketing (#49) · fa183ee8

mcarilli authored Sep 28, 2018

* beautiful

* IT'S WORKING

* Hopefully fix race condition for fallback hook

* Updating test

* shared_param -> delayed_allreduce

* Adding a safety check

* One more check

* syntax...

fa183ee8

19 Sep, 2018 1 commit

Fix param freezing (#47) · 53e1b61a

mcarilli authored Sep 18, 2018

* Fix appears to work in Tomasz's example.

* Somehow shared_param got de-enabled again?

53e1b61a

14 Sep, 2018 2 commits
- Only save and load master params if training with FP16 · 48f105d9
  Michael Carilli authored Sep 14, 2018
  
  48f105d9
- Fixing imagenet main.py and main_reducer.py to save and load master params · 327b2446
  Michael Carilli authored Sep 13, 2018
  
  327b2446
06 Sep, 2018 1 commit
- Revising LR scaling to account for any choice of num processes, batch size per process · a2801d91
  Michael Carilli authored Sep 06, 2018
  
  a2801d91
28 Aug, 2018 3 commits
- Cleaning up git weirdness + updating docs for Reducer · 37a4b221
  Michael Carilli authored Aug 28, 2018
  
  37a4b221
- Add reducer class in parallel/distributed. (#37) · 73b62dde
  Christian Sarofeen authored Aug 28, 2018
  
  73b62dde
- Adjusting learning rate for batch size · dc41c5ce
  Michael Carilli authored Aug 27, 2018
  
  dc41c5ce
20 Aug, 2018 1 commit
- minor cleanup · 59bf7d13
  Michael Carilli authored Aug 20, 2018
  
  59bf7d13
19 Aug, 2018 1 commit
- Adjusting learning rate schedule for 76% accuracy · eae8b989
  Michael Carilli authored Aug 18, 2018
  
  eae8b989
16 Aug, 2018 1 commit
- Removing orphaned /distributed/run_distributed.sh · 2af29c19
  Michael Carilli authored Aug 16, 2018
  
  2af29c19
07 Aug, 2018 1 commit
- Dist validation imagenet main.py · ae3be17a
  Christian Sarofeen authored Aug 07, 2018
  
  ae3be17a
23 Jul, 2018 1 commit
- Ported examples to use torch.distributed.launch · aa817132
  Michael Carilli authored Jul 19, 2018
  
  aa817132
29 Jun, 2018 2 commits
- Syncing imagenet examples · 34582381
  Michael Carilli authored Jun 29, 2018
  
  34582381
- Fixes to validation in imagenet example scripts. Precision and loss reporting... · cf45c54c
  Josh Romero authored Jun 29, 2018
```
Fixes to validation in imagenet example scripts. Precision and loss reporting modified to be consistent with train.
```
  cf45c54c
28 Jun, 2018 1 commit
- Fix to imagenet main.py data normalization. · 21c229e0
  Josh Romero authored Jun 27, 2018
  
  21c229e0
14 Jun, 2018 1 commit
- Changing loss_scale to static_loss_scale in imagenet/main.py to be explicit · 942174bf
  Michael Carilli authored Jun 14, 2018
  
  942174bf
11 Jun, 2018 2 commits
- [Imagenet example] Switch validation to same I/O pipeline so validation produces correct result. · 06ee98c2
  Christian Sarofeen authored Jun 08, 2018
  
  06ee98c2
- Switched to non-bucketed comm · 421c9e66
  Christian Sarofeen authored Jun 08, 2018
  
  421c9e66
01 Jun, 2018 1 commit
- Print speed (img/sec) in imagenet example · dcda3b56
  Pooya Davoodi authored Jun 01, 2018
  
  dcda3b56
27 May, 2018 1 commit
- Re-add normalization, correct typing. · 414dc119
  Christian Sarofeen authored May 27, 2018
  
  414dc119
25 May, 2018 1 commit
- Update imagenet example to fast version. · cae6005c
  Christian Sarofeen authored May 25, 2018
  
  cae6005c
22 May, 2018 1 commit
- Add device sync in imagenet example. · 61c1e160
  Christian Sarofeen authored May 21, 2018
  
  61c1e160
21 May, 2018 1 commit
- Add device sync in imagenet example. · 2d5b71bd
  Christian Sarofeen authored May 21, 2018
  
  2d5b71bd
14 May, 2018 1 commit
- Manually setting rank in imagenet examples · d4066f6e
  Michael Carilli authored May 14, 2018
  
  d4066f6e
06 May, 2018 1 commit
- Adding main_fp16_optimizer.py to examples/imagenet and word_language_model · 0d91a65e
  Michael Carilli authored May 05, 2018
  
  0d91a65e
03 May, 2018 1 commit
- Adding examples that can serve as master copies for QA workflow moving forward. · 8ef50822
  Michael Carilli authored May 03, 2018
  
  8ef50822