Commits · 4e7a2a8e44577c96dc1f0446c78520244c49738d · OpenDAS / apex

"...git@developer.sourcefind.cn:modelzoo/alphafold2_jax.git" did not exist on "eb93322ba9e65542721fec157cfce6e2b74e0936"

28 Nov, 2023 1 commit
- fix up for torch2.1 · 4e7a2a8e
  flyingdown authored Nov 28, 2023
  
  4e7a2a8e
01 Dec, 2021 1 commit
- Update run_rocm_distributed.sh · 3f3da214
  Hubert Lu authored Dec 01, 2021
  
  3f3da214
19 Nov, 2021 1 commit
- Add unit tests for Apex extensions and distributed Apex · 15498555
  Hubert Lu authored Nov 19, 2021
  
  15498555
25 Jan, 2021 1 commit

fix bugs in syncbn (#46) · 3f49dbf0

Jeff Daily authored Jan 25, 2021

- incorrect use of __shfl_down
- fix warp size assumptions
- update unit tests to exit on failure

3f49dbf0

31 Jul, 2020 1 commit
- skipping bfloat16 mgpu tests (#32) · 8dd19e3b
  Chaitanya Sri Krishna Lolla authored Jul 31, 2020
  
  8dd19e3b
10 Jul, 2020 1 commit

Enable sync batchnorm extension. (#27) · 9c80f6d3

Chaitanya Sri Krishna Lolla authored Jul 10, 2020

* Enable sync batchnorm

* enable syncbn properly

* update the unit tests

* update tests

* update conditions for welford_merge_element

* updated conditions based on comments.

9c80f6d3

06 Jul, 2020 1 commit

[sync BN] (#792) · 1ff54b8f

jjsjann123 authored Jul 06, 2020

* [sync BN]

support non-uniform batch size across process group.

TODO: test should be added once cleaned up.

* updating unit tests

* new unit tests for different inputs

* cleaning

1ff54b8f

03 Jun, 2020 1 commit

bfloat16 support for mgpu (#19) · b0c7d09f

rohithkrn authored Jun 03, 2020

* bfloat16 support for apex DDP

* enable mgpu tests for fp16 and bf16

* update Dockerfile

b0c7d09f

06 Nov, 2019 1 commit
- fixing batchnorm 1d input (#590) · 37cdaf4a
  jjsjann123 authored Nov 06, 2019
  
  37cdaf4a
26 Jul, 2019 1 commit

[sbn update] (#384) · 896ecdd6

jjsjann123 authored Jul 12, 2019

fixing empty return from python implementation
  adding proper test to verify functional correctness for python implementation

896ecdd6

12 Jul, 2019 1 commit

[sbn update] (#384) · 574fe244

jjsjann123 authored Jul 12, 2019

fixing empty return from python implementation
  adding proper test to verify functional correctness for python implementation

574fe244

01 May, 2019 1 commit
- allreduce_different_streams is now hidden · 72bce160
  Michael Carilli authored May 01, 2019
  
  72bce160
04 Apr, 2019 1 commit

WIP: Handle arbitrary combinations of optimizers/models/losses (#232) · 3f87614f

mcarilli authored Apr 03, 2019

* Refactor to allow more flexible treatment of multiple optimizers/models/losses

* Adding _process_optimizers.py

* Created L0 tests (now passing).

* fix: minor print typo (#234)

* make L1 results easier to read

* L0 multiple model/optimizer/loss test fleshed out

* Adding test that master params remain synced across distributed processes

* Docstring updates

* Docstring updates

3f87614f

12 Mar, 2019 1 commit
- Moving test_groups.py · d27a321a
  Michael Carilli authored Mar 12, 2019
  
  d27a321a
26 Feb, 2019 1 commit
- No need for casts during optimizer step · 613997ea
  Michael Carilli authored Feb 26, 2019
  
  613997ea
03 Feb, 2019 1 commit
- Lazy imports to reduce error spam · 48299b0d
  Michael Carilli authored Feb 02, 2019
  
  48299b0d
01 Nov, 2018 1 commit
- Adding switch to control averaging of gradients. · efc561ba
  Michael Carilli authored Nov 01, 2018
  
  efc561ba
29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

29 Sep, 2018 2 commits
- Clean up race condition test, need to figure out a clean way to create distributed unit tests · 9d731777
  Michael Carilli authored Sep 29, 2018
  
  9d731777
- Efficient bucketing (#49) · fa183ee8
  mcarilli authored Sep 28, 2018
```
* beautiful

* IT'S WORKING

* Hopefully fix race condition for fallback hook

* Updating test

* shared_param -> delayed_allreduce

* Adding a safety check

* One more check

* syntax...
```
  fa183ee8
14 May, 2018 1 commit
- Multi-op sequence for ddp_race_condition_test.py · cc8f03c8
  Michael Carilli authored May 14, 2018
  
  cc8f03c8
07 May, 2018 1 commit
- Fix race condition in DDP. · 7c2ae41e
  Christian Sarofeen authored May 01, 2018
  
  7c2ae41e