Moving gradient division back to after the allreduce. Empirically, it appears...

Moving gradient division back to after the allreduce. Empirically, it appears underflow is more of a danger than overflow.

Moving gradient division back to after the allreduce. Empirically, it appears...
Moving gradient division back to after the allreduce. Empirically, it appears underflow is more of a danger than overflow.
fd9b02c0 · Michael Carilli · 9eab1ac3 · fd9b02c0
Commit fd9b02c0 authored Oct 08, 2018 by Michael Carilli
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 3 deletions

apex/parallel/distributed.py apex/parallel/distributed.py +3 -3

No files found.
--- a/apex/parallel/distributed.py
+++ b/apex/parallel/distributed.py
@@ -11,13 +11,13 @@ import copy
 def apply_flat_dist_call(bucket, call, extra_args=None):
    coalesced = _flatten_dense_tensors(bucket)
-    if call is dist.all_reduce:
-        coalesced /= dist.get_world_size()
    if extra_args is not None:
        call(coalesced, *extra_args)
    else:
        call(coalesced)
+    if call is dist.all_reduce:
+        coalesced /= dist.get_world_size()
    for buf, synced in zip(bucket, _unflatten_dense_tensors(coalesced, bucket)):
        buf.copy_(synced)