Fix contrib fused_adam to work correctly with multi-GPU (#752)

The cuda kernel used by fused-adam was using the default stream on the default device. The kernel needs use the same device as the parameter tensor. Fixed by using context manager to set correct default device. For the use_mt case, raised an error. Alternatively, the use_mt case could launch one kernel per cuda device. The non-contrib version will also need to be fixed. Co-authored-by: Mandeep Singh Baines <msb@fb.com>

Fix contrib fused_adam to work correctly with multi-GPU (#752)
The cuda kernel used by fused-adam was using the default stream on the default device. The kernel needs use the same device as the parameter tensor. Fixed by using context manager to set correct default device. For the use_mt case, raised an error. Alternatively, the use_mt case could launch one kernel per cuda device. The non-contrib version will also need to be fixed. Co-authored-by: Mandeep Singh Baines <msb@fb.com>
8fac3a72 · msbaines · GitHub · 80b90b9d · 8fac3a72
Unverified Commit 8fac3a72 authored Mar 24, 2020 by msbaines Committed by GitHub Mar 24, 2020
Show whitespace changes
Inline Side-by-side

Showing with 36 additions and 27 deletions

apex/contrib/optimizers/fused_adam.py apex/contrib/optimizers/fused_adam.py +36 -27

No files found.
--- a/apex/contrib/optimizers/fused_adam.py
+++ b/apex/contrib/optimizers/fused_adam.py
@@ -130,6 +130,7 @@ class FusedAdam(torch.optim.Optimizer):
                    tensorlists = [[],[],[],[],[]]
                else:
                    tensorlists = [[],[],[],[]]
+                tensordevice = None

            for p, grad, output_param in zip(group['params'], grads_this_group, output_params_this_group):
                #note: p.grad should not ever be set for correct operation of mixed precision optimizer that sometimes sends None gradients
@@ -163,7 +164,14 @@ class FusedAdam(torch.optim.Optimizer):

                    for tl, t in zip(tensorlists, pl):
                        tl.append(t)
+
+                    if tensordevice is None:
+                        tensordevice = p.device
+                    elif tensordevice != p.device:
+                        raise RuntimeError('FusedAdam does not support use_mt with tensors on multiple device')
+
                else:
+                    with torch.cuda.device(p.device):
                        fused_adam_cuda.adam(p.data,
                                             out_p,
                                             exp_avg,
@@ -180,6 +188,7 @@ class FusedAdam(torch.optim.Optimizer):
                                             group['weight_decay'])

            if self._use_multi_tensor:
+                with torch.cuda.device(tensordevice):
                    multi_tensor_applier(
                        fused_adam_cuda.adam_mt,
                        self._overflow_buf,