[feat] Gossip/SlowMo (#378)

Add SlowMo Distributed Data Parallel for clusters with slow interconnects Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

[feat] Gossip/SlowMo (#378)
Add SlowMo Distributed Data Parallel for clusters with slow interconnects Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>
21464e05 · Benjamin Lefaudeux · GitHub · 8347c1a2 · 21464e05 · 21464e05
Unverified Commit 21464e05 authored Nov 08, 2021 by Benjamin Lefaudeux Committed by GitHub Nov 08, 2021
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -23,6 +23,7 @@ test-results/
 # Environments
 .env
 .venv
+.vscode
 env/
 venv/
 ENV/

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
          and gradient memory to be sharded despite being needed from different layers due to
          weight sharing. [#836]
 - [MEVO]: a custom layer to help big vocab trainings. Experimental. Docs is still TBD. [#840]
+- SlowMoDistributedDataParallel[feature][experimental] - This is a distributed training wrapper which should be useful on clusters with slow network interconnects (eg Ethernet). This improves on performance as compared to Distributed Data Parallel in such clusters. [#378]
 ## [0.4.1] - 2021-09-17
 ### Fixed

--- a/docs/source/api/experimental/nn/slowmo_ddp.rst
+++ b/docs/source/api/experimental/nn/slowmo_ddp.rst
+SlowMo Distributed Data Parallel
+================================
+.. autoclass:: fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel
+    :members:
+    :undoc-members:
+    :exclude-members: eval, forward, load_state_dict, state_dict, train, training
--- a/docs/source/api/index.rst
+++ b/docs/source/api/index.rst
@@ -12,3 +12,4 @@ API Reference
   nn/fsdp
   nn/checkpoint/checkpoint_activations
   experimental/nn/offload_model
+   experimental/nn/slowmo_ddp
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -92,6 +92,19 @@ master_doc = "index"
 # If true, `todo` and `todoList` produce output, else they produce nothing.
 todo_include_todos = True
+# List of custom sections allowed. It is especially useful when the argument
+# list is very long for a constructor or function. This helps split the
+# arguments into different sections, helping us to understand the arguments
+# better.
+napoleon_custom_sections = [
+    ("SlowMo Parameters", "params_style"),
+    ("LocalSGD Parameters", "params_style"),
+    ("SGP Parameters", "params_style"),
+    ("Debugging Parameters", "params_style"),
+    ("Parameters for Advanced Users", "params_style"),
+]
 # -- Options for HTML output -------------------------------------------------

--- a/docs/source/deep_dive/slowmo_ddp.rst
+++ b/docs/source/deep_dive/slowmo_ddp.rst
+SlowMo Distributed Data Parallel
+================================
+Training neural networks in a distributed data-parallel manner results in non-linear scaling (slowdown) due to the time spent on communication
+between the different nodes (as well as, to a lesser extent though, synchronization between the different nodes). So, a distributed training run
+with 8 nodes is not 8x faster than a run with 1 node as we would expect it to be.
+SlowMo Distributed Data Parallel aims to solve this by replacing the typical exact allreduce between gradients with an approximate
+averaging of parameters. This approximate averaging reduces both the time spent on communication as well as the synchronization between different
+nodes.  It uses one of the following two algorithms (configurable) as a base algorithm for this purpose:
+* Local SGD (papers `#1 <https://arxiv.org/abs/1602.05629>`_ and `#2 <https://arxiv.org/abs/1705.09056>`_). This algorithm does an allreduce of the parameters every few iterations.
+* `Stochastic Gradient Push <https://arxiv.org/abs/1811.10792>`_ (SGP). This algorithm involves one-to-one communications between nodes.
+These base algorithms (LocalSGD and SGP), when used only by themselves, result in reduced model quality (measured as accuracy in a classification
+setting). The `SlowMo <https://arxiv.org/abs/1910.00643>`_ algorithm alleviates this issue by doing a slow momentum step, typically, every 48 iterations.
+The training process with SlowMo looks as follows:
+1. Compute the forward pass.
+2. Compute the backward pass.
+3. During the backward pass, using a backward hook, on each node, the gradients are synchronized using allreduce across the different GPUs on
+   that node.
+4. Perform the ``optimizer.step()`` to update parameters on each node with the gradients of that node.
+5. Approximately average the parameters using a base algorithm - one of LocalSGD or SGP (both are described above).
+6. Perform the slow momentum update step once every ``slowmo_frequency`` (typically 48) iterations. In this step, the parameters on different
+   nodes are (exactly) averaged, followed by a ``slowmo_optimizer.step()``. Note that this ``slowmo_optimizer`` is different from the original optimizer,
+   and it is done in a `Zero-1 <./oss_sdp_fsdp.html>`_ like manner to save memory.
+Best practices for using ``SlowMoDistributedDataParallel``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1. SlowMo will be useful in deep learning workloads which run on more than 2 nodes in clusters with a slow interconnect, eg Ethernet.
+2. SlowMo should be useful in your workload if the following condition holds:
+   :math:`\textrm{time_taken_for_all_reduce_of_gradients} \times (1 - \frac{1}{\textrm{localsgd_frequency}} ) > \textrm{time_taken_for_backward_pass}`
+   Notes:
+   * In case you are using SGP as the base algorithm, the value of ``localsgd_frequency`` can be plugged in as 2.
+   * The formula above is a simplified version of:
+     :math:`\textrm{time_taken_for_all_reduce_of_gradients} > \textrm{time_taken_for_backward_pass} + \frac{\textrm{time_taken_for_all_reduce_of_gradients}}{\textrm{localsgd_frequency}}`
+     The left and right hand sides denote the total backward duration (combining the computation of gradients in the backward pass and the
+     communication cost) for DDP and SlowMo DDP, respectively. Since DDP overlaps the computation of gradients with their communication, it is
+     bottlenecked by the latter.  In contrast, there is an extra ``time_taken_for_backward_pass`` on the right hand side because we do not
+     overlap the backward pass with communication in the current implementation of SlowMo.
+   * In clusters with slower interconnect, ``time_taken_for_all_reduce_of_gradients`` will go up, leading to SlowMo being more useful. ``localsgd_frequency``
+     is also an important factor here. More details on varying that to affect performance are in tip 2 of
+     `Performance tips for SlowMoDistributedDataParallel`_.
+3. ``slowmo_momentum`` will need to be tuned for obtaining good model quality. A grid search across {0.0, 0.1, 0.2, 0.4, 0.6} should be good enough
+   for tuning. This ``slowmo_momentum`` value holds consistent across multiple runs with similar settings.  When the number of nodes used is increased,
+   however, a higher value of ``slow_momentum`` should be needed. More details about this can be found in the
+   `documentation <../api/experimental/nn/slowmo_ddp.html>`_.
+4. Adding SlowMo to existing Distributed Data Parallel code involves two steps, which can be found in the `tutorial <../tutorials/slowmo_ddp.html>`_.
+Performance tips for ``SlowMoDistributedDataParallel``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1. ``nprocs_per_node`` should be set to the number of GPUs on a node (this number should be the same on each node). This allows the API
+   to exploit the fast interconnect between different GPUs on a node.
+2. Increasing the ``localsgd_frequency`` results in an increase in speed. However, it comes with a tradeoff of reducing the model quality.
+   We recommend keeping the ``localsgd_frequency`` at 3.
+3. ``slowmo_memory_efficient`` should typically be used (this is the default behavior). It reduces memory usage by sharding the additional
+   slow momentum optimizer's parameters in a `Zero-1`_ like manner.
+4. A call to ``model.zero_grad(set_to_none=True)`` should be made after ``optimizer.step()`` in order to save memory for the
+   ``model.perform_slowmo()`` step. More details about this can be found in the
+   `documentation for perform_slowmo() <../api/experimental/nn/slowmo_ddp.html#:~:text=net.perform_slowmo(optimizer)-,perform_slowmo,-(optimizer%3A%20torch.optim>`_.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -42,6 +42,7 @@ modules and easy to use APIs.
   deep_dive/adascale
   deep_dive/pipeline_parallelism
   deep_dive/activation_checkpointing
+   deep_dive/slowmo_ddp
 |
 |
@@ -56,6 +57,7 @@ modules and easy to use APIs.
   tutorials/adascale
   tutorials/pipe
   tutorials/layer_memory_tracking
+   tutorials/slowmo_ddp
 |
 |

--- a/docs/source/tutorials/slowmo_ddp.rst
+++ b/docs/source/tutorials/slowmo_ddp.rst
+Efficient Data Parallel Training with SlowMo Distributed Data Parallel
+======================================================================
+SlowMo Distributed Data Parallel reduces the communication between different
+nodes while performing data parallel training. It is mainly useful for use on
+clusters with low interconnect speeds between different nodes. When using
+SlowMo, the models on the different nodes are no longer kept in sync after each
+iteration, which leads to the optimization dynamics being affected. The end
+result is close to the results of Distributed Data Parallel, but is not exactly
+the same.
+If you have code that is setup to use Distributed Data Parallel, using SlowMo Distributed Data Parallel
+is simply replacing the DDP call with a call to
+``fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel``, and adding a
+``model.perform_slowmo(optimizer)`` call after ``optimizer.step()`` -- preceded by
+``model.zero_grad(set_to_none=True)`` in order to reduce peak memory usage.
+The different points at which ``use_slowmo`` is used below help demonstrate these changes:
+.. code-block:: python
+    import torch
+    from fairscale.experimental.nn.data_parallel import SlowMoDistributedDataParallel as SlowMoDDP
+    def train(
+        rank: int,
+        world_size: int,
+        epochs: int,
+        use_slowmo: bool):
+        # process group init
+        dist_init(rank, world_size)
+        # Problem statement
+        model = MyAwesomeModel().to(rank)
+        if use_slowmo:
+            # Wrap the model into SlowMoDDP
+            model = SlowMoDDP(model, slowmo_momentum=0.5, nprocs_per_node=8)
+        else:
+            model = DDP(model, device_ids=[rank])
+        dataloader = MySuperFastDataloader()
+        loss_ln = MyVeryRelevantLoss()
+        optimizer = MyAmazingOptimizer()
+        # Any relevant training loop, with a line at the very end specific to SlowMoDDP, e.g.:
+        model.train()
+        for e in range(epochs):
+            for (data, target) in dataloader:
+                data, target = data.to(rank), target.to(rank)
+                # Train
+                outputs = model(data)
+                loss = loss_fn(outputs, target)
+                loss.backward()
+                optimizer.step()
+                model.zero_grad(set_to_none=use_slowmo)  # free memory for the perform_slowmo() call below
+                if use_slowmo:
+                    model.perform_slowmo(optimizer)
+In the example above, when using SlowMoDDP, we are reducing the total communication between
+nodes by 3 times as the default ``localsgd_frequency`` is set to 3.
+SlowMoDDP takes in ``slowmo_momentum`` as a parameter. This parameter may need to be tuned
+depending on your use case. It also takes in ``nproces_per_node`` which should be typically set
+to the number of GPUs on a node. Please look at the
+`documentation <../api/experimental/nn/slowmo_ddp.html>`_
+for more details on these parameters as well as other advanced settings of the SlowMo algorithm.
--- a/fairscale/experimental/nn/data_parallel/__init__.py
+++ b/fairscale/experimental/nn/data_parallel/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+from .gossip import SlowMoBaseAlgorithm, SlowMoDistributedDataParallel  # noqa
--- a/fairscale/experimental/nn/data_parallel/gossip/__init__.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+from .distributed import SlowMoBaseAlgorithm, SlowMoDistributedDataParallel
+from .gossiper import PushPull, PushSum
+from .graph_manager import (
+    DynamicBipartiteExponentialGraph,
+    DynamicBipartiteLinearGraph,
+    DynamicDirectedExponentialGraph,
+    DynamicDirectedLinearGraph,
+    GraphManager,
+    NPeerDynamicDirectedExponentialGraph,
+    RingGraph,
+)
+from .mixing_manager import MixingManager, UniformMixing
+from .utils import communicate
+from .utils.cuda_metering import CudaEventRecorder
--- a/fairscale/experimental/nn/data_parallel/gossip/distributed.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/distributed.py
--- a/fairscale/experimental/nn/data_parallel/gossip/gossiper.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/gossiper.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Gossipers
+:description: Gossiper's are designed for multi-peer communication (i.e., send
+              and recv from multiple peers at each ieration)
+"""
+from enum import Enum
+import logging
+from typing import Iterator, List, Optional, Tuple, cast
+import torch
+import torch.distributed as dist
+from .graph_manager import GraphManager
+from .mixing_manager import MixingManager, UniformMixing
+class dist_backend(str, Enum):
+    UNDEFINED = "undefined"
+    TCP = "tcp"
+    MPI = "mpi"
+    GLOO = "gloo"
+    NCCL = "nccl"
+class Gossiper(object):
+    """ Generic gossip averaging object for multi-peer communication
+    Args:
+        msg (torch.Tensor): message used to initialize recv buffer
+        graph (GraphManager): Subclass of GraphManager
+        device: (torch.Device) device on which to initialize recv buffer
+        mixing (MixingManager): Subclass of MixingManager
+        logger (logging.Logger): Module used to log results
+        rank (int): Rank of the current process
+        world_size (int): World size of the current process
+    """
+    def __init__(
+        self,
+        msg: torch.Tensor,
+        graph: GraphManager,
+        device: Optional[torch.device] = None,
+        mixing: MixingManager = None,
+        logger: logging.Logger = None,
+        rank: Optional[int] = None,
+        world_size: Optional[int] = None,
+    ) -> None:
+        """
+        Initialize generic averaging class designed for multi-peer comms
+        """
+        self.logger = logger
+        if rank is None or world_size is None:
+            assert dist.is_initialized()
+            # for now p2p communication only supported with tcp and mpi
+            assert dist.get_backend() != dist_backend.GLOO
+            assert dist.get_backend() != dist_backend.NCCL
+            rank = dist.get_rank()
+            world_size = dist.get_world_size()
+        # graph topology properties
+        self.rank = rank
+        self.world_size = world_size
+        assert isinstance(graph, GraphManager)
+        self._graph_manager = graph
+        self.peers_per_itr_device = torch.tensor([self._graph_manager.peers_per_itr], device=device, dtype=msg.dtype)
+        # This might need to be made float16 later on
+        self.passive = self._graph_manager.is_passive()
+        self.refresh_peers_(rotate=False)  # sets in- and out-peers attributes
+        # mixing matrix
+        if mixing is None:
+            mixing = UniformMixing(self._graph_manager, device)
+        assert isinstance(mixing, MixingManager)
+        self._mixing_manager = mixing
+        self.refresh_mixing_weights_()  # sets mixing-weights attribute
+        # regular ==> we don't need to keep track of ps-weight explicitly
+        self.regular = self._mixing_manager.is_regular()
+        # msg buffers used during send/recv
+        self.device = device if device is not None else msg.device
+        self.out_msg_buffer: List[Tuple[dist.Work, torch.Tensor]] = []
+        self.in_msg_buffer = msg.clone().detach_().to(self.device)
+        self._ps_weight: torch.Tensor = torch.ones(1, dtype=msg.dtype).detach_().to(self.device)
+        # not using regular comms ==> need to communicate ps-weight
+        if not self.regular:
+            self.in_msg_buffer = torch.cat([self.in_msg_buffer, self.ps_weight])
+        if self.device.type == "cpu":
+            try:
+                self.in_msg_buffer = self.in_msg_buffer.pin_memory()
+            except Exception as e:
+                if self.logger is not None:
+                    self.logger.error(e)
+                else:
+                    raise
+        self.placeholder = self.in_msg_buffer.clone()
+    @property
+    def ps_weight(self) -> torch.Tensor:
+        return self._ps_weight
+    @ps_weight.setter
+    def ps_weight(self, v: torch.Tensor) -> None:
+        self._ps_weight.data[0] = v
+    @property
+    def peers_per_itr(self) -> int:
+        return self._graph_manager.peers_per_itr
+    @peers_per_itr.setter
+    def peers_per_itr(self, v: int) -> None:
+        self._graph_manager.peers_per_itr = v
+    def refresh_peers_(self, rotate: Optional[bool] = None) -> None:
+        """ Update in- and out-peers """
+        if rotate is None:
+            rotate = self._graph_manager.is_dynamic_graph()
+        # cannot cycle peers in a static graph
+        assert not (rotate and not self._graph_manager.is_dynamic_graph())
+        self.out_edges, self.in_edges = self._graph_manager.get_edges(rotate)
+    def refresh_mixing_weights_(self, residual_adjusted: bool = False) -> None:
+        """ Update mixing-matrix weights """
+        self.mixing_weights = self._mixing_manager.get_mixing_weights(residual_adjusted)
+    def mix_out_msg_(self, out_msg: torch.Tensor, ps_weight: torch.Tensor) -> Iterator[torch.Tensor]:
+        """ Returns a generator mixing messages on the fly """
+        self.refresh_mixing_weights_(residual_adjusted=True)
+        self.ps_weight = ps_weight
+        # check whether or not we need to communicate ps_weight
+        if not self.regular:
+            out_msg = torch.cat([out_msg, cast(torch.Tensor, self.ps_weight.type(out_msg.dtype))])
+        # check whether or not we need to create a buffer for each out-msg
+        if self._mixing_manager.is_uniform():
+            weight = self.mixing_weights["uniform"]
+            out_msg *= weight.type(out_msg.dtype)
+            for _ in self.out_edges:
+                yield out_msg
+        else:
+            for out_edge in self.out_edges:
+                weight = self.mixing_weights[out_edge.dest]
+                yield out_msg.mul(weight.type(out_msg.dtype))  # type: ignore
+    def clean_msg_buffers_(self) -> None:
+        """ Clean outgoing message buffer """
+        while len(self.out_msg_buffer) > 0:
+            req, msg = self.out_msg_buffer.pop()
+            req.wait()
+            msg.set_()
+    def parse_in_msg_buffer(self) -> Tuple[torch.Tensor, torch.Tensor]:
+        """ Parse in-msg buffer and return msg and ps-weight separately """
+        msg = self.in_msg_buffer
+        if not self.regular:
+            return msg.narrow(0, 0, len(msg) - 1), msg[-1]
+        else:
+            return msg, self.ps_weight * self.peers_per_itr_device
+    def mix(self, out_msg: torch.Tensor, ps_weight: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """ Single gossip step """
+        raise NotImplementedError
+class PushSum(Gossiper):
+    """ 1-peer Push-Sum consensus averaging module """
+    def mix(self, out_msg: torch.Tensor, ps_weight: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """ Consensus averaging step """
+        # out_msg must be on the correct device
+        assert out_msg.device.type == self.device.type
+        if self.logger is not None:
+            self.logger.debug("in/out -peers {}/{}".format(self.in_edges, self.out_edges))
+        # prepare messages for gossip
+        mixed_out_msgs = self.mix_out_msg_(out_msg, ps_weight)
+        # non-blocking send
+        for out_edge in self.out_edges:
+            msg = next(mixed_out_msgs)
+            assert self.rank == out_edge.src
+            req = dist.broadcast(tensor=msg, src=out_edge.src, group=out_edge.process_group, async_op=True,)
+            self.out_msg_buffer.append((req, msg))
+        # blocking recv w/ some code optimization to avoid buffer prep overhead
+        if len(self.in_edges) == 1:
+            in_edge = self.in_edges[0]
+            dist.broadcast(tensor=self.in_msg_buffer, src=in_edge.src, group=in_edge.process_group)
+        # regular non-blocking recv
+        else:
+            # prepare in-msg buffer
+            self.in_msg_buffer.zero_()
+            for in_edge in self.in_edges:
+                dist.broadcast(
+                    tensor=self.placeholder, src=in_edge.src, group=in_edge.process_group,
+                )
+                self.in_msg_buffer.add_(self.placeholder)  # type: ignore
+        self.refresh_peers_()
+        self.clean_msg_buffers_()
+        return self.parse_in_msg_buffer()
+class PushPull(Gossiper):
+    """ Doubly-stochastic consensus averaging module """
+    def mix(self, out_msg: torch.Tensor, ps_weight: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        # out_msg must be on the correct device
+        assert out_msg.device.type == self.device.type
+        if self.logger is not None:
+            self.logger.debug("in/out -peers {}/{}".format(self.in_edges, self.out_edges))
+        # prepare messages for gossip
+        mixed_out_msgs = self.mix_out_msg_(out_msg, ps_weight)
+        # send-recv w/ some code optimization to avoid buffer prep overhead
+        if len(self.in_edges) == 1 and len(self.out_edges) == 1:
+            out_edge, in_edge = self.out_edges[0], self.in_edges[0]
+            msg = next(mixed_out_msgs)
+            if not self.passive:
+                dist.broadcast(tensor=msg, src=out_edge.src, group=out_edge.process_group)
+                dist.broadcast(
+                    tensor=self.in_msg_buffer, src=in_edge.src, group=in_edge.process_group,
+                )
+            else:
+                dist.broadcast(
+                    tensor=self.in_msg_buffer, src=in_edge.src, group=in_edge.process_group,
+                )
+                dist.broadcast(tensor=msg, src=out_edge.src, group=out_edge.process_group)
+        # regular send-recv
+        else:
+            # prepare in-msg buffer
+            self.in_msg_buffer.zero_()
+            # send-recv
+            for out_edge, in_edge in zip(self.out_edges, self.in_edges):
+                msg = next(mixed_out_msgs)
+                if not self.passive:
+                    dist.broadcast(tensor=msg, src=out_edge.src, group=out_edge.process_group)
+                    dist.broadcast(
+                        tensor=self.placeholder, src=in_edge.src, group=in_edge.process_group,
+                    )
+                else:
+                    dist.broadcast(
+                        tensor=self.placeholder, src=in_edge.src, group=in_edge.process_group,
+                    )
+                    dist.broadcast(tensor=msg, src=out_edge.src, group=out_edge.process_group)
+                self.in_msg_buffer.add_(self.placeholder)  # type: ignore
+        self.refresh_peers_()
+        self.clean_msg_buffers_()
+        return self.parse_in_msg_buffer()
--- a/fairscale/experimental/nn/data_parallel/gossip/graph_manager.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/graph_manager.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Graph Manager Class
+:description: Class provides an API for loading different peer-to-peer
+    communication topologies, and cycling through peers.
+"""
+from abc import ABC, abstractmethod
+from math import log as mlog
+from typing import List, Optional, Tuple
+import torch
+import torch.distributed as dist
+class Edge(object):
+    def __init__(self, local_master_rank: int, dest: int, src: int, local_rank: int) -> None:
+        self.src = src
+        self.dest = dest
+        self.process_group = dist.new_group([src, dest])
+        if local_master_rank in [self.src, self.dest] and local_rank == 0:
+            initializer_tensor = torch.Tensor([1]).cuda()
+            dist.all_reduce(initializer_tensor, group=self.process_group)
+            initializer_tensor = torch.Tensor([1]).cuda().half()
+            dist.all_reduce(initializer_tensor, group=self.process_group)
+class GraphManager(ABC):
+    def __init__(
+        self, rank: int, world_size: int, nprocs_per_node: int = 1, local_rank: int = 0, peers_per_itr: int = 1
+    ) -> None:
+        assert int(peers_per_itr) >= 1
+        self.rank = rank
+        self.world_size = world_size
+        self.phone_book: List[List[Edge]] = [[] for _ in range(self.world_size)]
+        self._peers_per_itr = peers_per_itr
+        self._group_indices = list(range(peers_per_itr))
+        self.nprocs_per_node = nprocs_per_node
+        self.local_rank = local_rank
+        self._make_graph()
+    @property
+    def peers_per_itr(self) -> int:
+        return self._peers_per_itr
+    @peers_per_itr.setter
+    def peers_per_itr(self, v: int) -> None:
+        self._peers_per_itr = v
+        # set group-indices attr. --- point to out-peers in phone-book
+        self._group_indices = list(range(v))
+    @abstractmethod
+    def _make_graph(self) -> None:
+        """
+        Returns a nested list of peers; the outer-list is indexed by rank,
+        the inner list denotes the set of peers that 'rank' can send
+        messages to at any point in time
+        """
+        raise NotImplementedError
+    def _add_peers(self, rank: int, peers: List[int]) -> None:
+        for peer in peers:
+            if peer not in self.phone_book[rank]:
+                self.phone_book[rank].append(
+                    Edge(
+                        local_master_rank=(self.rank * self.nprocs_per_node),
+                        dest=(peer * self.nprocs_per_node),
+                        src=(rank * self.nprocs_per_node),
+                        local_rank=self.local_rank,
+                    )
+                )
+    @abstractmethod
+    def is_regular_graph(self) -> bool:
+        """ Whether each node has the same number of in-peers as out-peers """
+        raise NotImplementedError
+    @abstractmethod
+    def is_bipartite_graph(self) -> bool:
+        """ Whether graph is bipartite or not """
+        raise NotImplementedError
+    @abstractmethod
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        """ Whether 'rank' is a passive node or not """
+        raise NotImplementedError
+    @abstractmethod
+    def is_dynamic_graph(self) -> bool:
+        """ Whether the graph-type is dynamic (as opposed to static) """
+        raise NotImplementedError
+    def get_peers(self, rotate: bool = False) -> Tuple[List[int], List[int]]:
+        """ Returns the out and in-peers corresponding to 'self.rank' """
+        # cycle through in- and out-peers by updating group-index
+        if rotate:
+            self._rotate_group_indices()
+        # get out- and in-peers using new group-indices
+        out_peers, in_peers = [], []
+        for group_index in self._group_indices:
+            out_peers.append(self.phone_book[self.rank][group_index].dest)
+            for rank, peers in enumerate(self.phone_book):
+                if rank == self.rank:
+                    continue
+                if self.rank * self.nprocs_per_node == peers[group_index].dest:
+                    in_peers.append(rank)
+        return out_peers, in_peers
+    def get_edges(self, rotate: bool = False) -> Tuple[List[Edge], List[Edge]]:
+        """ Returns the pairwise process groups between rank and the out and
+        in-peers corresponding to 'self.rank' """
+        # cycle through in- and out-peers by updating group-index
+        if rotate:
+            self._rotate_group_indices()
+        # get out- and in-peers using new group-indices
+        out_edges, in_edges = [], []
+        for group_index in self._group_indices:
+            out_edges.append(self.phone_book[self.rank][group_index])
+            for rank, edges in enumerate(self.phone_book):
+                if rank == self.rank:
+                    continue
+                if self.rank * self.nprocs_per_node == edges[group_index].dest:
+                    in_edges.append(self.phone_book[rank][group_index])
+        return out_edges, in_edges
+    def _rotate_group_indices(self) -> None:
+        """ Incerement group indices to point to the next out-peer """
+        increment = self.peers_per_itr
+        for i, group_index in enumerate(self._group_indices):
+            self._group_indices[i] = int((group_index + increment) % len(self.phone_book[self.rank]))
+    def _rotate_forward(self, r: int, p: int) -> int:
+        """ Helper function returns peer that is p hops ahead of r """
+        return (r + p) % self.world_size
+    def _rotate_backward(self, r: int, p: int) -> int:
+        """ Helper function returns peer that is p hops behind r """
+        return (r - p) % self.world_size
+class DynamicDirectedExponentialGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            for i in range(0, int(mlog(self.world_size - 1, 2)) + 1):
+                f_peer = self._rotate_forward(rank, 2 ** i)
+                b_peer = self._rotate_backward(rank, 2 ** i)
+                self._add_peers(rank, [f_peer, b_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return False
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        return False
+    def is_dynamic_graph(self) -> bool:
+        return True
+class NPeerDynamicDirectedExponentialGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            for i in range(0, int(mlog(self.world_size - 1, self._peers_per_itr + 1)) + 1):
+                for j in range(1, self._peers_per_itr + 1):
+                    distance_to_neighbor = j * ((self._peers_per_itr + 1) ** i)
+                    f_peer = self._rotate_forward(rank, distance_to_neighbor)
+                    self._add_peers(rank, [f_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return False
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        return False
+    def is_dynamic_graph(self) -> bool:
+        return True
+class DynamicBipartiteExponentialGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            for i in range(0, int(mlog(self.world_size - 1, 2)) + 1):
+                if i == 0:
+                    f_peer = self._rotate_forward(rank, 1)
+                    b_peer = self._rotate_backward(rank, 1)
+                else:
+                    f_peer = self._rotate_forward(rank, 1 + 2 ** i)
+                    b_peer = self._rotate_backward(rank, 1 + 2 ** i)
+                # create directory for non-passive peers
+                if not self.is_passive(rank) and (self.is_passive(f_peer) and self.is_passive(b_peer)):
+                    self._add_peers(rank, [f_peer, b_peer])
+                # create directory for passive peers
+                elif self.is_passive(rank) and (not (self.is_passive(f_peer) or self.is_passive(b_peer))):
+                    self._add_peers(rank, [f_peer, b_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return True
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        rank = self.rank if rank is None else rank
+        return (rank % 2) == 0
+    def is_dynamic_graph(self) -> bool:
+        return True
+class DynamicDirectedLinearGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            for i in range(1, self.world_size):
+                if i % 2 == 0:
+                    continue
+                f_peer = self._rotate_forward(rank, i)
+                b_peer = self._rotate_backward(rank, i)
+                self._add_peers(rank, [f_peer, b_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return False
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        return False
+    def is_dynamic_graph(self) -> bool:
+        return True
+class DynamicBipartiteLinearGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            for i in range(1, self.world_size):
+                f_peer = self._rotate_forward(rank, i)
+                b_peer = self._rotate_backward(rank, i)
+                # create directory for non-passive peers
+                if not self.is_passive(rank) and (self.is_passive(f_peer) and self.is_passive(b_peer)):
+                    self._add_peers(rank, [f_peer, b_peer])
+                # create directory for passive peers
+                elif self.is_passive(rank) and (not (self.is_passive(f_peer) or self.is_passive(b_peer))):
+                    self._add_peers(rank, [f_peer, b_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return True
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        rank = self.rank if rank is None else rank
+        return (rank % 2) == 0
+    def is_dynamic_graph(self) -> bool:
+        return True
+class RingGraph(GraphManager):
+    def _make_graph(self) -> None:
+        for rank in range(self.world_size):
+            f_peer = self._rotate_forward(rank, 1)
+            b_peer = self._rotate_backward(rank, 1)
+            self._add_peers(rank, [f_peer, b_peer])
+    def is_regular_graph(self) -> bool:
+        return True
+    def is_bipartite_graph(self) -> bool:
+        return False
+    def is_passive(self, rank: Optional[int] = None) -> bool:
+        return False
+    def is_dynamic_graph(self) -> bool:
+        return False
--- a/fairscale/experimental/nn/data_parallel/gossip/mixing_manager.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/mixing_manager.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Mixing Manager Class
+:description: Class provides an API for dynamically selecting mixing weights
+              for gossip
+"""
+from abc import ABC, abstractmethod
+from typing import Dict, Optional, Union
+import torch
+from .graph_manager import GraphManager
+class MixingManager(ABC):
+    def __init__(self, graph: GraphManager, device: Optional[torch.device]) -> None:
+        self.graph_manager = graph
+        self.device = device
+    def is_regular(self) -> bool:
+        """
+        Whether there is bias accumulated in local entry of stationary
+        distribution of mixing matrix
+        """
+        return self.graph_manager.is_regular_graph() and self.is_uniform()
+    @abstractmethod
+    def is_uniform(self) -> bool:
+        """ Whether mixing weights are distributed uniformly over peers """
+        raise NotImplementedError
+    @abstractmethod
+    def get_mixing_weights(self, residual_adjusted: bool = True) -> Dict[Union[str, int], torch.Tensor]:
+        """ Create mixing weight dictionary using uniform allocation """
+        raise NotImplementedError
+class UniformMixing(MixingManager):
+    def get_mixing_weights(self, residual_adjusted: bool = True) -> Dict[Union[str, int], torch.Tensor]:
+        """ Create mixing weight dictionary using uniform allocation """
+        mixing_weights: Dict[Union[str, int], torch.Tensor] = {}
+        out_peers, _ = self.graph_manager.get_peers()
+        w = torch.tensor([1.0 / (len(out_peers) + 1.0)], device=self.device)
+        mixing_weights["lo"] = w.clone()
+        w_op = w if not residual_adjusted else w / mixing_weights["lo"]
+        mixing_weights["uniform"] = w_op.clone()
+        for op in out_peers:
+            mixing_weights[op] = w_op.clone()
+        return mixing_weights
+    def is_uniform(self) -> bool:
+        return True
--- a/fairscale/experimental/nn/data_parallel/gossip/utils/__init__.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/utils/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+from .helpers import (
+    MultiProcessAdapter,
+    communicate,
+    create_process_group,
+    flatten_tensors,
+    group_by_dtype,
+    make_logger,
+    unflatten_tensors,
+)
--- a/fairscale/experimental/nn/data_parallel/gossip/utils/cuda_metering.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/utils/cuda_metering.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Benchmarking utils for timing cuda executions
+"""
+from collections import defaultdict, deque
+from functools import partial
+import statistics
+from typing import ClassVar, Deque, Dict, Optional
+import torch
+MAX_LEN_DEQUEUE = 10 ** 4
+deque_with_max_len_fixed = partial(deque, maxlen=MAX_LEN_DEQUEUE)
+def create_and_record_event() -> torch.cuda.Event:
+    event = torch.cuda.Event(enable_timing=True)
+    event.record()
+    return event
+class EventRecorder(object):
+    def stop(self) -> None:
+        pass
+def create_event_recorder(event_name: str, dummy: bool = False) -> EventRecorder:
+    if not dummy:
+        return CudaEventRecorder(event_name)
+    return DummyCudaEventRecorder()
+class CudaEventRecorder(EventRecorder):
+    """ Allows profiling in an easy-to-use manner. CudaEventRecorder can be used
+    in a loop. When it is used in a loop (or when an event recorder is created
+    multiple times with the same name), get_timings returns the statistics of the
+    timings since the last reset. Note: in case the number of timings is greater than
+    10,000, only the last 10,000 timings are used to calculate the statistics.
+    Usage:
+    >>> event_recorder1 = CudaEventRecorder('1')
+    >>> # Sequence of events whose time is to be measured
+    >>> event_recorder1.stop()
+    >>> event_recorder2 = CudaEventRecorder('2')
+    >>> # Sequence of events whose time is to be measured
+    >>> event_recorder2.stop()
+    >>> print(CudaEventRecorder.get_timings())
+    Args:
+        event_name (str): The name by which the cuda event is to be referred later on
+    """
+    event_recorders: ClassVar[Dict[str, Deque["CudaEventRecorder"]]] = defaultdict(deque_with_max_len_fixed)  # type: ignore
+    all_event_recorders: ClassVar[Dict[str, Deque["CudaEventRecorder"]]] = defaultdict(deque_with_max_len_fixed)  # type: ignore
+    def __init__(self, event_name: str) -> None:
+        self.event_name = event_name
+        self.start_event = create_and_record_event()
+        self.end_event: Optional[torch.cuda.Event] = None
+        # Adding it to global tracker
+        CudaEventRecorder.event_recorders[event_name].append(self)
+        CudaEventRecorder.all_event_recorders[event_name].append(self)
+    def stop(self) -> None:
+        self.end_event = create_and_record_event()
+    def find_time_elapsed(self) -> float:
+        if self.end_event is None:
+            raise Exception(f"stopEvent was not called for event with name {self.event_name}")
+        self.end_event.synchronize()
+        return self.start_event.elapsed_time(self.end_event)
+    @classmethod
+    def reset(cls) -> None:
+        cls.event_recorders = defaultdict(deque_with_max_len_fixed)  # type: ignore
+    @classmethod
+    def get_common_timings(cls, event_recorders: Dict[str, Deque["CudaEventRecorder"]], description: str) -> str:
+        all_timings_str = f"{description}:\n"
+        # Iterating over different types of events, eg., forward, backward
+        for event_name, event_recorder_list in event_recorders.items():
+            # Iterating over different occurences of an event type
+            time_taken_list = [event_recorder.find_time_elapsed() for event_recorder in event_recorder_list]
+            all_timings_str += ("{}: Time taken: avg: {}, std: {}, count: " "{}\n").format(
+                event_name, statistics.mean(time_taken_list), statistics.pstdev(time_taken_list), len(time_taken_list),
+            )
+        return all_timings_str
+    @classmethod
+    def get_timings(cls) -> str:
+        """ Returns the timings since last reset was called """
+        return cls.get_common_timings(cls.event_recorders, "Timings since last reset")
+    @classmethod
+    def get_all_timings(cls) -> str:
+        """ Returns the statistics of all the timings """
+        return cls.get_common_timings(cls.all_event_recorders, "All timings")
+class DummyCudaEventRecorder(EventRecorder):
+    pass
--- a/fairscale/experimental/nn/data_parallel/gossip/utils/helpers.py
+++ b/fairscale/experimental/nn/data_parallel/gossip/utils/helpers.py
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Collection of commonly used utility functions
+"""
+import collections
+import logging
+import sys
+from typing import Any, Dict, List, MutableMapping, Set, Tuple
+import torch
+import torch.distributed as dist
+def flatten_tensors(tensors: List[torch.Tensor]) -> torch.Tensor:
+    """
+    Flatten dense tensors into a contiguous 1D buffer. Assume tensors are of
+    same dense type.
+    Since inputs are dense, the resulting tensor will be a concatenated 1D
+    buffer. Element-wise operation on this buffer will be equivalent to
+    operating individually
+    Args:
+        tensors (Iterable[Tensor]): dense tensors to flatten
+    Returns:
+        A 1D buffer containing input tensors
+    """
+    if len(tensors) == 1:
+        return tensors[0].view(-1).clone()
+    flat = torch.cat([t.view(-1) for t in tensors], dim=0)
+    return flat
+def unflatten_tensors(flat: torch.Tensor, tensors: List[torch.Tensor]) -> List[torch.Tensor]:
+    """
+    View a flat buffer using the sizes of tensors. Assume that tensors are of
+    same dense type, and that flat is given by flatten_dense_tensors.
+    Args:
+        flat (Tensor): flattened dense tensors to unflatten
+        tensors (Iterable[Tensor]): dense tensors whose sizes will be used to
+            unflatten flat
+    Returns:
+        Unflattened dense tensors with sizes same as tensors and values from
+        flat
+    """
+    outputs = []
+    offset = 0
+    for tensor in tensors:
+        numel = tensor.numel()
+        outputs.append(flat.narrow(0, offset, numel).view_as(tensor))
+        offset += numel
+    return outputs
+def group_by_dtype(tensors: List[torch.Tensor]) -> Dict[torch.dtype, List[torch.Tensor]]:
+    """
+    Returns a dict mapping from the tensor dtype to a list containing all
+    tensors of that dtype.
+    Arg:
+        tensors (Iterable[Tensor]): list of tensors
+    """
+    tensors_by_dtype = collections.defaultdict(list)
+    for tensor in tensors:
+        tensors_by_dtype[tensor.dtype].append(tensor)
+    return tensors_by_dtype
+def communicate(tensors: List[torch.Tensor], communication_op: Any, logger: logging.Logger = None) -> None:
+    """
+    Communicate a list of tensors
+    Args:
+        tensors (Iterable[Tensor]): list of tensors
+        communication_op: a method or partial object which takes a tensor as
+            input and communicates it. It can be a partial object around
+            something like torch.distributed.all_reduce
+    """
+    tensors_by_dtype = group_by_dtype(tensors)
+    for tensors_with_same_dtype in tensors_by_dtype.values():
+        flat_tensor = flatten_tensors(tensors_with_same_dtype)
+        if logger is not None:
+            logger.debug("Flatten completed")
+        communication_op(tensor=flat_tensor)
+        if logger is not None:
+            logger.debug("Commmunication completed")
+        with torch.no_grad():
+            for f, t in zip(unflatten_tensors(flat_tensor, tensors_with_same_dtype), tensors_with_same_dtype,):
+                t.copy_(f)
+        if logger is not None:
+            logger.debug("Unflatten completed")
+HANDLER_AND_LEVEL_SET: Set[logging.Logger] = set()
+# TODO: deprecate this function
+def make_logger(rank: int, verbose: bool = True) -> logging.Logger:
+    """
+    Return a logger for writing to stdout
+    Args:
+        rank (int): rank of node making logger
+        verbose (bool): whether to set log-level to INFO; o.w. WARNING
+    Returns:
+        Python logger
+    """
+    logger = logging.getLogger(__name__)
+    if logger not in HANDLER_AND_LEVEL_SET:
+        # if not getattr(logger, "handler_and_level_set", None):
+        console = logging.StreamHandler(stream=sys.stdout)
+        format_str = "{}".format(rank)
+        format_str += ": %(levelname)s -- %(threadName)s -- %(message)s"
+        console.setFormatter(logging.Formatter(format_str))
+        logger.addHandler(console)  # prints to console
+        if verbose:
+            logger.setLevel(logging.DEBUG)
+        else:
+            logger.setLevel(logging.INFO)
+        HANDLER_AND_LEVEL_SET.add(logger)
+        # logger.handler_and_level_set = True
+    return logger
+def create_process_group(ranks: List[int]) -> torch.distributed.ProcessGroup:
+    """
+    Creates and intializes a new process group. Assumes init_process_group
+    has already been called
+    Arguments:
+        ranks (list<int>): ranks corresponding to the processes which should
+            belong the created process group
+    Returns:
+        New process group
+    """
+    new_group = dist.new_group(ranks=ranks)
+    init_tensor_fp32, init_tensor_fp16 = torch.zeros(1), torch.zeros(1).half()
+    for init_tensor in [init_tensor_fp32, init_tensor_fp16]:
+        if torch.cuda.is_available():
+            init_tensor = init_tensor.cuda()
+        if dist.get_rank() in ranks:
+            dist.all_reduce(init_tensor, group=new_group)
+        torch.cuda.synchronize()
+    return new_group
+class MultiProcessAdapter(logging.LoggerAdapter):
+    """
+    Creates an adapter to make logging for multiple processes cleaner
+    """
+    def process(self, msg: str, kwargs: Any) -> Tuple[str, MutableMapping[str, Any]]:
+        # use process_num from kwargs or the default given on instantiation
+        process_num = kwargs.pop("process_num", self.extra["process_num"])
+        return f"process: {process_num} {msg}", kwargs
--- a/fairscale/utils/testing.py
+++ b/fairscale/utils/testing.py
@@ -208,8 +208,21 @@ def get_world_sizes() -> List[int]:
    return [x for x in [1, 2, 4, 8] if x <= limit]
-def spawn_for_all_world_sizes(test_func: Callable, world_sizes: List[int] = get_world_sizes(), args: Any = []) -> None:
+def test_runner(
+    rank: int, test_func: Callable, deterministic: bool = False, *args: List[Any], **kwargs: Dict[str, Any]
+) -> None:
+    # At this point we're in a new process, torch options need to be set again
+    if deterministic:
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+        torch.manual_seed(1357)
+    test_func(rank, *args, **kwargs)
+def spawn_for_all_world_sizes(
+    test_func: Callable, world_sizes: List[int] = get_world_sizes(), args: Any = [], deterministic: bool = False
+) -> None:
    for world_size in world_sizes:
        _, filename = tempfile.mkstemp()
        _, filename_rpc = tempfile.mkstemp()
@@ -217,7 +230,12 @@ def spawn_for_all_world_sizes(test_func: Callable, world_sizes: List[int] = get_
        try:
            # (lefaudeux) Let mp handle the process joining, join=False and handling context has
            # been unstable in the past.
-            mp.spawn(test_func, args=(world_size, filename, filename_rpc, *args), nprocs=world_size, join=True)
+            mp.spawn(
+                test_runner,
+                args=(test_func, deterministic, world_size, filename, filename_rpc, *args),
+                nprocs=world_size,
+                join=True,
+            )
        finally:
            rmf(filename)
            rmf(filename_rpc)
@@ -239,7 +257,19 @@ def worker_process(
    initialize_model_parallel(1, world_size, **kwargs)
+    # Make sure that CUDA operations are repeatable
+    context = (
+        torch.backends.cudnn.flags(benchmark=False, deterministic=True)  # type: ignore
+        if torch.cuda.is_available() and hasattr(torch.backends.cudnn, "flags")
+        else contextlib.suppress()
+    )
+    if torch.cuda.is_available() and not hasattr(torch.backends.cudnn, "flags"):
+        torch.backends.cudnn.benchmark = False
+        torch.backends.cudnn.deterministic = True
    try:
+        with context:
            func(*args)
        teardown()
    except BaseException as e:

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -27,4 +27,4 @@ use_parentheses = true
 skip_glob = ["build/*", "stubs/*"]
 # Don't split "import" and "from".
 force_sort_within_sections = true
-known_third_party = ["benchmark_dataset", "datasets", "golden_configs", "models", "numpy", "parameterized", "pytest", "recommonmark", "setuptools", "torch", "torchtext", "torchvision"]
+known_third_party = ["benchmark_dataset", "datasets", "golden_configs", "helpers", "models", "numpy", "parameterized", "pytest", "recommonmark", "setuptools", "torch", "torchtext", "torchvision"]
--- a/stubs/torch/backends/cudnn.pyi
+++ b/stubs/torch/backends/cudnn.pyi
@@ -5,3 +5,4 @@ def version() -> int: ...
 #END
 deterministic : bool
 benchmark: bool