Efficient Data Parallel Training with SlowMo Distributed Data Parallel ====================================================================== SlowMo Distributed Data Parallel reduces the communication between different nodes while performing data parallel training. It is mainly useful for use on clusters with low interconnect speeds between different nodes. When using SlowMo, the models on the different nodes are no longer kept in sync after each iteration, which leads to the optimization dynamics being affected. The end result is close to the results of Distributed Data Parallel, but is not exactly the same. If you have code that is setup to use Distributed Data Parallel, using SlowMo Distributed Data Parallel is simply replacing the DDP call with a call to ``fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel``, and adding a ``model.perform_slowmo(optimizer)`` call after ``optimizer.step()`` -- preceded by ``model.zero_grad(set_to_none=True)`` in order to reduce peak memory usage. The different points at which ``use_slowmo`` is used below help demonstrate these changes: .. code-block:: python import torch from fairscale.experimental.nn.data_parallel import SlowMoDistributedDataParallel as SlowMoDDP def train( rank: int, world_size: int, epochs: int, use_slowmo: bool): # process group init dist_init(rank, world_size) # Problem statement model = MyAwesomeModel().to(rank) if use_slowmo: # Wrap the model into SlowMoDDP model = SlowMoDDP(model, slowmo_momentum=0.5, nprocs_per_node=8) else: model = DDP(model, device_ids=[rank]) dataloader = MySuperFastDataloader() loss_ln = MyVeryRelevantLoss() optimizer = MyAmazingOptimizer() # Any relevant training loop, with a line at the very end specific to SlowMoDDP, e.g.: model.train() for e in range(epochs): for (data, target) in dataloader: data, target = data.to(rank), target.to(rank) # Train outputs = model(data) loss = loss_fn(outputs, target) loss.backward() optimizer.step() model.zero_grad(set_to_none=use_slowmo) # free memory for the perform_slowmo() call below if use_slowmo: model.perform_slowmo(optimizer) In the example above, when using SlowMoDDP, we are reducing the total communication between nodes by 3 times as the default ``localsgd_frequency`` is set to 3. SlowMoDDP takes in ``slowmo_momentum`` as a parameter. This parameter may need to be tuned depending on your use case. It also takes in ``nproces_per_node`` which should be typically set to the number of GPUs on a node. Please look at the `documentation <../api/experimental/nn/slowmo_ddp.html>`_ for more details on these parameters as well as other advanced settings of the SlowMo algorithm.