pipeline_parallelism.rst

Pipeline Parallelism
=====================

Training large models can lead to out-of-memory when the size of the model is too large for a single GPU.
To train such a large model, layers can be pipelined across different GPU devices as described in GPipe.
The `fairscale.nn.Pipe` is an implementation of GPipe which has been adopted from torchgpipe. This API
has also been upstreamed to PyTorch in the 1.8 release with the experimental tag.

.. image:: ../_static/img/pipe.png

Gpipe first shards the model across different devices where each device hosts a shard of the model.
A shard can be a single layer or a series of layers. However Gpipe splits a mini-batch of data into
micro-batches and feeds it to the device hosting the first shard. The layers on each device process
the micro-batches and send the output to the following shard/device. In the meantime it is ready to
process the micro batch from the previous shard/device. By pipepling the input in this way, Gpipe is
able to reduce the idle time of devices.

Best practices for using `fairscale.nn.Pipe`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Choice of size of micro-batches can affect GPU utilization. A smaller microbatch can reduce latency of shards waiting for previous shard outputs but a large microbatch better utilizes GPUs.

2. Sharding the model can also impact GPU utilization where layers with heavier computation can slow down the shards downstream.