offload.rst

OffloadModel
=============

Heavily inspired by the `Layer-to-Layer <https://arxiv.org/abs/2002.05645>`_ algorithm and
`Zero-Offload <https://arxiv.org/abs/2101.06840>`_, OffloadModel uses the CPU to store
the entire model, optimizer state and gradients. OffloadModel then brings in a layer (or a number of
layers) onto the GPU for training at a time during the forward and backward pass. The intermediate
activations for the layer boundaries are also stored on the CPU and copied to the GPU as needed for
the backward pass. Once the backward pass is completed all the parameters are updated with the
gradients present on the CPU.

.. image:: ../_static/img/offload.png
    :height: 500px
    :width: 500px

Offload uses the following techniques to enable large model training:

1. The model is assumed to be nn.Sequential and sharded (almost) equally based on the number of
parameters into a list of nn.Modules. Each nn.Module now contains a fraction of the whole model
which we shall refer to as model shards.

2. At each iteration, each of the model shards are copied from the CPU -> GPU, FW pass is computed
using the minibatch of data and the model shard is copied back from GPU -> CPU. In the BW pass, the
same process is repeated.

3. The optimizer remains on the CPU and gradients and parameters are all moved onto the CPU before
running optimizer.step. This ensures that the CPU is responsible for updating the parameters and
holding onto the optimizer state.

4. If activation checkpointing is enabled, we use torch.autograd.Function to disable graph construction
in the FW pass and copy intermediate activations from GPU -> CPU after the FW pass of a given shard is
complete. The reverse copy is carried out in the BW pass.

5. Microbatches are used to enable larger throughput and offset the cost of moving model parameters
and activations from CPU <-> GPU. Micro-batches allow you to specify large mini-batches which are
broken down into micro-batches and fed to the model shards at each iteration. In short it is a way
to allow more computation at a given time on a model shard to offset the cost of copying from CPU <-> GPU.

Best practices for using `fairscale.experimental.nn.OffloadModel`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Using OffloadModel to train large models can result in loss of throughput which can be overcome by using activation checkpointing and microbatches.

2. OffloadModel currently only works for `nn.Sequential` models.