activation_checkpointing.rst

Enhanced Activation Checkpointing
=================================

Activation checkpointing is a technique used to reduce GPU memory usage during training. This is 
done by avoiding the need to store intermediate activation tensors during the forward pass. Instead, 
the forward pass is recomputed by keeping track of the original input during the backward pass. 
There is a slight increase in computation cost (about 33%) but this reduces the need to store 
large activation tensors which allows us to increase the batch size and thereby the net throughput 
of the model.


Activation checkpointing is implemented by overriding `torch.autograd.Function`. In the `forward` 
function which handles the forward pass of the module, using `no_grad`, we can prevent the creation 
of the forward graph and materialization of intermediate activation tensors for a long period of 
time (i.e till the backward pass). Instead, during the backward pass, the forward pass is executed 
again followed by the backward pass. The inputs to the forward pass are saved using a context object 
that is then accessed in the backward pass to retrieve the original inputs. We also save the 
Random Number Generator(RNG) state for the forward and backward passes as required for Dropout layers.

The above functionality is already implemented as part of the `torch.utils.checkpoint.checkpoint_wrapper` 
API whereby different modules in the forward pass can be wrapped. The wrapper in FairScale offers 
functionality beyond that provided by the PyTorch API specifically you can use 
`fairscale.nn.checkpoint.checkpoint_wrapper` to wrap a `nn.Module`, handle kwargs in the forward 
pass, offload intermediate activations to the CPU and handle non-tensor outputs returned from the 
forward function.

Best practices for `fairscale.nn.checkpoint.checkpoint_wrapper`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Memory savings depends entirely on the model and the segmentation of checkpoint wrapping. 
Each backprop consists of several mini-forward and backprop passes. The gain is entirely dependent 
on the memory footprint of the layer’s activations. 

2. When using BatchNormalization you may need to freeze the calculation of statistics since we run 
the forward pass twice.

3. Ensure that the input tensor’s `requires_grad` field is set to True. In order to trigger the 
backward function, the output needs to have this field set. By setting it on the input tensor we 
ensure that this is propagated to the output and the `backward` function is triggered.