syncbn.rst 4.65 KB
Newer Older
Hang Zhang's avatar
v1.0.1  
Hang Zhang committed
1
2
3
Implementing Synchronized Multi-GPU Batch Normalization
=======================================================

Hang Zhang's avatar
sync BN  
Hang Zhang committed
4
In this tutorial, we discuss the implementation detail of Multi-GPU Batch Normalization (BN) (classic implementation: :class:`encoding.nn.BatchNorm2d`. We will provide the training example in a later version.
Hang Zhang's avatar
v1.0.1  
Hang Zhang committed
5

Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
6
7
8
9
10
How BN works?
-------------

BN layer was introduced in the paper `Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`_, which dramatically speed up the training process of the network (enables larger learning rate) and makes the network less sensitive to the weight initialization. 

Hang Zhang's avatar
Hang Zhang committed
11
12
13
.. image:: http://hangzh.com/blog/images/bn1.png
    :align: center

Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
14
15
16
17
18
19
20
21
22
- Forward Pass: 
    For the input data :math:`X={x_1, ...x_N}`, the data are normalized to be zero-mean and unit-variance, then scale and shit:

    .. math::
        y_i = \gamma\cdot\frac{x_i-\mu}{\sigma} + \beta ,

    where :math:`\mu=\frac{\sum_i^N x_i}{N} , \sigma = \sqrt{\frac{\sum_i^N (x_i-\mu)^2}{N}+\epsilon}` and :math:`\gamma, \beta` are the learnable parameters.
        
- Backward Pass:
Hang Zhang's avatar
sync BN  
Hang Zhang committed
23
    For calculating the gradient :math:`\frac{d_\ell}{d_{x_i}}`, we need to consider the partial gradient from :math:`\frac{d_\ell}{d_y}` and the gradients from :math:`\frac{d_\ell}{d_\mu}` and :math:`\frac{d_\ell}{d_\sigma}`, since the :math:`\mu \text{ and } \sigma` are the function of the input :math:`x_i`. We use patial direvative in the notations:
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
24
25
26

    .. math::

Hang Zhang's avatar
sync BN  
Hang Zhang committed
27
        \frac{d_\ell}{d_{x_i}} = \frac{d_\ell}{d_{y_i}}\cdot\frac{\partial_{y_i}}{\partial_{x_i}} + \frac{d_\ell}{d_\mu}\cdot\frac{d_\mu}{d_{x_i}} + \frac{d_\ell}{d_\sigma}\cdot\frac{d_\sigma}{d_{x_i}}
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
28

Hang Zhang's avatar
sync BN  
Hang Zhang committed
29
    where :math:`\frac{\partial_{y_i}}{\partial_{x_i}}=\frac{\gamma}{\sigma}, \frac{d_\ell}{d_\mu}=-\frac{\gamma}{\sigma}\sum_i^N\frac{d_\ell}{d_{y_i}}
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
30
31
32
33
34
35
36
    \text{ and } \frac{d_\sigma}{d_{x_i}}=-\frac{1}{\sigma}(\frac{x_i-\mu}{N})`.

Why Synchronize BN?
-------------------

- Standard Implementations of BN in public frameworks (suck as Caffe, MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the data are normalized within each GPU. Therefore the `working batch-size` of the BN layer is `BatchSize/nGPU` (batch-size in each GPU). 

Hang Zhang's avatar
Hang Zhang committed
37
38
39
.. image:: http://hangzh.com/blog/images/bn2.png
    :align: center

Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
40
41
42
- Since the `working batch-size` is typically large enough for standard vision tasks, such as classification and detection, there is no need to synchronize BN layer during the training. The synchronization will slow down the training.

- However, for the Semantic Segmentation task, the state-of-the-art approaches typically adopt dilated convoluton, which is very memory consuming. The `working bath-size` can be too small for BN layers (2 or 4 in each GPU) when using larger/deeper pre-trained networks, such as :class:`encoding.dilated.ResNet` or :class:`encoding.dilated.DenseNet`. 
Hang Zhang's avatar
v1.0.1  
Hang Zhang committed
43
44
45
46

How to Synchronize?
-------------------

Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
47
48
49
Suppose we have :math:`K` number of GPUs, :math:`sum(x)_k` and :math:`sum(x^2)_k` denotes the sum of elements and sum of element squares in :math:`k^{th}` GPU.

- Forward Pass:
Hang Zhang's avatar
sync BN  
Hang Zhang committed
50
    We can calculate the sum of elements :math:`sum(x)=\sum x_i \text{ and sum of squares } sum(x^2)=\sum x_i^2` in each GPU, then apply :class:`encoding.parallel.allreduce` operation to sum accross GPUs. Then calculate the global mean :math:`\mu=\frac{sum(x)}{N} \text{ and global variance } \sigma=\sqrt{\frac{sum(x^2)}{N}-\mu^2+\epsilon}`. 
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
51
52

- Backward Pass:
Hang Zhang's avatar
sync BN  
Hang Zhang committed
53
    * :math:`\frac{d_\ell}{d_{x_i}}=\frac{d_\ell}{d_{y_i}}\frac{\gamma}{\sigma}` can be calculated locally in each GPU.
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
54
55
    * Calculate the gradient of :math:`sum(x)` and :math:`sum(x^2)` individually in each GPU :math:`\frac{d_\ell}{d_{sum(x)_k}}` and :math:`\frac{d_\ell}{d_{sum(x^2)_k}}`. 

Hang Zhang's avatar
Hang Zhang committed
56
    * Then Sync the gradient (automatically handled by :class:`encoding.parallel.AllReduce`) and continue the backward.
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
57

Hang Zhang's avatar
Hang Zhang committed
58
59
.. image:: http://hangzh.com/blog/images/bn3.png
    :align: center
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
60
61
62
63
64

Citation
--------

.. note::
Hang Zhang's avatar
sync BN  
Hang Zhang committed
65
66
67
    This code is provided together with the paper, please cite our work.

        * Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, Amit Agrawal. "Context Encoding for Semantic Segmentation"  *The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018*::
Hang Zhang's avatar
v0.1.0  
Hang Zhang committed
68

Hang Zhang's avatar
sync BN  
Hang Zhang committed
69
70
71
72
73
74
75
            @InProceedings{Zhang_2018_CVPR,
            author = {Zhang, Hang and Dana, Kristin and Shi, Jianping and Zhang, Zhongyue and Wang, Xiaogang and Tyagi, Ambrish and Agrawal, Amit},
            title = {Context Encoding for Semantic Segmentation},
            booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
            month = {June},
            year = {2018}
            }