[Doc] NN doc refactor Conv Layers (#1672)

* go through gcn, relgcn * fix tagconv formula * fix doc in sageconv * fix sgconv doc * replace hat with tilde * more comments on gmmconv * fix agnnconv chebconv doc * modify nnconv doc * remove & * add nn conv examples * Rebase master * More merge conflicts * check homo * add back self loop for some convs, check homo in tranform * add example for denseconv * add example and doc for dotgat and cfconv * check in-degree for graphconv * add language fix * gconv address all comments * another round of change based on api template * change agnn * go through agnn, appnp, atomic, cf, cheb, dense, gat, sage modules * finish pytorch part of nn conv * mxnet graphconv done * tensorflow graphconv works * add new modules into doc * add comments to not split code * refine doc * resr * more comments * more fix * finish conv and dense conv part api * pylint fix * fix pylink * fix pylint * more fix * fix * fix test fail because zere in degree * fix test fail * sage is not update for mxnet tf Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-81.us-east-2.compute.internal>

[Doc] NN doc refactor Conv Layers (#1672)
* go through gcn, relgcn * fix tagconv formula * fix doc in sageconv * fix sgconv doc * replace hat with tilde * more comments on gmmconv * fix agnnconv chebconv doc * modify nnconv doc * remove & * add nn conv examples * Rebase master * More merge conflicts * check homo * add back self loop for some convs, check homo in tranform * add example for denseconv * add example and doc for dotgat and cfconv * check in-degree for graphconv * add language fix * gconv address all comments * another round of change based on api template * change agnn * go through agnn, appnp, atomic, cf, cheb, dense, gat, sage modules * finish pytorch part of nn conv * mxnet graphconv done * tensorflow graphconv works * add new modules into doc * add comments to not split code * refine doc * resr * more comments * more fix * finish conv and dense conv part api * pylint fix * fix pylink * fix pylint * more fix * fix * fix test fail because zere in degree * fix test fail * sage is not update for mxnet tf Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-81.us-east-2.compute.internal>
69f5869f · Tianjun Xiao · GitHub · 98c1117d · 69f5869f · 69f5869f
Unverified Commit 69f5869f authored Aug 12, 2020 by Tianjun Xiao Committed by GitHub Aug 12, 2020
20 changed files
--- a/docs/source/api/python/nn.pytorch.rst
+++ b/docs/source/api/python/nn.pytorch.rst
@@ -122,6 +122,21 @@ AtomicConv
    :members: forward
    :show-inheritance:

+CFConv
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: dgl.nn.pytorch.conv.CFConv
+    :members: forward
+    :show-inheritance:
+
+DotGatConv
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: dgl.nn.pytorch.conv.DotGatConv
+    :members: forward
+    :show-inheritance:
+
+
 Dense Conv Layers
 ----------------------------------------


--- a/python/dgl/nn/mxnet/conv/graphconv.py
+++ b/python/dgl/nn/mxnet/conv/graphconv.py
@@ -10,40 +10,27 @@ from ....base import DGLError
 from ....utils import expand_as_pair

 class GraphConv(gluon.Block):
-    r"""Apply graph convolution over an input signal.
+    r"""

-    Graph convolution is introduced in `GCN <https://arxiv.org/abs/1609.02907>`__
-    and can be described as below:
+    Description
+    -----------
+    Graph convolution was introduced in `GCN <https://arxiv.org/abs/1609.02907>`__
+    and mathematically is defined as follows:

    .. math::
      h_i^{(l+1)} = \sigma(b^{(l)} + \sum_{j\in\mathcal{N}(i)}\frac{1}{c_{ij}}h_j^{(l)}W^{(l)})

-    where :math:`\mathcal{N}(i)` is the neighbor set of node :math:`i`. :math:`c_{ij}` is equal
-    to the product of the square root of node degrees:
-    :math:`\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}`. :math:`\sigma` is an activation
-    function.
-
-    The model parameters are initialized as in the
-    `original implementation <https://github.com/tkipf/gcn/blob/master/gcn/layers.py>`__ where
-    the weight :math:`W^{(l)}` is initialized using Glorot uniform initialization
-    and the bias is initialized to be zero.
-
-    Notes
-    -----
-    Zero in degree nodes could lead to invalid normalizer. A common practice
-    to avoid this is to add a self-loop for each node in the graph, which
-    can be achieved by:
-
-    >>> g = ... # some DGLGraph
-    >>> g.add_edges(g.nodes(), g.nodes())
-
+    where :math:`\mathcal{N}(i)` is the set of neighbors of node :math:`i`,
+    :math:`c_{ij}` is the product of the square root of node degrees
+    (i.e.,  :math:`c_{ij} = \sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}`),
+    and :math:`\sigma` is an activation function.

    Parameters
    ----------
    in_feats : int
-        Number of input features.
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
    out_feats : int
-        Number of output features.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    norm : str, optional
        How to apply the normalizer. If is `'right'`, divide the aggregated messages
        by each node's in-degrees, which is equivalent to averaging the received messages.
@@ -54,16 +41,91 @@ class GraphConv(gluon.Block):
        without a weight matrix.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
-    activation: callable activation function/layer or None, optional
+    activation : callable activation function/layer or None, optional
        If not None, applies an activation function to the updated node features.
        Default: ``None``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.

    Attributes
    ----------
-    weight : mxnet.gluon.parameter.Parameter
+    weight : torch.Tensor
        The learnable weight tensor.
-    bias : mxnet.gluon.parameter.Parameter
+    bias : torch.Tensor
        The learnable bias tensor.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import mxnet as mx
+    >>> from mxnet import gluon
+    >>> import numpy as np
+    >>> from dgl.nn import GraphConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = mx.nd.ones((6, 10))
+    >>> conv = GraphConv(10, 2, norm='both', weight=True, bias=True)
+    >>> conv.initialize(ctx=mx.cpu(0))
+    >>> res = conv(g, feat)
+    >>> print(res)
+    [[1.0209361  0.22472616]
+    [1.1240715  0.24742813]
+    [1.0209361  0.22472616]
+    [1.2924911  0.28450024]
+    [1.3568745  0.29867214]
+    [0.7948386  0.17495811]]
+    <NDArray 6x2 @cpu(0)>
+
+    >>> # allow_zero_in_degree example
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> conv = GraphConv(10, 2, norm='both', weight=True, bias=True, allow_zero_in_degree=True)
+    >>> res = conv(g, feat)
+    >>> print(res)
+    [[1.0209361  0.22472616]
+    [1.1240715  0.24742813]
+    [1.0209361  0.22472616]
+    [1.2924911  0.28450024]
+    [1.3568745  0.29867214]
+    [0.  0.]]
+    <NDArray 6x2 @cpu(0)>
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_fea = mx.nd.random.randn(2, 5)
+    >>> v_fea = mx.nd.random.randn(4, 5)
+    >>> conv = GraphConv(5, 2, norm='both', weight=True, bias=True)
+    >>> conv.initialize(ctx=mx.cpu(0))
+    >>> res = conv(g, (u_fea, v_fea))
+    >>> res
+    [[ 0.26967263  0.308129  ]
+    [ 0.05143356 -0.11355402]
+    [ 0.22705637  0.1375853 ]
+    [ 0.26967263  0.308129  ]]
+    <NDArray 4x2 @cpu(0)>
    """
    def __init__(self,
                 in_feats,
@@ -71,7 +133,8 @@ class GraphConv(gluon.Block):
                 norm='both',
                 weight=True,
                 bias=True,
-                 activation=None):
+                 activation=None,
+                 allow_zero_in_degree=False):
        super(GraphConv, self).__init__()
        if norm not in ('none', 'both', 'right'):
            raise DGLError('Invalid norm value. Must be either "none", "both" or "right".'
@@ -79,6 +142,7 @@ class GraphConv(gluon.Block):
        self._in_feats = in_feats
        self._out_feats = out_feats
        self._norm = norm
+        self._allow_zero_in_degree = allow_zero_in_degree

        with self.name_scope():
            if weight:
@@ -96,15 +160,11 @@ class GraphConv(gluon.Block):
        self._activation = activation

    def forward(self, graph, feat, weight=None):
-        r"""Compute graph convolution.
+        r"""

-        Notes
-        -----
-        * Input shape: :math:`(N, *, \text{in_feats})` where * means any number of additional
-          dimensions, :math:`N` is the number of nodes.
-        * Output shape: :math:`(N, *, \text{out_feats})` where all but the last dimension are
-          the same shape as the input.
-        * Weight shape: :math:`(\text{in_feats}, \text{out_feats})`.
+        Description
+        -----------
+        Compute graph convolution.

        Parameters
        ----------
@@ -126,8 +186,35 @@ class GraphConv(gluon.Block):
        -------
        mxnet.NDArray
            The output feature
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
+
+        Notes
+        -----
+        * Input shape: :math:`(N, *, \text{in_feats})` where * means any number of additional
+          dimensions, :math:`N` is the number of nodes.
+        * Output shape: :math:`(N, *, \text{out_feats})` where all but the last dimension are
+          the same shape as the input.
+        * Weight shape: :math:`(\text{in_feats}, \text{out_feats})`.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).asnumpy().any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            feat_src, feat_dst = expand_as_pair(feat, graph)

            if self._norm == 'both':

--- a/python/dgl/nn/pytorch/conv/__init__.py
+++ b/python/dgl/nn/pytorch/conv/__init__.py
@@ -20,8 +20,9 @@ from .densegraphconv import DenseGraphConv
 from .densesageconv import DenseSAGEConv
 from .atomicconv import AtomicConv
 from .cfconv import CFConv
+from .dotgatconv import DotGatConv

 __all__ = ['GraphConv', 'GATConv', 'TAGConv', 'RelGraphConv', 'SAGEConv',
           'SGConv', 'APPNPConv', 'GINConv', 'GatedGraphConv', 'GMMConv',
           'ChebConv', 'AGNNConv', 'NNConv', 'DenseGraphConv', 'DenseSAGEConv',
-           'DenseChebConv', 'EdgeConv', 'AtomicConv', 'CFConv']
+           'DenseChebConv', 'EdgeConv', 'AtomicConv', 'CFConv', 'DotGatConv']
--- a/python/dgl/nn/pytorch/conv/agnnconv.py
+++ b/python/dgl/nn/pytorch/conv/agnnconv.py
@@ -6,11 +6,16 @@ from torch.nn import functional as F

 from .... import function as fn
 from ....ops import edge_softmax
+from ....base import DGLError
 from ....utils import expand_as_pair


 class AGNNConv(nn.Module):
-    r"""Attention-based Graph Neural Network layer from paper `Attention-based
+    r"""
+
+    Description
+    -----------
+    Attention-based Graph Neural Network layer from paper `Attention-based
    Graph Neural Network for Semi-Supervised Learning
    <https://arxiv.org/abs/1803.03735>`__.

@@ -22,24 +27,75 @@ class AGNNConv(nn.Module):
    .. math::
        P_{ij} = \mathrm{softmax}_i ( \beta \cdot \cos(h_i^l, h_j^l))

+    where :math:`\beta` is a single scalar parameter.
+
    Parameters
    ----------
    init_beta : float, optional
-        The :math:`\beta` in the formula.
+        The :math:`\beta` in the formula, a single scalar parameter.
    learn_beta : bool, optional
        If True, :math:`\beta` will be learnable parameter.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import AGNNConv
+    >>>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> conv = AGNNConv()
+    >>> res = conv(g, feat)
+    >>> res
+    tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+            [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+            [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+            [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+            [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+            [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
+        grad_fn=<BinaryReduceBackward>)
    """
    def __init__(self,
                 init_beta=1.,
-                 learn_beta=True):
+                 learn_beta=True,
+                 allow_zero_in_degree=False):
        super(AGNNConv, self).__init__()
+        self._allow_zero_in_degree = allow_zero_in_degree
        if learn_beta:
            self.beta = nn.Parameter(th.Tensor([init_beta]))
        else:
            self.register_buffer('beta', th.Tensor([init_beta]))

    def forward(self, graph, feat):
-        r"""Compute AGNN layer.
+        r"""
+
+        Description
+        -----------
+        Compute AGNN layer.

        Parameters
        ----------
@@ -49,7 +105,7 @@ class AGNNConv(nn.Module):
            The input feature of shape :math:`(N, *)` :math:`N` is the
            number of nodes, and :math:`*` could be of any shape.
            If a pair of torch.Tensor is given, the pair must contain two tensors of shape
-            :math:`(N_{in}, *)` and :math:`(N_{out}, *})`, the the :math:`*` in the later
+            :math:`(N_{in}, *)` and :math:`(N_{out}, *)`, the :math:`*` in the later
            tensor must equal the previous one.

        Returns
@@ -57,9 +113,29 @@ class AGNNConv(nn.Module):
        torch.Tensor
            The output feature of shape :math:`(N, *)` where :math:`*`
            should be the same as input shape.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            feat_src, feat_dst = expand_as_pair(feat, graph)
+
            graph.srcdata['h'] = feat_src
            graph.srcdata['norm_h'] = F.normalize(feat_src, p=2, dim=-1)
            if isinstance(feat, tuple) or graph.is_block:

--- a/python/dgl/nn/pytorch/conv/appnpconv.py
+++ b/python/dgl/nn/pytorch/conv/appnpconv.py
@@ -5,27 +5,57 @@ from torch import nn

 from .... import function as fn

-
 class APPNPConv(nn.Module):
-    r"""Approximate Personalized Propagation of Neural Predictions
+    r"""
+
+    Description
+    -----------
+    Approximate Personalized Propagation of Neural Predictions
    layer from paper `Predict then Propagate: Graph Neural Networks
    meet Personalized PageRank <https://arxiv.org/pdf/1810.05997.pdf>`__.

    .. math::
-        H^{0} & = X
+        H^{0} &= X
+
+        H^{l+1} &= (1-\alpha)\left(\tilde{D}^{-1/2}
+        \tilde{A} \tilde{D}^{-1/2} H^{l}\right) + \alpha H^{0}

-        H^{t+1} & = (1-\alpha)\left(\hat{D}^{-1/2}
-        \hat{A} \hat{D}^{-1/2} H^{t}\right) + \alpha H^{0}
+    where :math:`\tilde{A}` is :math:`A` + :math:`I`.

    Parameters
    ----------
    k : int
-        Number of iterations :math:`K`.
+        The number of iterations :math:`K`.
    alpha : float
        The teleport probability :math:`\alpha`.
    edge_drop : float, optional
-        Dropout rate on edges that controls the
+        The dropout rate on edges that controls the
        messages received by each node. Default: ``0``.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import APPNPConv
+    >>>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> feat = th.ones(6, 10)
+    >>> conv = APPNPConv(k=3, alpha=0.5)
+    >>> res = conv(g, feat)
+    >>> res
+    tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
+            1.0000],
+            [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
+            1.0000],
+            [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
+            1.0000],
+            [1.0303, 1.0303, 1.0303, 1.0303, 1.0303, 1.0303, 1.0303, 1.0303, 1.0303,
+            1.0303],
+            [0.8643, 0.8643, 0.8643, 0.8643, 0.8643, 0.8643, 0.8643, 0.8643, 0.8643,
+            0.8643],
+            [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000,
+            0.5000]])
    """
    def __init__(self,
                 k,
@@ -37,14 +67,18 @@ class APPNPConv(nn.Module):
        self.edge_drop = nn.Dropout(edge_drop)

    def forward(self, graph, feat):
-        r"""Compute APPNP layer.
+        r"""
+
+        Description
+        -----------
+        Compute APPNP layer.

        Parameters
        ----------
        graph : DGLGraph
            The graph.
        feat : torch.Tensor
-            The input feature of shape :math:`(N, *)` :math:`N` is the
+            The input feature of shape :math:`(N, *)`. :math:`N` is the
            number of nodes, and :math:`*` could be of any shape.

        Returns

--- a/python/dgl/nn/pytorch/conv/atomicconv.py
+++ b/python/dgl/nn/pytorch/conv/atomicconv.py
@@ -5,7 +5,11 @@ import torch as th
 import torch.nn as nn

 class RadialPooling(nn.Module):
-    r"""Radial pooling from paper `Atomic Convolutional Networks for
+    r"""
+
+    Description
+    -----------
+    Radial pooling from paper `Atomic Convolutional Networks for
    Predicting Protein-Ligand Binding Affinity <https://arxiv.org/abs/1703.10603>`__.

    We denote the distance between atom :math:`i` and :math:`j` by :math:`r_{ij}`.
@@ -53,7 +57,11 @@ class RadialPooling(nn.Module):
            rbf_kernel_scaling.reshape(-1, 1, 1), requires_grad=True)

    def forward(self, distances):
-        """Apply the layer to transform edge distances.
+        """
+
+        Description
+        -----------
+        Apply the layer to transform edge distances.

        Parameters
        ----------
@@ -81,7 +89,11 @@ class RadialPooling(nn.Module):
        return rbf_kernel_results * cutoff_values

 def msg_func(edges):
-    """Send messages along edges.
+    """
+
+    Description
+    -----------
+    Send messages along edges.

    Parameters
    ----------
@@ -99,7 +111,11 @@ def msg_func(edges):
        'ij,ik->ijk', edges.src['hv'], edges.data['he']).view(len(edges), -1)}

 def reduce_func(nodes):
-    """Collect messages and update node representations.
+    """
+
+    Description
+    -----------
+    Collect messages and update node representations.

    Parameters
    ----------
@@ -116,10 +132,14 @@ def reduce_func(nodes):
    return {'hv_new': nodes.mailbox['m'].sum(1)}

 class AtomicConv(nn.Module):
-    r"""Atomic Convolution Layer from paper `Atomic Convolutional Networks for
+    r"""
+
+    Description
+    -----------
+    Atomic Convolution Layer from paper `Atomic Convolutional Networks for
    Predicting Protein-Ligand Binding Affinity <https://arxiv.org/abs/1703.10603>`__.

-    We denote the type of atom :math:`i` by :math:`z_i` and the distance between atom
+    Denoting the type of atom :math:`i` by :math:`z_i` and the distance between atom
    :math:`i` and :math:`j` by :math:`r_{ij}`.

    **Distance Transformation**
@@ -155,20 +175,7 @@ class AtomicConv(nn.Module):
    .. math::
        p_{i, t}^{k} = \sum_{j\in N(i)} e_{ij}^{k} * 1(z_j == t)

-    We concatenate the results for all RBF kernels and atom types.
-
-    Notes
-    -----
-
-    * This convolution operation is designed for molecular graphs in Chemistry, but it might
-      be possible to extend it to more general graphs.
-
-    * There seems to be an inconsistency about the definition of :math:`e_{ij}^{k}` in the
-      paper and the author's implementation. We follow the author's implementation. In the
-      paper, :math:`e_{ij}^{k}` was defined as
-      :math:`\exp(-\gamma_{k}|r_{ij}-r_{k}|^2 * f_{ij}^{k})`.
-
-    * :math:`\gamma_{k}`, :math:`r_k` and :math:`c_k` are all learnable.
+    Then concatenate the results for all RBF kernels and atom types.

    Parameters
    ----------
@@ -183,6 +190,42 @@ class AtomicConv(nn.Module):
    features_to_use : None or float tensor of shape (T)
        In the original paper, these are atomic numbers to consider, representing the types
        of atoms. T for the number of types of atomic numbers. Default to None.
+
+    Notes
+    -----
+
+    * This convolution operation is designed for molecular graphs in Chemistry, but it might
+      be possible to extend it to more general graphs.
+
+    * There seems to be an inconsistency about the definition of :math:`e_{ij}^{k}` in the
+      paper and the author's implementation. We follow the author's implementation. In the
+      paper, :math:`e_{ij}^{k}` was defined as
+      :math:`\exp(-\gamma_{k}|r_{ij}-r_{k}|^2 * f_{ij}^{k})`.
+
+    * :math:`\gamma_{k}`, :math:`r_k` and :math:`c_k` are all learnable.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import AtomicConv
+
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> feat = th.ones(6, 1)
+    >>> edist = th.ones(6, 1)
+    >>> interaction_cutoffs = th.ones(3).float() * 2
+    >>> rbf_kernel_means = th.ones(3).float()
+    >>> rbf_kernel_scaling = th.ones(3).float()
+    >>> conv = AtomicConv(interaction_cutoffs, rbf_kernel_means, rbf_kernel_scaling)
+    >>> res = conv(g, feat, edist)
+    >>> res
+    tensor([[0.5000, 0.5000, 0.5000],
+                [0.5000, 0.5000, 0.5000],
+                [0.5000, 0.5000, 0.5000],
+                [1.0000, 1.0000, 1.0000],
+                [0.5000, 0.5000, 0.5000],
+                [0.0000, 0.0000, 0.0000]], grad_fn=<ViewBackward>)
    """
    def __init__(self, interaction_cutoffs, rbf_kernel_means,
                 rbf_kernel_scaling, features_to_use=None):
@@ -199,23 +242,27 @@ class AtomicConv(nn.Module):
            self.features_to_use = nn.Parameter(features_to_use, requires_grad=False)

    def forward(self, graph, feat, distances):
-        """Apply the atomic convolution layer.
+        """
+
+        Description
+        -----------
+        Apply the atomic convolution layer.

        Parameters
        ----------
        graph : DGLGraph
            Topology based on which message passing is performed.
-        feat : Float32 tensor of shape (V, 1)
+        feat : Float32 tensor of shape :math:`(V, 1)`
            Initial node features, which are atomic numbers in the paper.
-            V for the number of nodes.
-        distances : Float32 tensor of shape (E, 1)
+            :math:`V` for the number of nodes.
+        distances : Float32 tensor of shape :math:`(E, 1)`
            Distance between end nodes of edges. E for the number of edges.

        Returns
        -------
-        Float32 tensor of shape (V, K * T)
-            Updated node representations. V for the number of nodes, K for the
-            number of radial filters, and T for the number of types of atomic numbers.
+        Float32 tensor of shape :math:`(V, K * T)`
+            Updated node representations. :math:`V` for the number of nodes, :math:`K` for the
+            number of radial filters, and :math:`T` for the number of types of atomic numbers.
        """
        with graph.local_scope():
            radial_pooled_values = self.radial_pooling(distances)                # (K, E, 1)

--- a/python/dgl/nn/pytorch/conv/cfconv.py
+++ b/python/dgl/nn/pytorch/conv/cfconv.py
@@ -6,12 +6,16 @@ import torch.nn as nn
 from .... import function as fn

 class ShiftedSoftplus(nn.Module):
-    r"""Applies the element-wise function:
+    r"""
+
+    Description
+    -----------
+    Applies the element-wise function:

    .. math::
        \text{SSP}(x) = \frac{1}{\beta} * \log(1 + \exp(\beta * x)) - \log(\text{shift})

-    Parameters
+    Attributes
    ----------
    beta : int
        :math:`\beta` value for the mathematical formulation. Default to 1.
@@ -25,7 +29,11 @@ class ShiftedSoftplus(nn.Module):
        self.softplus = nn.Softplus(beta=beta, threshold=threshold)

    def forward(self, inputs):
-        """Applies the activation function.
+        """
+
+        Description
+        -----------
+        Applies the activation function.

        Parameters
        ----------
@@ -40,23 +48,54 @@ class ShiftedSoftplus(nn.Module):
        return self.softplus(inputs) - np.log(float(self.shift))

 class CFConv(nn.Module):
-    r"""CFConv in SchNet.
+    r"""
+
+    Description
+    -----------
+    CFConv in SchNet.

    SchNet is introduced in `SchNet: A continuous-filter convolutional neural network for
    modeling quantum interactions <https://arxiv.org/abs/1706.08566>`__.

    It combines node and edge features in message passing and updates node representations.

+    .. math::
+        h_i^{(l+1)} = \sum_{j\in \mathcal{N}(i)} h_j^{l} \circ W^{(l)}e_ij
+
+    where :math:`\circ` represents element-wise multiplication and for :math:`\text{SPP}` :
+
+    .. math::
+        \text{SSP}(x) = \frac{1}{\beta} * \log(1 + \exp(\beta * x)) - \log(\text{shift})
+
    Parameters
    ----------
    node_in_feats : int
-        Size for the input node features.
+        Size for the input node features :math:`h_j^{(l)}`.
    edge_in_feats : int
-        Size for the input edge features.
+        Size for the input edge features :math:`e_ij`.
    hidden_feats : int
        Size for the hidden representations.
    out_feats : int
-        Size for the output representations.
+        Size for the output representations :math:`h_j^{(l+1)}`.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import CFConv
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> nfeat = th.ones(6, 10)
+    >>> efeat = th.ones(6, 5)
+    >>> conv = CFConv(10, 5, 3, 2)
+    >>> res = conv(g, nfeat, efeat)
+    >>> res
+    tensor([[-0.1209, -0.2289],
+            [-0.1209, -0.2289],
+            [-0.1209, -0.2289],
+            [-0.1135, -0.2338],
+            [-0.1209, -0.2289],
+            [-0.1283, -0.2240]], grad_fn=<SubBackward0>)
    """
    def __init__(self, node_in_feats, edge_in_feats, hidden_feats, out_feats):
        super(CFConv, self).__init__()
@@ -74,7 +113,11 @@ class CFConv(nn.Module):
        )

    def forward(self, g, node_feats, edge_feats):
-        """Performs message passing and updates node representations.
+        """
+
+        Description
+        -----------
+        Performs message passing and updates node representations.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/chebconv.py
+++ b/python/dgl/nn/pytorch/conv/chebconv.py
@@ -9,7 +9,11 @@ from .... import laplacian_lambda_max, broadcast_nodes, function as fn


 class ChebConv(nn.Module):
-    r"""Chebyshev Spectral Graph Convolution layer from paper `Convolutional
+    r"""
+
+    Description
+    -----------
+    Chebyshev Spectral Graph Convolution layer from paper `Convolutional
    Neural Networks on Graphs with Fast Localized Spectral Filtering
    <https://arxiv.org/pdf/1606.09375.pdf>`__.

@@ -18,24 +22,50 @@ class ChebConv(nn.Module):

        Z^{0, l} &= H^{l}

-        Z^{1, l} &= \hat{L} \cdot H^{l}
+        Z^{1, l} &= \tilde{L} \cdot H^{l}
+
+        Z^{k, l} &= 2 \cdot \tilde{L} \cdot Z^{k-1, l} - Z^{k-2, l}

-        Z^{k, l} &= 2 \cdot \hat{L} \cdot Z^{k-1, l} - Z^{k-2, l}
+        \tilde{L} &= 2\left(I - \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}\right)/\lambda_{max} - I
+
+    where :math:`\tilde{A}` is :math:`A` + :math:`I`, :math:`W` is learnable weight.

-        \hat{L} &= 2\left(I - \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2}\right)/\lambda_{max} - I

    Parameters
    ----------
    in_feats: int
-        Number of input features.
+        Dimension of input features; i.e, the number of dimensions of :math:`h_i^{(l)}`.
    out_feats: int
-        Number of output features.
+        Dimension of output features :math:`h_i^{(l+1)}`.
    k : int
-        Chebyshev filter size.
+        Chebyshev filter size :math:`K`.
    activation : function, optional
-        Activation function, default is ReLu.
+        Activation function. Default ``ReLu``.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
+
+    Note
+    ----
+    ChebConv only support DGLGraph as input for now. Heterograph will report error. To be fixed.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import ChebConv
+    >>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> feat = th.ones(6, 10)
+    >>> conv = ChebConv(10, 2, 2)
+    >>> res = conv(g, feat)
+    >>> res
+    tensor([[ 0.6163, -0.1809],
+            [ 0.6163, -0.1809],
+            [ 0.6163, -0.1809],
+            [ 0.9698, -1.5053],
+            [ 0.3664,  0.7556],
+            [-0.2370,  3.0164]], grad_fn=<AddBackward0>)
    """

    def __init__(self,
@@ -52,7 +82,11 @@ class ChebConv(nn.Module):
        self.linear = nn.Linear(k * in_feats, out_feats, bias)

    def forward(self, graph, feat, lambda_max=None):
-        r"""Compute ChebNet layer.
+        r"""
+
+        Description
+        -----------
+        Compute ChebNet layer.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/densechebconv.py
+++ b/python/dgl/nn/pytorch/conv/densechebconv.py
@@ -6,7 +6,11 @@ from torch.nn import init


 class DenseChebConv(nn.Module):
-    r"""Chebyshev Spectral Graph Convolution layer from paper `Convolutional
+    r"""
+
+    Description
+    -----------
+    Chebyshev Spectral Graph Convolution layer from paper `Convolutional
    Neural Networks on Graphs with Fast Localized Spectral Filtering
    <https://arxiv.org/pdf/1606.09375.pdf>`__.

@@ -15,17 +19,43 @@ class DenseChebConv(nn.Module):
    Parameters
    ----------
    in_feats: int
-        Number of input features.
+        Dimension of input features :math:`h_i^{(l)}`.
    out_feats: int
-        Number of output features.
+        Dimension of output features :math:`h_i^{(l+1)}`.
    k : int
        Chebyshev filter size.
+    activation : function, optional
+        Activation function, default is ReLu.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.

+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import DenseChebConv
+    >>>
+    >>> feat = th.ones(6, 10)
+    >>> adj = th.tensor([[0., 0., 1., 0., 0., 0.],
+    ...         [1., 0., 0., 0., 0., 0.],
+    ...         [0., 1., 0., 0., 0., 0.],
+    ...         [0., 0., 1., 0., 0., 1.],
+    ...         [0., 0., 0., 1., 0., 0.],
+    ...         [0., 0., 0., 0., 0., 0.]])
+    >>> conv = DenseChebConv(10, 2, 2)
+    >>> res = conv(adj, feat)
+    >>> res
+    tensor([[-3.3516, -2.4797],
+            [-3.3516, -2.4797],
+            [-3.3516, -2.4797],
+            [-4.5192, -3.0835],
+            [-2.5259, -2.0527],
+            [-0.5327, -1.0219]], grad_fn=<AddBackward0>)
+
    See also
    --------
-    ChebConv
+    `ChebConv <https://docs.dgl.ai/api/python/nn.pytorch.html#chebconv>`__
    """
    def __init__(self,
                 in_feats,
@@ -51,7 +81,11 @@ class DenseChebConv(nn.Module):
            init.xavier_normal_(self.W[i], init.calculate_gain('relu'))

    def forward(self, adj, feat, lambda_max=None):
-        r"""Compute (Dense) Chebyshev Spectral Graph Convolution layer.
+        r"""
+
+        Description
+        -----------
+        Compute (Dense) Chebyshev Spectral Graph Convolution layer.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/densegraphconv.py
+++ b/python/dgl/nn/pytorch/conv/densegraphconv.py
@@ -6,7 +6,11 @@ from torch.nn import init


 class DenseGraphConv(nn.Module):
-    """Graph Convolutional Network layer where the graph structure
+    """
+
+    Description
+    -----------
+    Graph Convolutional Network layer where the graph structure
    is given by an adjacency matrix.
    We recommend user to use this module when applying graph convolution on
    dense graphs.
@@ -14,23 +18,53 @@ class DenseGraphConv(nn.Module):
    Parameters
    ----------
    in_feats : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    norm : str, optional
        How to apply the normalizer. If is `'right'`, divide the aggregated messages
        by each node's in-degrees, which is equivalent to averaging the received messages.
        If is `'none'`, no normalization is applied. Default is `'both'`,
        where the :math:`c_{ij}` in the paper is applied.
-    bias : bool
+    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
    activation : callable activation function/layer or None, optional
        If not None, applies an activation function to the updated node features.
        Default: ``None``.

+    Notes
+    -----
+    Zero in-degree nodes will lead to all-zero output. A common practice
+    to avoid this is to add a self-loop for each node in the graph,
+    which can be achieved by setting the diagonal of the adjacency matrix to be 1.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import DenseGraphConv
+    >>>
+    >>> feat = th.ones(6, 10)
+    >>> adj = th.tensor([[0., 0., 1., 0., 0., 0.],
+    ...         [1., 0., 0., 0., 0., 0.],
+    ...         [0., 1., 0., 0., 0., 0.],
+    ...         [0., 0., 1., 0., 0., 1.],
+    ...         [0., 0., 0., 1., 0., 0.],
+    ...         [0., 0., 0., 0., 0., 0.]])
+    >>> conv = DenseGraphConv(10, 2)
+    >>> res = conv(adj, feat)
+    >>> res
+    tensor([[0.2159, 1.9027],
+            [0.3053, 2.6908],
+            [0.3053, 2.6908],
+            [0.3685, 3.2481],
+            [0.3053, 2.6908],
+            [0.0000, 0.0000]], grad_fn=<AddBackward0>)
+
    See also
    --------
-    GraphConv
+    `GraphConv <https://docs.dgl.ai/api/python/nn.pytorch.html#graphconv>`__
    """
    def __init__(self,
                 in_feats,
@@ -58,7 +92,11 @@ class DenseGraphConv(nn.Module):
            init.zeros_(self.bias)

    def forward(self, adj, feat):
-        r"""Compute (Dense) Graph Convolution layer.
+        r"""
+
+        Description
+        -----------
+        Compute (Dense) Graph Convolution layer.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/densesageconv.py
+++ b/python/dgl/nn/pytorch/conv/densesageconv.py
@@ -5,7 +5,11 @@ from ....utils import check_eq_shape


 class DenseSAGEConv(nn.Module):
-    """GraphSAGE layer where the graph structure is given by an
+    """
+
+    Description
+    -----------
+    GraphSAGE layer where the graph structure is given by an
    adjacency matrix.
    We recommend to use this module when appying GraphSAGE on dense graphs.

@@ -14,9 +18,9 @@ class DenseSAGEConv(nn.Module):
    Parameters
    ----------
    in_feats : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`h_i^{(l)}`.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e, the number of dimensions of :math:`h_i^{(l+1)}`.
    feat_drop : float, optional
        Dropout rate on features. Default: 0.
    bias : bool
@@ -27,9 +31,33 @@ class DenseSAGEConv(nn.Module):
        If not None, applies an activation function to the updated node features.
        Default: ``None``.

+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import DenseSAGEConv
+    >>>
+    >>> feat = th.ones(6, 10)
+    >>> adj = th.tensor([[0., 0., 1., 0., 0., 0.],
+    ...         [1., 0., 0., 0., 0., 0.],
+    ...         [0., 1., 0., 0., 0., 0.],
+    ...         [0., 0., 1., 0., 0., 1.],
+    ...         [0., 0., 0., 1., 0., 0.],
+    ...         [0., 0., 0., 0., 0., 0.]])
+    >>> conv = DenseSAGEConv(10, 2)
+    >>> res = conv(adj, feat)
+    >>> res
+    tensor([[1.0401, 2.1008],
+            [1.0401, 2.1008],
+            [1.0401, 2.1008],
+            [1.0401, 2.1008],
+            [1.0401, 2.1008],
+            [1.0401, 2.1008]], grad_fn=<AddmmBackward>)
+
    See also
    --------
-    SAGEConv
+    `SAGEConv <https://docs.dgl.ai/api/python/nn.pytorch.html#sageconv>`__
    """
    def __init__(self,
                 in_feats,
@@ -48,12 +76,25 @@ class DenseSAGEConv(nn.Module):
        self.reset_parameters()

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        r"""
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The linear weights :math:`W^{(l)}` are initialized using Glorot uniform initialization.
+        """
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_uniform_(self.fc.weight, gain=gain)

    def forward(self, adj, feat):
-        r"""Compute (Dense) Graph SAGE layer.
+        r"""
+
+        Description
+        -----------
+        Compute (Dense) Graph SAGE layer.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/dotgatconv.py
+++ b/python/dgl/nn/pytorch/conv/dotgatconv.py
@@ -4,11 +4,16 @@ from torch import nn

 from .... import function as fn
 from ....ops import edge_softmax
+from ....base import DGLError
 from ....utils import expand_as_pair


 class DotGatConv(nn.Module):
-    r"""Apply dot product version of self attention in GCN.
+    r"""
+
+    Description
+    -----------
+    Apply dot product version of self attention in GCN.

        .. math::
            h_i^{(l+1)} = \sum_{j\in \mathcal{N}(i)} \alpha_{i, j} h_j^{(l)}
@@ -16,22 +21,92 @@ class DotGatConv(nn.Module):
        where :math:`\alpha_{ij}` is the attention score bewteen node :math:`i` and node :math:`j`:

        .. math::
-            \alpha_{i, j} = \mathrm{softmax_i}(e_{ij}^{l})
+            \alpha_{i, j} &= \mathrm{softmax_i}(e_{ij}^{l})

-            e_{ij}^{l} = ({W_i^{(l)} h_i^{(l)}})^T \cdot {W_j^{(l)} h_j^{(l)}}
+            e_{ij}^{l} &= ({W_i^{(l)} h_i^{(l)}})^T \cdot {W_j^{(l)} h_j^{(l)}}

        where :math:`W_i` and :math:`W_j` transform node :math:`i`'s and node :math:`j`'s
        features into the same dimension, so that when compute note features' similarity,
-        we can use dot-product.
-    """
+        it can use dot-product.

+    Parameters
+    ----------
+    in_feats : int, or pair of ints
+        Input feature size; i.e, the number of dimensions of :math:`h_i^{(l)}`.
+        DotGatConv can be applied on homogeneous graph and unidirectional
+        `bipartite graph <https://docs.dgl.ai/generated/dgl.bipartite.html?highlight=bipartite>`__.
+        If the layer is to be applied to a unidirectional bipartite graph, ``in_feats``
+        specifies the input feature size on both the source and destination nodes.  If
+        a scalar is given, the source and destination node feature size would take the
+        same value.
+    out_feats : int
+        Output feature size; i.e, the number of dimensions of :math:`h_i^{(l+1)}`.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import DotGatConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> gatconv = DotGatConv(10, 2)
+    >>> res = gatconv(g, feat)
+    >>> res
+    tensor([[-0.6958, -0.8752],
+            [-0.6958, -0.8752],
+            [-0.6958, -0.8752],
+            [-0.6958, -0.8752],
+            [-0.6958, -0.8752],
+            [-0.6958, -0.8752]], grad_fn=<CopyReduceBackward>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_feat = th.tensor(np.random.rand(2, 5).astype(np.float32))
+    >>> v_feat = th.tensor(np.random.rand(4, 10).astype(np.float32))
+    >>> gatconv = DotGatConv((5,10), 2)
+    >>> res = gatconv(g, (u_feat, v_feat))
+    >>> res
+    tensor([[ 0.4718,  0.0864],
+            [ 0.7099, -0.0335],
+            [ 0.5869,  0.0284],
+            [ 0.4718,  0.0864]], grad_fn=<CopyReduceBackward>)
+    """
    def __init__(self,
                 in_feats,
-                 out_feats
-                 ):
+                 out_feats,
+                 allow_zero_in_degree=False):
        super(DotGatConv, self).__init__()
        self._in_src_feats, self._in_dst_feats = expand_as_pair(in_feats)
        self._out_feats = out_feats
+        self._allow_zero_in_degree = allow_zero_in_degree

        if isinstance(in_feats, tuple):
            self.fc_src = nn.Linear(self._in_src_feats, self._out_feats, bias=False)
@@ -40,7 +115,11 @@ class DotGatConv(nn.Module):
            self.fc = nn.Linear(self._in_src_feats, self._out_feats, bias=False)

    def forward(self, graph, feat):
-        r"""Apply dot product version of self attention in GCN.
+        r"""
+
+        Description
+        -----------
+        Apply dot product version of self attention in GCN.

        Parameters
        ----------
@@ -57,10 +136,29 @@ class DotGatConv(nn.Module):
        torch.Tensor
            The output feature of shape :math:`(N, D_{out})` where :math:`D_{out}` is size
            of output feature.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """

        graph = graph.local_var()

+        if not self._allow_zero_in_degree:
+            if (graph.in_degrees() == 0).any():
+                raise DGLError('There are 0-in-degree nodes in the graph, '
+                               'output for those nodes will be invalid. '
+                               'This is harmful for some applications, '
+                               'causing silent performance regression. '
+                               'Adding self-loop on the input graph by '
+                               'calling `g = dgl.add_self_loop(g)` will resolve '
+                               'the issue. Setting ``allow_zero_in_degree`` '
+                               'to be `True` when constructing this module will '
+                               'suppress the check and let the code run.')
+
        # check if feat is a tuple
        if isinstance(feat, tuple):
            h_src = feat[0]

--- a/python/dgl/nn/pytorch/conv/edgeconv.py
+++ b/python/dgl/nn/pytorch/conv/edgeconv.py
@@ -2,37 +2,102 @@
 # pylint: disable= no-member, arguments-differ, invalid-name
 from torch import nn

+from ....base import DGLError
 from .... import function as fn
 from ....utils import expand_as_pair


 class EdgeConv(nn.Module):
-    r"""EdgeConv layer.
+    r"""
+
+    Description
+    -----------
+    EdgeConv layer.

    Introduced in "`Dynamic Graph CNN for Learning on Point Clouds
    <https://arxiv.org/pdf/1801.07829>`__".  Can be described as follows:

    .. math::
-       x_i^{(l+1)} = \max_{j \in \mathcal{N}(i)} \mathrm{ReLU}(
-       \Theta \cdot (x_j^{(l)} - x_i^{(l)}) + \Phi \cdot x_i^{(l)})
+       h_i^{(l+1)} = \max_{j \in \mathcal{N}(i)} \mathrm{ReLU}(
+       \Theta \cdot (h_j^{(l)} - h_i^{(l)}) + \Phi \cdot h_i^{(l)})

    where :math:`\mathcal{N}(i)` is the neighbor of :math:`i`.
+    :math:`\Theta` and :math:`\Phi` are linear layers.

    Parameters
    ----------
    in_feat : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
    out_feat : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    batch_norm : bool
-        Whether to include batch normalization on messages.
+        Whether to include batch normalization on messages. Default: ``False``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import EdgeConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> conv = EdgeConv(10, 2)
+    >>> res = conv(g, feat)
+    >>> res
+    tensor([[-0.2347,  0.5849],
+            [-0.2347,  0.5849],
+            [-0.2347,  0.5849],
+            [-0.2347,  0.5849],
+            [-0.2347,  0.5849],
+            [-0.2347,  0.5849]], grad_fn=<CopyReduceBackward>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_fea = th.rand(2, 5)
+    >>> v_fea = th.rand(4, 5)
+    >>> conv = EdgeConv(5, 2, 3)
+    >>> res = conv(g, (u_fea, v_fea))
+    >>> res
+    tensor([[ 1.6375,  0.2085],
+            [-1.1925, -1.2852],
+            [ 0.2101,  1.3466],
+            [ 0.2342, -0.9868]], grad_fn=<CopyReduceBackward>)
    """
    def __init__(self,
                 in_feat,
                 out_feat,
-                 batch_norm=False):
+                 batch_norm=False,
+                 allow_zero_in_degree=False):
        super(EdgeConv, self).__init__()
        self.batch_norm = batch_norm
+        self._allow_zero_in_degree = allow_zero_in_degree

        self.theta = nn.Linear(in_feat, out_feat)
        self.phi = nn.Linear(in_feat, out_feat)
@@ -47,27 +112,51 @@ class EdgeConv(nn.Module):
        phi_x = self.phi(edges.src['x'])
        return {'e': theta_x + phi_x}

-    def forward(self, g, h):
-        """Forward computation
+    def forward(self, g, feat):
+        """
+
+        Description
+        -----------
+        Forward computation

        Parameters
        ----------
        g : DGLGraph
            The graph.
-        h : Tensor or pair of tensors
+        feat : Tensor or pair of tensors
            :math:`(N, D)` where :math:`N` is the number of nodes and
            :math:`D` is the number of feature dimensions.

            If a pair of tensors is given, the graph must be a uni-bipartite graph
            with only one edge type, and the two tensors must have the same
            dimensionality on all except the first axis.
+
        Returns
        -------
        torch.Tensor
            New node features.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """
        with g.local_scope():
-            h_src, h_dst = expand_as_pair(h, g)
+            if not self._allow_zero_in_degree:
+                if (g.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
+            h_src, h_dst = expand_as_pair(feat, g)
            g.srcdata['x'] = h_src
            g.dstdata['x'] = h_dst
            if not self.batch_norm:

--- a/python/dgl/nn/pytorch/conv/gatconv.py
+++ b/python/dgl/nn/pytorch/conv/gatconv.py
@@ -5,12 +5,17 @@ from torch import nn

 from .... import function as fn
 from ....ops import edge_softmax
+from ....base import DGLError
 from ..utils import Identity
 from ....utils import expand_as_pair

 # pylint: enable=W0235
 class GATConv(nn.Module):
-    r"""Apply `Graph Attention Network <https://arxiv.org/pdf/1710.10903.pdf>`__
+    r"""
+
+    Description
+    -----------
+    Apply `Graph Attention Network <https://arxiv.org/pdf/1710.10903.pdf>`__
    over an input signal.

    .. math::
@@ -20,29 +25,112 @@ class GATConv(nn.Module):
    node :math:`j`:

    .. math::
-        \alpha_{ij}^{l} & = \mathrm{softmax_i} (e_{ij}^{l})
+        \alpha_{ij}^{l} &= \mathrm{softmax_i} (e_{ij}^{l})

-        e_{ij}^{l} & = \mathrm{LeakyReLU}\left(\vec{a}^T [W h_{i} \| W h_{j}]\right)
+        e_{ij}^{l} &= \mathrm{LeakyReLU}\left(\vec{a}^T [W h_{i} \| W h_{j}]\right)

    Parameters
    ----------
-    in_feats : int
-        Input feature size.
+    in_feats : int, or pair of ints
+        Input feature size; i.e, the number of dimensions of :math:`h_i^{(l)}`.
+        ATConv can be applied on homogeneous graph and unidirectional
+        `bipartite graph <https://docs.dgl.ai/generated/dgl.bipartite.html?highlight=bipartite>`__.
+        If the layer is to be applied to a unidirectional bipartite graph, ``in_feats``
+        specifies the input feature size on both the source and destination nodes.  If
+        a scalar is given, the source and destination node feature size would take the
+        same value.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e, the number of dimensions of :math:`h_i^{(l+1)}`.
    num_heads : int
        Number of heads in Multi-Head Attention.
    feat_drop : float, optional
-        Dropout rate on feature, defaults: ``0``.
+        Dropout rate on feature. Defaults: ``0``.
    attn_drop : float, optional
-        Dropout rate on attention weight, defaults: ``0``.
+        Dropout rate on attention weight. Defaults: ``0``.
    negative_slope : float, optional
-        LeakyReLU angle of negative slope.
+        LeakyReLU angle of negative slope. Defaults: ``0.2``.
    residual : bool, optional
-        If True, use residual connection.
+        If True, use residual connection. Defaults: ``False``.
    activation : callable activation function/layer or None, optional.
        If not None, applies an activation function to the updated node features.
        Default: ``None``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Defaults: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import GATConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> gatconv = GATConv(10, 2, num_heads=3)
+    >>> res = gatconv(g, feat)
+    >>> res
+    tensor([[[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]],
+            [[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]],
+            [[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]],
+            [[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]],
+            [[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]],
+            [[ 3.4570,  1.8634],
+            [ 1.3805, -0.0762],
+            [ 1.0390, -1.1479]]], grad_fn=<BinaryReduceBackward>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_feat = th.tensor(np.random.rand(2, 5).astype(np.float32))
+    >>> v_feat = th.tensor(np.random.rand(4, 10).astype(np.float32))
+    >>> gatconv = GATConv((5,10), 2, 3)
+    >>> res = gatconv(g, (u_feat, v_feat))
+    >>> res
+    tensor([[[-0.6066,  1.0268],
+            [-0.5945, -0.4801],
+            [ 0.1594,  0.3825]],
+            [[ 0.0268,  1.0783],
+            [ 0.5041, -1.3025],
+            [ 0.6568,  0.7048]],
+            [[-0.2688,  1.0543],
+            [-0.0315, -0.9016],
+            [ 0.3943,  0.5347]],
+            [[-0.6066,  1.0268],
+            [-0.5945, -0.4801],
+            [ 0.1594,  0.3825]]], grad_fn=<BinaryReduceBackward>)
    """
    def __init__(self,
                 in_feats,
@@ -52,11 +140,19 @@ class GATConv(nn.Module):
                 attn_drop=0.,
                 negative_slope=0.2,
                 residual=False,
-                 activation=None):
+                 activation=None,
+                 allow_zero_in_degree=False):
        super(GATConv, self).__init__()
        self._num_heads = num_heads
        self._in_src_feats, self._in_dst_feats = expand_as_pair(in_feats)
        self._out_feats = out_feats
+        self._allow_zero_in_degree = allow_zero_in_degree
+        if isinstance(in_feats, tuple):
+            self.fc_src = nn.Linear(
+                self._in_src_feats, out_feats * num_heads, bias=False)
+            self.fc_dst = nn.Linear(
+                self._in_dst_feats, out_feats * num_heads, bias=False)
+        else:
            self.fc = nn.Linear(
                self._in_src_feats, out_feats * num_heads, bias=False)
        self.attn_l = nn.Parameter(th.FloatTensor(size=(1, num_heads, out_feats)))
@@ -76,7 +172,17 @@ class GATConv(nn.Module):
        self.activation = activation

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        """
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The fc weights :math:`W^{(l)}` are initialized using Glorot uniform initialization.
+        The attention weights are using xavier initialization method.
+        """
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_normal_(self.fc.weight, gain=gain)
        nn.init.xavier_normal_(self.attn_l, gain=gain)
@@ -85,7 +191,11 @@ class GATConv(nn.Module):
            nn.init.xavier_normal_(self.res_fc.weight, gain=gain)

    def forward(self, graph, feat):
-        r"""Compute graph attention network layer.
+        r"""
+
+        Description
+        -----------
+        Compute graph attention network layer.

        Parameters
        ----------
@@ -102,8 +212,27 @@ class GATConv(nn.Module):
        torch.Tensor
            The output feature of shape :math:`(N, H, D_{out})` where :math:`H`
            is the number of heads, and :math:`D_{out}` is size of output feature.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            if isinstance(feat, tuple):
                h_src = self.feat_drop(feat[0])
                h_dst = self.feat_drop(feat[1])

--- a/python/dgl/nn/pytorch/conv/gatedgraphconv.py
+++ b/python/dgl/nn/pytorch/conv/gatedgraphconv.py
@@ -8,28 +8,58 @@ from .... import function as fn


 class GatedGraphConv(nn.Module):
-    r"""Gated Graph Convolution layer from paper `Gated Graph Sequence
+    r"""
+
+    Description
+    -----------
+    Gated Graph Convolution layer from paper `Gated Graph Sequence
    Neural Networks <https://arxiv.org/pdf/1511.05493.pdf>`__.

    .. math::
-        h_{i}^{0} & = [ x_i \| \mathbf{0} ]
+        h_{i}^{0} &= [ x_i \| \mathbf{0} ]

-        a_{i}^{t} & = \sum_{j\in\mathcal{N}(i)} W_{e_{ij}} h_{j}^{t}
+        a_{i}^{t} &= \sum_{j\in\mathcal{N}(i)} W_{e_{ij}} h_{j}^{t}

-        h_{i}^{t+1} & = \mathrm{GRU}(a_{i}^{t}, h_{i}^{t})
+        h_{i}^{t+1} &= \mathrm{GRU}(a_{i}^{t}, h_{i}^{t})

    Parameters
    ----------
    in_feats : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`x_i`.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(t+1)}`.
    n_steps : int
-        Number of recurrent steps.
+        Number of recurrent steps; i.e, the :math:`t` in the above formula.
    n_etypes : int
        Number of edge types.
    bias : bool
        If True, adds a learnable bias to the output. Default: ``True``.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import GatedGraphConv
+    >>>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> feat = th.ones(6, 10)
+    >>> conv = GatedGraphConv(10, 10, 2, 3)
+    >>> etype = th.tensor([0,1,2,0,1,2])
+    >>> res = conv(g, feat, etype)
+    >>> res
+    tensor([[ 0.4652,  0.4458,  0.5169,  0.4126,  0.4847,  0.2303,  0.2757,  0.7721,
+            0.0523,  0.0857],
+            [ 0.0832,  0.1388, -0.5643,  0.7053, -0.2524, -0.3847,  0.7587,  0.8245,
+            0.9315,  0.4063],
+            [ 0.6340,  0.4096,  0.7692,  0.2125,  0.2106,  0.4542, -0.0580,  0.3364,
+            -0.1376,  0.4948],
+            [ 0.5551,  0.7946,  0.6220,  0.8058,  0.5711,  0.3063, -0.5454,  0.2272,
+            -0.6931, -0.1607],
+            [ 0.2644,  0.2469, -0.6143,  0.6008, -0.1516, -0.3781,  0.5878,  0.7993,
+            0.9241,  0.1835],
+            [ 0.6393,  0.3447,  0.3893,  0.4279,  0.3342,  0.3809,  0.0406,  0.5030,
+            0.1342,  0.0425]], grad_fn=<AddBackward0>)
    """
    def __init__(self,
                 in_feats,
@@ -49,7 +79,17 @@ class GatedGraphConv(nn.Module):
        self.reset_parameters()

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        r"""
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The model parameters are initialized using Glorot uniform initialization
+        and the bias is initialized to be zero.
+        """
        gain = init.calculate_gain('relu')
        self.gru.reset_parameters()
        for linear in self.linears:
@@ -57,7 +97,11 @@ class GatedGraphConv(nn.Module):
            init.zeros_(linear.bias)

    def forward(self, graph, feat, etypes):
-        """Compute Gated Graph Convolution layer.
+        """
+
+        Description
+        -----------
+        Compute Gated Graph Convolution layer.

        Parameters
        ----------

--- a/python/dgl/nn/pytorch/conv/ginconv.py
+++ b/python/dgl/nn/pytorch/conv/ginconv.py
@@ -4,11 +4,16 @@ import torch as th
 from torch import nn

 from .... import function as fn
+from ....base import DGLError
 from ....utils import expand_as_pair


 class GINConv(nn.Module):
-    r"""Graph Isomorphism Network layer from paper `How Powerful are Graph
+    r"""
+
+    Description
+    -----------
+    Graph Isomorphism Network layer from paper `How Powerful are Graph
    Neural Networks? <https://arxiv.org/pdf/1810.00826.pdf>`__.

    .. math::
@@ -26,15 +31,66 @@ class GINConv(nn.Module):
    init_eps : float, optional
        Initial :math:`\epsilon` value, default: ``0``.
    learn_eps : bool, optional
-        If True, :math:`\epsilon` will be a learnable parameter.
+        If True, :math:`\epsilon` will be a learnable parameter. Default: ``False``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Example
+    -------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import GINConv
+    >>>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> lin = th.nn.Linear(10, 10)
+    >>> conv = GINConv(lin, 'max')
+    >>> res = conv(g, feat)
+    >>> res
+    tensor([[ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468],
+            [ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468],
+            [ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468],
+            [ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468],
+            [ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468],
+            [ 1.2330, -0.1572,  0.0622, -3.1567, -2.2414, -0.7275,  0.6311,  1.0396,
+            1.7008, -1.2468]], grad_fn=<AddmmBackward>)
    """
    def __init__(self,
                 apply_func,
                 aggregator_type,
                 init_eps=0,
-                 learn_eps=False):
+                 learn_eps=False,
+                 allow_zero_in_degree=False):
        super(GINConv, self).__init__()
        self.apply_func = apply_func
+        self._aggregator_type = aggregator_type
        if aggregator_type == 'sum':
            self._reducer = fn.sum
        elif aggregator_type == 'max':
@@ -48,9 +104,14 @@ class GINConv(nn.Module):
            self.eps = th.nn.Parameter(th.FloatTensor([init_eps]))
        else:
            self.register_buffer('eps', th.FloatTensor([init_eps]))
+        self._allow_zero_in_degree = allow_zero_in_degree

    def forward(self, graph, feat):
-        r"""Compute Graph Isomorphism Network layer.
+        r"""
+
+        Description
+        -----------
+        Compute Graph Isomorphism Network layer.

        Parameters
        ----------
@@ -71,8 +132,28 @@ class GINConv(nn.Module):
            :math:`D_{out}` is the output dimensionality of ``apply_func``.
            If ``apply_func`` is None, :math:`D_{out}` should be the same
            as input dimensionality.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any() and \
+                (self._aggregator_type not in ['sum', 'mean']):
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            feat_src, feat_dst = expand_as_pair(feat, graph)
            graph.srcdata['h'] = feat_src
            graph.update_all(fn.copy_u('h', 'm'), self._reducer('m', 'neigh'))

--- a/python/dgl/nn/pytorch/conv/gmmconv.py
+++ b/python/dgl/nn/pytorch/conv/gmmconv.py
@@ -5,37 +5,107 @@ from torch import nn
 from torch.nn import init

 from .... import function as fn
+from ....base import DGLError
 from ..utils import Identity
 from ....utils import expand_as_pair


 class GMMConv(nn.Module):
-    r"""The Gaussian Mixture Model Convolution layer from `Geometric Deep
+    r"""
+
+    Description
+    -----------
+    The Gaussian Mixture Model Convolution layer from `Geometric Deep
    Learning on Graphs and Manifolds using Mixture Model CNNs
    <http://openaccess.thecvf.com/content_cvpr_2017/papers/Monti_Geometric_Deep_Learning_CVPR_2017_paper.pdf>`__.

    .. math::
-        h_i^{l+1} & = \mathrm{aggregate}\left(\left\{\frac{1}{K}
+        u_{ij} &= f(x_i, x_j), x_j \in \mathcal{N}(i)
+
+        w_k(u) &= \exp\left(-\frac{1}{2}(u-\mu_k)^T \Sigma_k^{-1} (u - \mu_k)\right)
+
+        h_i^{l+1} &= \mathrm{aggregate}\left(\left\{\frac{1}{K}
         \sum_{k}^{K} w_k(u_{ij}), \forall j\in \mathcal{N}(i)\right\}\right)

-        w_k(u) & = \exp\left(-\frac{1}{2}(u-\mu_k)^T \Sigma_k^{-1} (u - \mu_k)\right)
+    where :math:`u` denotes the pseudo-coordinates between a vertex and one of its neighbor,
+    computed using function :math:`f`, :math:`\Sigma_k^{-1}` and :math:`\mu_k` are
+    learnable parameters representing the covariance matrix and mean vector of a Gaussian kernel.

    Parameters
    ----------
    in_feats : int
-        Number of input features.
+        Number of input features; i.e., the number of dimensions of :math:`x_i`.
    out_feats : int
-        Number of output features.
+        Number of output features; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    dim : int
-        Dimensionality of pseudo-coordinte.
+        Dimensionality of pseudo-coordinte; i.e, the number of dimensions of :math:`u_{ij}`.
    n_kernels : int
        Number of kernels :math:`K`.
    aggregator_type : str
-        Aggregator type (``sum``, ``mean``, ``max``).
+        Aggregator type (``sum``, ``mean``, ``max``). Default: ``sum``.
    residual : bool
        If True, use residual connection inside this layer. Default: ``False``.
    bias : bool
        If True, adds a learnable bias to the output. Default: ``True``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import GMMConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> conv = GMMConv(10, 2, 3, 2, 'mean')
+    >>> pseudo = th.ones(12, 3)
+    >>> res = conv(g, feat, pseudo)
+    >>> res
+    tensor([[-0.3462, -0.2654],
+            [-0.3462, -0.2654],
+            [-0.3462, -0.2654],
+            [-0.3462, -0.2654],
+            [-0.3462, -0.2654],
+            [-0.3462, -0.2654]], grad_fn=<AddBackward0>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_fea = th.rand(2, 5)
+    >>> v_fea = th.rand(4, 10)
+    >>> pseudo = th.ones(5, 3)
+    >>> conv = GMMConv((10, 5), 2, 3, 2, 'mean')
+    >>> res = conv(g, (u_fea, v_fea), pseudo)
+    >>> res
+    tensor([[-0.1107, -0.1559],
+            [-0.1646, -0.2326],
+            [-0.1377, -0.1943],
+            [-0.1107, -0.1559]], grad_fn=<AddBackward0>)
    """
    def __init__(self,
                 in_feats,
@@ -44,12 +114,14 @@ class GMMConv(nn.Module):
                 n_kernels,
                 aggregator_type='sum',
                 residual=False,
-                 bias=True):
+                 bias=True,
+                 allow_zero_in_degree=False):
        super(GMMConv, self).__init__()
        self._in_src_feats, self._in_dst_feats = expand_as_pair(in_feats)
        self._out_feats = out_feats
        self._dim = dim
        self._n_kernels = n_kernels
+        self._allow_zero_in_degree = allow_zero_in_degree
        if aggregator_type == 'sum':
            self._reducer = fn.sum
        elif aggregator_type == 'mean':
@@ -77,7 +149,19 @@ class GMMConv(nn.Module):
        self.reset_parameters()

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        r"""
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The fc parameters are initialized using Glorot uniform initialization
+        and the bias is initialized to be zero.
+        The mu weight is initialized using normal distribution and
+        inv_sigma is initialized with constant value 1.0.
+        """
        gain = init.calculate_gain('relu')
        init.xavier_normal_(self.fc.weight, gain=gain)
        if isinstance(self.res_fc, nn.Linear):
@@ -88,7 +172,11 @@ class GMMConv(nn.Module):
            init.zeros_(self.bias.data)

    def forward(self, graph, feat, pseudo):
-        """Compute Gaussian Mixture Model Convolution layer.
+        """
+
+        Description
+        -----------
+        Compute Gaussian Mixture Model Convolution layer.

        Parameters
        ----------
@@ -109,8 +197,27 @@ class GMMConv(nn.Module):
        torch.Tensor
            The output feature of shape :math:`(N, D_{out})` where :math:`D_{out}`
            is the output feature size.
+
+        Raises
+        ------
+        DGLError
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            feat_src, feat_dst = expand_as_pair(feat, graph)
            graph.srcdata['h'] = self.fc(feat_src).view(-1, self._n_kernels, self._out_feats)
            E = graph.number_of_edges()

--- a/python/dgl/nn/pytorch/conv/graphconv.py
+++ b/python/dgl/nn/pytorch/conv/graphconv.py
@@ -10,40 +10,27 @@ from ....utils import expand_as_pair

 # pylint: disable=W0235
 class GraphConv(nn.Module):
-    r"""Apply graph convolution over an input signal.
+    r"""

-    Graph convolution is introduced in `GCN <https://arxiv.org/abs/1609.02907>`__
-    and can be described as below:
+    Description
+    -----------
+    Graph convolution was introduced in `GCN <https://arxiv.org/abs/1609.02907>`__
+    and mathematically is defined as follows:

    .. math::
      h_i^{(l+1)} = \sigma(b^{(l)} + \sum_{j\in\mathcal{N}(i)}\frac{1}{c_{ij}}h_j^{(l)}W^{(l)})

-    where :math:`\mathcal{N}(i)` is the neighbor set of node :math:`i`. :math:`c_{ij}` is equal
-    to the product of the square root of node degrees:
-    :math:`\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}`. :math:`\sigma` is an activation
-    function.
-
-    The model parameters are initialized as in the
-    `original implementation <https://github.com/tkipf/gcn/blob/master/gcn/layers.py>`__ where
-    the weight :math:`W^{(l)}` is initialized using Glorot uniform initialization
-    and the bias is initialized to be zero.
-
-    Notes
-    -----
-    Zero in degree nodes could lead to invalid normalizer. A common practice
-    to avoid this is to add a self-loop for each node in the graph, which
-    can be achieved by:
-
-    >>> g = ... # some DGLGraph
-    >>> g.add_edges(g.nodes(), g.nodes())
-
+    where :math:`\mathcal{N}(i)` is the set of neighbors of node :math:`i`,
+    :math:`c_{ij}` is the product of the square root of node degrees
+    (i.e.,  :math:`c_{ij} = \sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}`),
+    and :math:`\sigma` is an activation function.

    Parameters
    ----------
    in_feats : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    norm : str, optional
        How to apply the normalizer. If is `'right'`, divide the aggregated messages
        by each node's in-degrees, which is equivalent to averaging the received messages.
@@ -54,9 +41,15 @@ class GraphConv(nn.Module):
        without a weight matrix.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
-    activation: callable activation function/layer or None, optional
+    activation : callable activation function/layer or None, optional
        If not None, applies an activation function to the updated node features.
        Default: ``None``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves. Default: ``False``.

    Attributes
    ----------
@@ -64,6 +57,68 @@ class GraphConv(nn.Module):
        The learnable weight tensor.
    bias : torch.Tensor
        The learnable bias tensor.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import GraphConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> conv = GraphConv(10, 2, norm='both', weight=True, bias=True)
+    >>> res = conv(g, feat)
+    >>> print(res)
+    tensor([[ 1.3326, -0.2797],
+            [ 1.4673, -0.3080],
+            [ 1.3326, -0.2797],
+            [ 1.6871, -0.3541],
+            [ 1.7711, -0.3717],
+            [ 1.0375, -0.2178]], grad_fn=<AddBackward0>)
+    >>> # allow_zero_in_degree example
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> conv = GraphConv(10, 2, norm='both', weight=True, bias=True, allow_zero_in_degree=True)
+    >>> res = conv(g, feat)
+    >>> print(res)
+    tensor([[-0.2473, -0.4631],
+            [-0.3497, -0.6549],
+            [-0.3497, -0.6549],
+            [-0.4221, -0.7905],
+            [-0.3497, -0.6549],
+            [ 0.0000,  0.0000]], grad_fn=<AddBackward0>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_fea = th.rand(2, 5)
+    >>> v_fea = th.rand(4, 5)
+    >>> conv = GraphConv(5, 2, norm='both', weight=True, bias=True)
+    >>> res = conv(g, (u_fea, v_fea))
+    >>> res
+    tensor([[-0.2994,  0.6106],
+            [-0.4482,  0.5540],
+            [-0.5287,  0.8235],
+            [-0.2994,  0.6106]], grad_fn=<AddBackward0>)
    """
    def __init__(self,
                 in_feats,
@@ -71,7 +126,8 @@ class GraphConv(nn.Module):
                 norm='both',
                 weight=True,
                 bias=True,
-                 activation=None):
+                 activation=None,
+                 allow_zero_in_degree=False):
        super(GraphConv, self).__init__()
        if norm not in ('none', 'both', 'right'):
            raise DGLError('Invalid norm value. Must be either "none", "both" or "right".'
@@ -79,6 +135,7 @@ class GraphConv(nn.Module):
        self._in_feats = in_feats
        self._out_feats = out_feats
        self._norm = norm
+        self._allow_zero_in_degree = allow_zero_in_degree

        if weight:
            self.weight = nn.Parameter(th.Tensor(in_feats, out_feats))
@@ -95,22 +152,31 @@ class GraphConv(nn.Module):
        self._activation = activation

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        r"""
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The model parameters are initialized as in the
+        `original implementation <https://github.com/tkipf/gcn/blob/master/gcn/layers.py>`__
+        where the weight :math:`W^{(l)}` is initialized using Glorot uniform initialization
+        and the bias is initialized to be zero.
+
+        """
        if self.weight is not None:
            init.xavier_uniform_(self.weight)
        if self.bias is not None:
            init.zeros_(self.bias)

    def forward(self, graph, feat, weight=None):
-        r"""Compute graph convolution.
+        r"""

-        Notes
-        -----
-        * Input shape: :math:`(N, *, \text{in_feats})` where * means any number of additional
-          dimensions, :math:`N` is the number of nodes.
-        * Output shape: :math:`(N, *, \text{out_feats})` where all but the last dimension are
-          the same shape as the input.
-        * Weight shape: :math:`(\text{in_feats}, \text{out_feats})`.
+        Description
+        -----------
+        Compute graph convolution.

        Parameters
        ----------
@@ -120,11 +186,9 @@ class GraphConv(nn.Module):
            If a torch.Tensor is given, it represents the input feature of shape
            :math:`(N, D_{in})`
            where :math:`D_{in}` is size of input feature, :math:`N` is the number of nodes.
-            If a pair of torch.Tensor is given, the pair must contain two tensors of shape
-            :math:`(N_{in}, D_{in_{src}})` and :math:`(N_{out}, D_{in_{dst}})`.
-
-            Note that in the special case of graph convolutional networks, if a pair of
-            tensors is given, the latter element will not participate in computation.
+            If a pair of torch.Tensor is given, which is the case for bipartite graph, the pair
+            must contain two tensors of shape :math:`(N_{in}, D_{in_{src}})` and
+            :math:`(N_{out}, D_{in_{dst}})`.
        weight : torch.Tensor, optional
            Optional external weight tensor.

@@ -132,11 +196,42 @@ class GraphConv(nn.Module):
        -------
        torch.Tensor
            The output feature
+
+        Raises
+        ------
+        DGLError
+            Case 1:
+            If there are 0-in-degree nodes in the input graph, it will raise DGLError
+            since no message will be passed to those nodes. This will cause invalid output.
+            The error can be ignored by setting ``allow_zero_in_degree`` parameter to ``True``.
+
+            Case 2:
+            External weight is provided while at the same time the module
+            has defined its own weight parameter.
+
+        Notes
+        -----
+        * Input shape: :math:`(N, *, \text{in_feats})` where * means any number of additional
+          dimensions, :math:`N` is the number of nodes.
+        * Output shape: :math:`(N, *, \text{out_feats})` where all but the last dimension are
+          the same shape as the input.
+        * Weight shape: :math:`(\text{in_feats}, \text{out_feats})`.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            # (BarclayII) For RGCN on heterogeneous graphs we need to support GCN on bipartite.
            feat_src, feat_dst = expand_as_pair(feat, graph)
-
            if self._norm == 'both':
                degs = graph.out_degrees().float().clamp(min=1)
                norm = th.pow(degs, -0.5)

--- a/python/dgl/nn/pytorch/conv/nnconv.py
+++ b/python/dgl/nn/pytorch/conv/nnconv.py
@@ -4,30 +4,39 @@ import torch as th
 from torch import nn
 from torch.nn import init

+from ....base import DGLError
 from .... import function as fn
 from ..utils import Identity
 from ....utils import expand_as_pair


 class NNConv(nn.Module):
-    r"""Graph Convolution layer introduced in `Neural Message Passing
+    r"""
+
+    Description
+    -----------
+    Graph Convolution layer introduced in `Neural Message Passing
    for Quantum Chemistry <https://arxiv.org/pdf/1704.01212.pdf>`__.

    .. math::
        h_{i}^{l+1} = h_{i}^{l} + \mathrm{aggregate}\left(\left\{
        f_\Theta (e_{ij}) \cdot h_j^{l}, j\in \mathcal{N}(i) \right\}\right)

+    where :math:`e_{ij}` is the edge feature, :math:`f_\Theta` is a function
+    with learnable parameters.
+
    Parameters
    ----------
    in_feats : int
-        Input feature size.
-
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
+        NN can be applied on homogeneous graph and unidirectional
+        `bipartite graph <https://docs.dgl.ai/generated/dgl.bipartite.html?highlight=bipartite>`__.
        If the layer is to be applied on a unidirectional bipartite graph, ``in_feats``
        specifies the input feature size on both the source and destination nodes.  If
        a scalar is given, the source and destination node feature size would take the
        same value.
    out_feats : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    edge_func : callable activation function/layer
        Maps each edge feature to a vector of shape
        ``(in_feats * out_feats)`` as weight to compute
@@ -39,18 +48,81 @@ class NNConv(nn.Module):
        If True, use residual connection. Default: ``False``.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
+    allow_zero_in_degree : bool, optional
+        If there are 0-in-degree nodes in the graph, output for those nodes will be invalid
+        since no message will be passed to those nodes. This is harmful for some applications
+        causing silent performance regression. This module will raise a DGLError if it detects
+        0-in-degree nodes in input graph. By setting ``True``, it will suppress the check
+        and let the users handle it by themselves.
+
+    Notes
+    -----
+    Zero in-degree nodes will lead to invalid output value. This is because no message
+    will be passed to those nodes, the aggregation function will be appied on empty input.
+    A common practice to avoid this is to add a self-loop for each node in the graph if
+    it is homogeneous, which can be achieved by:
+
+    >>> g = ... # a DGLGraph
+    >>> g = dgl.add_self_loop(g)
+
+    Calling ``add_self_loop`` will not work for some graphs, for example, heterogeneous graph
+    since the edge type can not be decided for self_loop edges. Set ``allow_zero_in_degree``
+    to ``True`` for those cases to unblock the code and handle zere-in-degree nodes manually.
+    A common practise to handle this is to filter out the nodes with zere-in-degree when use
+    after conv.
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import NNConv
+
+    >>> # Case 1: Homogeneous graph
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> g = dgl.add_self_loop(g)
+    >>> feat = th.ones(6, 10)
+    >>> lin = th.nn.Linear(5, 20)
+    >>> def edge_func(efeat):
+    ...     return lin(efeat)
+    >>> efeat = th.ones(6+6, 5)
+    >>> conv = NNConv(10, 2, edge_func, 'mean')
+    >>> res = conv(g, feat, efeat)
+    >>> res
+    tensor([[-1.5243, -0.2719],
+            [-1.5243, -0.2719],
+            [-1.5243, -0.2719],
+            [-1.5243, -0.2719],
+            [-1.5243, -0.2719],
+            [-1.5243, -0.2719]], grad_fn=<AddBackward0>)
+
+    >>> # Case 2: Unidirectional bipartite graph
+    >>> u = [0, 1, 0, 0, 1]
+    >>> v = [0, 1, 2, 3, 2]
+    >>> g = dgl.bipartite((u, v))
+    >>> u_feat = th.tensor(np.random.rand(2, 10).astype(np.float32))
+    >>> v_feat = th.tensor(np.random.rand(4, 10).astype(np.float32))
+    >>> conv = NNConv(10, 2, edge_func, 'mean')
+    >>> efeat = th.ones(5, 5)
+    >>> res = conv(g, (u_feat, v_feat), efeat)
+    >>> res
+    tensor([[-0.6568,  0.5042],
+            [ 0.9089, -0.5352],
+            [ 0.1261, -0.0155],
+            [-0.6568,  0.5042]], grad_fn=<AddBackward0>)
    """
    def __init__(self,
                 in_feats,
                 out_feats,
                 edge_func,
-                 aggregator_type,
+                 aggregator_type='mean',
                 residual=False,
-                 bias=True):
+                 bias=True,
+                 allow_zero_in_degree=False):
        super(NNConv, self).__init__()
        self._in_src_feats, self._in_dst_feats = expand_as_pair(in_feats)
        self._out_feats = out_feats
-        self.edge_nn = edge_func
+        self.edge_func = edge_func
        if aggregator_type == 'sum':
            self.reducer = fn.sum
        elif aggregator_type == 'mean':
@@ -71,10 +143,21 @@ class NNConv(nn.Module):
            self.bias = nn.Parameter(th.Tensor(out_feats))
        else:
            self.register_buffer('bias', None)
+        self._allow_zero_in_degree = allow_zero_in_degree
        self.reset_parameters()

    def reset_parameters(self):
-        """Reinitialize learnable parameters."""
+        r"""
+
+        Description
+        -----------
+        Reinitialize learnable parameters.
+
+        Notes
+        -----
+        The model parameters are initialized using Glorot uniform initialization
+        and the bias is initialized to be zero.
+        """
        gain = init.calculate_gain('relu')
        if self.bias is not None:
            nn.init.zeros_(self.bias)
@@ -94,7 +177,7 @@ class NNConv(nn.Module):
            input feature size.
        efeat : torch.Tensor
            The edge feature of shape :math:`(N, *)`, should fit the input
-            shape requirement of ``edge_nn``.
+            shape requirement of ``edge_func``.

        Returns
        -------
@@ -103,12 +186,24 @@ class NNConv(nn.Module):
            is the output feature size.
        """
        with graph.local_scope():
+            if not self._allow_zero_in_degree:
+                if (graph.in_degrees() == 0).any():
+                    raise DGLError('There are 0-in-degree nodes in the graph, '
+                                   'output for those nodes will be invalid. '
+                                   'This is harmful for some applications, '
+                                   'causing silent performance regression. '
+                                   'Adding self-loop on the input graph by '
+                                   'calling `g = dgl.add_self_loop(g)` will resolve '
+                                   'the issue. Setting ``allow_zero_in_degree`` '
+                                   'to be `True` when constructing this module will '
+                                   'suppress the check and let the code run.')
+
            feat_src, feat_dst = expand_as_pair(feat, graph)

            # (n, d_in, 1)
            graph.srcdata['h'] = feat_src.unsqueeze(-1)
            # (n, d_in, d_out)
-            graph.edata['w'] = self.edge_nn(efeat).view(-1, self._in_src_feats, self._out_feats)
+            graph.edata['w'] = self.edge_func(efeat).view(-1, self._in_src_feats, self._out_feats)
            # (n, d_in, d_out)
            graph.update_all(fn.u_mul_e('h', 'w', 'm'), self.reducer('m', 'neigh'))
            rst = graph.dstdata['neigh'].sum(dim=1) # (n, d_out)

--- a/python/dgl/nn/pytorch/conv/relgraphconv.py
+++ b/python/dgl/nn/pytorch/conv/relgraphconv.py
@@ -8,7 +8,11 @@ from .. import utils


 class RelGraphConv(nn.Module):
-    r"""Relational graph convolution layer.
+    r"""
+
+    Description
+    -----------
+    Relational graph convolution layer.

    Relational graph convolution is introduced in "`Modeling Relational Data with Graph
    Convolutional Networks <https://arxiv.org/abs/1703.06103>`__"
@@ -30,37 +34,82 @@ class RelGraphConv(nn.Module):

       W_r^{(l)} = \sum_{b=1}^B a_{rb}^{(l)}V_b^{(l)}

-    where :math:`B` is the number of bases.
+    where :math:`B` is the number of bases, :math:`V_b^{(l)}` are linearly combined
+    with coefficients :math:`a_{rb}^{(l)}`.

    The block-diagonal-decomposition regularization decomposes :math:`W_r` into :math:`B`
    number of block diagonal matrices. We refer :math:`B` as the number of bases.

+    The block regularization decomposes :math:`W_r` by:
+
+    .. math::
+
+       W_r^{(l)} = \oplus_{b=1}^B Q_{rb}^{(l)}
+
+    where :math:`B` is the number of bases, :math:`Q_{rb}^{(l)}` are block
+    bases with shape :math:`R^{(d^{(l+1)}/B)*(d^{l}/B)}`.
+
    Parameters
    ----------
    in_feat : int
-        Input feature size.
+        Input feature size; i.e, the number of dimensions of :math:`h_j^{(l)}`.
    out_feat : int
-        Output feature size.
+        Output feature size; i.e., the number of dimensions of :math:`h_i^{(l+1)}`.
    num_rels : int
-        Number of relations.
+        Number of relations. .
    regularizer : str
-        Which weight regularizer to use "basis" or "bdd"
+        Which weight regularizer to use "basis" or "bdd".
+        "basis" is short for basis-diagonal-decomposition.
+        "bdd" is short for block-diagonal-decomposition.
    num_bases : int, optional
-        Number of bases. If is none, use number of relations. Default: None.
+        Number of bases. If is none, use number of relations. Default: ``None``.
    bias : bool, optional
-        True if bias is added. Default: True
+        True if bias is added. Default: ``True``.
    activation : callable, optional
-        Activation function. Default: None
+        Activation function. Default: ``None``.
    self_loop : bool, optional
-        True to include self loop message. Default: False
+        True to include self loop message. Default: ``True``.
    low_mem : bool, optional
-        True to use low memory implementation of relation message passing function. Default: False
-        This option trade speed with memory consumption, and will slowdown the forward/backward.
-        Turn it on when you encounter OOM problem during training or evaluation.
+        True to use low memory implementation of relation message passing function. Default: False.
+        This option trades speed with memory consumption, and will slowdown the forward/backward.
+        Turn it on when you encounter OOM problem during training or evaluation. Default: ``False``.
    dropout : float, optional
-        Dropout rate. Default: 0.0
+        Dropout rate. Default: ``0.0``
    layer_norm: float, optional
-        Add layer norm. Default: False
+        Add layer norm. Default: ``False``
+
+    Examples
+    --------
+    >>> import dgl
+    >>> import numpy as np
+    >>> import torch as th
+    >>> from dgl.nn import RelGraphConv
+    >>>
+    >>> g = dgl.graph(([0,1,2,3,2,5], [1,2,3,4,0,3]))
+    >>> feat = th.ones(6, 10)
+    >>> conv = RelGraphConv(10, 2, 3, regularizer='basis', num_bases=2)
+    >>> conv.weight.shape
+    torch.Size([2, 10, 2])
+    >>> etype = th.tensor(np.array([0,1,2,0,1,2]).astype(np.int64))
+    >>> res = conv(g, feat, etype)
+    >>> res
+    tensor([[ 0.3996, -2.3303],
+            [-0.4323, -0.1440],
+            [ 0.3996, -2.3303],
+            [ 2.1046, -2.8654],
+            [-0.4323, -0.1440],
+            [-0.1309, -1.0000]], grad_fn=<AddBackward0>)
+
+    >>> # One-hot input
+    >>> one_hot_feat = th.tensor(np.array([0,1,2,3,4,5]).astype(np.int64))
+    >>> res = conv(g, one_hot_feat, etype)
+    >>> res
+    tensor([[ 0.5925,  0.0985],
+            [-0.3953,  0.8408],
+            [-0.9819,  0.5284],
+            [-1.0085, -0.1721],
+            [ 0.5962,  1.2002],
+            [ 0.0365, -0.3532]], grad_fn=<AddBackward0>)
    """
    def __init__(self,
                 in_feat,
@@ -70,7 +119,7 @@ class RelGraphConv(nn.Module):
                 num_bases=None,
                 bias=True,
                 activation=None,
-                 self_loop=False,
+                 self_loop=True,
                 low_mem=False,
                 dropout=0.0,
                 layer_norm=False):
@@ -193,22 +242,28 @@ class RelGraphConv(nn.Module):
            msg = msg * edges.data['norm']
        return {'msg': msg}

-    def forward(self, g, x, etypes, norm=None):
-        """ Forward computation
+    def forward(self, g, feat, etypes, norm=None):
+        """
+
+        Description
+        -----------
+
+        Forward computation

        Parameters
        ----------
        g : DGLGraph
            The graph.
-        x : torch.Tensor
+        feat : torch.Tensor
            Input node features. Could be either
+
                * :math:`(|V|, D)` dense tensor
                * :math:`(|V|,)` int64 vector, representing the categorical values of each
-                  node. We then treat the input feature as an one-hot encoding feature.
+                  node. It then treat the input feature as an one-hot encoding feature.
        etypes : torch.Tensor
            Edge type tensor. Shape: :math:`(|E|,)`
        norm : torch.Tensor
-            Optional edge normalizer tensor. Shape: :math:`(|E|, 1)`
+            Optional edge normalizer tensor. Shape: :math:`(|E|, 1)`.

        Returns
        -------
@@ -216,12 +271,12 @@ class RelGraphConv(nn.Module):
            New node features.
        """
        with g.local_scope():
-            g.srcdata['h'] = x
+            g.srcdata['h'] = feat
            g.edata['type'] = etypes
            if norm is not None:
                g.edata['norm'] = norm
            if self.self_loop:
-                loop_message = utils.matmul_maybe_select(x[:g.number_of_dst_nodes()],
+                loop_message = utils.matmul_maybe_select(feat[:g.number_of_dst_nodes()],
                                                         self.loop_weight)
            # message passing
            g.update_all(self.message_func, fn.sum(msg='msg', out='h'))