[Doc] new nn api doc (#2019)

* Add dotproduct attention * [Feature] Add dotproduct attention * [Feature] Add dotproduct attention * [Feature] Add dotproduct attention * [New] Update landing page * [New] Update landing page * [New] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Improvement] use dgl build-in in dotgatconv * [Doc] review API doc string bottom up * [Doc] Add doc of input and output features * [Doc] Update doc string for pooling and transformer. * [Doc] Reformat doc string and change some wordings. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

[Doc] new nn api doc (#2019)
* Add dotproduct attention * [Feature] Add dotproduct attention * [Feature] Add dotproduct attention * [Feature] Add dotproduct attention * [New] Update landing page * [New] Update landing page * [New] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Doc] Update landing page * [Improvement] use dgl build-in in dotgatconv * [Doc] review API doc string bottom up * [Doc] Add doc of input and output features * [Doc] Update doc string for pooling and transformer. * [Doc] Reformat doc string and change some wordings. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. * [Doc] Doc string refactoring. Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
2fa7f71a · zhjwy9343 · GitHub · f4f78803 · 2fa7f71a · 2fa7f71a
Unverified Commit 2fa7f71a authored Aug 14, 2020 by zhjwy9343 Committed by GitHub Aug 14, 2020
5 changed files
--- a/python/dgl/nn/pytorch/conv/dotgatconv.py
+++ b/python/dgl/nn/pytorch/conv/dotgatconv.py
@@ -181,12 +181,8 @@ class DotGatConv(nn.Module):
        # Step 2. edge softmax to compute attention scores
        graph.edata['sa'] = edge_softmax(graph, graph.edata['a'])

-        # Step 3. Broadcast softmax value to each edge, and then attention is done
-        graph.apply_edges(lambda edges: {'attn': edges.src['ft'] * \
-                                                 edges.data['sa'].unsqueeze(dim=0).T})
-
-        # Step 4. Aggregate attention to dst,user nodes, so formula 7 is done
-        graph.update_all(fn.copy_e('attn', 'm'), fn.sum('m', 'agg_u'))
+        # Step 3. Broadcast softmax value to each edge, and aggregate dst node
+        graph.update_all(fn.u_mul_e('ft', 'sa', 'attn'), fn.sum('attn', 'agg_u'))

        # output results to the destination nodes
        rst = graph.dstdata['agg_u']

--- a/python/dgl/nn/pytorch/factory.py
+++ b/python/dgl/nn/pytorch/factory.py
@@ -12,11 +12,23 @@ def pairwise_squared_distance(x):


 class KNNGraph(nn.Module):
-    r"""Layer that transforms one point set into a graph, or a batch of
+    r"""
+
+    Description
+    -----------
+    Layer that transforms one point set into a graph, or a batch of
    point sets with the same number of points into a union of those graphs.

-    If a batch of point set is provided, then the point :math:`j` in point
-    set :math:`i` is mapped to graph node ID :math:`i \times M + j`, where
+    The KNNGraph is implemented in the following steps:
+
+    1. Compute an NxN matrix of pairwise distance for all points.
+    2. Pick the k points with the smallest distance for each point as their k-nearest neighbors.
+    3. Construct a graph with edges to each point as a node from its k-nearest neighbors.
+
+    The overall computational complexity is :math:`O(N^2(logN + D)`.
+
+    If a batch of point sets is provided, the point :math:`j` in point
+    set :math:`i` is mapped to graph node ID: :math:`i \times M + j`, where
    :math:`M` is the number of nodes in each point set.

    The predecessors of each node are the k-nearest neighbors of the
@@ -25,7 +37,30 @@ class KNNGraph(nn.Module):
    Parameters
    ----------
    k : int
-        The number of neighbors
+        The number of neighbors.
+
+    Notes
+    -----
+    The nearest neighbors found for a node include the node itself.
+
+    Examples
+    --------
+    The following example uses PyTorch backend.
+
+    >>> import torch
+    >>> from dgl.nn.pytorch.factory import KNNGraph
+    >>>
+    >>> kg = KNNGraph(2)
+    >>> x = torch.tensor([[0,1],
+                          [1,2],
+                          [1,3],
+                          [100, 101],
+                          [101, 102],
+                          [50, 50]])
+    >>> g = kg(x)
+    >>> print(g.edges())
+        (tensor([0, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5]),
+         tensor([0, 0, 1, 2, 1, 2, 5, 3, 4, 3, 4, 5]))
    """
    def __init__(self, k):
        super(KNNGraph, self).__init__()
@@ -33,7 +68,9 @@ class KNNGraph(nn.Module):

    #pylint: disable=invalid-name
    def forward(self, x):
-        """Forward computation.
+        """
+
+        Forward computation.

        Parameters
        ----------
@@ -45,19 +82,23 @@ class KNNGraph(nn.Module):
        Returns
        -------
        DGLGraph
-            A DGLGraph with no features.
+            A DGLGraph without features.
        """
        return knn_graph(x, self.k)


 class SegmentedKNNGraph(nn.Module):
-    r"""Layer that transforms one point set into a graph, or a batch of
+    r"""
+
+    Description
+    -----------
+    Layer that transforms one point set into a graph, or a batch of
    point sets with different number of points into a union of those graphs.

-    If a batch of point set is provided, then the point :math:`j` in point
-    set :math:`i` is mapped to graph node ID
+    If a batch of point sets is provided, then the point :math:`j` in the point
+    set :math:`i` is mapped to graph node ID:
    :math:`\sum_{p<i} |V_p| + j`, where :math:`|V_p|` means the number of
-    points in point set :math:`p`.
+    points in the point set :math:`p`.

    The predecessors of each node are the k-nearest neighbors of the
    corresponding point.
@@ -65,7 +106,34 @@ class SegmentedKNNGraph(nn.Module):
    Parameters
    ----------
    k : int
-        The number of neighbors
+        The number of neighbors.
+
+    Notes
+    -----
+    The nearest neighbors found for a node include the node itself.
+
+    Examples
+    --------
+    The following example uses PyTorch backend.
+
+    >>> import torch
+    >>> from dgl.nn.pytorch.factory import SegmentedKNNGraph
+    >>>
+    >>> kg = SegmentedKNNGraph(2)
+    >>> x = torch.tensor([[0,1],
+    ...                   [1,2],
+    ...                   [1,3],
+    ...                   [100, 101],
+    ...                   [101, 102],
+    ...                   [50, 50],
+    ...                   [24,25],
+    ...                   [25,24]])
+    >>> g = kg(x, [3,3,2])
+    >>> print(g.edges())
+    (tensor([0, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 7, 7]),
+     tensor([0, 0, 1, 2, 1, 2, 3, 4, 5, 3, 4, 5, 6, 7, 6, 7]))
+    >>>
+
    """
    def __init__(self, k):
        super(SegmentedKNNGraph, self).__init__()
@@ -73,20 +141,22 @@ class SegmentedKNNGraph(nn.Module):

    #pylint: disable=invalid-name
    def forward(self, x, segs):
-        """Forward computation.
+        r"""Forward computation.

        Parameters
        ----------
        x : Tensor
            :math:`(M, D)` where :math:`M` means the total number of points
-            in all point sets.
+            in all point sets, and :math:`D` means the size of features.
        segs : iterable of int
            :math:`(N)` integers where :math:`N` means the number of point
-            sets.  The elements must sum up to :math:`M`.
+            sets.  The number of elements must sum up to :math:`M`. And any
+            :math:`N` should :math:`\ge k`

        Returns
        -------
        DGLGraph
-            A DGLGraph with no features.
+            A DGLGraph without features.
        """
+
        return segmented_knn_graph(x, self.k, segs)
--- a/python/dgl/nn/pytorch/glob.py
+++ b/python/dgl/nn/pytorch/glob.py
@@ -14,31 +14,76 @@ __all__ = ['SumPooling', 'AvgPooling', 'MaxPooling', 'SortPooling',
           'SetTransformerEncoder', 'SetTransformerDecoder', 'WeightAndSum']

 class SumPooling(nn.Module):
-    r"""Apply sum pooling over the nodes in the graph.
+    r"""
+
+    Description
+    -----------
+    Apply sum pooling over the nodes in a graph .

    .. math::
        r^{(i)} = \sum_{k=1}^{N_i} x^{(i)}_k
+
+    Notes
+    -----
+        Input: Could be one graph, or a batch of graphs. If using a batch of graphs,
+        make sure nodes in all graphs have the same feature size, and concatenate
+        nodes' feature together as the input.
+
+    Examples
+    --------
+    The following example uses PyTorch backend.
+
+    >>> import dgl
+    >>> import torch as th
+    >>> from dgl.nn.pytorch.glob import SumPooling
+    >>>
+    >>> g1 = dgl.DGLGraph()
+    >>> g1.add_nodes(2)
+    >>> g1_node_feats = th.ones(2,5)
+    >>>
+    >>> g2 = dgl.DGLGraph()
+    >>> g2.add_nodes(3)
+    >>> g2_node_feats = th.ones(3,5)
+    >>>
+    >>> sumpool = SumPooling()
+
+    Case 1: Input a single graph
+
+    >>> sumpool(g1, g1_node_feats)
+        tensor([[2., 2., 2., 2., 2.]])
+
+    Case 2: Input a batch of graphs
+
+    Build a batch of DGL graphs and concatenate all graphs' node features into one tensor.
+
+    >>> batch_g = dgl.batch([g1, g2])
+    >>> batch_f = th.cat([g1_node_feats, g2_node_feats])
+    >>>
+    >>> sumpool(batch_g, batch_f)
+        tensor([[2., 2., 2., 2., 2.],
+                [3., 3., 3., 3., 3.]])
    """
    def __init__(self):
        super(SumPooling, self).__init__()

    def forward(self, graph, feat):
-        r"""Compute sum pooling.
+        r"""

+        Compute sum pooling.

        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            a DGLGraph or a batch of DGLGraphs
        feat : torch.Tensor
-            The input feature with shape :math:`(N, *)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)`, where :math:`N` is the number
+            of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, *)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, D)`, where :math:`B` refers to the
+            batch size of input graphs.
        """
        with graph.local_scope():
            graph.ndata['h'] = feat
@@ -47,30 +92,76 @@ class SumPooling(nn.Module):


 class AvgPooling(nn.Module):
-    r"""Apply average pooling over the nodes in the graph.
+    r"""
+
+    Description
+    -----------
+    Apply average pooling over the nodes in a graph.

    .. math::
        r^{(i)} = \frac{1}{N_i}\sum_{k=1}^{N_i} x^{(i)}_k
+
+    Notes
+    -----
+        Input: Could be one graph, or a batch of graphs. If using a batch of graphs,
+        make sure nodes in all graphs have the same feature size, and concatenate
+        nodes' feature together as the input.
+
+    Examples
+    --------
+    The following example uses PyTorch backend.
+
+    >>> import dgl
+    >>> import torch as th
+    >>> from dgl.nn.pytorch.glob import AvgPooling
+    >>>
+    >>> g1 = dgl.DGLGraph()
+    >>> g1.add_nodes(2)
+    >>> g1_node_feats = th.ones(2,5)
+    >>>
+    >>> g2 = dgl.DGLGraph()
+    >>> g2.add_nodes(3)
+    >>> g2_node_feats = th.ones(3,5)
+    >>>
+    >>> avgpool = AvgPooling()
+
+    Case 1: Input single graph
+
+    >>> avgpool(g1, g1_node_feats)
+        tensor([[1., 1., 1., 1., 1.]])
+
+    Case 2: Input a batch of graphs
+
+    Build a batch of DGL graphs and concatenate all graphs' note features into one tensor.
+
+    >>> batch_g = dgl.batch([g1, g2])
+    >>> batch_f = th.cat([g1_node_feats, g2_node_feats])
+    >>>
+    >>> avgpool(batch_g, batch_f)
+        tensor([[1., 1., 1., 1., 1.],
+                [1., 1., 1., 1., 1.]])
    """
    def __init__(self):
        super(AvgPooling, self).__init__()

    def forward(self, graph, feat):
-        r"""Compute average pooling.
+        r"""
+
+        Compute average pooling.

        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            A DGLGraph or a batch of DGLGraphs.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, *)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)`, where :math:`N` is the number
+            of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, *)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, D)`, where
+            :math:`B` refers to the batch size of input graphs.
        """
        with graph.local_scope():
            graph.ndata['h'] = feat
@@ -79,10 +170,54 @@ class AvgPooling(nn.Module):


 class MaxPooling(nn.Module):
-    r"""Apply max pooling over the nodes in the graph.
+    r"""
+
+    Description
+    -----------
+    Apply max pooling over the nodes in a graph.

    .. math::
        r^{(i)} = \max_{k=1}^{N_i}\left( x^{(i)}_k \right)
+
+    Notes
+    -----
+        Input: Could be one graph, or a batch of graphs. If using a batch of graphs,
+        make sure nodes in all graphs have the same feature size, and concatenate
+        nodes' feature together as the input.
+
+    Examples
+    --------
+    The following example uses PyTorch backend.
+
+    >>> import dgl
+    >>> import torch as th
+    >>> from dgl.nn.pytorch.glob import MaxPooling
+    >>>
+    >>> g1 = dgl.DGLGraph()
+    >>> g1.add_nodes(2)
+    >>> g1_node_feats = th.ones(2,5)
+    >>>
+    >>> g2 = dgl.DGLGraph()
+    >>> g2.add_nodes(3)
+    >>> g2_node_feats = th.ones(3,5)
+    >>>
+    >>> maxpool = MaxPooling()
+
+    Case 1: Input a single graph
+
+    >>> maxpool(g1, g1_node_feats)
+        tensor([[1., 1., 1., 1., 1.]])
+
+    Case 2: Input a batch of graphs
+
+    Build a batch of DGL graphs and concatenate all graphs' node features into one tensor.
+
+    >>> batch_g = dgl.batch([g1, g2])
+    >>> batch_f = th.cat([g1_node_feats, g2_node_feats])
+    >>>
+    >>> maxpool(batch_g, batch_f)
+        tensor([[1., 1., 1., 1., 1.],
+                [1., 1., 1., 1., 1.]])
    """
    def __init__(self):
        super(MaxPooling, self).__init__()
@@ -93,9 +228,9 @@ class MaxPooling(nn.Module):
        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            A DGLGraph or a batch of DGLGraphs.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, *)` where
+            The input feature with shape :math:`(N, *)`, where
            :math:`N` is the number of nodes in the graph.

        Returns
@@ -111,34 +246,79 @@ class MaxPooling(nn.Module):


 class SortPooling(nn.Module):
-    r"""Apply Sort Pooling (`An End-to-End Deep Learning Architecture for Graph Classification
-    <https://www.cse.wustl.edu/~ychen/public/DGCNN.pdf>`__) over the nodes in the graph.
+    r"""
+
+    Description
+    -----------
+    Apply Sort Pooling (`An End-to-End Deep Learning Architecture for Graph Classification
+    <https://www.cse.wustl.edu/~ychen/public/DGCNN.pdf>`__) over the nodes in a graph.

    Parameters
    ----------
    k : int
        The number of nodes to hold for each graph.
+
+    Notes
+    -----
+        Input: Could be one graph, or a batch of graphs. If using a batch of graphs,
+        make sure nodes in all graphs have the same feature size, and concatenate
+        nodes' feature together as the input.
+
+    Examples
+    --------
+
+    >>> import dgl
+    >>> import torch as th
+    >>> from dgl.nn.pytorch.glob import SortPooling
+    >>>
+    >>> g1 = dgl.DGLGraph()
+    >>> g1.add_nodes(2)
+    >>> g1_node_feats = th.ones(2,5)
+    >>>
+    >>> g2 = dgl.DGLGraph()
+    >>> g2.add_nodes(3)
+    >>> g2_node_feats = th.ones(3,5)
+    >>>
+    >>> sortpool = SortPooling(k=2)
+
+    Case 1: Input a single graph
+
+    >>> sortpool(g1, g1_node_feats)
+        tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
+
+    Case 2: Input a batch of graphs
+
+    Build a batch of DGL graphs and concatenate all graphs' node features into one tensor.
+
+    >>> batch_g = dgl.batch([g1, g2])
+    >>> batch_f = th.cat([g1_node_feats, g2_node_feats])
+    >>>
+    >>> sortpool(batch_g, batch_f)
+        tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+                [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
    """
    def __init__(self, k):
        super(SortPooling, self).__init__()
        self.k = k

    def forward(self, graph, feat):
-        r"""Compute sort pooling.
+        r"""
+
+        Compute sort pooling.

        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            A DGLGraph or a batch of DGLGraphs.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, D)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)`, where :math:`N` is the
+            number of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, k * D)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, k * D)`, where :math:`B` refers
+            to the batch size of input graphs.
        """
        with graph.local_scope():
            # Sort the feature of each node in ascending order.
@@ -151,8 +331,12 @@ class SortPooling(nn.Module):


 class GlobalAttentionPooling(nn.Module):
-    r"""Apply Global Attention Pooling (`Gated Graph Sequence Neural Networks
-    <https://arxiv.org/abs/1511.05493.pdf>`__) over the nodes in the graph.
+    r"""
+
+    Description
+    -----------
+    Apply Global Attention Pooling (`Gated Graph Sequence Neural Networks
+    <https://arxiv.org/abs/1511.05493.pdf>`__) over the nodes in a graph.

    .. math::
        r^{(i)} = \sum_{k=1}^{N_i}\mathrm{softmax}\left(f_{gate}
@@ -163,8 +347,8 @@ class GlobalAttentionPooling(nn.Module):
    gate_nn : torch.nn.Module
        A neural network that computes attention scores for each feature.
    feat_nn : torch.nn.Module, optional
-        A neural network applied to each feature before combining them
-        with attention scores.
+        A neural network applied to each feature before combining them with attention
+        scores.
    """
    def __init__(self, gate_nn, feat_nn=None):
        super(GlobalAttentionPooling, self).__init__()
@@ -172,21 +356,23 @@ class GlobalAttentionPooling(nn.Module):
        self.feat_nn = feat_nn

    def forward(self, graph, feat):
-        r"""Compute global attention pooling.
+        r"""
+
+        Compute global attention pooling.

        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            A DGLGraph or a batch of DGLGraphs.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, D)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)` where :math:`N` is the
+            number of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, D)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, D)`, where :math:`B` refers
+            to the batch size.
        """
        with graph.local_scope():
            gate = self.gate_nn(feat)
@@ -205,9 +391,10 @@ class GlobalAttentionPooling(nn.Module):


 class Set2Set(nn.Module):
-    r"""Apply Set2Set (`Order Matters: Sequence to sequence for sets
-    <https://arxiv.org/pdf/1511.06391.pdf>`__) over the nodes in the graph.
+    r"""

+    Description
+    -----------
    For each individual graph in the batch, set2set computes

    .. math::
@@ -224,11 +411,11 @@ class Set2Set(nn.Module):
    Parameters
    ----------
    input_dim : int
-        Size of each input sample
+        The size of each input sample.
    n_iters : int
-        Number of iterations.
+        The number of iterations.
    n_layers : int
-        Number of recurrent layers.
+        The number of recurrent layers.
    """
    def __init__(self, input_dim, n_iters, n_layers):
        super(Set2Set, self).__init__()
@@ -244,21 +431,22 @@ class Set2Set(nn.Module):
        self.lstm.reset_parameters()

    def forward(self, graph, feat):
-        r"""Compute set2set pooling.
+        r"""
+        Compute set2set pooling.

        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            The input graph.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, D)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)` where  :math:`N` is the
+            number of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, D)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, D)`, where :math:`B` refers to
+            the batch size, and :math:`D` means the size of features.
        """
        with graph.local_scope():
            batch_size = graph.batch_size
@@ -497,31 +685,35 @@ class PMALayer(nn.Module):


 class SetTransformerEncoder(nn.Module):
-    r"""The Encoder module in `Set Transformer: A Framework for Attention-based
+    r"""
+
+    Description
+    -----------
+    The Encoder module in `Set Transformer: A Framework for Attention-based
    Permutation-Invariant Neural Networks <https://arxiv.org/pdf/1810.00825.pdf>`__.

    Parameters
    ----------
    d_model : int
-        Hidden size of the model.
+        The hidden size of the model.
    n_heads : int
-        Number of heads.
+        The number of heads.
    d_head : int
-        Hidden size of each head.
+        The hidden size of each head.
    d_ff : int
-        Kernel size in FFN (Positionwise Feed-Forward Network) layer.
+        The kernel size in FFN (Positionwise Feed-Forward Network) layer.
    n_layers : int
-        Number of layers.
+        The number of layers.
    block_type : str
        Building block type: 'sab' (Set Attention Block) or 'isab' (Induced
        Set Attention Block).
    m : int or None
-        Number of induced vectors in ISAB Block, set to None if block type
+        The number of induced vectors in ISAB Block. Set to None if block type
        is 'sab'.
    dropouth : float
-        Dropout rate of each sublayer.
+        The dropout rate of each sublayer.
    dropouta : float
-        Dropout rate of attention heads.
+        The dropout rate of attention heads.
    """
    def __init__(self, d_model, n_heads, d_head, d_ff,
                 n_layers=1, block_type='sab', m=None, dropouth=0., dropouta=0.):
@@ -554,10 +746,10 @@ class SetTransformerEncoder(nn.Module):
        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            The input graph.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, D)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)`, where :math:`N` is the
+            number of nodes in the graph.

        Returns
        -------
@@ -571,7 +763,11 @@ class SetTransformerEncoder(nn.Module):


 class SetTransformerDecoder(nn.Module):
-    r"""The Decoder module in `Set Transformer: A Framework for Attention-based
+    r"""
+
+    Description
+    -----------
+    The Decoder module in `Set Transformer: A Framework for Attention-based
    Permutation-Invariant Neural Networks <https://arxiv.org/pdf/1810.00825.pdf>`__.

    Parameters
@@ -579,15 +775,15 @@ class SetTransformerDecoder(nn.Module):
    d_model : int
        Hidden size of the model.
    num_heads : int
-        Number of heads.
+        The number of heads.
    d_head : int
        Hidden size of each head.
    d_ff : int
        Kernel size in FFN (Positionwise Feed-Forward Network) layer.
    n_layers : int
-        Number of layers.
+        The number of layers.
    k : int
-        Number of seed vectors in PMA (Pooling by Multihead Attention) layer.
+        The number of seed vectors in PMA (Pooling by Multihead Attention) layer.
    dropouth : float
        Dropout rate of each sublayer.
    dropouta : float
@@ -615,16 +811,16 @@ class SetTransformerDecoder(nn.Module):
        Parameters
        ----------
        graph : DGLGraph
-            The graph.
+            The input graph.
        feat : torch.Tensor
-            The input feature with shape :math:`(N, D)` where
-            :math:`N` is the number of nodes in the graph.
+            The input feature with shape :math:`(N, D)`, where :math:`N` is the
+            number of nodes in the graph, and :math:`D` means the size of features.

        Returns
        -------
        torch.Tensor
-            The output feature with shape :math:`(B, D)`, where
-            :math:`B` refers to the batch size.
+            The output feature with shape :math:`(B, D)`, where :math:`B` refers to
+            the batch size.
        """
        len_pma = graph.batch_num_nodes()
        len_sab = [self.k] * graph.batch_size

--- a/python/dgl/nn/pytorch/utils.py
+++ b/python/dgl/nn/pytorch/utils.py
@@ -104,20 +104,25 @@ class Identity(nn.Module):
        return x

 class Sequential(nn.Sequential):
-    r"""A squential container for stacking graph neural network modules.
+    r"""

-    We support two modes: sequentially apply GNN modules on the same graph or
-    a list of given graphs. In the second case, the number of graphs equals the
+    Description
+    -----------
+    A squential container for stacking graph neural network modules.
+
+    DGL supports two modes: sequentially apply GNN modules on 1) the same graph or
+    2) a list of given graphs. In the second case, the number of graphs equals the
    number of modules inside this container.

    Parameters
    ----------
    *args :
-        Sub-modules of type torch.nn.Module, will be added to the container in
-        the order they are passed in the constructor.
+        Sub-modules of torch.nn.Module that will be added to the container in
+        the order by which they are passed in the constructor.

    Examples
    --------
+    The following example uses PyTorch backend.

    Mode 1: sequentially apply GNN modules on the same graph

@@ -147,7 +152,8 @@ class Sequential(nn.Sequential):
    >>> net(g, n_feat, e_feat)
    (tensor([[39.8597, 45.4542, 25.1877, 30.8086],
             [40.7095, 45.3985, 25.4590, 30.0134],
-        [40.7894, 45.2556, 25.5221, 30.4220]]), tensor([[80.3772, 89.7752, 50.7762, 60.5520],
+             [40.7894, 45.2556, 25.5221, 30.4220]]),
+     tensor([[80.3772, 89.7752, 50.7762, 60.5520],
             [80.5671, 89.3736, 50.6558, 60.6418],
             [80.4620, 89.5142, 50.3643, 60.3126],
             [80.4817, 89.8549, 50.9430, 59.9108],
@@ -186,11 +192,14 @@ class Sequential(nn.Sequential):
            [220.4007, 239.7365, 213.8648, 234.9637],
            [196.4630, 207.6319, 184.2927, 208.7465]])
    """
+
    def __init__(self, *args):
        super(Sequential, self).__init__(*args)

    def forward(self, graph, *feats):
-        r"""Sequentially apply modules to the input.
+        r"""
+
+        Sequentially apply modules to the input.

        Parameters
        ----------
@@ -199,8 +208,8 @@ class Sequential(nn.Sequential):

        *feats :
            Input features.
-            The output of :math:`i`-th block should match that of the input
-            of :math:`(i+1)`-th block.
+            The output of the :math:`i`-th module should match the input
+            of the :math:`(i+1)`-th module in the sequential.
        """
        if isinstance(graph, list):
            for graph_i, module in zip(graph, self):

--- a/python/dgl/ops/edge_softmax.py
+++ b/python/dgl/ops/edge_softmax.py
@@ -6,9 +6,11 @@ __all__ = ['edge_softmax']


 def edge_softmax(graph, logits, eids=ALL, norm_by='dst'):
-    r"""Compute edge softmax.
+    r"""

-    For a node :math:`i`, edge softmax is an operation of computing
+    Description
+    -----------
+    Compute edge softmax. For a node :math:`i`, edge softmax is an operation that computes

    .. math::
      a_{ij} = \frac{\exp(z_{ij})}{\sum_{j\in\mathcal{N}(i)}\exp(z_{ij})}
@@ -22,15 +24,18 @@ def edge_softmax(graph, logits, eids=ALL, norm_by='dst'):
    softmax normalized by source nodes(i.e. :math:`ij` are outgoing edges of
    `i` in the formula). The previous case correspond to softmax in GAT and
    Transformer, and the later case correspond to softmax in Capsule network.
+    An example of using edge softmax is in
+    `Graph Attention Network <https://arxiv.org/pdf/1710.10903.pdf>`__ where
+    the attention weights are computed with such an edge softmax operation.

    Parameters
    ----------
-    gidx : HeteroGraphIndex
-        The graph to perfor edge softmax on.
+    graph : DGLGraph
+        The graph to perform edge softmax on.
    logits : torch.Tensor
-        The input edge feature
+        The input edge feature.
    eids : torch.Tensor or ALL, optional
-        Edges on which to apply edge softmax. If ALL, apply edge
+        A tensor of edge index on which to apply edge softmax. If ALL, apply edge
        softmax on all edges in the graph. Default: ALL.
    norm_by : str, could be `src` or `dst`
        Normalized by source nodes or destination nodes. Default: `dst`.
@@ -38,22 +43,25 @@ def edge_softmax(graph, logits, eids=ALL, norm_by='dst'):
    Returns
    -------
    Tensor
-        Softmax value
+        Softmax value.

    Notes
    -----
        * Input shape: :math:`(E, *, 1)` where * means any number of
          additional dimensions, :math:`E` equals the length of eids.
+          If the `eids` is ALL, :math:`E` equals the number of edges in
+          the graph.
        * Return shape: :math:`(E, *, 1)`

    Examples
    --------
+    The following example uses PyTorch backend.

    >>> from dgl.ops import edge_softmax
    >>> import dgl
    >>> import torch as th

-    Create a :code:`DGLGraph` object and initialize its edge features.
+    Create a :code:`DGLGraph` object g and initialize its edge features.

    >>> g = dgl.DGLGraph()
    >>> g.add_nodes(3)