[Doc] split docs to multiple rst files (#2037)

5cb57593 · Minjie Wang · GitHub · a8b27b47 · 5cb57593 · 5cb57593
Unverified Commit 5cb57593 authored Aug 17, 2020 by Minjie Wang Committed by GitHub Aug 17, 2020
19 changed files
--- a/docs/source/guide/data.rst
+++ b/docs/source/guide/data.rst
 .. _guide-data-pipeline:

-Graph data input pipeline in DGL
-==================================
+Chapter 4: Graph Data Pipeline
+====================================================

 DGL implements many commonly used graph datasets in :ref:`apidata`. They
 follow a standard pipeline defined in class :class:`dgl.data.DGLDataset`. We highly

--- a/docs/source/guide/distributed.rst
+++ b/docs/source/guide/distributed.rst
-Distributed Training
-============================
+Chapter 7: Distributed Training
+=====================================
--- a/docs/source/guide/graph.rst
+++ b/docs/source/guide/graph.rst
-Graph
-=====
+.. _guide-graph:
+
+Chapter 1: Graph
+======================

 Graph chapter
--- a/docs/source/guide/index.rst
+++ b/docs/source/guide/index.rst
@@ -3,6 +3,7 @@ User Guide

 .. toctree::
  :maxdepth: 2
+  :titlesonly:

  preface
  graph

--- a/docs/source/guide/message.rst
+++ b/docs/source/guide/message.rst
 .. _guide-message-passing:

-Message Passing
-===============
+Chapter 2: Message Passing
+================================

 Message Passing Paradigm
 ------------------------
@@ -122,11 +122,8 @@ cleaned. The math formula for the above function is:
 message reduction and node update in a single call, which leaves room
 for optimizations, as explained below.

-Notes
-----
-
-Performance Optimization Notes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Writing Efficient Message Passing Codes
+----------------------------------------------

 DGL optimized memory consumption and computing speed for message
 passing. The optimization includes:
@@ -210,7 +207,7 @@ be optimized with DGL’s built-in function ``u_add_v``, which further
 speeds up computation and saves memory footprint.

 Apply Message Passing On Part Of The Graph
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-----------------------------------------------

 If we only want to update part of the nodes in the graph, the practice
 is to create a subgraph by providing the ids for the nodes we want to
@@ -228,7 +225,7 @@ training <https://docs.dgl.ai/generated/guide/minibatch.html>`__ user guide for
 usages.

 Apply Edge Weight In Message Passing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+----------------------------------------

 A commonly seen practice in GNN modeling is to apply edge weight on the
 message before message aggregation, for examples, in

--- a/docs/source/guide/minibatch-custom-sampler.rst
+++ b/docs/source/guide/minibatch-custom-sampler.rst
+.. _guide-minibatch-customizing-neighborhood-sampler:
+
+6.4 Customizing Neighborhood Sampler
+----------------------------------------------
+
+Although DGL provides some neighborhood sampling strategies, sometimes
+users would want to write their own sampling strategy. This section
+explains how to write your own strategy and plug it into your stochastic
+GNN training framework.
+
+Recall that in `How Powerful are Graph Neural
+Networks <https://arxiv.org/pdf/1810.00826.pdf>`__, the definition of message
+passing is:
+
+.. math::
+
+
+   \begin{gathered}
+     \boldsymbol{a}_v^{(l)} = \rho^{(l)} \left(
+       \left\lbrace
+         \boldsymbol{h}_u^{(l-1)} : u \in \mathcal{N} \left( v \right)
+       \right\rbrace
+     \right)
+   \\
+     \boldsymbol{h}_v^{(l)} = \phi^{(l)} \left(
+       \boldsymbol{h}_v^{(l-1)}, \boldsymbol{a}_v^{(l)}
+     \right)
+   \end{gathered}
+
+where :math:`\rho^{(l)}` and :math:`\phi^{(l)}` are parameterized
+functions, and :math:`\mathcal{N}(v)` is defined as the set of
+predecessors (or *neighbors* if the graph is undirected) of :math:`v` on graph
+:math:`\mathcal{G}`.
+
+For instance, to perform a message passing for updating the red node in
+the following graph:
+
+.. figure:: https://i.imgur.com/xYPtaoy.png
+   :alt: Imgur
+
+   Imgur
+
+One needs to aggregate the node features of its neighbors, shown as
+green nodes:
+
+.. figure:: https://i.imgur.com/OuvExp1.png
+   :alt: Imgur
+
+   Imgur
+
+Neighborhood sampling with pencil and paper
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We then consider how multi-layer message passing works for computing the
+output of a single node. In the following text we refer to the nodes
+whose GNN outputs are to be computed as *seed nodes*.
+
+.. code:: python
+
+    import torch
+    import dgl
+    
+    src = torch.LongTensor(
+        [0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10,
+         1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11])
+    dst = torch.LongTensor(
+        [1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11,
+         0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10])
+    g = dgl.graph((src, dst))
+    g.ndata['x'] = torch.randn(12, 5)
+    g.ndata['y'] = torch.randn(12, 1)
+
+Finding the message passing dependency
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider computing with a 2-layer GNN the output of the seed node 8,
+colored red, in the following graph:
+
+.. figure:: https://i.imgur.com/xYPtaoy.png
+   :alt: Imgur
+
+   Imgur
+
+By the formulation:
+
+.. math::
+
+
+   \begin{gathered}
+     \boldsymbol{a}_8^{(2)} = \rho^{(2)} \left(
+       \left\lbrace
+         \boldsymbol{h}_u^{(1)} : u \in \mathcal{N} \left( 8 \right)
+       \right\rbrace
+     \right) = \rho^{(2)} \left(
+       \left\lbrace
+         \boldsymbol{h}_4^{(1)}, \boldsymbol{h}_5^{(1)},
+         \boldsymbol{h}_7^{(1)}, \boldsymbol{h}_{11}^{(1)}
+       \right\rbrace
+     \right)
+   \\
+     \boldsymbol{h}_8^{(2)} = \phi^{(2)} \left(
+       \boldsymbol{h}_8^{(1)}, \boldsymbol{a}_8^{(2)}
+     \right)
+   \end{gathered}
+
+We can tell from the formulation that to compute
+:math:`\boldsymbol{h}_8^{(2)}` we need messages from node 4, 5, 7 and 11
+(colored green) along the edges visualized below.
+
+.. figure:: https://i.imgur.com/Gwjz05H.png
+   :alt: Imgur
+
+   Imgur
+
+This graph contains all the nodes in the original graph but only the
+edges necessary for message passing to the given output nodes. We call
+that the *frontier* of the second GNN layer for the red node 8.
+
+Several functions can be used for generating frontiers. For instance,
+:func:`dgl.in_subgraph()` is a function that induces a
+subgraph by including all the nodes in the original graph, but only all
+the incoming edges of the given nodes. You can use that as a frontier
+for message passing along all the incoming edges.
+
+.. code:: python
+
+    frontier = dgl.in_subgraph(g, [8])
+    print(frontier.all_edges())
+
+For a concrete list, please refer to :ref:`api-subgraph-extraction` and
+:ref:`api-sampling`.
+
+Technically, any graph that has the same set of nodes as the original
+graph can serve as a frontier. This serves as the basis for
+:ref:`guide-minibatch-customizing-neighborhood-sampler-impl`.
+
+The Bipartite Structure for Multi-layer Minibatch Message Passing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+However, to compute :math:`\boldsymbol{h}_8^{(2)}` from
+:math:`\boldsymbol{h}_\cdot^{(1)}`, we cannot simply perform message
+passing on the frontier directly, because it still contains all the
+nodes from the original graph. Namely, we only need nodes 4, 5, 7, 8,
+and 11 (green and red nodes) as input, as well as node 8 (red node) as output.
+Since the number of nodes
+for input and output is different, we need to perform message passing on
+a small, bipartite-structured graph instead. We call such a
+bipartite-structured graph that only contains the necessary input nodes
+and output nodes a *block*. The following figure shows the block of the
+second GNN layer for node 8.
+
+.. figure:: https://i.imgur.com/stB2UlR.png
+   :alt: Imgur
+
+   Imgur
+
+Note that the output nodes also appear in the input nodes. The reason is
+that representations of output nodes from the previous layer are needed
+for feature combination after message passing (i.e. :math:`\phi^{(2)}`).
+
+DGL provides :func:`dgl.to_block` to convert any frontier
+to a block where the first argument specifies the frontier and the
+second argument specifies the output nodes. For instance, the frontier
+above can be converted to a block with output node 8 with the code as
+follows.
+
+.. code:: python
+
+    output_nodes = torch.LongTensor([8])
+    block = dgl.to_block(frontier, output_nodes)
+
+To find the number of input nodes and output nodes of a given node type,
+one can use :meth:`dgl.DGLHeteroGraph.number_of_src_nodes` and
+:meth:`dgl.DGLHeteroGraph.number_of_dst_nodes` methods.
+
+.. code:: python
+
+    num_input_nodes, num_output_nodes = block.number_of_src_nodes(), block.number_of_dst_nodes()
+    print(num_input_nodes, num_output_nodes)
+
+The block’s input node features can be accessed via member
+:attr:`dgl.DGLHeteroGraph.srcdata` and :attr:`dgl.DGLHeteroGraph.srcnodes`, and
+its output node features can be accessed via member
+:attr:`dgl.DGLHeteroGraph.dstdata` and :attr:`dgl.DGLHeteroGraph.dstnodes`. The
+syntax of ``srcdata``/``dstdata`` and ``srcnodes``/``dstnodes`` are
+identical to :attr:`dgl.DGLHeteroGraph.ndata` and
+:attr:`dgl.DGLHeteroGraph.nodes` in normal graphs.
+
+.. code:: python
+
+    block.srcdata['h'] = torch.randn(num_input_nodes, 5)
+    block.dstdata['h'] = torch.randn(num_output_nodes, 5)
+
+If a block is converted from a frontier, which is in turn converted from
+a graph, one can directly read the feature of the block’s input and
+output nodes via
+
+.. code:: python
+
+    print(block.srcdata['x'])
+    print(block.dstdata['y'])
+
+.. raw:: html
+
+   <div class="alert alert-info">
+
+::
+
+   <b>ID Mappings</b>
+
+The original node IDs of the input nodes and output nodes in the block
+can be found as the feature ``dgl.NID``, and the mapping from the
+block’s edge IDs to the input frontier’s edge IDs can be found as the
+feature ``dgl.EID``.
+
+.. raw:: html
+
+   </div>
+
+**Output Nodes**
+
+DGL ensures that the output nodes of a block will always appear in the
+input nodes. The output nodes will always index firstly in the input
+nodes.
+
+.. code:: python
+
+    input_nodes = block.srcdata[dgl.NID]
+    output_nodes = block.dstdata[dgl.NID]
+    assert torch.equal(input_nodes[:len(output_nodes)], output_nodes)
+
+As a result, the output nodes must cover all nodes that are the
+destination of an edge in the frontier.
+
+For example, consider the following frontier
+
+.. figure:: https://i.imgur.com/g5Ptbj7.png
+   :alt: Imgur
+
+   Imgur
+
+where the red and green nodes (i.e. node 4, 5, 7, 8, and 11) are all
+nodes that is a destination of an edge. Then the following code will
+raise an error because the output nodes did not cover all those nodes.
+
+.. code:: python
+
+    dgl.to_block(frontier2, torch.LongTensor([4, 5]))   # ERROR
+
+However, the output nodes can have more nodes than above. In this case,
+we will have isolated nodes that do not have any edge connecting to it.
+The isolated nodes will be included in both input nodes and output
+nodes.
+
+.. code:: python
+
+    # Node 3 is an isolated node that do not have any edge pointing to it.
+    block3 = dgl.to_block(frontier2, torch.LongTensor([4, 5, 7, 8, 11, 3]))
+    print(block3.srcdata[dgl.NID])
+    print(block3.dstdata[dgl.NID])
+
+Heterogeneous Graphs
+^^^^^^^^^^^^^^^^^^^^
+
+Blocks also work on heterogeneous graphs. Let’s say that we have the
+following frontier:
+
+.. code:: python
+
+    hetero_frontier = dgl.heterograph({
+        ('user', 'follow', 'user'): ([1, 3, 7], [3, 6, 8]),
+        ('user', 'play', 'game'): ([5, 5, 4], [6, 6, 2]),
+        ('game', 'played-by', 'user'): ([2], [6])
+    }, num_nodes_dict={'user': 10, 'game': 10})
+
+One can also create a block with output nodes User #3, #6, and #8, as
+well as Game #2 and #6.
+
+.. code:: python
+
+    hetero_block = dgl.to_block(hetero_frontier, {'user': [3, 6, 8], 'block': [2, 6]})
+
+One can also get the input nodes and output nodes by type:
+
+.. code:: python
+
+    # input users and games
+    print(hetero_block.srcnodes['user'].data[dgl.NID], hetero_block.srcnodes['game'].data[dgl.NID])
+    # output users and games
+    print(hetero_block.dstnodes['user'].data[dgl.NID], hetero_block.dstnodes['game'].data[dgl.NID])
+
+
+.. _guide-minibatch-customizing-neighborhood-sampler-impl:
+
+Implementing a Custom Neighbor Sampler
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Recall that the following code performs neighbor sampling for node
+classification.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+
+To implement your own neighborhood sampling strategy, you basically
+replace the ``sampler`` object with your own. To do that, let’s first
+see what :class:`~dgl.dataloading.dataloader.BlockSampler`, the parent class of
+:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler`, is.
+
+:class:`~dgl.dataloading.dataloader.BlockSampler` is responsible for
+generating the list of blocks starting from the last layer, with method
+:meth:`~dgl.dataloading.dataloader.BlockSampler.sample_blocks`. The default implementation of
+``sample_blocks`` is to iterate backwards, generating the frontiers and
+converting them to blocks.
+
+Therefore, for neighborhood sampling, **you only need to implement
+the**\ :meth:`~dgl.dataloading.dataloader.BlockSampler.sample_frontier`\ **method**. Given which
+layer the sampler is generating frontier for, as well as the original
+graph and the nodes to compute representations, this method is
+responsible for generating a frontier for them.
+
+Meanwhile, you also need to pass how many GNN layers you have to the
+parent class.
+
+For example, the implementation of
+:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler` can
+go as follows.
+
+.. code:: python
+
+    class MultiLayerFullNeighborSampler(dgl.dataloading.BlockSampler):
+        def __init__(self, n_layers):
+            super().__init__(n_layers)
+    
+        def sample_frontier(self, block_id, g, seed_nodes):
+            frontier = dgl.in_subgraph(g, seed_nodes)
+            return frontier
+
+:class:`dgl.dataloading.neighbor.MultiLayerNeighborSampler`, a more
+complicated neighbor sampler class that allows you to sample a small
+number of neighbors to gather message for each node, goes as follows.
+
+.. code:: python
+
+    class MultiLayerNeighborSampler(dgl.dataloading.BlockSampler):
+        def __init__(self, fanouts):
+            super().__init__(len(fanouts))
+    
+            self.fanouts = fanouts
+    
+        def sample_frontier(self, block_id, g, seed_nodes):
+            fanout = self.fanouts[block_id]
+            if fanout is None:
+                frontier = dgl.in_subgraph(g, seed_nodes)
+            else:
+                frontier = dgl.sampling.sample_neighbors(g, seed_nodes, fanout)
+            return frontier
+
+Although the functions above can generate a frontier, any graph that has
+the same nodes as the original graph can serve as a frontier.
+
+For example, if one want to randomly drop inbound edges to the seed
+nodes with a probability, one can simply define the sampler as follows:
+
+.. code:: python
+
+    class MultiLayerDropoutSampler(dgl.dataloading.BlockSampler):
+        def __init__(self, p, n_layers):
+            super().__init__()
+    
+            self.n_layers = n_layers
+            self.p = p
+    
+        def sample_frontier(self, block_id, g, seed_nodes, *args, **kwargs):
+            # Get all inbound edges to `seed_nodes`
+            src, dst = dgl.in_subgraph(g, seed_nodes).all_edges()
+            # Randomly select edges with a probability of p
+            mask = torch.zeros_like(src).bernoulli_(self.p)
+            src = src[mask]
+            dst = dst[mask]
+            # Return a new graph with the same nodes as the original graph as a
+            # frontier
+            frontier = dgl.graph((src, dst), num_nodes=g.number_of_nodes())
+            return frontier
+    
+        def __len__(self):
+            return self.n_layers
+
+After implementing your sampler, you can create a data loader that takes
+in your sampler and it will keep generating lists of blocks while
+iterating over the seed nodes as usual.
+
+.. code:: python
+
+    sampler = MultiLayerDropoutSampler(0.5, 2)
+    dataloader = dgl.dataloading.NodeDataLoader(
+        g, train_nids, sampler,
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+    
+    model = StochasticTwoLayerRGCN(in_features, hidden_features, out_features)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        input_features = blocks[0].srcdata     # returns a dict
+        output_labels = blocks[-1].dstdata     # returns a dict
+        output_predictions = model(blocks, input_features)
+        loss = compute_loss(output_labels, output_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+Heterogeneous Graphs
+^^^^^^^^^^^^^^^^^^^^
+
+Generating a frontier for a heterogeneous graph is nothing different
+than that for a homogeneous graph. Just make the returned graph have the
+same nodes as the original graph, and it should work fine. For example,
+we can rewrite the ``MultiLayerDropoutSampler`` above to iterate over
+all edge types, so that it can work on heterogeneous graphs as well.
+
+.. code:: python
+
+    class MultiLayerDropoutSampler(dgl.dataloading.BlockSampler):
+        def __init__(self, p, n_layers):
+            super().__init__()
+    
+            self.n_layers = n_layers
+            self.p = p
+    
+        def sample_frontier(self, block_id, g, seed_nodes, *args, **kwargs):
+            # Get all inbound edges to `seed_nodes`
+            sg = dgl.in_subgraph(g, seed_nodes)
+    
+            new_edges_masks = {}
+            # Iterate over all edge types
+            for etype in sg.canonical_etypes:
+                edge_mask = torch.zeros(sg.number_of_edges(etype))
+                edge_mask.bernoulli_(self.p)
+                new_edges_masks[etype] = edge_mask.bool()
+    
+            # Return a new graph with the same nodes as the original graph as a
+            # frontier
+            frontier = dgl.edge_subgraph(new_edge_masks, preserve_nodes=True)
+            return frontier
+    
+        def __len__(self):
+            return self.n_layers
+
+
+
--- a/docs/source/guide/minibatch-edge.rst
+++ b/docs/source/guide/minibatch-edge.rst
+.. _guide-minibatch-edge-classification-sampler:
+
+6.2 Training GNN for Edge Classification with Neighborhood Sampling
+----------------------------------------------------------------------
+
+Training for edge classification/regression is somewhat similar to that
+of node classification/regression with several notable differences.
+
+Define a neighborhood sampler and data loader
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can use the
+:ref:`same neighborhood samplers as node classification <guide-minibatch-node-classification-sampler>`.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+
+To use the neighborhood sampler provided by DGL for edge classification,
+one need to instead combine it with
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader`, which iterates
+over a set of edges in minibatches, yielding the subgraph induced by the
+edge minibatch and ``blocks`` to be consumed by the module above.
+
+For example, the following code creates a PyTorch DataLoader that
+iterates over the training edge ID array ``train_eids`` in batches,
+putting the list of generated blocks onto GPU.
+
+.. code:: python
+
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+For a complete list of supported builtin samplers, please refer to the
+:ref:`neighborhood sampler API reference <api-dataloading-neighbor-sampling>`.
+
+If you wish to develop your own neighborhood sampler or you want a more
+detailed explanation of the concept of blocks, please refer to
+:ref:`guide-minibatch-customizing-neighborhood-sampler`.
+
+Removing edges in the minibatch from the original graph for neighbor sampling
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When training edge classification models, sometimes you wish to remove
+the edges appearing in the training data from the computation dependency
+as if they never existed. Otherwise, the model will “know” the fact that
+an edge exists between the two nodes, and potentially use it for
+advantage.
+
+Therefore in edge classification you sometimes would like to exclude the
+edges sampled in the minibatch from the original graph for neighborhood
+sampling, as well as the reverse edges of the sampled edges on an
+undirected graph. You can specify ``exclude='reverse'`` in instantiation
+of :class:`~dgl.dataloading.pytorch.EdgeDataLoader`, with the mapping of the edge
+IDs to their reverse edges IDs.  Usually doing so will lead to much slower
+sampling process due to locating the reverse edges involving in the minibatch
+and removing them.
+
+.. code:: python
+
+    n_edges = g.number_of_edges()
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+    
+        # The following two arguments are specifically for excluding the minibatch
+        # edges and their reverse edges from the original graph for neighborhood
+        # sampling.
+        exclude='reverse',
+        reverse_eids=torch.cat([
+            torch.arange(n_edges // 2, n_edges), torch.arange(0, n_edges // 2)]),
+    
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+Adapt your model for minibatch training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The edge classification model usually consists of two parts:
+
+-  One part that obtains the representation of incident nodes.
+-  The other part that computes the edge score from the incident node
+   representations.
+
+The former part is exactly the same as
+:ref:`that from node classification <guide-minibatch-node-classification-model>`
+and we can simply reuse it. The input is still the list of
+blocks generated from a data loader provided by DGL, as well as the
+input features.
+
+.. code:: python
+
+    class StochasticTwoLayerGCN(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.conv1 = dglnn.GraphConv(in_features, hidden_features)
+            self.conv2 = dglnn.GraphConv(hidden_features, out_features)
+    
+        def forward(self, blocks, x):
+            x = F.relu(self.conv1(blocks[0], x))
+            x = F.relu(self.conv2(blocks[1], x))
+            return x
+
+The input to the latter part is usually the output from the
+former part, as well as the subgraph of the original graph induced by the
+edges in the minibatch. The subgraph is yielded from the same data
+loader. One can call :meth:`dgl.DGLHeteroGraph.apply_edges` to compute the
+scores on the edges with the edge subgraph.
+
+The following code shows an example of predicting scores on the edges by
+concatenating the incident node features and projecting it with a dense
+layer.
+
+.. code:: python
+
+    class ScorePredictor(nn.Module):
+        def __init__(self, num_classes, in_features):
+            super().__init__()
+            self.W = nn.Linear(2 * in_features, num_classes)
+    
+        def apply_edges(self, edges):
+            data = torch.cat([edges.src['x'], edges.dst['x']])
+            return {'score': self.W(data)}
+    
+        def forward(self, edge_subgraph, x):
+            with edge_subgraph.local_scope():
+                edge_subgraph.ndata['x'] = x
+                edge_subgraph.apply_edges(self.apply_edges)
+                return edge_subgraph.edata['score']
+
+The entire model will take the list of blocks and the edge subgraph
+generated by the data loader, as well as the input node features as
+follows:
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features, num_classes):
+            super().__init__()
+            self.gcn = StochasticTwoLayerGCN(
+                in_features, hidden_features, out_features)
+            self.predictor = ScorePredictor(num_classes, out_features)
+    
+        def forward(self, edge_subgraph, blocks, x):
+            x = self.gcn(blocks, x)
+            return self.predictor(edge_subgraph, x)
+
+DGL ensures that that the nodes in the edge subgraph are the same as the
+output nodes of the last block in the generated list of blocks.
+
+Training Loop
+~~~~~~~~~~~~~
+
+The training loop is very similar to node classification. You can
+iterate over the dataloader and get a subgraph induced by the edges in
+the minibatch, as well as the list of blocks necessary for computing
+their incident node representations.
+
+.. code:: python
+
+    model = Model(in_features, hidden_features, out_features, num_classes)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, edge_subgraph, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        edge_subgraph = edge_subgraph.to(torch.device('cuda'))
+        input_features = blocks[0].srcdata['features']
+        edge_labels = edge_subgraph.edata['labels']
+        edge_predictions = model(edge_subgraph, blocks, input_features)
+        loss = compute_loss(edge_labels, edge_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+For heterogeneous graphs
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The models computing the node representations on heterogeneous graphs
+can also be used for computing incident node representations for edge
+classification/regression.
+
+.. code:: python
+
+    class StochasticTwoLayerRGCN(nn.Module):
+        def __init__(self, in_feat, hidden_feat, out_feat):
+            super().__init__()
+            self.conv1 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(in_feat, hidden_feat, norm='right')
+                    for rel in rel_names
+                })
+            self.conv2 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(hidden_feat, out_feat, norm='right')
+                    for rel in rel_names
+                })
+    
+        def forward(self, blocks, x):
+            x = self.conv1(blocks[0], x)
+            x = self.conv2(blocks[1], x)
+            return x
+
+For score prediction, the only implementation difference between the
+homogeneous graph and the heterogeneous graph is that we are looping
+over the edge types for :meth:`~dgl.DGLHeteroGraph.apply_edges`.
+
+.. code:: python
+
+    class ScorePredictor(nn.Module):
+        def __init__(self, num_classes, in_features):
+            super().__init__()
+            self.W = nn.Linear(2 * in_features, num_classes)
+    
+        def apply_edges(self, edges):
+            data = torch.cat([edges.src['x'], edges.dst['x']])
+            return {'score': self.W(data)}
+    
+        def forward(self, edge_subgraph, x):
+            with edge_subgraph.local_scope():
+                edge_subgraph.ndata['x'] = x
+                for etype in edge_subgraph.canonical_etypes:
+                    edge_subgraph.apply_edges(self.apply_edges, etype=etype)
+                return edge_subgraph.edata['score']
+
+Data loader definition is also very similar to that of node
+classification. The only difference is that you need
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` instead of
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`, and you will be supplying a
+dictionary of edge types and edge ID tensors instead of a dictionary of
+node types and node ID tensors.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+Things become a little different if you wish to exclude the reverse
+edges on heterogeneous graphs. On heterogeneous graphs, reverse edges
+usually have a different edge type from the edges themselves, in order
+to differentiate the “forward” and “backward” relationships (e.g.
+``follow`` and ``followed by`` are reverse relations of each other,
+``purchase`` and ``purchased by`` are reverse relations of each other,
+etc.).
+
+If each edge in a type has a reverse edge with the same ID in another
+type, you can specify the mapping between edge types and their reverse
+types. The way to exclude the edges in the minibatch as well as their
+reverse edges then goes as follows.
+
+.. code:: python
+
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+    
+        # The following two arguments are specifically for excluding the minibatch
+        # edges and their reverse edges from the original graph for neighborhood
+        # sampling.
+        exclude='reverse_types',
+        reverse_etypes={'follow': 'followed by', 'followed by': 'follow',
+                        'purchase': 'purchased by', 'purchased by': 'purchase'}
+    
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+The training loop is again almost the same as that on homogeneous graph,
+except for the implementation of ``compute_loss`` that will take in two
+dictionaries of node types and predictions here.
+
+.. code:: python
+
+    model = Model(in_features, hidden_features, out_features, num_classes)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, edge_subgraph, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        edge_subgraph = edge_subgraph.to(torch.device('cuda'))
+        input_features = blocks[0].srcdata['features']
+        edge_labels = edge_subgraph.edata['labels']
+        edge_predictions = model(edge_subgraph, blocks, input_features)
+        loss = compute_loss(edge_labels, edge_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+`GCMC <https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcmc>`__
+is an example of edge classification on a bipartite graph.
+
+
--- a/docs/source/guide/minibatch-inference.rst
+++ b/docs/source/guide/minibatch-inference.rst
+.. _guide-minibatch-inference:
+
+6.6 Exact Offline Inference on Large Graphs
+------------------------------------------------------
+
+Both subgraph sampling and neighborhood sampling are to reduce the
+memory and time consumption for training GNNs with GPUs. When performing
+inference it is usually better to truly aggregate over all neighbors
+instead to get rid of the randomness introduced by sampling. However,
+full-graph forward propagation is usually infeasible on GPU due to
+limited memory, and slow on CPU due to slow computation. This section
+introduces the methodology of full-graph forward propagation with
+limited GPU memory via minibatch and neighborhood sampling.
+
+The inference algorithm is different from the training algorithm, as the
+representations of all nodes should be computed layer by layer, starting
+from the first layer. Specifically, for a particular layer, we need to
+compute the output representations of all nodes from this GNN layer in
+minibatches. The consequence is that the inference algorithm will have
+an outer loop iterating over the layers, and an inner loop iterating
+over the minibatches of nodes. In contrast, the training algorithm has
+an outer loop iterating over the minibatches of nodes, and an inner loop
+iterating over the layers for both neighborhood sampling and message
+passing.
+
+The following animation shows how the computation would look like (note
+that for every layer only the first three minibatches are drawn).
+
+.. figure:: https://i.imgur.com/rr1FG7S.gif
+   :alt: Imgur
+
+   Imgur
+
+Implementing Offline Inference
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the two-layer GCN we have mentioned in Section 6.5.1. The way
+to implement offline inference still involves using
+```MultiLayerFullNeighborSampler`` <https://todo>`__, but sampling for
+only one layer at a time. Note that offline inference is implemented as
+a method of the GNN module because the computation on one layer depends
+on how messages are aggregated and combined as well.
+
+.. code:: python
+
+    class StochasticTwoLayerGCN(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.hidden_features = hidden_features
+            self.out_features = out_features
+            self.conv1 = dgl.nn.GraphConv(in_features, hidden_features)
+            self.conv2 = dgl.nn.GraphConv(hidden_features, out_features)
+            self.n_layers = 2
+    
+        def forward(self, blocks, x):
+            x_dst = x[:blocks[0].number_of_dst_nodes()]
+            x = F.relu(self.conv1(blocks[0], (x, x_dst)))
+            x_dst = x[:blocks[1].number_of_dst_nodes()]
+            x = F.relu(self.conv2(blocks[1], (x, x_dst)))
+            return x
+    
+        def inference(self, g, x, batch_size, device):
+            """
+            Offline inference with this module
+            """
+            # Compute representations layer by layer
+            for l, layer in enumerate([self.conv1, self.conv2]):
+                y = torch.zeros(g.number_of_nodes(),
+                                self.hidden_features
+                                if l != self.n_layers - 1
+                                else self.out_features)
+                sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
+                dataloader = dgl.dataloading.NodeDataLoader(
+                    g, torch.arange(g.number_of_nodes()), sampler,
+                    batch_size=batch_size,
+                    shuffle=True,
+                    drop_last=False)
+                
+                # Within a layer, iterate over nodes in batches
+                for input_nodes, output_nodes, blocks in dataloader:
+                    block = blocks[0]
+    
+                    # Copy the features of necessary input nodes to GPU
+                    h = x[input_nodes].to(device)
+                    # Compute output.  Note that this computation is the same
+                    # but only for a single layer.
+                    h_dst = h[:block.number_of_dst_nodes()]
+                    h = F.relu(layer(block, (h, h_dst)))
+                    # Copy to output back to CPU.
+                    y[output_nodes] = h.cpu()
+    
+            return y
+
+Note that for the purpose of computing evaluation metric on the
+validation set for model selection we usually don’t have to compute
+exact offline inference. The reason is that we need to compute the
+representation for every single node on every single layer, which is
+usually very costly especially in the semi-supervised regime with a lot
+of unlabeled data. Neighborhood sampling will work fine for model
+selection and validation.
+
+One can see
+`GraphSAGE <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py>`__
+and
+`RGCN <https://github.com/dmlc/dgl/blob/master/examples/pytorch/rgcn-hetero/entity_classify_mb.py>`__
+for examples of offline inference.
--- a/docs/source/guide/minibatch-link.rst
+++ b/docs/source/guide/minibatch-link.rst
+.. _guide-minibatch-link-classification-sampler:
+
+6.3 Training GNN for Link Prediction with Neighborhood Sampling
+--------------------------------------------------------------------
+
+Define a neighborhood sampler and data loader with negative sampling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can still use the same neighborhood sampler as the one in node/edge
+classification.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` in DGL also
+supports generating negative samples for link prediction. To do so, you
+need to provide the negative sampling function.
+:class:`~dgl.dataloading.negative_sampler.Uniform` is a
+function that does uniform sampling. For each source node of an edge, it
+samples ``k`` negative destination nodes.
+
+The following data loader will pick 5 negative destination nodes
+uniformly for each source node of an edge.
+
+.. code:: python
+
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_seeds, sampler,
+        negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=args.num_workers)
+
+For the builtin negative samplers please see :ref:`api-dataloading-negative-sampling`.
+
+You can also give your own negative sampler function, as long as it
+takes in the original graph ``g`` and the minibatch edge ID array
+``eid``, and returns a pair of source ID arrays and destination ID
+arrays.
+
+The following gives an example of custom negative sampler that samples
+negative destination nodes according to a probability distribution
+proportional to a power of degrees.
+
+.. code:: python
+
+    class NegativeSampler(object):
+        def __init__(self, g, k):
+            # caches the probability distribution
+            self.weights = g.in_degrees().float() ** 0.75
+            self.k = k
+    
+        def __call__(self, g, eids):
+            src, _ = g.find_edges(eids)
+            src = src.repeat_interleave(self.k)
+            dst = self.weights.multinomial(len(src), replacement=True)
+            return src, dst
+    
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_seeds, sampler,
+        negative_sampler=NegativeSampler(g, 5),
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=args.num_workers)
+
+Adapt your model for minibatch training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained in :ref:`guide-training-link-prediction`, link prediction is trained
+via comparing the score of an edge (positive example) against a
+non-existent edge (negative example). To compute the scores of edges you
+can reuse the node representation computation model you have seen in
+edge classification/regression.
+
+.. code:: python
+
+    class StochasticTwoLayerGCN(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.conv1 = dgl.nn.GraphConv(in_features, hidden_features)
+            self.conv2 = dgl.nn.GraphConv(hidden_features, out_features)
+    
+        def forward(self, blocks, x):
+            x = F.relu(self.conv1(blocks[0], x))
+            x = F.relu(self.conv2(blocks[1], x))
+            return x
+
+For score prediction, since you only need to predict a scalar score for
+each edge instead of a probability distribution, this example shows how
+to compute a score with a dot product of incident node representations.
+
+.. code:: python
+
+    class ScorePredictor(nn.Module):
+        def forward(self, edge_subgraph, x):
+            with edge_subgraph.local_scope():
+                edge_subgraph.ndata['x'] = x
+                edge_subgraph.apply_edges(dgl.function.u_dot_v('x', 'x', 'score'))
+                return edge_subgraph.edata['score']
+
+When a negative sampler is provided, DGL’s data loader will generate
+three items per minibatch:
+
+-  A positive graph containing all the edges sampled in the minibatch.
+-  A negative graph containing all the non-existent edges generated by
+   the negative sampler.
+-  A list of blocks generated by the neighborhood sampler.
+
+So one can define the link prediction model as follows that takes in the
+three items as well as the input features.
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.gcn = StochasticTwoLayerGCN(
+                in_features, hidden_features, out_features)
+    
+        def forward(self, positive_graph, negative_graph, blocks, x):
+            x = self.gcn(blocks, x)
+            pos_score = self.predictor(positive_graph, x)
+            neg_score = self.predictor(negative_graph, x)
+            return pos_score, neg_score
+
+Training loop
+~~~~~~~~~~~~~
+
+The training loop simply involves iterating over the data loader and
+feeding in the graphs as well as the input features to the model defined
+above.
+
+.. code:: python
+
+    model = Model(in_features, hidden_features, out_features)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, positive_graph, negative_graph, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        positive_graph = positive_graph.to(torch.device('cuda'))
+        negative_graph = negative_graph.to(torch.device('cuda'))
+        input_features = blocks[0].srcdata['features']
+        pos_score, neg_score = model(positive_graph, blocks, input_features)
+        loss = compute_loss(pos_score, neg_score)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+DGL provides the
+`unsupervised learning GraphSAGE <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_unsupervised.py>`__
+that shows an example of link prediction on homogeneous graphs.
+
+For heterogeneous graphs
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The models computing the node representations on heterogeneous graphs
+can also be used for computing incident node representations for edge
+classification/regression.
+
+.. code:: python
+
+    class StochasticTwoLayerRGCN(nn.Module):
+        def __init__(self, in_feat, hidden_feat, out_feat):
+            super().__init__()
+            self.conv1 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(in_feat, hidden_feat, norm='right')
+                    for rel in rel_names
+                })
+            self.conv2 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(hidden_feat, out_feat, norm='right')
+                    for rel in rel_names
+                })
+    
+        def forward(self, blocks, x):
+            x = self.conv1(blocks[0], x)
+            x = self.conv2(blocks[1], x)
+            return x
+
+For score prediction, the only implementation difference between the
+homogeneous graph and the heterogeneous graph is that we are looping
+over the edge types for :meth:`dgl.DGLHeteroGraph.apply_edges`.
+
+.. code:: python
+
+    class ScorePredictor(nn.Module):
+        def forward(self, edge_subgraph, x):
+            with edge_subgraph.local_scope():
+                edge_subgraph.ndata['x'] = x
+                for etype in edge_subgraph.canonical_etypes:
+                    edge_subgraph.apply_edges(
+                        dgl.function.u_dot_v('x', 'x', 'score'), etype=etype)
+                return edge_subgraph.edata['score']
+
+Data loader definition is also very similar to that of edge
+classification/regression. The only difference is that you need to give
+the negative sampler and you will be supplying a dictionary of edge
+types and edge ID tensors instead of a dictionary of node types and node
+ID tensors.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+        negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+If you want to give your own negative sampling function, the function
+should take in the original graph and the dictionary of edge types and
+edge ID tensors. It should return a dictionary of edge types and
+source-destination array pairs. An example is given as follows:
+
+.. code:: python
+
+    class NegativeSampler(object):
+        def __init__(self, g, k):
+            # caches the probability distribution
+            self.weights = {
+                etype: g.in_degrees(etype=etype).float() ** 0.75
+                for etype in g.canonical_etypes}
+            self.k = k
+    
+        def __call__(self, g, eids_dict):
+            result_dict = {}
+            for etype, eids in eids_dict.items():
+                src, _ = g.find_edges(eids, etype=etype)
+                src = src.repeat_interleave(self.k)
+                dst = self.weights.multinomial(len(src), replacement=True)
+                result_dict[etype] = (src, dst)
+            return result_dict
+    
+    dataloader = dgl.dataloading.EdgeDataLoader(
+        g, train_eid_dict, sampler,
+        negative_sampler=negative_sampler=NegativeSampler(g, 5),
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+The training loop is again almost the same as that on homogeneous graph,
+except for the implementation of ``compute_loss`` that will take in two
+dictionaries of node types and predictions here.
+
+.. code:: python
+
+    model = Model(in_features, hidden_features, out_features, num_classes)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, positive_graph, negative_graph, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        positive_graph = positive_graph.to(torch.device('cuda'))
+        negative_graph = negative_graph.to(torch.device('cuda'))
+        input_features = blocks[0].srcdata['features']
+        edge_labels = edge_subgraph.edata['labels']
+        edge_predictions = model(edge_subgraph, blocks, input_features)
+        loss = compute_loss(edge_labels, edge_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+
+
--- a/docs/source/guide/minibatch-nn.rst
+++ b/docs/source/guide/minibatch-nn.rst
+.. _guide-minibatch-custom-gnn-module:
+
+6.5 Implementing Custom GNN Module for Mini-batch Training
+-------------------------------------------------------------
+
+If you were familiar with how to write a custom GNN module for updating
+the entire graph for homogeneous or heterogeneous graphs (see
+:ref:`guide-nn`), the code for computing on
+blocks is similar, with the exception that the nodes are divided into
+input nodes and output nodes.
+
+For example, consider the following custom graph convolution module
+code. Note that it is not necessarily among the most efficient implementations
+- they only serve for an example of how a custom GNN module could look
+like.
+
+.. code:: python
+
+    class CustomGraphConv(nn.Module):
+        def __init__(self, in_feats, out_feats):
+            super().__init__()
+            self.W = nn.Linear(in_feats * 2, out_feats)
+    
+        def forward(self, g, h):
+            with g.local_scope():
+                g.ndata['h'] = h
+                g.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h_neigh'))
+                return self.W(torch.cat([g.ndata['h'], g.ndata['h_neigh']], 1))
+
+If you have a custom message passing NN module for the full graph, and
+you would like to make it work for blocks, you only need to rewrite the
+forward function as follows. Note that the corresponding statements from
+the full-graph implementation are commented; you can compare the
+original statements with the new statements.
+
+.. code:: python
+
+    class CustomGraphConv(nn.Module):
+        def __init__(self, in_feats, out_feats):
+            super().__init__()
+            self.W = nn.Linear(in_feats * 2, out_feats)
+    
+        # h is now a pair of feature tensors for input and output nodes, instead of
+        # a single feature tensor.
+        # def forward(self, g, h):
+        def forward(self, block, h):
+            # with g.local_scope():
+            with block.local_scope():
+                # g.ndata['h'] = h
+                h_src = h
+                h_dst = h[:block.number_of_dst_nodes()]
+                block.srcdata['h'] = h_src
+                block.dstdata['h'] = h_dst
+    
+                # g.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h_neigh'))
+                block.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h_neigh'))
+    
+                # return self.W(torch.cat([g.ndata['h'], g.ndata['h_neigh']], 1))
+                return self.W(torch.cat(
+                    [block.dstdata['h'], block.dstdata['h_neigh']], 1))
+
+In general, you need to do the following to make your NN module work for
+blocks.
+
+-  Obtain the features for output nodes from the input features by
+   slicing the first few rows. The number of rows can be obtained by
+   :meth:`block.number_of_dst_nodes <dgl.DGLHeteroGraph.number_of_dst_nodes>`.
+-  Replace
+   :attr:`g.ndata <dgl.DGLHeteroGraph.ndata>` with either
+   :attr:`block.srcdata <dgl.DGLHeteroGraph.srcdata>` for features on input nodes or
+   :attr:`block.dstdata <dgl.DGLHeteroGraph.dstdata>` for features on output nodes, if
+   the original graph has only one node type.
+-  Replace
+   :attr:`g.nodes <dgl.DGLHeteroGraph.nodes>` with either
+   :attr:`block.srcnodes <dgl.DGLHeteroGraph.srcnodes>` for features on input nodes or
+   :attr:`block.dstnodes <dgl.DGLHeteroGraph.dstnodes>` for features on output nodes,
+   if the original graph has multiple node types.
+-  Replace
+   :meth:`g.number_of_nodes <dgl.DGLHeteroGraph.number_of_nodes>` with either
+   :meth:`block.number_of_src_nodes <dgl.DGLHeteroGraph.number_of_src_nodes>` or
+   :meth:`block.number_of_dst_nodes <dgl.DGLHeteroGraph.number_of_dst_nodes>` for the number of
+   input nodes or output nodes respectively.
+
+Heterogeneous graphs
+~~~~~~~~~~~~~~~~~~~~
+
+For heterogeneous graph the way of writing custom GNN modules is
+similar. For instance, consider the following module that work on full
+graph.
+
+.. code:: python
+
+    class CustomHeteroGraphConv(nn.Module):
+        def __init__(self, g, in_feats, out_feats):
+            super().__init__()
+            self.Ws = nn.ModuleDict()
+            for etype in g.canonical_etypes:
+                utype, _, vtype = etype
+                self.Ws[etype] = nn.Linear(in_feats[utype], out_feats[vtype])
+            for ntype in g.ntypes:
+                self.Vs[ntype] = nn.Linear(in_feats[ntype], out_feats[ntype])
+    
+        def forward(self, g, h):
+            with g.local_scope():
+                for ntype in g.ntypes:
+                    g.nodes[ntype].data['h_dst'] = self.Vs[ntype](h[ntype])
+                    g.nodes[ntype].data['h_src'] = h[ntype]
+                for etype in g.canonical_etypes:
+                    utype, _, vtype = etype
+                    g.update_all(
+                        fn.copy_u('h_src', 'm'), fn.mean('m', 'h_neigh'),
+                        etype=etype)
+                    g.nodes[vtype].data['h_dst'] = g.nodes[vtype].data['h_dst'] + \
+                        self.Ws[etype](g.nodes[vtype].data['h_neigh'])
+                return {ntype: g.nodes[ntype].data['h_dst'] for ntype in g.ntypes}
+
+For ``CustomHeteroGraphConv``, the principle is to replace ``g.nodes``
+with ``g.srcnodes`` or ``g.dstnodes`` depend on whether the features
+serve for input or output.
+
+.. code:: python
+
+    class CustomHeteroGraphConv(nn.Module):
+        def __init__(self, g, in_feats, out_feats):
+            super().__init__()
+            self.Ws = nn.ModuleDict()
+            for etype in g.canonical_etypes:
+                utype, _, vtype = etype
+                self.Ws[etype] = nn.Linear(in_feats[utype], out_feats[vtype])
+            for ntype in g.ntypes:
+                self.Vs[ntype] = nn.Linear(in_feats[ntype], out_feats[ntype])
+    
+        def forward(self, g, h):
+            with g.local_scope():
+                for ntype in g.ntypes:
+                    h_src, h_dst = h[ntype]
+                    g.dstnodes[ntype].data['h_dst'] = self.Vs[ntype](h[ntype])
+                    g.srcnodes[ntype].data['h_src'] = h[ntype]
+                for etype in g.canonical_etypes:
+                    utype, _, vtype = etype
+                    g.update_all(
+                        fn.copy_u('h_src', 'm'), fn.mean('m', 'h_neigh'),
+                        etype=etype)
+                    g.dstnodes[vtype].data['h_dst'] = \
+                        g.dstnodes[vtype].data['h_dst'] + \
+                        self.Ws[etype](g.dstnodes[vtype].data['h_neigh'])
+                return {ntype: g.dstnodes[ntype].data['h_dst']
+                        for ntype in g.ntypes}
+
+Writing modules that work on homogeneous graphs, bipartite graphs, and blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All message passing modules in DGL work on homogeneous graphs,
+unidirectional bipartite graphs (that have two node types and one edge
+type), and a block with one edge type. Essentially, the input graph and
+feature of a builtin DGL neural network module must satisfy either of
+the following cases.
+
+-  If the input feature is a pair of tensors, then the input graph must
+   be unidirectional bipartite.
+-  If the input feature is a single tensor and the input graph is a
+   block, DGL will automatically set the feature on the output nodes as
+   the first few rows of the input node features.
+-  If the input feature must be a single tensor and the input graph is
+   not a block, then the input graph must be homogeneous.
+
+For example, the following is simplified from the PyTorch implementation
+of :class:`dgl.nn.pytorch.SAGEConv` (also available in MXNet and Tensorflow)
+(removing normalization and dealing with only mean aggregation etc.).
+
+.. code:: python
+
+    import dgl.function as fn
+    class SAGEConv(nn.Module):
+        def __init__(self, in_feats, out_feats):
+            super().__init__()
+            self.W = nn.Linear(in_feats * 2, out_feats)
+    
+        def forward(self, g, h):
+            if isinstance(h, tuple):
+                h_src, h_dst = h
+            elif g.is_block:
+                h_src = h
+                h_dst = h[:g.number_of_dst_nodes()]
+            else:
+                h_src = h_dst = h
+                 
+            g.srcdata['h'] = h_src
+            g.dstdata['h'] = h_dst
+            g.update_all(fn.copy_u('h', 'm'), fn.sum('m', 'h_neigh'))
+            return F.relu(
+                self.W(torch.cat([g.dstdata['h'], g.dstdata['h_neigh']], 1)))
+
+:ref:`guide-nn` also provides a walkthrough on :class:`dgl.nn.pytorch.SAGEConv`,
+which works on unidirectional bipartite graphs, homogeneous graphs, and blocks.
+
+
--- a/docs/source/guide/minibatch-node.rst
+++ b/docs/source/guide/minibatch-node.rst
+.. _guide-minibatch-node-classification-sampler:
+
+6.1 Training GNN for Node Classification with Neighborhood Sampling
+-----------------------------------------------------------------------
+
+To make your model been trained stochastically, you need to do the
+followings:
+
+-  Define a neighborhood sampler.
+-  Adapt your model for minibatch training.
+-  Modify your training loop.
+
+The following sub-subsections address these steps one by one.
+
+Define a neighborhood sampler and data loader
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DGL provides several neighborhood sampler classes that generates the
+computation dependencies needed for each layer given the nodes we wish
+to compute on.
+
+The simplest neighborhood sampler is
+:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler`
+which makes the node gather messages from all of its neighbors.
+
+To use a sampler provided by DGL, one also need to combine it with
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`, which iterates
+over a set of nodes in minibatches.
+
+For example, the following code creates a PyTorch DataLoader that
+iterates over the training node ID array ``train_nids`` in batches,
+putting the list of generated blocks onto GPU.
+
+.. code:: python
+
+    import dgl
+    import dgl.nn as dglnn
+    import torch
+    import torch.nn as nn
+    import torch.nn.functional as F
+    
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+    dataloader = dgl.dataloading.NodeDataLoader(
+        g, train_nids, sampler,
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+Iterating over the DataLoader will yield a list of specially created
+graphs representing the computation dependencies on each layer. They are
+called *blocks* in DGL.
+
+.. code:: python
+
+    input_nodes, output_nodes, blocks = next(iter(dataloader))
+    print(blocks)
+
+The iterator generates three items at a time. ``input_nodes`` describe
+the nodes needed to compute the representation of ``output_nodes``.
+``blocks`` describe for each GNN layer which node representations are to
+be computed as output, which node representations are needed as input,
+and how does representation from the input nodes propagate to the output
+nodes.
+
+For a complete list of supported builtin samplers, please refer to the
+:ref:`neighborhood sampler API reference <api-dataloading-neighbor-sampling>`.
+
+If you wish to develop your own neighborhood sampler or you want a more
+detailed explanation of the concept of blocks, please refer to
+:ref:`guide-minibatch-customizing-neighborhood-sampler`.
+
+.. _guide-minibatch-node-classification-model:
+
+Adapt your model for minibatch training
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your message passing modules are all provided by DGL, the changes
+required to adapt your model to minibatch training is minimal. Take a
+multi-layer GCN as an example. If your model on full graph is
+implemented as follows:
+
+.. code:: python
+
+    class TwoLayerGCN(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.conv1 = dglnn.GraphConv(in_features, hidden_features)
+            self.conv2 = dglnn.GraphConv(hidden_features, out_features)
+    
+        def forward(self, g, x):
+            x = F.relu(self.conv1(g, x))
+            x = F.relu(self.conv2(g, x))
+            return x
+
+Then all you need is to replace ``g`` with ``blocks`` generated above.
+
+.. code:: python
+
+    class StochasticTwoLayerGCN(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.conv1 = dgl.nn.GraphConv(in_features, hidden_features)
+            self.conv2 = dgl.nn.GraphConv(hidden_features, out_features)
+    
+        def forward(self, blocks, x):
+            x = F.relu(self.conv1(blocks[0], x))
+            x = F.relu(self.conv2(blocks[1], x))
+            return x
+
+The DGL ``GraphConv`` modules above accepts an element in ``blocks``
+generated by the data loader as an argument.
+
+:ref:`The API reference of each NN module <apinn>` will tell you
+whether it supports accepting a block as an argument.
+
+If you wish to use your own message passing module, please refer to
+:ref:`guide-minibatch-custom-gnn-module`.
+
+Training Loop
+~~~~~~~~~~~~~
+
+The training loop simply consists of iterating over the dataset with the
+customized batching iterator. During each iteration that yields a list
+of blocks, we:
+
+1. Load the node features corresponding to the input nodes onto GPU. The
+   node features can be stored in either memory or external storage.
+   Note that we only need to load the input nodes’ features, as opposed
+   to load the features of all nodes as in full graph training.
+   
+   If the features are stored in ``g.ndata``, then the features can be loaded
+   by accessing the features in ``blocks[0].srcdata``, the features of
+   input nodes of the first block, which is identical to all the
+   necessary nodes needed for computing the final representations.
+
+2. Feed the list of blocks and the input node features to the multilayer
+   GNN and get the outputs.
+
+3. Load the node labels corresponding to the output nodes onto GPU.
+   Similarly, the node labels can be stored in either memory or external
+   storage. Again, note that we only need to load the output nodes’
+   labels, as opposed to load the labels of all nodes as in full graph
+   training.
+   
+   If the features are stored in ``g.ndata``, then the labels
+   can be loaded by accessing the features in ``blocks[-1].srcdata``,
+   the features of output nodes of the last block, which is identical to
+   the nodes we wish to compute the final representation.
+
+4. Compute the loss and backpropagate.
+
+.. code:: python
+
+    model = StochasticTwoLayerGCN(in_features, hidden_features, out_features)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, output_nodes, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        input_features = blocks[0].srcdata['features']
+        output_labels = blocks[-1].dstdata['label']
+        output_predictions = model(blocks, input_features)
+        loss = compute_loss(output_labels, output_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+DGL provides an end-to-end stochastic training example `GraphSAGE
+implementation <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling.py>`__.
+
+For heterogeneous graphs
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Training a graph neural network for node classification on heterogeneous
+graph is similar.
+
+For instance, we have previously seen
+:ref:`how to train a 2-layer RGCN on full graph <guide-training-rgcn-node-classification>`.
+The code for RGCN implementation on minibatch training looks very
+similar to that (with self-loops, non-linearity and basis decomposition
+removed for simplicity):
+
+.. code:: python
+
+    class StochasticTwoLayerRGCN(nn.Module):
+        def __init__(self, in_feat, hidden_feat, out_feat):
+            super().__init__()
+            self.conv1 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(in_feat, hidden_feat, norm='right')
+                    for rel in rel_names
+                })
+            self.conv2 = dglnn.HeteroGraphConv({
+                    rel : dglnn.GraphConv(hidden_feat, out_feat, norm='right')
+                    for rel in rel_names
+                })
+    
+        def forward(self, blocks, x):
+            x = self.conv1(blocks[0], x)
+            x = self.conv2(blocks[1], x)
+            return x
+
+Some of the samplers provided by DGL also support heterogeneous graphs.
+For example, one can still use the provided
+:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler` class and
+:class:`~dgl.dataloading.pytorch.NodeDataLoader` class for
+stochastic training. For full-neighbor sampling, the only difference
+would be that you would specify a dictionary of node
+types and node IDs for the training set.
+
+.. code:: python
+
+    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
+    dataloader = dgl.dataloading.NodeDataLoader(
+        g, train_nid_dict, sampler,
+        batch_size=1024,
+        shuffle=True,
+        drop_last=False,
+        num_workers=4)
+
+The training loop is almost the same as that of homogeneous graphs,
+except for the implementation of ``compute_loss`` that will take in two
+dictionaries of node types and predictions here.
+
+.. code:: python
+
+    model = StochasticTwoLayerRGCN(in_features, hidden_features, out_features)
+    model = model.cuda()
+    opt = torch.optim.Adam(model.parameters())
+    
+    for input_nodes, output_nodes, blocks in dataloader:
+        blocks = [b.to(torch.device('cuda')) for b in blocks]
+        input_features = blocks[0].srcdata     # returns a dict
+        output_labels = blocks[-1].dstdata     # returns a dict
+        output_predictions = model(blocks, input_features)
+        loss = compute_loss(output_labels, output_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+DGL provides an end-to-end stochastic training example `RGCN
+implementation <https://github.com/dmlc/dgl/blob/master/examples/pytorch/rgcn-hetero/entity_classify_mb.py>`__.
+
+
--- a/docs/source/guide/minibatch.rst
+++ b/docs/source/guide/minibatch.rst
--- a/docs/source/guide/nn.rst
+++ b/docs/source/guide/nn.rst
 .. _guide-nn:

-Building DGL NN Module
-======================
+Chapter 3: Building GNN Modules
+=====================================

 DGL NN module is the building block for your GNN model. It inherents
 from `Pytorch’s NN Module <https://pytorch.org/docs/1.2.0/_modules/torch/nn/modules/module.html>`__, `MXNet Gluon’s NN Block  <http://mxnet.incubator.apache.org/versions/1.6/api/python/docs/api/gluon/nn/index.html>`__ and `TensorFlow’s Keras
-Layer <https://www.tensorflow.org/api_docs/python/tf/keras/layers>`__, depending on the DNN framework backend we are using. In DGL NN
+Layer <https://www.tensorflow.org/api_docs/python/tf/keras/layers>`__, depending on the DNN framework backend in use. In DGL NN
 module, the parameter registration in construction function and tensor
 operation in forward function are the same with the backend framework.
 In this way, DGL code can be seamlessly integrated into the backend

--- a/docs/source/guide/training-edge.rst
+++ b/docs/source/guide/training-edge.rst
+.. _guide-training-edge-classification:
+
+5.2 Edge Classification/Regression
+---------------------------------------------
+
+Sometimes you wish to predict the attributes on the edges of the graph,
+or even whether an edge exists or not between two given nodes. In that
+case, you would like to have an *edge classification/regression* model.
+
+Here we generate a random graph for edge prediction as a demonstration.
+
+.. code:: ipython3
+
+    src = np.random.randint(0, 100, 500)
+    dst = np.random.randint(0, 100, 500)
+    # make it symmetric
+    edge_pred_graph = dgl.graph((np.concatenate([src, dst]), np.concatenate([dst, src])))
+    # synthetic node and edge features, as well as edge labels
+    edge_pred_graph.ndata['feature'] = torch.randn(100, 10)
+    edge_pred_graph.edata['feature'] = torch.randn(1000, 10)
+    edge_pred_graph.edata['label'] = torch.randn(1000)
+    # synthetic train-validation-test splits
+    edge_pred_graph.edata['train_mask'] = torch.zeros(1000, dtype=torch.bool).bernoulli(0.6)
+
+Overview
+~~~~~~~~
+
+From the previous section you have learned how to do node classification
+with a multilayer GNN. The same technique can be applied for computing a
+hidden representation of any node. The prediction on edges can then be
+derived from the representation of their incident nodes.
+
+The most common case of computing the prediction on an edge is to
+express it as a parameterized function of the representation of its
+incident nodes, and optionally the features on the edge itself.
+
+Model Implementation Difference from Node Classification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assuming that you compute the node representation with the model from
+the previous section, you only need to write another component that
+computes the edge prediction with the
+:meth:`~dgl.DGLHeteroGraph.apply_edges` method.
+
+For instance, if you would like to compute a score for each edge for
+edge regression, the following code computes the dot product of incident
+node representations on each edge.
+
+.. code:: python
+
+    import dgl.function as fn
+    class DotProductPredictor(nn.Module):
+        def forward(self, graph, h):
+            # h contains the node representations computed from the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h
+                graph.apply_edges(fn.u_dot_v('h', 'h', 'score'))
+                return graph.edata['score']
+
+One can also write a prediction function that predicts a vector for each
+edge with an MLP. Such vector can be used in further downstream tasks,
+e.g. as logits of a categorical distribution.
+
+.. code:: python
+
+    class MLPPredictor(nn.Module):
+        def __init__(self, in_features, out_classes):
+            super().__init__()
+            self.W = nn.Linear(in_features * 2, out_classes)
+    
+        def apply_edges(self, edges):
+            h_u = edges.src['h']
+            h_v = edges.dst['h']
+            score = self.W(torch.cat([h_u, h_v], 1))
+            return {'score': score}
+    
+        def forward(self, graph, h):
+            # h contains the node representations computed from the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h
+                graph.apply_edges(self.apply_edges)
+                return graph.edata['score']
+
+Training loop
+~~~~~~~~~~~~~
+
+Given the node representation computation model and an edge predictor
+model, we can easily write a full-graph training loop where we compute
+the prediction on all edges.
+
+The following example takes ``SAGE`` in the previous section as the node
+representation computation model and ``DotPredictor`` as an edge
+predictor model.
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.sage = SAGE(in_features, hidden_features, out_features)
+            self.pred = DotProductPredictor()
+        def forward(self, g, x):
+            h = self.sage(g, x)
+            return self.pred(g, h)
+
+In this example, we also assume that the training/validation/test edge
+sets are identified by boolean masks on edges. This example also does
+not include early stopping and model saving.
+
+.. code:: python
+
+    node_features = edge_pred_graph.ndata['feature']
+    edge_label = edge_pred_graph.edata['label']
+    train_mask = edge_pred_graph.edata['train_mask']
+    model = Model(10, 20, 5)
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(10):
+        pred = model(edge_pred_graph, node_features)
+        loss = ((pred[train_mask] - edge_label[train_mask]) ** 2).mean()
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+
+
+Heterogeneous graph
+~~~~~~~~~~~~~~~~~~~
+
+Edge classification on heterogeneous graphs is not very different from
+that on homogeneous graphs. If you wish to perform edge classification
+on one edge type, you only need to compute the node representation for
+all node types, and predict on that edge type with
+:meth:`~dgl.DGLHeteroGraph.apply_edges` method.
+
+For example, to make ``DotProductPredictor`` work on one edge type of a
+heterogeneous graph, you only need to specify the edge type in
+``apply_edges`` method.
+
+.. code:: python
+
+    class HeteroDotProductPredictor(nn.Module):
+        def forward(self, graph, h, etype):
+            # h contains the node representations for each edge type computed from
+            # the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h   # assigns 'h' of all node types in one shot
+                graph.apply_edges(fn.u_dot_v('h', 'h', 'score'), etype=etype)
+                return graph.edges[etype].data['score']
+
+You can similarly write a ``HeteroMLPPredictor``.
+
+.. code:: python
+
+    class MLPPredictor(nn.Module):
+        def __init__(self, in_features, out_classes):
+            super().__init__()
+            self.W = nn.Linear(in_features * 2, out_classes)
+    
+        def apply_edges(self, edges):
+            h_u = edges.src['h']
+            h_v = edges.dst['h']
+            score = self.W(torch.cat([h_u, h_v], 1))
+            return {'score': score}
+    
+        def forward(self, graph, h, etype):
+            # h contains the node representations computed from the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h   # assigns 'h' of all node types in one shot
+                graph.apply_edges(self.apply_edges, etype=etype)
+                return graph.edges[etype].data['score']
+
+The end-to-end model that predicts a score for each edge on a single
+edge type will look like this:
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features, rel_names):
+            super().__init__()
+            self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
+            self.pred = HeteroDotProductPredictor()
+        def forward(self, g, x, etype):
+            h = self.sage(g, x)
+            return self.pred(g, h, etype)
+
+Using the model simply involves feeding the model a dictionary of node
+types and features.
+
+.. code:: python
+
+    model = Model(10, 20, 5, hetero_graph.etypes)
+    user_feats = hetero_graph.nodes['user'].data['feature']
+    item_feats = hetero_graph.nodes['item'].data['feature']
+    label = hetero_graph.edges['click'].data['label']
+    train_mask = hetero_graph.edges['click'].data['train_mask']
+    node_features = {'user': user_feats, 'item': item_feats}
+
+Then the training loop looks almost the same as that in homogeneous
+graph. For instance, if you wish to predict the edge labels on edge type
+``click``, then you can simply do
+
+.. code:: python
+
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(10):
+        pred = model(hetero_graph, node_features, 'click')
+        loss = ((pred[train_mask] - label[train_mask]) ** 2).mean()
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+
+
+Predicting Edge Type of an Existing Edge on a Heterogeneous Graph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sometimes you may want to predict which type an existing edge belongs
+to.
+
+For instance, given the heterogeneous graph above, your task is given an
+edge connecting a user and an item, predict whether the user would
+``click`` or ``dislike`` an item.
+
+This is a simplified version of rating prediction, which is common in
+recommendation literature.
+
+You can use a heterogeneous graph convolution network to obtain the node
+representations. For instance, you can still use the RGCN above for this
+purpose.
+
+To predict the type of an edge, you can simply repurpose the
+``HeteroDotProductPredictor`` above so that it takes in another graph
+with only one edge type that “merges” all the edge types to be
+predicted, and emits the score of each type for every edge.
+
+In the example here, you will need a graph that has two node types
+``user`` and ``item``, and one single edge type that “merges” all the
+edge types from ``user`` and ``item``, i.e. ``like`` and ``dislike``.
+This can be conveniently created using
+:meth:`relation slicing <dgl.DGLHeteroGraph.__getitem__>`.
+
+.. code:: python
+
+    dec_graph = hetero_graph['user', :, 'item']
+
+Since the statement above also returns the original edge types as a
+feature named ``dgl.ETYPE``, we can use that as labels.
+
+.. code:: python
+
+    edge_label = dec_graph.edata[dgl.ETYPE]
+
+Given the graph above as input to the edge type predictor module, you
+can write your predictor module as follows.
+
+.. code:: python
+
+    class HeteroMLPPredictor(nn.Module):
+        def __init__(self, in_dims, n_classes):
+            super().__init__()
+            self.W = nn.Linear(in_dims * 2, n_classes)
+    
+        def apply_edges(self, edges):
+            x = torch.cat([edges.src['h'], edges.dst['h']], 1)
+            y = self.W(x)
+            return {'score': y}
+    
+        def forward(self, graph, h):
+            # h contains the node representations for each edge type computed from
+            # the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h   # assigns 'h' of all node types in one shot
+                graph.apply_edges(self.apply_edges)
+                return graph.edata['score']
+
+The model that combines the node representation module and the edge type
+predictor module is the following:
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features, rel_names):
+            super().__init__()
+            self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
+            self.pred = HeteroMLPPredictor(out_features, len(rel_names))
+        def forward(self, g, x, dec_graph):
+            h = self.sage(g, x)
+            return self.pred(dec_graph, h)
+
+The training loop then simply be the following:
+
+.. code:: python
+
+    model = Model(10, 20, 5, hetero_graph.etypes)
+    user_feats = hetero_graph.nodes['user'].data['feature']
+    item_feats = hetero_graph.nodes['item'].data['feature']
+    node_features = {'user': user_feats, 'item': item_feats}
+    
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(10):
+        logits = model(hetero_graph, node_features, dec_graph)
+        loss = F.cross_entropy(logits, edge_label)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+
+
+DGL provides `Graph Convolutional Matrix
+Completion <https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcmc>`__
+as an example of rating prediction, which is formulated by predicting
+the type of an existing edge on a heterogeneous graph. The node
+representation module in the `model implementation
+file <https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcmc>`__
+is called ``GCMCLayer``. The edge type predictor module is called
+``BiDecoder``. Both of them are more complicated than the setting
+described here.
+
+
--- a/docs/source/guide/training-graph.rst
+++ b/docs/source/guide/training-graph.rst
+.. _guide-training-graph-classification:
+
+5.4 Graph Classification
+----------------------------------
+
+Instead of a big single graph, sometimes we might have the data in the
+form of multiple graphs, for example a list of different types of
+communities of people. By characterizing the friendships among people in
+the same community by a graph, we get a list of graphs to classify. In
+this scenario, a graph classification model could help identify the type
+of the community, i.e. to classify each graph based on the structure and
+overall information.
+
+Overview
+~~~~~~~~
+
+The major difference between graph classification and node
+classification or link prediction is that the prediction result
+characterize the property of the entire input graph. We perform the
+message passing over nodes/edges just like the previous tasks, but also
+try to retrieve a graph-level representation.
+
+The graph classification proceeds as follows:
+
+.. figure:: https://data.dgl.ai/tutorial/batch/graph_classifier.png
+   :alt: Graph Classification Process
+
+   Graph Classification Process
+
+From left to right, the common practice is:
+
+-  Prepare graphs in to a batch of graphs
+-  Message passing on the batched graphs to update node/edge features
+-  Aggregate node/edge features into a graph-level representation
+-  Classification head for the task
+
+Batch of Graphs
+^^^^^^^^^^^^^^^
+
+Usually a graph classification task trains on a lot of graphs, and it
+will be very inefficient if we use only one graph at a time when
+training the model. Borrowing the idea of mini-batch training from
+common deep learning practice, we can build a batch of multiple graphs
+and send them together for one training iteration.
+
+In DGL, we can build a single batched graph of a list of graphs. This
+batched graph can be simply used as a single large graph, with separated
+components representing the corresponding original small graphs.
+
+.. figure:: https://data.dgl.ai/tutorial/batch/batch.png
+   :alt: Batched Graph
+
+   Batched Graph
+
+Graph Readout
+^^^^^^^^^^^^^
+
+Every graph in the data may have its unique structure, as well as its
+node and edge features. In order to make a single prediction, we usually
+aggregate and summarize over the possibly abundant information. This
+type of operation is named *Readout*. Common aggregations include
+summation, average, maximum or minimum over all node or edge features.
+
+Given a graph :math:`g`, we can define the average readout aggregation
+as
+
+.. math:: h_g = \frac{1}{|\mathcal{V}|}\sum_{v\in \mathcal{V}}h_v
+
+In DGL the corresponding function call is :func:`dgl.readout_nodes`.
+
+Once :math:`h_g` is available, we can pass it through an MLP layer for
+classification output.
+
+Writing neural network model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The input to the model is the batched graph with node and edge features.
+One thing to note is the node and edge features in the batched graph
+have no batch dimension. A little special care should be put in the
+model:
+
+Computation on a batched graph
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, we discuss the computational properties of a batched graph.
+
+First, different graphs in a batch are entirely separated, i.e. no edge
+connecting two graphs. With this nice property, all message passing
+functions still have the same results.
+
+Second, the readout function on a batched graph will be conducted over
+each graph separately. Assume the batch size is :math:`B` and the
+feature to be aggregated has dimension :math:`D`, the shape of the
+readout result will be :math:`(B, D)`.
+
+.. code:: python
+
+    g1 = dgl.graph(([0, 1], [1, 0]))
+    g1.ndata['h'] = torch.tensor([1., 2.])
+    g2 = dgl.graph(([0, 1], [1, 2]))
+    g2.ndata['h'] = torch.tensor([1., 2., 3.])
+    
+    dgl.readout_nodes(g1, 'h')
+    # tensor([3.])  # 1 + 2
+    
+    bg = dgl.batch([g1, g2])
+    dgl.readout_nodes(bg, 'h')
+    # tensor([3., 6.])  # [1 + 2, 1 + 2 + 3]
+
+Finally, each node/edge feature tensor on a batched graph is in the
+format of concatenating the corresponding feature tensor from all
+graphs.
+
+.. code:: python
+
+    bg.ndata['h']
+    # tensor([1., 2., 1., 2., 3.])
+
+Model definition
+^^^^^^^^^^^^^^^^
+
+Being aware of the above computation rules, we can define a very simple
+model.
+
+.. code:: python
+
+    class Classifier(nn.Module):
+        def __init__(self, in_dim, hidden_dim, n_classes):
+            super(Classifier, self).__init__()
+            self.conv1 = dglnn.GraphConv(in_dim, hidden_dim)
+            self.conv2 = dglnn.GraphConv(hidden_dim, hidden_dim)
+            self.classify = nn.Linear(hidden_dim, n_classes)
+    
+        def forward(self, g, feat):
+            # Apply graph convolution and activation.
+            h = F.relu(self.conv1(g, h))
+            h = F.relu(self.conv2(g, h))
+            with g.local_scope():
+                g.ndata['h'] = h
+                # Calculate graph representation by average readout.
+                hg = dgl.mean_nodes(g, 'h')
+                return self.classify(hg)
+
+Training loop
+~~~~~~~~~~~~~
+
+Data Loading
+^^^^^^^^^^^^
+
+Once the model’s defined, we can start training. Since graph
+classification deals with lots of relative small graphs instead of a big
+single one, we usually can train efficiently on stochastic mini-batches
+of graphs, without the need to design sophisticated graph sampling
+algorithms.
+
+Assuming that we have a graph classification dataset as introduced in
+:ref:`guide-data-pipeline`.
+
+.. code:: python
+
+    import dgl.data
+    dataset = dgl.data.GINDataset('MUTAG', False)
+
+Each item in the graph classification dataset is a pair of a graph and
+its label. We can speed up the data loading process by taking advantage
+of the DataLoader, by customizing the collate function to batch the
+graphs:
+
+.. code:: python
+
+    def collate(samples):
+        graphs, labels = map(list, zip(*samples))
+        batched_graph = dgl.batch(graphs)
+        batched_labels = torch.tensor(labels)
+        return batched_graph, batched_labels
+
+Then one can create a DataLoader that iterates over the dataset of
+graphs in minibatches.
+
+.. code:: python
+
+    from torch.utils.data import DataLoader
+    dataloader = DataLoader(
+        dataset,
+        batch_size=1024,
+        collate_fn=collate,
+        drop_last=False,
+        shuffle=True)
+
+Loop
+^^^^
+
+Training loop then simply involves iterating over the dataloader and
+updating the model.
+
+.. code:: python
+
+    model = Classifier(10, 20, 5)
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(20):
+        for batched_graph, labels in dataloader:
+            feats = batched_graph.ndata['feats']
+            logits = model(batched_graph, feats)
+            loss = F.cross_entropy(logits, labels)
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+
+DGL implements
+`GIN <https://github.com/dmlc/dgl/tree/master/examples/pytorch/gin>`__
+as an example of graph classification. The training loop is inside the
+function ``train`` in
+```main.py`` <https://github.com/dmlc/dgl/blob/master/examples/pytorch/gin/main.py>`__.
+The model implementation is inside
+```gin.py`` <https://github.com/dmlc/dgl/blob/master/examples/pytorch/gin/gin.py>`__
+with more components such as using
+:class:`dgl.nn.pytorch.GINConv` (also available in MXNet and Tensorflow)
+as the graph convolution layer, batch normalization, etc.
+
+Heterogeneous graph
+~~~~~~~~~~~~~~~~~~~
+
+Graph classification with heterogeneous graphs is a little different
+from that with homogeneous graphs. Except that you need heterogeneous
+graph convolution modules, yoyu also need to aggregate over the nodes of
+different types in the readout function.
+
+The following shows an example of summing up the average of node
+representations for each node type.
+
+.. code:: python
+
+    class RGCN(nn.Module):
+        def __init__(self, in_feats, hid_feats, out_feats, rel_names):
+            super().__init__()
+            
+            self.conv1 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(in_feats, hid_feats)
+                for rel in rel_names}, aggregate='sum')
+            self.conv2 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(hid_feats, out_feats)
+                for rel in rel_names}, aggregate='sum')
+      
+        def forward(self, graph, inputs):
+            # inputs are features of nodes
+            h = self.conv1(graph, inputs)
+            h = {k: F.relu(v) for k, v in h.items()}
+            h = self.conv2(graph, h)
+            return h
+    
+    class HeteroClassifier(nn.Module):
+        def __init__(self, in_dim, hidden_dim, n_classes, rel_names):
+            super().__init__()
+            
+            self.conv1 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(in_feats, hid_feats)
+                for rel in rel_names}, aggregate='sum')
+            self.conv2 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(hid_feats, out_feats)
+                for rel in rel_names}, aggregate='sum')
+            self.classify = nn.Linear(hidden_dim, n_classes)
+    
+        def forward(self, g):
+            h = g.ndata['feat']
+            # Apply graph convolution and activation.
+            h = F.relu(self.conv1(g, h))
+            h = F.relu(self.conv2(g, h))
+    
+            with g.local_scope():
+                g.ndata['h'] = h
+                # Calculate graph representation by average readout.
+                hg = 0
+                for ntype in g.ntypes:
+                    hg = hg + dgl.mean_nodes(g, 'h', ntype=ntype)
+                return self.classify(hg)
+
+The rest of the code is not different from that for homogeneous graphs.
+
+.. code:: python
+
+    model = HeteroClassifier(10, 20, 5)
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(20):
+        for batched_graph, labels in dataloader:
+            logits = model(batched_graph)
+            loss = F.cross_entropy(logits, labels)
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
--- a/docs/source/guide/training-link.rst
+++ b/docs/source/guide/training-link.rst
+.. _guide-training-link-prediction:
+
+5.3 Link Prediction
+---------------------------
+
+In some other settings you may want to predict whether an edge exists
+between two given nodes or not. Such model is called a *link prediction*
+model.
+
+Overview
+~~~~~~~~
+
+A GNN-based link prediction model represents the likelihood of
+connectivity between two nodes :math:`u` and :math:`v` as a function of
+:math:`\boldsymbol{h}_u^{(L)}` and :math:`\boldsymbol{h}_v^{(L)}`, their
+node representation computed from the multi-layer GNN.
+
+.. math::
+
+
+   y_{u,v} = \phi(\boldsymbol{h}_u^{(L)}, \boldsymbol{h}_v^{(L)})
+
+In this section we refer to :math:`y_{u,v}` the *score* between node
+:math:`u` and node :math:`v`.
+
+Training a link prediction model involves comparing the scores between
+nodes connected by an edge against the scores between an arbitrary pair
+of nodes. For example, given an edge connecting :math:`u` and :math:`v`,
+we encourage the score between node :math:`u` and :math:`v` to be higher
+than the score between node :math:`u` and a sampled node :math:`v'` from
+an arbitrary *noise* distribution :math:`v' \sim P_n(v)`. Such
+methodology is called *negative sampling*.
+
+There are lots of loss functions that can achieve the behavior above if
+minimized. A non-exhaustive list include:
+
+-  Cross-entropy loss:
+   :math:`\mathcal{L} = - \log \sigma (y_{u,v}) - \sum_{v_i \sim P_n(v), i=1,\dots,k}\log \left[ 1 - \sigma (y_{u,v_i})\right]`
+-  BPR loss:
+   :math:`\mathcal{L} = \sum_{v_i \sim P_n(v), i=1,\dots,k} - \log \sigma (y_{u,v} - y_{u,v_i})`
+-  Margin loss:
+   :math:`\mathcal{L} = \sum_{v_i \sim P_n(v), i=1,\dots,k} \max(0, M - y_{u, v} + y_{u, v_i})`,
+   where :math:`M` is a constant hyperparameter.
+
+You may find this idea familiar if you know what `implicit
+feedback <https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf>`__ or
+`noise-contrastive
+estimation <http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf>`__
+is.
+
+Model Implementation Difference from Edge Classification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The neural network model to compute the score between :math:`u` and
+:math:`v` is identical to the edge regression model described above.
+
+Here is an example of using dot product to compute the scores on edges.
+
+.. code:: python
+
+    class DotProductPredictor(nn.Module):
+        def forward(self, graph, h):
+            # h contains the node representations computed from the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h
+                graph.apply_edges(fn.u_dot_v('h', 'h', 'score'))
+                return graph.edata['score']
+
+Training loop
+~~~~~~~~~~~~~
+
+Because our score prediction model operates on graphs, we need to
+express the negative examples as another graph. The graph will contain
+all negative node pairs as edges.
+
+The following shows an example of expressing negative examples as a
+graph. Each edge :math:`(u,v)` gets :math:`k` negative examples
+:math:`(u,v_i)` where :math:`v_i` is sampled from a uniform
+distribution.
+
+.. code:: python
+
+    def construct_negative_graph(graph, k):
+        src, dst = graph.edges()
+    
+        neg_src = src.repeat_interleave(k)
+        neg_dst = torch.randint(0, graph.number_of_nodes(), (len(src) * k,))
+        return dgl.graph((neg_src, neg_dst), num_nodes=graph.number_of_nodes())
+
+The model that predicts edge scores is the same as that of edge
+classification/regression.
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features):
+            super().__init__()
+            self.sage = SAGE(in_features, hidden_features, out_features)
+            self.pred = DotProductPredictor()
+        def forward(self, g, neg_g, x):
+            h = self.sage(g, x)
+            return self.pred(g, h), self.pred(neg_g, h)
+
+The training loop then repeatedly constructs the negative graph and
+computes loss.
+
+.. code:: python
+
+    def compute_loss(pos_score, neg_score):
+        # Margin loss
+        n_edges = pos_score.shape[0]
+        return (1 - neg_score.view(n_edges, -1) + pos_score.unsqueeze(1)).clamp(min=0).mean()
+    
+    node_features = graph.ndata['feat']
+    n_features = node_features.shape[1]
+    k = 5
+    model = Model(n_features, 100, 100)
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(10):
+        negative_graph = construct_negative_graph(graph, k)
+        pos_score, neg_score = model(graph, negative_graph, node_features)
+        loss = compute_loss(pos_score, neg_score)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+
+
+After training, the node representation can be obtained via
+
+.. code:: python
+
+    node_embeddings = model.sage(graph, node_features)
+
+There are multiple ways of using the node embeddings. Examples include
+training downstream classifiers, or doing nearest neighbor search or
+maximum inner product search for relevant entity recommendation.
+
+Heterogeneous graphs
+~~~~~~~~~~~~~~~~~~~~
+
+Link prediction on heterogeneous graphs is not very different from that
+on homogeneous graphs. The following assumes that we are predicting on
+one edge type, and it is easy to extend it to multiple edge types.
+
+For example, you can reuse the ``HeteroDotProductPredictor`` above for
+computing the scores of the edges of an edge type for link prediction.
+
+.. code:: python
+
+    class HeteroDotProductPredictor(nn.Module):
+        def forward(self, graph, h, etype):
+            # h contains the node representations for each edge type computed from
+            # the GNN above.
+            with graph.local_scope():
+                graph.ndata['h'] = h
+                graph.apply_edges(fn.u_dot_v('h', 'h', 'score'), etype=etype)
+                return graph.edges[etype].data['score']
+
+To perform negative sampling, one can construct a negative graph for the
+edge type you are performing link prediction on as well.
+
+.. code:: python
+
+    def construct_negative_graph(graph, k, etype):
+        utype, _, vtype = etype
+        src, dst = graph.edges(etype=etype)
+        neg_src = src.repeat_interleave(k)
+        neg_dst = torch.randint(0, graph.number_of_nodes(vtype), (len(src) * k,))
+        return dgl.heterograph(
+            {etype: (neg_src, neg_dst)},
+            num_nodes_dict={ntype: graph.number_of_nodes(ntype) for ntype in graph.ntypes})
+
+The model is a bit different from that in edge classification on
+heterogeneous graphs since you need to specify edge type where you
+perform link prediction.
+
+.. code:: python
+
+    class Model(nn.Module):
+        def __init__(self, in_features, hidden_features, out_features, rel_names):
+            super().__init__()
+            self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
+            self.pred = HeteroDotProductPredictor()
+        def forward(self, g, neg_g, x, etype):
+            h = self.sage(g, x)
+            return self.pred(g, h, etype), self.pred(neg_g, h, etype)
+
+The training loop is similar to that of homogeneous graphs.
+
+.. code:: python
+
+    def compute_loss(pos_score, neg_score):
+        # Margin loss
+        n_edges = pos_score.shape[0]
+        return (1 - neg_score.view(n_edges, -1) + pos_score.unsqueeze(1)).clamp(min=0).mean()
+    
+    k = 5
+    model = Model(10, 20, 5, hetero_graph.etypes)
+    user_feats = hetero_graph.nodes['user'].data['feature']
+    item_feats = hetero_graph.nodes['item'].data['feature']
+    node_features = {'user': user_feats, 'item': item_feats}
+    opt = torch.optim.Adam(model.parameters())
+    for epoch in range(10):
+        negative_graph = construct_negative_graph(hetero_graph, k, ('user', 'click', 'item'))
+        pos_score, neg_score = model(hetero_graph, negative_graph, node_features, ('user', 'click', 'item'))
+        loss = compute_loss(pos_score, neg_score)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+
+
+
--- a/docs/source/guide/training-node.rst
+++ b/docs/source/guide/training-node.rst
+.. _guide-training-node-classification:
+
+5.1 Node Classification/Regression
+--------------------------------------------------
+
+One of the most popular and widely adopted tasks for graph neural
+networks is node classification, where each node in the
+training/validation/test set is assigned a ground truth category from a
+set of predefined categories. Node regression is similar, where each
+node in the training/validation/test set is assigned a ground truth
+number.
+
+Overview
+~~~~~~~~
+
+To classify nodes, graph neural network performs message passing
+discussed in :ref:`guide-message-passing` to utilize the node’s own
+features, but also its neighboring node and edge features. Message
+passing can be repeated multiple rounds to incorporate information from
+larger range of neighborhood.
+
+Writing neural network model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DGL provides a few built-in graph convolution modules that can perform
+one round of message passing. In this guide, we choose
+:class:`dgl.nn.pytorch.SAGEConv` (also available in MXNet and Tensorflow),
+the graph convolution module for GraphSAGE.
+
+Usually for deep learning models on graphs we need a multi-layer graph
+neural network, where we do multiple rounds of message passing. This can
+be achieved by stacking graph convolution modules as follows.
+
+.. code:: python
+
+    # Contruct a two-layer GNN model
+    import dgl.nn as dglnn
+    import torch.nn as nn
+    import torch.nn.functional as F
+    class SAGE(nn.Module):
+        def __init__(self, in_feats, hid_feats, out_feats):
+            super().__init__()
+            self.conv1 = dglnn.SAGEConv(
+                in_feats=in_feats, out_feats=hid_feats, aggregator_type='mean')
+            self.conv2 = dglnn.SAGEConv(
+                in_feats=hid_feats, out_feats=out_feats, aggregator_type='mean')
+      
+        def forward(self, graph, inputs):
+            # inputs are features of nodes
+            h = self.conv1(graph, inputs)
+            h = F.relu(h)
+            h = self.conv2(graph, h)
+            return h
+
+Note that you can use the model above for not only node classification,
+but also obtaining hidden node representations for other downstream
+tasks such as
+:ref:`guide-training-edge-classification`,
+:ref:`guide-training-link-prediction`, or
+:ref:`guide-training-graph-classification`.
+
+For a complete list of built-in graph convolution modules, please refer
+to :ref:`apinn`.
+
+For more details in how DGL
+neural network modules work and how to write a custom neural network
+module with message passing please refer to the example in :ref:`guide-nn`.
+
+Training loop
+~~~~~~~~~~~~~
+
+Training on the full graph simply involves a forward propagation of the
+model defined above, and computing the loss by comparing the prediction
+against ground truth labels on the training nodes.
+
+This section uses a DGL built-in dataset
+:class:`dgl.data.CiteseerGraphDataset` to
+show a training loop. The node features
+and labels are stored on its graph instance, and the
+training-validation-test split are also stored on the graph as boolean
+masks. This is similar to what you have seen in :ref:`guide-data-pipeline`.
+
+.. code:: python
+
+    node_features = graph.ndata['feat']
+    node_labels = graph.ndata['label']
+    train_mask = graph.ndata['train_mask']
+    valid_mask = graph.ndata['val_mask']
+    test_mask = graph.ndata['test_mask']
+    n_features = node_features.shape[1]
+    n_labels = int(node_labels.max().item() + 1)
+
+The following is an example of evaluating your model by accuracy.
+
+.. code:: python
+
+    def evaluate(model, graph, features, labels, mask):
+        model.eval()
+        with torch.no_grad():
+            logits = model(graph, features)
+            logits = logits[mask]
+            labels = labels[mask]
+            _, indices = torch.max(logits, dim=1)
+            correct = torch.sum(indices == labels)
+            return correct.item() * 1.0 / len(labels)
+
+You can then write our training loop as follows.
+
+.. code:: python
+
+    model = SAGE(in_feats=n_features, hid_feats=100, out_feats=n_labels)
+    opt = torch.optim.Adam(model.parameters())
+    
+    for epoch in range(10):
+        model.train()
+        # forward propagation by using all nodes
+        logits = model(graph, node_features)
+        # compute loss
+        loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])
+        # compute validation accuracy
+        acc = evaluate(model, graph, node_features, node_labels, valid_mask)
+        # backward propagation
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+    
+        # Save model if necessary.  Omitted in this example.
+
+
+`GraphSAGE <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_full.py>`__
+provides an end-to-end homogeneous graph node classification example.
+You could see the corresponding model implementation is in the
+``GraphSAGE`` class in the example with adjustable number of layers,
+dropout probabilities, and customizable aggregation functions and
+nonlinearities.
+
+.. _guide-training-rgcn-node-classification:
+
+Heterogeneous graph
+~~~~~~~~~~~~~~~~~~~
+
+If your graph is heterogeneous, you may want to gather message from
+neighbors along all edge types. You can use the module
+:class:`dgl.nn.pytorch.HeteroGraphConv` (also available in MXNet and Tensorflow)
+to perform message passing
+on all edge types, then combining different graph convolution modules
+for each edge type.
+
+The following code will define a heterogeneous graph convolution module
+that first performs a separate graph convolution on each edge type, then
+sums the message aggregations on each edge type as the final result for
+all node types.
+
+.. code:: python
+
+    # Define a Heterograph Conv model
+    import dgl.nn as dglnn
+    
+    class RGCN(nn.Module):
+        def __init__(self, in_feats, hid_feats, out_feats, rel_names):
+            super().__init__()
+            
+            self.conv1 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(in_feats, hid_feats)
+                for rel in rel_names}, aggregate='sum')
+            self.conv2 = dglnn.HeteroGraphConv({
+                rel: dglnn.GraphConv(hid_feats, out_feats)
+                for rel in rel_names}, aggregate='sum')
+      
+        def forward(self, graph, inputs):
+            # inputs are features of nodes
+            h = self.conv1(graph, inputs)
+            h = {k: F.relu(v) for k, v in h.items()}
+            h = self.conv2(graph, h)
+            return h
+
+``dgl.nn.HeteroGraphConv`` takes in a dictionary of node types and node
+feature tensors as input, and returns another dictionary of node types
+and node features.
+
+So given that we have the user and item features in the example above.
+
+.. code:: python
+
+    model = RGCN(n_hetero_features, 20, n_user_classes, hetero_graph.etypes)
+    user_feats = hetero_graph.nodes['user'].data['feature']
+    item_feats = hetero_graph.nodes['item'].data['feature']
+    labels = hetero_graph.nodes['user'].data['label']
+    train_mask = hetero_graph.nodes['user'].data['train_mask']
+
+One can simply perform a forward propagation as follows:
+
+.. code:: python
+
+    node_features = {'user': user_feats, 'item': item_feats}
+    h_dict = model(hetero_graph, {'user': user_feats, 'item': item_feats})
+    h_user = h_dict['user']
+    h_item = h_dict['item']
+
+Training loop is the same as the one for homogeneous graph, except that
+now you have a dictionary of node representations from which you compute
+the predictions. For instance, if you are only predicting the ``user``
+nodes, you can just extract the ``user`` node embeddings from the
+returned dictionary:
+
+.. code:: python
+
+    opt = torch.optim.Adam(model.parameters())
+    
+    for epoch in range(5):
+        model.train()
+        # forward propagation by using all nodes and extracting the user embeddings
+        logits = model(hetero_graph, node_features)['user']
+        # compute loss
+        loss = F.cross_entropy(logits[train_mask], labels[train_mask])
+        # Compute validation accuracy.  Omitted in this example.
+        # backward propagation
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        print(loss.item())
+    
+        # Save model if necessary.  Omitted in the example.
+
+
+DGL provides an end-to-end example of
+`RGCN <https://github.com/dmlc/dgl/blob/master/examples/pytorch/rgcn-hetero/entity_classify.py>`__
+for node classification. You can see the definition of heterogeneous
+graph convolution in ``RelGraphConvLayer`` in the `model implementation
+file <https://github.com/dmlc/dgl/blob/master/examples/pytorch/rgcn-hetero/model.py>`__.
+
+
--- a/docs/source/guide/training.rst
+++ b/docs/source/guide/training.rst
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -84,6 +84,7 @@ Getting Started
   :maxdepth: 2
   :caption: User Guide
   :hidden:
+   :titlesonly:
   :glob:

   guide/preface