[Doc] Scan the API docs and make many changes (#2080)

* WIP: api * dgl.sampling, dgl.data * dgl.sampling; dgl.dataloading * sampling packages * convert * subgraph * deprecate * subgraph APIs * All docstrings for convert/subgraph/transform * almost all funcs under dgl namespace * WIP: DGLGraph * done graph query * message passing functions * lint * fix merge error * fix test * lint * fix Co-authored-by: Quan Gan <coin2028@hotmail.com>

[Doc] Scan the API docs and make many changes (#2080)
* WIP: api * dgl.sampling, dgl.data * dgl.sampling; dgl.dataloading * sampling packages * convert * subgraph * deprecate * subgraph APIs * All docstrings for convert/subgraph/transform * almost all funcs under dgl namespace * WIP: DGLGraph * done graph query * message passing functions * lint * fix merge error * fix test * lint * fix Co-authored-by: Quan Gan <coin2028@hotmail.com>
f13b9b62 · Minjie Wang · GitHub · 35e25914 · 35e25914 · 35e25914
Unverified Commit f13b9b62 authored Aug 20, 2020 by Minjie Wang Committed by GitHub Aug 20, 2020
20 changed files
--- a/docs/_deprecate/5_giant_graph/1_sampling_mx.py
+++ b/docs/_deprecate/5_giant_graph/1_sampling_mx.py
-"""
-.. _model-sampling:
-NodeFlow and Sampling
-=======================================
-**Author**: Ziyue Huang, Da Zheng, Quan Gan, Jinjing Zhou, Zheng Zhang
-"""
-################################################################################################
-#
-# Graph convolutional network
-# ~~~
-#
-# In an :math:`L`-layer graph convolution network (GCN), given a graph
-# :math:`G=(V, E)`, represented as an adjacency matrix :math:`A`, with
-# node features :math:`H^{(0)} = X \in \mathbb{R}^{|V| \times d}`, the
-# hidden feature of a node :math:`v` in :math:`(l+1)`-th layer
-# :math:`h_v^{(l+1)}` depends on the features of all its neighbors in the
-# previous layer :math:`h_u^{(l)}`:
-#
-# .. math::
-#
-#
-#    z_v^{(l+1)} = \sum_{u \in \mathcal{N}(v)} \tilde{A}_{uv} h_u^{(l)} \qquad h_v^{(l+1)} = \sigma ( z_v^{(l+1)} W^{(l)})
-#
-# where :math:`\mathcal{N}(v)` is the neighborhood of :math:`v`,
-# :math:`\tilde{A}` could be any normalized version of :math:`A` such as
-# :math:`D^{-1} A` in Kipf et al., :math:`\sigma(\cdot)` is an activation
-# function, and :math:`W^{(l)}` is a trainable parameter of the
-# :math:`l`-th layer.
-#
-# In the node classification task you minimize the following loss:
-#
-# .. math::
-#
-#
-#    \frac{1}{\vert \mathcal{V}_\mathcal{L} \vert} \sum_{v \in \mathcal{V}_\mathcal{L}} f(y_v, z_v^{(L)})
-#
-# where :math:`y_v` is the label of :math:`v`, and :math:`f(\cdot, \cdot)`
-# is a loss function, e.g., cross entropy loss.
-#
-# While training GCN on the full graph, each node aggregates the hidden
-# features of its neighbors to compute its hidden feature in the next
-# layer.
-#
-# In this tutorial, you run GCN on the Reddit dataset constructed by `Hamilton et
-# al. <https://arxiv.org/abs/1706.02216>`__, wherein the nodes are posts
-# and edges are established if two nodes are commented by a same user. The
-# task is to predict the category that a post belongs to. This graph has
-# 233,000 nodes, 114.6 million edges and 41 categories. First load the Reddit graph.
-#
-import numpy as np
-import dgl
-import dgl.function as fn
-from dgl import DGLGraph
-from dgl.data import RedditDataset
-import mxnet as mx
-from mxnet import gluon
-# Load MXNet as backend
-dgl.load_backend('mxnet')
-# load dataset
-data = RedditDataset(self_loop=True)
-train_nid = mx.nd.array(np.nonzero(data.train_mask)[0]).astype(np.int64)
-features = mx.nd.array(data.features)
-in_feats = features.shape[1]
-labels = mx.nd.array(data.labels)
-n_classes = data.num_labels
-# construct DGLGraph and prepare related data
-g = DGLGraph(data.graph, readonly=True)
-g.ndata['features'] = features
-################################################################################################
-# Here you define the node UDF, which has a fully-connected layer:
-#
-class NodeUpdate(gluon.Block):
-    def __init__(self, in_feats, out_feats, activation=None):
-        super(NodeUpdate, self).__init__()
-        self.dense = gluon.nn.Dense(out_feats, in_units=in_feats)
-        self.activation = activation
-    def forward(self, node):
-        h = node.data['h']
-        h = self.dense(h)
-        if self.activation:
-            h = self.activation(h)
-        return {'activation': h}
-################################################################################################
-# In DGL, you implement GCN on the full graph with ``update_all`` in ``DGLGraph``.
-# The following code performs two-layer GCN on the Reddit graph.
-#
-# number of GCN layers
-L = 2
-# number of hidden units of a fully connected layer
-n_hidden = 64
-layers = [NodeUpdate(g.ndata['features'].shape[1], n_hidden, mx.nd.relu),
-          NodeUpdate(n_hidden, n_hidden, mx.nd.relu)]
-for layer in layers:
-    layer.initialize()
-h = g.ndata['features']
-for i in range(L):
-    g.ndata['h'] = h
-    g.update_all(message_func=fn.copy_src(src='h', out='m'),
-                 reduce_func=fn.sum(msg='m', out='h'),
-                 apply_node_func=lambda node: {'h': layers[i](node)['activation']})
-    h = g.ndata.pop('h')
-##############################################################################
-# NodeFlow
-# ~~~~~~~~~~~~~~~~~
-#
-# As the graph scales up to billions of nodes or edges, training on the
-# full graph would no longer be efficient or even feasible.
-#
-# Mini-batch training allows you to control the computation and memory
-# usage within some budget. The training loss for each iteration is
-#
-# .. math::
-#
-#    \frac{1}{\vert \tilde{\mathcal{V}}_\mathcal{L} \vert} \sum_{v \in \tilde{\mathcal{V}}_\mathcal{L}} f(y_v, z_v^{(L)})
-#
-# where :math:`\tilde{\mathcal{V}}_\mathcal{L}` is a subset sampled from
-# the total labeled nodes :math:`\mathcal{V}_\mathcal{L}` uniformly at
-# random.
-#
-# Stemming from the labeled nodes :math:`\tilde{\mathcal{V}}_\mathcal{L}`
-# in a mini-batch and tracing back to the input forms a computational
-# dependency graph (a directed acyclic graph [DAG]), which
-# captures the computation flow of :math:`Z^{(L)}`.
-#
-# In the example below, a mini-batch to compute the hidden features of
-# node D in layer 2 requires hidden features of A, B, E, G in layer 1,
-# which in turn requires hidden features of C, D, F in layer 0.
-#
-# |image0|
-#
-# For that purpose, you define ``NodeFlow`` to represent this computation
-# flow.
-#
-# ``NodeFlow`` is a type of layered graph, where nodes are organized in
-# :math:`L + 1` sequential *layers*, and edges only exist between adjacent
-# layers, forming *blocks*. You construct ``NodeFlow`` backwards, starting
-# from the last layer with all the nodes whose hidden features are
-# requested. The set of nodes the next layer depends on forms the previous
-# layer. An edge connects a node in the previous layer to another in the
-# next layer if the latter depends on the former. Repeat such process
-# until all :math:`L + 1` layers are constructed. The feature of nodes in
-# each layer, and that of edges in each block, are stored as separate
-# tensors.
-#
-# .. raw:: html
-#
-# ``NodeFlow`` provides ``block_compute`` for per-block computation, which
-# triggers computation and data propogation from the lower layer to the
-# next upper layer.
-#
-##############################################################################
-# Neighbor sampling
-# ~~~~~~~~~~~~~~~~~
-#
-# Real-world graphs often have nodes with large degree, meaning that a
-# moderately deep (e.g., three layers) GCN would often depend on input features
-# of the entire graph, even if the computation only depends on outputs of
-# a few nodes, hence its cost-ineffectiveness.
-#
-# Sampling methods mitigate this computational problem by reducing the
-# receptive field effectively. Fig-c above shows one such example.
-#
-# Instead of using all the :math:`L`-hop neighbors of a node :math:`v`,
-# `Hamilton et al. <https://arxiv.org/abs/1706.02216>`__ propose *neighbor
-# sampling*, which randomly samples a few neighbors
-# :math:`\hat{\mathcal{N}}^{(l)}(v)` to estimate the aggregation
-# :math:`z_v^{(l+1)}` of its total neighbors :math:`\mathcal{N}(v)` in
-# :math:`l`-th GCN layer, by an unbiased estimator
-# :math:`\hat{z}_v^{(l+1)}`
-#
-# .. math::
-#
-#
-#    \hat{z}_v^{(l+1)} = \frac{\vert \mathcal{N}(v) \vert }{\vert \hat{\mathcal{N}}^{(l)}(v) \vert} \sum_{u \in \hat{\mathcal{N}}^{(l)}(v)} \tilde{A}_{uv} \hat{h}_u^{(l)} \qquad
-#    \hat{h}_v^{(l+1)} = \sigma ( \hat{z}_v^{(l+1)} W^{(l)} )
-#
-# Let :math:`D^{(l)}` be the number of neighbors to be sampled for each
-# node at the :math:`l`-th layer, then the receptive field size of each
-# node can be controlled under :math:`\prod_{i=0}^{L-1} D^{(l)}` by
-# *neighbor sampling*.
-#
-##############################################################################
-# You then implement *neighbor sampling* by ``NodeFlow``:
-#
-class GCNSampling(gluon.Block):
-    def __init__(self,
-                 in_feats,
-                 n_hidden,
-                 n_classes,
-                 n_layers,
-                 activation,
-                 dropout,
-                 **kwargs):
-        super(GCNSampling, self).__init__(**kwargs)
-        self.dropout = dropout
-        self.n_layers = n_layers
-        with self.name_scope():
-            self.layers = gluon.nn.Sequential()
-            # input layer
-            self.layers.add(NodeUpdate(in_feats, n_hidden, activation))
-            # hidden layers
-            for i in range(1, n_layers-1):
-                self.layers.add(NodeUpdate(n_hidden, n_hidden, activation))
-            # output layer
-            self.layers.add(NodeUpdate(n_hidden, n_classes))
-    def forward(self, nf):
-        nf.layers[0].data['activation'] = nf.layers[0].data['features']
-        for i, layer in enumerate(self.layers):
-            h = nf.layers[i].data.pop('activation')
-            if self.dropout:
-                h = mx.nd.Dropout(h, p=self.dropout)
-            nf.layers[i].data['h'] = h
-            # block_compute() computes the feature of layer i given layer
-            # i-1, with the given message, reduce, and apply functions.
-            # Here, you essentially aggregate the neighbor node features in
-            # the previous layer, and update it with the `layer` function.
-            nf.block_compute(i,
-                             fn.copy_src(src='h', out='m'),
-                             lambda node : {'h': node.mailbox['m'].mean(axis=1)},
-                             layer)
-        h = nf.layers[-1].data.pop('activation')
-        return h
-##############################################################################
-# DGL provides ``NeighborSampler`` to construct the ``NodeFlow`` for a
-# mini-batch according to the computation logic of neighbor sampling.
-# ``NeighborSampler``
-# returns an iterator that generates a ``NodeFlow`` each time. This function
-# has many options to give users opportunities to customize the behavior
-# of the neighbor sampler, including the number of neighbors to sample or
-# the number of hops to sample, for example. Please see `its API
-# document <https://doc.dgl.ai/api/python/sampler.html>`__ for more
-# details.
-#
-# dropout probability
-dropout = 0.2
-# batch size
-batch_size = 1000
-# number of neighbors to sample
-num_neighbors = 4
-# number of epochs
-num_epochs = 1
-# initialize the model and cross entropy loss
-model = GCNSampling(in_feats, n_hidden, n_classes, L,
-                    mx.nd.relu, dropout, prefix='GCN')
-model.initialize()
-loss_fcn = gluon.loss.SoftmaxCELoss()
-# use adam optimizer
-trainer = gluon.Trainer(model.collect_params(), 'adam',
-                        {'learning_rate': 0.03, 'wd': 0})
-for epoch in range(num_epochs):
-    i = 0
-    for nf in dgl.contrib.sampling.NeighborSampler(g, batch_size,
-                                                   num_neighbors,
-                                                   neighbor_type='in',
-                                                   shuffle=True,
-                                                   num_hops=L,
-                                                   seed_nodes=train_nid):
-        # When `NodeFlow` is generated from `NeighborSampler`, it only contains
-        # the topology structure, on which there is no data attached.
-        # Users need to call `copy_from_parent` to copy specific data,
-        # such as input node features, from the original graph.
-        nf.copy_from_parent()
-        with mx.autograd.record():
-            # forward
-            pred = model(nf)
-            batch_nids = nf.layer_parent_nid(-1).astype('int64')
-            batch_labels = labels[batch_nids]
-            # cross entropy loss
-            loss = loss_fcn(pred, batch_labels)
-            loss = loss.sum() / len(batch_nids)
-        # backward
-        loss.backward()
-        # optimization
-        trainer.step(batch_size=1)
-        print("Epoch[{}]: loss {}".format(epoch, loss.asscalar()))
-        i += 1
-        # You only train the model with 32 mini-batches just for demonstration.
-        if i >= 32:
-            break
-##############################################################################
-# Control variate
-# ~~~~~~~~~~~~~~~
-#
-# The unbiased estimator :math:`\hat{Z}^{(\cdot)}` used in *neighbor
-# sampling* might suffer from high variance, so it still requires a
-# relatively large number of neighbors, e.g. \ :math:`D^{(0)}=25` and
-# :math:`D^{(1)}=10` in `Hamilton et
-# al. <https://arxiv.org/abs/1706.02216>`__. With *control variate*, a
-# standard variance reduction technique widely used in Monte Carlo
-# methods, 2 neighbors for a node seems sufficient.
-#
-# *Control variate* method works as follows: Given a random variable
-# :math:`X` and you wish to estimate its expectation
-# :math:`\mathbb{E} [X] = \theta`, it finds another random variable
-# :math:`Y` which is highly correlated with :math:`X` and whose
-# expectation :math:`\mathbb{E} [Y]` can be easily computed. The *control
-# variate* estimator :math:`\tilde{X}` is
-#
-# .. math::
-#
-#    \tilde{X} = X - Y + \mathbb{E} [Y] \qquad \mathbb{VAR} [\tilde{X}] = \mathbb{VAR} [X] + \mathbb{VAR} [Y] - 2 \cdot \mathbb{COV} [X, Y]
-#
-# If :math:`\mathbb{VAR} [Y] - 2\mathbb{COV} [X, Y] < 0`, then
-# :math:`\mathbb{VAR} [\tilde{X}] < \mathbb{VAR} [X]`.
-#
-# `Chen et al. <https://arxiv.org/abs/1710.10568>`__ proposed a *control
-# variate* based estimator used in GCN training, by using history
-# :math:`\bar{H}^{(l)}` of the nodes which are not sampled, the modified
-# estimator :math:`\hat{z}_v^{(l+1)}` is
-#
-# .. math::
-#
-#
-#    \hat{z}_v^{(l+1)} = \frac{\vert \mathcal{N}(v) \vert }{\vert \hat{\mathcal{N}}^{(l)}(v) \vert} \sum_{u \in \hat{\mathcal{N}}^{(l)}(v)} \tilde{A}_{uv} ( \hat{h}_u^{(l)} - \bar{h}_u^{(l)} ) + \sum_{u \in \mathcal{N}(v)} \tilde{A}_{uv} \bar{h}_u^{(l)} \\
-#    \hat{h}_v^{(l+1)} = \sigma ( \hat{z}_v^{(l+1)} W^{(l)} )
-#
-# This method can also be *conceptually* implemented in DGL as shown
-# here.
-#
-have_large_memory = False
-# The control-variate sampling code below needs to run on a large-memory
-# machine for the Reddit graph.
-if have_large_memory:
-    g.ndata['h_0'] = features
-    for i in range(L):
-        g.ndata['h_{}'.format(i+1)] = mx.nd.zeros((features.shape[0], n_hidden))
-    # With control-variate sampling, you only need to sample two neighbors to train GCN.
-    for nf in dgl.contrib.sampling.NeighborSampler(g, batch_size, expand_factor=2,
-                                                   neighbor_type='in', num_hops=L,
-                                                   seed_nodes=train_nid):
-        for i in range(nf.num_blocks):
-            # aggregate history on the original graph
-            g.pull(nf.layer_parent_nid(i+1),
-                   fn.copy_src(src='h_{}'.format(i), out='m'),
-                   lambda node: {'agg_h_{}'.format(i): node.mailbox['m'].mean(axis=1)})
-        nf.copy_from_parent()
-        h = nf.layers[0].data['features']
-        for i in range(nf.num_blocks):
-            prev_h = nf.layers[i].data['h_{}'.format(i)]
-            # compute delta_h, the difference of the current activation and the history
-            nf.layers[i].data['delta_h'] = h - prev_h
-            # refresh the old history
-            nf.layers[i].data['h_{}'.format(i)] = h.detach()
-            # aggregate the delta_h
-            nf.block_compute(i,
-                             fn.copy_src(src='delta_h', out='m'),
-                             lambda node: {'delta_h': node.data['m'].mean(axis=1)})
-            delta_h = nf.layers[i + 1].data['delta_h']
-            agg_h = nf.layers[i + 1].data['agg_h_{}'.format(i)]
-            # control variate estimator
-            nf.layers[i + 1].data['h'] = delta_h + agg_h
-            nf.apply_layer(i + 1, lambda node : {'h' : layer(node.data['h'])})
-            h = nf.layers[i + 1].data['h']
-        # update history
-        nf.copy_to_parent()
-##############################################################################
-# You can see full example here, `MXNet
-# code <https://github.com/dmlc/dgl/blob/master/examples/mxnet/sampling/>`__
-# and `PyTorch
-# code <https://github.com/dmlc/dgl/tree/master/examples/pytorch/sampling>`__.
-#
-# Below shows the performance of graph convolution network and GraphSage
-# with neighbor sampling and control variate sampling on the Reddit
-# dataset. Our GraphSage with control variate sampling, when sampling one
-# neighbor, can achieve over 96 percent test accuracy. |image1|
-#
-# More APIs
-# ~~~~~~~~~
-#
-# In fact, ``block_compute`` is one of the APIs that comes with
-# ``NodeFlow``, which provides flexibility to research new ideas. The
-# computation flow underlying a DAG can be executed in one sweep, by
-# calling ``prop_flows``.
-#
-# ``prop_flows`` accepts a list of UDFs. The code below defines node update UDFs
-# for each layer and computes a simplified version of GCN with neighbor sampling.
-#
-apply_node_funcs = [
-    lambda node : {'h' : layers[0](node)['activation']},
-    lambda node : {'h' : layers[1](node)['activation']},
-]
-for nf in dgl.contrib.sampling.NeighborSampler(g, batch_size, num_neighbors,
-                                               neighbor_type='in', num_hops=L,
-                                               seed_nodes=train_nid):
-    nf.copy_from_parent()
-    nf.layers[0].data['h'] = nf.layers[0].data['features']
-    nf.prop_flow(fn.copy_src(src='h', out='m'),
-                 fn.sum(msg='m', out='h'), apply_node_funcs)
-##############################################################################
-# Internally, ``prop_flow`` triggers the computation by fusing together
-# all the block computations, from the input to the top. The main
-# advantages of this API are 1) simplicity, 2) allowing more system-level
-# optimization in the future.
-#
-# .. |image0| image:: https://data.dgl.ai/tutorial/sampling/NodeFlow.png
-# .. |image1| image:: https://data.dgl.ai/tutorial/sampling/sampling_result.png
-#
--- a/docs/_deprecate/5_giant_graph/2_giant.py
+++ b/docs/_deprecate/5_giant_graph/2_giant.py
-"""
-.. _model-graph-store:
-Large-Scale Training of Graph Neural Networks
-=============================================
-**Author**: Da Zheng, Chao Ma, Zheng Zhang
-"""
-################################################################################################
-#
-# In real-world tasks, many graphs are very large. For example, a recent
-# snapshot of the friendship network of Facebook contains 800 million
-# nodes and over 100 billion links. We are facing challenges on
-# large-scale training of graph neural networks.
-#
-# To accelerate training on a giant graph, DGL provides two additional
-# components: sampler and graph store.
-#
-# -  A sampler constructs small subgraphs (``NodeFlow``) from a given
-#    (giant) graph. The sampler can run on a local machine as well as on
-#    remote machines. Also, DGL can launch multiple parallel samplers
-#    across a set of machines.
-#
-# -  The graph store contains graph embeddings of a giant graph, as well
-#    as the graph structure. So far, we provide a shared-memory graph
-#    store to support multi-processing training, which is important for
-#    training on multiple GPUs and on non-uniform memory access (NUMA)
-#    machines. The shared-memory graph store has a similar interface to
-#    ``DGLGraph`` for programming. DGL will also support a distributed
-#    graph store that can store graph embeddings across machines in the
-#    future release.
-#
-# The figure below shows the interaction of the trainer with the samplers
-# and the graph store. The trainer takes subgraphs (``NodeFlow``) from the
-# sampler and fetches graph embeddings from the graph store before
-# training. The trainer can push new graph embeddings to the graph store
-# afterward.
-#
-# |image0|
-#
-# In this tutorial, we use control-variate sampling to demonstrate how to
-# use these three DGL components, extending `the original code of
-# control-variate
-# sampling <https://doc.dgl.ai/tutorials/models/5_giant_graph/1_sampling_mx.html#sphx-glr-tutorials-models-5-giant-graph-1-sampling-mx-py>`__.
-# Because the graph store has a similar API to ``DGLGraph``, the code is
-# similar. The tutorial will mainly focus on the difference.
-#
-# Graph Store
-# -----------
-#
-# The graph store has two parts: the server and the client. We need to run
-# the graph store server as a daemon before training. We provide a script
-# ``run_store_server.py`` `(link) <https://github.com/dmlc/dgl/blob/master/examples/mxnet/sampling/run_store_server.py>`__
-# that runs the graph store server and loads graph data. For example, the
-# following command runs a graph store server that loads the reddit
-# dataset and is configured to run with four trainers.
-#
-# ::
-#
-#    python3 run_store_server.py --dataset reddit --num-workers 4
-#
-# The trainer uses the graph store client to access data in the graph
-# store from the trainer process. A user only needs to write code in the
-# trainer. We first create the graph store client that connects with the
-# server. We specify ``store_type`` as “shared_memory” to connect with the
-# shared-memory graph store server.
-#
-# .. code:: python
-#
-#    g = dgl.contrib.graph_store.create_graph_from_store("reddit", store_type="shared_mem")
-#
-# The `sampling
-# tutorial <https://doc.dgl.ai/tutorials/models/5_giant_graph/1_sampling_mx.html#sphx-glr-tutorials-models-5-giant-graph-1-sampling-mx-py>`__
-# shows the detail of sampling methods and how they are used to train
-# graph neural networks such as graph convolution network. As a recap, the
-# graph convolution model performs the following computation in each
-# layer.
-#
-# .. math::
-#
-#
-#    z_v^{(l+1)} = \sum_{u \in \mathcal{N}^{(l)}(v)} \tilde{A}_{uv} h_u^{(l)} \qquad
-#    h_v^{(l+1)} = \sigma ( z_v^{(l+1)} W^{(l)} )
-#
-# `Control variate sampling <https://arxiv.org/abs/1710.10568>`__
-# approximates :math:`z_v^{(l+1)}` as follows:
-#
-# .. math::
-#
-#
-#    \hat{z}_v^{(l+1)} = \frac{\vert \mathcal{N}(v) \vert }{\vert \hat{\mathcal{N}}^{(l)}(v) \vert} \sum_{u \in \hat{\mathcal{N}}^{(l)}(v)} \tilde{A}_{uv} ( \hat{h}_u^{(l)} - \bar{h}_u^{(l)} ) + \sum_{u \in \mathcal{N}(v)} \tilde{A}_{uv} \bar{h}_u^{(l)} \\
-#    \hat{h}_v^{(l+1)} = \sigma ( \hat{z}_v^{(l+1)} W^{(l)} )
-#
-# In addition to the approximation, `Chen et.
-# al. <https://arxiv.org/abs/1710.10568>`__ applies a preprocessing trick
-# to reduce the number of hops for sampling neighbors by one. This trick
-# works for models such as Graph Convolution Networks and GraphSage. It
-# preprocesses the input layer. The original GCN takes :math:`X` as input.
-# Instead of taking :math:`X` as the input of the model, the trick
-# computes :math:`U^{(0)}=\tilde{A}X` and uses :math:`U^{(0)}` as the
-# input of the first layer. In this way, the vertices in the first layer
-# does not need to compute aggregation over their neighborhood and, thus,
-# reduce the number of layers to sample by one.
-#
-# For a giant graph, both :math:`\tilde{A}` and :math:`X` can be very
-# large. We need to perform this operation in a distributed fashion. That
-# is, each trainer takes part of the computation and the computation is
-# distributed among all trainers. We can use ``update_all`` in the graph
-# store to perform this computation.
-#
-# .. code:: python
-#
-#    g.update_all(fn.copy_src(src='features', out='m'),
-#                 fn.sum(msg='m', out='preprocess'),
-#                 lambda node : {'preprocess': node.data['preprocess'] * node.data['norm']})
-#
-# ``update_all`` in the graph store runs in a distributed fashion. That
-# is, all trainers need to invoke this function and take part of the
-# computation. When a trainer completes its portion, it will wait for
-# other trainers to complete before proceeding with its other computation.
-#
-# The node/edge data now live in the graph store and the access to the
-# node/edge data is now a little different. The graph store no longer
-# supports data access with ``g.ndata``/``g.edata``, which reads the
-# entire node/edge data tensor. Instead, users have to use
-# ``g.nodes[node_ids].data[embed_name]`` to access data on some nodes.
-# (Note: this method is also allowed in ``DGLGraph`` and ``g.ndata`` is
-# simply a short syntax for ``g.nodes[:].data``). In addition, the graph
-# store supports ``get_n_repr``/``set_n_repr`` for node data and
-# ``get_e_repr``/``set_e_repr`` for edge data.
-#
-# To initialize the node/edge tensors more efficiently, we provide two new
-# methods in the graph store client to initialize node data and edge data
-# (i.e., ``init_ndata`` for node data or ``init_edata`` for edge data).
-# What happened under the hood is that these two methods send
-# initialization commands to the server and the graph store server
-# initializes the node/edge tensors on behalf of trainers.
-#
-# Here we show how we should initialize node data for control-variate
-# sampling. ``h_i`` stores the history of nodes in layer ``i``;
-# ``agg_h_i`` stores the aggregation of the history of neighbor nodes in
-# layer ``i``.
-#
-# .. code:: python
-#
-#    for i in range(n_layers):
-#        g.init_ndata('h_{}'.format(i), (features.shape[0], args.n_hidden), 'float32')
-#        g.init_ndata('agg_h_{}'.format(i), (features.shape[0], args.n_hidden), 'float32')
-#
-# After we initialize node data, we train GCN with control-variate
-# sampling as below. The training code takes advantage of preprocessed
-# input data in the first layer and works identically to the
-# single-process training procedure.
-#
-# .. code:: python
-#
-#    for nf in NeighborSampler(g, batch_size, num_neighbors,
-#                              neighbor_type='in', num_hops=L-1,
-#                              seed_nodes=labeled_nodes):
-#        for i in range(nf.num_blocks):
-#            # aggregate history on the original graph
-#            g.pull(nf.layer_parent_nid(i+1),
-#                   fn.copy_src(src='h_{}'.format(i), out='m'),
-#                   lambda node: {'agg_h_{}'.format(i): node.data['m'].mean(axis=1)})
-#        # We need to copy data in the NodeFlow to the right context.
-#        nf.copy_from_parent(ctx=right_context)
-#        nf.apply_layer(0, lambda node : {'h' : layer(node.data['preprocess'])})
-#        h = nf.layers[0].data['h']
-#
-#        for i in range(nf.num_blocks):
-#            prev_h = nf.layers[i].data['h_{}'.format(i)]
-#            # compute delta_h, the difference of the current activation and the history
-#            nf.layers[i].data['delta_h'] = h - prev_h
-#            # refresh the old history
-#            nf.layers[i].data['h_{}'.format(i)] = h.detach()
-#            # aggregate the delta_h
-#            nf.block_compute(i,
-#                             fn.copy_src(src='delta_h', out='m'),
-#                             lambda node: {'delta_h': node.data['m'].mean(axis=1)})
-#            delta_h = nf.layers[i + 1].data['delta_h']
-#            agg_h = nf.layers[i + 1].data['agg_h_{}'.format(i)]
-#            # control variate estimator
-#            nf.layers[i + 1].data['h'] = delta_h + agg_h
-#            nf.apply_layer(i + 1, lambda node : {'h' : layer(node.data['h'])})
-#            h = nf.layers[i + 1].data['h']
-#        # update history
-#        nf.copy_to_parent()
-#
-# The complete example code can be found
-# `here <https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling>`__.
-#
-# After showing how the shared-memory graph store is used with
-# control-variate sampling, let’s see how to use it for multi-GPU training
-# and how to optimize the training on a non-uniform memory access (NUMA)
-# machine. A NUMA machine here means a machine with multiple processors
-# and large memory. It works for all backend frameworks as long as the
-# framework supports multi-processing training. If we use MXNet as the
-# backend, we can use the distributed MXNet kvstore to aggregate gradients
-# among processes and use the MXNet launch tool to launch multiple workers
-# that run the training script. The command below launches our example
-# code for multi-processing GCN training with control variate sampling and
-# it runs 4 trainers.
-#
-# ::
-#
-#    python3 ../incubator-mxnet/tools/launch.py -n 4 -s 1 --launcher local \
-#        python3 examples/mxnet/sampling/multi_process_train.py \
-#        --graph-name reddit \
-#        --model gcn_cv --num-neighbors 1 \
-#        --batch-size 2500 --test-batch-size 5000 \
-#        --n-hidden 64
-#
-# ..
-#
-# It is fairly easy to enable multi-GPU training. All we need to do is to
-# copy data to a right GPU context and invoke NodeFlow computation in that
-# GPU context. As shown above, we specify a context ``right_context`` in
-# ``copy_from_parent``.
-#
-# To optimize the computation on a NUMA machine, we need to configure each
-# process properly. For example, we should use the same number of
-# processes as the number of NUMA nodes (usually equivalent to the number
-# of processors) and bind the processes to NUMA nodes. In addition, we
-# should reduce the number of OpenMP threads to the number of CPU cores in
-# a processor and reduce the number of threads of the MXNet kvstore to a
-# small number such as 4.
-#
-# .. code:: python
-#
-#    import numa
-#    import os
-#    if 'DMLC_TASK_ID' in os.environ and int(os.environ['DMLC_TASK_ID']) < 4:
-#        # bind the process to a NUMA node.
-#        numa.bind([int(os.environ['DMLC_TASK_ID'])])
-#        # Reduce the number of OpenMP threads to match the number of
-#        # CPU cores of a processor.
-#        os.environ['OMP_NUM_THREADS'] = '16'
-#    else:
-#        # Reduce the number of OpenMP threads in the MXNet KVstore server to 4.
-#        os.environ['OMP_NUM_THREADS'] = '4'
-#
-# Given the configuration above, NUMA-aware multi-processing training can
-# accelerate training almost by a factor of 4 as shown in the figure below
-# on an X1.32xlarge instance where there are 4 processors, each of which
-# has 16 physical CPU cores. We can see that NUMA-unaware training cannot
-# take advantage of computation power of the machine. It is even slightly
-# slower than just using one of the processors in the machine. NUMA-aware
-# training, on the other hand, takes about only 20 seconds to converge to
-# the accuracy of 96% with 20 iterations.
-#
-# |image1|
-#
-# Distributed Sampler
-# -------------------
-#
-# For many tasks, we found that the sampling takes a significant amount of
-# time for the training process on a giant graph. So DGL supports
-# distributed samplers for speeding up the sampling process on giant
-# graphs. DGL allows users to launch multiple samplers on different
-# machines concurrently, and each sampler can send its sampled subgraph
-# (``NodeFlow``) to trainer machines continuously.
-#
-# To use the distributed sampler on DGL, users start both trainer and
-# sampler processes on different machines. Users can find the complete
-# demo code and launch scripts `in this
-# link <https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling/dis_sampling>`__
-# and this tutorial will focus on the main difference between
-# single-machine code and distributed code.
-#
-# For the trainer, developers can easily migrate the existing
-# single-machine sampler code to the distributed setting seamlessly by
-# just changing a few lines of code. First, users need to create a
-# distributed ``SamplerReceiver`` object before training:
-#
-# .. code:: python
-#
-#    sampler = dgl.contrib.sampling.SamplerReceiver(graph, ip_addr, num_sampler)
-#
-# The ``SamplerReceiver`` class is used for receiving remote subgraph from
-# other machines. This API has three arguments: ``parent_graph``,
-# ``ip_address``, and ``number_of_samplers``.
-#
-# After that, developers can change just one line of existing
-# single-machine training code like this:
-#
-# .. code:: python
-#
-#    for nf in sampler:
-#        for i in range(nf.num_blocks):
-#            # aggregate history on the original graph
-#            g.pull(nf.layer_parent_nid(i+1),
-#                   fn.copy_src(src='h_{}'.format(i), out='m'),
-#                   lambda node: {'agg_h_{}'.format(i): node.data['m'].mean(axis=1)})
-#
-#    ...
-#
-# Here, we use the code ``for nf in sampler`` to replace the original
-# single-machine sampling code:
-#
-# .. code:: python
-#
-#    for nf in NeighborSampler(g, batch_size, num_neighbors,
-#                              neighbor_type='in', num_hops=L-1,
-#                              seed_nodes=labeled_nodes):
-#
-# All the other parts of the original single-machine code is not changed.
-#
-# In addition, developers need to write sampling logic on the sampler
-# machine. For neighbor-sampler, developers can just copy their existing
-# single-machine code to sampler machines like this:
-#
-# .. code:: python
-#
-#    sender = dgl.contrib.sampling.SamplerSender(trainer_address)
-#
-#    ...
-#
-#    for n in num_epoch:
-#        for nf in dgl.contrib.sampling.NeighborSampler(graph, batch_size, num_neighbors,
-#                                                           neighbor_type='in',
-#                                                           shuffle=shuffle,
-#                                                           num_workers=num_workers,
-#                                                           num_hops=num_hops,
-#                                                           add_self_loop=add_self_loop,
-#                                                           seed_nodes=seed_nodes):
-#            sender.send(nf, trainer_id)
-#        # tell trainer I have finished current epoch
-#        sender.signal(trainer_id)
-#
-# The figure below shows the overall performance improvement of training
-# GCN and GraphSage on the Reddit dataset after deploying the
-# optimizations in this tutorial. Our NUMA optimization speeds up the
-# training by a factor of 4. The distributed sampling achieves additional
-# 20%-40% speed improvement for different tasks.
-#
-# |image2|
-#
-# Scale to giant graphs
-# ---------------------
-#
-# Finally, we would like to demonstrate the scalability of DGL with giant
-# synthetic graphs. We create three large power-law graphs with
-# `RMAT <http://www.cs.cmu.edu/~christos/PUBLICATIONS/siam04.pdf>`__. Each
-# node is associated with 100 features and we compute node embeddings with
-# 64 dimensions. Below shows the training speed and memory consumption of
-# GCN with neighbor sampling.
-#
-# ====== ====== ================== ===========
-# #Nodes #Edges Time per epoch (s) Memory (GB)
-# ====== ====== ================== ===========
-# 5M     250M   4.7                8
-# 50M    2.5B   46                 75
-# 500M   25B    505                740
-# ====== ====== ================== ===========
-#
-# We can see that DGL can scale to graphs with up to 500M nodes and 25B
-# edges.
-#
-# .. |image0| image:: https://data.dgl.ai/tutorial/sampling/arch.png
-# .. |image1| image:: https://data.dgl.ai/tutorial/sampling/NUMA_speedup.png
-# .. |image2| image:: https://data.dgl.ai/tutorial/sampling/whole_speedup.png
-#
--- a/docs/_deprecate/5_giant_graph/README.txt
+++ b/docs/_deprecate/5_giant_graph/README.txt
-.. _tutorials5-index:
-Training on giant graphs
-=============================
-* **Sampling** `[paper] <https://arxiv.org/abs/1710.10568>`__ `[tutorial]
-  <5_giant_graph/1_sampling_mx.html>`__ `[MXNet code]
-  <https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling>`__ `[Pytorch code]
-  <https://github.com/dmlc/dgl/tree/master/examples/pytorch/sampling>`__:
-  You can perform neighbor sampling and control-variate sampling to train a
-  graph convolution network and its variants on a giant graph.
-* **Scale to giant graphs** `[tutorial] <5_giant_graph/2_giant.html>`__
-  `[MXNet code] <https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling>`__
-  `[Pytorch code]
-  <https://github.com/dmlc/dgl/tree/master/examples/pytorch/sampling>`__:
-  You can find two components (graph store and distributed sampler) to scale to
-  graphs with hundreds of millions of nodes.
--- a/docs/_deprecate/8_sse_mx.py
+++ b/docs/_deprecate/8_sse_mx.py
--- a/docs/source/api/python/dgl.DGLGraph.rst
+++ b/docs/source/api/python/dgl.DGLGraph.rst
@@ -6,9 +6,24 @@ dgl.DGLGraph
 .. currentmodule:: dgl
 .. class:: DGLGraph
+    Class for storing graph structure and node/edge feature data.
+    There are a few ways to create create a DGLGraph:
+    * To create a homogeneous graph from Tensor data, use :func:`dgl.graph`.
+    * To create a heterogeneous graph from Tensor data, use :func:`dgl.heterograph`.
+    * To create a graph from other data sources, use ``dgl.*`` create ops. See
+      :ref:`api-graph-create-ops`.
+    Read the user guide chapter :ref:`guide-graph` for an in-depth explanation about its
+    usage.
 Querying metagraph structure
 ----------------------------
+Methods for getting information about the node and edge types. They are typically useful
+when the graph is heterogeneous.
 .. autosummary::
    :toctree: ../../generated/
@@ -19,12 +34,13 @@ Querying metagraph structure
    DGLGraph.canonical_etypes
    DGLGraph.metagraph
    DGLGraph.to_canonical_etype
-    DGLGraph.get_ntype_id
-    DGLGraph.get_etype_id
 Querying graph structure
 ------------------------
+Methods for getting information about the graph structure such as capacity, connectivity,
+neighborhood, etc.
 .. autosummary::
    :toctree: ../../generated/
@@ -53,15 +69,20 @@ Querying graph structure
 Querying and manipulating sparse format
 ---------------------------------------
+Methods for getting or manipulating the internal storage formats of a ``DGLGraph``.
 .. autosummary::
    :toctree: ../../generated/
    DGLGraph.formats
    DGLGraph.create_format_
-Querying and manipulating index data type
+Querying and manipulating node/edge ID type
 -----------------------------------------
+Methods for getting or manipulating the data type for storing structure-related
+data such as node and edge IDs.
 .. autosummary::
    :toctree: ../../generated/
@@ -72,6 +93,9 @@ Querying and manipulating index data type
 Using Node/edge features
 ------------------------
+Methods for getting or setting the data type for storing structure-related
+data such as node and edge IDs.
 .. autosummary::
    :toctree: ../../generated/
@@ -85,12 +109,14 @@ Using Node/edge features
    DGLGraph.dstnodes
    DGLGraph.srcdata
    DGLGraph.dstdata
-    DGLGraph.local_var
-    DGLGraph.local_scope
 Transforming graph
 ------------------
+Methods for generating a new graph by transforming the current ones. Most of them
+are alias of the :ref:`api-subgraph-extraction` and :ref:`api-transform`
+under the ``dgl`` namespace.
 .. autosummary::
    :toctree: ../../generated/
@@ -99,9 +125,16 @@ Transforming graph
    DGLGraph.node_type_subgraph
    DGLGraph.edge_type_subgraph
    DGLGraph.__getitem__
+    DGLGraph.line_graph
+    DGLGraph.reverse
+    DGLGraph.add_self_loop
+    DGLGraph.remove_self_loop
+    DGLGraph.to_simple
+Adjacency and incidence matrix
+---------------------------------
-Converting to other formats
+Methods for getting the adjacency and the incidence matrix of the graph.
---------------------------
 .. autosummary::
    :toctree: ../../generated/
@@ -114,6 +147,8 @@ Converting to other formats
 Computing with DGLGraph
 -----------------------------
+Methods for performing message passing, applying functions on node/edge features, etc.
 .. autosummary::
    :toctree: ../../generated/
@@ -130,7 +165,11 @@ Computing with DGLGraph
    DGLGraph.filter_edges
 Querying batch summary
----------------------
+---------------------------------
+Methods for getting the batching information if the current graph is a batched
+graph generated from :func:`dgl.batch`. They are also widely used in the
+:ref:`api-batch`.
 .. autosummary::
    :toctree: ../../generated/
@@ -142,6 +181,8 @@ Querying batch summary
 Mutating topology
 -----------------
+Methods for mutating the graph structure *in-place*.
 .. autosummary::
    :toctree: ../../generated/
@@ -153,8 +194,20 @@ Mutating topology
 Device Control
 --------------
+Methods for getting or changing the device on which the graph is hosted.
 .. autosummary::
    :toctree: ../../generated/
    DGLGraph.to
    DGLGraph.device
+Misc
+----
+Other utility methods.
+.. autosummary::
+    :toctree: ../../generated/
+    DGLGraph.local_scope
--- a/docs/source/api/python/dgl.data.rst
+++ b/docs/source/api/python/dgl.data.rst
@@ -4,45 +4,42 @@ dgl.data
 =========
 .. currentmodule:: dgl.data
+.. automodule:: dgl.data
-Dataset Classes
+Quick links:
---------------
-DGL dataset
+* `Node Prediction Datasets`_
-```````````
+* `Edge Prediction Datasets`_
+* `Graph Prediction Datasets`_
+Base Dataset Class
+---------------------------
 .. autoclass:: DGLDataset
    :members: download, save, load, process, has_cache, __getitem__, __len__
-DGL builtin dataset
+.. _sstdata:
-```````````````````
-.. autoclass:: DGLBuiltinDataset
+Node Prediction Datasets
-    :members: download
+---------------------------------------
-.. _sstdata:
+DGL hosted datasets for node classification/regression tasks.
 Stanford sentiment treebank dataset
 ```````````````````````````````````
-For more information about the dataset, see `Sentiment Analysis <https://nlp.stanford.edu/sentiment/index.html>`__.
 .. autoclass:: SSTDataset
    :members: __getitem__, __len__
-.. _karateclubdata:
+.. _karateclubdata:
 Karate club dataset
 ```````````````````````````````````
 .. autoclass:: KarateClubDataset
    :members: __getitem__, __len__
 .. _citationdata:
 Citation network dataset
 ```````````````````````````````````
 .. autoclass:: CoraGraphDataset
    :members: __getitem__, __len__
@@ -52,22 +49,13 @@ Citation network dataset
 .. autoclass:: PubmedGraphDataset
    :members: __getitem__, __len__
-.. _kgdata:
+.. _corafulldata:
+CoraFull dataset
-Knowlege graph dataset
 ```````````````````````````````````
+.. autoclass:: CoraFullDataset
-.. autoclass:: FB15k237Dataset
-    :members: __getitem__, __len__
-.. autoclass:: FB15kDataset
-    :members: __getitem__, __len__
-.. autoclass:: WN18Dataset
    :members: __getitem__, __len__
 .. _rdfdata:
 RDF datasets
 ```````````````````````````````````
@@ -83,19 +71,9 @@ RDF datasets
 .. autoclass:: AMDataset
    :members: __getitem__, __len__
-.. _corafulldata:
-CoraFull dataset
-```````````````````````````````````
-.. autoclass:: CoraFullDataset
-    :members: __getitem__, __len__
 .. _amazoncobuydata:
 Amazon Co-Purchase dataset
 ```````````````````````````````````
 .. autoclass:: AmazonCoBuyComputerDataset
    :members: __getitem__, __len__
@@ -103,60 +81,90 @@ Amazon Co-Purchase dataset
    :members: __getitem__, __len__
 .. _coauthordata:
 Coauthor dataset
 ```````````````````````````````````
 .. autoclass:: CoauthorCSDataset
    :members: __getitem__, __len__
 .. autoclass:: CoauthorPhysicsDataset
    :members: __getitem__, __len__
-.. _bitcoinotcdata:
+.. _ppidata:
+Protein-Protein Interaction dataset
-BitcoinOTC dataset
 ```````````````````````````````````
+.. autoclass:: PPIDataset
+    :members: __getitem__, __len__
-.. autoclass:: BitcoinOTCDataset
+.. _redditdata:
+Reddit dataset
+``````````````
+.. autoclass:: RedditDataset
    :members: __getitem__, __len__
+.. _sbmdata:
+Symmetric Stochastic Block Model Mixture dataset
+````````````````````````````````````````````````
+.. autoclass:: SBMMixtureDataset
+    :members: __getitem__, __len__, collate_fn
-ICEWS18 dataset
-```````````````````````````````````
-.. autoclass:: ICEWS18Dataset
+Edge Prediction Datasets
-    :members: __getitem__, __len__
+---------------------------------------
-.. _qm7bdata:
+DGL hosted datasets for edge classification/regression and link prediction tasks.
-QM7b dataset
+.. _kgdata:
+Knowlege graph dataset
 ```````````````````````````````````
-.. autoclass:: QM7bDataset
+.. autoclass:: FB15k237Dataset
    :members: __getitem__, __len__
+.. autoclass:: FB15kDataset
+    :members: __getitem__, __len__
+.. autoclass:: WN18Dataset
+    :members: __getitem__, __len__
-GDELT dataset
+.. _bitcoinotcdata:
+BitcoinOTC dataset
+```````````````````````````````````
+.. autoclass:: BitcoinOTCDataset
+    :members: __getitem__, __len__
+ICEWS18 dataset
 ```````````````````````````````````
+.. autoclass:: ICEWS18Dataset
+    :members: __getitem__, __len__
+GDELT dataset
+```````````````````````````````````
 .. autoclass:: GDELTDataset
    :members: __getitem__, __len__
-.. _minigcdataset:
+Graph Prediction Datasets
+---------------------------------------
+DGL hosted datasets for graph classification/regression tasks.
+.. _qm7bdata:
+QM7b dataset
+```````````````````````````````````
+.. autoclass:: QM7bDataset
+    :members: __getitem__, __len__
+.. _minigcdataset:
 Mini graph classification dataset
 `````````````````````````````````
 .. autoclass:: MiniGCDataset
    :members: __getitem__, __len__
 .. _tudata:
 TU dataset
 ``````````
 .. autoclass:: TUDataset
    :members: __getitem__, __len__
@@ -164,41 +172,14 @@ TU dataset
    :members: __getitem__, __len__
 .. _gindataset:
 Graph isomorphism network dataset
 ```````````````````````````````````
 A compact subset of graph kernel dataset
 .. autoclass:: GINDataset
    :members: __getitem__, __len__
-.. _ppidata:
+Utilities
+-----------------
-Protein-Protein Interaction dataset
-```````````````````````````````````
-.. autoclass:: PPIDataset
-    :members: __getitem__, __len__
-.. _redditdata:
-Reddit dataset
-``````````````
-.. autoclass:: RedditDataset
-    :members: __getitem__, __len__
-.. _sbmdata:
-Symmetric Stochastic Block Model Mixture dataset
-````````````````````````````````````````````````
-.. autoclass:: SBMMixtureDataset
-    :members: __getitem__, __len__, collate_fn
-Utils
-----
 .. autosummary::
    :toctree: ../../generated/
@@ -214,4 +195,3 @@ Utils
 .. autoclass:: dgl.data.utils.Subset
    :members: __getitem__, __len__
--- a/docs/source/api/python/dgl.dataloading.rst
+++ b/docs/source/api/python/dgl.dataloading.rst
@@ -7,47 +7,29 @@ dgl.dataloading
 DataLoaders
 -----------
-PyTorch node/edge DataLoaders
-`````````````````````````````
 .. currentmodule:: dgl.dataloading.pytorch
+DGL DataLoader for mini-batch training works similarly to PyTorch's DataLoader.
+It has a generator interface that returns mini-batches sampled from some given graphs.
+DGL provides two DataLoaders: a ``NodeDataLoader`` for node classification task
+and an ``EdgeDataLoader`` for edge/link prediction task.
 .. autoclass:: NodeDataLoader
 .. autoclass:: EdgeDataLoader
-General collating functions
-```````````````````````````
-.. currentmodule:: dgl.dataloading.dataloader
-.. autoclass:: Collator
-    :members: dataset, collate
-.. autoclass:: NodeCollator
-    :members: dataset, collate
-    :show-inheritance:
-.. autoclass:: EdgeCollator
-    :members: dataset, collate
-    :show-inheritance:
 .. _api-dataloading-neighbor-sampling:
+Neighbor Sampler
-Neighborhood Sampling Classes
 -----------------------------
+.. currentmodule:: dgl.dataloading.neighbor
-Base Multi-layer Neighborhood Sampling Class
+Neighbor samplers are classes that control the behavior of ``DataLoader`` s
-````````````````````````````````````````````
+to sample neighbors. All of them inherit the base :class:`BlockSampler` class, but implement
+different neighbor sampling strategies by overriding the ``sample_frontier`` or
+the ``sample_blocks`` methods.
 .. autoclass:: BlockSampler
    :members: sample_frontier, sample_blocks
-Uniform Node-wise Neighbor Sampling (GraphSAGE style)
-`````````````````````````````````````````````````````
-.. currentmodule:: dgl.dataloading.neighbor
 .. autoclass:: MultiLayerNeighborSampler
    :members: sample_frontier
    :show-inheritance:
@@ -59,8 +41,10 @@ Uniform Node-wise Neighbor Sampling (GraphSAGE style)
 Negative Samplers for Link Prediction
 -------------------------------------
 .. currentmodule:: dgl.dataloading.negative_sampler
+Negative samplers are classes that control the behavior of the ``EdgeDataLoader``
+to generate negative edges.
 .. autoclass:: Uniform
    :members: __call__
--- a/docs/source/api/python/dgl.function.rst
+++ b/docs/source/api/python/dgl.function.rst
@@ -74,6 +74,11 @@ can be used in any `autograd` system. Also, built-in functions can be used not o
 or ``apply_edges`` as shown in the example, but wherever message and reduce functions are
 required (e.g. ``pull``, ``push``, ``send_and_recv``).
+.. _api-built-in:
+DGL Built-in Function
+-------------------------
 Here is a cheatsheet of all the DGL built-in functions.
 +-------------------------+-----------------------------------------------------------------+-----------------------+

--- a/docs/source/api/python/dgl.rst
+++ b/docs/source/api/python/dgl.rst
@@ -4,10 +4,15 @@ dgl
 =============================
 .. currentmodule:: dgl
+.. automodule:: dgl
+.. _api-graph-create-ops:
 Graph Create Ops
 -------------------------
+Operators for constructing :class:`DGLGraph` from raw data formats.
 .. autosummary::
    :toctree: ../../generated/
@@ -24,9 +29,11 @@ Graph Create Ops
 .. _api-subgraph-extraction:
-Subgraph Extraction Routines
+Subgraph Extraction Ops
 -------------------------------------
+Operators for extracting and returning subgraphs.
 .. autosummary::
    :toctree: ../../generated/
@@ -37,8 +44,12 @@ Subgraph Extraction Routines
    in_subgraph
    out_subgraph
-Graph Mutation Routines
+.. _api-transform:
---------------------------------
+Graph Transform Ops
+----------------------------------
+Operators for generating new graphs by manipulating the structure of the existing ones.
 .. autosummary::
    :toctree: ../../generated/
@@ -50,13 +61,6 @@ Graph Mutation Routines
    add_self_loop
    remove_self_loop
    add_reverse_edges
-Graph Transform Routines
----------------------------------
-.. autosummary::
-    :toctree: ../../generated/
    reverse
    to_bidirected
    to_simple
@@ -69,9 +73,14 @@ Graph Transform Routines
    khop_graph
    metapath_reachable_graph
-Batching and Reading Out
+.. _api-batch:
+Batching and Reading Out Ops
 -------------------------------
+Operators for batching multiple graphs into one for batch processing and
+operators for computing graph-level representation for both single and batched graphs.
 .. autosummary::
    :toctree: ../../generated/
@@ -92,18 +101,22 @@ Batching and Reading Out
    topk_nodes
    topk_edges
-Adjacency Related Routines
+Adjacency Related Utilities
 -------------------------------
+Utilities for computing adjacency matrix and Lapacian matrix.
 .. autosummary::
    :toctree: ../../generated/
    khop_adj
    laplacian_lambda_max
-Propagate Messages by Traversals
+Traversals
 ------------------------------------------
+Utilities for traversing graphs.
 .. autosummary::
    :toctree: ../../generated/
@@ -115,6 +128,9 @@ Propagate Messages by Traversals
 Utilities
 -----------------------------------------------
+Other utilities for controlling randomness, saving and loading graphs, etc.
 .. autosummary::
    :toctree: ../../generated/

--- a/docs/source/api/python/sampling.rst
+++ b/docs/source/api/python/sampling.rst
@@ -5,7 +5,7 @@ dgl.sampling
 .. automodule:: dgl.sampling
-Random walk sampling functions
+Random walk
 ------------------------------
 .. autosummary::
@@ -14,7 +14,7 @@ Random walk sampling functions
    random_walk
    pack_traces
-Neighbor sampling functions
+Neighbor sampling
 ---------------------------
 .. autosummary::
@@ -22,8 +22,4 @@ Neighbor sampling functions
    sample_neighbors
    select_topk
+    PinSAGESampler
-Builtin sampler classes for more complicated sampling algorithms
----------------------------------------------------------------
-.. autoclass:: RandomWalkNeighborSampler
-.. autoclass:: PinSAGESampler
--- a/docs/source/api/python/index.rst
+++ b/docs/source/api/python/index.rst
@@ -5,11 +5,12 @@ API Reference
   :maxdepth: 2
   dgl
-   dgl.DGLGraph
   dgl.data
-   nn
-   dgl.ops
-   dgl.function
-   sampling
   dgl.dataloading
+   dgl.DGLGraph
   dgl.distributed
+   dgl.function
+   nn
+   dgl.ops
+   dgl.sampling
+   udf
--- a/docs/source/api/python/nn.mxnet.rst
+++ b/docs/source/api/python/nn.mxnet.rst
@@ -3,15 +3,6 @@
 NN Modules (MXNet)
 ===================
-.. contents:: Contents
-    :local:
-We welcome your contribution! If you want a model to be implemented in DGL as a NN module,
-please `create an issue <https://github.com/dmlc/dgl/issues>`_ started with "[Feature Request] NN Module XXXModel".
-If you want to contribute a NN module, please `create a pull request <https://github.com/dmlc/dgl/pulls>`_ started
-with "[NN] XXXModel in MXNet NN Modules" and our team member would review this PR.
 Conv Layers
 ----------------------------------------

--- a/docs/source/api/python/nn.pytorch.rst
+++ b/docs/source/api/python/nn.pytorch.rst
@@ -3,15 +3,6 @@
 NN Modules (PyTorch)
 ====================
-.. contents:: Contents
-    :local:
-We welcome your contribution! If you want a model to be implemented in DGL as a NN module,
-please `create an issue <https://github.com/dmlc/dgl/issues>`_ started with "[Feature Request] NN Module XXXModel".
-If you want to contribute a NN module, please `create a pull request <https://github.com/dmlc/dgl/pulls>`_ started
-with "[NN] XXXModel in PyTorch NN Modules" and our team member would review this PR.
 .. _apinn-pytorch-conv:
 Conv Layers

--- a/docs/source/api/python/nn.tensorflow.rst
+++ b/docs/source/api/python/nn.tensorflow.rst
@@ -3,15 +3,6 @@
 NN Modules (Tensorflow)
 ====================
-.. contents:: Contents
-    :local:
-We welcome your contribution! If you want a model to be implemented in DGL as a NN module,
-please `create an issue <https://github.com/dmlc/dgl/issues>`_ started with "[Feature Request] NN Module XXXModel".
-If you want to contribute a NN module, please `create a pull request <https://github.com/dmlc/dgl/pulls>`_ started
-with "[NN] XXXModel in tensorflow NN Modules" and our team member would review this PR.
 Conv Layers 
 ----------------------------------------

--- a/docs/source/api/python/udf.rst
+++ b/docs/source/api/python/udf.rst
 .. _apiudf:
-dgl.udf
+User-defined Function
 ==================================================
 .. currentmodule:: dgl.udf

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -103,14 +103,15 @@ Getting Started
   :glob:
   api/python/dgl
-   api/python/dgl.DGLGraph
   api/python/dgl.data
-   api/python/nn
-   api/python/dgl.ops
-   api/python/dgl.function
-   api/python/sampling
   api/python/dgl.dataloading
+   api/python/dgl.DGLGraph
   api/python/dgl.distributed
+   api/python/dgl.function
+   api/python/nn
+   api/python/dgl.ops
+   api/python/dgl.sampling
+   api/python/udf
 .. toctree::
   :maxdepth: 3

--- a/python/dgl/__init__.py
+++ b/python/dgl/__init__.py
-"""DGL root package."""
+"""
+The ``dgl`` package contains data structure for storing structural and feature data
+(i.e., the :class:`DGLGraph` class) and also utilities for generating, manipulating
+and transforming graphs.
+"""
 # Windows compatibility
 # This initializes Winsock and performs cleanup at termination as required
 import socket

--- a/python/dgl/batch.py
+++ b/python/dgl/batch.py
@@ -10,7 +10,7 @@ from . import utils
 __all__ = ['batch', 'unbatch', 'batch_hetero', 'unbatch_hetero']
 def batch(graphs, ndata=ALL, edata=ALL, *, node_attrs=None, edge_attrs=None):
-    r"""Batch a collection of ``DGLGraph``s into one graph for more efficient
+    r"""Batch a collection of :class:`DGLGraph` s into one graph for more efficient
    graph computation.
    Each input graph becomes one disjoint component of the batched graph. The nodes
@@ -35,8 +35,8 @@ def batch(graphs, ndata=ALL, edata=ALL, *, node_attrs=None, edge_attrs=None):
    The numbers of nodes and edges of the input graphs are accessible via the
    :func:`DGLGraph.batch_num_nodes` and :func:`DGLGraph.batch_num_edges` attributes
-    of the result graph. For homographs, they are 1D integer tensors, with each element
+    of the resulting graph. For homogeneous graphs, they are 1D integer tensors,
-    being the number of nodes/edges of the corresponding input graph. For
+    with each element being the number of nodes/edges of the corresponding input graph. For
    heterographs, they are dictionaries of 1D integer tensors, with node
    type or edge type as the keys.
@@ -46,7 +46,7 @@ def batch(graphs, ndata=ALL, edata=ALL, *, node_attrs=None, edge_attrs=None):
    By default, node/edge features are batched by concatenating the feature tensors
    of all input graphs. This thus requires features of the same name to have
    the same data type and feature size. One can pass ``None`` to the ``ndata``
-    or ``edata`` argument to prevent feature batching, or pass a list of string
+    or ``edata`` argument to prevent feature batching, or pass a list of strings
    to specify which features to batch.
    To unbatch the graph back to a list, use the :func:`dgl.unbatch` function.
@@ -68,7 +68,7 @@ def batch(graphs, ndata=ALL, edata=ALL, *, node_attrs=None, edge_attrs=None):
    Examples
    --------
-    Batch homographs
+    Batch homogeneous graphs
    >>> import dgl
    >>> import torch as th
@@ -251,13 +251,13 @@ def unbatch(g, node_split=None, edge_split=None):
    """Revert the batch operation by split the given graph into a list of small ones.
    This is the reverse operation of :func:``dgl.batch``. If the ``node_split``
-    or the ``edge_split`` is not given, it uses the :func:`DGLGraph.batch_num_nodes`
+    or the ``edge_split`` is not given, it calls :func:`DGLGraph.batch_num_nodes`
-    and :func:`DGLGraph.batch_num_edges` of the input graph.
+    and :func:`DGLGraph.batch_num_edges` of the input graph to get the information.
    If the ``node_split`` or the ``edge_split`` arguments are given,
    it will partition the graph according to the given segments. One must assure
    that the partition is valid -- edges of the i^th graph only connect nodes
-    belong to the i^th graph. Otherwise, an error will be thrown.
+    belong to the i^th graph. Otherwise, DGL will throw an error.
    The function supports heterograph input, in which case the two split
    section arguments shall be of dictionary type -- similar to the

--- a/python/dgl/convert.py
+++ b/python/dgl/convert.py
@@ -35,7 +35,7 @@ def graph(data,
          idtype=None,
          device=None,
          **deprecated_kwargs):
-    """Create a graph.
+    """Create a graph and return.
    Parameters
    ----------
@@ -199,7 +199,7 @@ def heterograph(data_dict,
                num_nodes_dict=None,
                idtype=None,
                device=None):
-    """Create a heterogeneous graph.
+    """Create a heterogeneous graph and return.
    Parameters
    ----------
@@ -354,33 +354,34 @@ def heterograph(data_dict,
 def to_heterogeneous(G, ntypes, etypes, ntype_field=NTYPE,
                     etype_field=ETYPE, metagraph=None):
-    """Convert the given homogeneous graph to a heterogeneous graph.
+    """Convert a homogeneous graph to a heterogeneous graph and return.
    The input graph should have only one type of nodes and edges. Each node and edge
-    stores an integer feature (under ``ntype_field`` and ``etype_field``), representing
+    stores an integer feature as its type ID
-    the type id, which can be used to retrieve the type names stored
+    (specified by :attr:`ntype_field` and :attr:`etype_field`).
-    in the given ``ntypes`` and ``etypes`` arguments.
+    DGL uses it to retrieve the type names stored in the given
+    :attr:`ntypes` and :attr:`etypes` arguments.
    The function will automatically distinguish edge types that have the same given
-    type IDs but different src and dst type IDs. For example, we allow both edges A and B
+    type IDs but different src and dst type IDs. For example, it allows both edges A and B
    to have the same type ID 0, but one has (0, 1) and the other as (2, 3) as the
    (src, dst) type IDs. In this case, the function will "split" edge type 0 into two types:
    (0, ty_A, 1) and (2, ty_B, 3). In another word, these two edges share the same edge
-    type name, but can be distinguished by a canonical edge type tuple.
+    type name, but can be distinguished by an edge type triplet.
-    This function will copy any node/edge features from :attr:`G` to the returned heterogeneous
-    graph, except for node/edge types to recover the heterogeneous graph.
-    One can retrieve the IDs of the nodes/edges in :attr:`G` from the returned heterogeneous
+    The function stores the node and edge IDs in the input graph using the ``dgl.NID``
-    graph with node feature ``dgl.NID`` and edge feature ``dgl.EID`` respectively.
+    and ``dgl.EID`` names in the ``ndata`` and ``edata`` of the resulting graph.
+    It also copies any node/edge features from :attr:`G` to the returned heterogeneous
+    graph, except for reserved fields for storing type IDs (``dgl.NTYPE`` and ``dgl.ETYPE``)
+    and node/edge IDs (``dgl.NID`` and ``dgl.EID``).
    Parameters
    ----------
    G : DGLGraph
        The homogeneous graph.
-    ntypes : list of str
+    ntypes : list[str]
        The node type names.
-    etypes : list of str
+    etypes : list[str]
        The edge type names.
    ntype_field : str, optional
        The feature field used to store node type. (Default: ``dgl.NTYPE``)
@@ -389,24 +390,18 @@ def to_heterogeneous(G, ntypes, etypes, ntype_field=NTYPE,
    metagraph : networkx MultiDiGraph, optional
        Metagraph of the returned heterograph.
        If provided, DGL assumes that G can indeed be described with the given metagraph.
-        If None, DGL will infer the metagraph from the given inputs, which would be
+        If None, DGL will infer the metagraph from the given inputs, which could be
-        potentially slower for large graphs.
+        costly for large graphs.
    Returns
    -------
    DGLGraph
-        A heterogeneous graph. The parent node and edge ID are stored in the column
+        A heterogeneous graph.
-        ``dgl.NID`` and ``dgl.EID`` respectively for all node/edge types.
    Notes
    -----
    The returned node and edge types may not necessarily be in the same order as
-    ``ntypes`` and ``etypes``.  And edge types may be duplicated if the source
+    ``ntypes`` and ``etypes``.
-    and destination types differ.
-    The node IDs of a single type in the returned heterogeneous graph is ordered
-    the same as the nodes with the same ``ntype_field`` feature. Edge IDs of
-    a single type is similar.
    Examples
    --------
@@ -568,15 +563,15 @@ def to_hetero(G, ntypes, etypes, ntype_field=NTYPE, etype_field=ETYPE,
                            etype_field=etype_field, metagraph=metagraph)
 def to_homogeneous(G, ndata=None, edata=None):
-    """Convert the given heterogeneous graph to a homogeneous graph.
+    """Convert a heterogeneous graph to a homogeneous graph and return.
-    The returned graph has only one type of nodes and edges.
-    Node and edge types are stored as features in the returned graph. Each feature
+    Node and edge types of the input graph are stored as the ``dgl.NTYPE``
-    is an integer representing the type id, which can be used to retrieve the type
+    and ``dgl.ETYPE`` features in the returned graph.
-    names stored in ``G.ntypes`` and ``G.etypes`` arguments.
+    Each feature is an integer representing the type id, determined by the
+    :meth:`DGLGraph.get_ntype_id` and :meth:`DGLGraph.get_etype_id` methods.
-    If all
+    The function also stores the original node/edge IDs as the ``dgl.NID``
+    and ``dgl.EID`` features in the returned graph.
    Parameters
    ----------
@@ -596,8 +591,7 @@ def to_homogeneous(G, ndata=None, edata=None):
    Returns
    -------
    DGLGraph
-        A homogeneous graph. The parent node and edge type/ID are stored in
+        A homogeneous graph.
-        columns ``dgl.NTYPE/dgl.NID`` and ``dgl.ETYPE/dgl.EID`` respectively.
    Examples
    --------
@@ -695,7 +689,7 @@ def from_scipy(sp_mat,
               eweight_name=None,
               idtype=None,
               device=None):
-    """Create a graph from a SciPy sparse matrix.
+    """Create a graph from a SciPy sparse matrix and return.
    Parameters
    ----------
@@ -785,7 +779,7 @@ def bipartite_from_scipy(sp_mat,
                         eweight_name=None,
                         idtype=None,
                         device=None):
-    """Create a unidirectional bipartite graph from a SciPy sparse matrix.
+    """Create a uni-directional bipartite graph from a SciPy sparse matrix and return.
    The created graph will have two types of nodes ``utype`` and ``vtype`` as well as one
    edge type ``etype`` whose edges are from ``utype`` to ``vtype``.
@@ -881,8 +875,9 @@ def from_networkx(nx_graph,
                  edge_id_attr_name=None,
                  idtype=None,
                  device=None):
-    """Create a graph from a NetworkX graph.
+    """Create a graph from a NetworkX graph and return.
+    .. note::
        Creating a DGLGraph from a NetworkX graph is not fast especially for large scales.
        It is recommended to first convert a NetworkX graph into a tuple of node-tensors
        and then construct a DGLGraph with :func:`dgl.graph`.
@@ -903,7 +898,7 @@ def from_networkx(nx_graph,
        The names of the edge attributes to retrieve from the NetworkX graph. If given, DGL
        stores the retrieved edge attributes in ``edata`` of the returned graph using their
        original names. The attribute data must be convertible to Tensor type (e.g., scalar,
-        numpy.ndarray, list, etc.). It must be None if :attr:`nx_graph` is undirected.
+        ``numpy.ndarray``, list, etc.). It must be None if :attr:`nx_graph` is undirected.
    edge_id_attr_name : str, optional
        The name of the edge attribute that stores the edge IDs. If given, DGL will assign edge
        IDs accordingly when creating the graph, so the attribute must be valid IDs, i.e.
@@ -1046,11 +1041,12 @@ def bipartite_from_networkx(nx_graph,
                            edge_id_attr_name=None,
                            idtype=None,
                            device=None):
-    """Create a unidirectional bipartite graph from a NetworkX graph.
+    """Create a unidirectional bipartite graph from a NetworkX graph and return.
    The created graph will have two types of nodes ``utype`` and ``vtype`` as well as one
    edge type ``etype`` whose edges are from ``utype`` to ``vtype``.
+    .. note::
        Creating a DGLGraph from a NetworkX graph is not fast especially for large scales.
        It is recommended to first convert a NetworkX graph into a tuple of node-tensors
        and then construct a DGLGraph with :func:`dgl.heterograph`.
@@ -1074,7 +1070,7 @@ def bipartite_from_networkx(nx_graph,
        The names of the node attributes for node type :attr:`utype` to retrieve from the
        NetworkX graph. If given, DGL stores the retrieved node attributes in
        ``nodes[utype].data`` of the returned graph using their original names. The attribute
-        data must be convertible to Tensor type (e.g., scalar, numpy.array, list, etc.).
+        data must be convertible to Tensor type (e.g., scalar, ``numpy.ndarray``, list, etc.).
    e_attrs : list[str], optional
        The names of the edge attributes to retrieve from the NetworkX graph. If given, DGL
        stores the retrieved edge attributes in ``edata`` of the returned graph using their
@@ -1242,14 +1238,16 @@ def bipartite_from_networkx(nx_graph,
    return g.to(device)
 def to_networkx(g, node_attrs=None, edge_attrs=None):
-    """Convert a homogeneous graph to a NetworkX graph.
+    """Convert a homogeneous graph to a NetworkX graph and return.
-    It will save the edge IDs as the ``'id'`` edge attribute in the returned NetworkX graph.
+    The resulting NetworkX graph also contains the node/edge features of the input graph.
+    Additionally, DGL saves the edge IDs as the ``'id'`` edge attribute in the
+    returned NetworkX graph.
    Parameters
    ----------
    g : DGLGraph
-        A homogeneous graph on CPU.
+        A homogeneous graph.
    node_attrs : iterable of str, optional
        The node attributes to copy from ``g.ndata``. (Default: None)
    edge_attrs : iterable of str, optional
@@ -1260,6 +1258,10 @@ def to_networkx(g, node_attrs=None, edge_attrs=None):
    networkx.DiGraph
        The converted NetworkX graph.
+    Notes
+    -----
+    The function only supports CPU graph input.
    Examples
    --------
    The following example uses PyTorch backend.

--- a/python/dgl/data/__init__.py
+++ b/python/dgl/data/__init__.py
-"""Data related package."""
+"""The ``dgl.data`` package contains datasets hosted by DGL and also utilities
+for downloading, processing, saving and loading data from external resources.
+"""
 from __future__ import absolute_import
 from . import citation_graph as citegrh