[Doc and bugfix] Add docs and user guide and update tutorial for sampling pipeline (#3774)

* huuuuge update * remove * lint * lint * fix * what happened to nccl * update multi-gpu unsupervised graphsage example * replace most of the dgl.mp.process with torch.mp.spawn * update if condition for use_uva case * update user guide * address comments * incorporating suggestions from @jermainewang * oops * fix tutorial to pass CI * oops * fix again Co-authored-by: Xin Yao <xiny@nvidia.com>

[Doc and bugfix] Add docs and user guide and update tutorial for sampling pipeline (#3774)
* huuuuge update * remove * lint * lint * fix * what happened to nccl * update multi-gpu unsupervised graphsage example * replace most of the dgl.mp.process with torch.mp.spawn * update if condition for use_uva case * update user guide * address comments * incorporating suggestions from @jermainewang * oops * fix tutorial to pass CI * oops * fix again Co-authored-by: Xin Yao <xiny@nvidia.com>
d41d07d0 · Quan (Andy) Gan · GitHub · 3bd5a9b6 · d41d07d0 · d41d07d0
Unverified Commit d41d07d0 authored Feb 28, 2022 by Quan (Andy) Gan Committed by GitHub Feb 28, 2022
20 changed files
--- a/docs/source/api/python/dgl.dataloading.rst
+++ b/docs/source/api/python/dgl.dataloading.rst
@@ -4,78 +4,48 @@ dgl.dataloading
 =================================

 .. automodule:: dgl.dataloading
+.. currentmodule:: dgl.dataloading

 DataLoaders
 -----------
-.. currentmodule:: dgl.dataloading.pytorch

 DGL DataLoader for mini-batch training works similarly to PyTorch's DataLoader.
 It has a generator interface that returns mini-batches sampled from some given graphs.
 DGL provides two DataLoaders: a ``NodeDataLoader`` for node classification task
 and an ``EdgeDataLoader`` for edge/link prediction task.

-.. autoclass:: NodeDataLoader
-.. autoclass:: EdgeDataLoader
-.. autoclass:: GraphDataLoader
-.. autoclass:: DistNodeDataLoader
-.. autoclass:: DistEdgeDataLoader
-
-.. _api-dataloading-neighbor-sampling:
-
-Neighbor Sampler
----------------
-.. currentmodule:: dgl.dataloading.neighbor
-
-Neighbor samplers are classes that control the behavior of ``DataLoader`` s
-to sample neighbors. All of them inherit the base :class:`BlockSampler` class, but implement
-different neighbor sampling strategies by overriding the ``sample_frontier`` or
-the ``sample_blocks`` methods.
-
-.. autoclass:: BlockSampler
-    :members: sample_frontier, sample_blocks, sample
-
-.. autoclass:: MultiLayerNeighborSampler
-    :members: sample_frontier
-    :show-inheritance:
-
-.. autoclass:: MultiLayerFullNeighborSampler
-    :show-inheritance:
+.. autosummary::
+    :toctree: ../../generated/

-Subgraph Iterators
------------------
-Subgraph iterators iterate over the original graph in subgraphs. One should use subgraph
-iterators with ``GraphDataLoader`` like follows:
+    DataLoader
+    NodeDataLoader
+    EdgeDataLoader
+    GraphDataLoader
+    DistNodeDataLoader
+    DistEdgeDataLoader

-.. code:: python
-
-   sgiter = dgl.dataloading.ClusterGCNSubgraphIterator(
-       g, num_partitions=100, cache_directory='.', refresh=True)
-   dataloader = dgl.dataloading.GraphDataLoader(sgiter, batch_size=4, num_workers=0)
-   for subgraph_batch in dataloader:
-       train_on(subgraph_batch)
-
-.. autoclass:: dgl.dataloading.dataloader.SubgraphIterator
-
-.. autoclass:: dgl.dataloading.cluster_gcn.ClusterGCNSubgraphIterator
+.. _api-dataloading-neighbor-sampling:

-ShaDow-GNN Subgraph Sampler
---------------------------
-.. currentmodule:: dgl.dataloading.shadow
+Samplers
+--------

-.. autoclass:: ShaDowKHopSampler
+.. autosummary::
+    :toctree: ../../generated/

-.. _api-dataloading-collators:
+    Sampler
+    BlockSampler
+    NeighborSampler
+    MultiLayerFullNeighborSampler
+    ClusterGCNSampler
+    ShaDowKHopSampler

-Collators
---------
-.. currentmodule:: dgl.dataloading
+Sampler Transformations
+-----------------------

-Collators are platform-agnostic classes that generates the mini-batches
-given the graphs and indices to sample from.
+.. autosummary::
+    :toctree: ../../generated/

-.. autoclass:: NodeCollator
-.. autoclass:: EdgeCollator
-.. autoclass:: GraphCollator
+    as_edge_prediction_sampler

 .. _api-dataloading-negative-sampling:

@@ -83,30 +53,24 @@ Negative Samplers for Link Prediction
 -------------------------------------
 .. currentmodule:: dgl.dataloading.negative_sampler

-Negative samplers are classes that control the behavior of the ``EdgeDataLoader``
-to generate negative edges.
-
-.. autoclass:: Uniform
-    :members: __call__
+Negative samplers are classes that control the behavior of the edge prediction samplers

-.. autoclass:: GlobalUniform
-    :members: __call__
-
-Async Copying to/from GPUs
--------------------------
-.. currentmodule:: dgl.dataloading
+.. autosummary::
+    :toctree: ../../generated/

-Data can be copied from the CPU to the GPU
-while the GPU is being used for
-computation, using the :class:`AsyncTransferer`.
-For the transfer to be fully asynchronous, the context the
-:class:`AsyncTranserer`
-is created with must be a GPU context, and the input tensor must be in 
-pinned memory.
+    Uniform
+    PerSourceUniform
+    GlobalUniform

+Utility Class and Functions for Feature Prefetching
+---------------------------------------------------
+.. currentmodule:: dgl.dataloading.base

-.. autoclass:: AsyncTransferer
-    :members: __init__, async_copy
+.. autosummary::
+    :toctree: ../../generated/

-.. autoclass:: async_transferer.Transfer
-    :members: wait
+    LazyFeature
+    set_node_lazy_features
+    set_edge_lazy_features
+    set_src_lazy_features
+    set_dst_lazy_features
--- a/docs/source/api/python/dgl.rst
+++ b/docs/source/api/python/dgl.rst
@@ -173,7 +173,8 @@ set at each iteration. ``prop_edges_YYY`` applies traversal algorithm ``YYY`` an
 Utilities
 -----------------------------------------------

-Other utilities for controlling randomness, saving and loading graphs, etc.
+Other utilities for controlling randomness, saving and loading graphs, functions that applies
+the same function to every elements in a container, etc.

 .. autosummary::
    :toctree: ../../generated/
@@ -181,3 +182,4 @@ Other utilities for controlling randomness, saving and loading graphs, etc.
    seed
    save_graphs
    load_graphs
+    apply_each
--- a/docs/source/guide/distributed-apis.rst
+++ b/docs/source/guide/distributed-apis.rst
@@ -202,8 +202,8 @@ DGL provides two levels of APIs for sampling nodes and edges to generate mini-ba
 (see the section of mini-batch training). The low-level APIs require users to write code
 to explicitly define how a layer of nodes are sampled (e.g., using :func:`dgl.sampling.sample_neighbors` ).
 The high-level sampling APIs implement a few popular sampling algorithms for node classification
-and link prediction tasks (e.g., :class:`~dgl.dataloading.pytorch.NodeDataLoader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ).
+and link prediction tasks (e.g., :class:`~dgl.dataloading.NodeDataLoader` and
+:class:`~dgl.dataloading.EdgeDataLoader` ).

 The distributed sampling module follows the same design and provides two levels of sampling APIs.
 For the lower-level sampling API, it provides :func:`~dgl.distributed.sample_neighbors` for
@@ -240,10 +240,10 @@ difference is that users need to use :func:`dgl.distributed.sample_neighbors` an
        for batch in dataloader:
            ...

-The high-level sampling APIs (:class:`~dgl.dataloading.pytorch.NodeDataLoader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ) has distributed counterparts
-(:class:`~dgl.dataloading.pytorch.DistNodeDataLoader` and
-:class:`~dgl.dataloading.pytorch.DistEdgeDataLoader`).  The code is exactly the
+The high-level sampling APIs (:class:`~dgl.dataloading.NodeDataLoader` and
+:class:`~dgl.dataloading.EdgeDataLoader` ) has distributed counterparts
+(:class:`~dgl.dataloading.DistNodeDataLoader` and
+:class:`~dgl.dataloading.DistEdgeDataLoader`).  The code is exactly the
 same as single-process sampling otherwise.

 .. code:: python

--- a/docs/source/guide/minibatch-custom-sampler.rst
+++ b/docs/source/guide/minibatch-custom-sampler.rst
 .. _guide-minibatch-customizing-neighborhood-sampler:

-6.4 Customizing Neighborhood Sampler
+6.4 Implementing custom graph samplers
 ----------------------------------------------

-:ref:`(中文版) <guide_cn-minibatch-customizing-neighborhood-sampler>`
-
-Although DGL provides some neighborhood sampling strategies, sometimes
-users would want to write their own sampling strategy. This section
-explains how to write your own strategy and plug it into your stochastic
-GNN training framework.
-
-Recall that in `How Powerful are Graph Neural
-Networks <https://arxiv.org/pdf/1810.00826.pdf>`__, the definition of message
-passing is:
-
-.. math::
-
-
-   \begin{gathered}
-     \boldsymbol{a}_v^{(l)} = \rho^{(l)} \left(
-       \left\lbrace
-         \boldsymbol{h}_u^{(l-1)} : u \in \mathcal{N} \left( v \right)
-       \right\rbrace
-     \right)
-   \\
-     \boldsymbol{h}_v^{(l)} = \phi^{(l)} \left(
-       \boldsymbol{h}_v^{(l-1)}, \boldsymbol{a}_v^{(l)}
-     \right)
-   \end{gathered}
-
-where :math:`\rho^{(l)}` and :math:`\phi^{(l)}` are parameterized
-functions, and :math:`\mathcal{N}(v)` is defined as the set of
-predecessors (or *neighbors* if the graph is undirected) of :math:`v` on graph
-:math:`\mathcal{G}`.
-
-For instance, to perform a message passing for updating the red node in
-the following graph:
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_0.png
-   :alt: Imgur
-
-
-
-One needs to aggregate the node features of its neighbors, shown as
-green nodes:
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_1.png
-   :alt: Imgur
-
-
-
-Neighborhood sampling with pencil and paper
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Let's first define a DGL graph according to the above image.
-
-.. code:: python
-
-    import torch
-    import dgl
-
-    src = torch.LongTensor(
-        [0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10,
-         1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11])
-    dst = torch.LongTensor(
-        [1, 2, 3, 3, 3, 4, 5, 5, 6, 5, 8, 6, 8, 9, 8, 11, 11, 10, 11,
-         0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 9, 10])
-    g = dgl.graph((src, dst))
-
-We then consider how multi-layer message passing works for computing the
-output of a single node. In the following text we refer to the nodes
-whose GNN outputs are to be computed as *seed nodes*.
-
-Finding the message passing dependency
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Consider computing with a 2-layer GNN the output of the seed node 8,
-colored red, in the following graph:
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_2.png
-   :alt: Imgur
-
-
-
-By the formulation:
-
-.. math::
-
-
-   \begin{gathered}
-     \boldsymbol{a}_8^{(2)} = \rho^{(2)} \left(
-       \left\lbrace
-         \boldsymbol{h}_u^{(1)} : u \in \mathcal{N} \left( 8 \right)
-       \right\rbrace
-     \right) = \rho^{(2)} \left(
-       \left\lbrace
-         \boldsymbol{h}_4^{(1)}, \boldsymbol{h}_5^{(1)},
-         \boldsymbol{h}_7^{(1)}, \boldsymbol{h}_{11}^{(1)}
-       \right\rbrace
-     \right)
-   \\
-     \boldsymbol{h}_8^{(2)} = \phi^{(2)} \left(
-       \boldsymbol{h}_8^{(1)}, \boldsymbol{a}_8^{(2)}
-     \right)
-   \end{gathered}
-
-We can tell from the formulation that to compute
-:math:`\boldsymbol{h}_8^{(2)}` we need messages from node 4, 5, 7 and 11
-(colored green) along the edges visualized below.
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_3.png
-   :alt: Imgur
-
-
-
-This graph contains all the nodes in the original graph but only the
-edges necessary for message passing to the given output nodes. We call
-that the *frontier* of the second GNN layer for the red node 8.
-
-Several functions can be used for generating frontiers. For instance,
-:func:`dgl.in_subgraph()` is a function that induces a
-subgraph by including all the nodes in the original graph, but only all
-the incoming edges of the given nodes. You can use that as a frontier
-for message passing along all the incoming edges.
+Implementing custom samplers involves subclassing the :class:`dgl.dataloading.Sampler`
+base class and implementing its abstract :attr:`sample` method.  The :attr:`sample`
+method should take in two arguments:

 .. code:: python

-    frontier = dgl.in_subgraph(g, [8])
-    print(frontier.all_edges())
-
-For a concrete list, please refer to :ref:`api-subgraph-extraction` and
-:ref:`api-sampling`.
-
-Technically, any graph that has the same set of nodes as the original
-graph can serve as a frontier. This serves as the basis for
-:ref:`guide-minibatch-customizing-neighborhood-sampler-impl`.
-
-The Bipartite Structure for Multi-layer Minibatch Message Passing
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+   def sample(self, g, indices):
+       pass

-However, to compute :math:`\boldsymbol{h}_8^{(2)}` from
-:math:`\boldsymbol{h}_\cdot^{(1)}`, we cannot simply perform message
-passing on the frontier directly, because it still contains all the
-nodes from the original graph. Namely, we only need nodes 4, 5, 7, 8,
-and 11 (green and red nodes) as input, as well as node 8 (red node) as output.
-Since the number of nodes
-for input and output is different, we need to perform message passing on
-a small, bipartite-structured graph instead. We call such a
-bipartite-structured graph that only contains the necessary input nodes
-(referred as *source* nodes) and output nodes (referred as *destination* nodes)
-of a *message flow graph* (MFG).
-
-The following figure shows the MFG of the second GNN layer for node 8.
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_4.png
-   :alt: Imgur
+The first argument :attr:`g` is the original graph to sample from while
+the second argument :attr:`indices` is the indices of the current mini-batch
+-- it generally could be anything depending on what indices are given to the
+accompanied :class:`~dgl.dataloading.DataLoader` but are typically seed node
+or seed edge IDs. The function returns the mini-batch of samples for
+the current iteration.

 .. note::

-   See the :doc:`Stochastic Training Tutorial
-   <tutorials/large/L0_neighbor_sampling_overview>` for the concept of
-   message flow graph.
-
-Note that the destination nodes also appear in the source nodes. The reason is
-that representations of destination nodes from the previous layer are needed
-for feature combination after message passing (i.e. :math:`\phi^{(2)}`).
-
-DGL provides :func:`dgl.to_block` to convert any frontier
-to a MFG where the first argument specifies the frontier and the
-second argument specifies the destination nodes. For instance, the frontier
-above can be converted to a MFG with destination node 8 with the code as
-follows.
-
-.. code:: python
-
-    dst_nodes = torch.LongTensor([8])
-    block = dgl.to_block(frontier, dst_nodes)
-
-To find the number of source nodes and destination nodes of a given node type,
-one can use :meth:`dgl.DGLHeteroGraph.number_of_src_nodes` and
-:meth:`dgl.DGLHeteroGraph.number_of_dst_nodes` methods.
-
-.. code:: python
-
-    num_src_nodes, num_dst_nodes = block.number_of_src_nodes(), block.number_of_dst_nodes()
-    print(num_src_nodes, num_dst_nodes)
-
-The MFG’s source node features can be accessed via member
-:attr:`dgl.DGLHeteroGraph.srcdata` and :attr:`dgl.DGLHeteroGraph.srcnodes`, and
-its destination node features can be accessed via member
-:attr:`dgl.DGLHeteroGraph.dstdata` and :attr:`dgl.DGLHeteroGraph.dstnodes`. The
-syntax of ``srcdata``/``dstdata`` and ``srcnodes``/``dstnodes`` are
-identical to :attr:`dgl.DGLHeteroGraph.ndata` and
-:attr:`dgl.DGLHeteroGraph.nodes` in normal graphs.
-
-.. code:: python
-
-    block.srcdata['h'] = torch.randn(num_src_nodes, 5)
-    block.dstdata['h'] = torch.randn(num_dst_nodes, 5)
-
-If a MFG is converted from a frontier, which is in turn converted from
-a graph, one can directly read the feature of the MFG’s source and
-destination nodes via
-
-.. code:: python
-
-    print(block.srcdata['x'])
-    print(block.dstdata['y'])
-
-.. note::
-
-   The original node IDs of the source nodes and destination nodes in the MFG
-   can be found as the feature ``dgl.NID``, and the mapping from the
-   MFG’s edge IDs to the input frontier’s edge IDs can be found as the
-   feature ``dgl.EID``.
-
-DGL ensures that the destination nodes of a MFG will always appear in the
-source nodes. The destination nodes will always index firstly in the source
-nodes.
-
-.. code:: python
-
-    src_nodes = block.srcdata[dgl.NID]
-    dst_nodes = block.dstdata[dgl.NID]
-    assert torch.equal(src_nodes[:len(dst_nodes)], dst_nodes)
-
-As a result, the destination nodes must cover all nodes that are the
-destination of an edge in the frontier.
-
-For example, consider the following frontier
-
-.. figure:: https://data.dgl.ai/asset/image/guide_6_4_5.png
-   :alt: Imgur
-
-
-
-where the red and green nodes (i.e. node 4, 5, 7, 8, and 11) are all
-nodes that is a destination of an edge. Then the following code will
-raise an error because the destination nodes did not cover all those nodes.
-
-.. code:: python
-
-    dgl.to_block(frontier2, torch.LongTensor([4, 5]))   # ERROR
-
-However, the destination nodes can have more nodes than above. In this case,
-we will have isolated nodes that do not have any edge connecting to it.
-The isolated nodes will be included in both source nodes and destination
-nodes.
-
-.. code:: python
-
-    # Node 3 is an isolated node that do not have any edge pointing to it.
-    block3 = dgl.to_block(frontier2, torch.LongTensor([4, 5, 7, 8, 11, 3]))
-    print(block3.srcdata[dgl.NID])
-    print(block3.dstdata[dgl.NID])
-
-Heterogeneous Graphs
-^^^^^^^^^^^^^^^^^^^^
-
-MFGs also work on heterogeneous graphs. Let’s say that we have the
-following frontier:
-
-.. code:: python
-
-    hetero_frontier = dgl.heterograph({
-        ('user', 'follow', 'user'): ([1, 3, 7], [3, 6, 8]),
-        ('user', 'play', 'game'): ([5, 5, 4], [6, 6, 2]),
-        ('game', 'played-by', 'user'): ([2], [6])
-    }, num_nodes_dict={'user': 10, 'game': 10})
-
-One can also create a MFG with destination nodes User #3, #6, and #8, as
-well as Game #2 and #6.
-
-.. code:: python
-
-    hetero_block = dgl.to_block(hetero_frontier, {'user': [3, 6, 8], 'game': [2, 6]})
-
-One can also get the source nodes and destination nodes by type:
-
-.. code:: python
-
-    # source users and games
-    print(hetero_block.srcnodes['user'].data[dgl.NID], hetero_block.srcnodes['game'].data[dgl.NID])
-    # destination users and games
-    print(hetero_block.dstnodes['user'].data[dgl.NID], hetero_block.dstnodes['game'].data[dgl.NID])
-
-
-.. _guide-minibatch-customizing-neighborhood-sampler-impl:
-
-Implementing a Custom Neighbor Sampler
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Recall that the following code performs neighbor sampling for node
-classification.
-
-.. code:: python
-
-    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
-
-To implement your own neighborhood sampling strategy, you basically
-replace the ``sampler`` object with your own. To do that, let’s first
-see what :class:`~dgl.dataloading.dataloader.BlockSampler`, the parent class of
-:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler`, is.
-
-:class:`~dgl.dataloading.dataloader.BlockSampler` is responsible for
-generating the list of MFGs starting from the last layer, with method
-:meth:`~dgl.dataloading.dataloader.BlockSampler.sample_blocks`. The default implementation of
-``sample_blocks`` is to iterate backwards, generating the frontiers and
-converting them to MFGs.
-
-Therefore, for neighborhood sampling, **you only need to implement
-the**\ :meth:`~dgl.dataloading.dataloader.BlockSampler.sample_frontier`\ **method**. Given which
-layer the sampler is generating frontier for, as well as the original
-graph and the nodes to compute representations, this method is
-responsible for generating a frontier for them.
-
-Meanwhile, you also need to pass how many GNN layers you have to the
-parent class.
-
-For example, the implementation of
-:class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler` can
-go as follows.
-
-.. code:: python
-
-    class MultiLayerFullNeighborSampler(dgl.dataloading.BlockSampler):
-        def __init__(self, n_layers):
-            super().__init__(n_layers)
-    
-        def sample_frontier(self, block_id, g, seed_nodes):
-            frontier = dgl.in_subgraph(g, seed_nodes)
-            return frontier
-
-:class:`dgl.dataloading.neighbor.MultiLayerNeighborSampler`, a more
-complicated neighbor sampler class that allows you to sample a small
-number of neighbors to gather message for each node, goes as follows.
-
-.. code:: python
-
-    class MultiLayerNeighborSampler(dgl.dataloading.BlockSampler):
-        def __init__(self, fanouts):
-            super().__init__(len(fanouts))
-    
-            self.fanouts = fanouts
-    
-        def sample_frontier(self, block_id, g, seed_nodes):
-            fanout = self.fanouts[block_id]
-            if fanout is None:
-                frontier = dgl.in_subgraph(g, seed_nodes)
-            else:
-                frontier = dgl.sampling.sample_neighbors(g, seed_nodes, fanout)
-            return frontier
-
-Although the functions above can generate a frontier, any graph that has
-the same nodes as the original graph can serve as a frontier.
-
-For example, if one want to randomly drop inbound edges to the seed
-nodes with a probability, one can simply define the sampler as follows:
-
-.. code:: python
-
-    class MultiLayerDropoutSampler(dgl.dataloading.BlockSampler):
-        def __init__(self, p, num_layers):
-            super().__init__(num_layers)
-    
-            self.p = p
-    
-        def sample_frontier(self, block_id, g, seed_nodes, *args, **kwargs):
-            # Get all inbound edges to `seed_nodes`
-            src, dst = dgl.in_subgraph(g, seed_nodes).all_edges()
-            # Randomly select edges with a probability of p
-            mask = torch.zeros_like(src).bernoulli_(self.p).bool()
-            src = src[mask]
-            dst = dst[mask]
-            # Return a new graph with the same nodes as the original graph as a
-            # frontier
-            frontier = dgl.graph((src, dst), num_nodes=g.number_of_nodes())
-            return frontier
-    
-        def __len__(self):
-            return self.num_layers
-
-After implementing your sampler, you can create a data loader that takes
-in your sampler and it will keep generating lists of MFGs while
-iterating over the seed nodes as usual.
-
-.. code:: python
-
-    sampler = MultiLayerDropoutSampler(0.5, 2)
-    dataloader = dgl.dataloading.NodeDataLoader(
-        g, train_nids, sampler,
-        batch_size=1024,
-        shuffle=True,
-        drop_last=False,
-        num_workers=4)
-    
-    model = StochasticTwoLayerRGCN(in_features, hidden_features, out_features)
-    model = model.cuda()
-    opt = torch.optim.Adam(model.parameters())
-    
-    for input_nodes, blocks in dataloader:
-        blocks = [b.to(torch.device('cuda')) for b in blocks]
-        input_features = blocks[0].srcdata     # returns a dict
-        output_labels = blocks[-1].dstdata     # returns a dict
-        output_predictions = model(blocks, input_features)
-        loss = compute_loss(output_labels, output_predictions)
-        opt.zero_grad()
-        loss.backward()
-        opt.step()
-
-Heterogeneous Graphs
-^^^^^^^^^^^^^^^^^^^^
-
-Generating a frontier for a heterogeneous graph is nothing different
-than that for a homogeneous graph. Just make the returned graph have the
-same nodes as the original graph, and it should work fine. For example,
-we can rewrite the ``MultiLayerDropoutSampler`` above to iterate over
-all edge types, so that it can work on heterogeneous graphs as well.
-
-.. code:: python
-
-    class MultiLayerDropoutSampler(dgl.dataloading.BlockSampler):
-        def __init__(self, p, num_layers):
-            super().__init__(num_layers)
-    
-            self.p = p
-    
-        def sample_frontier(self, block_id, g, seed_nodes, *args, **kwargs):
-            # Get all inbound edges to `seed_nodes`
-            sg = dgl.in_subgraph(g, seed_nodes)
-    
-            new_edges_masks = {}
-            # Iterate over all edge types
-            for etype in sg.canonical_etypes:
-                edge_mask = torch.zeros(sg.number_of_edges(etype))
-                edge_mask.bernoulli_(self.p)
-                new_edges_masks[etype] = edge_mask.bool()
-    
-            # Return a new graph with the same nodes as the original graph as a
-            # frontier
-            frontier = dgl.edge_subgraph(new_edges_masks, relabel_nodes=False)
-            return frontier
-    
-        def __len__(self):
-            return self.num_layers
+    The design here is similar to PyTorch's ``torch.utils.data.DataLoader``,
+    which is an iterator of dataset. Users can customize how to batch samples
+    using its ``collate_fn`` argument. Here in DGL, ``dgl.dataloading.DataLoader``
+    is an iterator of ``indices`` (e.g., training node IDs) while ``Sampler``
+    converts a batch of indices into a batch of graph- or tensor-type samples.
+
+
+The code below implements a classical neighbor sampler:
+
+.. code:: python
+
+   class NeighborSampler(dgl.dataloading.Sampler):
+       def __init__(self, fanouts : list[int]):
+           super().__init__()
+           self.fanouts = fanouts
+
+       def sample(self, g, seed_nodes):
+           output_nodes = seed_nodes
+           subgs = []
+           for fanout in reversed(self.fanouts):
+               # Sample a fixed number of neighbors of the current seed nodes.
+               sg = g.sample_neighbors(seed_nodes, fanout)
+               # Convert this subgraph to a message flow graph.
+               sg = dgl.to_block(sg, seed_nodes)
+               seed_nodes = sg.srcdata[NID]
+               subgs.insert(0, sg)
+            input_nodes = seed_nodes
+            return input_nodes, output_nodes, subgs
+
+To use this sampler with ``DataLoader``:
+
+.. code:: python
+
+    graph = ...  # the graph to be sampled from
+    train_nids = ...  # an 1-D tensor of training node IDs
+    sampler = NeighborSampler([10, 15])  # create a sampler
+    dataloader = dgl.dataloading.DataLoader(
+        graph,
+        train_nids,
+        sampler,
+        batch_size=32,    # batch_size decides how many IDs are passed to sampler at once
+        ...               # other arguments
+    )
+    for i, mini_batch in enumerate(dataloader):
+        # unpack the mini batch
+        input_nodes, output_nodes, subgs = mini_batch
+        train(input_nodes, output_nodes, subgs)
+
+Sampler for Heterogeneous Graphs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To write a sampler for heterogeneous graphs, one needs to be aware that
+the argument ``g`` will be a heterogeneous graph while ``indices`` could be a
+dictionary of ID tensors. Most of DGL's graph sampling operators (e.g.,
+the ``sample_neighbors`` and ``to_block`` functions in the above example) can
+work on heterogeneous graph natively, so many samplers are automatically
+ready for heterogeneous graph. For example, the above ``NeighborSampler``
+can be used on heterogeneous graphs:
+
+.. code:: python
+
+    hg = dgl.heterograph({
+        ('user', 'like', 'movie') : ...,
+        ('user', 'follow', 'user') : ...,
+        ('movie', 'liked-by', 'user') : ...,
+    })
+    train_nids = {'user' : ..., 'movie' : ...}  # training IDs of 'user' and 'movie' nodes
+    sampler = NeighborSampler([10, 15])  # create a sampler
+    dataloader = dgl.dataloading.DataLoader(
+        hg,
+        train_nids,
+        sampler,
+        batch_size=32,    # batch_size decides how many IDs are passed to sampler at once
+        ...               # other arguments
+    )
+    for i, mini_batch in enumerate(dataloader):
+        # unpack the mini batch
+        # input_nodes and output_nodes are dictionary while subgs are a list of
+        # heterogeneous graphs
+        input_nodes, output_nodes, subgs = mini_batch
+        train(input_nodes, output_nodes, subgs)
+
+Exclude Edges During Sampling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The examples above all belong to *node-wise sampler* because the ``indices`` argument
+to the ``sample`` method represents a batch of seed node IDs. Another common type of
+samplers is *edge-wise sampler* which, as name suggested, takes in a batch of seed
+edge IDs to construct mini-batch data. DGL provides a utility
+:func:`dgl.dataloading.as_edge_prediction_sampler` to turn a node-wise sampler to
+an edge-wise sampler. To prevent information leakge, it requires the node-wise sampler
+to have an additional third argument ``exclude_eids``. The code below modifies
+the ``NeighborSampler`` we just defined to properly exclude edges from the sampled
+subgraph:
+
+.. code:: python
+
+   class NeighborSampler(Sampler):
+       def __init__(self, fanouts):
+           super().__init__()
+           self.fanouts = fanouts
+
+       # NOTE: There is an additional third argument. For homogeneous graphs,
+       #   it is an 1-D tensor of integer IDs. For heterogeneous graphs, it
+       #   is a dictionary of ID tensors. We usually set its default value to be None.
+       def sample(self, g, seed_nodes, exclude_eids=None):
+           output_nodes = seed_nodes
+           subgs = []
+           for fanout in reversed(self.fanouts):
+               # Sample a fixed number of neighbors of the current seed nodes.
+               sg = g.sample_neighbors(seed_nodes, fanout, exclude_edges=exclude_eids)
+               # Convert this subgraph to a message flow graph.
+               sg = dgl.to_block(sg, seed_nodes)
+               seed_nodes = sg.srcdata[NID]
+               subgs.insert(0, sg)
+            input_nodes = seed_nodes
+            return input_nodes, output_nodes, subgs
+
+Further Readings
+~~~~~~~~~~~~~~~~~~
+See :ref:`guide-minibatch-prefetching` for how to write a custom graph sampler
+with feature prefetching.
\ No newline at end of file
--- a/docs/source/guide/minibatch-edge.rst
+++ b/docs/source/guide/minibatch-edge.rst
@@ -20,7 +20,7 @@ You can use the

 To use the neighborhood sampler provided by DGL for edge classification,
 one need to instead combine it with
-:class:`~dgl.dataloading.pytorch.EdgeDataLoader`, which iterates
+:func:`~dgl.dataloading.as_edge_prediction_sampler`, which iterates
 over a set of edges in minibatches, yielding the subgraph induced by the
 edge minibatch and *message flow graphs* (MFGs) to be consumed by the module below.

@@ -30,7 +30,8 @@ putting the list of generated MFGs onto GPU.

 .. code:: python

-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(sampler)
+    dataloader = dgl.dataloading.DataLoader(
        g, train_eid_dict, sampler,
        batch_size=1024,
        shuffle=True,
@@ -50,6 +51,8 @@ putting the list of generated MFGs onto GPU.
   detailed explanation of the concept of MFGs, please refer to
   :ref:`guide-minibatch-customizing-neighborhood-sampler`.

+.. _guide-minibatch-edge-classification-sampler-exclude:
+
 Removing edges in the minibatch from the original graph for neighbor sampling
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@@ -62,8 +65,8 @@ advantage.
 Therefore in edge classification you sometimes would like to exclude the
 edges sampled in the minibatch from the original graph for neighborhood
 sampling, as well as the reverse edges of the sampled edges on an
-undirected graph. You can specify ``exclude='reverse_id'`` in instantiation
-of :class:`~dgl.dataloading.pytorch.EdgeDataLoader`, with the mapping of the edge
+undirected graph. You can specify ``exclude='reverse_id'`` in calling
+:func:`~dgl.dataloading.as_edge_prediction_sampler`, with the mapping of the edge
 IDs to their reverse edges IDs.  Usually doing so will lead to much slower
 sampling process due to locating the reverse edges involving in the minibatch
 and removing them.
@@ -71,16 +74,11 @@ and removing them.
 .. code:: python

    n_edges = g.number_of_edges()
-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, exclude='reverse_id', reverse_eids=torch.cat([
+            torch.arange(n_edges // 2, n_edges), torch.arange(0, n_edges // 2)]))
+    dataloader = dgl.dataloading.DataLoader(
        g, train_eid_dict, sampler,
-    
-        # The following two arguments are specifically for excluding the minibatch
-        # edges and their reverse edges from the original graph for neighborhood
-        # sampling.
-        exclude='reverse_id',
-        reverse_eids=torch.cat([
-            torch.arange(n_edges // 2, n_edges), torch.arange(0, n_edges // 2)]),
-    
        batch_size=1024,
        shuffle=True,
        drop_last=False,
@@ -248,15 +246,16 @@ over the edge types for :meth:`~dgl.DGLHeteroGraph.apply_edges`.

 Data loader definition is also very similar to that of node
 classification. The only difference is that you need
-:class:`~dgl.dataloading.pytorch.EdgeDataLoader` instead of
-:class:`~dgl.dataloading.pytorch.NodeDataLoader`, and you will be supplying a
+:func:`~dgl.dataloading.as_edge_prediction_sampler`,
+and you will be supplying a
 dictionary of edge types and edge ID tensors instead of a dictionary of
 node types and node ID tensors.

 .. code:: python

    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(sampler)
+    dataloader = dgl.dataloading.DataLoader(
        g, train_eid_dict, sampler,
        batch_size=1024,
        shuffle=True,
@@ -278,16 +277,12 @@ reverse edges then goes as follows.

 .. code:: python

-    dataloader = dgl.dataloading.EdgeDataLoader(
-        g, train_eid_dict, sampler,
-    
-        # The following two arguments are specifically for excluding the minibatch
-        # edges and their reverse edges from the original graph for neighborhood
-        # sampling.
-        exclude='reverse_types',
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, exclude='reverse_types',
        reverse_etypes={'follow': 'followed by', 'followed by': 'follow',
-                        'purchase': 'purchased by', 'purchased by': 'purchase'}
-    
+                        'purchase': 'purchased by', 'purchased by': 'purchase'})
+    dataloader = dgl.dataloading.DataLoader(
+        g, train_eid_dict, sampler,
        batch_size=1024,
        shuffle=True,
        drop_last=False,

--- a/docs/source/guide/minibatch-gpu-sampling.rst
+++ b/docs/source/guide/minibatch-gpu-sampling.rst
@@ -48,13 +48,13 @@ One can use GPU-based neighborhood sampling with DGL data loaders via:
 * Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes
  accessing the same context.

-All the other arguments for the :class:`~dgl.dataloading.pytorch.NodeDataLoader` can be
+All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be
 the same as the other user guides and tutorials.

 .. code:: python

   g = g.to('cuda:0')
-   dataloader = dgl.dataloading.NodeDataLoader(
+   dataloader = dgl.dataloading.DataLoader(
       g,                                # The graph must be on GPU.
       train_nid,
       sampler,
@@ -64,8 +64,6 @@ the same as the other user guides and tutorials.
       drop_last=False,
       shuffle=True)

-GPU-based neighbor sampling also works for :class:`~dgl.dataloading.pytorch.EdgeDataLoader` since DGL 0.8.
-
 .. note::

  GPU-based neighbor sampling also works for custom neighborhood samplers as long as
@@ -91,14 +89,13 @@ You can enable UVA-based neighborhood sampling in DGL data loaders via:
 * Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes
  accessing the same context.

-All the other arguments for the :class:`~dgl.dataloading.pytorch.NodeDataLoader` can be
+All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be
 the same as the other user guides and tutorials.
-UVA-based neighbor sampling also works for :class:`~dgl.dataloading.pytorch.EdgeDataLoader`.

 .. code:: python

   g = g.pin_memory_()
-   dataloader = dgl.dataloading.NodeDataLoader(
+   dataloader = dgl.dataloading.DataLoader(
       g,                                # The graph must be pinned.
       train_nid,
       sampler,
@@ -116,7 +113,7 @@ especially for multi-GPU training.
  To use UVA-based sampling in multi-GPU training, you should first materialize all the
  necessary sparse formats of the graph and copy them to the shared memory explicitly
  before spawning training processes. Then you should pin the shared graph in each training
-  process respectively. Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_multi_gpu.py>`_ for more details.
+  process respectively. Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/multi_gpu_node_classification.py>`_ for more details.


 Using GPU-based neighbor sampling with DGL functions

--- a/docs/source/guide/minibatch-link.rst
+++ b/docs/source/guide/minibatch-link.rst
@@ -15,7 +15,7 @@ classification.

    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)

-:class:`~dgl.dataloading.pytorch.EdgeDataLoader` in DGL also
+:func:`~dgl.dataloading.as_edge_prediction_sampler` in DGL also
 supports generating negative samples for link prediction. To do so, you
 need to provide the negative sampling function.
 :class:`~dgl.dataloading.negative_sampler.Uniform` is a
@@ -27,9 +27,10 @@ uniformly for each source node of an edge.

 .. code:: python

-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5))
+    dataloader = dgl.dataloading.DataLoader(
        g, train_seeds, sampler,
-        negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=False,
@@ -60,10 +61,10 @@ proportional to a power of degrees.
            src = src.repeat_interleave(self.k)
            dst = self.weights.multinomial(len(src), replacement=True)
            return src, dst
-    
-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, negative_sampler=NegativeSampler(g, 5))
+    dataloader = dgl.dataloading.DataLoader(
        g, train_seeds, sampler,
-        negative_sampler=NegativeSampler(g, 5),
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=False,
@@ -229,9 +230,10 @@ ID tensors.
 .. code:: python

    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5))
+    dataloader = dgl.dataloading.DataLoader(
        g, train_eid_dict, sampler,
-        negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
        batch_size=1024,
        shuffle=True,
        drop_last=False,
@@ -269,10 +271,10 @@ sampler.  For instance, the following iterates over all edges of the heterogeneo
    train_eid_dict = {
        etype: g.edges(etype=etype, form='eid')
        for etype in g.canonical_etypes}
-
-    dataloader = dgl.dataloading.EdgeDataLoader(
+    sampler = dgl.dataloading.as_edge_prediction_sampler(
+        sampler, negative_sampler=NegativeSampler(g, 5))
+    dataloader = dgl.dataloading.DataLoader(
        g, train_eid_dict, sampler,
-        negative_sampler=NegativeSampler(g, 5),
        batch_size=1024,
        shuffle=True,
        drop_last=False,

--- a/docs/source/guide/minibatch-node.rst
+++ b/docs/source/guide/minibatch-node.rst
@@ -26,8 +26,8 @@ The simplest neighborhood sampler is
 which makes the node gather messages from all of its neighbors.

 To use a sampler provided by DGL, one also need to combine it with
-:class:`~dgl.dataloading.pytorch.NodeDataLoader`, which iterates
-over a set of nodes in minibatches.
+:class:`~dgl.dataloading.DataLoader`, which iterates
+over a set of indices (nodes in this case) in minibatches.

 For example, the following code creates a PyTorch DataLoader that
 iterates over the training node ID array ``train_nids`` in batches,
@@ -42,7 +42,7 @@ putting the list of generated MFGs onto GPU.
    import torch.nn.functional as F
    
    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
-    dataloader = dgl.dataloading.NodeDataLoader(
+    dataloader = dgl.dataloading.DataLoader(
        g, train_nids, sampler,
        batch_size=1024,
        shuffle=True,
@@ -212,7 +212,7 @@ removed for simplicity):
 Some of the samplers provided by DGL also support heterogeneous graphs.
 For example, one can still use the provided
 :class:`~dgl.dataloading.neighbor.MultiLayerFullNeighborSampler` class and
-:class:`~dgl.dataloading.pytorch.NodeDataLoader` class for
+:class:`~dgl.dataloading.DataLoader` class for
 stochastic training. For full-neighbor sampling, the only difference
 would be that you would specify a dictionary of node
 types and node IDs for the training set.
@@ -220,7 +220,7 @@ types and node IDs for the training set.
 .. code:: python

    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
-    dataloader = dgl.dataloading.NodeDataLoader(
+    dataloader = dgl.dataloading.DataLoader(
        g, train_nid_dict, sampler,
        batch_size=1024,
        shuffle=True,

--- a/docs/source/guide/minibatch-prefetching.rst
+++ b/docs/source/guide/minibatch-prefetching.rst
+.. _guide-minibatch-prefetching:
+
+6.8 Feature Prefetching
+-----------------------
+
+In minibatch training of GNNs, especially with neighbor sampling approaches, we often see
+that a large amount of node features need to be copied to the device for computing GNNs.
+To mitigate this bottleneck of data movement, DGL supports *feature prefetching*
+so that the model computation and data movement can happen in parallel.
+
+Enabling Prefetching with DGL's Builtin Samplers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+All the DGL samplers in :ref:`api-dataloading` allows users to specify which
+node and edge data to prefetch via arguments like :attr:`prefetch_node_feats`.
+For example, the following code asks :class:`dgl.dataloading.NeighborSampler` to prefetch
+the node data named ``feat`` and save it to the ``srcdata`` of the first message flow
+graph. It also asks the sampler to prefetch and save the node data named ``label``
+to the ``dstdata`` of the last message flow graph:
+
+.. code:: python
+
+   graph = ...                 # the graph to sample from
+   graph.ndata['feat'] = ...   # node feature
+   graph.ndata['label'] = ...  # node label
+   train_nids = ...  # an 1-D integer tensor of training node IDs
+   # create a sample and specify what data to prefetch
+   sampler = dgl.dataloading.NeighborSampler(
+       [15, 10, 5], prefetch_node_feats=['feat'], prefetch_labels=['label'])
+   # create a dataloader
+   dataloader = dgl.dataloading.DataLoader(
+       graph, train_nids, sampler,
+       batch_size=32,
+       ...    # other arguments
+   )
+   for mini_batch in dataloader:
+       # unpack mini batch
+       input_nodes, output_nodes, subgs = mini_batch
+       # the following data has been pre-fetched
+       feat = subgs[0].srcdata['feat']
+       label = subgs[-1].dstdata['label']
+       train(subgs, feat, label)
+
+.. note::
+
+    Even without specifying the the prefetch arguments, users can still access
+    ``subgs[0].srcdata['feat']`` and ``subgs[-1].dstdata['label']`` because DGL
+    internally keeps a reference to the node/edge data of the original graph when
+    a subgraph is created. Accessing subgraph features will incur data fetching
+    from the original graph immediately while prefetching ensures data
+    to be available before getting from data loader.
+
+
+Enabling Prefetching in Custom Samplers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Users can implement their own rules of prefetching when writing custom samplers.
+Here is the code of ``NeighborSampler`` with prefetching:
+
+.. code:: python
+
+   class NeighborSampler(dgl.dataloading.Sampler):
+       def __init__(self,
+                    fanouts : list[int],
+                    prefetch_node_feats: list[str] = None,
+                    prefetch_edge_feats: list[str] = None,
+                    prefetch_labels: list[str] = None):
+           super().__init__()
+           self.fanouts = fanouts
+           self.prefetch_node_feats = prefetch_node_feats
+           self.prefetch_edge_feats = prefetch_edge_feats
+           self.prefetch_labels = prefetch_labels
+
+       def sample(self, g, seed_nodes):
+           output_nodes = seed_nodes
+           subgs = []
+           for fanout in reversed(self.fanouts):
+               # Sample a fixed number of neighbors of the current seed nodes.
+               sg = g.sample_neighbors(seed_nodes, fanout)
+               # Convert this subgraph to a message flow graph.
+               sg = dgl.to_block(sg, seed_nodes)
+               seed_nodes = sg.srcdata[NID]
+               subgs.insert(0, sg)
+            input_nodes = seed_nodes
+
+            # handle prefetching
+            dgl.set_src_lazy_features(subgs[0], self.prefetch_node_feats)
+            dgl.set_dst_lazy_features(subgs[-1], self.prefetch_labels)
+            for subg in subgs:
+                dgl.set_edge_lazy_features(subg, self.prefetch_edge_feats)
+
+            return input_nodes, output_nodes, subgs
+
+Using the :func:`~dgl.set_src_lazy_features`, :func:`~dgl.set_dst_lazy_features`
+and :func:`~dgl.set_edge_lazy_features`, users can tell ``DataLoader`` which
+features to prefetch and where to save them (``srcdata``, ``dstdata`` or ``edata``).
+See :ref:`guide-minibatch-customizing-neighborhood-sampler` for more explanations
+on how to write a custom graph sampler.
\ No newline at end of file
--- a/docs/source/guide/minibatch.rst
+++ b/docs/source/guide/minibatch.rst
@@ -75,3 +75,4 @@ sampling:
    minibatch-nn
    minibatch-inference
    minibatch-gpu-sampling
+    minibatch-prefetching
--- a/examples/pytorch/GATNE-T/src/main_sparse_multi_gpus.py
+++ b/examples/pytorch/GATNE-T/src/main_sparse_multi_gpus.py
@@ -15,7 +15,7 @@ from numpy import random
 from torch.nn.parameter import Parameter
 import dgl
 import dgl.function as fn
-import dgl.multiprocessing as mp
+import torch.multiprocessing as mp

 from utils import *

@@ -491,13 +491,7 @@ def train_model(network_data):
    if n_gpus == 1:
        run(0, n_gpus, args, devices, data)
    else:
-        procs = []
-        for proc_id in range(n_gpus):
-            p = mp.Process(target=run, args=(proc_id, n_gpus, args, devices, data))
-            p.start()
-            procs.append(p)
-        for p in procs:
-            p.join()
+        mp.spawn(run, args=(n_gpus, args, devices, data), nprocs=n_gpus)


 if __name__ == "__main__":

--- a/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py
+++ b/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchmetrics.functional as MF
-import dgl
-import dgl.nn as dglnn
-import time
-import numpy as np
-from ogb.nodeproppred import DglNodePropPredDataset
-
-class SAGE(nn.Module):
-    def __init__(self, in_feats, n_hidden, n_classes):
-        super().__init__()
-        self.layers = nn.ModuleList()
-        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
-        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
-        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
-        self.dropout = nn.Dropout(0.5)
-
-    def forward(self, sg, x):
-        h = x
-        for l, layer in enumerate(self.layers):
-            h = layer(sg, h)
-            if l != len(self.layers) - 1:
-                h = F.relu(h)
-                h = self.dropout(h)
-        return h
-
-dataset = DglNodePropPredDataset('ogbn-products')
-graph, labels = dataset[0]
-graph.ndata['label'] = labels
-split_idx = dataset.get_idx_split()
-train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
-graph.ndata['train_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, train_idx, True)
-graph.ndata['valid_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, valid_idx, True)
-graph.ndata['test_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, test_idx, True)
-
-model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
-opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
-
-num_partitions = 1000
-sampler = dgl.dataloading.ClusterGCNSampler(
-        graph, num_partitions,
-        prefetch_node_feats=['feat', 'label', 'train_mask', 'valid_mask', 'test_mask'])
-# DataLoader for generic dataloading with a graph, a set of indices (any indices, like
-# partition IDs here), and a graph sampler.
-# NodeDataLoader and EdgeDataLoader are simply special cases of DataLoader where the
-# indices are guaranteed to be node and edge IDs.
-dataloader = dgl.dataloading.DataLoader(
-        graph,
-        torch.arange(num_partitions),
-        sampler,
-        device='cuda',
-        batch_size=100,
-        shuffle=True,
-        drop_last=False,
-        num_workers=0,
-        use_uva=True)
-
-durations = []
-for _ in range(10):
-    t0 = time.time()
-    model.train()
-    for it, sg in enumerate(dataloader):
-        x = sg.ndata['feat']
-        y = sg.ndata['label'][:, 0]
-        m = sg.ndata['train_mask']
-        y_hat = model(sg, x)
-        loss = F.cross_entropy(y_hat[m], y[m])
-        opt.zero_grad()
-        loss.backward()
-        opt.step()
-        if it % 20 == 0:
-            acc = MF.accuracy(y_hat[m], y[m])
-            mem = torch.cuda.max_memory_allocated() / 1000000
-            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
-    tt = time.time()
-    print(tt - t0)
-    durations.append(tt - t0)
-
-    model.eval()
-    with torch.no_grad():
-        val_preds, test_preds = [], []
-        val_labels, test_labels = [], []
-        for it, sg in enumerate(dataloader):
-            x = sg.ndata['feat']
-            y = sg.ndata['label'][:, 0]
-            m_val = sg.ndata['valid_mask']
-            m_test = sg.ndata['test_mask']
-            y_hat = model(sg, x)
-            val_preds.append(y_hat[m_val])
-            val_labels.append(y[m_val])
-            test_preds.append(y_hat[m_test])
-            test_labels.append(y[m_test])
-        val_preds = torch.cat(val_preds, 0)
-        val_labels = torch.cat(val_labels, 0)
-        test_preds = torch.cat(test_preds, 0)
-        test_labels = torch.cat(test_labels, 0)
-        val_acc = MF.accuracy(val_preds, val_labels)
-        test_acc = MF.accuracy(test_preds, test_labels)
-        print('Validation acc:', val_acc.item(), 'Test acc:', test_acc.item())
-
-print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/__temporary__/dglnew/__init__.py
+++ b/examples/pytorch/__temporary__/dglnew/__init__.py
-from . import graph
-from . import storages
--- a/examples/pytorch/__temporary__/dglnew/graph/__init__.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/__init__.py
-from .graph import *
-from .other_feature import *
-from .wrapper import *
--- a/examples/pytorch/__temporary__/dglnew/graph/graph.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/graph.py
-class GraphStorage(object):
-    def get_node_storage(self, key, ntype=None):
-        pass
-
-    def get_edge_storage(self, key, etype=None):
-        pass
-
-    # Required for checking whether a single dict is allowed for ndata and edata.
-    @property
-    def ntypes(self):
-        pass
-
-    @property
-    def canonical_etypes(self):
-        pass
-
-    def etypes(self):
-        return [etype[1] for etype in self.canonical_etypes]
-
-    def sample_neighbors(self, seed_nodes, fanout, edge_dir='in', prob=None,
-                         exclude_edges=None, replace=False, output_device=None):
-        """Return a DGLGraph which is a subgraph induced by sampling neighboring edges of
-        the given nodes.
-
-        See ``dgl.sampling.sample_neighbors`` for detailed semantics.
-
-        Parameters
-        ----------
-        seed_nodes : Tensor or dict[str, Tensor]
-            Node IDs to sample neighbors from.
-
-            This argument can take a single ID tensor or a dictionary of node types and ID tensors.
-            If a single tensor is given, the graph must only have one type of nodes.
-        fanout : int or dict[etype, int]
-            The number of edges to be sampled for each node on each edge type.
-
-            This argument can take a single int or a dictionary of edge types and ints.
-            If a single int is given, DGL will sample this number of edges for each node for
-            every edge type.
-
-            If -1 is given for a single edge type, all the neighboring edges with that edge
-            type will be selected.
-        prob : str, optional
-            Feature name used as the (unnormalized) probabilities associated with each
-            neighboring edge of a node.  The feature must have only one element for each
-            edge.
-
-            The features must be non-negative floats, and the sum of the features of
-            inbound/outbound edges for every node must be positive (though they don't have
-            to sum up to one).  Otherwise, the result will be undefined.
-
-            If :attr:`prob` is not None, GPU sampling is not supported.
-        exclude_edges: tensor or dict
-            Edge IDs to exclude during sampling neighbors for the seed nodes.
-
-            This argument can take a single ID tensor or a dictionary of edge types and ID tensors.
-            If a single tensor is given, the graph must only have one type of nodes.
-        replace : bool, optional
-            If True, sample with replacement.
-        output_device : Framework-specific device context object, optional
-            The output device.  Default is the same as the input graph.
-
-        Returns
-        -------
-        DGLGraph
-            A sampled subgraph with the same nodes as the original graph, but only the sampled neighboring
-            edges.  The induced edge IDs will be in ``edata[dgl.EID]``.
-        """
-        pass
-
-    # Required in Cluster-GCN
-    def subgraph(self, nodes, relabel_nodes=False, output_device=None):
-        """Return a subgraph induced on given nodes.
-
-        This has the same semantics as ``dgl.node_subgraph``.
-
-        Parameters
-        ----------
-        nodes : nodes or dict[str, nodes]
-            The nodes to form the subgraph. The allowed nodes formats are:
-
-            * Int Tensor: Each element is a node ID. The tensor must have the same device type
-              and ID data type as the graph's.
-            * iterable[int]: Each element is a node ID.
-            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
-              node :math:`i` is in the subgraph.
-
-            If the graph is homogeneous, one can directly pass the above formats.
-            Otherwise, the argument must be a dictionary with keys being node types
-            and values being the node IDs in the above formats.
-        relabel_nodes : bool, optional
-            If True, the extracted subgraph will only have the nodes in the specified node set
-            and it will relabel the nodes in order.
-        output_device : Framework-specific device context object, optional
-            The output device.  Default is the same as the input graph.
-
-        Returns
-        -------
-        DGLGraph
-            The subgraph.
-        """
-        pass
-
-    # Required in Link Prediction
-    def edge_subgraph(self, edges, relabel_nodes=False, output_device=None):
-        """Return a subgraph induced on given edges.
-
-        This has the same semantics as ``dgl.edge_subgraph``.
-
-        Parameters
-        ----------
-        edges : edges or dict[(str, str, str), edges]
-            The edges to form the subgraph. The allowed edges formats are:
-
-            * Int Tensor: Each element is an edge ID. The tensor must have the same device type
-              and ID data type as the graph's.
-            * iterable[int]: Each element is an edge ID.
-            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
-              edge :math:`i` is in the subgraph.
-
-            If the graph is homogeneous, one can directly pass the above formats.
-            Otherwise, the argument must be a dictionary with keys being edge types
-            and values being the edge IDs in the above formats.
-        relabel_nodes : bool, optional
-            If True, the extracted subgraph will only have the nodes in the specified node set
-            and it will relabel the nodes in order.
-        output_device : Framework-specific device context object, optional
-            The output device.  Default is the same as the input graph.
-
-        Returns
-        -------
-        DGLGraph
-            The subgraph.
-        """
-        pass
-
-    # Required in Link Prediction negative sampler
-    def find_edges(self, edges, etype=None, output_device=None):
-        """Return the source and destination node IDs given the edge IDs within the given edge type.
-        """
-        pass
-
-    # Required in Link Prediction negative sampler
-    def num_nodes(self, ntype):
-        """Return the number of nodes for the given node type."""
-        pass
-
-    def global_uniform_negative_sampling(self, num_samples, exclude_self_loops=True,
-                                         replace=False, etype=None):
-        """Per source negative sampling as in ``dgl.dataloading.GlobalUniform``"""
--- a/examples/pytorch/__temporary__/dglnew/graph/other_feature.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/other_feature.py
-from collections import Mapping
-from dgl.storages import wrap_storage
-from dgl.utils import recursive_apply
-
-# A GraphStorage class where ndata and edata can be any FeatureStorage but
-# otherwise the same as the wrapped DGLGraph.
-class OtherFeatureGraphStorage(object):
-    def __init__(self, g, ndata=None, edata=None):
-        self.g = g
-        self._ndata = recursive_apply(ndata, wrap_storage) if ndata is not None else {}
-        self._edata = recursive_apply(edata, wrap_storage) if edata is not None else {}
-
-        for k, v in self._ndata.items():
-            if not isinstance(v, Mapping):
-                assert len(self.g.ntypes) == 1
-                self._ndata[k] = {self.g.ntypes[0]: v}
-        for k, v in self._edata.items():
-            if not isinstance(v, Mapping):
-                assert len(self.g.canonical_etypes) == 1
-                self._edata[k] = {self.g.canonical_etypes[0]: v}
-
-    def get_node_storage(self, key, ntype=None):
-        if ntype is None:
-            ntype = self.g.ntypes[0]
-        return self._ndata[key][ntype]
-
-    def get_edge_storage(self, key, etype=None):
-        if etype is None:
-            etype = self.g.canonical_etypes[0]
-        return self._edata[key][etype]
-
-    def __getattr__(self, key):
-        # I wrote it in this way because I'm too lazy to write "def sample_neighbors"
-        # or stuff like that.
-        if key in ['ntypes', 'etypes', 'canonical_etypes', 'sample_neighbors',
-                   'subgraph', 'edge_subgraph', 'find_edges', 'num_nodes']:
-            # Delegate to the wrapped DGLGraph instance.
-            return getattr(self.g, key)
-        else:
-            return super().__getattr__(key)
--- a/examples/pytorch/__temporary__/graphsage/disk_storage.py
+++ b/examples/pytorch/__temporary__/graphsage/disk_storage.py
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchmetrics.functional as MF
-import dgl
-import dgl.nn as dglnn
-import time
-import numpy as np
-# OGB must follow DGL if both DGL and PyG are installed. Otherwise DataLoader will hang.
-# (This is a long-standing issue)
-from ogb.nodeproppred import DglNodePropPredDataset
-
-import dglnew
-
-class SAGE(nn.Module):
-    def __init__(self, in_feats, n_hidden, n_classes):
-        super().__init__()
-        self.layers = nn.ModuleList()
-        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
-        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
-        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
-        self.dropout = nn.Dropout(0.5)
-
-    def forward(self, blocks, x):
-        h = x
-        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
-            h = layer(block, h)
-            if l != len(self.layers) - 1:
-                h = F.relu(h)
-                h = self.dropout(h)
-        return h
-
-dataset = DglNodePropPredDataset('ogbn-products')
-graph, labels = dataset[0]
-split_idx = dataset.get_idx_split()
-train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
-
-# This is an example of using feature storage other than tensors
-feat_np = graph.ndata['feat'].numpy()
-feat = np.memmap('feat.npy', mode='w+', shape=feat_np.shape, dtype='float32')
-print(feat.shape)
-feat[:] = feat_np
-
-model = SAGE(feat.shape[1], 256, dataset.num_classes).cuda()
-opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
-
-graph.create_formats_()
-# Because NumpyStorage is registered with memmap, one can directly add numpy memmaps
-graph = dglnew.graph.OtherFeatureGraphStorage(graph, ndata={'feat': feat, 'label': labels})
-#graph = dglnew.graph.OtherFeatureGraphStorage(graph,
-#        ndata={'feat': dgl.storages.NumpyStorage(feat), 'label': labels})
-
-sampler = dgl.dataloading.NeighborSampler(
-        [5, 5, 5], output_device='cpu', prefetch_node_feats=['feat'],
-        prefetch_labels=['label'])
-dataloader = dgl.dataloading.NodeDataLoader(
-        graph,
-        train_idx,
-        sampler,
-        device='cuda',
-        batch_size=1000,
-        shuffle=True,
-        drop_last=False,
-        pin_memory=True,
-        num_workers=4,
-        use_prefetch_thread=True)       # TBD: could probably remove this argument
-
-durations = []
-for _ in range(10):
-    t0 = time.time()
-    for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
-        x = blocks[0].srcdata['feat']
-        y = blocks[-1].dstdata['label'][:, 0]
-        y_hat = model(blocks, x)
-        loss = F.cross_entropy(y_hat, y)
-        opt.zero_grad()
-        loss.backward()
-        opt.step()
-        if it % 20 == 0:
-            acc = MF.accuracy(y_hat, y)
-            mem = torch.cuda.max_memory_allocated() / 1000000
-            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
-    tt = time.time()
-    print(tt - t0)
-    durations.append(tt - t0)
-print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/cluster_gcn/README.md
+++ b/examples/pytorch/cluster_gcn/README.md
@@ -8,22 +8,12 @@ This repo reproduce the reported speed and performance maximally on Reddit and P
 Dependencies
 ------------
 - Python 3.7+(for string formatting features)
- PyTorch 1.5.0+
+- PyTorch 1.9.0+
 - sklearn


-## Run Experiments.
-* For reddit data, you may run the following scripts
+## Run Experiments

+```bash
+python cluster_gcn.py
 ```
-./run_reddit.sh
-```
-You should be able to see the final test F1 is around `Test F1-mic0.9612, Test F1-mac0.9399`.
-Note that the first run of provided script is considerably slow than reported in the paper, which is presumably due to dataloader used. After caching the partition allocation, the overall speed would be in a normal scale. On a 1080Ti and Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz machine I am able to train it within 45s. After the first epoch the F1-mic on Validation dataset should be around `0.93`.
-
-* For PPI data, you may run the following scripts
-
-```
-./run_ppi.sh
-```
-You should be able to see the final test F1 is around `Test F1-mic0.9924, Test F1-mac0.9917`. The training finished in 10 mins.
--- a/examples/pytorch/cluster_gcn/cluster_gcn.py
+++ b/examples/pytorch/cluster_gcn/cluster_gcn.py
-import argparse
-import os
-import time
-import random
-
-import numpy as np
-import networkx as nx
-import sklearn.preprocessing
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+import torchmetrics.functional as MF
 import dgl
-import dgl.function as fn
-from dgl.data import register_data_args
-
-from modules import GraphSAGE
-from sampler import ClusterIter
-from utils import Logger, evaluate, save_log_dir, load_data
-
-
-def main(args):
-    torch.manual_seed(args.rnd_seed)
-    np.random.seed(args.rnd_seed)
-    random.seed(args.rnd_seed)
-    torch.backends.cudnn.deterministic = True
-    torch.backends.cudnn.benchmark = False
-
-    multitask_data = set(['ppi'])
-    multitask = args.dataset in multitask_data
-
-    # load and preprocess dataset
-    data = load_data(args)
-    g = data.g
-    train_mask = g.ndata['train_mask']
-    val_mask = g.ndata['val_mask']
-    test_mask = g.ndata['test_mask']
-    labels = g.ndata['label']
-
-    train_nid = np.nonzero(train_mask.data.numpy())[0].astype(np.int64)
-
-    # Normalize features
-    if args.normalize:
-        feats = g.ndata['feat']
-        train_feats = feats[train_mask]
-        scaler = sklearn.preprocessing.StandardScaler()
-        scaler.fit(train_feats.data.numpy())
-        features = scaler.transform(feats.data.numpy())
-        g.ndata['feat'] = torch.FloatTensor(features)
-
-    in_feats = g.ndata['feat'].shape[1]
-    n_classes = data.num_classes
-    n_edges = g.number_of_edges()
-
-    n_train_samples = train_mask.int().sum().item()
-    n_val_samples = val_mask.int().sum().item()
-    n_test_samples = test_mask.int().sum().item()
-
-    print("""----Data statistics------'
-    #Edges %d
-    #Classes %d
-    #Train samples %d
-    #Val samples %d
-    #Test samples %d""" %
-            (n_edges, n_classes,
-            n_train_samples,
-            n_val_samples,
-            n_test_samples))
-    # create GCN model
-    if args.self_loop and not args.dataset.startswith('reddit'):
-        g = dgl.remove_self_loop(g)
-        g = dgl.add_self_loop(g)
-        print("adding self-loop edges")
-    # metis only support int64 graph
-    g = g.long()
-
-    if args.use_pp:
-        g.update_all(fn.copy_u('feat', 'm'), fn.sum('m', 'feat_agg'))
-        g.ndata['feat'] = torch.cat([g.ndata['feat'], g.ndata['feat_agg']], 1)
-        del g.ndata['feat_agg']
-
-    cluster_iterator = dgl.dataloading.GraphDataLoader(
-        dgl.dataloading.ClusterGCNSubgraphIterator(
-            dgl.node_subgraph(g, train_nid), args.psize, './cache'),
-        batch_size=args.batch_size, num_workers=4)
-    #cluster_iterator = ClusterIter(
-    #    args.dataset, g, args.psize, args.batch_size, train_nid, use_pp=args.use_pp)
-
-    # set device for dataset tensors
-    if args.gpu < 0:
-        cuda = False
-    else:
-        cuda = True
-        torch.cuda.set_device(args.gpu)
-        val_mask = val_mask.cuda()
-        test_mask = test_mask.cuda()
-        g = g.int().to(args.gpu)
-
-    print('labels shape:', g.ndata['label'].shape)
-    print("features shape, ", g.ndata['feat'].shape)
-
-    model = GraphSAGE(in_feats,
-                      args.n_hidden,
-                      n_classes,
-                      args.n_layers,
-                      F.relu,
-                      args.dropout,
-                      args.use_pp)
-
-    if cuda:
-        model.cuda()
-
-    # logger and so on
-    log_dir = save_log_dir(args)
-    logger = Logger(os.path.join(log_dir, 'loggings'))
-    logger.write(args)
-
-    # Loss function
-    if multitask:
-        print('Using multi-label loss')
-        loss_f = nn.BCEWithLogitsLoss()
-    else:
-        print('Using multi-class loss')
-        loss_f = nn.CrossEntropyLoss()
-
-    # use optimizer
-    optimizer = torch.optim.Adam(model.parameters(),
-                                 lr=args.lr,
-                                 weight_decay=args.weight_decay)
-
-    # set train_nids to cuda tensor
-    if cuda:
-        train_nid = torch.from_numpy(train_nid).cuda()
-        print("current memory after model before training",
-              torch.cuda.memory_allocated(device=train_nid.device) / 1024 / 1024)
-    start_time = time.time()
-    best_f1 = -1
-
-    for epoch in range(args.n_epochs):
-        for j, cluster in enumerate(cluster_iterator):
-            # sync with upper level training graph
-            if cuda:
-                cluster = cluster.to(torch.cuda.current_device())
-            model.train()
-            # forward
-            batch_labels = cluster.ndata['label']
-            batch_train_mask = cluster.ndata['train_mask']
-            if batch_train_mask.sum().item() == 0:
-                continue
-            pred = model(cluster)
-            loss = loss_f(pred[batch_train_mask],
-                          batch_labels[batch_train_mask])
-
-            optimizer.zero_grad()
-            loss.backward()
-            optimizer.step()
-            # in PPI case, `log_every` is chosen to log one time per epoch. 
-            # Choose your log freq dynamically when you want more info within one epoch
-            if j % args.log_every == 0:
-                print(f"epoch:{epoch}/{args.n_epochs}, Iteration {j}/"
-                      f"{len(cluster_iterator)}:training loss", loss.item())
-        print("current memory:",
-              torch.cuda.memory_allocated(device=pred.device) / 1024 / 1024)
-
-        # evaluate
-        if epoch % args.val_every == 0:
-            val_f1_mic, val_f1_mac = evaluate(
-                model, g, labels, val_mask, multitask)
-            print(
-                "Val F1-mic{:.4f}, Val F1-mac{:.4f}". format(val_f1_mic, val_f1_mac))
-            if val_f1_mic > best_f1:
-                best_f1 = val_f1_mic
-                print('new best val f1:', best_f1)
-                torch.save(model.state_dict(), os.path.join(
-                    log_dir, 'best_model.pkl'))
-
-    end_time = time.time()
-    print(f'training using time {start_time-end_time}')
-
-    # test
-    if args.use_val:
-        model.load_state_dict(torch.load(os.path.join(
-            log_dir, 'best_model.pkl')))
-    test_f1_mic, test_f1_mac = evaluate(
-        model, g, labels, test_mask, multitask)
-    print("Test F1-mic{:.4f}, Test F1-mac{:.4f}". format(test_f1_mic, test_f1_mac))
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='GCN')
-    register_data_args(parser)
-    parser.add_argument("--dropout", type=float, default=0.5,
-                        help="dropout probability")
-    parser.add_argument("--gpu", type=int, default=-1,
-                        help="gpu")
-    parser.add_argument("--lr", type=float, default=3e-2,
-                        help="learning rate")
-    parser.add_argument("--n-epochs", type=int, default=200,
-                        help="number of training epochs")
-    parser.add_argument("--log-every", type=int, default=100,
-                        help="the frequency to save model")
-    parser.add_argument("--batch-size", type=int, default=20,
-                        help="batch size")
-    parser.add_argument("--psize", type=int, default=1500,
-                        help="partition number")
-    parser.add_argument("--test-batch-size", type=int, default=1000,
-                        help="test batch size")
-    parser.add_argument("--n-hidden", type=int, default=16,
-                        help="number of hidden gcn units")
-    parser.add_argument("--n-layers", type=int, default=1,
-                        help="number of hidden gcn layers")
-    parser.add_argument("--val-every", type=int, default=1,
-                        help="number of epoch of doing inference on validation")
-    parser.add_argument("--rnd-seed", type=int, default=3,
-                        help="number of epoch of doing inference on validation")
-    parser.add_argument("--self-loop", action='store_true',
-                        help="graph self-loop (default=False)")
-    parser.add_argument("--use-pp", action='store_true',
-                        help="whether to use precomputation")
-    parser.add_argument("--normalize", action='store_true',
-                        help="whether to use normalized feature")
-    parser.add_argument("--use-val", action='store_true',
-                        help="whether to use validated best model to test")
-    parser.add_argument("--weight-decay", type=float, default=5e-4,
-                        help="Weight for L2 loss")
-    parser.add_argument("--note", type=str, default='none',
-                        help="note for log dir")
-
-    args = parser.parse_args()
-
-    print(args)
-
-    main(args)
+import dgl.nn as dglnn
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, sg, x):
+        h = x
+        for l, layer in enumerate(self.layers):
+            h = layer(sg, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+dataset = dgl.data.AsNodePredDataset(DglNodePropPredDataset('ogbn-products'))
+graph = dataset[0]      # already prepares ndata['label'/'train_mask'/'val_mask'/'test_mask']
+
+model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+num_partitions = 1000
+sampler = dgl.dataloading.ClusterGCNSampler(
+        graph, num_partitions,
+        prefetch_ndata=['feat', 'label', 'train_mask', 'val_mask', 'test_mask'])
+# DataLoader for generic dataloading with a graph, a set of indices (any indices, like
+# partition IDs here), and a graph sampler.
+# NodeDataLoader and EdgeDataLoader are simply special cases of DataLoader where the
+# indices are guaranteed to be node and edge IDs.
+dataloader = dgl.dataloading.DataLoader(
+        graph,
+        torch.arange(num_partitions).to('cuda'),
+        sampler,
+        device='cuda',
+        batch_size=100,
+        shuffle=True,
+        drop_last=False,
+        num_workers=0,
+        use_uva=True)
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    model.train()
+    for it, sg in enumerate(dataloader):
+        x = sg.ndata['feat']
+        y = sg.ndata['label']
+        m = sg.ndata['train_mask'].bool()
+        y_hat = model(sg, x)
+        loss = F.cross_entropy(y_hat[m], y[m])
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat[m], y[m])
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+
+    model.eval()
+    with torch.no_grad():
+        val_preds, test_preds = [], []
+        val_labels, test_labels = [], []
+        for it, sg in enumerate(dataloader):
+            x = sg.ndata['feat']
+            y = sg.ndata['label']
+            m_val = sg.ndata['val_mask'].bool()
+            m_test = sg.ndata['test_mask'].bool()
+            y_hat = model(sg, x)
+            val_preds.append(y_hat[m_val])
+            val_labels.append(y[m_val])
+            test_preds.append(y_hat[m_test])
+            test_labels.append(y[m_test])
+        val_preds = torch.cat(val_preds, 0)
+        val_labels = torch.cat(val_labels, 0)
+        test_preds = torch.cat(test_preds, 0)
+        test_labels = torch.cat(test_labels, 0)
+        val_acc = MF.accuracy(val_preds, val_labels)
+        test_acc = MF.accuracy(test_preds, test_labels)
+        print('Validation acc:', val_acc.item(), 'Test acc:', test_acc.item())
+
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/cluster_gcn/modules.py
+++ b/examples/pytorch/cluster_gcn/modules.py
-import math
-
-import dgl.function as fn
-import torch
-import torch.nn as nn
-
-class GraphSAGELayer(nn.Module):
-    def __init__(self,
-                 in_feats,
-                 out_feats,
-                 activation,
-                 dropout,
-                 bias=True,
-                 use_pp=False,
-                 use_lynorm=True):
-        super(GraphSAGELayer, self).__init__()
-        # The input feature size gets doubled as we concatenated the original
-        # features with the new features.
-        self.linear = nn.Linear(2 * in_feats, out_feats, bias=bias)
-        self.activation = activation
-        self.use_pp = use_pp
-        if dropout:
-            self.dropout = nn.Dropout(p=dropout)
-        else:
-            self.dropout = 0.
-        if use_lynorm:
-            self.lynorm = nn.LayerNorm(out_feats, elementwise_affine=True)
-        else:
-            self.lynorm = lambda x: x
-        self.reset_parameters()
-
-    def reset_parameters(self):
-        stdv = 1. / math.sqrt(self.linear.weight.size(1))
-        self.linear.weight.data.uniform_(-stdv, stdv)
-        if self.linear.bias is not None:
-            self.linear.bias.data.uniform_(-stdv, stdv)
-
-    def forward(self, g, h):
-        g = g.local_var()
-        if not self.use_pp:
-            norm = self.get_norm(g)
-            g.ndata['h'] = h
-            g.update_all(fn.copy_src(src='h', out='m'),
-                         fn.sum(msg='m', out='h'))
-            ah = g.ndata.pop('h')
-            h = self.concat(h, ah, norm)
-
-        if self.dropout:
-            h = self.dropout(h)
-
-        h = self.linear(h)
-        h = self.lynorm(h)
-        if self.activation:
-            h = self.activation(h)
-        return h
-
-    def concat(self, h, ah, norm):
-        ah = ah * norm
-        h = torch.cat((h, ah), dim=1)
-        return h
-
-    def get_norm(self, g):
-        norm = 1. / g.in_degrees().float().unsqueeze(1)
-        norm[torch.isinf(norm)] = 0
-        norm = norm.to(self.linear.weight.device)
-        return norm
-
-class GraphSAGE(nn.Module):
-    def __init__(self,
-                 in_feats,
-                 n_hidden,
-                 n_classes,
-                 n_layers,
-                 activation,
-                 dropout,
-                 use_pp):
-        super(GraphSAGE, self).__init__()
-        self.layers = nn.ModuleList()
-
-        # input layer
-        self.layers.append(GraphSAGELayer(in_feats, n_hidden, activation=activation,
-                                        dropout=dropout, use_pp=use_pp, use_lynorm=True))
-        # hidden layers
-        for i in range(n_layers - 1):
-            self.layers.append(
-                GraphSAGELayer(n_hidden, n_hidden, activation=activation, dropout=dropout,
-                             use_pp=False, use_lynorm=True))
-        # output layer
-        self.layers.append(GraphSAGELayer(n_hidden, n_classes, activation=None,
-                                        dropout=dropout, use_pp=False, use_lynorm=False))
-
-    def forward(self, g):
-        h = g.ndata['feat']
-        for layer in self.layers:
-            h = layer(g, h)
-        return h