[Sampling] New sampling pipeline plus asynchronous prefetching (#3665)

* initial update * more * more * multi-gpu example * cluster gcn, finalize homogeneous * more explanation * fix * bunch of fixes * fix * RGAT example and more fixes * shadow-gnn sampler and some changes in unit test * fix * wth * more fixes * remove shadow+node/edge dataloader tests for possible ux changes * lints * add legacy dataloading import just in case * fix * update pylint for f-strings * fix * lint * lint * lint again * cherry-picking commit fa9f494 * oops * fix * add sample_neighbors in dist_graph * fix * lint * fix * fix * fix * fix tutorial * fix * fix * fix * fix warning * remove debug * add get_foo_storage apis * lint

[Sampling] New sampling pipeline plus asynchronous prefetching (#3665)
* initial update * more * more * multi-gpu example * cluster gcn, finalize homogeneous * more explanation * fix * bunch of fixes * fix * RGAT example and more fixes * shadow-gnn sampler and some changes in unit test * fix * wth * more fixes * remove shadow+node/edge dataloader tests for possible ux changes * lints * add legacy dataloading import just in case * fix * update pylint for f-strings * fix * lint * lint * lint again * cherry-picking commit fa9f494 * oops * fix * add sample_neighbors in dist_graph * fix * lint * fix * fix * fix * fix tutorial * fix * fix * fix * fix warning * remove debug * add get_foo_storage apis * lint
701b4fcc · Quan (Andy) Gan · GitHub · 5152a879 · 701b4fcc · 701b4fcc
Unverified Commit 701b4fcc authored Jan 30, 2022 by Quan (Andy) Gan Committed by GitHub Jan 30, 2022
20 changed files
--- a/docs/source/api/python/dgl.dataloading.rst
+++ b/docs/source/api/python/dgl.dataloading.rst
@@ -17,6 +17,8 @@ and an ``EdgeDataLoader`` for edge/link prediction task.
 .. autoclass:: NodeDataLoader
 .. autoclass:: EdgeDataLoader
 .. autoclass:: GraphDataLoader
+.. autoclass:: DistNodeDataLoader
+.. autoclass:: DistEdgeDataLoader

 .. _api-dataloading-neighbor-sampling:


--- a/docs/source/guide/distributed-apis.rst
+++ b/docs/source/guide/distributed-apis.rst
@@ -202,20 +202,20 @@ DGL provides two levels of APIs for sampling nodes and edges to generate mini-ba
 (see the section of mini-batch training). The low-level APIs require users to write code
 to explicitly define how a layer of nodes are sampled (e.g., using :func:`dgl.sampling.sample_neighbors` ).
 The high-level sampling APIs implement a few popular sampling algorithms for node classification
-and link prediction tasks (e.g., :class:`~dgl.dataloading.pytorch.NodeDataloader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` ).
+and link prediction tasks (e.g., :class:`~dgl.dataloading.pytorch.NodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ).

 The distributed sampling module follows the same design and provides two levels of sampling APIs.
 For the lower-level sampling API, it provides :func:`~dgl.distributed.sample_neighbors` for
 distributed neighborhood sampling on :class:`~dgl.distributed.DistGraph`. In addition, DGL provides
-a distributed Dataloader (:class:`~dgl.distributed.DistDataLoader` ) for distributed sampling.
-The distributed Dataloader has the same interface as Pytorch DataLoader except that users cannot
+a distributed DataLoader (:class:`~dgl.distributed.DistDataLoader` ) for distributed sampling.
+The distributed DataLoader has the same interface as Pytorch DataLoader except that users cannot
 specify the number of worker processes when creating a dataloader. The worker processes are created
 in :func:`dgl.distributed.initialize`.

 **Note**: When running :func:`dgl.distributed.sample_neighbors` on :class:`~dgl.distributed.DistGraph`,
-the sampler cannot run in Pytorch Dataloader with multiple worker processes. The main reason is that
-Pytorch Dataloader creates new sampling worker processes in every epoch, which leads to creating and
+the sampler cannot run in Pytorch DataLoader with multiple worker processes. The main reason is that
+Pytorch DataLoader creates new sampling worker processes in every epoch, which leads to creating and
 destroying :class:`~dgl.distributed.DistGraph` objects many times.

 When using the low-level API, the sampling code is similar to single-process sampling. The only
@@ -240,16 +240,16 @@ difference is that users need to use :func:`dgl.distributed.sample_neighbors` an
        for batch in dataloader:
            ...

-The same high-level sampling APIs (:class:`~dgl.dataloading.pytorch.NodeDataloader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` ) work for both :class:`~dgl.DGLGraph`
-and :class:`~dgl.distributed.DistGraph`. When using :class:`~dgl.dataloading.pytorch.NodeDataloader`
-and :class:`~dgl.dataloading.pytorch.EdgeDataloader`, the distributed sampling code is exactly
-the same as single-process sampling.
+The high-level sampling APIs (:class:`~dgl.dataloading.pytorch.NodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ) has distributed counterparts
+(:class:`~dgl.dataloading.pytorch.DistNodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.DistEdgeDataLoader`).  The code is exactly the
+same as single-process sampling otherwise.

 .. code:: python

    sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
                                                 batch_size=batch_size, shuffle=True)
    for batch in dataloader:
        ...

--- a/docs/source/guide_cn/distributed-apis.rst
+++ b/docs/source/guide_cn/distributed-apis.rst
@@ -177,9 +177,9 @@ DGL提供了一个稀疏的Adagrad优化器 :class:`~dgl.distributed.SparseAdagr
 DGL提供了两个级别的API，用于对节点和边进行采样以生成小批次训练数据(请参阅小批次训练的章节)。
 底层API要求用户编写代码以明确定义如何对节点层进行采样(例如，使用 :func:`dgl.sampling.sample_neighbors` )。
 高层采样API为节点分类和链接预测任务实现了一些流行的采样算法（例如
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` )。
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` )。

 分布式采样模块遵循相同的设计，也提供两个级别的采样API。对于底层的采样API，它为
 :class:`~dgl.distributed.DistGraph` 上的分布式邻居采样提供了
@@ -188,7 +188,7 @@ DGL提供了两个级别的API，用于对节点和边进行采样以生成小
 分布式数据加载器具有与PyTorch DataLoader相同的接口。其中的工作进程(worker)在 :func:`dgl.distributed.initialize` 中创建。

 **Note**: 在 :class:`~dgl.distributed.DistGraph` 上运行 :func:`dgl.distributed.sample_neighbors` 时，
-采样器无法在具有多个工作进程的PyTorch Dataloader中运行。主要原因是PyTorch Dataloader在每个训练周期都会创建新的采样工作进程，
+采样器无法在具有多个工作进程的PyTorch DataLoader中运行。主要原因是PyTorch DataLoader在每个训练周期都会创建新的采样工作进程，
 从而导致多次创建和删除 :class:`~dgl.distributed.DistGraph` 对象。

 使用底层API时，采样代码类似于单进程采样。唯一的区别是用户需要使用
@@ -214,18 +214,18 @@ DGL提供了两个级别的API，用于对节点和边进行采样以生成小
        for batch in dataloader:
            ...

-:class:`~dgl.DGLGraph` 和 :class:`~dgl.distributed.DistGraph` 都可以使用相同的高级采样API(
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader`)。使用
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` 有分布式的版本
+:class:`~dgl.dataloading.pytorch.DistNodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` 时，分布式采样代码与单进程采样完全相同。
+:class:`~dgl.dataloading.pytorch.DistEdgeDataLoader` 。使用
+时分布式采样代码与单进程采样几乎完全相同。

 .. code:: python

    sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
                                                 batch_size=batch_size, shuffle=True)
    for batch in dataloader:
        ...

--- a/docs/source/guide_ko/distributed-apis.rst
+++ b/docs/source/guide_ko/distributed-apis.rst
@@ -132,8 +132,8 @@ DGL은 노드 임베딩들을 필요로 하는 변환 모델(transductive models
 분산 샘플링
 ~~~~~~~~

-DGL은 미니-배치를 생성하기 위해 노드 및 에지 샘플링을 하는 두 수준의 API를 제공한다 (미니-배치 학습 섹션 참조). Low-level API는 노드들의 레이어가 어떻게 샘플링될지를 명시적으로 정의하는 코드를 직접 작성해야한다 (예를 들면, :func:`dgl.sampling.sample_neighbors` 사용해서). High-level API는 노드 분류 및 링크 예측(예, :class:`~dgl.dataloading.pytorch.NodeDataloader` 와
-:class:`~dgl.dataloading.pytorch.EdgeDataloader`) 에 사용되는 몇 가지 유명한 샘플링 알고리즘을 구현하고 있다.
+DGL은 미니-배치를 생성하기 위해 노드 및 에지 샘플링을 하는 두 수준의 API를 제공한다 (미니-배치 학습 섹션 참조). Low-level API는 노드들의 레이어가 어떻게 샘플링될지를 명시적으로 정의하는 코드를 직접 작성해야한다 (예를 들면, :func:`dgl.sampling.sample_neighbors` 사용해서). High-level API는 노드 분류 및 링크 예측(예, :class:`~dgl.dataloading.pytorch.NodeDataLoader` 와
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader`) 에 사용되는 몇 가지 유명한 샘플링 알고리즘을 구현하고 있다.

 분산 샘플링 모듈도 같은 디자인을 따르고 있고, 두 level의 샘플링 API를 제공한다. Low-level 샘플링 API의 경우, :class:`~dgl.distributed.DistGraph` 에 대한 분산 이웃 샘플링을 위해 :func:`~dgl.distributed.sample_neighbors` 가 있다. 또한, DGL은 분산 샘플링을 위해 분산 데이터 로더, :class:`~dgl.distributed.DistDataLoader` 를 제공한다. 분산 DataLoader는 PyTorch DataLoader와 같은 인터페이스를 갖는데, 다른 점은 사용자가 데이터 로더를 생성할 때 worker 프로세스의 개수를 지정할 수 없다는 점이다. Worker 프로세스들은 :func:`dgl.distributed.initialize` 에서 만들어진다.

@@ -159,12 +159,12 @@ Low-level API를 사용할 때, 샘플링 코드는 단일 프로세스 샘플
        for batch in dataloader:
            ...

-동일한 high-level 샘플링 API들(:class:`~dgl.dataloading.pytorch.NodeDataloader` 와 :class:`~dgl.dataloading.pytorch.EdgeDataloader` )이 :class:`~dgl.DGLGraph` 와 :class:`~dgl.distributed.DistGraph` 에 대해서 동작한다. :class:`~dgl.dataloading.pytorch.NodeDataloader` 과 :class:`~dgl.dataloading.pytorch.EdgeDataloader` 를 사용할 때, 분산 샘플링 코드는 싱글-프로세스 샘플링 코드와 정확하게 같다.
+동일한 high-level 샘플링 API들(:class:`~dgl.dataloading.pytorch.NodeDataLoader` 와 :class:`~dgl.dataloading.pytorch.EdgeDataLoader` )이 :class:`~dgl.DGLGraph` 와 :class:`~dgl.distributed.DistGraph` 에 대해서 동작한다. :class:`~dgl.dataloading.pytorch.NodeDataLoader` 과 :class:`~dgl.dataloading.pytorch.EdgeDataLoader` 를 사용할 때, 분산 샘플링 코드는 싱글-프로세스 샘플링 코드와 정확하게 같다.

 .. code:: python

    sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
                                                 batch_size=batch_size, shuffle=True)
    for batch in dataloader:
        ...

--- a/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py
+++ b/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+
+USE_WRAPPER = True
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, sg, x):
+        h = x
+        for l, layer in enumerate(self.layers):
+            h = layer(sg, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+dataset = DglNodePropPredDataset('ogbn-products')
+graph, labels = dataset[0]
+graph.ndata['label'] = labels
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+graph.ndata['train_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, train_idx, True)
+graph.ndata['valid_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, valid_idx, True)
+graph.ndata['test_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, test_idx, True)
+
+model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+if USE_WRAPPER:
+    import dglnew
+    graph.create_formats_()
+    graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+num_partitions = 1000
+sampler = dgl.dataloading.ClusterGCNSampler(
+        graph, num_partitions,
+        prefetch_node_feats=['feat', 'label', 'train_mask', 'valid_mask', 'test_mask'])
+# DataLoader for generic dataloading with a graph, a set of indices (any indices, like
+# partition IDs here), and a graph sampler.
+# NodeDataLoader and EdgeDataLoader are simply special cases of DataLoader where the
+# indices are guaranteed to be node and edge IDs.
+dataloader = dgl.dataloading.DataLoader(
+        graph,
+        torch.arange(num_partitions),
+        sampler,
+        device='cuda',
+        batch_size=100,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=8,
+        persistent_workers=True,
+        use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, sg in enumerate(dataloader):
+        x = sg.ndata['feat']
+        y = sg.ndata['label'][:, 0]
+        m = sg.ndata['train_mask']
+        y_hat = model(sg, x)
+        loss = F.cross_entropy(y_hat[m], y[m])
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat[m], y[m])
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/__temporary__/dglnew/__init__.py
+++ b/examples/pytorch/__temporary__/dglnew/__init__.py
+from . import graph
+from . import storages
--- a/examples/pytorch/__temporary__/dglnew/graph/__init__.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/__init__.py
+from .graph import *
+from .other_feature import *
+from .wrapper import *
--- a/examples/pytorch/__temporary__/dglnew/graph/graph.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/graph.py
+class GraphStorage(object):
+    def get_node_storage(self, key, ntype=None):
+        pass
+
+    def get_edge_storage(self, key, etype=None):
+        pass
+
+    # Required for checking whether a single dict is allowed for ndata and edata.
+    @property
+    def ntypes(self):
+        pass
+
+    @property
+    def canonical_etypes(self):
+        pass
+
+    def etypes(self):
+        return [etype[1] for etype in self.canonical_etypes]
+
+    def sample_neighbors(self, seed_nodes, fanout, edge_dir='in', prob=None,
+                         exclude_edges=None, replace=False, output_device=None):
+        """Return a DGLGraph which is a subgraph induced by sampling neighboring edges of
+        the given nodes.
+
+        See ``dgl.sampling.sample_neighbors`` for detailed semantics.
+
+        Parameters
+        ----------
+        seed_nodes : Tensor or dict[str, Tensor]
+            Node IDs to sample neighbors from.
+
+            This argument can take a single ID tensor or a dictionary of node types and ID tensors.
+            If a single tensor is given, the graph must only have one type of nodes.
+        fanout : int or dict[etype, int]
+            The number of edges to be sampled for each node on each edge type.
+
+            This argument can take a single int or a dictionary of edge types and ints.
+            If a single int is given, DGL will sample this number of edges for each node for
+            every edge type.
+
+            If -1 is given for a single edge type, all the neighboring edges with that edge
+            type will be selected.
+        prob : str, optional
+            Feature name used as the (unnormalized) probabilities associated with each
+            neighboring edge of a node.  The feature must have only one element for each
+            edge.
+
+            The features must be non-negative floats, and the sum of the features of
+            inbound/outbound edges for every node must be positive (though they don't have
+            to sum up to one).  Otherwise, the result will be undefined.
+
+            If :attr:`prob` is not None, GPU sampling is not supported.
+        exclude_edges: tensor or dict
+            Edge IDs to exclude during sampling neighbors for the seed nodes.
+
+            This argument can take a single ID tensor or a dictionary of edge types and ID tensors.
+            If a single tensor is given, the graph must only have one type of nodes.
+        replace : bool, optional
+            If True, sample with replacement.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            A sampled subgraph with the same nodes as the original graph, but only the sampled neighboring
+            edges.  The induced edge IDs will be in ``edata[dgl.EID]``.
+        """
+        pass
+
+    # Required in Cluster-GCN
+    def subgraph(self, nodes, relabel_nodes=False, output_device=None):
+        """Return a subgraph induced on given nodes.
+
+        This has the same semantics as ``dgl.node_subgraph``.
+
+        Parameters
+        ----------
+        nodes : nodes or dict[str, nodes]
+            The nodes to form the subgraph. The allowed nodes formats are:
+
+            * Int Tensor: Each element is a node ID. The tensor must have the same device type
+              and ID data type as the graph's.
+            * iterable[int]: Each element is a node ID.
+            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
+              node :math:`i` is in the subgraph.
+
+            If the graph is homogeneous, one can directly pass the above formats.
+            Otherwise, the argument must be a dictionary with keys being node types
+            and values being the node IDs in the above formats.
+        relabel_nodes : bool, optional
+            If True, the extracted subgraph will only have the nodes in the specified node set
+            and it will relabel the nodes in order.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            The subgraph.
+        """
+        pass
+
+    # Required in Link Prediction
+    def edge_subgraph(self, edges, relabel_nodes=False, output_device=None):
+        """Return a subgraph induced on given edges.
+
+        This has the same semantics as ``dgl.edge_subgraph``.
+
+        Parameters
+        ----------
+        edges : edges or dict[(str, str, str), edges]
+            The edges to form the subgraph. The allowed edges formats are:
+
+            * Int Tensor: Each element is an edge ID. The tensor must have the same device type
+              and ID data type as the graph's.
+            * iterable[int]: Each element is an edge ID.
+            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
+              edge :math:`i` is in the subgraph.
+
+            If the graph is homogeneous, one can directly pass the above formats.
+            Otherwise, the argument must be a dictionary with keys being edge types
+            and values being the edge IDs in the above formats.
+        relabel_nodes : bool, optional
+            If True, the extracted subgraph will only have the nodes in the specified node set
+            and it will relabel the nodes in order.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            The subgraph.
+        """
+        pass
+
+    # Required in Link Prediction negative sampler
+    def find_edges(self, edges, etype=None, output_device=None):
+        """Return the source and destination node IDs given the edge IDs within the given edge type.
+        """
+        pass
+
+    # Required in Link Prediction negative sampler
+    def num_nodes(self, ntype):
+        """Return the number of nodes for the given node type."""
+        pass
+
+    def global_uniform_negative_sampling(self, num_samples, exclude_self_loops=True,
+                                         replace=False, etype=None):
+        """Per source negative sampling as in ``dgl.dataloading.GlobalUniform``"""
--- a/examples/pytorch/__temporary__/dglnew/graph/other_feature.py
+++ b/examples/pytorch/__temporary__/dglnew/graph/other_feature.py
+from collections import Mapping
+from dgl.storages import wrap_storage
+from dgl.utils import recursive_apply
+
+# A GraphStorage class where ndata and edata can be any FeatureStorage but
+# otherwise the same as the wrapped DGLGraph.
+class OtherFeatureGraphStorage(object):
+    def __init__(self, g, ndata=None, edata=None):
+        self.g = g
+        self._ndata = recursive_apply(ndata, wrap_storage) if ndata is not None else {}
+        self._edata = recursive_apply(edata, wrap_storage) if edata is not None else {}
+
+        for k, v in self._ndata.items():
+            if not isinstance(v, Mapping):
+                assert len(self.g.ntypes) == 1
+                self._ndata[k] = {self.g.ntypes[0]: v}
+        for k, v in self._edata.items():
+            if not isinstance(v, Mapping):
+                assert len(self.g.canonical_etypes) == 1
+                self._edata[k] = {self.g.canonical_etypes[0]: v}
+
+    def get_node_storage(self, key, ntype=None):
+        if ntype is None:
+            ntype = self.g.ntypes[0]
+        return self._ndata[key][ntype]
+
+    def get_edge_storage(self, key, etype=None):
+        if etype is None:
+            etype = self.g.canonical_etypes[0]
+        return self._edata[key][etype]
+
+    def __getattr__(self, key):
+        # I wrote it in this way because I'm too lazy to write "def sample_neighbors"
+        # or stuff like that.
+        if key in ['ntypes', 'etypes', 'canonical_etypes', 'sample_neighbors',
+                   'subgraph', 'edge_subgraph', 'find_edges', 'num_nodes']:
+            # Delegate to the wrapped DGLGraph instance.
+            return getattr(self.g, key)
+        else:
+            return super().__getattr__(key)
--- a/examples/pytorch/__temporary__/graphsage/ddp.py
+++ b/examples/pytorch/__temporary__/graphsage/ddp.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as dist
+import torch.distributed.optim
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+
+USE_WRAPPER = False
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, blocks, x):
+        h = x
+        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
+            h = layer(block, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+
+def train(rank, world_size, graph, num_classes, split_idx):
+    torch.cuda.set_device(rank)
+    dist.init_process_group('nccl', 'tcp://127.0.0.1:12347', world_size=world_size, rank=rank)
+
+    model = SAGE(graph.ndata['feat'].shape[1], 256, num_classes).cuda()
+    model = nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank)
+    opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+    if USE_WRAPPER:
+        import dglnew
+        graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+    sampler = dgl.dataloading.NeighborSampler(
+            [5, 5, 5], output_device='cpu', prefetch_node_feats=['feat'],
+            prefetch_labels=['label'])
+    dataloader = dgl.dataloading.NodeDataLoader(
+            graph,
+            train_idx,
+            sampler,
+            device='cuda',
+            batch_size=1000,
+            shuffle=True,
+            drop_last=False,
+            pin_memory=True,
+            num_workers=4,
+            persistent_workers=True,
+            use_ddp=True,
+            use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+    durations = []
+    for _ in range(10):
+        t0 = time.time()
+        for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
+            x = blocks[0].srcdata['feat']
+            y = blocks[-1].dstdata['label'][:, 0]
+            y_hat = model(blocks, x)
+            loss = F.cross_entropy(y_hat, y)
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+            if it % 20 == 0:
+                acc = MF.accuracy(y_hat, y)
+                mem = torch.cuda.max_memory_allocated() / 1000000
+                print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+        tt = time.time()
+        if rank == 0:
+            print(tt - t0)
+            durations.append(tt - t0)
+    if rank == 0:
+        print(np.mean(durations[4:]), np.std(durations[4:]))
+
+if __name__ == '__main__':
+    dataset = DglNodePropPredDataset('ogbn-products')
+    graph, labels = dataset[0]
+    graph.ndata['label'] = labels
+    graph.create_formats_()
+    split_idx = dataset.get_idx_split()
+    num_classes = dataset.num_classes
+    n_procs = 4
+
+    # Tested with mp.spawn and fork.  Both worked and got 4s per epoch with 4 GPUs
+    # and 3.86s per epoch with 8 GPUs on p2.8x, compared to 5.2s from official examples.
+    #import torch.multiprocessing as mp
+    #mp.spawn(train, args=(n_procs, graph, num_classes, split_idx), nprocs=n_procs)
+    import dgl.multiprocessing as mp
+    procs = []
+    for i in range(n_procs):
+        p = mp.Process(target=train, args=(i, n_procs, graph, num_classes, split_idx))
+        p.start()
+        procs.append(p)
+    for p in procs:
+        p.join()
--- a/examples/pytorch/__temporary__/graphsage/disk_storage.py
+++ b/examples/pytorch/__temporary__/graphsage/disk_storage.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+# OGB must follow DGL if both DGL and PyG are installed. Otherwise DataLoader will hang.
+# (This is a long-standing issue)
+from ogb.nodeproppred import DglNodePropPredDataset
+
+import dglnew
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, blocks, x):
+        h = x
+        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
+            h = layer(block, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+dataset = DglNodePropPredDataset('ogbn-products')
+graph, labels = dataset[0]
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+
+# This is an example of using feature storage other than tensors
+feat_np = graph.ndata['feat'].numpy()
+feat = np.memmap('feat.npy', mode='w+', shape=feat_np.shape, dtype='float32')
+print(feat.shape)
+feat[:] = feat_np
+
+model = SAGE(feat.shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+graph.create_formats_()
+# Because NumpyStorage is registered with memmap, one can directly add numpy memmaps
+graph = dglnew.graph.OtherFeatureGraphStorage(graph, ndata={'feat': feat, 'label': labels})
+#graph = dglnew.graph.OtherFeatureGraphStorage(graph,
+#        ndata={'feat': dgl.storages.NumpyStorage(feat), 'label': labels})
+
+sampler = dgl.dataloading.NeighborSampler(
+        [5, 5, 5], output_device='cpu', prefetch_node_feats=['feat'],
+        prefetch_labels=['label'])
+dataloader = dgl.dataloading.NodeDataLoader(
+        graph,
+        train_idx,
+        sampler,
+        device='cuda',
+        batch_size=1000,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=4,
+        use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
+        x = blocks[0].srcdata['feat']
+        y = blocks[-1].dstdata['label'][:, 0]
+        y_hat = model(blocks, x)
+        loss = F.cross_entropy(y_hat, y)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat, y)
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/__temporary__/graphsage/link_pred.py
+++ b/examples/pytorch/__temporary__/graphsage/link_pred.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+# OGB must follow DGL if both DGL and PyG are installed. Otherwise DataLoader will hang.
+# (This is a long-standing issue)
+from ogb.nodeproppred import DglNodePropPredDataset
+
+USE_WRAPPER = True
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, pair_graph, neg_pair_graph, blocks, x):
+        h = x
+        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
+            h = layer(block, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        with pair_graph.local_scope(), neg_pair_graph.local_scope():
+            pair_graph.ndata['h'] = neg_pair_graph.ndata['h'] = h
+            pair_graph.apply_edges(dgl.function.u_dot_v('h', 'h', 's'))
+            neg_pair_graph.apply_edges(dgl.function.u_dot_v('h', 'h', 's'))
+            return pair_graph.edata['s'], neg_pair_graph.edata['s']
+
+dataset = DglNodePropPredDataset('ogbn-products')
+graph, labels = dataset[0]
+graph.ndata['label'] = labels
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+
+model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+num_edges = graph.num_edges()
+train_eids = torch.arange(num_edges)
+if USE_WRAPPER:
+    import dglnew
+    graph.create_formats_()
+    graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+sampler = dgl.dataloading.NeighborSampler(
+        [5, 5, 5], output_device='cpu', prefetch_node_feats=['feat'],
+        prefetch_labels=['label'])
+dataloader = dgl.dataloading.EdgeDataLoader(
+        graph,
+        train_eids,
+        sampler,
+        device='cuda',
+        batch_size=1000,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=8,
+        persistent_workers=True,
+        use_prefetch_thread=True,       # TBD: could probably remove this argument
+        exclude='reverse_id',
+        reverse_eids=torch.arange(num_edges) ^ 1,
+        negative_sampler=dgl.dataloading.negative_sampler.Uniform(5))
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, (input_nodes, pair_graph, neg_pair_graph, blocks) in enumerate(dataloader):
+        x = blocks[0].srcdata['feat']
+        pos_score, neg_score = model(pair_graph, neg_pair_graph, blocks, x)
+        pos_label = torch.ones_like(pos_score)
+        neg_label = torch.zeros_like(neg_score)
+        score = torch.cat([pos_score, neg_score])
+        labels = torch.cat([pos_label, neg_label])
+        loss = F.binary_cross_entropy_with_logits(score, labels)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.auroc(score, labels.long())
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+            tt = time.time()
+            print(tt - t0)
+            t0 = time.time()
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/__temporary__/graphsage/normal.py
+++ b/examples/pytorch/__temporary__/graphsage/normal.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+
+USE_WRAPPER = True
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, blocks, x):
+        h = x
+        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
+            h = layer(block, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+dataset = DglNodePropPredDataset('ogbn-products')
+graph, labels = dataset[0]
+graph.ndata['label'] = labels
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+
+model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+if USE_WRAPPER:
+    import dglnew
+    graph.create_formats_()
+    graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+sampler = dgl.dataloading.NeighborSampler(
+        [5, 5, 5], output_device='cpu', prefetch_node_feats=['feat'],
+        prefetch_labels=['label'])
+dataloader = dgl.dataloading.NodeDataLoader(
+        graph,
+        train_idx,
+        sampler,
+        device='cuda',
+        batch_size=1000,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=16,
+        persistent_workers=True,
+        use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
+        x = blocks[0].srcdata['feat']
+        y = blocks[-1].dstdata['label'][:, 0]
+        y_hat = model(blocks, x)
+        loss = F.cross_entropy(y_hat, y)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat, y)
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/__temporary__/rgat/rgat.py
+++ b/examples/pytorch/__temporary__/rgat/rgat.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.function as fn
+import dgl.nn as dglnn
+from dgl.utils import recursive_apply
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+import tqdm
+
+USE_WRAPPER = True
+
+class HeteroGAT(nn.Module):
+    def __init__(self, etypes, in_feats, n_hidden, n_classes, n_heads=4):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.HeteroGraphConv({
+            etype: dglnn.GATConv(in_feats, n_hidden // n_heads, n_heads)
+            for etype in etypes}))
+        self.layers.append(dglnn.HeteroGraphConv({
+            etype: dglnn.GATConv(n_hidden, n_hidden // n_heads, n_heads)
+            for etype in etypes}))
+        self.layers.append(dglnn.HeteroGraphConv({
+            etype: dglnn.GATConv(n_hidden, n_hidden // n_heads, n_heads)
+            for etype in etypes}))
+        self.dropout = nn.Dropout(0.5)
+        self.linear = nn.Linear(n_hidden, n_classes)   # Should be HeteroLinear
+
+    def forward(self, blocks, x):
+        h = x
+        for l, (layer, block) in enumerate(zip(self.layers, blocks)):
+            h = layer(block, h)
+            # One thing is that h might return tensors with zero rows if the number of dst nodes
+            # of one node type is 0.  x.view(x.shape[0], -1) wouldn't work in this case.
+            h = recursive_apply(h, lambda x: x.view(x.shape[0], x.shape[1] * x.shape[2]))
+            if l != len(self.layers) - 1:
+                h = recursive_apply(h, F.relu)
+                h = recursive_apply(h, self.dropout)
+        return self.linear(h['paper'])
+
+dataset = DglNodePropPredDataset('ogbn-mag')
+
+graph, labels = dataset[0]
+graph.ndata['label'] = labels
+# Preprocess: add reverse edges in "cites" relation, and add reverse edge types for the
+# rest.
+graph = dgl.AddReverse()(graph)
+# Preprocess: precompute the author, topic, and institution features
+graph.update_all(fn.copy_u('feat', 'm'), fn.mean('m', 'feat'), etype='rev_writes')
+graph.update_all(fn.copy_u('feat', 'm'), fn.mean('m', 'feat'), etype='has_topic')
+graph.update_all(fn.copy_u('feat', 'm'), fn.mean('m', 'feat'), etype='affiliated_with')
+graph.edges['cites'].data['weight'] = torch.ones(graph.num_edges('cites'))  # dummy edge weights
+
+model = HeteroGAT(graph.etypes, graph.ndata['feat']['paper'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+if USE_WRAPPER:
+    import dglnew
+    graph.create_formats_()
+    graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+
+sampler = dgl.dataloading.NeighborSampler(
+        [5, 5, 5], output_device='cpu',
+        prefetch_node_feats={k: ['feat'] for k in graph.ntypes},
+        prefetch_labels={'paper': ['label']},
+        prefetch_edge_feats={'cites': ['weight']})
+dataloader = dgl.dataloading.NodeDataLoader(
+        graph,
+        train_idx,
+        sampler,
+        device='cuda',
+        batch_size=1000,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=8,
+        persistent_workers=True,
+        use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
+        x = blocks[0].srcdata['feat']
+        y = blocks[-1].dstdata['label']['paper'][:, 0]
+        assert y.min() >= 0 and y.max() < dataset.num_classes
+        y_hat = model(blocks, x)
+        loss = F.cross_entropy(y_hat, y)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat, y)
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
--- a/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py
+++ b/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py
@@ -69,7 +69,7 @@ class SAGE(nn.Module):
            y = th.zeros(g.number_of_nodes(), self.n_hidden if l != len(self.layers) - 1 else self.n_classes)

            sampler = dgl.dataloading.MultiLayerNeighborSampler([None])
-            dataloader = dgl.dataloading.NodeDataLoader(
+            dataloader = dgl.dataloading.DistNodeDataLoader(
                g,
                th.arange(g.number_of_nodes()),
                sampler,

--- a/examples/pytorch/rgcn/experimental/entity_classify_dist.py
+++ b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
@@ -366,7 +366,7 @@ def run(args, device, data):
    val_fanouts = [int(fanout) for fanout in args.validation_fanout.split(',')]

    sampler = dgl.dataloading.MultiLayerNeighborSampler(fanouts)
-    dataloader = dgl.dataloading.NodeDataLoader(
+    dataloader = dgl.dataloading.DistNodeDataLoader(
        g,
        {'paper': train_nid},
        sampler,
@@ -375,7 +375,7 @@ def run(args, device, data):
        drop_last=False)

    valid_sampler = dgl.dataloading.MultiLayerNeighborSampler(val_fanouts)
-    valid_dataloader = dgl.dataloading.NodeDataLoader(
+    valid_dataloader = dgl.dataloading.DistNodeDataLoader(
        g,
        {'paper': val_nid},
        valid_sampler,
@@ -384,7 +384,7 @@ def run(args, device, data):
        drop_last=False)

    test_sampler = dgl.dataloading.MultiLayerNeighborSampler(val_fanouts)
-    test_dataloader = dgl.dataloading.NodeDataLoader(
+    test_dataloader = dgl.dataloading.DistNodeDataLoader(
        g,
        {'paper': test_nid},
        test_sampler,

--- a/examples/pytorch/tgn/dataloading.py
+++ b/examples/pytorch/tgn/dataloading.py
@@ -287,6 +287,9 @@ class TemporalEdgeDataLoader(dgl.dataloading.EdgeDataLoader):
        if dataloader_kwargs.get('num_workers', 0) > 0:
            g.create_formats_()

+    def __iter__(self):
+        return iter(self.dataloader)
+
 # ====== Fast Mode ======

 # Part of code in reservoir sampling comes from PyG library

--- a/examples/pytorch/tgn/train.py
+++ b/examples/pytorch/tgn/train.py
@@ -292,7 +292,7 @@ if __name__ == "__main__":
            if i < args.epochs-1 and args.fast_mode:
                sampler.reset()
            print(log_content[0], log_content[1], log_content[2])
-    except:
+    except KeyboardInterrupt:
        traceback.print_exc()
        error_content = "Training Interreputed!"
        f.writelines(error_content)

--- a/python/dgl/__init__.py
+++ b/python/dgl/__init__.py
@@ -21,9 +21,11 @@ from . import container
 from . import distributed
 from . import random
 from . import sampling
+from . import storages
 from . import dataloading
 from . import ops
 from . import cuda
+from . import _dataloading  # legacy dataloading modules

 from ._ffi.runtime_ctypes import TypeCode
 from ._ffi.function import register_func, get_global_func, list_global_func_names, extract_ext_funcs

--- a/python/dgl/_dataloading/__init__.py
+++ b/python/dgl/_dataloading/__init__.py
+"""The ``dgl.dataloading`` package contains:
+
+* Data loader classes for iterating over a set of nodes or edges in a graph and generates
+  computation dependency via neighborhood sampling methods.
+
+* Various sampler classes that perform neighborhood sampling for multi-layer GNNs.
+
+* Negative samplers for link prediction.
+
+For a holistic explanation on how different components work together.
+Read the user guide :ref:`guide-minibatch`.
+
+.. note::
+    This package is experimental and the interfaces may be subject
+    to changes in future releases. It currently only has implementations in PyTorch.
+"""
+from .neighbor import *
+from .dataloader import *
+from .cluster_gcn import *
+from .shadow import *
+
+from . import negative_sampler
+from .async_transferer import AsyncTransferer
+
+from .. import backend as F
+
+if F.get_preferred_backend() == 'pytorch':
+    from .pytorch import *