[Distributed] Distributed node embedding and sparse optimizer (#2733)

* Draft for sparse emb * add some notes * Fix * Add sparse optim for dist pytorch * Update test * Fix * upd * upd * Fix * Fix * Fix bug * add transductive exmpale * Fix example * Some fix * Upd * Fix lint * lint * lint * lint * upd * Fix lint * lint * upd * remove dead import * update * lint * update unitest * update example * Add adam optimizer * Add unitest and update data * upd * upd * upd * Fix docstring and fix some bug in example code * Update rgcn readme Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-25.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-210.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-66.ec2.internal>

[Distributed] Distributed node embedding and sparse optimizer (#2733)
* Draft for sparse emb * add some notes * Fix * Add sparse optim for dist pytorch * Update test * Fix * upd * upd * Fix * Fix * Fix bug * add transductive exmpale * Fix example * Some fix * Upd * Fix lint * lint * lint * lint * upd * Fix lint * lint * upd * remove dead import * update * lint * update unitest * update example * Add adam optimizer * Add unitest and update data * upd * upd * upd * Fix docstring and fix some bug in example code * Update rgcn readme Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-25.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-210.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-66.ec2.internal>
975eb8fc · xiang song(charlie.song) · GitHub · 2d372e35 · 975eb8fc · 975eb8fc
Unverified Commit 975eb8fc authored May 03, 2021 by xiang song(charlie.song) Committed by GitHub May 03, 2021
20 changed files
--- a/docs/source/api/python/dgl.distributed.rst
+++ b/docs/source/api/python/dgl.distributed.rst
@@ -25,14 +25,23 @@ Distributed Tensor
 .. autoclass:: DistTensor
    :members: part_policy, shape, dtype, name
-Distributed Embedding
+Distributed Node Embedding
 ---------------------
+.. currentmodule:: dgl.distributed.nn.pytorch
-.. autoclass:: DistEmbedding
+.. autoclass:: NodeEmbedding
+Distributed embedding optimizer
+-------------------------
+.. currentmodule:: dgl.distributed.optim.pytorch
 .. autoclass:: SparseAdagrad
    :members: step
+.. autoclass:: SparseAdam
+    :members: step
 Distributed workload split
 --------------------------

--- a/docs/source/guide/distributed-apis.rst
+++ b/docs/source/guide/distributed-apis.rst
@@ -9,7 +9,7 @@ This section covers the distributed APIs used in the training script. DGL provid
 data structures and various APIs for initialization, distributed sampling and workload split.
 For distributed training/inference, DGL provides three distributed data structures:
 :class:`~dgl.distributed.DistGraph` for distributed graphs, :class:`~dgl.distributed.DistTensor` for
-distributed tensors and :class:`~dgl.distributed.DistEmbedding` for distributed learnable embeddings.
+distributed tensors and :class:`~dgl.distributed.nn.NodeEmbedding` for distributed learnable embeddings.
 Initialization of the DGL distributed module
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -27,7 +27,7 @@ Typically, the initialization APIs should be invoked in the following order:
    th.distributed.init_process_group(backend='gloo')
 **Note**: If the training script contains user-defined functions (UDFs) that have to be invoked on
-the servers (see the section of DistTensor and DistEmbedding for more details), these UDFs have to
+the servers (see the section of DistTensor and NodeEmbedding for more details), these UDFs have to
 be declared before :func:`~dgl.distributed.initialize`.
 Distributed graph
@@ -125,7 +125,7 @@ in the cluster even if the :class:`~dgl.distributed.DistTensor` object disappear
    tensor = dgl.distributed.DistTensor((g.number_of_nodes(), 10), th.float32, name='test')
 **Note**: :class:`~dgl.distributed.DistTensor` creation is a synchronized operation. All trainers
-have to invoke the creation and the creation succeeds only when all trainers call it. 
+have to invoke the creation and the creation succeeds only when all trainers call it.
 A user can add a :class:`~dgl.distributed.DistTensor` to a :class:`~dgl.distributed.DistGraph`
 object as one of the node data or edge data.
@@ -153,10 +153,10 @@ computation operators, such as sum and mean.
 when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent
 writes to the same row of data is to run one server process on a machine.
-Distributed Embedding
+Distributed NodeEmbedding
 ~~~~~~~~~~~~~~~~~~~~~
-DGL provides :class:`~dgl.distributed.DistEmbedding` to support transductive models that require
+DGL provides :class:`~dgl.distributed.nn.NodeEmbedding` to support transductive models that require
 node embeddings. Creating distributed embeddings is very similar to creating distributed tensors.
 .. code:: python
@@ -165,7 +165,7 @@ node embeddings. Creating distributed embeddings is very similar to creating dis
        arr = th.zeros(shape, dtype=dtype)
        arr.uniform_(-1, 1)
        return arr
-    emb = dgl.distributed.DistEmbedding(g.number_of_nodes(), 10, init_func=initializer)
+    emb = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(), 10, init_func=initializer)
 Internally, distributed embeddings are built on top of distributed tensors, and, thus, has
 very similar behaviors to distributed tensors. For example, when embeddings are created, they
@@ -192,7 +192,7 @@ the other for dense model parameters, as shown in the code below:
    optimizer.step()
    sparse_optimizer.step()
-**Note**: :class:`~dgl.distributed.DistEmbedding` is not an Pytorch nn module, so we cannot
+**Note**: :class:`~dgl.distributed.nn.NodeEmbedding` is not an Pytorch nn module, so we cannot
 get access to it from parameters of a Pytorch nn module.
 Distributed sampling
@@ -252,7 +252,7 @@ the same as single-process sampling.
    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
                                             batch_size=batch_size, shuffle=True)
    for batch in dataloader:
-        ... 
+        ...
 Split workloads

--- a/docs/source/guide/distributed.rst
+++ b/docs/source/guide/distributed.rst
@@ -16,7 +16,7 @@ For the training script, DGL provides distributed APIs that are similar to the o
 mini-batch training. This makes distributed training require only small code modifications
 from mini-batch training on a single machine. Below shows an example of training GraphSage
 in a distributed fashion. The only code modifications are located on line 4-7:
-1) initialize DGL's distributed module, 2) create a distributed graph object, and 
+1) initialize DGL's distributed module, 2) create a distributed graph object, and
 3) split the training set and calculate the nodes for the local process.
 The rest of the code, including sampler creation, model definition, training loops
 are the same as :ref:`mini-batch training <guide-minibatch>`.
@@ -35,7 +35,7 @@ are the same as :ref:`mini-batch training <guide-minibatch>`.
    # Create sampler
    sampler = NeighborSampler(g, [10,25],
-                              dgl.distributed.sample_neighbors, 
+                              dgl.distributed.sample_neighbors,
                              device)
    dataloader = DistDataLoader(
@@ -85,7 +85,7 @@ Specifically, DGL's distributed training has three types of interacting processe
  generate mini-batches for training.
 * Trainers contain multiple classes to interact with servers. It has
  :class:`~dgl.distributed.DistGraph` to get access to partitioned graph data and has
-  :class:`~dgl.distributed.DistEmbedding` and :class:`~dgl.distributed.DistTensor` to access
+  :class:`~dgl.distributed.nn.NodeEmbedding` and :class:`~dgl.distributed.DistTensor` to access
  the node/edge features/embeddings. It has
  :class:`~dgl.distributed.dist_dataloader.DistDataLoader` to
  interact with samplers to get mini-batches.

--- a/docs/source/guide_cn/distributed-apis.rst
+++ b/docs/source/guide_cn/distributed-apis.rst
@@ -8,7 +8,7 @@
 本节介绍了在训练脚本中使用的分布式计算API。DGL提供了三种分布式数据结构和多种API，用于初始化、分布式采样和数据分割。
 对于分布式训练/推断，DGL提供了三种分布式数据结构：用于分布式图的 :class:`~dgl.distributed.DistGraph`、
 用于分布式张量的 :class:`~dgl.distributed.DistTensor` 和用于分布式可学习嵌入的
-:class:`~dgl.distributed.DistEmbedding`。
+:class:`~dgl.distributed.nn.NodeEmbedding`。
 DGL分布式模块的初始化
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -24,7 +24,7 @@ DGL分布式模块的初始化
    dgl.distributed.initialize('ip_config.txt')
    th.distributed.init_process_group(backend='gloo')
-**Note**: 如果训练脚本里包含需要在服务器(细节内容可以在下面的DistTensor和DistEmbedding章节里查看)上调用的用户自定义函数(UDF)，
+**Note**: 如果训练脚本里包含需要在服务器(细节内容可以在下面的DistTensor和NodeEmbedding章节里查看)上调用的用户自定义函数(UDF)，
 这些UDF必须在 :func:`~dgl.distributed.initialize` 之前被声明。
 分布式图
@@ -138,7 +138,7 @@ DGL为分布式张量提供了类似于单机普通张量的接口，以访问
 分布式嵌入
 ~~~~~~~~~~~~~~~~~~~~~
-DGL提供 :class:`~dgl.distributed.DistEmbedding` 以支持需要节点嵌入的直推(transductive)模型。
+DGL提供 :class:`~dgl.distributed.nn.NodeEmbedding` 以支持需要节点嵌入的直推(transductive)模型。
 分布式嵌入的创建与分布式张量的创建非常相似。
 .. code:: python
@@ -147,7 +147,7 @@ DGL提供 :class:`~dgl.distributed.DistEmbedding` 以支持需要节点嵌入的
        arr = th.zeros(shape, dtype=dtype)
        arr.uniform_(-1, 1)
        return arr
-    emb = dgl.distributed.DistEmbedding(g.number_of_nodes(), 10, init_func=initializer)
+    emb = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(), 10, init_func=initializer)
 在内部，分布式嵌入建立在分布式张量之上，因此，其行为与分布式张量非常相似。
 例如，创建嵌入时，DGL会将它们分片并存储在集群中的所有计算机上。(分布式嵌入)可以通过名称唯一标识。
@@ -169,7 +169,7 @@ DGL提供了一个稀疏的Adagrad优化器 :class:`~dgl.distributed.SparseAdagr
    optimizer.step()
    sparse_optimizer.step()
-**Note**: :class:`~dgl.distributed.DistEmbedding` 不是PyTorch的nn模块，因此用户无法从nn模块的参数访问它。
+**Note**: :class:`~dgl.distributed.nn.NodeEmbedding` 不是PyTorch的nn模块，因此用户无法从nn模块的参数访问它。
 分布式采样
 ~~~~~~~~~~~~~~~~~~~~
@@ -228,7 +228,7 @@ DGL提供了两个级别的API，用于对节点和边进行采样以生成小
    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
                                             batch_size=batch_size, shuffle=True)
    for batch in dataloader:
-        ... 
+        ...
 分割数据集

--- a/docs/source/guide_cn/distributed.rst
+++ b/docs/source/guide_cn/distributed.rst
@@ -28,7 +28,7 @@ DGL采用完全分布式的方法，可将数据和计算同时分布在一组
    # 创建采样器
    sampler = NeighborSampler(g, [10,25],
-                              dgl.distributed.sample_neighbors, 
+                              dgl.distributed.sample_neighbors,
                              device)
    dataloader = DistDataLoader(
@@ -74,7 +74,7 @@ DGL实现了一些分布式组件以支持分布式训练，下图显示了这
  这些服务器一起工作以将图数据提供给训练器。请注意，一台机器可能同时运行多个服务器进程，以并行化计算和网络通信。
 * *采样器进程* 与服务器进行交互，并对节点和边采样以生成用于训练的小批次数据。
 * *训练器进程* 包含多个与服务器交互的类。它用 :class:`~dgl.distributed.DistGraph` 来获取被划分的图分区数据，
-  用 :class:`~dgl.distributed.DistEmbedding` 和
+  用 :class:`~dgl.distributed.nn.NodeEmbedding` 和
  :class:`~dgl.distributed.DistTensor` 来获取节点/边特征/嵌入，用
  :class:`~dgl.distributed.dist_dataloader.DistDataLoader` 与采样器进行交互以获得小批次数据。

--- a/examples/pytorch/graphsage/experimental/README.md
+++ b/examples/pytorch/graphsage/experimental/README.md
@@ -118,7 +118,7 @@ The command below launches one training process on each machine and each trainin
 python3 ~/workspace/dgl/tools/launch.py \
 --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 1 \
--num_samplers 4 \
+--num_samplers 0 \
 --num_servers 1 \
 --part_config data/ogb-product.json \
 --ip_config ip_config.txt \
@@ -131,7 +131,7 @@ To run unsupervised training:
 python3 ~/workspace/dgl/tools/launch.py \
 --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 1 \
--num_samplers 4 \
+--num_samplers 0 \
 --num_servers 1 \
 --part_config data/ogb-product.json \
 --ip_config ip_config.txt \
@@ -144,13 +144,59 @@ By default, this code will run on CPU. If you have GPU support, you can just add
 python3 ~/workspace/dgl/tools/launch.py \
 --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 4 \
--num_samplers 4 \
+--num_samplers 0 \
 --num_servers 1 \
 --part_config data/ogb-product.json \
 --ip_config ip_config.txt \
 "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000 --num_gpus 4"
 ```
+To run supervised with transductive setting (nodes are initialized with node embedding)
+```bash
+python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
+--num_trainers 4 \
+--num_samplers 4 \
+--num_servers 1 \
+--num_samplers 0 \
+--part_config data/ogb-product.json \
+--ip_config ip_config.txt \
+"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5"
+```
+To run supervised with transductive setting using dgl distributed NodeEmbedding
+```bash
+python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
+--num_trainers 4 \
+--num_samplers 4 \
+--num_servers 1 \
+--num_samplers 0 \
+--part_config data/ogb-product.json \
+--ip_config ip_config.txt \
+"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5  --dgl_sparse"
+```
+To run unsupervised with transductive setting (nodes are initialized with node embedding)
+```bash
+python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
+--num_trainers 4 \
+--num_samplers 0 \
+--num_servers 1 \
+--part_config data/ogb-product.json \
+--ip_config ip_config.txt \
+"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4"
+```
+To run unsupervised with transductive setting using dgl distributed NodeEmbedding
+```bash
+python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
+--num_trainers 4 \
+--num_samplers 0 \
+--num_servers 1 \
+--part_config data/ogb-product.json \
+--ip_config ip_config.txt \
+"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4 --dgl_sparse"
+```
 **Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
 ## Distributed code runs in the standalone mode

--- a/examples/pytorch/graphsage/experimental/ip_config.txt
+++ b/examples/pytorch/graphsage/experimental/ip_config.txt
-172.31.19.1
+172.31.2.66
-172.31.23.205
+172.31.1.191
-172.31.29.175
-172.31.16.98
\ No newline at end of file
--- a/examples/pytorch/graphsage/experimental/train_dist.py
+++ b/examples/pytorch/graphsage/experimental/train_dist.py
@@ -21,20 +21,21 @@ import torch.optim as optim
 import torch.multiprocessing as mp
 from torch.utils.data import DataLoader
-def load_subtensor(g, seeds, input_nodes, device):
+def load_subtensor(g, seeds, input_nodes, device, load_feat=True):
    """
    Copys features and labels of a set of nodes onto GPU.
    """
-    batch_inputs = g.ndata['features'][input_nodes].to(device)
+    batch_inputs = g.ndata['features'][input_nodes].to(device) if load_feat else None
    batch_labels = g.ndata['labels'][seeds].to(device)
    return batch_inputs, batch_labels
 class NeighborSampler(object):
-    def __init__(self, g, fanouts, sample_neighbors, device):
+    def __init__(self, g, fanouts, sample_neighbors, device, load_feat=True):
        self.g = g
        self.fanouts = fanouts
        self.sample_neighbors = sample_neighbors
        self.device = device
+        self.load_feat=load_feat
    def sample_blocks(self, seeds):
        seeds = th.LongTensor(np.asarray(seeds))
@@ -51,8 +52,9 @@ class NeighborSampler(object):
        input_nodes = blocks[0].srcdata[dgl.NID]
        seeds = blocks[-1].dstdata[dgl.NID]
-        batch_inputs, batch_labels = load_subtensor(self.g, seeds, input_nodes, "cpu")
+        batch_inputs, batch_labels = load_subtensor(self.g, seeds, input_nodes, "cpu", self.load_feat)
-        blocks[0].srcdata['features'] = batch_inputs
+        if self.load_feat:
+            blocks[0].srcdata['features'] = batch_inputs
        blocks[-1].dstdata['labels'] = batch_labels
        return blocks
@@ -289,7 +291,7 @@ if __name__ == '__main__':
    parser.add_argument('--part_config', type=str, help='The path to the partition config file')
    parser.add_argument('--num_clients', type=int, help='The number of clients')
    parser.add_argument('--n_classes', type=int, help='the number of classes')
-    parser.add_argument('--num_gpus', type=int, default=-1, 
+    parser.add_argument('--num_gpus', type=int, default=-1,
                        help="the number of GPU device. Use -1 for CPU training")
    parser.add_argument('--num_epochs', type=int, default=20)
    parser.add_argument('--num_hidden', type=int, default=16)

--- a/examples/pytorch/graphsage/experimental/train_dist_transductive.py
+++ b/examples/pytorch/graphsage/experimental/train_dist_transductive.py
+import os
+os.environ['DGLBACKEND']='pytorch'
+from multiprocessing import Process
+import argparse, time, math
+import numpy as np
+from functools import wraps
+import tqdm
+import dgl
+from dgl import DGLGraph
+from dgl.data import register_data_args, load_data
+from dgl.data.utils import load_graphs
+import dgl.function as fn
+import dgl.nn.pytorch as dglnn
+from dgl.distributed import DistDataLoader
+from dgl.distributed.nn import NodeEmbedding
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+import torch.multiprocessing as mp
+from torch.utils.data import DataLoader
+from train_dist import DistSAGE, NeighborSampler, compute_acc
+class TransDistSAGE(DistSAGE):
+    def __init__(self, in_feats, n_hidden, n_classes, n_layers,
+                 activation, dropout):
+        super(TransDistSAGE, self).__init__(in_feats, n_hidden, n_classes, n_layers, activation, dropout)
+    def inference(self, standalone, g, x, batch_size, device):
+        """
+        Inference with the GraphSAGE model on full neighbors (i.e. without neighbor sampling).
+        g : the entire graph.
+        x : the input of entire node set.
+        The inference code is written in a fashion that it could handle any number of nodes and
+        layers.
+        """
+        # During inference with sampling, multi-layer blocks are very inefficient because
+        # lots of computations in the first few layers are repeated.
+        # Therefore, we compute the representation of all nodes layer by layer.  The nodes
+        # on each layer are of course splitted in batches.
+        # TODO: can we standardize this?
+        nodes = dgl.distributed.node_split(np.arange(g.number_of_nodes()),
+                                           g.get_partition_book(), force_even=True)
+        y = dgl.distributed.DistTensor((g.number_of_nodes(), self.n_hidden), th.float32, 'h',
+                                       persistent=True)
+        for l, layer in enumerate(self.layers):
+            if l == len(self.layers) - 1:
+                y = dgl.distributed.DistTensor((g.number_of_nodes(), self.n_classes),
+                                               th.float32, 'h_last', persistent=True)
+            sampler = NeighborSampler(g, [-1], dgl.distributed.sample_neighbors, device, load_feat=False)
+            print('|V|={}, eval batch size: {}'.format(g.number_of_nodes(), batch_size))
+            # Create PyTorch DataLoader for constructing blocks
+            dataloader = DistDataLoader(
+                dataset=nodes,
+                batch_size=batch_size,
+                collate_fn=sampler.sample_blocks,
+                shuffle=False,
+                drop_last=False)
+            for blocks in tqdm.tqdm(dataloader):
+                block = blocks[0].to(device)
+                input_nodes = block.srcdata[dgl.NID]
+                output_nodes = block.dstdata[dgl.NID]
+                h = x[input_nodes].to(device)
+                h_dst = h[:block.number_of_dst_nodes()]
+                h = layer(block, (h, h_dst))
+                if l != len(self.layers) - 1:
+                    h = self.activation(h)
+                    h = self.dropout(h)
+                y[output_nodes] = h.cpu()
+            x = y
+            g.barrier()
+        return y
+def initializer(shape, dtype):
+    arr = th.zeros(shape, dtype=dtype)
+    arr.uniform_(-1, 1)
+    return arr
+class DistEmb(nn.Module):
+    def __init__(self, num_nodes, emb_size, dgl_sparse_emb=False, dev_id='cpu'):
+        super().__init__()
+        self.dev_id = dev_id
+        self.emb_size = emb_size
+        self.dgl_sparse_emb = dgl_sparse_emb
+        if dgl_sparse_emb:
+            self.sparse_emb = NodeEmbedding(num_nodes, emb_size, name='sage', init_func=initializer)
+        else:
+            self.sparse_emb = th.nn.Embedding(num_nodes, emb_size, sparse=True)
+            nn.init.uniform_(self.sparse_emb.weight, -1.0, 1.0)
+    def forward(self, idx):
+        # embeddings are stored in cpu
+        idx = idx.cpu()
+        if self.dgl_sparse_emb:
+            return self.sparse_emb(idx, device=self.dev_id)
+        else:
+            return self.sparse_emb(idx).to(self.dev_id)
+def load_embs(standalone, emb_layer, g):
+    nodes = dgl.distributed.node_split(np.arange(g.number_of_nodes()),
+                                       g.get_partition_book(), force_even=True)
+    x = dgl.distributed.DistTensor(
+        (g.number_of_nodes(),
+         emb_layer.module.emb_size \
+            if isinstance(emb_layer, th.nn.parallel.DistributedDataParallel) \
+            else emb_layer.emb_size),
+        th.float32, 'eval_embs',
+        persistent=True)
+    num_nodes = nodes.shape[0]
+    for i in range((num_nodes + 1023) // 1024):
+        idx = nodes[i * 1024: (i+1) * 1024 \
+                    if (i+1) * 1024 < num_nodes \
+                    else num_nodes]
+        embeds = emb_layer(idx).cpu()
+        x[idx] = embeds
+    if not standalone:
+        g.barrier()
+    return x
+def evaluate(standalone, model, emb_layer, g, labels, val_nid, test_nid, batch_size, device):
+    """
+    Evaluate the model on the validation set specified by ``val_nid``.
+    g : The entire graph.
+    inputs : The features of all the nodes.
+    labels : The labels of all the nodes.
+    val_nid : the node Ids for validation.
+    batch_size : Number of nodes to compute at the same time.
+    device : The GPU device to evaluate on.
+    """
+    model.eval()
+    emb_layer.eval()
+    with th.no_grad():
+        inputs = load_embs(standalone, emb_layer, g)
+        pred = model.inference(standalone, g, inputs, batch_size, device)
+    model.train()
+    emb_layer.train()
+    return compute_acc(pred[val_nid], labels[val_nid]), compute_acc(pred[test_nid], labels[test_nid])
+def run(args, device, data):
+    # Unpack data
+    train_nid, val_nid, test_nid, n_classes, g = data
+    # Create sampler
+    sampler = NeighborSampler(g, [int(fanout) for fanout in args.fan_out.split(',')],
+                              dgl.distributed.sample_neighbors, device, load_feat=False)
+    # Create DataLoader for constructing blocks
+    dataloader = DistDataLoader(
+        dataset=train_nid.numpy(),
+        batch_size=args.batch_size,
+        collate_fn=sampler.sample_blocks,
+        shuffle=True,
+        drop_last=False)
+    # Define model and optimizer
+    emb_layer = DistEmb(g.num_nodes(), args.num_hidden, dgl_sparse_emb=args.dgl_sparse, dev_id=device)
+    model = TransDistSAGE(args.num_hidden, args.num_hidden, n_classes, args.num_layers, F.relu, args.dropout)
+    model = model.to(device)
+    if not args.standalone:
+        if args.num_gpus == -1:
+            model = th.nn.parallel.DistributedDataParallel(model)
+        else:
+            dev_id = g.rank() % args.num_gpus
+            model = th.nn.parallel.DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id)
+            if not args.dgl_sparse:
+                emb_layer = th.nn.parallel.DistributedDataParallel(emb_layer)
+    loss_fcn = nn.CrossEntropyLoss()
+    loss_fcn = loss_fcn.to(device)
+    optimizer = optim.Adam(model.parameters(), lr=args.lr)
+    if args.dgl_sparse:
+        emb_optimizer = dgl.distributed.optim.SparseAdam([emb_layer.sparse_emb], lr=args.sparse_lr)
+        print('optimize DGL sparse embedding:', emb_layer.sparse_emb)
+    elif args.standalone:
+        emb_optimizer = th.optim.SparseAdam(list(emb_layer.sparse_emb.parameters()), lr=args.sparse_lr)
+        print('optimize Pytorch sparse embedding:', emb_layer.sparse_emb)
+    else:
+        emb_optimizer = th.optim.SparseAdam(list(emb_layer.module.sparse_emb.parameters()), lr=args.sparse_lr)
+        print('optimize Pytorch sparse embedding:', emb_layer.module.sparse_emb)
+    train_size = th.sum(g.ndata['train_mask'][0:g.number_of_nodes()])
+    # Training loop
+    iter_tput = []
+    epoch = 0
+    for epoch in range(args.num_epochs):
+        tic = time.time()
+        sample_time = 0
+        forward_time = 0
+        backward_time = 0
+        update_time = 0
+        num_seeds = 0
+        num_inputs = 0
+        start = time.time()
+        # Loop over the dataloader to sample the computation dependency graph as a list of
+        # blocks.
+        step_time = []
+        for step, blocks in enumerate(dataloader):
+            tic_step = time.time()
+            sample_time += tic_step - start
+            # The nodes for input lies at the LHS side of the first block.
+            # The nodes for output lies at the RHS side of the last block.
+            batch_inputs = blocks[0].srcdata[dgl.NID]
+            batch_labels = blocks[-1].dstdata['labels']
+            batch_labels = batch_labels.long()
+            num_seeds += len(blocks[-1].dstdata[dgl.NID])
+            num_inputs += len(blocks[0].srcdata[dgl.NID])
+            blocks = [block.to(device) for block in blocks]
+            batch_labels = batch_labels.to(device)
+            # Compute loss and prediction
+            start = time.time()
+            batch_inputs = emb_layer(batch_inputs)
+            batch_pred = model(blocks, batch_inputs)
+            loss = loss_fcn(batch_pred, batch_labels)
+            forward_end = time.time()
+            emb_optimizer.zero_grad()
+            optimizer.zero_grad()
+            loss.backward()
+            compute_end = time.time()
+            forward_time += forward_end - start
+            backward_time += compute_end - forward_end
+            emb_optimizer.step()
+            optimizer.step()
+            update_time += time.time() - compute_end
+            step_t = time.time() - tic_step
+            step_time.append(step_t)
+            iter_tput.append(len(blocks[-1].dstdata[dgl.NID]) / step_t)
+            if step % args.log_every == 0:
+                acc = compute_acc(batch_pred, batch_labels)
+                gpu_mem_alloc = th.cuda.max_memory_allocated() / 1000000 if th.cuda.is_available() else 0
+                print('Part {} | Epoch {:05d} | Step {:05d} | Loss {:.4f} | Train Acc {:.4f} | Speed (samples/sec) {:.4f} | GPU {:.1f} MB | time {:.3f} s'.format(
+                    g.rank(), epoch, step, loss.item(), acc.item(), np.mean(iter_tput[3:]), gpu_mem_alloc, np.sum(step_time[-args.log_every:])))
+            start = time.time()
+        toc = time.time()
+        print('Part {}, Epoch Time(s): {:.4f}, sample+data_copy: {:.4f}, forward: {:.4f}, backward: {:.4f}, update: {:.4f}, #seeds: {}, #inputs: {}'.format(
+            g.rank(), toc - tic, sample_time, forward_time, backward_time, update_time, num_seeds, num_inputs))
+        epoch += 1
+        if epoch % args.eval_every == 0 and epoch != 0:
+            start = time.time()
+            val_acc, test_acc = evaluate(args.standalone, model.module, emb_layer, g,
+                                         g.ndata['labels'], val_nid, test_nid, args.batch_size_eval, device)
+            print('Part {}, Val Acc {:.4f}, Test Acc {:.4f}, time: {:.4f}'.format(g.rank(), val_acc, test_acc, time.time()-start))
+def main(args):
+    dgl.distributed.initialize(args.ip_config)
+    if not args.standalone:
+        th.distributed.init_process_group(backend='gloo')
+    g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
+    print('rank:', g.rank())
+    pb = g.get_partition_book()
+    train_nid = dgl.distributed.node_split(g.ndata['train_mask'], pb, force_even=True)
+    val_nid = dgl.distributed.node_split(g.ndata['val_mask'], pb, force_even=True)
+    test_nid = dgl.distributed.node_split(g.ndata['test_mask'], pb, force_even=True)
+    local_nid = pb.partid2nids(pb.partid).detach().numpy()
+    print('part {}, train: {} (local: {}), val: {} (local: {}), test: {} (local: {})'.format(
+        g.rank(), len(train_nid), len(np.intersect1d(train_nid.numpy(), local_nid)),
+        len(val_nid), len(np.intersect1d(val_nid.numpy(), local_nid)),
+        len(test_nid), len(np.intersect1d(test_nid.numpy(), local_nid))))
+    if args.num_gpus == -1:
+        device = th.device('cpu')
+    else:
+        device = th.device('cuda:'+str(g.rank() % args.num_gpus))
+    labels = g.ndata['labels'][np.arange(g.number_of_nodes())]
+    n_classes = len(th.unique(labels[th.logical_not(th.isnan(labels))]))
+    print('#labels:', n_classes)
+    # Pack data
+    data = train_nid, val_nid, test_nid, n_classes, g
+    run(args, device, data)
+    print("parent ends")
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='GCN')
+    register_data_args(parser)
+    parser.add_argument('--graph_name', type=str, help='graph name')
+    parser.add_argument('--id', type=int, help='the partition id')
+    parser.add_argument('--ip_config', type=str, help='The file for IP configuration')
+    parser.add_argument('--part_config', type=str, help='The path to the partition config file')
+    parser.add_argument('--num_clients', type=int, help='The number of clients')
+    parser.add_argument('--n_classes', type=int, help='the number of classes')
+    parser.add_argument('--num_gpus', type=int, default=-1,
+                        help="the number of GPU device. Use -1 for CPU training")
+    parser.add_argument('--num_epochs', type=int, default=20)
+    parser.add_argument('--num_hidden', type=int, default=16)
+    parser.add_argument('--num_layers', type=int, default=2)
+    parser.add_argument('--fan_out', type=str, default='10,25')
+    parser.add_argument('--batch_size', type=int, default=1000)
+    parser.add_argument('--batch_size_eval', type=int, default=100000)
+    parser.add_argument('--log_every', type=int, default=20)
+    parser.add_argument('--eval_every', type=int, default=5)
+    parser.add_argument('--lr', type=float, default=0.003)
+    parser.add_argument('--dropout', type=float, default=0.5)
+    parser.add_argument('--local_rank', type=int, help='get rank of the process')
+    parser.add_argument('--standalone', action='store_true', help='run in the standalone mode')
+    parser.add_argument("--dgl_sparse", action='store_true',
+            help='Whether to use DGL sparse embedding')
+    parser.add_argument("--sparse_lr", type=float, default=1e-2,
+            help="sparse lr rate")
+    args = parser.parse_args()
+    print(args)
+    main(args)
--- a/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py
+++ b/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py
@@ -68,19 +68,18 @@ class SAGE(nn.Module):
        for l, layer in enumerate(self.layers):
            y = th.zeros(g.number_of_nodes(), self.n_hidden if l != len(self.layers) - 1 else self.n_classes)
-            sampler = dgl.sampling.MultiLayerNeighborSampler([None])
+            sampler = dgl.dataloading.MultiLayerNeighborSampler([None])
-            dataloader = dgl.sampling.NodeDataLoader(
+            dataloader = dgl.dataloading.NodeDataLoader(
                g,
                th.arange(g.number_of_nodes()),
                sampler,
-                batch_size=args.batch_size,
+                batch_size=batch_size,
                shuffle=True,
                drop_last=False,
-                num_workers=args.num_workers)
+                num_workers=0)
            for input_nodes, output_nodes, blocks in tqdm.tqdm(dataloader):
                block = blocks[0]
                block = block.int().to(device)
                h = x[input_nodes].to(device)
                h = layer(block, h)
@@ -93,7 +92,6 @@ class SAGE(nn.Module):
            x = y
        return y
 class NegativeSampler(object):
    def __init__(self, g, neg_nseeds):
        self.neg_nseeds = neg_nseeds
@@ -270,7 +268,7 @@ def generate_emb(model, g, inputs, batch_size, device):
 def compute_acc(emb, labels, train_nids, val_nids, test_nids):
    """
    Compute the accuracy of prediction given the labels.
    We will fist train a LogisticRegression model using the trained embeddings,
    the training set, validation set and test set is provided as the arguments.
@@ -459,7 +457,7 @@ if __name__ == '__main__':
    parser.add_argument('--ip_config', type=str, help='The file for IP configuration')
    parser.add_argument('--part_config', type=str, help='The path to the partition config file')
    parser.add_argument('--n_classes', type=int, help='the number of classes')
-    parser.add_argument('--num_gpus', type=int, default=-1, 
+    parser.add_argument('--num_gpus', type=int, default=-1,
                        help="the number of GPU device. Use -1 for CPU training")
    parser.add_argument('--num_epochs', type=int, default=20)
    parser.add_argument('--num_hidden', type=int, default=16)
@@ -479,6 +477,5 @@ if __name__ == '__main__':
    parser.add_argument('--remove_edge', default=False, action='store_true',
        help="whether to remove edges during sampling")
    args = parser.parse_args()
    print(args)
    main(args)
--- a/examples/pytorch/graphsage/experimental/train_dist_unsupervised_transductive.py
+++ b/examples/pytorch/graphsage/experimental/train_dist_unsupervised_transductive.py
+import os
+os.environ['DGLBACKEND']='pytorch'
+from multiprocessing import Process
+import argparse, time, math
+import numpy as np
+from functools import wraps
+import tqdm
+import sklearn.linear_model as lm
+import sklearn.metrics as skm
+import dgl
+from dgl import DGLGraph
+from dgl.data import register_data_args, load_data
+from dgl.data.utils import load_graphs
+import dgl.function as fn
+import dgl.nn.pytorch as dglnn
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+import torch.multiprocessing as mp
+from dgl.distributed import DistDataLoader
+from dgl.distributed.optim import SparseAdagrad
+from train_dist_unsupervised import SAGE, NeighborSampler, PosNeighborSampler, CrossEntropyLoss, compute_acc
+from train_dist_transductive import DistEmb, load_embs
+def generate_emb(standalone, model, emb_layer, g, batch_size, device):
+    """
+    Generate embeddings for each node
+    emb_layer : Embedding layer
+    g : The entire graph.
+    inputs : The features of all the nodes.
+    batch_size : Number of nodes to compute at the same time.
+    device : The GPU device to evaluate on.
+    """
+    model.eval()
+    emb_layer.eval()
+    with th.no_grad():
+        inputs = load_embs(standalone, emb_layer, g)
+        pred = model.inference(g, inputs, batch_size, device)
+    g.barrier()
+    return pred
+def run(args, device, data):
+    # Unpack data
+    train_eids, train_nids, g, global_train_nid, global_valid_nid, global_test_nid, labels = data
+    # Create sampler
+    sampler = NeighborSampler(g, [int(fanout) for fanout in args.fan_out.split(',')], train_nids,
+                              dgl.distributed.sample_neighbors, args.num_negs, args.remove_edge)
+    # Create PyTorch DataLoader for constructing blocks
+    dataloader = dgl.distributed.DistDataLoader(
+        dataset=train_eids.numpy(),
+        batch_size=args.batch_size,
+        collate_fn=sampler.sample_blocks,
+        shuffle=True,
+        drop_last=False)
+    # Define model and optimizer
+    emb_layer = DistEmb(g.num_nodes(), args.num_hidden, dgl_sparse_emb=args.dgl_sparse, dev_id=device)
+    model = SAGE(args.num_hidden, args.num_hidden, args.num_hidden, args.num_layers, F.relu, args.dropout)
+    model = model.to(device)
+    if not args.standalone:
+        if args.num_gpus == -1:
+            model = th.nn.parallel.DistributedDataParallel(model)
+        else:
+            dev_id = g.rank() % args.num_gpus
+            model = th.nn.parallel.DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id)
+            if not args.dgl_sparse:
+                emb_layer = th.nn.parallel.DistributedDataParallel(emb_layer)
+    loss_fcn = CrossEntropyLoss()
+    loss_fcn = loss_fcn.to(device)
+    optimizer = optim.Adam(model.parameters(), lr=args.lr)
+    if args.dgl_sparse:
+        emb_optimizer = dgl.distributed.optim.SparseAdam([emb_layer.sparse_emb], lr=args.sparse_lr)
+        print('optimize DGL sparse embedding:', emb_layer.sparse_emb)
+    elif args.standalone:
+        emb_optimizer = th.optim.SparseAdam(list(emb_layer.sparse_emb.parameters()), lr=args.sparse_lr)
+        print('optimize Pytorch sparse embedding:', emb_layer.sparse_emb)
+    else:
+        emb_optimizer = th.optim.SparseAdam(list(emb_layer.module.sparse_emb.parameters()), lr=args.sparse_lr)
+        print('optimize Pytorch sparse embedding:', emb_layer.module.sparse_emb)
+    # Training loop
+    epoch = 0
+    for epoch in range(args.num_epochs):
+        sample_time = 0
+        copy_time = 0
+        forward_time = 0
+        backward_time = 0
+        update_time = 0
+        num_seeds = 0
+        num_inputs = 0
+        step_time = []
+        iter_t = []
+        sample_t = []
+        feat_copy_t = []
+        forward_t = []
+        backward_t = []
+        update_t = []
+        iter_tput = []
+        start = time.time()
+        # Loop over the dataloader to sample the computation dependency graph as a list of
+        # blocks.
+        for step, (pos_graph, neg_graph, blocks) in enumerate(dataloader):
+            tic_step = time.time()
+            sample_t.append(tic_step - start)
+            pos_graph = pos_graph.to(device)
+            neg_graph = neg_graph.to(device)
+            blocks = [block.to(device) for block in blocks]
+            # The nodes for input lies at the LHS side of the first block.
+            # The nodes for output lies at the RHS side of the last block.
+            # Load the input features as well as output labels
+            batch_inputs = blocks[0].srcdata[dgl.NID]
+            copy_time = time.time()
+            feat_copy_t.append(copy_time - tic_step)
+            # Compute loss and prediction
+            batch_inputs = emb_layer(batch_inputs)
+            batch_pred = model(blocks, batch_inputs)
+            loss = loss_fcn(batch_pred, pos_graph, neg_graph)
+            forward_end = time.time()
+            emb_optimizer.zero_grad()
+            optimizer.zero_grad()
+            loss.backward()
+            compute_end = time.time()
+            forward_t.append(forward_end - copy_time)
+            backward_t.append(compute_end - forward_end)
+            # Aggregate gradients in multiple nodes.
+            emb_optimizer.step()
+            optimizer.step()
+            update_t.append(time.time() - compute_end)
+            pos_edges = pos_graph.number_of_edges()
+            neg_edges = neg_graph.number_of_edges()
+            step_t = time.time() - start
+            step_time.append(step_t)
+            iter_tput.append(pos_edges / step_t)
+            num_seeds += pos_edges
+            if step % args.log_every == 0:
+                print('[{}] Epoch {:05d} | Step {:05d} | Loss {:.4f} | Speed (samples/sec) {:.4f} | time {:.3f} s' \
+                        '| sample {:.3f} | copy {:.3f} | forward {:.3f} | backward {:.3f} | update {:.3f}'.format(
+                    g.rank(), epoch, step, loss.item(), np.mean(iter_tput[3:]), np.sum(step_time[-args.log_every:]),
+                    np.sum(sample_t[-args.log_every:]), np.sum(feat_copy_t[-args.log_every:]), np.sum(forward_t[-args.log_every:]),
+                    np.sum(backward_t[-args.log_every:]), np.sum(update_t[-args.log_every:])))
+            start = time.time()
+        print('[{}]Epoch Time(s): {:.4f}, sample: {:.4f}, data copy: {:.4f}, forward: {:.4f}, backward: {:.4f}, update: {:.4f}, #seeds: {}, #inputs: {}'.format(
+            g.rank(), np.sum(step_time), np.sum(sample_t), np.sum(feat_copy_t), np.sum(forward_t), np.sum(backward_t), np.sum(update_t), num_seeds, num_inputs))
+        epoch += 1
+    # evaluate the embedding using LogisticRegression
+    if args.standalone:
+        pred = generate_emb(True, model, emb_layer, g, args.batch_size_eval, device)
+    else:
+        pred = generate_emb(False, model.module, emb_layer, g, args.batch_size_eval, device)
+    if g.rank() == 0:
+        eval_acc, test_acc = compute_acc(pred, labels, global_train_nid, global_valid_nid, global_test_nid)
+        print('eval acc {:.4f}; test acc {:.4f}'.format(eval_acc, test_acc))
+    # sync for eval and test
+    if not args.standalone:
+        th.distributed.barrier()
+    if not args.standalone:
+        g._client.barrier()
+        # save features into file
+        if g.rank() == 0:
+            th.save(pred, 'emb.pt')
+    else:
+        feat = g.ndata['features']
+        th.save(pred, 'emb.pt')
+def main(args):
+    dgl.distributed.initialize(args.ip_config)
+    if not args.standalone:
+        th.distributed.init_process_group(backend='gloo')
+    g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
+    print('rank:', g.rank())
+    print('number of edges', g.number_of_edges())
+    train_eids = dgl.distributed.edge_split(th.ones((g.number_of_edges(),), dtype=th.bool), g.get_partition_book(), force_even=True)
+    train_nids = dgl.distributed.node_split(th.ones((g.number_of_nodes(),), dtype=th.bool), g.get_partition_book())
+    global_train_nid = th.LongTensor(np.nonzero(g.ndata['train_mask'][np.arange(g.number_of_nodes())]))
+    global_valid_nid = th.LongTensor(np.nonzero(g.ndata['val_mask'][np.arange(g.number_of_nodes())]))
+    global_test_nid = th.LongTensor(np.nonzero(g.ndata['test_mask'][np.arange(g.number_of_nodes())]))
+    labels = g.ndata['labels'][np.arange(g.number_of_nodes())]
+    if args.num_gpus == -1:
+        device = th.device('cpu')
+    else:
+        device = th.device('cuda:'+str(g.rank() % args.num_gpus))
+    # Pack data
+    global_train_nid = global_train_nid.squeeze()
+    global_valid_nid = global_valid_nid.squeeze()
+    global_test_nid = global_test_nid.squeeze()
+    print("number of train {}".format(global_train_nid.shape[0]))
+    print("number of valid {}".format(global_valid_nid.shape[0]))
+    print("number of test {}".format(global_test_nid.shape[0]))
+    data = train_eids, train_nids, g, global_train_nid, global_valid_nid, global_test_nid, labels
+    run(args, device, data)
+    print("parent ends")
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='GCN')
+    register_data_args(parser)
+    parser.add_argument('--graph_name', type=str, help='graph name')
+    parser.add_argument('--id', type=int, help='the partition id')
+    parser.add_argument('--ip_config', type=str, help='The file for IP configuration')
+    parser.add_argument('--part_config', type=str, help='The path to the partition config file')
+    parser.add_argument('--n_classes', type=int, help='the number of classes')
+    parser.add_argument('--num_gpus', type=int, default=-1,
+                        help="the number of GPU device. Use -1 for CPU training")
+    parser.add_argument('--num_epochs', type=int, default=5)
+    parser.add_argument('--num_hidden', type=int, default=16)
+    parser.add_argument('--num-layers', type=int, default=2)
+    parser.add_argument('--fan_out', type=str, default='10,25')
+    parser.add_argument('--batch_size', type=int, default=1000)
+    parser.add_argument('--batch_size_eval', type=int, default=100000)
+    parser.add_argument('--log_every', type=int, default=20)
+    parser.add_argument('--eval_every', type=int, default=5)
+    parser.add_argument('--lr', type=float, default=0.003)
+    parser.add_argument('--dropout', type=float, default=0.5)
+    parser.add_argument('--local_rank', type=int, help='get rank of the process')
+    parser.add_argument('--standalone', action='store_true', help='run in the standalone mode')
+    parser.add_argument('--num_negs', type=int, default=1)
+    parser.add_argument('--neg_share', default=False, action='store_true',
+        help="sharing neg nodes for positive nodes")
+    parser.add_argument('--remove_edge', default=False, action='store_true',
+        help="whether to remove edges during sampling")
+    parser.add_argument("--dgl_sparse", action='store_true',
+            help='Whether to use DGL sparse embedding')
+    parser.add_argument("--sparse_lr", type=float, default=1e-2,
+            help="sparse lr rate")
+    args = parser.parse_args()
+    print(args)
+    main(args)
--- a/examples/pytorch/rgcn/experimental/README.md
+++ b/examples/pytorch/rgcn/experimental/README.md
@@ -10,35 +10,35 @@ pip3 install ogb pyarrow
 To train RGCN, it has four steps:
-### Step 0: Setup a Distributed File System 
+### Step 0: Setup a Distributed File System
-* You may skip this step if your cluster already has folder(s) synchronized across machines. 
+* You may skip this step if your cluster already has folder(s) synchronized across machines.
-To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph). 
+To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
-#### Server side setup 
+#### Server side setup
 Here is an example of how to setup NFS. First, install essential libs on the storage server
 ```bash
 sudo apt-get install nfs-kernel-server
-``` 
+```
 Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
 ```bash
 mkdir -p /home/ubuntu/workspace
 ```
-We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to 
+We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
 ```bash
 sudo vim /etc/exports
-# add the following line 
+# add the following line
 /home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
 ```
 The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
 ```bash
-# for ip range 10.0.0.0 – 10.255.255.255	
+# for ip range 10.0.0.0 – 10.255.255.255
 /home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
-# for ip range 172.16.0.0 – 172.31.255.255	
+# for ip range 172.16.0.0 – 172.31.255.255
 /home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
 ```
@@ -51,22 +51,22 @@ sudo systemctl restart nfs-kernel-server
 For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
-#### Client side setup 
+#### Client side setup
 To use NFS, clients also require to install essential packages
 ```
-sudo apt-get install nfs-common 
+sudo apt-get install nfs-common
 ```
-You can either mount the NFS manually 
+You can either mount the NFS manually
 ```
 mkdir -p /home/ubuntu/workspace
 sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
 ```
-or edit the fstab so the folder will be mounted automatically 
+or edit the fstab so the folder will be mounted automatically
 ```
 # vim /etc/fstab
@@ -74,7 +74,7 @@ or edit the fstab so the folder will be mounted automatically
 <nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
 ```
-Then run `mount -a`. 
+Then run `mount -a`.
 Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
@@ -126,6 +126,23 @@ We can get the performance score at the second epoch:
 Val Acc 0.4323, Test Acc 0.4255, time: 128.0379
 ```
+The command below launches the same distributed training job using dgl distributed NodeEmbedding
+```bash
+python3 ~/workspace/dgl/tools/launch.py \
+--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
+--num_trainers 1 \
+--num_servers 1 \
+--num_samplers 4 \
+--part_config data/ogbn-mag.json \
+--ip_config ip_config.txt \
+"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024  --n-hidden 64 --lr 0.01 --eval-batch-size 1024  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --sparse-embedding --sparse-lr 0.06 --num_gpus 1"
+```
+We can get the performance score at the second epoch:
+```
+Val Acc 0.4410, Test Acc 0.4282, time: 32.5274
+```
 **Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
 ## Partition a graph with ParMETIS
@@ -186,7 +203,7 @@ python3 get_mag_data.py
 ### Step 5: Verify the partition result (Optional)
 ```bash
-python3 verify_mag_partitions.py 
+python3 verify_mag_partitions.py
 ```
 ## Distributed code runs in the standalone mode

--- a/examples/pytorch/rgcn/experimental/entity_classify_dist.py
+++ b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
@@ -162,7 +162,7 @@ class DistEmbedLayer(nn.Module):
                    # We only create embeddings for nodes without node features.
                    if feat_name not in g.nodes[ntype].data:
                        part_policy = g.get_node_partition_policy(ntype)
-                        self.node_embeds[ntype] = dgl.distributed.DistEmbedding(g.number_of_nodes(ntype),
+                        self.node_embeds[ntype] = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(ntype),
                                self.embed_size,
                                embed_name + '_' + ntype,
                                init_emb,
@@ -389,10 +389,10 @@ def run(args, device, data):
    if args.sparse_embedding:
        if args.dgl_sparse and args.standalone:
-            emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.node_embeds.values()), lr=args.sparse_lr)
+            emb_optimizer = dgl.distributed.optim.SparseAdam(list(embed_layer.node_embeds.values()), lr=args.sparse_lr)
            print('optimize DGL sparse embedding:', embed_layer.node_embeds.keys())
        elif args.dgl_sparse:
-            emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr)
+            emb_optimizer = dgl.distributed.optim.SparseAdam(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr)
            print('optimize DGL sparse embedding:', embed_layer.module.node_embeds.keys())
        elif args.standalone:
            emb_optimizer = th.optim.SparseAdam(list(embed_layer.node_embeds.parameters()), lr=args.sparse_lr)
@@ -534,7 +534,7 @@ if __name__ == '__main__':
    parser.add_argument('--conf-path', type=str, help='The path to the partition config file')
    # rgcn related
-    parser.add_argument('--num_gpus', type=int, default=-1, 
+    parser.add_argument('--num_gpus', type=int, default=-1,
                        help="the number of GPU device. Use -1 for CPU training")
    parser.add_argument("--dropout", type=float, default=0,
            help="dropout probability")

--- a/python/dgl/contrib/dis_kvstore.py
+++ b/python/dgl/contrib/dis_kvstore.py
@@ -33,7 +33,7 @@ def read_ip_config(filename):
        172.31.47.147 30050 2
        172.31.30.180 30050 2
-    Note that, DGL KVStore supports multiple servers that can share data with each other 
+    Note that, DGL KVStore supports multiple servers that can share data with each other
    on the same machine via shared-tensor. So the server_count should be >= 1.
    Parameters
@@ -103,11 +103,11 @@ def get_type_str(dtype):
 class KVServer(object):
    """KVServer is a lightweight key-value store service for DGL distributed training.
-    In practice, developers can use KVServer to hold large-scale graph features or 
+    In practice, developers can use KVServer to hold large-scale graph features or
-    graph embeddings across machines in a distributed setting. Also, user can re-wriite _push_handler() 
+    graph embeddings across machines in a distributed setting. Also, user can re-wriite _push_handler()
    and _pull_handler() API to support flexibale algorithms.
-    DGL kvstore supports multiple-servers on single-machine. That means we can lunach many servers on the same machine and all of 
+    DGL kvstore supports multiple-servers on single-machine. That means we can lunach many servers on the same machine and all of
    these servers will share the same shared-memory tensor for load-balance.
    Note that, DO NOT use KVServer in multiple threads on Python because this behavior is not defined.
@@ -119,7 +119,7 @@ class KVServer(object):
    server_id : int
        KVServer's ID (start from 0).
    server_namebook: dict
-        IP address namebook of KVServer, where key is the KVServer's ID 
+        IP address namebook of KVServer, where key is the KVServer's ID
        (start from 0) and value is the server's machine_id, IP address and port, e.g.,
          {0:'[0, 172.31.40.143, 30050],
@@ -196,7 +196,7 @@ class KVServer(object):
        name : str
            data name
        global2local : list or tensor (mx.ndarray or torch.tensor)
-            A data mapping of global ID to local ID. KVStore will use global ID by default 
+            A data mapping of global ID to local ID. KVStore will use global ID by default
            if the global2local is not been set.
            Note that, if the global2local is None KVServer will read shared-tensor.
@@ -260,7 +260,7 @@ class KVServer(object):
                    time.sleep(2) # wait writing finish
                    break
                else:
-                    time.sleep(2) # wait until the file been created    
+                    time.sleep(2) # wait until the file been created
            data_shape, data_type = self._read_data_shape_type(name+'-part-shape-'+str(self._machine_id))
            assert data_type == 'int64'
            shared_data = empty_shared_mem(name+'-part-', False, data_shape, 'int64')
@@ -526,8 +526,8 @@ class KVServer(object):
                        c_ptr=None)
                    for client_id in range(self._client_count):
                        _send_kv_msg(self._sender, back_msg, client_id)
-                    self._barrier_count = 0  
+                    self._barrier_count = 0
-            # Final message              
+            # Final message
            elif msg.type == KVMsgType.FINAL:
                print("Exit KVStore service %d, solved message count: %d" % (self.get_id(), self.get_message_count()))
                break # exit loop
@@ -639,7 +639,7 @@ class KVServer(object):
    def _default_push_handler(self, name, ID, data, target):
-        """Default handler for PUSH message. 
+        """Default handler for PUSH message.
        On default, _push_handler perform update operation for the tensor.
@@ -680,7 +680,7 @@ class KVServer(object):
 class KVClient(object):
-    """KVClient is used to push/pull tensors to/from KVServer. If the server node and client node are on the 
+    """KVClient is used to push/pull tensors to/from KVServer. If the server node and client node are on the
    same machine, they can commuincate with each other using local shared-memory tensor, instead of TCP/IP connections.
    Note that, DO NOT use KVClient in multiple threads on Python because this behavior is not defined.
@@ -690,7 +690,7 @@ class KVClient(object):
    Parameters
    ----------
    server_namebook: dict
-        IP address namebook of KVServer, where key is the KVServer's ID 
+        IP address namebook of KVServer, where key is the KVServer's ID
        (start from 0) and value is the server's machine_id, IP address and port, and group_count, e.g.,
          {0:'[0, 172.31.40.143, 30050, 2],
@@ -807,7 +807,7 @@ class KVClient(object):
                    if (os.path.exists(tensor_name+'shape-'+str(self._machine_id))):
                        break
                    else:
-                        time.sleep(1) # wait until the file been created 
+                        time.sleep(1) # wait until the file been created
                shape, data_type = self._read_data_shape_type(tensor_name+'shape-'+str(self._machine_id))
                assert data_type == dtype
                shared_data = empty_shared_mem(tensor_name, False, shape, dtype)
@@ -825,7 +825,7 @@ class KVClient(object):
                type=KVMsgType.GET_SHAPE,
                rank=self._client_id,
                name=name,
-                id=None, 
+                id=None,
                data=None,
                shape=None,
                c_ptr=None)
@@ -844,12 +844,12 @@ class KVClient(object):
    def init_data(self, name, shape, dtype, target_name):
-        """Send message to kvserver to initialize new data and 
+        """Send message to kvserver to initialize new data and
-        get corresponded shared-tensor (e.g., partition_book, g2l) on kvclient. 
+        get corresponded shared-tensor (e.g., partition_book, g2l) on kvclient.
        The new data will be initialized to zeros.
-        Note that, this API must be invoked after the conenct() API. 
+        Note that, this API must be invoked after the conenct() API.
        Parameters
        ----------
@@ -1034,10 +1034,10 @@ class KVClient(object):
                local_data = partial_data
            else: # push data to remote server
                msg = KVStoreMsg(
-                    type=KVMsgType.PUSH, 
+                    type=KVMsgType.PUSH,
-                    rank=self._client_id, 
+                    rank=self._client_id,
                    name=name,
-                    id=partial_id, 
+                    id=partial_id,
                    data=partial_data,
                    shape=None,
                    c_ptr=None)
@@ -1052,7 +1052,7 @@ class KVClient(object):
                self._udf_push_handler(name+'-data-', local_id, local_data, self._data_store, self._udf_push_param)
            else:
                self._default_push_handler(name+'-data-', local_id, local_data, self._data_store)
    def pull(self, name, id_tensor):
        """Pull message from KVServer.
@@ -1081,8 +1081,8 @@ class KVClient(object):
                        self._group_count,
                        self._machine_id,
                        self._client_id,
-                        self._data_store[name+'-part-'], 
+                        self._data_store[name+'-part-'],
-                        g2l, 
+                        g2l,
                        self._data_store[name+'-data-'],
                        self._sender,
                        self._receiver)
@@ -1116,9 +1116,9 @@ class KVClient(object):
                        local_id = partial_id
                else: # pull data from remote server
                    msg = KVStoreMsg(
-                        type=KVMsgType.PULL, 
+                        type=KVMsgType.PULL,
-                        rank=self._client_id, 
+                        rank=self._client_id,
-                        name=name, 
+                        name=name,
                        id=partial_id,
                        data=None,
                        shape=None,
@@ -1128,16 +1128,16 @@ class KVClient(object):
                    _send_kv_msg(self._sender, msg, s_id)
                    pull_count += 1
-                start += count[idx]           
+                start += count[idx]
            msg_list = []
            if local_id is not None: # local pull
                local_data = self._udf_pull_handler(name+'-data-', local_id, self._data_store)
                s_id = random.randint(self._machine_id*self._group_count, (self._machine_id+1)*self._group_count-1)
                local_msg = KVStoreMsg(
-                    type=KVMsgType.PULL_BACK, 
+                    type=KVMsgType.PULL_BACK,
                    rank=s_id,
-                    name=name, 
+                    name=name,
                    id=None,
                    data=local_data,
                    shape=None,
@@ -1157,13 +1157,13 @@ class KVClient(object):
            return data_tensor[back_sorted_id] # return data with original index order
    def barrier(self):
        """Barrier for all client nodes
        This API will be blocked untill all the clients call this API.
        """
-        msg = KVStoreMsg( 
+        msg = KVStoreMsg(
            type=KVMsgType.BARRIER,
            rank=self._client_id,
            name=None,
@@ -1215,7 +1215,7 @@ class KVClient(object):
            IP = '127.0.0.1'
        finally:
            s.close()
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(("",0))
        s.listen(1)
@@ -1365,7 +1365,7 @@ class KVClient(object):
    def _default_push_handler(self, name, ID, data, target):
-        """Default handler for PUSH message. 
+        """Default handler for PUSH message.
        On default, _push_handler perform update operation for the tensor.
@@ -1381,4 +1381,4 @@ class KVClient(object):
            self._data_store
        """
        target[name][ID] = data
--- a/python/dgl/distributed/__init__.py
+++ b/python/dgl/distributed/__init__.py
@@ -19,6 +19,8 @@ from .dist_tensor import DistTensor
 from .partition import partition_graph, load_partition, load_partition_book
 from .graph_partition_book import GraphPartitionBook, PartitionPolicy
 from .sparse_emb import SparseAdagrad, DistEmbedding
+from . import nn
+from . import optim
 from .rpc import *
 from .rpc_server import start_server

--- a/python/dgl/distributed/dist_tensor.py
+++ b/python/dgl/distributed/dist_tensor.py
@@ -78,6 +78,8 @@ class DistTensor:
        The system determines the right partition policy automatically.
    persistent : bool
        Whether the created tensor lives after the ``DistTensor`` object is destroyed.
+    is_gdata : bool
+        Whether the created tensor is a ndata/edata or not.
    Examples
    --------
@@ -100,7 +102,7 @@ class DistTensor:
    do the same.
    '''
    def __init__(self, shape, dtype, name=None, init_func=None, part_policy=None,
-                 persistent=False):
+                 persistent=False, is_gdata=True):
        self.kvstore = get_kvstore()
        assert self.kvstore is not None, \
                'Distributed module is not initialized. Please call dgl.distributed.initialize.'
@@ -126,6 +128,7 @@ class DistTensor:
                    + 'its first dimension does not match the number of nodes or edges ' \
                    + 'of a distributed graph or there does not exist a distributed graph.'
+        self._tensor_name = name
        self._part_policy = part_policy
        assert part_policy.get_size() == shape[0], \
                'The partition policy does not match the input shape.'
@@ -147,7 +150,7 @@ class DistTensor:
        self._name = str(data_name)
        self._persistent = persistent
        if self._name not in exist_names:
-            self.kvstore.init_data(self._name, shape, dtype, part_policy, init_func)
+            self.kvstore.init_data(self._name, shape, dtype, part_policy, init_func, is_gdata)
            self._owner = True
        else:
            self._owner = False
@@ -218,3 +221,14 @@ class DistTensor:
            The name of the tensor.
        '''
        return self._name
+    @property
+    def tensor_name(self):
+        '''Return the tensor name
+        Returns
+        -------
+        str
+            The name of the tensor.
+        '''
+        return self._tensor_name
--- a/python/dgl/distributed/kvstore.py
+++ b/python/dgl/distributed/kvstore.py
@@ -825,6 +825,8 @@ class KVClient(object):
        self._full_data_shape = {}
        # Store all the data name
        self._data_name_list = set()
+        # Store all graph data name
+        self._gdata_name_list = set()
        # Basic information
        self._server_namebook = rpc.read_ip_config(ip_config, num_servers)
        self._server_count = len(self._server_namebook)
@@ -940,7 +942,7 @@ class KVClient(object):
        self._pull_handlers[name] = func
        self.barrier()
-    def init_data(self, name, shape, dtype, part_policy, init_func):
+    def init_data(self, name, shape, dtype, part_policy, init_func, is_gdata=True):
        """Send message to kvserver to initialize new data tensor and mapping this
        data from server side to client side.
@@ -956,6 +958,8 @@ class KVClient(object):
            partition policy.
        init_func : func
            UDF init function
+        is_gdata : bool
+            Whether the created tensor is a ndata/edata or not.
        """
        assert len(name) > 0, 'name cannot be empty.'
        assert len(shape) > 0, 'shape cannot be empty'
@@ -997,6 +1001,8 @@ class KVClient(object):
        dlpack = shared_data.to_dlpack()
        self._data_store[name] = F.zerocopy_from_dlpack(dlpack)
        self._data_name_list.add(name)
+        if is_gdata:
+            self._gdata_name_list.add(name)
        self._full_data_shape[name] = tuple(shape)
        self._pull_handlers[name] = default_pull_handler
        self._push_handlers[name] = default_push_handler
@@ -1040,6 +1046,8 @@ class KVClient(object):
        self.barrier()
        self._data_name_list.remove(name)
+        if name in self._gdata_name_list:
+            self._gdata_name_list.remove(name)
        # TODO(chao) : remove the delete log print
        del self._data_store[name]
        del self._full_data_shape[name]
@@ -1110,11 +1118,14 @@ class KVClient(object):
                response = rpc.recv_response()
                assert response.msg == SEND_META_TO_BACKUP_MSG
            self._data_name_list.add(name)
+            # map_shared_data happens only at DistGraph initialization
+            # TODO(xiangsx): We assume there is no non-graph data initialized at this time
+            self._gdata_name_list.add(name)
        self.barrier()
    def data_name_list(self):
        """Get all the data name"""
-        return list(self._data_name_list)
+        return list(self._gdata_name_list)
    def get_data_meta(self, name):
        """Get meta data (data_type, data_shape, partition_policy)
@@ -1125,6 +1136,25 @@ class KVClient(object):
        part_policy = self._part_policy[name]
        return (data_type, data_shape, part_policy)
+    def get_partid(self, name, id_tensor):
+        """
+        Parameters
+        ----------
+        name : str
+            data name
+        id_tensor : tensor
+            a vector storing the global data ID
+        """
+        assert len(name) > 0, 'name cannot be empty.'
+        id_tensor = utils.toindex(id_tensor)
+        id_tensor = id_tensor.tousertensor()
+        assert F.ndim(id_tensor) == 1, 'ID must be a vector.'
+        # partition data
+        machine_id = self._part_policy[name].to_partid(id_tensor)
+        return machine_id
    def push(self, name, id_tensor, data_tensor):
        """Push data to KVServer.

--- a/python/dgl/distributed/nn/__init__.py
+++ b/python/dgl/distributed/nn/__init__.py
+"""dgl distributed.optims."""
+import importlib
+import sys
+import os
+from ...backend import backend_name
+from ...utils import expand_as_pair
+def _load_backend(mod_name):
+    mod = importlib.import_module('.%s' % mod_name, __name__)
+    thismod = sys.modules[__name__]
+    for api, obj in mod.__dict__.items():
+        setattr(thismod, api, obj)
+_load_backend(backend_name)
--- a/python/dgl/distributed/nn/mxnet/__init__.py
+++ b/python/dgl/distributed/nn/mxnet/__init__.py
--- a/python/dgl/distributed/nn/pytorch/__init__.py
+++ b/python/dgl/distributed/nn/pytorch/__init__.py
+"""dgl distributed sparse optimizer for pytorch."""
+from .sparse_emb import NodeEmbedding