[Distributed] Heterogeneous graph support (#2457)

* Distributed heterograph (#3) * heterogeneous graph partition. * fix graph partition book for heterograph. * load heterograph partitions. * update DistGraphServer to support heterograph. * make DistGraph runnable for heterograph. * partition a graph and store parts with homogeneous graph structure. * update DistGraph server&client to use homogeneous graph. * shuffle node Ids based on node types. * load mag in heterograph. * fix per-node-type mapping. * balance node types. * fix for homogeneous graph * store etype for now. * fix data name. * fix a bug in example. * add profiler in rgcn. * heterogeneous RGCN. * map homogeneous node ids to hetero node ids. * fix graph partition book. * fix DistGraph. * shuffle eids. * verify eids and their mappings when loading a partition. * Id map from homogneous Ids to per-type Ids. * verify partitioned results. * add test for distributed sampler. * add mapping from per-type Ids to homogeneous Ids. * update example. * fix DistGraph. * Revert "add profiler in rgcn." This reverts commit 36daaed8b660933dac8f61a39faec3da2467d676. * add tests for homogeneous graphs. * fix a bug. * fix test. * fix for one partition. * fix for standalone training and evaluation. * small fix. * fix two bugs. * initialize projection matrix. * small fix on RGCN. * Fix rgcn performance (#17) Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal> * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix. * fix test. * fix lint. * test partitions. * remove redundant test for partitioning. * remove commented code. * fix partition. * fix tests. * fix RGCN. * fix test. * fix test. * fix test. * fix. * fix a bug. * update dmlc-core. * fix. * fix rgcn. * update readme. * add comments. Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal> * fix. * fix. * add div_int. * fix. * fix. * fix lint. * fix. * fix. * fix. * adjust. * move code. * handle heterograph. * return pytorch tensor in GPB. * remove some tests in example. * add to_block for distributed training. * use distributed to_block. * remove unnecessary function in DistGraph. * remove distributed to_block. * use pytorch tensor. * fix a bug in ntypes and etypes. * enable norm. * make the data loader compatible with the old format. * fix. * add comments. * fix a bug. * add test for heterograph. * support partition without reshuffle. * add test. * support partition without reshuffle. * fix. * add test. * fix bugs. * fix lint. * fix dataset. * fix for mxnet. * update docstring. * rename to floor_div * avoid exposing NodePartitionPolicy and EdgePartitionPolicy. * fix docstring. * fix error. * fixes. * fix comments. * rename. * rename. * explain IdMap. * fix docstring. * fix docstring. * update docstring. * remove the code of returning heterograph. * remove argument. * fix example. * make GraphPartitionBook an abstract class. * fix. * fix. * fix a bug. * fix a bug in example * fix a bug * reverse heterograph sampling. * temp fix. * fix lint. * Revert "temp fix." This reverts commit c450717b9f578b8c48769c675f2a19d6c1e64381. * compute norm. * Revert "reverse heterograph sampling." This reverts commit bd6deb7f52998de76508f800441ff518e2fadcb9. * fix. * move id_map.py * remove check * add more comments. * update docstring. Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal>

[Distributed] Heterogeneous graph support (#2457)
* Distributed heterograph (#3) * heterogeneous graph partition. * fix graph partition book for heterograph. * load heterograph partitions. * update DistGraphServer to support heterograph. * make DistGraph runnable for heterograph. * partition a graph and store parts with homogeneous graph structure. * update DistGraph server&client to use homogeneous graph. * shuffle node Ids based on node types. * load mag in heterograph. * fix per-node-type mapping. * balance node types. * fix for homogeneous graph * store etype for now. * fix data name. * fix a bug in example. * add profiler in rgcn. * heterogeneous RGCN. * map homogeneous node ids to hetero node ids. * fix graph partition book. * fix DistGraph. * shuffle eids. * verify eids and their mappings when loading a partition. * Id map from homogneous Ids to per-type Ids. * verify partitioned results. * add test for distributed sampler. * add mapping from per-type Ids to homogeneous Ids. * update example. * fix DistGraph. * Revert "add profiler in rgcn." This reverts commit 36daaed8b660933dac8f61a39faec3da2467d676. * add tests for homogeneous graphs. * fix a bug. * fix test. * fix for one partition. * fix for standalone training and evaluation. * small fix. * fix two bugs. * initialize projection matrix. * small fix on RGCN. * Fix rgcn performance (#17) Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal> * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix. * fix test. * fix lint. * test partitions. * remove redundant test for partitioning. * remove commented code. * fix partition. * fix tests. * fix RGCN. * fix test. * fix test. * fix test. * fix. * fix a bug. * update dmlc-core. * fix. * fix rgcn. * update readme. * add comments. Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal> * fix. * fix. * add div_int. * fix. * fix. * fix lint. * fix. * fix. * fix. * adjust. * move code. * handle heterograph. * return pytorch tensor in GPB. * remove some tests in example. * add to_block for distributed training. * use distributed to_block. * remove unnecessary function in DistGraph. * remove distributed to_block. * use pytorch tensor. * fix a bug in ntypes and etypes. * enable norm. * make the data loader compatible with the old format. * fix. * add comments. * fix a bug. * add test for heterograph. * support partition without reshuffle. * add test. * support partition without reshuffle. * fix. * add test. * fix bugs. * fix lint. * fix dataset. * fix for mxnet. * update docstring. * rename to floor_div * avoid exposing NodePartitionPolicy and EdgePartitionPolicy. * fix docstring. * fix error. * fixes. * fix comments. * rename. * rename. * explain IdMap. * fix docstring. * fix docstring. * update docstring. * remove the code of returning heterograph. * remove argument. * fix example. * make GraphPartitionBook an abstract class. * fix. * fix. * fix a bug. * fix a bug in example * fix a bug * reverse heterograph sampling. * temp fix. * fix lint. * Revert "temp fix." This reverts commit c450717b9f578b8c48769c675f2a19d6c1e64381. * compute norm. * Revert "reverse heterograph sampling." This reverts commit bd6deb7f52998de76508f800441ff518e2fadcb9. * fix. * move id_map.py * remove check * add more comments. * update docstring. Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-171.ec2.internal>
25ac3344 · Da Zheng · GitHub · aa884d43 · 25ac3344 · 25ac3344
Unverified Commit 25ac3344 authored Jan 24, 2021 by Da Zheng Committed by GitHub Jan 24, 2021
20 changed files
--- a/examples/pytorch/graphsage/experimental/README.md
+++ b/examples/pytorch/graphsage/experimental/README.md
@@ -39,6 +39,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
 python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
 ```

+This script generates partitioned graphs and store them in the directory called `data`.
+
 ### Step 2: copy the partitioned data and files to the cluster

 DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:

--- a/examples/pytorch/rgcn/experimental/README.md
+++ b/examples/pytorch/rgcn/experimental/README.md
 ## Distributed training

-This is an example of training RGCN node classification in a distributed fashion. Currently, the example only support training RGCN graphs with no input features. The current implementation follows ../rgcn/entity_claasify_mp.py.
+This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py.

 Before training, please install some python libs by pip:

@@ -36,6 +36,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
 python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
 ```

+This script generates partitioned graphs and store them in the directory called `data`.
+
 ### Step 2: copy the partitioned data to the cluster
 DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:

@@ -78,7 +80,7 @@ python3 ~/dgl/tools/launch.py \
 --num_samplers 4 \
 --part_config data/ogbn-mag.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06"
+"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06 --node-feats"
 ```

 We can get the performance score at the second epoch:
@@ -98,5 +100,5 @@ python3 partition_graph.py --dataset ogbn-mag --num_parts 1

 ### Step 2: run the training script
 ```bash
-python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 256 --n-hidden 64 --lr 0.01 --eval-batch-size 8 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone
+python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 128 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone  --sparse-embedding  --sparse-lr 0.06 --node-feats
 ```
--- a/examples/pytorch/rgcn/experimental/entity_classify_dist.py
+++ b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
@@ -106,7 +106,7 @@ class EntityClassify(nn.Module):
        h = feats
        for layer, block in zip(self.layers, blocks):
            block = block.to(self.device)
-            h = layer(block, h, block.edata['etype'], block.edata['norm'])
+            h = layer(block, h, block.edata[dgl.ETYPE], block.edata['norm'])
        return h

 def init_emb(shape, dtype):
@@ -122,8 +122,6 @@ class DistEmbedLayer(nn.Module):
        Device to run the layer.
    g : DistGraph
        training graph
-    num_of_ntype : int
-        Number of node types
    embed_size : int
        Output embed size
    sparse_emb: bool
@@ -138,55 +136,74 @@ class DistEmbedLayer(nn.Module):
    def __init__(self,
                 dev_id,
                 g,
-                 num_of_ntype,
                 embed_size,
                 sparse_emb=False,
                 dgl_sparse_emb=False,
+                 feat_name='feat',
                 embed_name='node_emb'):
        super(DistEmbedLayer, self).__init__()
        self.dev_id = dev_id
-        self.num_of_ntype = num_of_ntype
        self.embed_size = embed_size
        self.embed_name = embed_name
+        self.feat_name = feat_name
        self.sparse_emb = sparse_emb
-
+        self.g = g
+        self.ntype_id_map = {g.get_ntype_id(ntype):ntype for ntype in g.ntypes}
+
+        self.node_projs = nn.ModuleDict()
+        for ntype in g.ntypes:
+            if feat_name in g.nodes[ntype].data:
+                self.node_projs[ntype] = nn.Linear(g.nodes[ntype].data[feat_name].shape[1], embed_size)
+                nn.init.xavier_uniform_(self.node_projs[ntype].weight)
+                print('node {} has data {}'.format(ntype, feat_name))
        if sparse_emb:
            if dgl_sparse_emb:
-                self.node_embeds = dgl.distributed.DistEmbedding(g.number_of_nodes(),
+                self.node_embeds = {}
+                for ntype in g.ntypes:
+                    # We only create embeddings for nodes without node features.
+                    if feat_name not in g.nodes[ntype].data:
+                        part_policy = g.get_node_partition_policy(ntype)
+                        self.node_embeds[ntype] = dgl.distributed.DistEmbedding(g.number_of_nodes(ntype),
                                self.embed_size,
-                                                                 embed_name,
-                                                                 init_emb)
+                                embed_name + '_' + ntype,
+                                init_emb,
+                                part_policy)
            else:
-                self.node_embeds = th.nn.Embedding(g.number_of_nodes(), self.embed_size, sparse=self.sparse_emb)
-                nn.init.uniform_(self.node_embeds.weight, -1.0, 1.0)
+                self.node_embeds = nn.ModuleDict()
+                for ntype in g.ntypes:
+                    # We only create embeddings for nodes without node features.
+                    if feat_name not in g.nodes[ntype].data:
+                        self.node_embeds[ntype] = th.nn.Embedding(g.number_of_nodes(ntype), self.embed_size, sparse=self.sparse_emb)
+                        nn.init.uniform_(self.node_embeds[ntype].weight, -1.0, 1.0)
        else:
-            self.node_embeds = th.nn.Embedding(g.number_of_nodes(), self.embed_size)
-            nn.init.uniform_(self.node_embeds.weight, -1.0, 1.0)
-
-    def forward(self, node_ids, node_tids, features):
+            self.node_embeds = nn.ModuleDict()
+            for ntype in g.ntypes:
+                # We only create embeddings for nodes without node features.
+                if feat_name not in g.nodes[ntype].data:
+                    self.node_embeds[ntype] = th.nn.Embedding(g.number_of_nodes(ntype), self.embed_size)
+                    nn.init.uniform_(self.node_embeds[ntype].weight, -1.0, 1.0)
+
+    def forward(self, node_ids, ntype_ids):
        """Forward computation
        Parameters
        ----------
-        node_ids : tensor
+        node_ids : Tensor
            node ids to generate embedding for.
-        node_ids : tensor
+        ntype_ids : Tensor
            node type ids
-        features : list of features
-            list of initial features for nodes belong to different node type.
-            If None, the corresponding features is an one-hot encoding feature,
-            else use the features directly as input feature and matmul a
-            projection matrix.
        Returns
        -------
        tensor
            embeddings as the input of the next layer
        """
-        embeds = th.empty(node_ids.shape[0], self.embed_size)
-        for ntype in range(self.num_of_ntype):
-            assert features[ntype] is None, 'Currently Dist RGCN only support non input feature'
-            loc = node_tids == ntype
-            embeds[loc] = self.node_embeds(node_ids[loc])
-
+        embeds = th.empty(node_ids.shape[0], self.embed_size, device=self.dev_id)
+        for ntype_id in th.unique(ntype_ids).tolist():
+            ntype = self.ntype_id_map[int(ntype_id)]
+            loc = ntype_ids == ntype_id
+            if self.feat_name in self.g.nodes[ntype].data:
+                embeds[loc] = self.node_projs[ntype](self.g.nodes[ntype].data[self.feat_name][node_ids[ntype_ids == ntype_id]].to(self.dev_id))
+            else:
+                embeds[loc] = self.node_embeds[ntype](node_ids[ntype_ids == ntype_id]).to(self.dev_id)
        return embeds

 def compute_acc(results, labels):
@@ -196,7 +213,15 @@ def compute_acc(results, labels):
    labels = labels.long()
    return (results == labels).float().sum() / len(results)

-def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, node_feats, global_val_nid, global_test_nid):
+def gen_norm(g):
+    _, v, eid = g.all_edges(form='all')
+    _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
+    degrees = count[inverse_index]
+    norm = th.ones(eid.shape[0], device=eid.device) / degrees
+    norm = norm.unsqueeze(1)
+    g.edata['norm'] = norm
+
+def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, all_val_nid, all_test_nid):
    model.eval()
    embed_layer.eval()
    eval_logits = []
@@ -207,11 +232,12 @@ def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, node_feats
    with th.no_grad():
        for sample_data in tqdm.tqdm(eval_loader):
            seeds, blocks = sample_data
-            feats = embed_layer(blocks[0].srcdata[dgl.NID],
-                                blocks[0].srcdata[dgl.NTYPE],
-                                node_feats)
+            for block in blocks:
+                gen_norm(block)
+            feats = embed_layer(blocks[0].srcdata[dgl.NID], blocks[0].srcdata[dgl.NTYPE])
            logits = model(blocks, feats)
            eval_logits.append(logits.cpu().detach())
+            assert np.all(seeds.numpy() < g.number_of_nodes('paper'))
            eval_seeds.append(seeds.cpu().detach())
    eval_logits = th.cat(eval_logits)
    eval_seeds = th.cat(eval_seeds)
@@ -222,11 +248,12 @@ def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, node_feats
    with th.no_grad():
        for sample_data in tqdm.tqdm(test_loader):
            seeds, blocks = sample_data
-            feats = embed_layer(blocks[0].srcdata[dgl.NID],
-                                blocks[0].srcdata[dgl.NTYPE],
-                                node_feats)
+            for block in blocks:
+                gen_norm(block)
+            feats = embed_layer(blocks[0].srcdata[dgl.NID], blocks[0].srcdata[dgl.NTYPE])
            logits = model(blocks, feats)
            test_logits.append(logits.cpu().detach())
+            assert np.all(seeds.numpy() < g.number_of_nodes('paper'))
            test_seeds.append(seeds.cpu().detach())
    test_logits = th.cat(test_logits)
    test_seeds = th.cat(test_seeds)
@@ -234,8 +261,8 @@ def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, node_feats

    g.barrier()
    if g.rank() == 0:
-        return compute_acc(global_results[global_val_nid], labels[global_val_nid]), \
-            compute_acc(global_results[global_test_nid], labels[global_test_nid])
+        return compute_acc(global_results[all_val_nid], labels[all_val_nid]), \
+            compute_acc(global_results[all_test_nid], labels[all_test_nid])
    else:
        return -1, -1

@@ -274,29 +301,35 @@ class NeighborSampler:
        norms = []
        ntypes = []
        seeds = th.LongTensor(np.asarray(seeds))
-        cur = seeds
+        gpb = self.g.get_partition_book()
+        # We need to map the per-type node IDs to homogeneous IDs.
+        cur = gpb.map_to_homo_nid(seeds, 'paper')
        for fanout in self.fanouts:
-            frontier = self.sample_neighbors(self.g, cur, fanout, replace=True)
-            etypes = self.g.edata[dgl.ETYPE][frontier.edata[dgl.EID]]
-            norm = self.g.edata['norm'][frontier.edata[dgl.EID]]
+            # For a heterogeneous input graph, the returned frontier is stored in
+            # the homogeneous graph format.
+            frontier = self.sample_neighbors(self.g, cur, fanout, replace=False)
            block = dgl.to_block(frontier, cur)
-            block.srcdata[dgl.NTYPE] = self.g.ndata[dgl.NTYPE][block.srcdata[dgl.NID]]
-            block.edata['etype'] = etypes
-            block.edata['norm'] = norm
            cur = block.srcdata[dgl.NID]
+
+            block.edata[dgl.EID] = frontier.edata[dgl.EID]
+            # Map the homogeneous edge Ids to their edge type.
+            block.edata[dgl.ETYPE], block.edata[dgl.EID] = gpb.map_to_per_etype(block.edata[dgl.EID])
+            # Map the homogeneous node Ids to their node types and per-type Ids.
+            block.srcdata[dgl.NTYPE], block.srcdata[dgl.NID] = gpb.map_to_per_ntype(block.srcdata[dgl.NID])
+            block.dstdata[dgl.NTYPE], block.dstdata[dgl.NID] = gpb.map_to_per_ntype(block.dstdata[dgl.NID])
            blocks.insert(0, block)
        return seeds, blocks

 def run(args, device, data):
-    g, node_feats, num_of_ntype, num_classes, num_rels, \
-        train_nid, val_nid, test_nid, labels, global_val_nid, global_test_nid = data
+    g, num_classes, train_nid, val_nid, test_nid, labels, all_val_nid, all_test_nid = data
+    num_rels = len(g.etypes)

    fanouts = [int(fanout) for fanout in args.fanout.split(',')]
    val_fanouts = [int(fanout) for fanout in args.validation_fanout.split(',')]
    sampler = NeighborSampler(g, fanouts, dgl.distributed.sample_neighbors)
    # Create DataLoader for constructing blocks
    dataloader = DistDataLoader(
-        dataset=train_nid.numpy(),
+        dataset=train_nid,
        batch_size=args.batch_size,
        collate_fn=sampler.sample_blocks,
        shuffle=True,
@@ -305,7 +338,7 @@ def run(args, device, data):
    valid_sampler = NeighborSampler(g, val_fanouts, dgl.distributed.sample_neighbors)
    # Create DataLoader for constructing blocks
    valid_dataloader = DistDataLoader(
-        dataset=val_nid.numpy(),
+        dataset=val_nid,
        batch_size=args.batch_size,
        collate_fn=valid_sampler.sample_blocks,
        shuffle=False,
@@ -314,7 +347,7 @@ def run(args, device, data):
    test_sampler = NeighborSampler(g, [-1] * args.n_layers, dgl.distributed.sample_neighbors)
    # Create DataLoader for constructing blocks
    test_dataloader = DistDataLoader(
-        dataset=test_nid.numpy(),
+        dataset=test_nid,
        batch_size=args.batch_size,
        collate_fn=test_sampler.sample_blocks,
        shuffle=False,
@@ -322,10 +355,10 @@ def run(args, device, data):

    embed_layer = DistEmbedLayer(device,
                                 g,
-                                 num_of_ntype,
                                 args.n_hidden,
                                 sparse_emb=args.sparse_embedding,
-                                 dgl_sparse_emb=args.dgl_sparse)
+                                 dgl_sparse_emb=args.dgl_sparse,
+                                 feat_name='feat')

    model = EntityClassify(device,
                           args.n_hidden,
@@ -340,15 +373,33 @@ def run(args, device, data):
    model = model.to(device)
    if not args.standalone:
        model = th.nn.parallel.DistributedDataParallel(model)
-        if args.sparse_embedding and not args.dgl_sparse:
+        # If there are dense parameters in the embedding layer
+        # or we use Pytorch saprse embeddings.
+        if len(embed_layer.node_projs) > 0 or not args.dgl_sparse:
            embed_layer = DistributedDataParallel(embed_layer, device_ids=None, output_device=None)

    if args.sparse_embedding:
-        if args.dgl_sparse:
-            emb_optimizer = dgl.distributed.SparseAdagrad([embed_layer.node_embeds], lr=args.sparse_lr)
+        if args.dgl_sparse and args.standalone:
+            emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.node_embeds.values()), lr=args.sparse_lr)
+            print('optimize DGL sparse embedding:', embed_layer.node_embeds.keys())
+        elif args.dgl_sparse:
+            emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr)
+            print('optimize DGL sparse embedding:', embed_layer.module.node_embeds.keys())
+        elif args.standalone:
+            emb_optimizer = th.optim.SparseAdam(embed_layer.node_embeds.parameters(), lr=args.sparse_lr)
+            print('optimize Pytorch sparse embedding:', embed_layer.node_embeds)
        else:
            emb_optimizer = th.optim.SparseAdam(embed_layer.module.node_embeds.parameters(), lr=args.sparse_lr)
-        optimizer = th.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.l2norm)
+            print('optimize Pytorch sparse embedding:', embed_layer.module.node_embeds)
+        dense_params = list(model.parameters())
+        if args.node_feats:
+            if args.standalone:
+                dense_params += list(embed_layer.node_projs.parameters())
+                print('optimize dense projection:', embed_layer.node_projs)
+            else:
+                dense_params += list(embed_layer.module.node_projs.parameters())
+                print('optimize dense projection:', embed_layer.module.node_projs)
+        optimizer = th.optim.Adam(dense_params, lr=args.lr, weight_decay=args.l2norm)
    else:
        all_params = list(model.parameters()) + list(embed_layer.parameters())
        optimizer = th.optim.Adam(all_params, lr=args.lr, weight_decay=args.l2norm)
@@ -385,9 +436,9 @@ def run(args, device, data):
            sample_time += tic_step - start
            sample_t.append(tic_step - start)

-            feats = embed_layer(blocks[0].srcdata[dgl.NID],
-                                blocks[0].srcdata[dgl.NTYPE],
-                                node_feats)
+            for block in blocks:
+                gen_norm(block)
+            feats = embed_layer(blocks[0].srcdata[dgl.NID], blocks[0].srcdata[dgl.NTYPE])
            label = labels[seeds]
            copy_time = time.time()
            feat_copy_t.append(copy_time - tic_step)
@@ -410,15 +461,16 @@ def run(args, device, data):
            backward_t.append(compute_end - forward_end)

            # Aggregate gradients in multiple nodes.
-            optimizer.step()
            update_t.append(time.time() - compute_end)
            step_t = time.time() - start
            step_time.append(step_t)

+            train_acc = th.sum(logits.argmax(dim=1) == label).item() / len(seeds)
+
            if step % args.log_every == 0:
-                print('[{}] Epoch {:05d} | Step {:05d} | Loss {:.4f} | time {:.3f} s' \
+                print('[{}] Epoch {:05d} | Step {:05d} | Train acc {:.4f} | Loss {:.4f} | time {:.3f} s' \
                        '| sample {:.3f} | copy {:.3f} | forward {:.3f} | backward {:.3f} | update {:.3f}'.format(
-                    g.rank(), epoch, step, loss.item(), np.sum(step_time[-args.log_every:]),
+                    g.rank(), epoch, step, train_acc, loss.item(), np.sum(step_time[-args.log_every:]),
                    np.sum(sample_t[-args.log_every:]), np.sum(feat_copy_t[-args.log_every:]), np.sum(forward_t[-args.log_every:]),
                    np.sum(backward_t[-args.log_every:]), np.sum(update_t[-args.log_every:])))
            start = time.time()
@@ -430,7 +482,7 @@ def run(args, device, data):
        start = time.time()
        g.barrier()
        val_acc, test_acc = evaluate(g, model, embed_layer, labels,
-            valid_dataloader, test_dataloader, node_feats, global_val_nid, global_test_nid)
+            valid_dataloader, test_dataloader, all_val_nid, all_test_nid)
        if val_acc >= 0:
            print('Val Acc {:.4f}, Test Acc {:.4f}, time: {:.4f}'.format(val_acc, test_acc,
                                                                         time.time() - start))
@@ -442,34 +494,24 @@ def main(args):

    g = dgl.distributed.DistGraph(args.graph_name, part_config=args.conf_path)
    print('rank:', g.rank())
-    print('number of edges', g.number_of_edges())

    pb = g.get_partition_book()
-    train_nid = dgl.distributed.node_split(g.ndata['train_mask'], pb, force_even=True)
-    val_nid = dgl.distributed.node_split(g.ndata['val_mask'], pb, force_even=True)
-    test_nid = dgl.distributed.node_split(g.ndata['test_mask'], pb, force_even=True)
-    local_nid = pb.partid2nids(pb.partid).detach().numpy()
+    train_nid = dgl.distributed.node_split(g.nodes['paper'].data['train_mask'], pb, ntype='paper', force_even=True)
+    val_nid = dgl.distributed.node_split(g.nodes['paper'].data['val_mask'], pb, ntype='paper', force_even=True)
+    test_nid = dgl.distributed.node_split(g.nodes['paper'].data['test_mask'], pb, ntype='paper', force_even=True)
+    local_nid = pb.partid2nids(pb.partid, 'paper').detach().numpy()
    print('part {}, train: {} (local: {}), val: {} (local: {}), test: {} (local: {})'.format(
          g.rank(), len(train_nid), len(np.intersect1d(train_nid.numpy(), local_nid)),
          len(val_nid), len(np.intersect1d(val_nid.numpy(), local_nid)),
          len(test_nid), len(np.intersect1d(test_nid.numpy(), local_nid))))
    device = th.device('cpu')
-    labels = g.ndata['labels'][np.arange(g.number_of_nodes())]
-    global_val_nid = th.LongTensor(np.nonzero(g.ndata['val_mask'][np.arange(g.number_of_nodes())])).squeeze()
-    global_test_nid = th.LongTensor(np.nonzero(g.ndata['test_mask'][np.arange(g.number_of_nodes())])).squeeze()
+    labels = g.nodes['paper'].data['labels'][np.arange(g.number_of_nodes('paper'))]
+    all_val_nid = th.LongTensor(np.nonzero(g.nodes['paper'].data['val_mask'][np.arange(g.number_of_nodes('paper'))])).squeeze()
+    all_test_nid = th.LongTensor(np.nonzero(g.nodes['paper'].data['test_mask'][np.arange(g.number_of_nodes('paper'))])).squeeze()
    n_classes = len(th.unique(labels[labels >= 0]))
-    print(labels.shape)
    print('#classes:', n_classes)

-    # these two infor should have a better place to store and retrive
-    num_of_ntype = len(th.unique(g.ndata[dgl.NTYPE][np.arange(g.number_of_nodes())]))
-    num_rels = len(th.unique(g.edata[dgl.ETYPE][np.arange(g.number_of_edges())]))
-
-    # no initial node features
-    node_feats = [None] * num_of_ntype
-
-    run(args, device, (g, node_feats, num_of_ntype, n_classes, num_rels,
-                       train_nid, val_nid, test_nid, labels, global_val_nid, global_test_nid))
+    run(args, device, (g, n_classes, train_nid, val_nid, test_nid, labels, all_val_nid, all_test_nid))

 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='RGCN')
@@ -527,8 +569,6 @@ if __name__ == '__main__':
            help='Whether to use DGL sparse embedding')
    parser.add_argument('--node-feats', default=False, action='store_true',
            help='Whether use node features')
-    parser.add_argument('--global-norm', default=False, action='store_true',
-            help='User global norm instead of per node type norm')
    parser.add_argument('--layer-norm', default=False, action='store_true',
            help='Use layer norm')
    parser.add_argument('--local_rank', type=int, help='get rank of the process')

--- a/examples/pytorch/rgcn/experimental/partition_graph.py
+++ b/examples/pytorch/rgcn/experimental/partition_graph.py
@@ -6,7 +6,7 @@ import time

 from ogb.nodeproppred import DglNodePropPredDataset

-def load_ogb(dataset, global_norm):
+def load_ogb(dataset):
    if dataset == 'ogbn-mag':
        dataset = DglNodePropPredDataset(name=dataset)
        split_idx = dataset.get_idx_split()
@@ -33,54 +33,24 @@ def load_ogb(dataset, global_norm):
        print('Number of valid: {}'.format(len(val_idx)))
        print('Number of test: {}'.format(len(test_idx)))

-        # currently we do not support node feature in mag dataset.
-        # calculate norm for each edge type and store in edge
-        if global_norm is False:
-            for canonical_etype in hg.canonical_etypes:
-                u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
-                _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
-                degrees = count[inverse_index]
-                norm = th.ones(eid.shape[0]) / degrees
-                norm = norm.unsqueeze(1)
-                hg.edges[canonical_etype].data['norm'] = norm
-
        # get target category id
        category_id = len(hg.ntypes)
        for i, ntype in enumerate(hg.ntypes):
            if ntype == category:
                category_id = i

-        g = dgl.to_homogeneous(hg, edata=['norm'])
-        if global_norm:
-            u, v, eid = g.all_edges(form='all')
-            _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
-            degrees = count[inverse_index]
-            norm = th.ones(eid.shape[0]) / degrees
-            norm = norm.unsqueeze(1)
-            g.edata['norm'] = norm
-
-        node_ids = th.arange(g.number_of_nodes())
-        # find out the target node ids
-        node_tids = g.ndata[dgl.NTYPE]
-        loc = (node_tids == category_id)
-        target_idx = node_ids[loc]
-        train_idx = target_idx[train_idx]
-        val_idx = target_idx[val_idx]
-        test_idx = target_idx[test_idx]
-        train_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        train_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
        train_mask[train_idx] = True
-        val_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        val_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
        val_mask[val_idx] = True
-        test_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        test_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
        test_mask[test_idx] = True
-        g.ndata['train_mask'] = train_mask
-        g.ndata['val_mask'] = val_mask
-        g.ndata['test_mask'] = test_mask
+        hg.nodes['paper'].data['train_mask'] = train_mask
+        hg.nodes['paper'].data['val_mask'] = val_mask
+        hg.nodes['paper'].data['test_mask'] = test_mask

-        labels = th.full((g.number_of_nodes(),), -1, dtype=paper_labels.dtype)
-        labels[target_idx] = paper_labels
-        g.ndata['labels'] = labels
-        return g
+        hg.nodes['paper'].data['labels'] = paper_labels
+        return hg
    else:
        raise("Do not support other ogbn datasets.")

@@ -98,21 +68,19 @@ if __name__ == '__main__':
                           help='turn the graph into an undirected graph.')
    argparser.add_argument('--balance_edges', action='store_true',
                           help='balance the number of edges in each partition.')
-    argparser.add_argument('--global-norm', default=False, action='store_true',
-                           help='User global norm instead of per node type norm')
    args = argparser.parse_args()

    start = time.time()
-    g = load_ogb(args.dataset, args.global_norm)
+    g = load_ogb(args.dataset)

    print('load {} takes {:.3f} seconds'.format(args.dataset, time.time() - start))
    print('|V|={}, |E|={}'.format(g.number_of_nodes(), g.number_of_edges()))
-    print('train: {}, valid: {}, test: {}'.format(th.sum(g.ndata['train_mask']),
-                                                  th.sum(g.ndata['val_mask']),
-                                                  th.sum(g.ndata['test_mask'])))
+    print('train: {}, valid: {}, test: {}'.format(th.sum(g.nodes['paper'].data['train_mask']),
+                                                  th.sum(g.nodes['paper'].data['val_mask']),
+                                                  th.sum(g.nodes['paper'].data['test_mask'])))

    if args.balance_train:
-        balance_ntypes = g.ndata['train_mask']
+        balance_ntypes = {'paper': g.nodes['paper'].data['train_mask']}
    else:
        balance_ntypes = None


--- a/python/dgl/backend/backend.py
+++ b/python/dgl/backend/backend.py
@@ -355,6 +355,22 @@ def sum(input, dim, keepdims=False):
    """
    pass

+def floor_div(in1, in2):
+    """Element-wise integer division and rounds each quotient towards zero.
+
+    Parameters
+    ----------
+    in1 : Tensor
+        The input tensor
+    in2 : Tensor or integer
+        The input
+
+    Returns
+    -------
+    Tensor
+        A framework-specific tensor.
+    """
+
 def reduce_sum(input):
    """Returns the sum of all elements in the input tensor.


--- a/python/dgl/backend/mxnet/tensor.py
+++ b/python/dgl/backend/mxnet/tensor.py
@@ -149,6 +149,9 @@ def sum(input, dim, keepdims=False):
        return nd.array([0.], dtype=input.dtype, ctx=input.context)
    return nd.sum(input, axis=dim, keepdims=keepdims)

+def floor_div(in1, in2):
+    return in1 / in2
+
 def reduce_sum(input):
    return input.sum()


--- a/python/dgl/backend/pytorch/tensor.py
+++ b/python/dgl/backend/pytorch/tensor.py
@@ -117,6 +117,9 @@ def copy_to(input, ctx, **kwargs):
 def sum(input, dim, keepdims=False):
    return th.sum(input, dim=dim, keepdim=keepdims)

+def floor_div(in1, in2):
+    return in1 // in2
+
 def reduce_sum(input):
    return input.sum()


--- a/python/dgl/backend/tensorflow/tensor.py
+++ b/python/dgl/backend/tensorflow/tensor.py
@@ -168,6 +168,8 @@ def sum(input, dim, keepdims=False):
        input = tf.cast(input, tf.int32)
    return tf.reduce_sum(input, axis=dim, keepdims=keepdims)

+def floor_div(in1, in2):
+    return astype(in1 / in2, dtype(in1))

 def reduce_sum(input):
    if input.dtype == tf.bool:

--- a/python/dgl/data/citation_graph.py
+++ b/python/dgl/data/citation_graph.py
@@ -184,9 +184,9 @@ class CitationGraphDataset(DGLBuiltinDataset):
        self._graph = nx.DiGraph(graph)

        self._num_classes = info['num_classes']
-        self._g.ndata['train_mask'] = generate_mask_tensor(self._g.ndata['train_mask'].numpy())
-        self._g.ndata['val_mask'] = generate_mask_tensor(self._g.ndata['val_mask'].numpy())
-        self._g.ndata['test_mask'] = generate_mask_tensor(self._g.ndata['test_mask'].numpy())
+        self._g.ndata['train_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['train_mask']))
+        self._g.ndata['val_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['val_mask']))
+        self._g.ndata['test_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['test_mask']))
        # hack for mxnet compatability

        if self.verbose:

--- a/python/dgl/distributed/dist_dataloader.py
+++ b/python/dgl/distributed/dist_dataloader.py
@@ -133,7 +133,7 @@ class DistDataLoader:
        if not self.drop_last and len(dataset) % self.batch_size != 0:
            self.expected_idxs += 1

-        # We need to have a unique Id for each data loader to identify itself
+        # We need to have a unique ID for each data loader to identify itself
        # in the sampler processes.
        global DATALOADER_ID
        self.name = "dataloader-" + str(DATALOADER_ID)

--- a/python/dgl/distributed/dist_graph.py
+++ b/python/dgl/distributed/dist_graph.py
--- a/python/dgl/distributed/dist_tensor.py
+++ b/python/dgl/distributed/dist_tensor.py
@@ -8,17 +8,10 @@ from .role import get_role
 from .. import utils
 from .. import backend as F

-def _get_data_name(name, part_policy):
-    ''' This is to get the name of data in the kvstore.
-
-    KVStore doesn't understand node data or edge data. We'll use a prefix to distinguish them.
-    '''
-    return part_policy + ':' + name
-
 def _default_init_data(shape, dtype):
    return F.zeros(shape, dtype, F.cpu())

-# These Ids can identify the anonymous distributed tensors.
+# These IDs can identify the anonymous distributed tensors.
 DIST_TENSOR_ID = 0

 class DistTensor:
@@ -144,10 +137,12 @@ class DistTensor:
            assert not persistent, 'We cannot generate anonymous persistent distributed tensors'
            global DIST_TENSOR_ID
            # All processes of the same role should create DistTensor synchronously.
-            # Thus, all of them should have the same Ids.
+            # Thus, all of them should have the same IDs.
            name = 'anonymous-' + get_role() + '-' + str(DIST_TENSOR_ID)
            DIST_TENSOR_ID += 1
-        self._name = _get_data_name(name, part_policy.policy_str)
+        assert isinstance(name, str), 'name {} is type {}'.format(name, type(name))
+        data_name = part_policy.get_data_name(name)
+        self._name = str(data_name)
        self._persistent = persistent
        if self._name not in exist_names:
            self.kvstore.init_data(self._name, shape, dtype, part_policy, init_func)

--- a/python/dgl/distributed/graph_partition_book.py
+++ b/python/dgl/distributed/graph_partition_book.py
--- a/python/dgl/distributed/graph_services.py
+++ b/python/dgl/distributed/graph_services.py
@@ -47,10 +47,10 @@ class FindEdgeResponse(Response):
 def _sample_neighbors(local_g, partition_book, seed_nodes, fan_out, edge_dir, prob, replace):
    """ Sample from local partition.

-    The input nodes use global Ids. We need to map the global node Ids to local node Ids,
-    perform sampling and map the sampled results to the global Ids space again.
+    The input nodes use global IDs. We need to map the global node IDs to local node IDs,
+    perform sampling and map the sampled results to the global IDs space again.
    The sampled results are stored in three vectors that store source nodes, destination nodes
-    and edge Ids.
+    and edge IDs.
    """
    local_ids = partition_book.nid2localnid(seed_nodes, partition_book.partid)
    local_ids = F.astype(local_ids, local_g.idtype)
@@ -59,7 +59,8 @@ def _sample_neighbors(local_g, partition_book, seed_nodes, fan_out, edge_dir, pr
        local_g, local_ids, fan_out, edge_dir, prob, replace, _dist_training=True)
    global_nid_mapping = local_g.ndata[NID]
    src, dst = sampled_graph.edges()
-    global_src, global_dst = global_nid_mapping[src], global_nid_mapping[dst]
+    global_src, global_dst = F.gather_row(global_nid_mapping, src), \
+            F.gather_row(global_nid_mapping, dst)
    global_eids = F.gather_row(local_g.edata[EID], sampled_graph.edata[EID])
    return global_src, global_dst, global_eids

@@ -78,10 +79,10 @@ def _find_edges(local_g, partition_book, seed_edges):
 def _in_subgraph(local_g, partition_book, seed_nodes):
    """ Get in subgraph from local partition.

-    The input nodes use global Ids. We need to map the global node Ids to local node Ids,
-    get in-subgraph and map the sampled results to the global Ids space again.
+    The input nodes use global IDs. We need to map the global node IDs to local node IDs,
+    get in-subgraph and map the sampled results to the global IDs space again.
    The results are stored in three vectors that store source nodes, destination nodes
-    and edge Ids.
+    and edge IDs.
    """
    local_ids = partition_book.nid2localnid(seed_nodes, partition_book.partid)
    local_ids = F.astype(local_ids, local_g.idtype)
@@ -254,7 +255,19 @@ def sample_neighbors(g, nodes, fanout, edge_dir='in', prob=None, replace=False):
    Node/edge features are not preserved. The original IDs of
    the sampled edges are stored as the `dgl.EID` feature in the returned graph.

-    For now, we only support the input graph with one node type and one edge type.
+    This version provides an experimental support for heterogeneous graphs.
+    When the input graph is heterogeneous, the sampled subgraph is still stored in
+    the homogeneous graph format. That is, all nodes and edges are assigned with
+    unique IDs (in contrast, we typically use a type name and a node/edge ID to
+    identify a node or an edge in ``DGLGraph``). We refer to this type of IDs
+    as *homogeneous ID*.
+    Users can use :func:`dgl.distributed.GraphPartitionBook.map_to_per_ntype`
+    and :func:`dgl.distributed.GraphPartitionBook.map_to_per_etype`
+    to identify their node/edge types and node/edge IDs of that type.
+
+    For heterogeneous graphs, ``nodes`` can be a dictionary whose key is node type
+    and the value is type-specific node IDs; ``nodes`` can also be a tensor of
+    *homogeneous ID*.

    Parameters
    ----------
@@ -292,9 +305,17 @@ def sample_neighbors(g, nodes, fanout, edge_dir='in', prob=None, replace=False):
    DGLGraph
        A sampled subgraph containing only the sampled neighboring edges.  It is on CPU.
    """
+    gpb = g.get_partition_book()
    if isinstance(nodes, dict):
-        assert len(nodes) == 1, 'The distributed sampler only supports one node type for now.'
-        nodes = list(nodes.values())[0]
+        homo_nids = []
+        for ntype in nodes:
+            assert ntype in g.ntypes, 'The sampled node type does not exist in the input graph'
+            if F.is_tensor(nodes[ntype]):
+                typed_nodes = nodes[ntype]
+            else:
+                typed_nodes = toindex(nodes[ntype]).tousertensor()
+            homo_nids.append(gpb.map_to_homo_nid(typed_nodes, ntype))
+        nodes = F.cat(homo_nids, 0)
    def issue_remote_req(node_ids):
        return SamplingRequest(node_ids, fanout, edge_dir=edge_dir,
                               prob=prob, replace=replace)

--- a/python/dgl/distributed/id_map.py
+++ b/python/dgl/distributed/id_map.py
+"""Module for mapping between node/edge IDs and node/edge types."""
+import numpy as np
+
+from .._ffi.function import _init_api
+from .. import backend as F
+from .. import utils
+
+class IdMap:
+    '''A map for converting node/edge IDs to their type IDs and type-wise IDs.
+
+    For a heterogeneous graph, DGL assigns an integer ID to each node/edge type;
+    node and edge of different types have independent IDs starting from zero.
+    Therefore, a node/edge can be uniquely identified by an ID pair,
+    ``(type_id, type_wise_id)``. To make it convenient for distributed processing,
+    DGL further encodes the ID pair into one integer ID, which we refer to
+    as *homogeneous ID*.
+
+    DGL arranges nodes and edges so that all nodes of the same type have contiguous
+    homogeneous IDs. If the graph is partitioned, the nodes/edges of the same type
+    within a partition have contiguous homogeneous IDs.
+
+    Below is an example adjancency matrix of an unpartitioned heterogeneous graph
+    stored using the above ID assignment. Here, the graph has two types of nodes
+    (``T0`` and ``T1``), and four types of edges (``R0``, ``R1``, ``R2``, ``R3``).
+    There are a total of 400 nodes in the graph and each type has 200 nodes. Nodes
+    of type 0 have IDs in [0,200), while nodes of type 1 have IDs in [200, 400).
+
+    ```
+        0 <- T0 -> 200 <- T1 -> 400
+     0  +-----------+------------+
+        |           |            |
+     ^  |    R0     |     R1     |
+     T0 |           |            |
+     v  |           |            |
+    200 +-----------+------------+
+        |           |            |
+     ^  |    R2     |     R3     |
+     T1 |           |            |
+     v  |           |            |
+    400 +-----------+------------+
+    ```
+
+    Below shows the adjacency matrix after the graph is partitioned into two.
+    Note that each partition still has two node types and four edge types,
+    and nodes/edges of the same type have contiguous IDs.
+
+    ```
+                partition 0              partition 1
+
+        0 <- T0 -> 100 <- T1 -> 200 <- T0 -> 300 <- T1 -> 400
+     0  +-----------+------------+-----------+------------+
+        |           |            |                        |
+     ^  |    R0     |     R1     |                        |
+     T0 |           |            |                        |
+     v  |           |            |                        |
+    100 +-----------+------------+                        |
+        |           |            |                        |
+     ^  |    R2     |     R3     |                        |
+     T1 |           |            |                        |
+     v  |           |            |                        |
+    200 +-----------+------------+-----------+------------+
+        |                        |           |            |
+     ^  |                        |    R0     |     R1     |
+     T0 |                        |           |            |
+     v  |                        |           |            |
+    100 |                        +-----------+------------+
+        |                        |           |            |
+     ^  |                        |    R2     |     R3     |
+     T1 |                        |           |            |
+     v  |                        |           |            |
+    200 +-----------+------------+-----------+------------+
+    ```
+
+    The following table is an alternative way to represent the above ID assignments.
+    It is easy to see that the homogeneous ID range [0, 100) is used for nodes of type 0
+    in partition 0, [100, 200) is used for nodes of type 1 in partition 0, and so on.
+    ```
+    +---------+------+----------
+      range   | type | partition
+    [0, 100)  |   0  |    0
+    [100,200) |   1  |    0
+    [200,300) |   0  |    1
+    [300,400) |   1  |    1
+    ```
+
+    The goal of this class is to, given a node's homogenous ID, convert it into the
+    ID pair ``(type_id, type_wise_id)``. For example, homogeneous node ID 90 is mapped
+    to (0, 90); homogeneous node ID 201 is mapped to (0, 101).
+
+    Parameters
+    ----------
+    id_ranges : dict[str, Tensor].
+        Node ID ranges within partitions for each node type. The key is the node type
+        name in string. The value is a tensor of shape :math:`(K, 2)`, where :math:`K` is
+        the number of partitions. Each row has two integers: the starting and the ending IDs
+        for a particular node type in a partition. For example, all nodes of type ``"T"`` in
+        partition ``i`` has ID range ``id_ranges["T"][i][0]`` to ``id_ranges["T"][i][1]``.
+        It is the same as the `node_map` argument in `RangePartitionBook`.
+    '''
+    def __init__(self, id_ranges):
+        self.num_parts = list(id_ranges.values())[0].shape[0]
+        self.num_types = len(id_ranges)
+        ranges = np.zeros((self.num_parts * self.num_types, 2), dtype=np.int64)
+        typed_map = []
+        id_ranges = list(id_ranges.values())
+        id_ranges.sort(key=lambda a: a[0, 0])
+        for i, id_range in enumerate(id_ranges):
+            ranges[i::self.num_types] = id_range
+            map1 = np.cumsum(id_range[:, 1] - id_range[:, 0])
+            typed_map.append(map1)
+
+        assert np.all(np.diff(ranges[:, 0]) >= 0)
+        assert np.all(np.diff(ranges[:, 1]) >= 0)
+        self.range_start = utils.toindex(np.ascontiguousarray(ranges[:, 0]))
+        self.range_end = utils.toindex(np.ascontiguousarray(ranges[:, 1]) - 1)
+        self.typed_map = utils.toindex(np.concatenate(typed_map))
+
+    def __call__(self, ids):
+        '''Convert the homogeneous IDs to (type_id, type_wise_id).
+
+        Parameters
+        ----------
+        ids : 1D tensor
+            The homogeneous ID.
+
+        Returns
+        -------
+        type_ids : Tensor
+            Type IDs
+        per_type_ids : Tensor
+            Type-wise IDs
+        '''
+        if self.num_types == 0:
+            return F.zeros((len(ids),), F.dtype(ids), F.cpu()), ids
+        if len(ids) == 0:
+            return ids, ids
+
+        ids = utils.toindex(ids)
+        ret = _CAPI_DGLHeteroMapIds(ids.todgltensor(),
+                                    self.range_start.todgltensor(),
+                                    self.range_end.todgltensor(),
+                                    self.typed_map.todgltensor(),
+                                    self.num_parts, self.num_types)
+        ret = utils.toindex(ret).tousertensor()
+        return ret[:len(ids)], ret[len(ids):]
+
+_init_api("dgl.distributed.id_map")
--- a/python/dgl/distributed/kvstore.py
+++ b/python/dgl/distributed/kvstore.py
@@ -886,9 +886,9 @@ class KVClient(object):
        def push_handler(data_store, name, local_offset, data)
        ```

-        `data_store` is a dict that contains all tensors in the kvstore. `name` is the name
-        of the tensor where new data is pushed to. `local_offset` is the offset where new
-        data should be written in the tensor in the local partition. `data` is the new data
+        ``data_store`` is a dict that contains all tensors in the kvstore. ``name`` is the name
+        of the tensor where new data is pushed to. ``local_offset`` is the offset where new
+        data should be written in the tensor in the local partition. ``data`` is the new data
        to be written.

        Parameters
@@ -919,8 +919,8 @@ class KVClient(object):
        def pull_handler(data_store, name, local_offset)
        ```

-        `data_store` is a dict that contains all tensors in the kvstore. `name` is the name
-        of the tensor where new data is pushed to. `local_offset` is the offset where new
+        ``data_store`` is a dict that contains all tensors in the kvstore. ``name`` is the name
+        of the tensor where new data is pushed to. ``local_offset`` is the offset where new
        data should be written in the tensor in the local partition.

        Parameters

--- a/python/dgl/distributed/partition.py
+++ b/python/dgl/distributed/partition.py
--- a/python/dgl/distributed/standalone_kvstore.py
+++ b/python/dgl/distributed/standalone_kvstore.py
@@ -4,7 +4,6 @@ This kvstore is used when running in the standalone mode
 """

 from .. import backend as F
-from .graph_partition_book import PartitionPolicy, NODE_PART_POLICY, EDGE_PART_POLICY

 class KVClient(object):
    ''' The fake KVStore client.
@@ -34,9 +33,11 @@ class KVClient(object):
        '''register pull handler'''
        self._pull_handlers[name] = func

-    def add_data(self, name, tensor):
+    def add_data(self, name, tensor, part_policy):
        '''add data to the client'''
        self._data[name] = tensor
+        if part_policy.policy_str not in self._all_possible_part_policy:
+            self._all_possible_part_policy[part_policy.policy_str] = part_policy

    def init_data(self, name, shape, dtype, part_policy, init_func):
        '''add new data to the client'''
@@ -72,7 +73,3 @@ class KVClient(object):

    def map_shared_data(self, partition_book):
        '''Mapping shared-memory tensor from server to client.'''
-        self._all_possible_part_policy[NODE_PART_POLICY] = PartitionPolicy(NODE_PART_POLICY,
-                                                                           partition_book)
-        self._all_possible_part_policy[EDGE_PART_POLICY] = PartitionPolicy(EDGE_PART_POLICY,
-                                                                           partition_book)
--- a/python/dgl/partition.py
+++ b/python/dgl/partition.py
@@ -6,16 +6,16 @@ from ._ffi.function import _init_api
 from .heterograph import DGLHeteroGraph
 from . import backend as F
 from . import utils
-from .base import EID, NID
+from .base import EID, NID, NTYPE, ETYPE

 __all__ = ["metis_partition", "metis_partition_assignment",
           "partition_graph_with_halo"]


 def reorder_nodes(g, new_node_ids):
-    """ Generate a new graph with new node Ids.
+    """ Generate a new graph with new node IDs.

-    We assign each node in the input graph with a new node Id. This results in
+    We assign each node in the input graph with a new node ID. This results in
    a new graph.

    Parameters
@@ -23,11 +23,11 @@ def reorder_nodes(g, new_node_ids):
    g : DGLGraph
        The input graph
    new_node_ids : a tensor
-        The new node Ids
+        The new node IDs
    Returns
    -------
    DGLGraph
-        The graph with new node Ids.
+        The graph with new node IDs.
    """
    assert len(new_node_ids) == g.number_of_nodes(), \
        "The number of new node ids must match #nodes in the graph."
@@ -35,7 +35,7 @@ def reorder_nodes(g, new_node_ids):
    sorted_ids, idx = F.sort_1d(new_node_ids.tousertensor())
    assert F.asnumpy(sorted_ids[0]) == 0 \
        and F.asnumpy(sorted_ids[-1]) == g.number_of_nodes() - 1, \
-        "The new node Ids are incorrect."
+        "The new node IDs are incorrect."
    new_gidx = _CAPI_DGLReorderGraph_Hetero(
        g._graph, new_node_ids.todgltensor())
    new_g = DGLHeteroGraph(gidx=new_gidx, ntypes=['_N'], etypes=['_E'])
@@ -46,6 +46,74 @@ def reorder_nodes(g, new_node_ids):
 def _get_halo_heterosubgraph_inner_node(halo_subg):
    return _CAPI_GetHaloSubgraphInnerNodes_Hetero(halo_subg)

+def reshuffle_graph(g, node_part=None):
+    '''Reshuffle node ids and edge IDs of a graph.
+
+    This function reshuffles nodes and edges in a graph so that all nodes/edges of the same type
+    have contiguous IDs. If a graph is partitioned and nodes are assigned to different partitions,
+    all nodes/edges in a partition should
+    get contiguous IDs; within a partition, all nodes/edges of the same type have contigous IDs.
+
+    Parameters
+    ----------
+    g : DGLGraph
+        The input graph.
+    node_part : Tensor
+        This is a vector whose length is the same as the number of nodes in the input graph.
+        Each element indicates the partition ID the corresponding node is assigned to.
+
+    Returns
+    -------
+    (DGLGraph, Tensor)
+        The graph whose nodes and edges are reshuffled.
+        The 1D tensor that indicates the partition IDs of the nodes in the reshuffled graph.
+    '''
+    # In this case, we don't need to reshuffle node IDs and edge IDs.
+    if node_part is None:
+        g.ndata['orig_id'] = F.arange(0, g.number_of_nodes())
+        g.edata['orig_id'] = F.arange(0, g.number_of_edges())
+        return g, None
+
+    start = time.time()
+    if node_part is not None:
+        node_part = utils.toindex(node_part)
+        node_part = node_part.tousertensor()
+    if NTYPE in g.ndata:
+        is_hetero = len(F.unique(g.ndata[NTYPE])) > 1
+    else:
+        is_hetero = False
+    if is_hetero:
+        num_node_types = F.max(g.ndata[NTYPE], 0) + 1
+        if node_part is not None:
+            sorted_part, new2old_map = F.sort_1d(node_part * num_node_types + g.ndata[NTYPE])
+        else:
+            sorted_part, new2old_map = F.sort_1d(g.ndata[NTYPE])
+        sorted_part = F.floor_div(sorted_part, num_node_types)
+    elif node_part is not None:
+        sorted_part, new2old_map = F.sort_1d(node_part)
+    else:
+        g.ndata['orig_id'] = g.ndata[NID]
+        g.edata['orig_id'] = g.edata[EID]
+        return g, None
+
+    new_node_ids = np.zeros((g.number_of_nodes(),), dtype=np.int64)
+    new_node_ids[F.asnumpy(new2old_map)] = np.arange(0, g.number_of_nodes())
+    # If the input graph is homogneous, we only need to create an empty array, so that
+    # _CAPI_DGLReassignEdges_Hetero knows how to handle it.
+    etype = g.edata[ETYPE] if ETYPE in g.edata else F.zeros((0), F.dtype(sorted_part), F.cpu())
+    g = reorder_nodes(g, new_node_ids)
+    node_part = utils.toindex(sorted_part)
+    # We reassign edges in in-CSR. In this way, after partitioning, we can ensure
+    # that all edges in a partition are in the contiguous ID space.
+    etype_idx = utils.toindex(etype)
+    orig_eids = _CAPI_DGLReassignEdges_Hetero(g._graph, etype_idx.todgltensor(),
+                                              node_part.todgltensor(), True)
+    orig_eids = utils.toindex(orig_eids)
+    orig_eids = orig_eids.tousertensor()
+    g.edata['orig_id'] = orig_eids
+
+    print('Reshuffle nodes and edges: {:.3f} seconds'.format(time.time() - start))
+    return g, node_part.tousertensor()

 def partition_graph_with_halo(g, node_part, extra_cached_hops, reshuffle=False):
    '''Partition a graph.
@@ -55,10 +123,10 @@ def partition_graph_with_halo(g, node_part, extra_cached_hops, reshuffle=False):
    not belong to the partition of a subgraph but are connected to the nodes
    in the partition within a fixed number of hops.

-    If `reshuffle` is turned on, the function reshuffles node Ids and edge Ids
+    If `reshuffle` is turned on, the function reshuffles node IDs and edge IDs
    of the input graph before partitioning. After reshuffling, all nodes and edges
-    in a partition fall in a contiguous Id range in the input graph.
-    The partitioend subgraphs have node data 'orig_id', which stores the node Ids
+    in a partition fall in a contiguous ID range in the input graph.
+    The partitioend subgraphs have node data 'orig_id', which stores the node IDs
    in the original input graph.

    Parameters
@@ -68,37 +136,24 @@ def partition_graph_with_halo(g, node_part, extra_cached_hops, reshuffle=False):
    node_part: 1D tensor
        Specify which partition a node is assigned to. The length of this tensor
        needs to be the same as the number of nodes of the graph. Each element
-        indicates the partition Id of a node.
+        indicates the partition ID of a node.
    extra_cached_hops: int
        The number of hops a HALO node can be accessed.
    reshuffle : bool
-        Resuffle nodes so that nodes in the same partition are in the same Id range.
+        Resuffle nodes so that nodes in the same partition are in the same ID range.

    Returns
    --------
    a dict of DGLGraphs
-        The key is the partition Id and the value is the DGLGraph of the partition.
+        The key is the partition ID and the value is the DGLGraph of the partition.
    '''
    assert len(node_part) == g.number_of_nodes()
-    node_part = utils.toindex(node_part)
    if reshuffle:
-        start = time.time()
-        node_part = node_part.tousertensor()
-        sorted_part, new2old_map = F.sort_1d(node_part)
-        new_node_ids = np.zeros((g.number_of_nodes(),), dtype=np.int64)
-        new_node_ids[F.asnumpy(new2old_map)] = np.arange(
-            0, g.number_of_nodes())
-        g = reorder_nodes(g, new_node_ids)
-        node_part = utils.toindex(sorted_part)
-        # We reassign edges in in-CSR. In this way, after partitioning, we can ensure
-        # that all edges in a partition are in the contiguous Id space.
-        orig_eids = _CAPI_DGLReassignEdges_Hetero(g._graph, True)
-        orig_eids = utils.toindex(orig_eids)
-        orig_eids = orig_eids.tousertensor()
+        g, node_part = reshuffle_graph(g, node_part)
        orig_nids = g.ndata['orig_id']
-        print('Reshuffle nodes and edges: {:.3f} seconds'.format(
-            time.time() - start))
+        orig_eids = g.edata['orig_id']

+    node_part = utils.toindex(node_part)
    start = time.time()
    subgs = _CAPI_DGLPartitionWithHalo_Hetero(
        g._graph, node_part.todgltensor(), extra_cached_hops)
@@ -171,7 +226,7 @@ def metis_partition_assignment(g, k, balance_ntypes=None, balance_edges=False):
    Returns
    -------
    a 1-D tensor
-        A vector with each element that indicates the partition Id of a vertex.
+        A vector with each element that indicates the partition ID of a vertex.
    '''
    # METIS works only on symmetric graphs.
    # The METIS runs on the symmetric graph to generate the node assignment to partitions.
@@ -252,10 +307,10 @@ def metis_partition(g, k, extra_cached_hops=0, reshuffle=False,
    To balance the node types, a user needs to pass a vector of N elements to indicate
    the type of each node. N is the number of nodes in the input graph.

-    If `reshuffle` is turned on, the function reshuffles node Ids and edge Ids
+    If `reshuffle` is turned on, the function reshuffles node IDs and edge IDs
    of the input graph before partitioning. After reshuffling, all nodes and edges
-    in a partition fall in a contiguous Id range in the input graph.
-    The partitioend subgraphs have node data 'orig_id', which stores the node Ids
+    in a partition fall in a contiguous ID range in the input graph.
+    The partitioend subgraphs have node data 'orig_id', which stores the node IDs
    in the original input graph.

    The partitioned subgraph is stored in DGLGraph. The DGLGraph has the `part_id`
@@ -271,7 +326,7 @@ def metis_partition(g, k, extra_cached_hops=0, reshuffle=False,
    extra_cached_hops: int
        The number of hops a HALO node can be accessed.
    reshuffle : bool
-        Resuffle nodes so that nodes in the same partition are in the same Id range.
+        Resuffle nodes so that nodes in the same partition are in the same ID range.
    balance_ntypes : tensor
        Node type of each node
    balance_edges : bool
@@ -280,7 +335,7 @@ def metis_partition(g, k, extra_cached_hops=0, reshuffle=False,
    Returns
    --------
    a dict of DGLGraphs
-        The key is the partition Id and the value is the DGLGraph of the partition.
+        The key is the partition ID and the value is the DGLGraph of the partition.
    '''
    node_part = metis_partition_assignment(g, k, balance_ntypes, balance_edges)
    if node_part is None:
@@ -289,5 +344,4 @@ def metis_partition(g, k, extra_cached_hops=0, reshuffle=False,
    # Then we split the original graph into parts based on the METIS partitioning results.
    return partition_graph_with_halo(g, node_part, extra_cached_hops, reshuffle)

-
 _init_api("dgl.partition")
--- a/src/graph/graph_op.cc
+++ b/src/graph/graph_op.cc
@@ -719,4 +719,61 @@ DGL_REGISTER_GLOBAL("graph_index._CAPI_DGLMapSubgraphNID")
    *rv = GraphOp::MapParentIdToSubgraphId(parent_vids, query);
  });

+template<class IdType>
+IdArray MapIds(IdArray ids, IdArray range_starts, IdArray range_ends, IdArray typed_map,
+               int num_parts, int num_types) {
+  int64_t num_ids = ids->shape[0];
+  int64_t num_ranges = range_starts->shape[0];
+  IdArray ret = IdArray::Empty({num_ids * 2}, ids->dtype, ids->ctx);
+
+  const IdType *range_start_data = static_cast<IdType *>(range_starts->data);
+  const IdType *range_end_data = static_cast<IdType *>(range_ends->data);
+  const IdType *ids_data = static_cast<IdType *>(ids->data);
+  const IdType *typed_map_data = static_cast<IdType *>(typed_map->data);
+  IdType *types_data = static_cast<IdType *>(ret->data);
+  IdType *per_type_ids_data = static_cast<IdType *>(ret->data) + num_ids;
+#pragma omp parallel for
+  for (int64_t i = 0; i < ids->shape[0]; i++) {
+    IdType id = ids_data[i];
+    auto it = std::lower_bound(range_end_data, range_end_data + num_ranges, id);
+    // The range must exist.
+    BUG_ON(it != range_end_data + num_ranges);
+    size_t range_id = it - range_end_data;
+    int type_id = range_id % num_types;
+    types_data[i] = type_id;
+    int part_id = range_id / num_types;
+    BUG_ON(part_id < num_parts);
+    if (part_id == 0) {
+      per_type_ids_data[i] = id - range_start_data[range_id];
+    } else {
+      per_type_ids_data[i] = id - range_start_data[range_id]
+          + typed_map_data[num_parts * type_id + part_id - 1];
+    }
+  }
+  return ret;
+}
+
+DGL_REGISTER_GLOBAL("distributed.id_map._CAPI_DGLHeteroMapIds")
+.set_body([] (DGLArgs args, DGLRetValue* rv) {
+    const IdArray ids = args[0];
+    const IdArray range_starts = args[1];
+    const IdArray range_ends = args[2];
+    const IdArray typed_map = args[3];
+    int num_parts = args[4];
+    int num_types = args[5];
+    int num_ranges = range_starts->shape[0];
+
+    CHECK_EQ(range_starts->dtype.bits, ids->dtype.bits);
+    CHECK_EQ(range_ends->dtype.bits, ids->dtype.bits);
+    CHECK_EQ(typed_map->dtype.bits, ids->dtype.bits);
+    CHECK_EQ(num_ranges, num_parts * num_types);
+    CHECK_EQ(num_ranges, range_ends->shape[0]);
+
+    IdArray ret;
+    ATEN_ID_TYPE_SWITCH(ids->dtype, IdType, {
+      ret = MapIds<IdType>(ids, range_starts, range_ends, typed_map, num_parts, num_types);
+    });
+    *rv = ret;
+  });
+
 }  // namespace dgl