Support new format for multi-file support in distributed partitioning. (#4217)

* Code changes for the following 1. Generating node data at each process 2. Reading csv files using pyarrow 3. feature complete code. * Removed some typo's because of which unit tests were failing 1. Change the file name to correct file name when loading edges from file 2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted. * Code changes to address CI comments by reviewers 1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions. 2 function signatures and invocations now match w.r.t argument list 3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code. * Addressing code review comments 1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process. * Update docstring's of two functions appropriately in response to code review comments Removed "todo" from the docstring of the gen_nodedata function. Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time. Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

Support new format for multi-file support in distributed partitioning. (#4217)
* Code changes for the following 1. Generating node data at each process 2. Reading csv files using pyarrow 3. feature complete code. * Removed some typo's because of which unit tests were failing 1. Change the file name to correct file name when loading edges from file 2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted. * Code changes to address CI comments by reviewers 1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions. 2 function signatures and invocations now match w.r.t argument list 3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code. * Addressing code review comments 1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process. * Update docstring's of two functions appropriately in response to code review comments Removed "todo" from the docstring of the gen_nodedata function. Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time. Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.
dad3606a · kylasa · GitHub · 9948ef4d · dad3606a · dad3606a
Unverified Commit dad3606a authored Jul 12, 2022 by kylasa Committed by GitHub Jul 12, 2022
6 changed files
--- a/tools/distpartitioning/convert_partition.py
+++ b/tools/distpartitioning/convert_partition.py
@@ -18,6 +18,48 @@ def create_dgl_object(graph_name, num_parts, \
    This function creates dgl objects for a given graph partition, as in function
    arguments. 
+    The "schema" argument is a dictionary, which contains the metadata related to node ids
+    and edge ids. It contains two keys: "nid" and "eid", whose value is also a dictionary
+    with the following structure. 
+    1. The key-value pairs in the "nid" dictionary has the following format.
+       "ntype-name" is the user assigned name to this node type. "format" describes the 
+       format of the contents of the files. and "data" is a list of lists, each list has
+       3 elements: file-name, start_id and end_id. File-name can be either absolute or
+       relative path to this file and starting and ending ids are type ids of the nodes 
+       which are contained in this file. These type ids are later used to compute global ids
+       of these nodes which are used throughout the processing of this pipeline. 
+        "ntype-name" : {
+            "format" : "csv", 
+            "data" : [
+                    [ <path-to-file>/ntype0-name-0.csv, start_id0, end_id0], 
+                    [ <path-to-file>/ntype0-name-1.csv, start_id1, end_id1],
+                    ...
+                    [ <path-to-file>/ntype0-name-<p-1>.csv, start_id<p-1>, end_id<p-1>],
+            ]
+        }
+    2. The key-value pairs in the "eid" dictionary has the following format.
+       As described for the "nid" dictionary the "eid" dictionary is similarly structured
+       except that these entries are for edges. 
+        "etype-name" : {
+            "format" : "csv", 
+            "data" : [
+                    [ <path-to-file>/etype0-name-0, start_id0, end_id0], 
+                    [ <path-to-file>/etype0-name-1 start_id1, end_id1],
+                    ...
+                    [ <path-to-file>/etype0-name-1 start_id<p-1>, end_id<p-1>]
+            ]
+        }
+    In "nid" dictionary, the type_nids are specified that
+    should be assigned to nodes which are read from the corresponding nodes file. 
+    Along the same lines dictionary for the key "eid" is used for edges in the 
+    input graph.
+    These type ids, for nodes and edges, are used to compute global ids for nodes
+    and edges which are stored in the graph object.
    Parameters:
    -----------
    graph_name : string
@@ -39,8 +81,6 @@ def create_dgl_object(graph_name, num_parts, \
    edgeid_offset : int
        offset to be used when assigning edge global ids in the current partition
-    return compact_g2, node_map_val, edge_map_val, ntypes_map, etypes_map
    Returns: 
    --------
    dgl object
@@ -54,14 +94,21 @@ def create_dgl_object(graph_name, num_parts, \
    dictionary
        map between edge type(string)  and edge_type_id(int)
    """
    #create auxiliary data structures from the schema object
-    global_nid_ranges = schema['nid']
+    node_info = schema["nid"]
-    global_eid_ranges = schema['eid']
+    offset = 0
-    global_nid_ranges = {key: np.array(global_nid_ranges[key]).reshape(
+    global_nid_ranges = {}
-        1, 2) for key in global_nid_ranges}
+    for k, v in node_info.items():
-    global_eid_ranges = {key: np.array(global_eid_ranges[key]).reshape(
+        global_nid_ranges[k] = np.array([offset + int(v["data"][0][1]), offset + int(v["data"][-1][2])]).reshape(1,2)
-        1, 2) for key in global_eid_ranges}
+        offset += int(v["data"][-1][2])
+    edge_info = schema["eid"]
+    offset = 0
+    global_eid_ranges = {}
+    for k, v in edge_info.items():
+        global_eid_ranges[k] = np.array([offset + int(v["data"][0][1]), offset + int(v["data"][-1][2])]).reshape(1,2)
+        offset += int(v["data"][-1][2])
    id_map = dgl.distributed.id_map.IdMap(global_nid_ranges)
    ntypes = [(key, global_nid_ranges[key][0, 0]) for key in global_nid_ranges]

--- a/tools/distpartitioning/data_proc_pipeline.py
+++ b/tools/distpartitioning/data_proc_pipeline.py
@@ -15,19 +15,9 @@ def log_params(params):
    print('Graph Name: ', params.graph_name)
    print('Schema File: ', params.schema)
    print('No. partitions: ', params.num_parts)
-    print('No. node weights: ', params.num_node_weights)
-    print('Workspace dir: ', params.workspace)
-    print('Node Attr Type: ', params.node_attr_dtype)
-    print('Edge Attr Dtype: ', params.edge_attr_dtype)
    print('Output Dir: ', params.output)
-    print('Removed Edges File: ', params.removed_edges)
    print('WorldSize: ', params.world_size)
-    print('Nodes File: ', params.nodes_file)
-    print('Edges File: ', params.edges_file)
-    print('Node feats: ', params.node_feats_file)
-    print('Edge feats: ', params.edge_feats_file)
    print('Metis partitions: ', params.partitions_file)
-    print('Exec Type: ', params.exec_type)
 if __name__ == "__main__":
    """ 
@@ -44,38 +34,15 @@ if __name__ == "__main__":
                     help='The schema of the graph')
    parser.add_argument('--num-parts', required=True, type=int,
                     help='The number of partitions')
-    parser.add_argument('--num-node-weights', required=True, type=int,
-                     help='The number of node weights used by METIS.')
-    parser.add_argument('--workspace', type=str, default='/tmp',
-                    help='The directory to store the intermediate results')
-    parser.add_argument('--node-attr-dtype', type=str, default=None,
-                    help='The data type of the node attributes')
-    parser.add_argument('--edge-attr-dtype', type=str, default=None,
-                    help='The data type of the edge attributes')
    parser.add_argument('--output', required=True, type=str,
                    help='The output directory of the partitioned results')
-    parser.add_argument('--removed-edges', help='a file that contains the removed self-loops and duplicated edges',
-                    default=None, type=str)
-    parser.add_argument('--exec-type', type=int, default=0,
-                    help='Use 0 for single machine run and 1 for distributed execution')
    #arguments needed for the distributed implementation
    parser.add_argument('--world-size', help='no. of processes to spawn',
                    default=1, type=int, required=True)
-    parser.add_argument('--nodes-file', help='filename of the nodes metadata', 
-                    default=None, type=str, required=True)
-    parser.add_argument('--edges-file', help='filename of the nodes metadata', 
-                    default=None, type=str, required=True)
-    parser.add_argument('--node-feats-file', help='filename of the nodes features', 
-                    default=None, type=str, required=True)
-    parser.add_argument('--edge-feats-file', help='filename of the nodes metadata', 
-                    default=None, type=str )
    parser.add_argument('--partitions-file', help='filename of the output of dgl_part2 (metis partitions)',
                    default=None, type=str)
    params = parser.parse_args()
-    #invoke the starting function here.
+    #invoke the pipeline function
-    if(params.exec_type == 0):
+    multi_machine_run(params)
-        single_machine_run(params)
-    else:
-        multi_machine_run(params)
--- a/tools/distpartitioning/data_shuffle.py
+++ b/tools/distpartitioning/data_shuffle.py
--- a/tools/distpartitioning/dataset_utils.py
+++ b/tools/distpartitioning/dataset_utils.py
@@ -3,7 +3,10 @@ import numpy as np
 import constants
 import torch
-def get_dataset(input_dir, graph_name, rank, num_node_weights):
+import pyarrow
+from pyarrow import csv
+def get_dataset(input_dir, graph_name, rank, world_size, schema_map):
    """
    Function to read the multiple file formatted dataset. 
@@ -15,64 +18,91 @@ def get_dataset(input_dir, graph_name, rank, num_node_weights):
        graph name string
    rank : int
        rank of the current process
-    num_node_weights : int
+    world_size : int
-        integer indicating the no. of weights each node is attributed with
+        total number of process in the current execution
+    schema_map : dictionary
+        this is the dictionary created by reading the graph metadata json file
+        for the input graph dataset
    Return:
    -------
    dictionary
-        Data read from nodes.txt file and used to build a dictionary with keys as column names
+        where keys are node-type names and values are tuples. Each tuple represents the
-        and values as columns in the csv file.
+        range of type ids read from a file by the current process. Please note that node
+        data for each node type is split into "p" files and each one of these "p" files are
+        read a process in the distributed graph partitioning pipeline
    dictionary
        Data read from numpy files for all the node features in this dataset. Dictionary built 
        using this data has keys as node feature names and values as tensor data representing 
        node features
+    dictionary
+        in which keys are node-type and values are a triplet. This triplet has node-feature name, 
+        and range of tids for the node feature data read from files by the current process. Each
+        node-type may have mutiple feature(s) and associated tensor data.
    dictionary
        Data read from edges.txt file and used to build a dictionary with keys as column names 
        and values as columns in the csv file. 
+    dictionary
+        in which keys are edge-type names and values are triplets. This triplet has edge-feature name, 
+        and range of tids for theedge feature data read from the files by the current process. Each
+        edge-type may have several edge features and associated tensor data.
    """
    #node features dictionary
    node_features = {}
+    node_feature_tids = {}
-    #iterate over the sub-dirs and extract the nodetypes
+    #iterate over the "node_data" dictionary in the schema_map
-    #in each nodetype folder read all the features assigned to 
+    #read the node features if exists
-    #current rank
+    #also keep track of the type_nids for which the node_features are read.
-    siblings = os.listdir(input_dir)
+    dataset_features = schema_map["node_data"]
-    for s in siblings:
+    for ntype_name, ntype_feature_data in dataset_features.items():
-        if s.startswith("nodes-"):
+        #ntype_feature_data is a dictionary
-            tokens = s.split("-")
+        #where key: feature_name, value: list of lists
-            ntype = tokens[1]
+        node_feature_tids[ntype_name] = []
-            num_feats = tokens[2]
+        for feat_name, feat_data in ntype_feature_data.items():
-            for idx in range(int(num_feats)):
+            assert len(feat_data) == world_size
-                feat_file = s +'/node-feat-'+'{:02d}'.format(idx) +'/'+ str(rank)+'.npy'
+            my_feat_data = feat_data[rank]
-                if (os.path.exists(input_dir+'/'+feat_file)):
+            if (os.path.isabs(my_feat_data[0])):
-                    features = np.load(input_dir+'/'+feat_file)
+                node_features[ntype_name+'/'+feat_name] = torch.from_numpy(np.load(my_feat_data[0]))
-                    node_features[ntype+'/feat'] = torch.tensor(features)
+            else:
+                node_features[ntype_name+'/'+feat_name] = torch.from_numpy(np.load(input_dir+my_feat_data[0]))
-    #done build node_features locally. 
+            node_feature_tids[ntype_name].append([feat_name, my_feat_data[1], my_feat_data[2]])
-    if len(node_features) <= 0: 
-        print('[Rank: ', rank, '] This dataset does not have any node features')
-    else: 
-        for k, v in node_features.items():
-            print('[Rank: ', rank, '] node feature name: ', k, ', feature data shape: ', v.size())
-    #read (split) xxx_nodes.txt file
+    #read my nodes for each node type
-    node_file = input_dir+'/'+graph_name+'_nodes'+'_{:02d}.txt'.format(rank)
+    node_tids = {}
-    node_data = np.loadtxt(node_file, delimiter=' ', dtype='int64')
+    node_data = schema_map["nid"]
-    nodes_datadict = {}
+    for ntype_name, ntype_info in node_data.items():
-    nodes_datadict[constants.NTYPE_ID] = node_data[:,0]
+        v = []
-    type_idx = 0 + num_node_weights + 1
+        node_file_info = ntype_info["data"]
-    nodes_datadict[constants.GLOBAL_TYPE_NID] = node_data[:,type_idx]
+        for idx in range(len(node_file_info)):
-    print('[Rank: ', rank, '] Done reading node_data: ', len(nodes_datadict), nodes_datadict[constants.NTYPE_ID].shape)
+            v.append((node_file_info[idx][1], node_file_info[idx][2]))
+        node_tids[ntype_name] = v
-    #read (split) xxx_edges.txt file
+    #read my edges for each edge type
+    edge_tids = {}
    edge_datadict = {}
-    edge_file = input_dir+'/'+graph_name+'_edges'+'_{:02d}.txt'.format(rank)
+    edge_data = schema_map["eid"]
-    edge_data = np.loadtxt(edge_file, delimiter=' ', dtype='int64')
+    for etype_name, etype_info in edge_data.items():
-    edge_datadict[constants.GLOBAL_SRC_ID] = edge_data[:,0]
+        assert etype_info["format"] == "csv"
-    edge_datadict[constants.GLOBAL_DST_ID] = edge_data[:,1]
-    edge_datadict[constants.GLOBAL_TYPE_EID] = edge_data[:,2]
+        edge_info = etype_info["data"]
-    edge_datadict[constants.ETYPE_ID] = edge_data[:,3]
+        assert len(edge_info) == world_size
+        data_df = csv.read_csv(edge_info[rank][0], read_options=pyarrow.csv.ReadOptions(autogenerate_column_names=True), 
+                                    parse_options=pyarrow.csv.ParseOptions(delimiter=' '))
+        edge_datadict[constants.GLOBAL_SRC_ID] = data_df['f0'].to_numpy()
+        edge_datadict[constants.GLOBAL_DST_ID] = data_df['f1'].to_numpy()
+        edge_datadict[constants.GLOBAL_TYPE_EID] = data_df['f2'].to_numpy()
+        edge_datadict[constants.ETYPE_ID] = data_df['f3'].to_numpy()
+        v = []
+        edge_file_info = etype_info["data"]
+        for idx in range(len(edge_file_info)):
+            v.append((edge_file_info[idx][1], edge_file_info[idx][2]))
+        edge_tids[etype_name] = v
    print('[Rank: ', rank, '] Done reading edge_file: ', len(edge_datadict), edge_datadict[constants.GLOBAL_SRC_ID].shape)
-    return nodes_datadict, node_features, edge_datadict
+    return node_tids, node_features, node_feature_tids, edge_datadict, edge_tids
--- a/tools/distpartitioning/globalids.py
+++ b/tools/distpartitioning/globalids.py
@@ -45,7 +45,7 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data):
    #allocate buffers to receive node-ids
    recv_nodes = []
    for i in recv_counts:
-        recv_nodes.append(torch.zeros([i.item()], dtype=torch.int64))
+        recv_nodes.append(torch.zeros(i.tolist(), dtype=torch.int64))
    #form the outgoing message
    send_nodes = []
@@ -67,17 +67,18 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data):
        global_nids = proc_i_nodes.numpy()
        if (len(global_nids) != 0):
            common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], global_nids, return_indices=True)
-            values = node_data[constants.SHUFFLE_GLOBAL_NID][ind1]
+            shuffle_global_nids = node_data[constants.SHUFFLE_GLOBAL_NID][ind1]
-            send_nodes.append(torch.Tensor(values).type(dtype=torch.int64))
+            send_nodes.append(torch.from_numpy(shuffle_global_nids).type(dtype=torch.int64))
        else:
-            send_nodes.append(torch.empty((0,), dtype=torch.int64))
+            send_nodes.append(torch.empty((0), dtype=torch.int64))
    #send receive global-ids
    alltoallv_cpu(rank, world_size, recv_shuffle_global_nids, send_nodes)
-    shuffle_global_nids = [x.numpy() for x in recv_shuffle_global_nids]
+    shuffle_global_nids = np.concatenate([x.numpy() for x in recv_shuffle_global_nids])
-    global_nids = [x for x in global_nids_ranks]
+    global_nids = np.concatenate([x for x in global_nids_ranks])
-    return np.column_stack((np.concatenate(global_nids), np.concatenate(shuffle_global_nids)))
+    ret_val = np.column_stack([global_nids, shuffle_global_nids])
+    return ret_val
 def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, node_data):
@@ -122,7 +123,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no
        global_nids_ranks.append(not_owned_nodes)
    #Retrieve Global-ids for respective node owners
-    resolved_global_nids = get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data)
+    non_local_nids = get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data)
    #Add global_nid <-> shuffle_global_nid mappings to the received data
    for i in range(world_size):
@@ -132,7 +133,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no
            common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], own_global_nids, return_indices=True)
            my_shuffle_global_nids = node_data[constants.SHUFFLE_GLOBAL_NID][ind1]
            local_mappings = np.column_stack((own_global_nids, my_shuffle_global_nids))
-            resolved_global_nids = np.concatenate((resolved_global_nids, local_mappings))
+            resolved_global_nids = np.concatenate((non_local_nids, local_mappings))
    #form a dictionary of mappings between orig-node-ids and global-ids
    resolved_mappings = dict(zip(resolved_global_nids[:,0], resolved_global_nids[:,1]))

--- a/tools/distpartitioning/utils.py
+++ b/tools/distpartitioning/utils.py
@@ -5,6 +5,9 @@ import json
 import dgl
 import constants
+import pyarrow
+from pyarrow import csv
 def read_partitions_file(part_file):
    """
    Utility method to read metis partitions, which is the output of 
@@ -47,6 +50,34 @@ def read_json(json_file):
    return val
+def get_ntype_featnames(ntype_name, schema): 
+    """
+    Retrieves node feature names for a given node_type
+    Parameters:
+    -----------
+    ntype_name : string
+        a string specifying a node_type name
+    schema : dictionary
+        metadata json object as a dictionary, which is read from the input
+        metadata file from the input dataset
+    Returns:
+    --------
+    list : 
+        a list of feature names for a given node_type
+    """
+    ntype_dict = schema["node_data"]
+    if (ntype_name in ntype_dict):
+        featnames = []
+        ntype_info = ntype_dict[ntype_name]
+        for k, v in ntype_info.items(): 
+            featnames.append(k)
+        return featnames
+    else: 
+        return []
 def get_node_types(schema):
    """ 
    Utility method to extract node_typename -> node_type mappings
@@ -60,72 +91,47 @@ def get_node_types(schema):
    Returns:
    --------
-    dictionary, list 
+    dictionary
-        dictionary with ntype <-> type_nid mappings
+        with keys as node type names and values as ids (integers)
-        list of ntype strings
+    list
-    """
+        list of ntype name strings
-    #Get the node_id ranges from the schema
+    dictionary
-    global_nid_ranges = schema['nid']
+        with keys as ntype ids (integers) and values as node type names
-    global_nid_ranges = {key: np.array(global_nid_ranges[key]).reshape(1,2)
-                    for key in global_nid_ranges}
-    #Create an array with the starting id for each node_type and sort
-    ntypes = [(key, global_nid_ranges[key][0,0]) for key in global_nid_ranges]
-    ntypes.sort(key=lambda e: e[1])
-    #Create node_typename -> node_type dictionary
-    ntypes = [e[0] for e in ntypes]
-    ntypes_map = {e: i for i, e in enumerate(ntypes)}
-    return ntypes_map, ntypes
-def get_edge_types(schema): 
-    """
-    Utility function to form edges dictionary between edge_type names and ids
-    Parameters
-    ----------
-    schema : dictionary
-        Input schema from which the edge_typename -> edge_type
-        dictionary is defined
-    Returns
-    -------
-    dictionary:
-        a map between edgetype_names and ids
-    list: 
-        list of edgetype_names
    """
+    ntype_info = schema["nid"]
-    global_eid_ranges = schema['eid']
+    ntypes = []
-    global_eid_ranges = {key: np.array(global_eid_ranges[key]).reshape(1,2)
+    for k in ntype_info.keys(): 
-                    for key in global_eid_ranges}
+        ntypes.append(k)
-    etypes = [(key, global_eid_ranges[key][0, 0]) for key in global_eid_ranges]
+    ntype_ntypeid_map = {e: i for i, e in enumerate(ntypes)}
-    etypes.sort(key=lambda e: e[1])
+    ntypeid_ntype_map = {str(i): e for i, e in enumerate(ntypes)}
+    return ntype_ntypeid_map, ntypes, ntypeid_ntype_map
-    etypes = [e[0] for e in etypes]
-    etypes_map = {e: i for i, e in enumerate(etypes)}
+def get_gnid_range_map(node_tids): 
-    return etypes_map, etypes
-def get_ntypes_map(schema): 
    """
-    Utility function to return nodes global id range from the input schema
+    Retrieves auxiliary dictionaries from the metadata json object
-    as well as node count per each node type
    Parameters:
    -----------
-    schema : dictionary
+    node_tids: dictionary
-        Input schema where the requested dictionaries are defined
+        This dictionary contains the information about nodes for each node_type.
+        Typically this information contains p-entries, where each entry has a file-name, 
+        starting and ending type_node_ids for the nodes in this file. Keys in this dictionary
+        are the node_type and value is a list of lists. Each individual entry in this list has
+        three items: file-name, starting type_nid and ending type_nid
    Returns:
    --------
    dictionary : 
-        map between the node_types and global_id ranges for each node_type
+        a dictionary where keys are node_type names and values are global_nid range, which is a tuple.
-    dictionary : 
-        map between the node_type and total node count for that type
    """
-    return schema["nid"], schema["node_type_id_count"]
+    ntypes_gid_range = {} 
+    offset = 0
+    for k, v in node_tids.items(): 
+        ntypes_gid_range[k] = [offset + int(v[0][0]), offset + int(v[-1][1])]
+        offset += int(v[-1][1])
+    return ntypes_gid_range
 def write_metadata_json(metadata_list, output_dir, graph_name):
    """
@@ -178,7 +184,7 @@ def write_metadata_json(metadata_list, output_dir, graph_name):
    with open('{}/{}.json'.format(output_dir, graph_name), 'w') as outfile: 
        json.dump(graph_metadata, outfile, sort_keys=True, indent=4)
-def augment_edge_data(edge_data, part_ids, id_offset):
+def augment_edge_data(edge_data, part_ids, edge_tids, rank, world_size):
    """
    Add partition-id (rank which owns an edge) column to the edge_data.
@@ -190,61 +196,27 @@ def augment_edge_data(edge_data, part_ids, id_offset):
        array of part_ids indexed by global_nid
    """
    #add global_nids to the node_data
-    global_eids = np.arange(id_offset, id_offset + len(edge_data[constants.GLOBAL_TYPE_EID]), dtype=np.int64)
+    etype_offset = {}
+    offset = 0
+    for etype_name, tid_range in edge_tids.items(): 
+        assert int(tid_range[0][0]) == 0
+        assert len(tid_range) == world_size
+        etype_offset[etype_name] = offset + int(tid_range[0][0])
+        offset += int(tid_range[-1][1])
+    global_eids = []
+    for etype_name, tid_range in edge_tids.items(): 
+        global_eid_start = etype_offset[etype_name]
+        begin = global_eid_start + int(tid_range[rank][0])
+        end = global_eid_start + int(tid_range[rank][1])
+        global_eids.append(np.arange(begin, end, dtype=np.int64))
+    global_eids = np.concatenate(global_eids)
+    assert global_eids.shape[0] == edge_data[constants.ETYPE_ID].shape[0]
    edge_data[constants.GLOBAL_EID] = global_eids
+    #assign the owner process/rank for each edge 
    edge_data[constants.OWNER_PROCESS] = part_ids[edge_data[constants.GLOBAL_DST_ID]]
-def augment_node_data(node_data, part_ids, offset): 
-    """
-    Utility function to add auxilary columns to the node_data numpy ndarray.
-    Parameters:
-    -----------
-    node_data : dictionary
-        Node information as read from xxx_nodes.txt file and a dictionary is built using this data
-        using keys as column names and values as column data from the csv txt file.
-    part_ids : numpy array 
-        array of part_ids indexed by global_nid
-    """
-    #add global_nids to the node_data
-    global_nids = np.arange(offset, offset + len(node_data[constants.GLOBAL_TYPE_NID]), dtype=np.int64)
-    node_data[constants.GLOBAL_NID] = global_nids
-    #add owner proc_ids to the node_data
-    proc_ids = part_ids[node_data[constants.GLOBAL_NID]]
-    node_data[constants.OWNER_PROCESS] = proc_ids
-def read_nodes_file(nodes_file):
-    """
-    Utility function to read xxx_nodes.txt file
-    Parameters:
-    -----------
-    nodesfile : string
-        Graph file for nodes in the input graph
-    Returns:
-    --------
-    dictionary
-        Nodes data stored in dictionary where keys are column names
-        and values are the columns from the numpy ndarray as read from the
-        xxx_nodes.txt file
-    """
-    if nodes_file == "" or nodes_file == None:
-        return None
-    # Read the file from here.
-    # Assuming the nodes file is a numpy file
-    # nodes.txt file is of the following format
-    # <node_type> <weight1> <weight2> <weight3> <weight4> <global_type_nid> <attributes>
-    # For the ogb-mag dataset, nodes.txt is of the above format.
-    nodes_data = np.loadtxt(nodes_file, delimiter=' ', dtype='int64')
-    nodes_datadict = {}
-    nodes_datadict[constants.NTYPE_ID] = nodes_data[:,0]
-    nodes_datadict[constants.GLOBAL_TYPE_NID] = nodes_data[:,5]
-    return nodes_datadict
 def read_edges_file(edge_file, edge_data_dict):
    """ 
    Utility function to read xxx_edges.txt file
@@ -268,23 +240,13 @@ def read_edges_file(edge_file, edge_data_dict):
    # global_src_id -- global idx for the source node ... line # in the graph_nodes.txt
    # global_dst_id -- global idx for the destination id node ... line # in the graph_nodes.txt
-    edge_data = np.loadtxt(edge_file , delimiter=' ', dtype = 'int64')
+    edge_data_df = csv.read_csv(edge_file, read_options=pyarrow.csv.ReadOptions(autogenerate_column_names=True), 
+                                    parse_options=pyarrow.csv.ParseOptions(delimiter=' '))
-    if (edge_data_dict == None): 
+    edge_data_dict = {}
-        edge_data_dict = {}
+    edge_data_dict[constants.GLOBAL_SRC_ID] = edge_data_df['f0'].to_numpy()
-        edge_data_dict[constants.GLOBAL_SRC_ID] = edge_data[:,0]
+    edge_data_dict[constants.GLOBAL_DST_ID] = edge_data_df['f1'].to_numpy()
-        edge_data_dict[constants.GLOBAL_DST_ID] = edge_data[:,1]
+    edge_data_dict[constants.GLOBAL_TYPE_EID] = edge_data_df['f2'].to_numpy()
-        edge_data_dict[constants.GLOBAL_TYPE_EID] = edge_data[:,2]
+    edge_data_dict[constants.ETYPE_ID] = edge_data_df['f3'].to_numpy()
-        edge_data_dict[constants.ETYPE_ID] = edge_data[:,3]
-    else: 
-        edge_data_dict[constants.GLOBAL_SRC_ID] = \
-            np.concatenate((edge_data_dict[constants.GLOBAL_SRC_ID], edge_data[:,0]))
-        edge_data_dict[constants.GLOBAL_DST_ID] = \
-            np.concatenate((edge_data_dict[constants.GLOBAL_DST_ID], edge_data[:,1]))
-        edge_data_dict[constants.GLOBAL_TYPE_EID] = \
-            np.concatenate((edge_data_dict[constants.GLOBAL_TYPE_EID], edge_data[:,2]))
-        edge_data_dict[constants.ETYPE_ID] = \
-            np.concatenate((edge_data_dict[constants.ETYPE_ID], edge_data[:,3]))
    return edge_data_dict
 def read_node_features_file(nodes_features_file):