Unverified Commit dad3606a authored by kylasa's avatar kylasa Committed by GitHub
Browse files

Support new format for multi-file support in distributed partitioning. (#4217)

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.
parent 9948ef4d
...@@ -18,6 +18,48 @@ def create_dgl_object(graph_name, num_parts, \ ...@@ -18,6 +18,48 @@ def create_dgl_object(graph_name, num_parts, \
This function creates dgl objects for a given graph partition, as in function This function creates dgl objects for a given graph partition, as in function
arguments. arguments.
The "schema" argument is a dictionary, which contains the metadata related to node ids
and edge ids. It contains two keys: "nid" and "eid", whose value is also a dictionary
with the following structure.
1. The key-value pairs in the "nid" dictionary has the following format.
"ntype-name" is the user assigned name to this node type. "format" describes the
format of the contents of the files. and "data" is a list of lists, each list has
3 elements: file-name, start_id and end_id. File-name can be either absolute or
relative path to this file and starting and ending ids are type ids of the nodes
which are contained in this file. These type ids are later used to compute global ids
of these nodes which are used throughout the processing of this pipeline.
"ntype-name" : {
"format" : "csv",
"data" : [
[ <path-to-file>/ntype0-name-0.csv, start_id0, end_id0],
[ <path-to-file>/ntype0-name-1.csv, start_id1, end_id1],
...
[ <path-to-file>/ntype0-name-<p-1>.csv, start_id<p-1>, end_id<p-1>],
]
}
2. The key-value pairs in the "eid" dictionary has the following format.
As described for the "nid" dictionary the "eid" dictionary is similarly structured
except that these entries are for edges.
"etype-name" : {
"format" : "csv",
"data" : [
[ <path-to-file>/etype0-name-0, start_id0, end_id0],
[ <path-to-file>/etype0-name-1 start_id1, end_id1],
...
[ <path-to-file>/etype0-name-1 start_id<p-1>, end_id<p-1>]
]
}
In "nid" dictionary, the type_nids are specified that
should be assigned to nodes which are read from the corresponding nodes file.
Along the same lines dictionary for the key "eid" is used for edges in the
input graph.
These type ids, for nodes and edges, are used to compute global ids for nodes
and edges which are stored in the graph object.
Parameters: Parameters:
----------- -----------
graph_name : string graph_name : string
...@@ -39,8 +81,6 @@ def create_dgl_object(graph_name, num_parts, \ ...@@ -39,8 +81,6 @@ def create_dgl_object(graph_name, num_parts, \
edgeid_offset : int edgeid_offset : int
offset to be used when assigning edge global ids in the current partition offset to be used when assigning edge global ids in the current partition
return compact_g2, node_map_val, edge_map_val, ntypes_map, etypes_map
Returns: Returns:
-------- --------
dgl object dgl object
...@@ -54,14 +94,21 @@ def create_dgl_object(graph_name, num_parts, \ ...@@ -54,14 +94,21 @@ def create_dgl_object(graph_name, num_parts, \
dictionary dictionary
map between edge type(string) and edge_type_id(int) map between edge type(string) and edge_type_id(int)
""" """
#create auxiliary data structures from the schema object #create auxiliary data structures from the schema object
global_nid_ranges = schema['nid'] node_info = schema["nid"]
global_eid_ranges = schema['eid'] offset = 0
global_nid_ranges = {key: np.array(global_nid_ranges[key]).reshape( global_nid_ranges = {}
1, 2) for key in global_nid_ranges} for k, v in node_info.items():
global_eid_ranges = {key: np.array(global_eid_ranges[key]).reshape( global_nid_ranges[k] = np.array([offset + int(v["data"][0][1]), offset + int(v["data"][-1][2])]).reshape(1,2)
1, 2) for key in global_eid_ranges} offset += int(v["data"][-1][2])
edge_info = schema["eid"]
offset = 0
global_eid_ranges = {}
for k, v in edge_info.items():
global_eid_ranges[k] = np.array([offset + int(v["data"][0][1]), offset + int(v["data"][-1][2])]).reshape(1,2)
offset += int(v["data"][-1][2])
id_map = dgl.distributed.id_map.IdMap(global_nid_ranges) id_map = dgl.distributed.id_map.IdMap(global_nid_ranges)
ntypes = [(key, global_nid_ranges[key][0, 0]) for key in global_nid_ranges] ntypes = [(key, global_nid_ranges[key][0, 0]) for key in global_nid_ranges]
......
...@@ -15,19 +15,9 @@ def log_params(params): ...@@ -15,19 +15,9 @@ def log_params(params):
print('Graph Name: ', params.graph_name) print('Graph Name: ', params.graph_name)
print('Schema File: ', params.schema) print('Schema File: ', params.schema)
print('No. partitions: ', params.num_parts) print('No. partitions: ', params.num_parts)
print('No. node weights: ', params.num_node_weights)
print('Workspace dir: ', params.workspace)
print('Node Attr Type: ', params.node_attr_dtype)
print('Edge Attr Dtype: ', params.edge_attr_dtype)
print('Output Dir: ', params.output) print('Output Dir: ', params.output)
print('Removed Edges File: ', params.removed_edges)
print('WorldSize: ', params.world_size) print('WorldSize: ', params.world_size)
print('Nodes File: ', params.nodes_file)
print('Edges File: ', params.edges_file)
print('Node feats: ', params.node_feats_file)
print('Edge feats: ', params.edge_feats_file)
print('Metis partitions: ', params.partitions_file) print('Metis partitions: ', params.partitions_file)
print('Exec Type: ', params.exec_type)
if __name__ == "__main__": if __name__ == "__main__":
""" """
...@@ -44,38 +34,15 @@ if __name__ == "__main__": ...@@ -44,38 +34,15 @@ if __name__ == "__main__":
help='The schema of the graph') help='The schema of the graph')
parser.add_argument('--num-parts', required=True, type=int, parser.add_argument('--num-parts', required=True, type=int,
help='The number of partitions') help='The number of partitions')
parser.add_argument('--num-node-weights', required=True, type=int,
help='The number of node weights used by METIS.')
parser.add_argument('--workspace', type=str, default='/tmp',
help='The directory to store the intermediate results')
parser.add_argument('--node-attr-dtype', type=str, default=None,
help='The data type of the node attributes')
parser.add_argument('--edge-attr-dtype', type=str, default=None,
help='The data type of the edge attributes')
parser.add_argument('--output', required=True, type=str, parser.add_argument('--output', required=True, type=str,
help='The output directory of the partitioned results') help='The output directory of the partitioned results')
parser.add_argument('--removed-edges', help='a file that contains the removed self-loops and duplicated edges',
default=None, type=str)
parser.add_argument('--exec-type', type=int, default=0,
help='Use 0 for single machine run and 1 for distributed execution')
#arguments needed for the distributed implementation #arguments needed for the distributed implementation
parser.add_argument('--world-size', help='no. of processes to spawn', parser.add_argument('--world-size', help='no. of processes to spawn',
default=1, type=int, required=True) default=1, type=int, required=True)
parser.add_argument('--nodes-file', help='filename of the nodes metadata',
default=None, type=str, required=True)
parser.add_argument('--edges-file', help='filename of the nodes metadata',
default=None, type=str, required=True)
parser.add_argument('--node-feats-file', help='filename of the nodes features',
default=None, type=str, required=True)
parser.add_argument('--edge-feats-file', help='filename of the nodes metadata',
default=None, type=str )
parser.add_argument('--partitions-file', help='filename of the output of dgl_part2 (metis partitions)', parser.add_argument('--partitions-file', help='filename of the output of dgl_part2 (metis partitions)',
default=None, type=str) default=None, type=str)
params = parser.parse_args() params = parser.parse_args()
#invoke the starting function here. #invoke the pipeline function
if(params.exec_type == 0): multi_machine_run(params)
single_machine_run(params)
else:
multi_machine_run(params)
This diff is collapsed.
...@@ -3,7 +3,10 @@ import numpy as np ...@@ -3,7 +3,10 @@ import numpy as np
import constants import constants
import torch import torch
def get_dataset(input_dir, graph_name, rank, num_node_weights): import pyarrow
from pyarrow import csv
def get_dataset(input_dir, graph_name, rank, world_size, schema_map):
""" """
Function to read the multiple file formatted dataset. Function to read the multiple file formatted dataset.
...@@ -15,64 +18,91 @@ def get_dataset(input_dir, graph_name, rank, num_node_weights): ...@@ -15,64 +18,91 @@ def get_dataset(input_dir, graph_name, rank, num_node_weights):
graph name string graph name string
rank : int rank : int
rank of the current process rank of the current process
num_node_weights : int world_size : int
integer indicating the no. of weights each node is attributed with total number of process in the current execution
schema_map : dictionary
this is the dictionary created by reading the graph metadata json file
for the input graph dataset
Return: Return:
------- -------
dictionary dictionary
Data read from nodes.txt file and used to build a dictionary with keys as column names where keys are node-type names and values are tuples. Each tuple represents the
and values as columns in the csv file. range of type ids read from a file by the current process. Please note that node
data for each node type is split into "p" files and each one of these "p" files are
read a process in the distributed graph partitioning pipeline
dictionary dictionary
Data read from numpy files for all the node features in this dataset. Dictionary built Data read from numpy files for all the node features in this dataset. Dictionary built
using this data has keys as node feature names and values as tensor data representing using this data has keys as node feature names and values as tensor data representing
node features node features
dictionary
in which keys are node-type and values are a triplet. This triplet has node-feature name,
and range of tids for the node feature data read from files by the current process. Each
node-type may have mutiple feature(s) and associated tensor data.
dictionary dictionary
Data read from edges.txt file and used to build a dictionary with keys as column names Data read from edges.txt file and used to build a dictionary with keys as column names
and values as columns in the csv file. and values as columns in the csv file.
dictionary
in which keys are edge-type names and values are triplets. This triplet has edge-feature name,
and range of tids for theedge feature data read from the files by the current process. Each
edge-type may have several edge features and associated tensor data.
""" """
#node features dictionary #node features dictionary
node_features = {} node_features = {}
node_feature_tids = {}
#iterate over the sub-dirs and extract the nodetypes #iterate over the "node_data" dictionary in the schema_map
#in each nodetype folder read all the features assigned to #read the node features if exists
#current rank #also keep track of the type_nids for which the node_features are read.
siblings = os.listdir(input_dir) dataset_features = schema_map["node_data"]
for s in siblings: for ntype_name, ntype_feature_data in dataset_features.items():
if s.startswith("nodes-"): #ntype_feature_data is a dictionary
tokens = s.split("-") #where key: feature_name, value: list of lists
ntype = tokens[1] node_feature_tids[ntype_name] = []
num_feats = tokens[2] for feat_name, feat_data in ntype_feature_data.items():
for idx in range(int(num_feats)): assert len(feat_data) == world_size
feat_file = s +'/node-feat-'+'{:02d}'.format(idx) +'/'+ str(rank)+'.npy' my_feat_data = feat_data[rank]
if (os.path.exists(input_dir+'/'+feat_file)): if (os.path.isabs(my_feat_data[0])):
features = np.load(input_dir+'/'+feat_file) node_features[ntype_name+'/'+feat_name] = torch.from_numpy(np.load(my_feat_data[0]))
node_features[ntype+'/feat'] = torch.tensor(features) else:
node_features[ntype_name+'/'+feat_name] = torch.from_numpy(np.load(input_dir+my_feat_data[0]))
#done build node_features locally. node_feature_tids[ntype_name].append([feat_name, my_feat_data[1], my_feat_data[2]])
if len(node_features) <= 0:
print('[Rank: ', rank, '] This dataset does not have any node features')
else:
for k, v in node_features.items():
print('[Rank: ', rank, '] node feature name: ', k, ', feature data shape: ', v.size())
#read (split) xxx_nodes.txt file #read my nodes for each node type
node_file = input_dir+'/'+graph_name+'_nodes'+'_{:02d}.txt'.format(rank) node_tids = {}
node_data = np.loadtxt(node_file, delimiter=' ', dtype='int64') node_data = schema_map["nid"]
nodes_datadict = {} for ntype_name, ntype_info in node_data.items():
nodes_datadict[constants.NTYPE_ID] = node_data[:,0] v = []
type_idx = 0 + num_node_weights + 1 node_file_info = ntype_info["data"]
nodes_datadict[constants.GLOBAL_TYPE_NID] = node_data[:,type_idx] for idx in range(len(node_file_info)):
print('[Rank: ', rank, '] Done reading node_data: ', len(nodes_datadict), nodes_datadict[constants.NTYPE_ID].shape) v.append((node_file_info[idx][1], node_file_info[idx][2]))
node_tids[ntype_name] = v
#read (split) xxx_edges.txt file #read my edges for each edge type
edge_tids = {}
edge_datadict = {} edge_datadict = {}
edge_file = input_dir+'/'+graph_name+'_edges'+'_{:02d}.txt'.format(rank) edge_data = schema_map["eid"]
edge_data = np.loadtxt(edge_file, delimiter=' ', dtype='int64') for etype_name, etype_info in edge_data.items():
edge_datadict[constants.GLOBAL_SRC_ID] = edge_data[:,0] assert etype_info["format"] == "csv"
edge_datadict[constants.GLOBAL_DST_ID] = edge_data[:,1]
edge_datadict[constants.GLOBAL_TYPE_EID] = edge_data[:,2] edge_info = etype_info["data"]
edge_datadict[constants.ETYPE_ID] = edge_data[:,3] assert len(edge_info) == world_size
data_df = csv.read_csv(edge_info[rank][0], read_options=pyarrow.csv.ReadOptions(autogenerate_column_names=True),
parse_options=pyarrow.csv.ParseOptions(delimiter=' '))
edge_datadict[constants.GLOBAL_SRC_ID] = data_df['f0'].to_numpy()
edge_datadict[constants.GLOBAL_DST_ID] = data_df['f1'].to_numpy()
edge_datadict[constants.GLOBAL_TYPE_EID] = data_df['f2'].to_numpy()
edge_datadict[constants.ETYPE_ID] = data_df['f3'].to_numpy()
v = []
edge_file_info = etype_info["data"]
for idx in range(len(edge_file_info)):
v.append((edge_file_info[idx][1], edge_file_info[idx][2]))
edge_tids[etype_name] = v
print('[Rank: ', rank, '] Done reading edge_file: ', len(edge_datadict), edge_datadict[constants.GLOBAL_SRC_ID].shape) print('[Rank: ', rank, '] Done reading edge_file: ', len(edge_datadict), edge_datadict[constants.GLOBAL_SRC_ID].shape)
return nodes_datadict, node_features, edge_datadict return node_tids, node_features, node_feature_tids, edge_datadict, edge_tids
...@@ -45,7 +45,7 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data): ...@@ -45,7 +45,7 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data):
#allocate buffers to receive node-ids #allocate buffers to receive node-ids
recv_nodes = [] recv_nodes = []
for i in recv_counts: for i in recv_counts:
recv_nodes.append(torch.zeros([i.item()], dtype=torch.int64)) recv_nodes.append(torch.zeros(i.tolist(), dtype=torch.int64))
#form the outgoing message #form the outgoing message
send_nodes = [] send_nodes = []
...@@ -67,17 +67,18 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data): ...@@ -67,17 +67,18 @@ def get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data):
global_nids = proc_i_nodes.numpy() global_nids = proc_i_nodes.numpy()
if (len(global_nids) != 0): if (len(global_nids) != 0):
common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], global_nids, return_indices=True) common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], global_nids, return_indices=True)
values = node_data[constants.SHUFFLE_GLOBAL_NID][ind1] shuffle_global_nids = node_data[constants.SHUFFLE_GLOBAL_NID][ind1]
send_nodes.append(torch.Tensor(values).type(dtype=torch.int64)) send_nodes.append(torch.from_numpy(shuffle_global_nids).type(dtype=torch.int64))
else: else:
send_nodes.append(torch.empty((0,), dtype=torch.int64)) send_nodes.append(torch.empty((0), dtype=torch.int64))
#send receive global-ids #send receive global-ids
alltoallv_cpu(rank, world_size, recv_shuffle_global_nids, send_nodes) alltoallv_cpu(rank, world_size, recv_shuffle_global_nids, send_nodes)
shuffle_global_nids = [x.numpy() for x in recv_shuffle_global_nids] shuffle_global_nids = np.concatenate([x.numpy() for x in recv_shuffle_global_nids])
global_nids = [x for x in global_nids_ranks] global_nids = np.concatenate([x for x in global_nids_ranks])
return np.column_stack((np.concatenate(global_nids), np.concatenate(shuffle_global_nids))) ret_val = np.column_stack([global_nids, shuffle_global_nids])
return ret_val
def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, node_data): def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, node_data):
...@@ -122,7 +123,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no ...@@ -122,7 +123,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no
global_nids_ranks.append(not_owned_nodes) global_nids_ranks.append(not_owned_nodes)
#Retrieve Global-ids for respective node owners #Retrieve Global-ids for respective node owners
resolved_global_nids = get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data) non_local_nids = get_shuffle_global_nids(rank, world_size, global_nids_ranks, node_data)
#Add global_nid <-> shuffle_global_nid mappings to the received data #Add global_nid <-> shuffle_global_nid mappings to the received data
for i in range(world_size): for i in range(world_size):
...@@ -132,7 +133,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no ...@@ -132,7 +133,7 @@ def get_shuffle_global_nids_edges(rank, world_size, edge_data, node_part_ids, no
common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], own_global_nids, return_indices=True) common, ind1, ind2 = np.intersect1d(node_data[constants.GLOBAL_NID], own_global_nids, return_indices=True)
my_shuffle_global_nids = node_data[constants.SHUFFLE_GLOBAL_NID][ind1] my_shuffle_global_nids = node_data[constants.SHUFFLE_GLOBAL_NID][ind1]
local_mappings = np.column_stack((own_global_nids, my_shuffle_global_nids)) local_mappings = np.column_stack((own_global_nids, my_shuffle_global_nids))
resolved_global_nids = np.concatenate((resolved_global_nids, local_mappings)) resolved_global_nids = np.concatenate((non_local_nids, local_mappings))
#form a dictionary of mappings between orig-node-ids and global-ids #form a dictionary of mappings between orig-node-ids and global-ids
resolved_mappings = dict(zip(resolved_global_nids[:,0], resolved_global_nids[:,1])) resolved_mappings = dict(zip(resolved_global_nids[:,0], resolved_global_nids[:,1]))
......
...@@ -5,6 +5,9 @@ import json ...@@ -5,6 +5,9 @@ import json
import dgl import dgl
import constants import constants
import pyarrow
from pyarrow import csv
def read_partitions_file(part_file): def read_partitions_file(part_file):
""" """
Utility method to read metis partitions, which is the output of Utility method to read metis partitions, which is the output of
...@@ -47,6 +50,34 @@ def read_json(json_file): ...@@ -47,6 +50,34 @@ def read_json(json_file):
return val return val
def get_ntype_featnames(ntype_name, schema):
"""
Retrieves node feature names for a given node_type
Parameters:
-----------
ntype_name : string
a string specifying a node_type name
schema : dictionary
metadata json object as a dictionary, which is read from the input
metadata file from the input dataset
Returns:
--------
list :
a list of feature names for a given node_type
"""
ntype_dict = schema["node_data"]
if (ntype_name in ntype_dict):
featnames = []
ntype_info = ntype_dict[ntype_name]
for k, v in ntype_info.items():
featnames.append(k)
return featnames
else:
return []
def get_node_types(schema): def get_node_types(schema):
""" """
Utility method to extract node_typename -> node_type mappings Utility method to extract node_typename -> node_type mappings
...@@ -60,72 +91,47 @@ def get_node_types(schema): ...@@ -60,72 +91,47 @@ def get_node_types(schema):
Returns: Returns:
-------- --------
dictionary, list dictionary
dictionary with ntype <-> type_nid mappings with keys as node type names and values as ids (integers)
list of ntype strings list
""" list of ntype name strings
#Get the node_id ranges from the schema dictionary
global_nid_ranges = schema['nid'] with keys as ntype ids (integers) and values as node type names
global_nid_ranges = {key: np.array(global_nid_ranges[key]).reshape(1,2)
for key in global_nid_ranges}
#Create an array with the starting id for each node_type and sort
ntypes = [(key, global_nid_ranges[key][0,0]) for key in global_nid_ranges]
ntypes.sort(key=lambda e: e[1])
#Create node_typename -> node_type dictionary
ntypes = [e[0] for e in ntypes]
ntypes_map = {e: i for i, e in enumerate(ntypes)}
return ntypes_map, ntypes
def get_edge_types(schema):
"""
Utility function to form edges dictionary between edge_type names and ids
Parameters
----------
schema : dictionary
Input schema from which the edge_typename -> edge_type
dictionary is defined
Returns
-------
dictionary:
a map between edgetype_names and ids
list:
list of edgetype_names
""" """
ntype_info = schema["nid"]
global_eid_ranges = schema['eid'] ntypes = []
global_eid_ranges = {key: np.array(global_eid_ranges[key]).reshape(1,2) for k in ntype_info.keys():
for key in global_eid_ranges} ntypes.append(k)
etypes = [(key, global_eid_ranges[key][0, 0]) for key in global_eid_ranges] ntype_ntypeid_map = {e: i for i, e in enumerate(ntypes)}
etypes.sort(key=lambda e: e[1]) ntypeid_ntype_map = {str(i): e for i, e in enumerate(ntypes)}
return ntype_ntypeid_map, ntypes, ntypeid_ntype_map
etypes = [e[0] for e in etypes]
etypes_map = {e: i for i, e in enumerate(etypes)} def get_gnid_range_map(node_tids):
return etypes_map, etypes
def get_ntypes_map(schema):
""" """
Utility function to return nodes global id range from the input schema Retrieves auxiliary dictionaries from the metadata json object
as well as node count per each node type
Parameters: Parameters:
----------- -----------
schema : dictionary node_tids: dictionary
Input schema where the requested dictionaries are defined This dictionary contains the information about nodes for each node_type.
Typically this information contains p-entries, where each entry has a file-name,
starting and ending type_node_ids for the nodes in this file. Keys in this dictionary
are the node_type and value is a list of lists. Each individual entry in this list has
three items: file-name, starting type_nid and ending type_nid
Returns: Returns:
-------- --------
dictionary : dictionary :
map between the node_types and global_id ranges for each node_type a dictionary where keys are node_type names and values are global_nid range, which is a tuple.
dictionary :
map between the node_type and total node count for that type
""" """
return schema["nid"], schema["node_type_id_count"] ntypes_gid_range = {}
offset = 0
for k, v in node_tids.items():
ntypes_gid_range[k] = [offset + int(v[0][0]), offset + int(v[-1][1])]
offset += int(v[-1][1])
return ntypes_gid_range
def write_metadata_json(metadata_list, output_dir, graph_name): def write_metadata_json(metadata_list, output_dir, graph_name):
""" """
...@@ -178,7 +184,7 @@ def write_metadata_json(metadata_list, output_dir, graph_name): ...@@ -178,7 +184,7 @@ def write_metadata_json(metadata_list, output_dir, graph_name):
with open('{}/{}.json'.format(output_dir, graph_name), 'w') as outfile: with open('{}/{}.json'.format(output_dir, graph_name), 'w') as outfile:
json.dump(graph_metadata, outfile, sort_keys=True, indent=4) json.dump(graph_metadata, outfile, sort_keys=True, indent=4)
def augment_edge_data(edge_data, part_ids, id_offset): def augment_edge_data(edge_data, part_ids, edge_tids, rank, world_size):
""" """
Add partition-id (rank which owns an edge) column to the edge_data. Add partition-id (rank which owns an edge) column to the edge_data.
...@@ -190,61 +196,27 @@ def augment_edge_data(edge_data, part_ids, id_offset): ...@@ -190,61 +196,27 @@ def augment_edge_data(edge_data, part_ids, id_offset):
array of part_ids indexed by global_nid array of part_ids indexed by global_nid
""" """
#add global_nids to the node_data #add global_nids to the node_data
global_eids = np.arange(id_offset, id_offset + len(edge_data[constants.GLOBAL_TYPE_EID]), dtype=np.int64) etype_offset = {}
offset = 0
for etype_name, tid_range in edge_tids.items():
assert int(tid_range[0][0]) == 0
assert len(tid_range) == world_size
etype_offset[etype_name] = offset + int(tid_range[0][0])
offset += int(tid_range[-1][1])
global_eids = []
for etype_name, tid_range in edge_tids.items():
global_eid_start = etype_offset[etype_name]
begin = global_eid_start + int(tid_range[rank][0])
end = global_eid_start + int(tid_range[rank][1])
global_eids.append(np.arange(begin, end, dtype=np.int64))
global_eids = np.concatenate(global_eids)
assert global_eids.shape[0] == edge_data[constants.ETYPE_ID].shape[0]
edge_data[constants.GLOBAL_EID] = global_eids edge_data[constants.GLOBAL_EID] = global_eids
#assign the owner process/rank for each edge
edge_data[constants.OWNER_PROCESS] = part_ids[edge_data[constants.GLOBAL_DST_ID]] edge_data[constants.OWNER_PROCESS] = part_ids[edge_data[constants.GLOBAL_DST_ID]]
def augment_node_data(node_data, part_ids, offset):
"""
Utility function to add auxilary columns to the node_data numpy ndarray.
Parameters:
-----------
node_data : dictionary
Node information as read from xxx_nodes.txt file and a dictionary is built using this data
using keys as column names and values as column data from the csv txt file.
part_ids : numpy array
array of part_ids indexed by global_nid
"""
#add global_nids to the node_data
global_nids = np.arange(offset, offset + len(node_data[constants.GLOBAL_TYPE_NID]), dtype=np.int64)
node_data[constants.GLOBAL_NID] = global_nids
#add owner proc_ids to the node_data
proc_ids = part_ids[node_data[constants.GLOBAL_NID]]
node_data[constants.OWNER_PROCESS] = proc_ids
def read_nodes_file(nodes_file):
"""
Utility function to read xxx_nodes.txt file
Parameters:
-----------
nodesfile : string
Graph file for nodes in the input graph
Returns:
--------
dictionary
Nodes data stored in dictionary where keys are column names
and values are the columns from the numpy ndarray as read from the
xxx_nodes.txt file
"""
if nodes_file == "" or nodes_file == None:
return None
# Read the file from here.
# Assuming the nodes file is a numpy file
# nodes.txt file is of the following format
# <node_type> <weight1> <weight2> <weight3> <weight4> <global_type_nid> <attributes>
# For the ogb-mag dataset, nodes.txt is of the above format.
nodes_data = np.loadtxt(nodes_file, delimiter=' ', dtype='int64')
nodes_datadict = {}
nodes_datadict[constants.NTYPE_ID] = nodes_data[:,0]
nodes_datadict[constants.GLOBAL_TYPE_NID] = nodes_data[:,5]
return nodes_datadict
def read_edges_file(edge_file, edge_data_dict): def read_edges_file(edge_file, edge_data_dict):
""" """
Utility function to read xxx_edges.txt file Utility function to read xxx_edges.txt file
...@@ -268,23 +240,13 @@ def read_edges_file(edge_file, edge_data_dict): ...@@ -268,23 +240,13 @@ def read_edges_file(edge_file, edge_data_dict):
# global_src_id -- global idx for the source node ... line # in the graph_nodes.txt # global_src_id -- global idx for the source node ... line # in the graph_nodes.txt
# global_dst_id -- global idx for the destination id node ... line # in the graph_nodes.txt # global_dst_id -- global idx for the destination id node ... line # in the graph_nodes.txt
edge_data = np.loadtxt(edge_file , delimiter=' ', dtype = 'int64') edge_data_df = csv.read_csv(edge_file, read_options=pyarrow.csv.ReadOptions(autogenerate_column_names=True),
parse_options=pyarrow.csv.ParseOptions(delimiter=' '))
if (edge_data_dict == None): edge_data_dict = {}
edge_data_dict = {} edge_data_dict[constants.GLOBAL_SRC_ID] = edge_data_df['f0'].to_numpy()
edge_data_dict[constants.GLOBAL_SRC_ID] = edge_data[:,0] edge_data_dict[constants.GLOBAL_DST_ID] = edge_data_df['f1'].to_numpy()
edge_data_dict[constants.GLOBAL_DST_ID] = edge_data[:,1] edge_data_dict[constants.GLOBAL_TYPE_EID] = edge_data_df['f2'].to_numpy()
edge_data_dict[constants.GLOBAL_TYPE_EID] = edge_data[:,2] edge_data_dict[constants.ETYPE_ID] = edge_data_df['f3'].to_numpy()
edge_data_dict[constants.ETYPE_ID] = edge_data[:,3]
else:
edge_data_dict[constants.GLOBAL_SRC_ID] = \
np.concatenate((edge_data_dict[constants.GLOBAL_SRC_ID], edge_data[:,0]))
edge_data_dict[constants.GLOBAL_DST_ID] = \
np.concatenate((edge_data_dict[constants.GLOBAL_DST_ID], edge_data[:,1]))
edge_data_dict[constants.GLOBAL_TYPE_EID] = \
np.concatenate((edge_data_dict[constants.GLOBAL_TYPE_EID], edge_data[:,2]))
edge_data_dict[constants.ETYPE_ID] = \
np.concatenate((edge_data_dict[constants.ETYPE_ID], edge_data[:,3]))
return edge_data_dict return edge_data_dict
def read_node_features_file(nodes_features_file): def read_node_features_file(nodes_features_file):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment