Commits · f51b31b2456902bd59ca072675f5fa0ddd37521f · OpenDAS / dgl

17 Aug, 2022 1 commit

Distributed Lookup service implementation to retrieve node-level mappings (#4387) · f51b31b2

kylasa authored Aug 17, 2022

* Distributed Lookup service which is for retrieving global_nids to shuffle-global-nids/partition-id mappings

1. Implemented a class to provide distributed lookup service
2. This class can be used to retrieve global-nids mappings

* Code changes to address CI comments.

1. Removed some unneeded type_casts to numpy.int64
2. Added additional comments when iterating over the partition-ids list.
3.Added docstring to the class and adjusted comments where it is relevant.

* Updated code comments and variable names...

1. Changed the variable names to appropriately represent the values stored in these variables.
2. Updated the docstring correctly.

* Corrected docstring as per the suggestion... and removed all the capital letters for Global nids and Shuffle Global nids...

* Addressing CI review comments.

f51b31b2

06 Aug, 2022 1 commit

[Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) · c1e01b1d

kylasa authored Aug 05, 2022

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

c1e01b1d

13 Jul, 2022 1 commit

Support new format for multi-file support in distributed partitioning. (#4217) · dad3606a

kylasa authored Jul 12, 2022

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

dad3606a

05 Jul, 2022 1 commit

Revert "Revert "[Distributed Training Pipeline] Initial implementation of... · a324440f

Da Zheng authored Jul 04, 2022

Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac.

a324440f

29 Jun, 2022 1 commit

code changes for bug fixes identified during mag_lsc dataset (#4187) · 3ccd973c

kylasa authored Jun 29, 2022

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

3ccd973c

23 May, 2022 1 commit

Revert "[Distributed Training Pipeline] Initial implementation of Distributed... · 7c598aac

Da Zheng authored May 22, 2022

Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)

This reverts commit 4b87e47f.

7c598aac

19 May, 2022 1 commit

[Distributed Training Pipeline] Initial implementation of Distributed data... · 4b87e47f

kylasa authored May 18, 2022

[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)

* Initial implementation of Distributed data processing step in the Distributed Training pipeline

Implemented the following:
1) Read the output of parmetis (node-id to partition-id mappings)
2) Read the original graph files
3) Shuffle the node/edge metadata and features
4) output the partition specific files in DGL format using convert_partition.py functionality
5) Graph meta data is serialized in json format on rank-0 machine.

* Bug Fixes identified during verification of the dataset

1. When sending out global-id lookups for non-local nodes, in the msg_alltoall.py, conditional filter was used to identify the indices in node_data which is incorrect. Replaced the conditional filter with intersect1d to find out the common node ids and appropriate indices which are later used to identify the needed information to communicate.

2. When writing the graph level json file in distributed processing, the edge_offset on non-rank-0 machines was starting from 0 instead of the appropriate offset. Now added code to start the edge(s) from correct starting offset instead of 0 always.

* Restructuring and consolidation of code

1) Fixed issue when running verify_mag_dataset.py, Now we read xxx_removed_edges.txt and add these edges to `edge_data`. This will ensure that the self-loops and duplicate edges are handling appropriately when compared to the original dataset.

2) Consolidated code into a fewer files and changed code to following the python naming convention.

* Code changes addressing code review comments

Following changes are made in this commit.
1) Naming convention is defined and code is changed accordingly. Definition of various global_ids are defined and how to read them is mentioned.
2) All the code review comments are addressed
3)Files are moved to a new directory with dgl/tools directory as per suggestion
4) README.md file is include and it contains detailed information about the Naming convention adopted by the code, high level overview of the algorithm used in data-shuffling, example command-line to use on a single machine.

* addressing github review comments

Made code changes addressing all the review comments from GitHub.

* Addressing latest code review comments

Addressed all the latest code reviewing comments. One of the major changes is treating the node and edge metadata as dictionary objects and removing all the python lists with numpy arrays.

* Update README.md

Text rendering corrections

* Addressed code review comments

Addressed code review comments for the latest code review
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

4b87e47f