• kylasa's avatar
    [Distributed Training Pipeline] Initial implementation of Distributed data... · 4b87e47f
    kylasa authored
    
    [Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)
    
    * Initial implementation of Distributed data processing step in the Distributed Training pipeline
    
    Implemented the following:
    1) Read the output of parmetis (node-id to partition-id mappings)
    2) Read the original graph files
    3) Shuffle the node/edge metadata and features
    4) output the partition specific files in DGL format using convert_partition.py functionality
    5) Graph meta data is serialized in json format on rank-0 machine.
    
    * Bug Fixes identified during verification of the dataset
    
    1. When sending out global-id lookups for non-local nodes, in the msg_alltoall.py, conditional filter was used to identify the indices in node_data which is incorrect. Replaced the conditional filter with intersect1d to find out the common node ids and appropriate indices which are later used to identify the needed information to communicate.
    
    2. When writing the graph level json file in distributed processing, the edge_offset on non-rank-0 machines was starting from 0 instead of the appropriate offset. Now added code to start the edge(s) from correct starting offset instead of 0 always.
    
    * Restructuring and consolidation of code
    
    1) Fixed issue when running verify_mag_dataset.py, Now we read xxx_removed_edges.txt and add these edges to `edge_data`. This will ensure that the self-loops and duplicate edges are handling appropriately when compared to the original dataset.
    
    2) Consolidated code into a fewer files and changed code to following the python naming convention.
    
    * Code changes addressing code review comments
    
    Following changes are made in this commit.
    1) Naming convention is defined and code is changed accordingly. Definition of various global_ids are defined and how to read them is mentioned.
    2) All the code review comments are addressed
    3)Files are moved to a new directory with dgl/tools directory as per suggestion
    4) README.md file is include and it contains detailed information about the Naming convention adopted by the code, high level overview of the algorithm used in data-shuffling, example command-line to use on a single machine.
    
    * addressing github review comments
    
    Made code changes addressing all the review comments from GitHub.
    
    * Addressing latest code review comments
    
    Addressed all the latest code reviewing comments. One of the major changes is treating the node and edge metadata as dictionary objects and removing all the python lists with numpy arrays.
    
    * Update README.md
    
    Text rendering corrections
    
    * Addressed code review comments
    
    Addressed code review comments for the latest code review
    Co-authored-by: default avatarxiang song(charlie.song) <classicxsong@gmail.com>
    4b87e47f
utils.py 10.7 KB