1. 13 Jul, 2022 1 commit
    • kylasa's avatar
      Support new format for multi-file support in distributed partitioning. (#4217) · dad3606a
      kylasa authored
      * Code changes for the following
      
      1. Generating node data at each process
      2. Reading csv files using pyarrow
      3. feature complete code.
      
      * Removed some typo's because of which unit tests were failing
      
      1. Change the file name to correct file name when loading edges from file
      2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.
      
      * Code changes to address CI comments by reviewers
      
      1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
      2 function signatures and invocations now match w.r.t argument list
      3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.
      
      * Addressing code review comments
      
      1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.
      
      * Update docstring's of two functions appropriately in response to code review comments
      
      Removed "todo" from the docstring of the gen_nodedata function.
      Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.
      
      Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.
      dad3606a
  2. 05 Jul, 2022 2 commits
    • kylasa's avatar
      Added code to support multiple-file-support feature and removed singl… (#4188) · 9948ef4d
      kylasa authored
      * Added code to support multiple-file-support feature and removed single-file-support code
      
      1. Added code to read dataset in multiple-file-format
      2. Removed code for single-file format
      
      * added files missing in the previous commit
      
      This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py
      
      * Update convert_partition.py
      
      Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.
      
      * addressing code review comments during the CI process
      
      code changes resulting from the code review comments received during the CI process.
      
      * Code reorganization
      
      Addressing CI comments and code reorganization for easier understanding.
      
      * Removed commented out line
      
      removed commented out line.
      9948ef4d
    • Da Zheng's avatar
      Revert "Revert "[Distributed Training Pipeline] Initial implementation of... · a324440f
      Da Zheng authored
      Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"
      
      This reverts commit 7c598aac.
      a324440f
  3. 29 Jun, 2022 1 commit
    • kylasa's avatar
      code changes for bug fixes identified during mag_lsc dataset (#4187) · 3ccd973c
      kylasa authored
      * code changes for bug fixes identified during mag_lsc dataset
      
      1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
      2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.
      
      * Update convert_partition.py
      
      Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.
      
      * Addressing review comments.
      
      Removed space as suggested at the end of the line
      3ccd973c
  4. 14 Jun, 2022 1 commit
  5. 09 Jun, 2022 1 commit
  6. 23 May, 2022 1 commit
  7. 19 May, 2022 1 commit
    • kylasa's avatar
      [Distributed Training Pipeline] Initial implementation of Distributed data... · 4b87e47f
      kylasa authored
      
      [Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)
      
      * Initial implementation of Distributed data processing step in the Distributed Training pipeline
      
      Implemented the following:
      1) Read the output of parmetis (node-id to partition-id mappings)
      2) Read the original graph files
      3) Shuffle the node/edge metadata and features
      4) output the partition specific files in DGL format using convert_partition.py functionality
      5) Graph meta data is serialized in json format on rank-0 machine.
      
      * Bug Fixes identified during verification of the dataset
      
      1. When sending out global-id lookups for non-local nodes, in the msg_alltoall.py, conditional filter was used to identify the indices in node_data which is incorrect. Replaced the conditional filter with intersect1d to find out the common node ids and appropriate indices which are later used to identify the needed information to communicate.
      
      2. When writing the graph level json file in distributed processing, the edge_offset on non-rank-0 machines was starting from 0 instead of the appropriate offset. Now added code to start the edge(s) from correct starting offset instead of 0 always.
      
      * Restructuring and consolidation of code
      
      1) Fixed issue when running verify_mag_dataset.py, Now we read xxx_removed_edges.txt and add these edges to `edge_data`. This will ensure that the self-loops and duplicate edges are handling appropriately when compared to the original dataset.
      
      2) Consolidated code into a fewer files and changed code to following the python naming convention.
      
      * Code changes addressing code review comments
      
      Following changes are made in this commit.
      1) Naming convention is defined and code is changed accordingly. Definition of various global_ids are defined and how to read them is mentioned.
      2) All the code review comments are addressed
      3)Files are moved to a new directory with dgl/tools directory as per suggestion
      4) README.md file is include and it contains detailed information about the Naming convention adopted by the code, high level overview of the algorithm used in data-shuffling, example command-line to use on a single machine.
      
      * addressing github review comments
      
      Made code changes addressing all the review comments from GitHub.
      
      * Addressing latest code review comments
      
      Addressed all the latest code reviewing comments. One of the major changes is treating the node and edge metadata as dictionary objects and removing all the python lists with numpy arrays.
      
      * Update README.md
      
      Text rendering corrections
      
      * Addressed code review comments
      
      Addressed code review comments for the latest code review
      Co-authored-by: default avatarxiang song(charlie.song) <classicxsong@gmail.com>
      4b87e47f
  8. 09 Feb, 2022 1 commit
  9. 08 Nov, 2021 1 commit
  10. 23 Sep, 2021 1 commit
  11. 14 Sep, 2021 1 commit
  12. 17 Aug, 2021 1 commit
  13. 02 Aug, 2021 2 commits
  14. 30 Jul, 2021 1 commit
  15. 02 Jul, 2021 1 commit
  16. 26 May, 2021 1 commit
  17. 01 May, 2021 1 commit
  18. 08 Apr, 2021 1 commit
  19. 04 Apr, 2021 1 commit
  20. 30 Mar, 2021 1 commit
  21. 22 Mar, 2021 1 commit
  22. 25 Feb, 2021 1 commit
  23. 09 Feb, 2021 1 commit
    • Da Zheng's avatar
      [Distributed] Distributed METIS partition (#2576) · e4ff4844
      Da Zheng authored
      
      
      * add convert.
      
      * fix.
      
      * add write_mag.
      
      * fix convert_partition.py
      
      * write data.
      
      * use pyarrow to read.
      
      * update write_mag.py
      
      * fix convert_partition.py.
      
      * load node/edge features when necessary.
      
      * reshuffle nodes.
      
      * write mag correctly.
      
      * fix a bug: inner nodes in a partition might be empty.
      
      * fix bugs.
      
      * add verify code.
      
      * insert reverse edges.
      
      * fix a bug.
      
      * add get node/edge data.
      
      * add instructions.
      
      * remove unnecessary argument.
      
      * update distributed preprocessing.
      
      * fix readme.
      
      * fix.
      
      * fix.
      
      * fix.
      
      * fix readme.
      
      * fix doc.
      
      * fix.
      
      * update readme
      
      * update doc.
      
      * update readme.
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal>
      e4ff4844
  24. 15 Sep, 2020 1 commit
  25. 27 Aug, 2020 1 commit
  26. 13 Aug, 2020 1 commit
  27. 12 Aug, 2020 2 commits
  28. 11 Aug, 2020 2 commits
  29. 10 Aug, 2020 1 commit
  30. 09 Aug, 2020 1 commit
  31. 08 Aug, 2020 1 commit
  32. 31 Jul, 2020 1 commit
  33. 27 Jul, 2020 1 commit
  34. 17 Jul, 2020 1 commit
  35. 16 Jul, 2020 1 commit
    • Chao Ma's avatar
      [Distributed] Distributed launching script (#1772) · ca9d3216
      Chao Ma authored
      
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix launch script.
      Co-authored-by: default avatarDa Zheng <zhengda1936@gmail.com>
      ca9d3216
  36. 03 May, 2020 1 commit
    • Da Zheng's avatar
      [Feature] Distributed graph store (#1383) · 2190c39d
      Da Zheng authored
      * initial version from distributed training.
      
      This is copied from multiprocessing training.
      
      * modify for distributed training.
      
      * it's runnable now.
      
      * measure time in neighbor sampling.
      
      * simplify neighbor sampling.
      
      * fix a bug in distributed neighbor sampling.
      
      * allow single-machine training.
      
      * fix a bug.
      
      * fix a bug.
      
      * fix openmp.
      
      * make some improvement.
      
      * fix.
      
      * add prepare in the sampler.
      
      * prepare nodeflow async.
      
      * fix a bug.
      
      * get id.
      
      * simplify the code.
      
      * improve.
      
      * fix partition.py
      
      * fix the example.
      
      * add more features.
      
      * fix the example.
      
      * allow one partition
      
      * use distributed kvstore.
      
      * do g2l map manually.
      
      * fix commandline.
      
      * a temp script to save reddit.
      
      * fix pull_handler.
      
      * add pytorch version.
      
      * estimate the time for copying data.
      
      * delete unused code.
      
      * fix a bug.
      
      * print id.
      
      * fix a bug
      
      * fix a bug
      
      * fix a bug.
      
      * remove redundent code.
      
      * revert modify in sampler.
      
      * fix temp script.
      
      * remove pytorch version.
      
      * fix.
      
      * distributed training with pytorch.
      
      * add distributed graph store.
      
      * fix.
      
      * add metis_partition_assignment.
      
      * fix a few bugs in distributed graph store.
      
      * fix test.
      
      * fix bugs in distributed graph store.
      
      * fix tests.
      
      * remove code of defining DistGraphStore.
      
      * fix partition.
      
      * fix example.
      
      * update run.sh.
      
      * only read necessary node data.
      
      * batching data fetch of multiple NodeFlows.
      
      * simplify gcn.
      
      * remove unnecessary code.
      
      * use the new copy_from_kvstore.
      
      * update training script.
      
      * print time in graphsage.
      
      * make distributed training runnable.
      
      * use val_nid.
      
      * fix train_sampling.
      
      * add distributed training.
      
      * add run.sh
      
      * add more timing.
      
      * fix a bug.
      
      * save graph metadata when partition.
      
      * create ndata and edata in distributed graph store.
      
      * add timing in minibatch training of GraphSage.
      
      * use pytorch distributed.
      
      * add checks.
      
      * fix a bug in global vs. local ids.
      
      * remove fast pull
      
      * fix a compile error.
      
      * update and add new APIs.
      
      * implement more methods in DistGraphStore.
      
      * update more APIs.
      
      * rename it to DistGraph.
      
      * rename to DistTensor
      
      * remove some unnecessary API.
      
      * remove unnecessary files.
      
      * revert changes in sampler.
      
      * Revert "simplify gcn."
      
      This reverts commit 0ed3a34ca714203a5b45240af71555d4227ce452.
      
      * Revert "simplify neighbor sampling."
      
      This reverts commit 551c72d20f05a029360ba97f312c7a7a578aacec.
      
      * Revert "measure time in neighbor sampling."
      
      This reverts commit 63ae80c7b402bb626e24acbbc8fdfe9fffd0bc64.
      
      * Revert "add timing in minibatch training of GraphSage."
      
      This reverts commit e59dc8957a414c7df5c316f51d78bce822bdef5e.
      
      * Revert "fix train_sampling."
      
      This reverts commit ea6aea9a4aabb8ba0ff63070aa51e7ca81536ad9.
      
      * fix lint.
      
      * add comments and small update.
      
      * add more comments.
      
      * add more unit tests and fix bugs.
      
      * check the existence of shared-mem graph index.
      
      * use new partitioned graph storage.
      
      * fix bugs.
      
      * print error in fast pull.
      
      * fix lint
      
      * fix a compile error.
      
      * save absolute path after partitioning.
      
      * small fixes in the example
      
      * Revert "[kvstore] support any data type for init_data() (#1465)"
      
      This reverts commit 87b6997b
      
      .
      
      * fix a bug.
      
      * disable evaluation.
      
      * Revert "Revert "[kvstore] support any data type for init_data() (#1465)""
      
      This reverts commit f5b8039c6326eb73bad8287db3d30d93175e5bee.
      
      * support set and init data.
      
      * support set and init data.
      
      * Revert "Revert "[kvstore] support any data type for init_data() (#1465)""
      
      This reverts commit f5b8039c6326eb73bad8287db3d30d93175e5bee.
      
      * fix bugs.
      
      * fix unit test.
      
      * move to dgl.distributed.
      
      * fix lint.
      
      * fix lint.
      
      * remove local_nids.
      
      * fix lint.
      
      * fix test.
      
      * remove train_dist.
      
      * revert train_sampling.
      
      * rename funcs.
      
      * address comments.
      
      * address comments.
      
      Use NodeDataView/EdgeDataView to keep track of data.
      
      * address comments.
      
      * address comments.
      
      * revert.
      
      * save data with DGL serializer.
      
      * use the right way of getting shape.
      
      * fix lint.
      
      * address comments.
      
      * address comments.
      
      * fix an error in mxnet.
      
      * address comments.
      
      * add edge_map.
      
      * add more test and fix bugs.
      Co-authored-by: default avatarZheng <dzzhen@186590dc80ff.ant.amazon.com>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-6-131.us-east-2.compute.internal>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-26-167.us-east-2.compute.internal>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-16-150.us-west-2.compute.internal>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-16-250.us-west-2.compute.internal>
      Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-30-135.us-west-2.compute.internal>
      2190c39d