Commits · c8ea9fa4e4afd3d16d86afa99494095060685c10 · OpenDAS / dgl

"src/vscode:/vscode.git/clone" did not exist on "889aa6008cf8df45cfb8bebaccb3e2dd56cdfd2d"

18 Nov, 2022 1 commit

[Dist] Flexible pipeline - Initial commit (#4733) · c8ea9fa4

kylasa authored Nov 18, 2022

* Flexible pipeline - Initial commit

1. Implementation of flexible pipeline feature.
2. With this implementation, the pipeline now supports multiple partitions per process. And also assumes that num_partitions is always a multiple of num_processes.

* Update test_dist_part.py

* Code changes to address review comments

* Code refactoring of exchange_features function into two functions for better readability

* Upadting test_dist_part to fix merge issues with the master branch

* corrected variable names...

* Fixed code refactoring issues.

* Provide missing function arguments to exchange_feature function

* Providing the missing function argument to fix error.

* Provide missing function argument to 'get_shuffle_nids' function.

* Repositioned a variable within its scope.

* Removed tab space which is causing the indentation problem

* Fix issue with the CI test framework, which is the root cause for the failure of the CI tests.

1. Now we read files specific to the partition-id and store this data separately, identified by the local_part_id, in the local process.
2. Similarly as above, we also differentiate the node and edge features type_ids with the same keys as above.
3. These above two changes will help up to get the appropriate feature data during the feature exchange and send to the destination process correctly.

* Correct the parametrization for the CI unit test cases.

* Addressing Rui's code review comments.

* Addressing code review comments.

c8ea9fa4

17 Nov, 2022 1 commit

[Dist] Fix bug in Dist partitioning (#4910) · 799245a7

Serge Panev authored Nov 17, 2022


Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>

799245a7

09 Nov, 2022 1 commit
- [Dist] Fix typo of metis preprocess in dist partitin pipeline · 344be1ef
  xiang song(charlie.song) authored Nov 08, 2022
  
  344be1ef
08 Nov, 2022 1 commit

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large... · 4cd0a685

kylasa authored Nov 07, 2022

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large messages and resulting in killed process (#4790)

* Send out the message to the distributed lookup service in batches.

* Update function signature for allgather_sizes function call.

* Removed the unnecessary if statement .

* Removed logging.info message, which is not needed.

4cd0a685

07 Nov, 2022 3 commits
- alltoall returns tensor list with None values, which is failing torch.cat(). (#4788) · e3bf1825
  kylasa authored Nov 07, 2022
  
  e3bf1825
- [Dist] Create <graph_name>_stats.txt file if it does not exist before ParMETIS execution (#4791) · 98b9e0fa
  kylasa authored Nov 07, 2022
```
* check if stats file exists, if not create one before parmetis run

* correct the typo error and correctly use constants.GRAPH_NAME
```
  98b9e0fa
- Reading files in chunks to reduce the memory footprint of pyarrow (#4795) · 53117c51
  kylasa authored Nov 07, 2022
```
All tasks completed.
```
  53117c51
04 Nov, 2022 2 commits

[Dist] deprecate etype and always use canonical etype for partition and load (#4777) · ed8e9c44

Rhett Ying authored Nov 04, 2022

* [Dist] deprecate etype and always use canonical etype for partition and load

* enable canonical etypes in dist part pipeline

* resolve rebase conflicts

* fix lint

* fix test failure

* throw exception if outdated part config is loaded

* refine

* refine

* revert unnecessary change

* fix typo

ed8e9c44

[Dist] remove dependecy of load_partition_book in change tool (#4802) · dccf1f16

peizhou001 authored Nov 04, 2022



* remove dependecy of load_partition_book in change tool

* fix issue

* fix issue
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-19.ap-northeast-1.compute.internal>

dccf1f16

31 Oct, 2022 1 commit
- Updated the key to retrieve correct rank of a process (#4756) · 9a72b78b
  kylasa authored Oct 31, 2022
```
Merging this PR to the master branch
```
  9a72b78b
27 Oct, 2022 1 commit
- [Dist] fix etype issue in dist part pipeline (#4754) · ea4d9e83
  Rhett Ying authored Oct 27, 2022
```
* [Dist] fix etype issue in dist part pipeline

* add comments
```
  ea4d9e83
26 Oct, 2022 1 commit

[Dist] Reduce startup overhead: sort etypes and save in specified formats (#4735) · 1990e797

Rhett Ying authored Oct 26, 2022

* [Dist] reduce startup overhead: enable to save in specified formats

* [Dist] reduce startup overhead: sort partitions when generating

* sort csc/csr only whenmultiple etypes

* refine

1990e797

19 Oct, 2022 2 commits
- add standalone tools for generating canonical etypes (#4626) · 743516f3
  peizhou001 authored Oct 19, 2022
```
* add a standalone tool for change etypes to canonical etypes in part config
```
  743516f3
- [Dist] decouple num_chunks and num_parts for graphs with edge feature (#4729) · 6a460725
  Rhett Ying authored Oct 19, 2022
```
* [Dist] decouple num_chunks and num_parts for graphs with edge feature

* fix test failure
```
  6a460725
17 Oct, 2022 1 commit

[Dist] Reduce peak memory in DistDGL (#4687) · b1309217

Rhett Ying authored Oct 17, 2022

* [Dist] Reduce peak memory in DistDGL: avoid validation, release memory once loaded

* remove orig_id from ndata/edata for partition_graph()

* delete orig_id from ndata/edata in dist part pipeline

* reduce dtype size and format before saving graphs

* fix lint

* ETYPE requires to be int32/64 for CSRSortByTag

* fix test failure

* refine

b1309217

12 Oct, 2022 1 commit
- [Misc] Black auto fix. (#4705) · 2b983869
  Hongzhi (Steve), Chen authored Oct 12, 2022
```
Co-authored-by: Steve <ubuntu@ip-172-31-34-29.ap-northeast-1.compute.internal>
```
  2b983869
11 Oct, 2022 1 commit
- [Misc] Black auto fix. (#4697) · ea48ce7a
  Hongzhi (Steve), Chen authored Oct 11, 2022
```
Co-authored-by: Steve <ubuntu@ip-172-31-34-29.ap-northeast-1.compute.internal>
```
  ea48ce7a
03 Oct, 2022 2 commits

ParMETIS wrapper script to enable ParMETIS to process chunked dataset format (#4605) · eae6ce2a

kylasa authored Oct 03, 2022

* Creating ParMETIS wrapper script to run parmetis using one script from user perspective

* Addressed all the CI comments from PR https://github.com/dmlc/dgl/pull/4529

* Addressing CI comments.

* Isort, and black changes.

* Replaced python with python3

* Replaced single quote with double quotes per suggestion.

* Removed print statement

* Addressing CI Commets.

* Addressing CI review comments.

* Addressing CI comments as per chime discussion with Rui

* CI Comments, Black and isort changes

* Align with code refactoring, black, isort and code review comments.

* Addressing CI review comments, and fixing merge issues with the master branch

* Updated with proper unit test skip decorator

eae6ce2a

Edge Feature support for input graph datasets for dist. graph partitioning pipeline (#4623) · 1f471396

kylasa authored Oct 03, 2022

* Added support for edge features.

* Added comments and removing unnecessary print statements.

* updated data_shuffle.py to remove compile error.

* Repaled python3 with python to match CI test framework.

* Removed unrelated files from the pull request.

* Isort changes.

* black changes on this file.

* Addressing CI review comments.

* Addressing CI comments.

* Removed duplicated and resolved merge conflict code.

* Addressing CI Comments from Rui.

* Addressing CI comments, and fixing merge issues.

* Addressing CI comments, code refactoring, isort and black

1f471396

28 Sep, 2022 2 commits

[Dist] enable to partition many chunks into less partitions via pipeline (#4620) · cf19254a

Rhett Ying authored Sep 28, 2022

* [Dist] enable to partition many chunks into less partitions via pipeline

* refine

* add meta file for num_parts, add more tests, refine docstring

* remove args.num_parts

* create pydantic class for partition metadata

* refine

* rename json file

cf19254a

[Dist] save original node/edge IDs into separate files (#4649) · 6c1500d4
Rhett Ying authored Sep 28, 2022
```
* [Dist] save original node/edge IDs into separate files

* separate nids and eids
```
6c1500d4

23 Sep, 2022 1 commit

Garbage Collection and memory snapshot code for debugging partitioning... · ace76327

kylasa authored Sep 23, 2022


 Garbage Collection and memory snapshot code for debugging partitioning pipeline (target as master branch) (#4598)

* Squashed commit of the following:

commit e605a550b3783dd5f24eb39b6873a2e0e79be9c7
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:45:39 2022 -0700

    Delete pyproject.toml

commit f2db9e700d817212b67b5227f6472d218f0c74f2
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:44:40 2022 -0700

    Changes suggested by isort program to sort imports.

commit 5a6078beac6218a4f1fb378c169f04dda7396425
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:39:50 2022 -0700

    addressing code review comments from the CI process.

commit c8e92decb7aebeb32c7467108e16f058491443ab
Author: kylasa <kylasa@gmail.com>
Date:   Wed Sep 14 18:23:59 2022 -0700

    Corrected a typo in the import statement

commit 14ddb0e9b553d5be3ed2c50d82dee671e84ad8c9
Author: kylasa <kylasa@gmail.com>
Date:   Tue Sep 13 18:47:34 2022 -0700

    Memory snapshot code for debugging memory footprint of the graph partitioning pipeline

Squashed commit done

* Addressing code review comments.

* Update utils.py

* dummy change to trigger CI tests
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

ace76327

20 Sep, 2022 1 commit
- add ssh port config for dispatchdata (#4557) · 166b273b
  peizhou001 authored Sep 20, 2022
  
  166b273b
15 Sep, 2022 1 commit

[DistPart] expose timeout config for process group (#4532) · 099b173f

Rhett Ying authored Sep 15, 2022



* [DistPart] expose timeout config for process group

* refine code

* Update tools/distpartitioning/data_proc_pipeline.py
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

099b173f

22 Aug, 2022 1 commit
- [Doc] Change random.py to random_partition.py in guide on distributed partition pipeline (#4438) · 7a41c126
  Mufei Li authored Aug 22, 2022
```
* Update distributed-preprocessing.rst

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>
```
  7a41c126
21 Aug, 2022 1 commit
- Let distributed training launch script report error when any trainer or kvserver fails. (#4437) · ee672c0b
  xiang song(charlie.song) authored Aug 21, 2022
```
* Collect error reports

* update

* fix
Co-authored-by: root <root@ip-10-0-80-128.ec2.internal>
```
  ee672c0b
19 Aug, 2022 1 commit

[Dist][CI] Unit test for the new distributed partitioning pipeline (#4394) · 2e8ae9f9

Mufei Li authored Aug 19, 2022



* chunked graph data format

* Update

* Update

* Update task_distributed_test.sh

* Update

* Update

* Revert "Update"

This reverts commit 03c461870f19375fb03125b061fc853ab555577f.

* Update

* Update

* ssh-keygen

* CI

* install openssh

* openssh

* Update

* CI

* Update

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-53-142.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-87.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>

2e8ae9f9

17 Aug, 2022 1 commit

Distributed Lookup service implementation to retrieve node-level mappings (#4387) · f51b31b2

kylasa authored Aug 17, 2022

* Distributed Lookup service which is for retrieving global_nids to shuffle-global-nids/partition-id mappings

1. Implemented a class to provide distributed lookup service
2. This class can be used to retrieve global-nids mappings

* Code changes to address CI comments.

1. Removed some unneeded type_casts to numpy.int64
2. Added additional comments when iterating over the partition-ids list.
3.Added docstring to the class and adjusted comments where it is relevant.

* Updated code comments and variable names...

1. Changed the variable names to appropriately represent the values stored in these variables.
2. Updated the docstring correctly.

* Corrected docstring as per the suggestion... and removed all the capital letters for Global nids and Shuffle Global nids...

* Addressing CI review comments.

f51b31b2

12 Aug, 2022 1 commit
- Revert "[Dist] New distributed data preparation pipeline (#4386)" (#4391) · 18d89b5d
  Minjie Wang authored Aug 12, 2022
```
This reverts commit 71ce1749.
```
  18d89b5d
11 Aug, 2022 3 commits

[Dist] New distributed data preparation pipeline (#4386) · 71ce1749

Minjie Wang authored Aug 11, 2022

* code changes for bug fixes identified during mag_lsc dataset (#4187)

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

* Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac

.

* Added code to support multiple-file-support feature and removed singl… (#4188)

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

* Support new format for multi-file support in distributed partitioning. (#4217)

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

* [Distributed] Change for the new input format for distributed partitioning (#4273)

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

* [Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311)

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

* [Distributed] reduce memory consumption in distributed graph partitioning. (#4338)

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

* [Distributed] Graph chunking UX (#4365)

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

* Adding launch script and wrapper script to trigger distributed graph … (#4276)

* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>
Co-authored-by: kylasa <kylasa@gmail.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

71ce1749

Adding launch script and wrapper script to trigger distributed graph … (#4276) · 8086d1ed

kylasa authored Aug 11, 2022



* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

8086d1ed

[Distributed] Graph chunking UX (#4365) · 067cd744

Quan (Andy) Gan authored Aug 11, 2022

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

067cd744

07 Aug, 2022 1 commit

[Distributed] reduce memory consumption in distributed graph partitioning. (#4338) · 60bc0b76

kylasa authored Aug 07, 2022

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

60bc0b76

06 Aug, 2022 1 commit

[Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) · c1e01b1d

kylasa authored Aug 05, 2022

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

c1e01b1d

23 Jul, 2022 1 commit

[Distributed] Change for the new input format for distributed partitioning (#4273) · 7f8e1cf2

kylasa authored Jul 23, 2022

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

7f8e1cf2

13 Jul, 2022 1 commit

Support new format for multi-file support in distributed partitioning. (#4217) · dad3606a

kylasa authored Jul 12, 2022

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

dad3606a

05 Jul, 2022 2 commits

Added code to support multiple-file-support feature and removed singl… (#4188) · 9948ef4d

kylasa authored Jul 04, 2022

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

9948ef4d

Revert "Revert "[Distributed Training Pipeline] Initial implementation of... · a324440f

Da Zheng authored Jul 04, 2022

Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac.

a324440f

29 Jun, 2022 1 commit

code changes for bug fixes identified during mag_lsc dataset (#4187) · 3ccd973c

kylasa authored Jun 29, 2022

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

3ccd973c

14 Jun, 2022 1 commit
- [Dist] master port should be fixed for all trainers (#4108) · 9501ed6a
  Rhett Ying authored Jun 14, 2022
```
* [Dist] master port should be fixed for all trainers

* add tests for tools/launch.py
```
  9501ed6a