Commits · ad7be8be44461b74641ad0e46524f095d0a2cad6 · OpenDAS / dgl

21 Aug, 2022 1 commit
- Let distributed training launch script report error when any trainer or kvserver fails. (#4437) · ee672c0b
  xiang song(charlie.song) authored Aug 21, 2022
```
* Collect error reports

* update

* fix
Co-authored-by: root <root@ip-10-0-80-128.ec2.internal>
```
  ee672c0b
19 Aug, 2022 1 commit

[Dist][CI] Unit test for the new distributed partitioning pipeline (#4394) · 2e8ae9f9

Mufei Li authored Aug 19, 2022



* chunked graph data format

* Update

* Update

* Update task_distributed_test.sh

* Update

* Update

* Revert "Update"

This reverts commit 03c461870f19375fb03125b061fc853ab555577f.

* Update

* Update

* ssh-keygen

* CI

* install openssh

* openssh

* Update

* CI

* Update

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-53-142.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-87.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>

2e8ae9f9

17 Aug, 2022 1 commit

Distributed Lookup service implementation to retrieve node-level mappings (#4387) · f51b31b2

kylasa authored Aug 17, 2022

* Distributed Lookup service which is for retrieving global_nids to shuffle-global-nids/partition-id mappings

1. Implemented a class to provide distributed lookup service
2. This class can be used to retrieve global-nids mappings

* Code changes to address CI comments.

1. Removed some unneeded type_casts to numpy.int64
2. Added additional comments when iterating over the partition-ids list.
3.Added docstring to the class and adjusted comments where it is relevant.

* Updated code comments and variable names...

1. Changed the variable names to appropriately represent the values stored in these variables.
2. Updated the docstring correctly.

* Corrected docstring as per the suggestion... and removed all the capital letters for Global nids and Shuffle Global nids...

* Addressing CI review comments.

f51b31b2

12 Aug, 2022 1 commit
- Revert "[Dist] New distributed data preparation pipeline (#4386)" (#4391) · 18d89b5d
  Minjie Wang authored Aug 12, 2022
```
This reverts commit 71ce1749.
```
  18d89b5d
11 Aug, 2022 3 commits

[Dist] New distributed data preparation pipeline (#4386) · 71ce1749

Minjie Wang authored Aug 11, 2022

* code changes for bug fixes identified during mag_lsc dataset (#4187)

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

* Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac

.

* Added code to support multiple-file-support feature and removed singl… (#4188)

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

* Support new format for multi-file support in distributed partitioning. (#4217)

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

* [Distributed] Change for the new input format for distributed partitioning (#4273)

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

* [Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311)

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

* [Distributed] reduce memory consumption in distributed graph partitioning. (#4338)

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

* [Distributed] Graph chunking UX (#4365)

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

* Adding launch script and wrapper script to trigger distributed graph … (#4276)

* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>
Co-authored-by: kylasa <kylasa@gmail.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

71ce1749

Adding launch script and wrapper script to trigger distributed graph … (#4276) · 8086d1ed

kylasa authored Aug 11, 2022



* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

8086d1ed

[Distributed] Graph chunking UX (#4365) · 067cd744

Quan (Andy) Gan authored Aug 11, 2022

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

067cd744

07 Aug, 2022 1 commit

[Distributed] reduce memory consumption in distributed graph partitioning. (#4338) · 60bc0b76

kylasa authored Aug 07, 2022

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

60bc0b76

06 Aug, 2022 1 commit

[Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) · c1e01b1d

kylasa authored Aug 05, 2022

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

c1e01b1d

23 Jul, 2022 1 commit

[Distributed] Change for the new input format for distributed partitioning (#4273) · 7f8e1cf2

kylasa authored Jul 23, 2022

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

7f8e1cf2

13 Jul, 2022 1 commit

Support new format for multi-file support in distributed partitioning. (#4217) · dad3606a

kylasa authored Jul 12, 2022

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

dad3606a

05 Jul, 2022 2 commits

Added code to support multiple-file-support feature and removed singl… (#4188) · 9948ef4d

kylasa authored Jul 04, 2022

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

9948ef4d

Revert "Revert "[Distributed Training Pipeline] Initial implementation of... · a324440f

Da Zheng authored Jul 04, 2022

Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac.

a324440f

29 Jun, 2022 1 commit

code changes for bug fixes identified during mag_lsc dataset (#4187) · 3ccd973c

kylasa authored Jun 29, 2022

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

3ccd973c

14 Jun, 2022 1 commit
- [Dist] master port should be fixed for all trainers (#4108) · 9501ed6a
  Rhett Ying authored Jun 14, 2022
```
* [Dist] master port should be fixed for all trainers

* add tests for tools/launch.py
```
  9501ed6a
09 Jun, 2022 1 commit
- [Dist] avoid busy ssh connection (#4096) · 966d1aa8
  Rhett Ying authored Jun 09, 2022
  
  966d1aa8
23 May, 2022 1 commit

Revert "[Distributed Training Pipeline] Initial implementation of Distributed... · 7c598aac

Da Zheng authored May 22, 2022

Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)

This reverts commit 4b87e47f.

7c598aac

19 May, 2022 1 commit

[Distributed Training Pipeline] Initial implementation of Distributed data... · 4b87e47f

kylasa authored May 18, 2022

[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)

* Initial implementation of Distributed data processing step in the Distributed Training pipeline

Implemented the following:
1) Read the output of parmetis (node-id to partition-id mappings)
2) Read the original graph files
3) Shuffle the node/edge metadata and features
4) output the partition specific files in DGL format using convert_partition.py functionality
5) Graph meta data is serialized in json format on rank-0 machine.

* Bug Fixes identified during verification of the dataset

1. When sending out global-id lookups for non-local nodes, in the msg_alltoall.py, conditional filter was used to identify the indices in node_data which is incorrect. Replaced the conditional filter with intersect1d to find out the common node ids and appropriate indices which are later used to identify the needed information to communicate.

2. When writing the graph level json file in distributed processing, the edge_offset on non-rank-0 machines was starting from 0 instead of the appropriate offset. Now added code to start the edge(s) from correct starting offset instead of 0 always.

* Restructuring and consolidation of code

1) Fixed issue when running verify_mag_dataset.py, Now we read xxx_removed_edges.txt and add these edges to `edge_data`. This will ensure that the self-loops and duplicate edges are handling appropriately when compared to the original dataset.

2) Consolidated code into a fewer files and changed code to following the python naming convention.

* Code changes addressing code review comments

Following changes are made in this commit.
1) Naming convention is defined and code is changed accordingly. Definition of various global_ids are defined and how to read them is mentioned.
2) All the code review comments are addressed
3)Files are moved to a new directory with dgl/tools directory as per suggestion
4) README.md file is include and it contains detailed information about the Naming convention adopted by the code, high level overview of the algorithm used in data-shuffling, example command-line to use on a single machine.

* addressing github review comments

Made code changes addressing all the review comments from GitHub.

* Addressing latest code review comments

Addressed all the latest code reviewing comments. One of the major changes is treating the node and edge metadata as dictionary objects and removing all the python lists with numpy arrays.

* Update README.md

Text rendering corrections

* Addressed code review comments

Addressed code review comments for the latest code review
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

4b87e47f

09 Feb, 2022 1 commit

[Feature] Launch Long Live Servers and Multiple Client Groups (#3688) · fcd8ed9a

Rhett Ying authored Feb 09, 2022

* enable to launch multiple client groups sequentially

* launch simultaneously is enabled

* refine docstring

* revert unnecessary change

* [DOC] add doc for long live server

* refine

* refine doc

* refine doc

fcd8ed9a

08 Nov, 2021 1 commit

Remove self-loops and duplicate edges before ParMETIS and restore when... · 2a757d4a

Rhett Ying authored Nov 08, 2021


Remove self-loops and duplicate edges before ParMETIS and restore when converting to DGLGraph (#3472)

* save self-loops and duplicated edges separately.

* [BugFix] sort graph by dgl.ETYPE

* fix bugs in verify script

* fix verify logic

* refine README
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

2a757d4a

23 Sep, 2021 1 commit

[Distributed] Allow user to pass-in extra env parameters when launching a... · 179d6aab

xiang song(charlie.song) authored Sep 23, 2021


[Distributed] Allow user to pass-in extra env parameters when launching a distributed training task. (#3375)

* Allow user to pass-in extra env parameters when launching a distributed training task.

* Update

* upd
Co-authored-by: xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>

179d6aab

14 Sep, 2021 1 commit

[Bugfix] And PYTHONPATH in server launch. (#3352) · ad61a9a5

xiang song(charlie.song) authored Sep 14, 2021



* put PYTHONPATH in server launch

* remove prints
Co-authored-by: xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>

ad61a9a5

17 Aug, 2021 1 commit

[Tools] In `tools/launch.py`, correctly pass all DGL client/server env vars if... · ac01e880

Eric Kim authored Aug 16, 2021

[Tools] In `tools/launch.py`, correctly pass all DGL client/server env vars if udf is a multi-command (#3245)

* Correctly pass all DGL client/server env vars if udf is a multi-command

* Refactor to use wrap_cmd_with_local_envvars() helper fn

ac01e880

02 Aug, 2021 2 commits

[Feature] Add removed edges in distributed graph partitioning to handle heterogeneous graph (#3137) · b1319200

Ankit Garg authored Aug 03, 2021



* Added code for Rectifying (TypeError: unhashable type: 'slice') when copying file

* 1) added distributed preprocessing code to create ParMetis Input from CSV files
2) add code to run pm_dglpart on multiple machines
3) added support for recreating heteregenous graph from homo geneous graph based on dropped edges, as ParMetis currently only supports homogeneous graphs

* move to pandas

* Added comments and remove drop_duplicates as it was redundant

* Addressed Pr Comments

* Rename variable

* Added comment

* Added comment

* updated ReadMe
Co-authored-by: Ankit Garg <gaank@amazon.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

b1319200

[Tools] Refactor tools/launch.py to handle more python binary names (#3205) · c40bbf4f
Eric Kim authored Aug 02, 2021
```
* Refactors torch dist launcher udf-wrap code to handle more python versions

* minor changes
```
c40bbf4f

30 Jul, 2021 1 commit
- [Tools] Adds --ssh_username to the dist launcher (tools/launch.py) (#3202) · efd0fc6c
  Eric Kim authored Jul 30, 2021
  
  efd0fc6c
02 Jul, 2021 1 commit
- Added code for Rectifying (TypeError: unhashable type: 'slice') when copying file (#3049) · 35ffceb4
  ankit-garg authored Jul 02, 2021
```
Co-authored-by: Ankit Garg <gaank@amazon.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
```
  35ffceb4
26 May, 2021 1 commit

[Distributed] Specify the graph format for distributed training (#2948) · 18dbaebe

Da Zheng authored May 26, 2021



* explicitly set the graph format.

* fix.

* fix.

* fix launch script.

* fix readme.
Co-authored-by: Zheng <dzzhen@3c22fba32af5.ant.amazon.com>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>

18dbaebe

01 May, 2021 1 commit

[Distributed] Kill training jobs in distributed training (#2881) · 2d372e35

Da Zheng authored May 01, 2021



* kill training jobs.

* update.

* fix.
Co-authored-by: Zheng <dzzhen@3c22fba32af5.ant.amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

2d372e35

08 Apr, 2021 1 commit

[Distributed] Fix a bug in multiprocessing sampling. (#2826) · bfbbefa7

Da Zheng authored Apr 08, 2021


Co-authored-by: Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

bfbbefa7

04 Apr, 2021 1 commit
- [Distributed] Automatically setting the number of OMP threads for trainers (#2812) · 86229d42
  Da Zheng authored Apr 04, 2021
```
* set omp thread.

* add comment.

* fix.
```
  86229d42
30 Mar, 2021 1 commit

[Distributed] Simplify distributed API (#2775) · e36c5db6

Da Zheng authored Mar 29, 2021



* remove num_workers.

* remove num_workers.

* remove num_workers.

* remove num-servers.

* update error message.

* update docstring.

* fix docs.

* fix tests.

* fix test.

* fix.

* print messages in test.

* fix.

* fix test.

* fix.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>

e36c5db6

22 Mar, 2021 1 commit
- remove pyinstrument. (#2772) · bb542066
  Da Zheng authored Mar 21, 2021
```
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
```
  bb542066
25 Feb, 2021 1 commit
- [Distributed] Fix the path for distributed partitioning. (#2700) · 0526b885
  Da Zheng authored Feb 24, 2021
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
```
  0526b885
09 Feb, 2021 1 commit

[Distributed] Distributed METIS partition (#2576) · e4ff4844

Da Zheng authored Feb 08, 2021



* add convert.

* fix.

* add write_mag.

* fix convert_partition.py

* write data.

* use pyarrow to read.

* update write_mag.py

* fix convert_partition.py.

* load node/edge features when necessary.

* reshuffle nodes.

* write mag correctly.

* fix a bug: inner nodes in a partition might be empty.

* fix bugs.

* add verify code.

* insert reverse edges.

* fix a bug.

* add get node/edge data.

* add instructions.

* remove unnecessary argument.

* update distributed preprocessing.

* fix readme.

* fix.

* fix.

* fix.

* fix readme.

* fix doc.

* fix.

* update readme

* update doc.

* update readme.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal>

e4ff4844

15 Sep, 2020 1 commit
- [Distributed] Change rsync to scp of copy_files.py (#2197) · 67e520c4
  Chao Ma authored Sep 15, 2020
```
* update

* update
```
  67e520c4
27 Aug, 2020 1 commit
- [Distributed] Check num_workers and num_samplers (#2108) · 4327d712
  Chao Ma authored Aug 27, 2020
```
* check num_workers

* update

* update

* update

* update

* update

* update
```
  4327d712
13 Aug, 2020 1 commit

[Distributed] Copy training scripts in copy_partitions.py (#2010) · eb9c067b

Chao Ma authored Aug 13, 2020

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

eb9c067b

12 Aug, 2020 2 commits

[Distributed] Fix all arguments to the format of xx_xxx (#2005) · f5d8fa84
Chao Ma authored Aug 12, 2020
```
* update

* update
```
f5d8fa84

[Distributed] adjust various APIs. (#1993) · d1cf5c38

Da Zheng authored Aug 11, 2020

* rename get_data_size.

* remove g from DistTensor.

* remove g from DistEmbedding.

* clean up API of graph partition book.

* fix DistGraph

* fix lint.

* collect all part policies.

* fix.

* fix.

* support distributed sampler.

* remove partition.py

d1cf5c38