Commits · 1947d87dd77eabe5893e277a52ecf0f9eb2f1063 · OpenDAS / dgl

23 Aug, 2022 1 commit
- fix unpinning when tensoradaptor is not available (#4450) · 1947d87d
  Xin Yao authored Aug 23, 2022
  
  1947d87d
22 Aug, 2022 3 commits
- [Doc] Change random.py to random_partition.py in guide on distributed partition pipeline (#4438) · 7a41c126
  Mufei Li authored Aug 22, 2022
```
* Update distributed-preprocessing.rst

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>
```
  7a41c126
- [Distributed][Feature] New distributed partitioning pipeline (#4439) · ad7be8be
  Minjie Wang authored Aug 22, 2022
  
  ad7be8be
- Merge branch 'master' into dist_part · 7e2ed9f8
  Minjie Wang authored Aug 22, 2022
  
  7e2ed9f8
21 Aug, 2022 1 commit
- Let distributed training launch script report error when any trainer or kvserver fails. (#4437) · ee672c0b
  xiang song(charlie.song) authored Aug 21, 2022
```
* Collect error reports

* update

* fix
Co-authored-by: root <root@ip-10-0-80-128.ec2.internal>
```
  ee672c0b
20 Aug, 2022 1 commit
- Merge branch 'master' into dist_part · 2cf4bd0a
  Minjie Wang authored Aug 20, 2022
  
  2cf4bd0a
19 Aug, 2022 2 commits

[Dist][CI] Unit test for the new distributed partitioning pipeline (#4394) · 2e8ae9f9

Mufei Li authored Aug 19, 2022



* chunked graph data format

* Update

* Update

* Update task_distributed_test.sh

* Update

* Update

* Revert "Update"

This reverts commit 03c461870f19375fb03125b061fc853ab555577f.

* Update

* Update

* ssh-keygen

* CI

* install openssh

* openssh

* Update

* CI

* Update

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-53-142.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-87.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>

2e8ae9f9

[EXAMPLE]Add multi gpu graph predication GIN+virtualnode example (#4385) · d077d371
peizhou001 authored Aug 19, 2022
```
* add multigpu folder for related examples
```
d077d371

18 Aug, 2022 5 commits
- [Feature] Rework Dataloader cpu affinitization as helper method (#4126) · 47993776
  Daniil Sizov authored Aug 18, 2022
```
* Add helper method for temporary affinitization of compute threads

* Rework DL affinitization as single helper

* Add example usage in benchmarks

* Fix python linter warnings

* Fix affinity helper params

* Use NUMA node 0 cores only by default

* Fix benchmarks

* Fix lint errors
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
```
  47993776
- [Doc] Misc Fix for User Guide 7.1 Data Preprocessing (#4433) · bc2fef9c
  Mufei Li authored Aug 18, 2022
```
* Update

* rollback for partition_algo/random.py
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
```
  bc2fef9c
- [CI] enable ssh in docker image for dist test (#4432) · b039ea99
  Rhett Ying authored Aug 18, 2022
  
  b039ea99
- [CI] enable ssh in docker image for dist test (#4432) · d248e768
  Rhett Ying authored Aug 18, 2022
  
  d248e768
- [Example][Bug] fix device index for dist train on GPUs (#4403) · e14860d9
  Rhett Ying authored Aug 18, 2022
  
  e14860d9
17 Aug, 2022 4 commits

Distributed Lookup service implementation to retrieve node-level mappings (#4387) · f51b31b2

kylasa authored Aug 17, 2022

* Distributed Lookup service which is for retrieving global_nids to shuffle-global-nids/partition-id mappings

1. Implemented a class to provide distributed lookup service
2. This class can be used to retrieve global-nids mappings

* Code changes to address CI comments.

1. Removed some unneeded type_casts to numpy.int64
2. Added additional comments when iterating over the partition-ids list.
3.Added docstring to the class and adjusted comments where it is relevant.

* Updated code comments and variable names...

1. Changed the variable names to appropriately represent the values stored in these variables.
2. Updated the docstring correctly.

* Corrected docstring as per the suggestion... and removed all the capital letters for Global nids and Shuffle Global nids...

* Addressing CI review comments.

f51b31b2

[CI] upgrade python version to 3.7.0 (#4406) · 8d3c5820

Rhett Ying authored Aug 17, 2022

* [CI] upgrade python version to 3.7.0

* do not upgrade for mxnet cpu due to seg fault

* fix test failure for mxnet

8d3c5820

[CI] upgrade python version to 3.7.0 (#4406) · cf4727a9

Rhett Ying authored Aug 17, 2022

* [CI] upgrade python version to 3.7.0

* do not upgrade for mxnet cpu due to seg fault

* fix test failure for mxnet

cf4727a9

[Doc] Update distributed chapter according to new pipeline (#4275) · 3bcb268a

Minjie Wang authored Aug 17, 2022

* dist index chapter

* preproc chapter

* rst

* tools page

* partition chapter

* rst

* hetero chapter

* 7.1 step1

* add parmetis back

* changed based on feedback

* address comments

3bcb268a

16 Aug, 2022 1 commit

[Feature] enable graph partition book support canonical etypes (#4343) · 39987bc5

Rhett Ying authored Aug 16, 2022

* [Feature] enable graph partition book support canonical etypes

* fix lint

* fix lint

* add todo

* refine according to review comments

* fix lint

* refine naming

* revert PartitionPolicy __init__

* refine docstring

* fix doc string

39987bc5

15 Aug, 2022 1 commit
- [Bugfix] Fix pinning empty tensors and graphs (#4393) · 3685000a
  Xin Yao authored Aug 15, 2022
  
  3685000a
13 Aug, 2022 1 commit

[Example] NGNN for ogbl (#4328) · 49c81795

Ereboas authored Aug 13, 2022



* NGNN for ogbl

* modify doc organization.

* merge similar parts

* 1st approving review.

* minor changes

* Remove the "Usage" section.
Co-authored-by: Mufei Li <mufeili1996@gmail.com>

49c81795

12 Aug, 2022 2 commits
- [Performance] Improve the performance of SpMMCsr by reconfiguration (#4363) · 2523bc7a
  Xin Yao authored Aug 12, 2022
```
* Change CUDA_MAX_NUM_THREADS to 256

* change the configuration of grid
```
  2523bc7a
- Revert "[Dist] New distributed data preparation pipeline (#4386)" (#4391) · 18d89b5d
  Minjie Wang authored Aug 12, 2022
```
This reverts commit 71ce1749.
```
  18d89b5d
11 Aug, 2022 4 commits

[Dist] New distributed data preparation pipeline (#4386) · 71ce1749

Minjie Wang authored Aug 11, 2022

* code changes for bug fixes identified during mag_lsc dataset (#4187)

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

* Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac

.

* Added code to support multiple-file-support feature and removed singl… (#4188)

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

* Support new format for multi-file support in distributed partitioning. (#4217)

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

* [Distributed] Change for the new input format for distributed partitioning (#4273)

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

* [Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311)

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

* [Distributed] reduce memory consumption in distributed graph partitioning. (#4338)

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

* [Distributed] Graph chunking UX (#4365)

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

* Adding launch script and wrapper script to trigger distributed graph … (#4276)

* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>
Co-authored-by: kylasa <kylasa@gmail.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

71ce1749

Merge branch 'master' into dist_part · e9b624fe
Minjie Wang authored Aug 11, 2022

e9b624fe

Adding launch script and wrapper script to trigger distributed graph … (#4276) · 8086d1ed

kylasa authored Aug 11, 2022



* Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document

1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline
2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it.

* Added code to auto-detect python version and retrieve some parameters from the input metadata json file

1. Auto detect python version
2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline.

* Updated the json file name to metadata.json file per UX documentation

1. Renamed json file name per UX documentation.

* address comments

* fix

* fix doc

* use unbuffered logging to cure anxiety

* cure more anxiety

* Update tools/dispatch_data.py
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

* oops
Co-authored-by: Quan Gan <coin2028@hotmail.com>
Co-authored-by: Minjie Wang <minjie.wang@nyu.edu>

8086d1ed

[Distributed] Graph chunking UX (#4365) · 067cd744

Quan (Andy) Gan authored Aug 11, 2022

* first commit

* update

* huh

* fix

* update

* revert core

* fix

* update

* rewrite

* oops

* address comments

* add graph name

* address comments

* remove sample metadata file

* address comments

* fix

* remove

* add docs

067cd744

10 Aug, 2022 4 commits

[Example]rgcn-ogbn-mag (#4331) · a88e7f7e

YJ-Zhao authored Aug 10, 2022



* rgcn-ogbn-mag

* Add link in README.md

* correct code-format,add the reset_parameters function to the HeteroEmbedding module

* add the annotation in hetero.py

* add a unit test

* modify format

* Update
Co-authored-by: Mufei Li <mufeili1996@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-50-143.us-west-2.compute.internal>

a88e7f7e

[Example][Bugfix] Fix infograph example (#4298) · 919b7838

Chang Liu authored Aug 10, 2022



* Fix infograph example

* Update

* Revert the changes and update Doc

* Update

* Split lines to pass CI-lint

* Update

* Update
Co-authored-by: Mufei Li <mufeili1996@gmail.com>

919b7838

Update issue templates · 14a77c86
Minjie Wang authored Aug 10, 2022

14a77c86
Update issue templates · 91f4eee0
Minjie Wang authored Aug 10, 2022

91f4eee0

09 Aug, 2022 2 commits
- [Bug] Fix broken static_assert (#4342) · 182e1ad5
  Xin Yao authored Aug 09, 2022
  
  182e1ad5
- [Bug] A bunch of fixes in edge_softmax_hetero (#4336) · 62c827c8
  Quan (Andy) Gan authored Aug 09, 2022
```
* bunch of fixes

* Update test_edge_softmax_hetero.py

* Update test_edge_softmax_hetero.py
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
```
  62c827c8
07 Aug, 2022 2 commits

[Distributed] reduce memory consumption in distributed graph partitioning. (#4338) · 60bc0b76

kylasa authored Aug 07, 2022

* Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions

1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints
a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes.
b)Edges order are maintained as passed into this function.
c) src/dst end points are mapped to target values based on the reshuffle'd nodes order.

* Code changes addressing CI comments for this PR

1. Used Da's suggested map to map nodes from old to new order.
This is much simpler and mem. efficient.

* Addressing CI Comments.

1. Reduced the amount of documentation to reflect the actual implementation.
2. named the mapping object appropriately.

60bc0b76

[Bugfix] Fix the default value of `num_bases` in RelGraphConv module (#4321) · 5ba5106a
Chang Liu authored Aug 07, 2022
```
* Fix doc and default settings for RelGraphConv

* Add unit test

* Split msg in two lines to pass CI-lint
```
5ba5106a

06 Aug, 2022 1 commit

[Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) · c1e01b1d

kylasa authored Aug 05, 2022

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

c1e01b1d

03 Aug, 2022 1 commit
- [BugFix] fix etype check in DistGraph.edge_subgraph (#4322) · 43ba94ee
  Rhett Ying authored Aug 03, 2022
  
  43ba94ee
02 Aug, 2022 1 commit
- [Unittest] Improve test_dataloader (#4301) · 463650a7
  Xin Yao authored Aug 02, 2022
```
* test ddp dataloader

* add pure_gpu for edgedataloader

* resolve ddp issue
```
  463650a7
01 Aug, 2022 3 commits
- [BugFix] enable DistGraph.find_edge() works with str or tuple of str (#4319) · 4dd16f5d
  Rhett Ying authored Aug 01, 2022
  
  4dd16f5d
- [Feature] Enable UVA for Weighted Samplers (#4314) · 44b68641
  Xin Yao authored Aug 01, 2022
```
* enable use for weighted neighbor sampler and biased random walk

* add unit tests

* fix for mxnet/tf

* fix typo
```
  44b68641
- [Example][Refactor] Refactor GIN example (#4280) · 9a16a5e0
  Chang Liu authored Jul 31, 2022
```
* Refactor GIN example

* Update

* Update README

* Minor update

* README update
Co-authored-by: Mufei Li <mufeili1996@gmail.com>
```
  9a16a5e0