Commits · 3fe5eea791b84280513bcb495aa7c4e1bd0fad9d · OpenDAS / dgl

19 May, 2022 1 commit

[Distributed Training Pipeline] Initial implementation of Distributed data... · 4b87e47f

kylasa authored May 18, 2022

[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)

* Initial implementation of Distributed data processing step in the Distributed Training pipeline

Implemented the following:
1) Read the output of parmetis (node-id to partition-id mappings)
2) Read the original graph files
3) Shuffle the node/edge metadata and features
4) output the partition specific files in DGL format using convert_partition.py functionality
5) Graph meta data is serialized in json format on rank-0 machine.

* Bug Fixes identified during verification of the dataset

1. When sending out global-id lookups for non-local nodes, in the msg_alltoall.py, conditional filter was used to identify the indices in node_data which is incorrect. Replaced the conditional filter with intersect1d to find out the common node ids and appropriate indices which are later used to identify the needed information to communicate.

2. When writing the graph level json file in distributed processing, the edge_offset on non-rank-0 machines was starting from 0 instead of the appropriate offset. Now added code to start the edge(s) from correct starting offset instead of 0 always.

* Restructuring and consolidation of code

1) Fixed issue when running verify_mag_dataset.py, Now we read xxx_removed_edges.txt and add these edges to `edge_data`. This will ensure that the self-loops and duplicate edges are handling appropriately when compared to the original dataset.

2) Consolidated code into a fewer files and changed code to following the python naming convention.

* Code changes addressing code review comments

Following changes are made in this commit.
1) Naming convention is defined and code is changed accordingly. Definition of various global_ids are defined and how to read them is mentioned.
2) All the code review comments are addressed
3)Files are moved to a new directory with dgl/tools directory as per suggestion
4) README.md file is include and it contains detailed information about the Naming convention adopted by the code, high level overview of the algorithm used in data-shuffling, example command-line to use on a single machine.

* addressing github review comments

Made code changes addressing all the review comments from GitHub.

* Addressing latest code review comments

Addressed all the latest code reviewing comments. One of the major changes is treating the node and edge metadata as dictionary objects and removing all the python lists with numpy arrays.

* Update README.md

Text rendering corrections

* Addressed code review comments

Addressed code review comments for the latest code review
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

4b87e47f

09 Feb, 2022 1 commit

[Feature] Launch Long Live Servers and Multiple Client Groups (#3688) · fcd8ed9a

Rhett Ying authored Feb 09, 2022

* enable to launch multiple client groups sequentially

* launch simultaneously is enabled

* refine docstring

* revert unnecessary change

* [DOC] add doc for long live server

* refine

* refine doc

* refine doc

fcd8ed9a

08 Nov, 2021 1 commit

Remove self-loops and duplicate edges before ParMETIS and restore when... · 2a757d4a

Rhett Ying authored Nov 08, 2021


Remove self-loops and duplicate edges before ParMETIS and restore when converting to DGLGraph (#3472)

* save self-loops and duplicated edges separately.

* [BugFix] sort graph by dgl.ETYPE

* fix bugs in verify script

* fix verify logic

* refine README
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

2a757d4a

23 Sep, 2021 1 commit

[Distributed] Allow user to pass-in extra env parameters when launching a... · 179d6aab

xiang song(charlie.song) authored Sep 23, 2021


[Distributed] Allow user to pass-in extra env parameters when launching a distributed training task. (#3375)

* Allow user to pass-in extra env parameters when launching a distributed training task.

* Update

* upd
Co-authored-by: xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>

179d6aab

14 Sep, 2021 1 commit

[Bugfix] And PYTHONPATH in server launch. (#3352) · ad61a9a5

xiang song(charlie.song) authored Sep 14, 2021



* put PYTHONPATH in server launch

* remove prints
Co-authored-by: xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>

ad61a9a5

17 Aug, 2021 1 commit

[Tools] In `tools/launch.py`, correctly pass all DGL client/server env vars if... · ac01e880

Eric Kim authored Aug 16, 2021

[Tools] In `tools/launch.py`, correctly pass all DGL client/server env vars if udf is a multi-command (#3245)

* Correctly pass all DGL client/server env vars if udf is a multi-command

* Refactor to use wrap_cmd_with_local_envvars() helper fn

ac01e880

02 Aug, 2021 2 commits

[Feature] Add removed edges in distributed graph partitioning to handle heterogeneous graph (#3137) · b1319200

Ankit Garg authored Aug 03, 2021



* Added code for Rectifying (TypeError: unhashable type: 'slice') when copying file

* 1) added distributed preprocessing code to create ParMetis Input from CSV files
2) add code to run pm_dglpart on multiple machines
3) added support for recreating heteregenous graph from homo geneous graph based on dropped edges, as ParMetis currently only supports homogeneous graphs

* move to pandas

* Added comments and remove drop_duplicates as it was redundant

* Addressed Pr Comments

* Rename variable

* Added comment

* Added comment

* updated ReadMe
Co-authored-by: Ankit Garg <gaank@amazon.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

b1319200

[Tools] Refactor tools/launch.py to handle more python binary names (#3205) · c40bbf4f
Eric Kim authored Aug 02, 2021
```
* Refactors torch dist launcher udf-wrap code to handle more python versions

* minor changes
```
c40bbf4f

30 Jul, 2021 1 commit
- [Tools] Adds --ssh_username to the dist launcher (tools/launch.py) (#3202) · efd0fc6c
  Eric Kim authored Jul 30, 2021
  
  efd0fc6c
02 Jul, 2021 1 commit
- Added code for Rectifying (TypeError: unhashable type: 'slice') when copying file (#3049) · 35ffceb4
  ankit-garg authored Jul 02, 2021
```
Co-authored-by: Ankit Garg <gaank@amazon.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
```
  35ffceb4
26 May, 2021 1 commit

[Distributed] Specify the graph format for distributed training (#2948) · 18dbaebe

Da Zheng authored May 26, 2021



* explicitly set the graph format.

* fix.

* fix.

* fix launch script.

* fix readme.
Co-authored-by: Zheng <dzzhen@3c22fba32af5.ant.amazon.com>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>

18dbaebe

01 May, 2021 1 commit

[Distributed] Kill training jobs in distributed training (#2881) · 2d372e35

Da Zheng authored May 01, 2021



* kill training jobs.

* update.

* fix.
Co-authored-by: Zheng <dzzhen@3c22fba32af5.ant.amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

2d372e35

08 Apr, 2021 1 commit

[Distributed] Fix a bug in multiprocessing sampling. (#2826) · bfbbefa7

Da Zheng authored Apr 08, 2021


Co-authored-by: Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

bfbbefa7

04 Apr, 2021 1 commit
- [Distributed] Automatically setting the number of OMP threads for trainers (#2812) · 86229d42
  Da Zheng authored Apr 04, 2021
```
* set omp thread.

* add comment.

* fix.
```
  86229d42
30 Mar, 2021 1 commit

[Distributed] Simplify distributed API (#2775) · e36c5db6

Da Zheng authored Mar 29, 2021



* remove num_workers.

* remove num_workers.

* remove num_workers.

* remove num-servers.

* update error message.

* update docstring.

* fix docs.

* fix tests.

* fix test.

* fix.

* print messages in test.

* fix.

* fix test.

* fix.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>

e36c5db6

22 Mar, 2021 1 commit
- remove pyinstrument. (#2772) · bb542066
  Da Zheng authored Mar 21, 2021
```
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
```
  bb542066
25 Feb, 2021 1 commit
- [Distributed] Fix the path for distributed partitioning. (#2700) · 0526b885
  Da Zheng authored Feb 24, 2021
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
```
  0526b885
09 Feb, 2021 1 commit

[Distributed] Distributed METIS partition (#2576) · e4ff4844

Da Zheng authored Feb 08, 2021



* add convert.

* fix.

* add write_mag.

* fix convert_partition.py

* write data.

* use pyarrow to read.

* update write_mag.py

* fix convert_partition.py.

* load node/edge features when necessary.

* reshuffle nodes.

* write mag correctly.

* fix a bug: inner nodes in a partition might be empty.

* fix bugs.

* add verify code.

* insert reverse edges.

* fix a bug.

* add get node/edge data.

* add instructions.

* remove unnecessary argument.

* update distributed preprocessing.

* fix readme.

* fix.

* fix.

* fix.

* fix readme.

* fix doc.

* fix.

* update readme

* update doc.

* update readme.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-202.us-west-1.compute.internal>

e4ff4844

15 Sep, 2020 1 commit
- [Distributed] Change rsync to scp of copy_files.py (#2197) · 67e520c4
  Chao Ma authored Sep 15, 2020
```
* update

* update
```
  67e520c4
27 Aug, 2020 1 commit
- [Distributed] Check num_workers and num_samplers (#2108) · 4327d712
  Chao Ma authored Aug 27, 2020
```
* check num_workers

* update

* update

* update

* update

* update

* update
```
  4327d712
13 Aug, 2020 1 commit

[Distributed] Copy training scripts in copy_partitions.py (#2010) · eb9c067b

Chao Ma authored Aug 13, 2020

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

eb9c067b

12 Aug, 2020 2 commits

[Distributed] Fix all arguments to the format of xx_xxx (#2005) · f5d8fa84
Chao Ma authored Aug 12, 2020
```
* update

* update
```
f5d8fa84

[Distributed] adjust various APIs. (#1993) · d1cf5c38

Da Zheng authored Aug 11, 2020

* rename get_data_size.

* remove g from DistTensor.

* remove g from DistEmbedding.

* clean up API of graph partition book.

* fix DistGraph

* fix lint.

* collect all part policies.

* fix.

* fix.

* support distributed sampler.

* remove partition.py

d1cf5c38

11 Aug, 2020 2 commits

[Distributed] Move server start code to initialize. (#2002) · 4efa3320
Da Zheng authored Aug 11, 2020
```
* move server start code to initialize.

* fix.

* fix lint.

* fix examples.

* add more checks.
```
4efa3320

[Distributed] Remove server_count from ip_config.txt (#1985) · d340ea3a

Chao Ma authored Aug 11, 2020



* remove server_count from ip_config.txt

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* lint

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update dist_context.py

* fix lint.

* make it work for multiple spaces.

* update ip_config.txt.

* fix examples.

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* udpate

* update

* update

* update

* update

* update
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>

d340ea3a

10 Aug, 2020 1 commit

[Distributed] Fix standalone (#1974) · ee30b2aa

Da Zheng authored Aug 10, 2020



* fix tests.

* fix.

* remove a test.

* make code work in the standalone mode.

* fix example.

* more fix.

* make DistDataloader work with num_workers=0

* fix DistDataloader tests.

* fix.

* fix lint.

* fix cleanup.

* fix test

* remove unnecessary code.

* remove tests.

* fix.

* fix.

* fix.

* fix example

* fix.

* fix.

* fix launch script.
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>

ee30b2aa

09 Aug, 2020 1 commit

[Distributed] Set the number of threads correctly to speed up (#1976) · b9ef70e5

Da Zheng authored Aug 08, 2020

* temp fix omp.

* set server threads.

* add CAPI to set up OMP threads.

* fix.

* fix.

* update namesapce.

* set cpi properly.

* allow to config num worker threads.

* set #threads.

* fix.

b9ef70e5

08 Aug, 2020 1 commit
- [Distributed] Fix the launch script. (#1977) · f0fbbc16
  Da Zheng authored Aug 08, 2020
```
* update launch script

* check the correctness of launch script.

* fix.
```
  f0fbbc16
31 Jul, 2020 1 commit

[Distributed] add copy_partitions.py (#1866) · 4be4b134

Da Zheng authored Jul 31, 2020



* fix bugs.

* eval on both vaidation and testing.

* add script.

* update.

* update launch.

* make train_dist.py independent.

* update readme.

* update readme.

* update readme.

* update readme.

* generate undirected graph.

* rename conf_file to part_config

* use rsync

* make train_dist independent.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-19-115.us-west-2.compute.internal>
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

4be4b134

27 Jul, 2020 1 commit
- [Distributed] Small fix on launch script (#1867) · bcb988bd
  Chao Ma authored Jul 27, 2020
```
* update

* update

* update

* update
```
  bcb988bd
17 Jul, 2020 1 commit

fix launch. (#1822) · 7645c660

Da Zheng authored Jul 17, 2020


Co-authored-by: Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>

7645c660

16 Jul, 2020 1 commit

[Distributed] Distributed launching script (#1772) · ca9d3216

Chao Ma authored Jul 17, 2020



* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix launch script.
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

ca9d3216

03 May, 2020 1 commit

[Feature] Distributed graph store (#1383) · 2190c39d

Da Zheng authored May 02, 2020

* initial version from distributed training.

This is copied from multiprocessing training.

* modify for distributed training.

* it's runnable now.

* measure time in neighbor sampling.

* simplify neighbor sampling.

* fix a bug in distributed neighbor sampling.

* allow single-machine training.

* fix a bug.

* fix a bug.

* fix openmp.

* make some improvement.

* fix.

* add prepare in the sampler.

* prepare nodeflow async.

* fix a bug.

* get id.

* simplify the code.

* improve.

* fix partition.py

* fix the example.

* add more features.

* fix the example.

* allow one partition

* use distributed kvstore.

* do g2l map manually.

* fix commandline.

* a temp script to save reddit.

* fix pull_handler.

* add pytorch version.

* estimate the time for copying data.

* delete unused code.

* fix a bug.

* print id.

* fix a bug

* fix a bug

* fix a bug.

* remove redundent code.

* revert modify in sampler.

* fix temp script.

* remove pytorch version.

* fix.

* distributed training with pytorch.

* add distributed graph store.

* fix.

* add metis_partition_assignment.

* fix a few bugs in distributed graph store.

* fix test.

* fix bugs in distributed graph store.

* fix tests.

* remove code of defining DistGraphStore.

* fix partition.

* fix example.

* update run.sh.

* only read necessary node data.

* batching data fetch of multiple NodeFlows.

* simplify gcn.

* remove unnecessary code.

* use the new copy_from_kvstore.

* update training script.

* print time in graphsage.

* make distributed training runnable.

* use val_nid.

* fix train_sampling.

* add distributed training.

* add run.sh

* add more timing.

* fix a bug.

* save graph metadata when partition.

* create ndata and edata in distributed graph store.

* add timing in minibatch training of GraphSage.

* use pytorch distributed.

* add checks.

* fix a bug in global vs. local ids.

* remove fast pull

* fix a compile error.

* update and add new APIs.

* implement more methods in DistGraphStore.

* update more APIs.

* rename it to DistGraph.

* rename to DistTensor

* remove some unnecessary API.

* remove unnecessary files.

* revert changes in sampler.

* Revert "simplify gcn."

This reverts commit 0ed3a34ca714203a5b45240af71555d4227ce452.

* Revert "simplify neighbor sampling."

This reverts commit 551c72d20f05a029360ba97f312c7a7a578aacec.

* Revert "measure time in neighbor sampling."

This reverts commit 63ae80c7b402bb626e24acbbc8fdfe9fffd0bc64.

* Revert "add timing in minibatch training of GraphSage."

This reverts commit e59dc8957a414c7df5c316f51d78bce822bdef5e.

* Revert "fix train_sampling."

This reverts commit ea6aea9a4aabb8ba0ff63070aa51e7ca81536ad9.

* fix lint.

* add comments and small update.

* add more comments.

* add more unit tests and fix bugs.

* check the existence of shared-mem graph index.

* use new partitioned graph storage.

* fix bugs.

* print error in fast pull.

* fix lint

* fix a compile error.

* save absolute path after partitioning.

* small fixes in the example

* Revert "[kvstore] support any data type for init_data() (#1465)"

This reverts commit 87b6997b

.

* fix a bug.

* disable evaluation.

* Revert "Revert "[kvstore] support any data type for init_data() (#1465)""

This reverts commit f5b8039c6326eb73bad8287db3d30d93175e5bee.

* support set and init data.

* support set and init data.

* Revert "Revert "[kvstore] support any data type for init_data() (#1465)""

This reverts commit f5b8039c6326eb73bad8287db3d30d93175e5bee.

* fix bugs.

* fix unit test.

* move to dgl.distributed.

* fix lint.

* fix lint.

* remove local_nids.

* fix lint.

* fix test.

* remove train_dist.

* revert train_sampling.

* rename funcs.

* address comments.

* address comments.

Use NodeDataView/EdgeDataView to keep track of data.

* address comments.

* address comments.

* revert.

* save data with DGL serializer.

* use the right way of getting shape.

* fix lint.

* address comments.

* address comments.

* fix an error in mxnet.

* address comments.

* add edge_map.

* add more test and fix bugs.
Co-authored-by: Zheng <dzzhen@186590dc80ff.ant.amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-131.us-east-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-26-167.us-east-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-150.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-250.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-30-135.us-west-2.compute.internal>

2190c39d

08 Mar, 2020 1 commit

[Feature] add metis partitioning to DGL (#1308) · 0e153c4b

Da Zheng authored Mar 07, 2020



* add metis.

* add test.

* construct partition id.

* link to METIS github repo.

* update metis.

* add a tool for partitioning a graph.

* update metis.

* update.

* update.

* fix metis.

* fix lint

* fix indent.

* another way of building metis.

* disable metis in windows.

* test windows

* fix.

* disable metis for windows properly.

* fix for tensorflow.

* skip test for gpu.

* make graph symmetric

* address comments.

* more comments.

* fix compile

* fix a bug.

* add test.

* change the default #hops of HALO nodes.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-26-167.us-east-2.compute.internal>

0e153c4b