Commits · e25f47dea22d9bdae4c4b0ff31aa5d21276f6bf8 · OpenDAS / dgl

16 Feb, 2023 1 commit

[DistDGL][Mem_Optimizations]Edge Ownership processes are computed on the fly when required. (#5225) · e25f47de

kylasa authored Feb 16, 2023

* Edge Ownership processes are computed on the fly when required.

Earlier we were storing Edge ownership processes after the dataset was retrieved from the disk. For massively large datasets, each node can handle upto 5 Billion edges, this means storing owner process-ids will consume 5 * 8 = 40GB. This memory will be hanging around until the edges are exchanged.

To reduce the memory footprint of the pipeline, we no longer store the ownership process-ids in the 'edge_data' dictionary after reading the dataset from the disk. Instead, we compute them on the fly at the time of exchanging edges.

Another optimization is not to send/receive all the messages in a one single large message. Instead we now split the total number edges into chunks, limited by 8 GB per node. And we iterate until all the chunks are exchanged.

Once all the edges are exchanged, as a sanity check, we compute the total number of edges in the system and compare it with the original value before edge shuffling, in a final assert statement before return the result to the caller.

* Applying lintrunner patch.

e25f47de

13 Feb, 2023 1 commit

Code changes to fix order sensitivity of the pipeline (#5288) · 432c71ef

kylasa authored Feb 13, 2023

Following changes are made in this PR.
1. In dataset_utils.py, when reading edges from disk we follow the order defined by the STR_EDGE_TYPE key in the metadata.json file. This order is implicitly used to assign edgeid to edge types. This same order is used to read edges from the disk as well.
2. Now the unit test framework will also randomize the order of edges read from the disk. This is done for the edges when reading from the disk for the unit tests.
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

432c71ef

10 Feb, 2023 2 commits
- [Doc] update `--part_config` · d49a3019
  Rhett Ying authored Feb 10, 2023
  
  d49a3019
- [Doc] update --part_config · ed3888dd
  Rhett Ying authored Feb 10, 2023
  
  ed3888dd
03 Feb, 2023 1 commit
- [DistDGL][Lintrunner]Lintrunner for tools directory (#5261) · aa42aaeb
  kylasa authored Feb 03, 2023
```
* lintrunner patch for gloo_wrapper.py

* lintrunner changes to the tools directory.
```
  aa42aaeb
02 Feb, 2023 1 commit

[Dist] add input_dir for parmetis preprocess (#5232) · eff16b61

Rhett Ying authored Feb 02, 2023

* [Dist] add input_dir for parmetis preprocess

* add support for parquet

* update parmetis_wrapper accordingly

eff16b61

05 Jan, 2023 1 commit

[Dist] Allow reading and writing single-column vector Parquet files. (#5098) · 9890201d

Theodore Vasiloudis authored Jan 05, 2023

* Allow reading and writing single-column vector Parquet files.

These files are commonly produced by Spark ML's feature processing code.

* [Dist] Only write single-column vector files for Parquet in tests.

9890201d

03 Jan, 2023 1 commit

[Dist] Add support for Parquet-formatted edges files, remove some assumptions... · 774709d3

Theodore Vasiloudis authored Jan 03, 2023


[Dist] Add support for Parquet-formatted edges files, remove some assumptions on edge file number. (#5051)

* [Dist] Add support for Parquet-formatted edges files, remove some assumptions on edge file number.

* [Dist] Add parquet edges option to unit tests.
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

774709d3

15 Dec, 2022 1 commit

[Dist] enable to chunk node/edge data into arbitrary number of chunks (#4930) · 9731e023

Rhett Ying authored Dec 15, 2022



* [Dist] enable to chunk node/edge data into arbitrary number of chunks

* [Dist] enable to split node/edge data into arbitrary parts

* refine code

* Format boolean to uint8 forcely to avoid dist.scatter failure

* convert boolean to int8 before scatter and revert it after scatter

* refine code

* fix test

* refine code

* move test utilities into utils.py

* update comment

* fix empty data

* update

* update

* fix empty data issue

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* remove unnecessary shuffle data

* separate array_split into standalone utility

* add example
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

9731e023

14 Dec, 2022 1 commit
- [Dist] generate partition meta for ParMETIS pipeline (#5020) · 32dc1af6
  Rhett Ying authored Dec 14, 2022
```
* [Dist] generate partition meta for ParMETIS
```
  32dc1af6
07 Dec, 2022 1 commit
- Fix bug when feature_tid is empty (#4985) · 394794b1
  xiang song(charlie.song) authored Dec 07, 2022
```
Co-authored-by: Xiang Song <xiangsx@amazon.com>
```
  394794b1
30 Nov, 2022 1 commit
- [Bugfix] Make preprocess compatible with openmpi (#4971) · adb07d18
  xiang song(charlie.song) authored Nov 30, 2022
```
* Make preprocess compatible with openmpi

* update docstr
Co-authored-by: Xiang Song <xiangsx@amazon.com>
```
  adb07d18
28 Nov, 2022 2 commits
- [Feature] Add parquet support for node/edge features in chunked data (#4933) · 08fd6cf8
  peizhou001 authored Nov 28, 2022
  
  08fd6cf8
- [Dist] fix argument consistent with help message (#4957) · 566d231a
  Rhett Ying authored Nov 28, 2022
  
  566d231a
18 Nov, 2022 1 commit

[Dist] Flexible pipeline - Initial commit (#4733) · c8ea9fa4

kylasa authored Nov 18, 2022

* Flexible pipeline - Initial commit

1. Implementation of flexible pipeline feature.
2. With this implementation, the pipeline now supports multiple partitions per process. And also assumes that num_partitions is always a multiple of num_processes.

* Update test_dist_part.py

* Code changes to address review comments

* Code refactoring of exchange_features function into two functions for better readability

* Upadting test_dist_part to fix merge issues with the master branch

* corrected variable names...

* Fixed code refactoring issues.

* Provide missing function arguments to exchange_feature function

* Providing the missing function argument to fix error.

* Provide missing function argument to 'get_shuffle_nids' function.

* Repositioned a variable within its scope.

* Removed tab space which is causing the indentation problem

* Fix issue with the CI test framework, which is the root cause for the failure of the CI tests.

1. Now we read files specific to the partition-id and store this data separately, identified by the local_part_id, in the local process.
2. Similarly as above, we also differentiate the node and edge features type_ids with the same keys as above.
3. These above two changes will help up to get the appropriate feature data during the feature exchange and send to the destination process correctly.

* Correct the parametrization for the CI unit test cases.

* Addressing Rui's code review comments.

* Addressing code review comments.

c8ea9fa4

17 Nov, 2022 1 commit

[Dist] Fix bug in Dist partitioning (#4910) · 799245a7

Serge Panev authored Nov 17, 2022


Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>

799245a7

09 Nov, 2022 1 commit
- [Dist] Fix typo of metis preprocess in dist partitin pipeline · 344be1ef
  xiang song(charlie.song) authored Nov 08, 2022
  
  344be1ef
08 Nov, 2022 1 commit

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large... · 4cd0a685

kylasa authored Nov 07, 2022

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large messages and resulting in killed process (#4790)

* Send out the message to the distributed lookup service in batches.

* Update function signature for allgather_sizes function call.

* Removed the unnecessary if statement .

* Removed logging.info message, which is not needed.

4cd0a685

07 Nov, 2022 3 commits
- alltoall returns tensor list with None values, which is failing torch.cat(). (#4788) · e3bf1825
  kylasa authored Nov 07, 2022
  
  e3bf1825
- [Dist] Create <graph_name>_stats.txt file if it does not exist before ParMETIS execution (#4791) · 98b9e0fa
  kylasa authored Nov 07, 2022
```
* check if stats file exists, if not create one before parmetis run

* correct the typo error and correctly use constants.GRAPH_NAME
```
  98b9e0fa
- Reading files in chunks to reduce the memory footprint of pyarrow (#4795) · 53117c51
  kylasa authored Nov 07, 2022
```
All tasks completed.
```
  53117c51
04 Nov, 2022 2 commits

[Dist] deprecate etype and always use canonical etype for partition and load (#4777) · ed8e9c44

Rhett Ying authored Nov 04, 2022

* [Dist] deprecate etype and always use canonical etype for partition and load

* enable canonical etypes in dist part pipeline

* resolve rebase conflicts

* fix lint

* fix test failure

* throw exception if outdated part config is loaded

* refine

* refine

* revert unnecessary change

* fix typo

ed8e9c44

[Dist] remove dependecy of load_partition_book in change tool (#4802) · dccf1f16

peizhou001 authored Nov 04, 2022



* remove dependecy of load_partition_book in change tool

* fix issue

* fix issue
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-19.ap-northeast-1.compute.internal>

dccf1f16

31 Oct, 2022 1 commit
- Updated the key to retrieve correct rank of a process (#4756) · 9a72b78b
  kylasa authored Oct 31, 2022
```
Merging this PR to the master branch
```
  9a72b78b
27 Oct, 2022 1 commit
- [Dist] fix etype issue in dist part pipeline (#4754) · ea4d9e83
  Rhett Ying authored Oct 27, 2022
```
* [Dist] fix etype issue in dist part pipeline

* add comments
```
  ea4d9e83
26 Oct, 2022 1 commit

[Dist] Reduce startup overhead: sort etypes and save in specified formats (#4735) · 1990e797

Rhett Ying authored Oct 26, 2022

* [Dist] reduce startup overhead: enable to save in specified formats

* [Dist] reduce startup overhead: sort partitions when generating

* sort csc/csr only whenmultiple etypes

* refine

1990e797

19 Oct, 2022 2 commits
- add standalone tools for generating canonical etypes (#4626) · 743516f3
  peizhou001 authored Oct 19, 2022
```
* add a standalone tool for change etypes to canonical etypes in part config
```
  743516f3
- [Dist] decouple num_chunks and num_parts for graphs with edge feature (#4729) · 6a460725
  Rhett Ying authored Oct 19, 2022
```
* [Dist] decouple num_chunks and num_parts for graphs with edge feature

* fix test failure
```
  6a460725
17 Oct, 2022 1 commit

[Dist] Reduce peak memory in DistDGL (#4687) · b1309217

Rhett Ying authored Oct 17, 2022

* [Dist] Reduce peak memory in DistDGL: avoid validation, release memory once loaded

* remove orig_id from ndata/edata for partition_graph()

* delete orig_id from ndata/edata in dist part pipeline

* reduce dtype size and format before saving graphs

* fix lint

* ETYPE requires to be int32/64 for CSRSortByTag

* fix test failure

* refine

b1309217

12 Oct, 2022 1 commit
- [Misc] Black auto fix. (#4705) · 2b983869
  Hongzhi (Steve), Chen authored Oct 12, 2022
```
Co-authored-by: Steve <ubuntu@ip-172-31-34-29.ap-northeast-1.compute.internal>
```
  2b983869
11 Oct, 2022 1 commit
- [Misc] Black auto fix. (#4697) · ea48ce7a
  Hongzhi (Steve), Chen authored Oct 11, 2022
```
Co-authored-by: Steve <ubuntu@ip-172-31-34-29.ap-northeast-1.compute.internal>
```
  ea48ce7a
03 Oct, 2022 2 commits

ParMETIS wrapper script to enable ParMETIS to process chunked dataset format (#4605) · eae6ce2a

kylasa authored Oct 03, 2022

* Creating ParMETIS wrapper script to run parmetis using one script from user perspective

* Addressed all the CI comments from PR https://github.com/dmlc/dgl/pull/4529

* Addressing CI comments.

* Isort, and black changes.

* Replaced python with python3

* Replaced single quote with double quotes per suggestion.

* Removed print statement

* Addressing CI Commets.

* Addressing CI review comments.

* Addressing CI comments as per chime discussion with Rui

* CI Comments, Black and isort changes

* Align with code refactoring, black, isort and code review comments.

* Addressing CI review comments, and fixing merge issues with the master branch

* Updated with proper unit test skip decorator

eae6ce2a

Edge Feature support for input graph datasets for dist. graph partitioning pipeline (#4623) · 1f471396

kylasa authored Oct 03, 2022

* Added support for edge features.

* Added comments and removing unnecessary print statements.

* updated data_shuffle.py to remove compile error.

* Repaled python3 with python to match CI test framework.

* Removed unrelated files from the pull request.

* Isort changes.

* black changes on this file.

* Addressing CI review comments.

* Addressing CI comments.

* Removed duplicated and resolved merge conflict code.

* Addressing CI Comments from Rui.

* Addressing CI comments, and fixing merge issues.

* Addressing CI comments, code refactoring, isort and black

1f471396

28 Sep, 2022 2 commits

[Dist] enable to partition many chunks into less partitions via pipeline (#4620) · cf19254a

Rhett Ying authored Sep 28, 2022

* [Dist] enable to partition many chunks into less partitions via pipeline

* refine

* add meta file for num_parts, add more tests, refine docstring

* remove args.num_parts

* create pydantic class for partition metadata

* refine

* rename json file

cf19254a

[Dist] save original node/edge IDs into separate files (#4649) · 6c1500d4
Rhett Ying authored Sep 28, 2022
```
* [Dist] save original node/edge IDs into separate files

* separate nids and eids
```
6c1500d4

23 Sep, 2022 1 commit

Garbage Collection and memory snapshot code for debugging partitioning... · ace76327

kylasa authored Sep 23, 2022


 Garbage Collection and memory snapshot code for debugging partitioning pipeline (target as master branch) (#4598)

* Squashed commit of the following:

commit e605a550b3783dd5f24eb39b6873a2e0e79be9c7
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:45:39 2022 -0700

    Delete pyproject.toml

commit f2db9e700d817212b67b5227f6472d218f0c74f2
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:44:40 2022 -0700

    Changes suggested by isort program to sort imports.

commit 5a6078beac6218a4f1fb378c169f04dda7396425
Author: kylasa <kylasa@gmail.com>
Date:   Thu Sep 15 14:39:50 2022 -0700

    addressing code review comments from the CI process.

commit c8e92decb7aebeb32c7467108e16f058491443ab
Author: kylasa <kylasa@gmail.com>
Date:   Wed Sep 14 18:23:59 2022 -0700

    Corrected a typo in the import statement

commit 14ddb0e9b553d5be3ed2c50d82dee671e84ad8c9
Author: kylasa <kylasa@gmail.com>
Date:   Tue Sep 13 18:47:34 2022 -0700

    Memory snapshot code for debugging memory footprint of the graph partitioning pipeline

Squashed commit done

* Addressing code review comments.

* Update utils.py

* dummy change to trigger CI tests
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

ace76327

20 Sep, 2022 1 commit
- add ssh port config for dispatchdata (#4557) · 166b273b
  peizhou001 authored Sep 20, 2022
  
  166b273b
15 Sep, 2022 1 commit

[DistPart] expose timeout config for process group (#4532) · 099b173f

Rhett Ying authored Sep 15, 2022



* [DistPart] expose timeout config for process group

* refine code

* Update tools/distpartitioning/data_proc_pipeline.py
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

099b173f

22 Aug, 2022 1 commit
- [Doc] Change random.py to random_partition.py in guide on distributed partition pipeline (#4438) · 7a41c126
  Mufei Li authored Aug 22, 2022
```
* Update distributed-preprocessing.rst

* Update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-26.ap-northeast-1.compute.internal>
```
  7a41c126
21 Aug, 2022 1 commit
- Let distributed training launch script report error when any trainer or kvserver fails. (#4437) · ee672c0b
  xiang song(charlie.song) authored Aug 21, 2022
```
* Collect error reports

* update

* fix
Co-authored-by: root <root@ip-10-0-80-128.ec2.internal>
```
  ee672c0b