Commits · 1ab0170a106ada9a7dbfa555563affb47c377305 · OpenDAS / dgl

27 Apr, 2023 1 commit
- [Distributed] Ensure round-robin edge file downloads, reduce logging, other improvements. (#5578) · 1ab0170a
  Theodore Vasiloudis authored Apr 27, 2023
```
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
```
  1ab0170a
13 Apr, 2023 1 commit
- Use correct delimiter when reading edge files during parmetis processing step (#5481) · 4085ec8a
  kylasa authored Apr 13, 2023
```
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
```
  4085ec8a
29 Mar, 2023 1 commit
- [Misc] Rename number_of_edges and number_of_nodes to num_edges and num_nodes. (#5490) · 3c8ac093
  Hongzhi (Steve), Chen authored Mar 29, 2023
```
* Other

* revert

---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-63.ap-northeast-1.compute.internal>
```
  3c8ac093
17 Mar, 2023 1 commit
- [Dist] Add argument in dispatch_data.py to allow user-defined metadata JSON filename. (#5445) · 5cab4230
  Theodore Vasiloudis authored Mar 16, 2023
  
  5cab4230
10 Mar, 2023 2 commits

[DistDGL][Robustness]Replacing numpy's unique with custom implementation (#5391) · 92e22995

kylasa authored Mar 10, 2023



* Replacing numpy's unique with custom implementation

* Added docstring to the new function.

* Adding unit tests

* Numpy's version issues with the 'kind' argument.

* Addressing CI Test Failure.

* Addressing CI review comments.

* revised implementation, optimized for time.

* added missing arguments for fallback case.

* Addressing CI test failures.

* Resolving issues with PYTHONPATH

* Fix CI Test Failure issues.

* fix CI test failures.

---------
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

92e22995

[DistDGL][TestCoverage]Added testcase for testing distributed lookup service. (#5365) · e74b3d3d

kylasa authored Mar 09, 2023



* Added testcase for testing distributed lookup service.

* Applying lintrunner patch.

* Fixing CI Test environment failures.

* lintrunner patch.

* lintrunner patch

* Fix CI Failure.

* Fixing CI Test failure cases.

* lintrunner patch.

* lintrunner patch and CI test failure.

* Restore no. of test cases.

* Resolving pythonpath issues.

* lintrunner patch.

* updating PYTHONPATH to resolve lib path

* Resolve merge conflicts

* Resolving issues with PYTHONPATH env variable.

* fix module path

* rename utils script under test to avoid ambiguity

* remove unnecessary pythonpath

* fix lint error

* fix lint error

---------
Co-authored-by: RhettYing <rhett_ying@qq.com>

e74b3d3d

06 Mar, 2023 2 commits

[DistDGL][UserEx]Sync parmetis_wrapper with changes in metadata.json (#5385) · 7b766393

kylasa authored Mar 06, 2023

* Sync parmetis_wrapper with changes in metadata.json

1. In the preprocess.py, make sure that num_partitions is defined as input argument. Also, align 'input_dir' with the input dataset. schema_file is assumed to be located inside the input_dir. Also, graph_stats.txt file is assumed to be present in the input_dir.

2. Use DGL_HOME environment variable so that parmetis_wrapper command can be run anywhere.

* Fix CI test failure cases.

* Addressing CI review comments.

* Addressing CI test failures.

* Applying lintrunner patch

7b766393

Support for no. of chunks smaller than no. of partitions. (#5390) · 894ad1e3

kylasa authored Mar 06, 2023

* Support for no. of chunks smaller than no. of partitions and Adding appropriate test cases.

Following changes are made with this PR.
1. Code changes for handling no. of chunks smaller than no. of partitions
2. Adding new test cases, which were previously deleted, for no. of chunks smaller than no. of partitions.
3. Also adding test cases, where multiple partitions are handled by a single process.

* Committing the missing files in this commit.

* lintrunner patch.

* lintrunner check

* lintrunner patch here.

* CI review comments.

894ad1e3

28 Feb, 2023 1 commit

Distributed Lookup Service Robustness (#5387) · cf752077

kylasa authored Feb 28, 2023

Handling corner cases in the distributed lookup service. When the get partition ids function is invoked with empty request. This is needed because we are using alltoall function in the get_partition_ids function.

cf752077

25 Feb, 2023 1 commit

[DistDGL][Feature_Request]Changes in the metadata.json file for input graph dataset. (#5310) · a14f69c9

kylasa authored Feb 24, 2023

* Implemented the following changes.

* Remove NUM_NODES_PER_CHUNK
* Remove NUM_EDGES_PER_CHUNK
* Remove the dependency between no. of edge files per edge type and no. of partitions
* Remove the dependency between no. of edge feature files per edge type and no. of partitions
* Remove the dependency between no. of edge feature files and no. of edge files per edge type.
* Remove the dependency between no. of node feature files and no. of partitions
* Add “node_type_counts”. This will be a list of integers. Each integer will represent total count of a node-type. The index in this list and the index in the “node_type” will be the same for a given node-type.
* Add “edge_type_counts”. This will be a list of integers. Each integer will represent total count of an edge-type. The index in this list and the index in the “edge_type” list will be the same for a given edge-type.

* Applying lintrunner patch.

* Adding missing keys to the metadata in the unit test framework.

* lintrunner patch.

* Resolving CI test failures due to merge conflicts.

* Applying lintrunner patch

* applying lintrunner patch

* Replacing tabspace with spaces - to satisfy lintrunner

* Fixing the CI Test Failure cases.

* Applying lintrunner patch

* lintrunner complaining about a blank line.

* Resolving issues with print statement for NoneType

* Removed tests for the arbitrary chunks tests. Since this functionality is not supported anymore.

* Addressing CI review comments.

* addressing CI review comments

* lintrunner patch

* lintrunner patch.

* Addressing CI review comments.

* lintrunner patch.

a14f69c9

23 Feb, 2023 3 commits

New script for customers to validate partitioned graph objects (#5340) · c42fa8a5
kylasa authored Feb 23, 2023
```
* A new script to validate graph partitioning pipeline

* Addressing CI review comments.

* lintrunner patch.
```
c42fa8a5

[DistDGL][Robustness]Uneven distribution of input graph files for nodes/edges and features. (#5227) · bbc538d9

kylasa authored Feb 23, 2023

* Uneven distribution of nodes/edges/features

To handle unevenly sized files for nodes/edges and feature files for nodes and edges, we have to synchronize before starting large no. of messages (either one large message or a burst of messages).

* Applying lintrunner patch.

* Removing tabspaces for lintrunner.

* lintrunner patch.

* removed issues introduced by the merge conflicts. Lots of code was repeated

bbc538d9

[DistDGL][Mem_Optimizations]get_partition_ids, service provided by the... · 61b6edab

kylasa authored Feb 23, 2023

[DistDGL][Mem_Optimizations]get_partition_ids, service provided by the distributed lookup service has high memory footprint (#5226)

* get_partition_ids, service provided by the distributed lookup service has high memory footprint

'get_partitionid' function, which is used to retrieve owner processes of the given list of global node ids, has high memory footprint. Currently this is of the order of 8x compared to the size of the input list.

For massively large datasets, this memory needs are very unrealistic and may result in OOM. In the case of CoreGraph, when retrieving owner of an edge list of size 6 Billion edges, the memory needs can be as high as 8*8*8 = 256 GB.

To limit the amount of memory used by this function, we split the size of the message sent to the distributed lookup service, so that each message is limited by the number of global node ids, which is 200 million. This reduced the memory footprint of this entire function to be no more than 0.2 * 8 * 8 = 13 GB. which is within reasonable limits.

Now since we send multiple small messages compared to one large message to the distributed lookup service, this may consume more wall-clock-time compared to earlier implementation.

* lintrunner patch.

* using np.ceil() per suggestion.

* converting the output of np.ceil() as ints.

61b6edab

22 Feb, 2023 1 commit

[DistDGL] Memory optimization to reduce memory footprint of the Dist Graph... · 5ea04713

kylasa authored Feb 22, 2023

[DistDGL] Memory optimization to reduce memory footprint of the Dist Graph partitioning pipeline. (#5130)

* Wrap np.argsort() in a function. This

Use a python wrapper for the np.argsort() function for better usage of systems memory.

* lintrunner patch.

* lintrunner patch.

* Changes to address code review comments.

5ea04713

19 Feb, 2023 1 commit
- [Misc] auto-format tools. (#5321) · 6bc82161
  Hongzhi (Steve), Chen authored Feb 19, 2023
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-63.ap-northeast-1.compute.internal>
```
  6bc82161
16 Feb, 2023 2 commits

[DistDGL][Optimizations]Rehash code to optimize for loop (#5224) · 9ce800d2

kylasa authored Feb 16, 2023

* Rehash code to optimize for loop

Reduced number of instructions in for loop, which exchanging edge features. This will reduce the number of times numpy's intersect1d is invoked (saving the runtime and memory overhead needs of numpy).

* Applying lintrunner patch to data_shuffle.py

9ce800d2

[DistDGL][Mem_Optimizations]Edge Ownership processes are computed on the fly when required. (#5225) · e25f47de

kylasa authored Feb 16, 2023

* Edge Ownership processes are computed on the fly when required.

Earlier we were storing Edge ownership processes after the dataset was retrieved from the disk. For massively large datasets, each node can handle upto 5 Billion edges, this means storing owner process-ids will consume 5 * 8 = 40GB. This memory will be hanging around until the edges are exchanged.

To reduce the memory footprint of the pipeline, we no longer store the ownership process-ids in the 'edge_data' dictionary after reading the dataset from the disk. Instead, we compute them on the fly at the time of exchanging edges.

Another optimization is not to send/receive all the messages in a one single large message. Instead we now split the total number edges into chunks, limited by 8 GB per node. And we iterate until all the chunks are exchanged.

Once all the edges are exchanged, as a sanity check, we compute the total number of edges in the system and compare it with the original value before edge shuffling, in a final assert statement before return the result to the caller.

* Applying lintrunner patch.

e25f47de

13 Feb, 2023 1 commit

Code changes to fix order sensitivity of the pipeline (#5288) · 432c71ef

kylasa authored Feb 13, 2023

Following changes are made in this PR.
1. In dataset_utils.py, when reading edges from disk we follow the order defined by the STR_EDGE_TYPE key in the metadata.json file. This order is implicitly used to assign edgeid to edge types. This same order is used to read edges from the disk as well.
2. Now the unit test framework will also randomize the order of edges read from the disk. This is done for the edges when reading from the disk for the unit tests.
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

432c71ef

10 Feb, 2023 2 commits
- [Doc] update `--part_config` · d49a3019
  Rhett Ying authored Feb 10, 2023
  
  d49a3019
- [Doc] update --part_config · ed3888dd
  Rhett Ying authored Feb 10, 2023
  
  ed3888dd
03 Feb, 2023 1 commit
- [DistDGL][Lintrunner]Lintrunner for tools directory (#5261) · aa42aaeb
  kylasa authored Feb 03, 2023
```
* lintrunner patch for gloo_wrapper.py

* lintrunner changes to the tools directory.
```
  aa42aaeb
02 Feb, 2023 1 commit

[Dist] add input_dir for parmetis preprocess (#5232) · eff16b61

Rhett Ying authored Feb 02, 2023

* [Dist] add input_dir for parmetis preprocess

* add support for parquet

* update parmetis_wrapper accordingly

eff16b61

05 Jan, 2023 1 commit

[Dist] Allow reading and writing single-column vector Parquet files. (#5098) · 9890201d

Theodore Vasiloudis authored Jan 05, 2023

* Allow reading and writing single-column vector Parquet files.

These files are commonly produced by Spark ML's feature processing code.

* [Dist] Only write single-column vector files for Parquet in tests.

9890201d

03 Jan, 2023 1 commit

[Dist] Add support for Parquet-formatted edges files, remove some assumptions... · 774709d3

Theodore Vasiloudis authored Jan 03, 2023


[Dist] Add support for Parquet-formatted edges files, remove some assumptions on edge file number. (#5051)

* [Dist] Add support for Parquet-formatted edges files, remove some assumptions on edge file number.

* [Dist] Add parquet edges option to unit tests.
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

774709d3

15 Dec, 2022 1 commit

[Dist] enable to chunk node/edge data into arbitrary number of chunks (#4930) · 9731e023

Rhett Ying authored Dec 15, 2022



* [Dist] enable to chunk node/edge data into arbitrary number of chunks

* [Dist] enable to split node/edge data into arbitrary parts

* refine code

* Format boolean to uint8 forcely to avoid dist.scatter failure

* convert boolean to int8 before scatter and revert it after scatter

* refine code

* fix test

* refine code

* move test utilities into utils.py

* update comment

* fix empty data

* update

* update

* fix empty data issue

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* release unnecessary mem

* remove unnecessary shuffle data

* separate array_split into standalone utility

* add example
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

9731e023

14 Dec, 2022 1 commit
- [Dist] generate partition meta for ParMETIS pipeline (#5020) · 32dc1af6
  Rhett Ying authored Dec 14, 2022
```
* [Dist] generate partition meta for ParMETIS
```
  32dc1af6
07 Dec, 2022 1 commit
- Fix bug when feature_tid is empty (#4985) · 394794b1
  xiang song(charlie.song) authored Dec 07, 2022
```
Co-authored-by: Xiang Song <xiangsx@amazon.com>
```
  394794b1
30 Nov, 2022 1 commit
- [Bugfix] Make preprocess compatible with openmpi (#4971) · adb07d18
  xiang song(charlie.song) authored Nov 30, 2022
```
* Make preprocess compatible with openmpi

* update docstr
Co-authored-by: Xiang Song <xiangsx@amazon.com>
```
  adb07d18
28 Nov, 2022 2 commits
- [Feature] Add parquet support for node/edge features in chunked data (#4933) · 08fd6cf8
  peizhou001 authored Nov 28, 2022
  
  08fd6cf8
- [Dist] fix argument consistent with help message (#4957) · 566d231a
  Rhett Ying authored Nov 28, 2022
  
  566d231a
18 Nov, 2022 1 commit

[Dist] Flexible pipeline - Initial commit (#4733) · c8ea9fa4

kylasa authored Nov 18, 2022

* Flexible pipeline - Initial commit

1. Implementation of flexible pipeline feature.
2. With this implementation, the pipeline now supports multiple partitions per process. And also assumes that num_partitions is always a multiple of num_processes.

* Update test_dist_part.py

* Code changes to address review comments

* Code refactoring of exchange_features function into two functions for better readability

* Upadting test_dist_part to fix merge issues with the master branch

* corrected variable names...

* Fixed code refactoring issues.

* Provide missing function arguments to exchange_feature function

* Providing the missing function argument to fix error.

* Provide missing function argument to 'get_shuffle_nids' function.

* Repositioned a variable within its scope.

* Removed tab space which is causing the indentation problem

* Fix issue with the CI test framework, which is the root cause for the failure of the CI tests.

1. Now we read files specific to the partition-id and store this data separately, identified by the local_part_id, in the local process.
2. Similarly as above, we also differentiate the node and edge features type_ids with the same keys as above.
3. These above two changes will help up to get the appropriate feature data during the feature exchange and send to the destination process correctly.

* Correct the parametrization for the CI unit test cases.

* Addressing Rui's code review comments.

* Addressing code review comments.

c8ea9fa4

17 Nov, 2022 1 commit

[Dist] Fix bug in Dist partitioning (#4910) · 799245a7

Serge Panev authored Nov 17, 2022


Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>

799245a7

09 Nov, 2022 1 commit
- [Dist] Fix typo of metis preprocess in dist partitin pipeline · 344be1ef
  xiang song(charlie.song) authored Nov 08, 2022
  
  344be1ef
08 Nov, 2022 1 commit

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large... · 4cd0a685

kylasa authored Nov 07, 2022

[DIST] Message size to retrieve SHUFFLE_GLOBAL_NIDs is resulting in very large messages and resulting in killed process (#4790)

* Send out the message to the distributed lookup service in batches.

* Update function signature for allgather_sizes function call.

* Removed the unnecessary if statement .

* Removed logging.info message, which is not needed.

4cd0a685

07 Nov, 2022 3 commits
- alltoall returns tensor list with None values, which is failing torch.cat(). (#4788) · e3bf1825
  kylasa authored Nov 07, 2022
  
  e3bf1825
- [Dist] Create <graph_name>_stats.txt file if it does not exist before ParMETIS execution (#4791) · 98b9e0fa
  kylasa authored Nov 07, 2022
```
* check if stats file exists, if not create one before parmetis run

* correct the typo error and correctly use constants.GRAPH_NAME
```
  98b9e0fa
- Reading files in chunks to reduce the memory footprint of pyarrow (#4795) · 53117c51
  kylasa authored Nov 07, 2022
```
All tasks completed.
```
  53117c51
04 Nov, 2022 2 commits

[Dist] deprecate etype and always use canonical etype for partition and load (#4777) · ed8e9c44

Rhett Ying authored Nov 04, 2022

* [Dist] deprecate etype and always use canonical etype for partition and load

* enable canonical etypes in dist part pipeline

* resolve rebase conflicts

* fix lint

* fix test failure

* throw exception if outdated part config is loaded

* refine

* refine

* revert unnecessary change

* fix typo

ed8e9c44

[Dist] remove dependecy of load_partition_book in change tool (#4802) · dccf1f16

peizhou001 authored Nov 04, 2022



* remove dependecy of load_partition_book in change tool

* fix issue

* fix issue
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-19.ap-northeast-1.compute.internal>

dccf1f16

31 Oct, 2022 1 commit
- Updated the key to retrieve correct rank of a process (#4756) · 9a72b78b
  kylasa authored Oct 31, 2022
```
Merging this PR to the master branch
```
  9a72b78b