Commits · c1e01b1d805c639272c5de8e4d11f6d936cda893 · OpenDAS / dgl

06 Aug, 2022 1 commit

[Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) · c1e01b1d

kylasa authored Aug 05, 2022

* Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing

1. Replaced alltoallv gloo wrapper call with alltoall message.
2. All the messages are padded to be of same length
3. Receiving side unpads the messages and continues processing.

* Code changes to address CI comments

1. Removed unused functions from gloo_wrapper.py
2. Changed the function signature of alltoallv_cpu_data as suggested.
3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions.

* Changed the function name appropriately

Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding

* Added code and text to address the review comments.

1. Changed the function name to indicate the local use of this function.
2. Changed docstring to indicate the assumptions made by alltoallv_cpu function.

* Removed unused function from import statement

Removed unused/removed function from import statement.

c1e01b1d

23 Jul, 2022 1 commit

[Distributed] Change for the new input format for distributed partitioning (#4273) · 7f8e1cf2

kylasa authored Jul 23, 2022

* Code changes to address the updated file format support for massively large graphs.

1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset.
2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file.
3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction.
4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results

* Code changes to address the CI review comments

1. Improved docstrings for some functions.
2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places.

* Added TODO to indicate the redundant data structure.

Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes.

7f8e1cf2

13 Jul, 2022 1 commit

Support new format for multi-file support in distributed partitioning. (#4217) · dad3606a

kylasa authored Jul 12, 2022

* Code changes for the following

1. Generating node data at each process
2. Reading csv files using pyarrow
3. feature complete code.

* Removed some typo's because of which unit tests were failing

1. Change the file name to correct file name when loading edges from file
2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted.

* Code changes to address CI comments by reviewers

1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions.
2 function signatures and invocations now match w.r.t argument list
3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code.

* Addressing code review comments

1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process.

* Update docstring's of two functions appropriately in response to code review comments

Removed "todo" from the docstring of the gen_nodedata function.
Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time.

Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list.

dad3606a

05 Jul, 2022 3 commits

Added code to support multiple-file-support feature and removed singl… (#4188) · 9948ef4d

kylasa authored Jul 04, 2022

* Added code to support multiple-file-support feature and removed single-file-support code

1. Added code to read dataset in multiple-file-format
2. Removed code for single-file format

* added files missing in the previous commit

This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py

* Update convert_partition.py

Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file.

* addressing code review comments during the CI process

code changes resulting from the code review comments received during the CI process.

* Code reorganization

Addressing CI comments and code reorganization for easier understanding.

* Removed commented out line

removed commented out line.

9948ef4d

Merge branch 'dist_part' of github.com:dmlc/dgl into dist_part · b7187dd3
Da Zheng authored Jul 04, 2022

b7187dd3

Revert "Revert "[Distributed Training Pipeline] Initial implementation of... · a324440f

Da Zheng authored Jul 04, 2022

Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)"

This reverts commit 7c598aac.

a324440f

02 Jul, 2022 1 commit
- [Example][Bugfix] Bugfix for dgi example (#4201) · 0f0e7c7f
  Chang Liu authored Jul 01, 2022
  
  0f0e7c7f
01 Jul, 2022 3 commits

[BugFix] check whether etype sorted when sampling (#4198) · dcf16992
Rhett Ying authored Jul 01, 2022

dcf16992

[Example][Refactor] Minor update on the golden example (#4197) · a9768cb3

Chang Liu authored Jun 30, 2022



* minor update on golden example

* update

* update

* Update README
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

a9768cb3

[Feature] extend sort_csr/csc_by_tag to edge (#4164) · 6a6597a0

Rhett Ying authored Jul 01, 2022



* [Feature] extend sort_csr/csc_by_tag to edge

* fix test ffailure in tensorflow

* refine sorting by edges

* fix docstring

* remove unnecessary mem
Co-authored-by: Xin Yao <xiny@nvidia.com>

6a6597a0

30 Jun, 2022 5 commits

[Example][Refactor] Regolden graphsage example for future guide (#4186) · b76d0ed1

Chang Liu authored Jun 30, 2022



* Regolden graphsage example to guide others

* update golden

* update

* Update example and propagate to original folder

* Update to remove ^M (windows DOS) character

* update

* Merge file changes and update README

* Minor comment update
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Co-authored-by: Mufei Li <mufeili1996@gmail.com>

b76d0ed1

Fix example crashes due to DGL API update (#4194) · a6bd96aa
Chang Liu authored Jun 30, 2022
```
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
a6bd96aa

[CI] Reduce CI workload (#4196) · f7dae453

Minjie Wang authored Jun 30, 2022

* try optimize CI

* fix go test; adjust timing report

* disable certain tests for mx/tf backends

* fix ut

* add pydantic

f7dae453

[Doc] fix typo (#4193) · 7735473b
Quan (Andy) Gan authored Jun 30, 2022
```
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
7735473b

[bugfix] Implement `__setstate__` for Column (fixes #4107) (#4174) · d2a22984

nv-dlasalle authored Jun 29, 2022



* * Workaround for graph data saving/loading compatibility problem in Column class.  There may be more places in DGL with the same issue, due to using Python serialization, instead of a more cohesive, comprehensive strategy.  This is just a local fix.

* Add checking for non-empty states

* Add unit test

* Handle the case of columns without storage
Co-authored-by: ndickson <ndickson@nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>

d2a22984

29 Jun, 2022 6 commits

code changes for bug fixes identified during mag_lsc dataset (#4187) · 3ccd973c

kylasa authored Jun 29, 2022

* code changes for bug fixes identified during mag_lsc dataset

1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors.
2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in  a graph partition.

* Update convert_partition.py

Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes.

* Addressing review comments.

Removed space as suggested at the end of the line

3ccd973c

Update test_transform.py (#4190) · 8b19c287
Mufei Li authored Jun 29, 2022

8b19c287
[Doc] Unify the minimal versions required for PyTorch/TensorFlow/MXNet (#4180) · 32f12ee1
Xin Yao authored Jun 29, 2022

32f12ee1

[Performance] Optimize the use of alternative streams in dataloader (#4177) · 5bef48df

Xin Yao authored Jun 29, 2022

* fix using alternative streams

* use a alternative stream for subgraph transferring

* fix StreamContext when stream is None

5bef48df

[CI] Upgrade software version of CI docker image (#4189) · 5640b129
Rhett Ying authored Jun 29, 2022
```
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
```
5640b129

[bugfix] Allow communicators of size one when NCCL is missing (#3713) · 1dddaad4

nv-dlasalle authored Jun 28, 2022



* Update nccl communicator for when NCCL is missing

* Use static_cast

* Add doc string

* Fix whitespace

* Resrtict unit test to GPU runs
Co-authored-by: Xin Yao <xiny@nvidia.com>

1dddaad4

28 Jun, 2022 4 commits
- [Bug Fix] Fix A Bug Related to GroupRevRes (#4181) · a25a14f2
  Mufei Li authored Jun 28, 2022
```
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
  a25a14f2
- [BugFix] fix build issue on mac OS (#4175) · 15188611
  Rhett Ying authored Jun 28, 2022
```
* [BugFix] fix build issue on mac OS

* refine
```
  15188611
- Update (#4178) · cb39dbfa
  Mufei Li authored Jun 28, 2022
  
  cb39dbfa
- [DGL-Go] Inference for Graph Prediction Pipeline (#4157) · 150e9273
  Mufei Li authored Jun 28, 2022
```
* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update
```
  150e9273
27 Jun, 2022 4 commits

[Bug][Feature] Added more missing FP16 specializations (#4140) · a5d8460c

ndickson-nvidia authored Jun 27, 2022

* * Added missing specializations for `__half` of `DLDataTypeTraits`, `IndexSelect`, `Full`, `Scatter_`, `CSRGetData`, `CSRMM`, `CSRSum`, `IndexSelectCPUFromGPU`
* Fixed casting issue in `_LinearSearchKernel` that was preventing it from supporting `__half`
* Added `#if`'d out specializations of `CSRGEMM`, `CSRGEAM`, and `Xgeam`, which would require functions that aren't currently provided by cublas

* * Added more specific error messages for unimplemented FP16 specializations of Xgeam, CSRGEMM, and CSRGEAM

* * Added missing instantiation of DLDataTypeTraits<__half>::dtype

* * Fixed linter error
* Added clearer comment explaining why the cast to long long is necessary

* * Worked around a compile error in some particular setup, where __half can't be constructed on the host side

* * Fixed linter formatting errors

* * Changes to comments as recommended

* * Made recommended changes to logging errors in FP16 specializations
* Also changed the existing Xgeam function for unsupported data types from LOG(INFO) to LOG(FATAL)

a5d8460c

[Bugfix] Fix that pin_prefetcher is not actually enabled (#4169) · b8f905f1
Xin Yao authored Jun 27, 2022

b8f905f1
[BugFix] fix rpc-related build issue on mac OS (#4168) · 10db5d0b
Rhett Ying authored Jun 27, 2022
```
* [BugFix] fix rpc-related build issue on mac OS

* add warning message

* add warning message
```
10db5d0b

[Dist] enable USE_EPOLL in default (#4167) · 9d425315

Rhett Ying authored Jun 27, 2022

* [Dist] enable USE_EPOLL in default

* fix build issue on windows

* fix build issue on windows

* fix build issue on windows

* fix build issue on windows

* fix build issue on windows

* fix build issue

9d425315

24 Jun, 2022 2 commits

[Doc] fix a bug in guide_cn (#4149) · d1f6f3a8
PotatoChipsNinja authored Jun 24, 2022
```
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
d1f6f3a8

[Performance][Optimizer] Enable using UVA and FP16 with SparseAdam Optimizer (#3885) · 020f0249

nv-dlasalle authored Jun 23, 2022



* Add uva by default to embedding

* More updates

* Update optimizer

* Add new uva functions

* Expose new pinned memory function

* Add unit tests

* Update formatting

* Fix unit test

* Handle auto UVA case when training is on CPU

* Allow per-embedding decisions for whether to use UVA

* Address spares_optim.py comments

* Remove unused templates

* Update unit test

* Use dgl allocate memory for pinning

* allow automatically unpin

* workaround for d2h copy with a different dtype

* fix linting

* update error message

* update copyright
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

020f0249

23 Jun, 2022 5 commits

[BugFix] Fix Correct&Smooth (#4102) (#4158) · 548c85ff

Lucas Prieto authored Jun 23, 2022


Co-authored-by: Mufei Li <mufeili1996@gmail.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>

548c85ff

[Example][Bugfix] Remove all torchtext legacy-related APIs for pytorch/pinsage example (#4130) · 598d746e

Chang Liu authored Jun 23, 2022



* Remove all torchtext legacy-related APIs

* Remove unused BagOfWordsPretrained class, and fix some typos
Co-authored-by: Mufei Li <mufeili1996@gmail.com>

598d746e

[Bugfix][Rework] Automatically unpin tensors pinned by DGL (rework #3997) (#4135) · 077e002f

Xin Yao authored Jun 23, 2022



* Explicitly unpin tensoradapter allocated arrays

* Undo unrelated change

* Add unit test

* update unit test

* add pinned_by_dgl flag to NDArray::Container

* use dgl.ndarray for holding the pinning status

* update multi-gpu uva inference

* reinterpret cast NDArray::Container* to DLTensor* in MoveAsDLTensor

* update unpin column and examples

* add unit test for unpin column
Co-authored-by: Dominique LaSalle <dlasalle@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

077e002f

[Fix] Fix compiler warnings - part 1 (#4051) · 1ad65879

Triston authored Jun 22, 2022



* Fix a cub compile error for CUDA 11.5

* Fix comparison of integer expressions of different signedness in coo_sort.cu file

* Fix comparison of integer expressions of different signedness in cuda_compact_graph.cu file

* Remove never referenced variable in spmm.cu

* Fix comparison of integer expressions of different signedness in rowwise_pick.h file

* Fix comparison of integer expressions of different signedness in choice.cc file

* Remove never referenced variable col_data in spat_op_impl_coo.cc

* Remove never referenced variable allowed in global_uniform.cc

* Fix comparison of integer expressions of different signedness in graph.cc

* Fix comparison of integer expressions of different signedness in graph_apis.cc

* Fix the un-used ctx variable in ndarray_partition.cc file for cpu only build

* Fix comparison of integer expressions of different signedness in libra_partition.cc

* Fix comparison of integer expressions of different signedness in graph_op.cc
Co-authored-by: Triston Cao <tristonc@nvidia.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

1ad65879

[Dist] etype is not guaranteed to be sorted (#4156) · ab1b2811
Rhett Ying authored Jun 23, 2022

ab1b2811

22 Jun, 2022 3 commits
- [Bug Fix] Fix the case when reverse_edge is False for citation graphs (#3840) · 4d3c01d6
  Mufei Li authored Jun 22, 2022
```
* Update citation_graph.py

* Update

* Update

* Update
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
```
  4d3c01d6
- [Bug] Fix problem with ShaDowKHopSampler working with reverse edge type exclusion (#4145) · 71157b05
  Quan (Andy) Gan authored Jun 22, 2022
```
* fix

* fix

* Update utils.py
```
  71157b05
- [BugFix] fix unstable sort when using dataloader with HeteroGraph (#4147) · 794ec4a4
  maqy authored Jun 22, 2022
```
* fix unstable sort

* add torch version check

* reformat

* split too long comments

* Update dataloader.py
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
```
  794ec4a4
21 Jun, 2022 1 commit

[DGL-Go] Inference for Node Prediction Pipeline (full & ns) (#4095) · 31e4a89b

Mufei Li authored Jun 21, 2022

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

31e4a89b