Commits · 17f1432ab2c74bed54df863be48e23b4113cbb37 · OpenDAS / dgl

27 Jul, 2022 1 commit
- [Log] fix confusing error log in TCPSocket::Bind() (#4299) · 069068aa
  Rhett Ying authored Jul 27, 2022
```
* [Log] fix confusing error log in TCPSocket::Bind()

* fix lint
```
  069068aa
26 Jul, 2022 1 commit

[Feature] Add CUDA Weighted Randomwalk Sampling (#4243) · 7e6a6b4a

Dewvin authored Jul 26, 2022



* [Feature] Add CUDA Weighted Randomwalk Sampling

* [Feature] Add CUDA Weighted Randomwalk Sampling

* [Feature] Add CUDA Weighted Randomwalk Sampling

* [Feature] Add CUDA Weighted Randomwalk Sampling

* fix empty prob array && enable non-uniform for restart && enable unit tests

* update doc and guide for randomwalk and pinsage

* update comments
Co-authored-by: zhenliangqiu <ubuntu@ip-172-31-24-245.ap-southeast-1.compute.internal>
Co-authored-by: xiny <xiny@nvidia.com>

7e6a6b4a

15 Jul, 2022 1 commit
- decompose (#4259) · 9a7ad16e
  Quan (Andy) Gan authored Jul 15, 2022
  
  9a7ad16e
09 Jul, 2022 1 commit
- [Bugfix] Add CUDA context availability check before setting curand seed (#4223) · 1feec870
  Xin Yao authored Jul 09, 2022
  
  1feec870
07 Jul, 2022 1 commit
- [Performance] Redirect `AllocWorkspace` to PyTorch's allocator if available (#4199) · 9ee7ced5
  Xin Yao authored Jul 07, 2022
  
  9ee7ced5
01 Jul, 2022 2 commits
- [BugFix] check whether etype sorted when sampling (#4198) · dcf16992
  Rhett Ying authored Jul 01, 2022
  
  dcf16992
- [Feature] extend sort_csr/csc_by_tag to edge (#4164) · 6a6597a0
  Rhett Ying authored Jul 01, 2022
```
* [Feature] extend sort_csr/csc_by_tag to edge

* fix test ffailure in tensorflow

* refine sorting by edges

* fix docstring

* remove unnecessary mem
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
  6a6597a0
29 Jun, 2022 1 commit

[bugfix] Allow communicators of size one when NCCL is missing (#3713) · 1dddaad4

nv-dlasalle authored Jun 28, 2022



* Update nccl communicator for when NCCL is missing

* Use static_cast

* Add doc string

* Fix whitespace

* Resrtict unit test to GPU runs
Co-authored-by: Xin Yao <xiny@nvidia.com>

1dddaad4

27 Jun, 2022 2 commits

[Bug][Feature] Added more missing FP16 specializations (#4140) · a5d8460c

ndickson-nvidia authored Jun 27, 2022

* * Added missing specializations for `__half` of `DLDataTypeTraits`, `IndexSelect`, `Full`, `Scatter_`, `CSRGetData`, `CSRMM`, `CSRSum`, `IndexSelectCPUFromGPU`
* Fixed casting issue in `_LinearSearchKernel` that was preventing it from supporting `__half`
* Added `#if`'d out specializations of `CSRGEMM`, `CSRGEAM`, and `Xgeam`, which would require functions that aren't currently provided by cublas

* * Added more specific error messages for unimplemented FP16 specializations of Xgeam, CSRGEMM, and CSRGEAM

* * Added missing instantiation of DLDataTypeTraits<__half>::dtype

* * Fixed linter error
* Added clearer comment explaining why the cast to long long is necessary

* * Worked around a compile error in some particular setup, where __half can't be constructed on the host side

* * Fixed linter formatting errors

* * Changes to comments as recommended

* * Made recommended changes to logging errors in FP16 specializations
* Also changed the existing Xgeam function for unsupported data types from LOG(INFO) to LOG(FATAL)

a5d8460c

[BugFix] fix rpc-related build issue on mac OS (#4168) · 10db5d0b
Rhett Ying authored Jun 27, 2022
```
* [BugFix] fix rpc-related build issue on mac OS

* add warning message

* add warning message
```
10db5d0b

24 Jun, 2022 1 commit

[Performance][Optimizer] Enable using UVA and FP16 with SparseAdam Optimizer (#3885) · 020f0249

nv-dlasalle authored Jun 23, 2022



* Add uva by default to embedding

* More updates

* Update optimizer

* Add new uva functions

* Expose new pinned memory function

* Add unit tests

* Update formatting

* Fix unit test

* Handle auto UVA case when training is on CPU

* Allow per-embedding decisions for whether to use UVA

* Address spares_optim.py comments

* Remove unused templates

* Update unit test

* Use dgl allocate memory for pinning

* allow automatically unpin

* workaround for d2h copy with a different dtype

* fix linting

* update error message

* update copyright
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

020f0249

23 Jun, 2022 2 commits

[Bugfix][Rework] Automatically unpin tensors pinned by DGL (rework #3997) (#4135) · 077e002f

Xin Yao authored Jun 23, 2022



* Explicitly unpin tensoradapter allocated arrays

* Undo unrelated change

* Add unit test

* update unit test

* add pinned_by_dgl flag to NDArray::Container

* use dgl.ndarray for holding the pinning status

* update multi-gpu uva inference

* reinterpret cast NDArray::Container* to DLTensor* in MoveAsDLTensor

* update unpin column and examples

* add unit test for unpin column
Co-authored-by: Dominique LaSalle <dlasalle@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

077e002f

[Fix] Fix compiler warnings - part 1 (#4051) · 1ad65879

Triston authored Jun 22, 2022



* Fix a cub compile error for CUDA 11.5

* Fix comparison of integer expressions of different signedness in coo_sort.cu file

* Fix comparison of integer expressions of different signedness in cuda_compact_graph.cu file

* Remove never referenced variable in spmm.cu

* Fix comparison of integer expressions of different signedness in rowwise_pick.h file

* Fix comparison of integer expressions of different signedness in choice.cc file

* Remove never referenced variable col_data in spat_op_impl_coo.cc

* Remove never referenced variable allowed in global_uniform.cc

* Fix comparison of integer expressions of different signedness in graph.cc

* Fix comparison of integer expressions of different signedness in graph_apis.cc

* Fix the un-used ctx variable in ndarray_partition.cc file for cpu only build

* Fix comparison of integer expressions of different signedness in libra_partition.cc

* Fix comparison of integer expressions of different signedness in graph_op.cc
Co-authored-by: Triston Cao <tristonc@nvidia.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

1ad65879

20 Jun, 2022 1 commit
- [Dist] re-try to receive rpc ndarray msg (#4142) · 3ffe0c09
  Rhett Ying authored Jun 20, 2022
  
  3ffe0c09
14 Jun, 2022 1 commit
- [Bugfix] Disable non-atomic atomic operations (#4117) · 473bf15f
  nv-dlasalle authored Jun 14, 2022
```
* Disable non-atomic atomic operations

* Improve error message

* Make error message more friendly
```
  473bf15f
11 Jun, 2022 1 commit

[Fix] Wrap all CUDA runtime API/CUB calls with macro (#4083) · 60b1c992

Xin Yao authored Jun 11, 2022



* Wrap all CUDA runtime API/CUB calls with macro

* remove the usage of explicit cudaMalloc in favor of AllocWorkspace

* fix typo
Co-authored-by: Israt Nisa <neesha295@gmail.com>

60b1c992

08 Jun, 2022 1 commit

[Dist] enable time out when fetching msg (#4043) · cac3720b

Rhett Ying authored Jun 08, 2022

* [ist] enable time out when fetching msg

* fix lint error

* minor refinements

* improve minor log

* fix dist test

* fix timeout issue in tensorpipe

cac3720b

07 Jun, 2022 1 commit

[Bug][Feature] Added cublasGemm<__half> specialization (#3988) (#4029) · eabcc58e

ndickson-nvidia authored Jun 07, 2022

* * Added specialization of cublasGemm function for `__half` type, to try to address https://github.com/dmlc/dgl/issues/3988



* * Added USE_FP16 guard

* * Added test cases to test_segment_mm, to test newly-added FP16 specialization of cublasGemm

* * Replaced for loop in test_segment_mm with pytest.mark.parametrize, as recommended
Co-authored-by: Xin Yao <xiny@nvidia.com>

eabcc58e

06 Jun, 2022 3 commits

[Bug] Added common operations for FP16 on older GPUs (#4079) · ea44da50

ndickson-nvidia authored Jun 06, 2022

* * Added support for common operations on FP16 (`half` or `__half`) for older GPU architectures
* Fixed an issue with previous check for FP16 support

* * Removing FP16 type checks, since they should no longer be needed

* * Fixed AtomicAdd to be atomic for `float` and `double` for old GPU architectures.  Unfortunately, it seems that atomicCAS for unsigned short seems to be unavailable until architecture 70, so half will have to stay non-atomic on old GPUs.

* * Fixed non-atomic version of `AtomicAdd<half>` for older GPUs to return old value instead value of new

ea44da50

parallelize csr2coo (#4081) · 31a81438
Quan (Andy) Gan authored Jun 06, 2022
```
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
31a81438

wrap all cuda kernel calls with macro (#4066) · 6014623d

Xin Yao authored Jun 06, 2022


Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Israt Nisa <neesha295@gmail.com>

6014623d

28 May, 2022 3 commits
- Change warning message for tensoradapter when not found (#4055) · 9922f41f
  Quan (Andy) Gan authored May 29, 2022
```
* change warning message

* Update tensordispatch.cc
```
  9922f41f
- Revert "[bugfix] Explicitly unpin tensoradapter allocated arrays (#3997)" (#4061) · 00c09b9f
  Quan (Andy) Gan authored May 28, 2022
```
This reverts commit fdd1fe19.
```
  00c09b9f
- add sanity check (#4050) · c577dc9f
  Quan (Andy) Gan authored May 28, 2022
  
  c577dc9f
26 May, 2022 1 commit

[Build][Tests] Enable FP16 for GPU builds in CI (#4030) · 7a065a9c

nv-dlasalle authored May 26, 2022

* Enable FP16 for GPU builds in CI

* Limit default GPU archs to pascal and above

* Disable FP16 dispatching for cuda architectures less than 60

* Fix linting

* Fix typos

7a065a9c

25 May, 2022 1 commit
- [Bugfix] Cython CAPI holding GIL causes deadlock when Python callback is asynchronous (#4036) · 3c129ad7
  Minjie Wang authored May 25, 2022
```
* cython nogil

* move APIs to internal and add unit test

* fix lint

* disable callback array test
```
  3c129ad7
17 May, 2022 1 commit

change the curandState and launch dimension of CSRRowwiseSample kernel (#3990) · bacf2ab4

paoxiaode authored May 17, 2022



* Change the curand_init parameter

* Change the curand_init parameter

* commit

* commit

* change the curandState and launch dim of CSRRowwiseSample kernel

* commit

* keep  _CSRRowWiseSampleReplaceKernel in sync
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

bacf2ab4

16 May, 2022 2 commits
- [bugfix] Explicitly unpin tensoradapter allocated arrays (#3997) · fdd1fe19
  nv-dlasalle authored May 16, 2022
```
* Explicitly unpin tensoradapter allocated arrays

* Undo unrelated change

* Add unit test

* update unit test
```
  fdd1fe19
- [Peformance] Remove unnecessary induced vertices in EdgeSubgraph (#3978) · 03024f95
  Xin Yao authored May 16, 2022
```
* remove unnecessary induced vertices in EdgeSubgraph

* add unit test
```
  03024f95
12 May, 2022 1 commit
- Fix launch parameters index select kernel in sparse push (#3524) · 4177f729
  nv-dlasalle authored May 12, 2022
  
  4177f729
11 May, 2022 1 commit

[Dist] Enable maximum try times for socket backend via DGL_DIST_MAX_T… (#3977) · 22e218d3

Rhett Ying authored May 11, 2022

* [Dist] Enable maximum try times for socket backend via DGL_DIST_MAX_TRY_TIMES

* reset env before/after test

* print log for info when trying to connect

* fix

* print log in python instead of cpp

22e218d3

27 Apr, 2022 1 commit

[Feature] enable socket net_type for rpc (#3951) · 37be02a4

Rhett Ying authored Apr 28, 2022

* [Feature] enable socket net_type for rpc

* fix lint

* fix lint

* fix build issue on windows

* fix test failure on windows

* fix test failure

* fix cpp unit test failure

* net_type blocking max_try_times

* fix other comments

* fix lint

* fix comment

* fix lint

* fix cpp

37be02a4

26 Apr, 2022 1 commit

[Performance][GPU] Improving Disjoint Union kernel for Graph Dataloaders (#3895) · 6e46bbf5

ayasar70 authored Apr 26, 2022



* Based on issue #3436. Improving _SegmentCopyKernel s GPU utilization by switching to nonzero based thread assignment

* fixing lint issues

* Update cub for cuda 11.5 compatibility (#3468)

* fixing type mismatch

* tx guaranteed to be smaller than nnz. Hence removing last check

* minor: updating comment

* adding three unit tests for csr slice method to cover some corner cases

* timing repeatkernel

* clean

* clean

* clean

* updating _SegmentMaskColKernel

* Working on requests: removing sorted array check and adding comments to utility functions

* fixing lint issue

* Optimizing disjoint union kernel

* Trying to resolve compilation issue on CI

* [EMPTY] Relevant commit message here

* applying revision requests on cpu/disjoint_union.cc

* removing unnecessary casts

* remove extra space
Co-authored-by: Abdurrahman Yasar <ayasar@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

6e46bbf5

12 Apr, 2022 1 commit
- [Example] Cleaned GraphSAGE node classification example with PyTorch Lightning (#3863) · 0d878ff8
  Quan (Andy) Gan authored Apr 12, 2022
```
* cleaned pl node classification example

* conform to PL's method of updating the dataloader

* update

* lint

* fix test

* fix
```
  0d878ff8
11 Apr, 2022 1 commit

[Feature] Enable UVA for GPU PinSAGE and RandomWalk (#3857) · 5fcd7f29

Xin Yao authored Apr 11, 2022



* enable uva for pinsage sampler

* unit test

* modify some checks on the python side

* remove legacy random walk code

* update unit test

* update unit test

* fix unit test

* adjust checks

* move some checks to c++

* move max_nodes check to cuda kernel

* fix ci for tf
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

5fcd7f29

09 Apr, 2022 1 commit

[BugFix] record/restore pin status when pickle/unpickle (#3914) · adb3a7c1

Rhett Ying authored Apr 09, 2022

* [BugFix] record/restore pin status when pickle/unpickle

* disable test on TF

* set version as expected

* unpin memory in test

adb3a7c1

05 Apr, 2022 1 commit

[Examples] Update graphsage multi-gpu example to use mutliple GPUs for... · 27a6eb56

nv-dlasalle authored Apr 05, 2022


[Examples] Update graphsage multi-gpu example to use mutliple GPUs for validation and testing. (#3827)

* Update graphsage multi-gpu example to use mutliple GPUs for validation and
testing.

* Remove argmax

* Fix rebase error

* Add more documentation to example and simplify

* Switch to name shared memory

* Add comment about how training is distributed

* Restore iteration count

* fix munmap error reporting for better error messages
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

27a6eb56

31 Mar, 2022 1 commit
- [Bugfix] Fix UVA sampling with partially specified node types (#3897) · 35e66f42
  Quan (Andy) Gan authored Mar 31, 2022
```
* fix uva with partial node types

* lint

* skip tensorflow unit test
```
  35e66f42
27 Mar, 2022 1 commit

[Feature] METIS Partition with Communication Volume Minimization (#3821) · fbbca994

Cheng Wan authored Mar 27, 2022

* upd

* upd

* upd

* upd

* upd

* fix OpenMP compatibility issues

* typo

* partition

* misc

* fix typo

* num_parts=1

* import torch

* long

* print info

* print info

* print info

* upd

* remove debug code

* revert partition.py

* fix cut count

* fix cut count

* Revert "fix cut count"

This reverts commit 10926b4fd48f45c8f1ddb58be7db6c22e653effd.

* Revert "fix cut count"

This reverts commit 76465283bef093a2b4209ad70dd15d2437b2ec8a.

* type of deprecate

* typo in deprecate info

* fix typo

* use cv for partitioning

* CE

* no message

* revert

* typo

* add objtype

* no message

* fix bug

* fix bug

* fix bug

* ?

* semicolon

* drop tensors

* no message

* backward

* backward

* max op

* store X.shape

* th

* test

* Revert "test"

This reverts commit 92b3b2f64a3a1128590098fa03ce429c5466e6ce.

* test

* tolist

* debug

* to cuda

* tuple

* fix bug

* remove X

* no message

* fix bug

* workload balance

* Revert "workload balance"

This reverts commit d7f8e4a16ba2a7eabb4a9bb945523bfe6623e723.

* reverse

* Revert "reverse"

This reverts commit 8a71cf25685aa7d889b9b8881b46f7a16b7d6e6d.

* Revert "Revert "reverse""

This reverts commit 196b143932d5cf9813576ece7c990b63d322d063.

* Revert "Revert "Revert "reverse"""

This reverts commit cf9e89a07013582056e7cde235e51331aca7fa9c.

* no message

* Merge commit '5498cf05'

# Conflicts:
#	python/dgl/distributed/partition.py

* Revert "Merge commit '5498cf05

'"

This reverts commit f79be2ad777897c7025b28308454cad81ad6bb27.

* fix bug

* third party

* no message

* try to avoid memory leak

* try to avoid memory leak

* avoid memory leak with no hope

* Revert "avoid memory leak with no hope"

This reverts commit c77befe9479f46758e744642f66dd209b50eef7d.

* no message

* Revert "no message"

This reverts commit 478cb28fe25fb1002b2f1dc202bb9bdaad8b2a56.

* del

* Revert "del"

This reverts commit 1b468e45ce646b400ff3ffa61a0b2da058b3bdfd.

* no message

* no message

* Revert "no message"

This reverts commit 92e4f5561ed42da0606618b2fff9f1ad5ed439d9.

* third party

* document

* Update metis_partition.cc

* Update metis_partition_hetero.cc

* Update metis_partition_hetero.cc

* Update partition.py

* Update partition.py

* Update partition.py
Co-authored-by: yzh119 <expye@outlook.com>
Co-authored-by: chwan-rice <54331508+chwan-rice@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

fbbca994

24 Mar, 2022 1 commit

[Bugfix] Fix multiple bugs and code refactor (#3841) · 223a3da5

Quan (Andy) Gan authored Mar 24, 2022



* fix

* remove setcxx methods

* move pin flag to CSR and COO matrix
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

223a3da5