Commits · 020f02498c090b80b690a22bde5cae6bf46c3375 · OpenDAS / dgl

24 Jun, 2022 1 commit

[Performance][Optimizer] Enable using UVA and FP16 with SparseAdam Optimizer (#3885) · 020f0249

nv-dlasalle authored Jun 23, 2022



* Add uva by default to embedding

* More updates

* Update optimizer

* Add new uva functions

* Expose new pinned memory function

* Add unit tests

* Update formatting

* Fix unit test

* Handle auto UVA case when training is on CPU

* Allow per-embedding decisions for whether to use UVA

* Address spares_optim.py comments

* Remove unused templates

* Update unit test

* Use dgl allocate memory for pinning

* allow automatically unpin

* workaround for d2h copy with a different dtype

* fix linting

* update error message

* update copyright
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

020f0249

23 Jun, 2022 2 commits

[Bugfix][Rework] Automatically unpin tensors pinned by DGL (rework #3997) (#4135) · 077e002f

Xin Yao authored Jun 23, 2022



* Explicitly unpin tensoradapter allocated arrays

* Undo unrelated change

* Add unit test

* update unit test

* add pinned_by_dgl flag to NDArray::Container

* use dgl.ndarray for holding the pinning status

* update multi-gpu uva inference

* reinterpret cast NDArray::Container* to DLTensor* in MoveAsDLTensor

* update unpin column and examples

* add unit test for unpin column
Co-authored-by: Dominique LaSalle <dlasalle@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

077e002f

[Fix] Fix compiler warnings - part 1 (#4051) · 1ad65879

Triston authored Jun 22, 2022



* Fix a cub compile error for CUDA 11.5

* Fix comparison of integer expressions of different signedness in coo_sort.cu file

* Fix comparison of integer expressions of different signedness in cuda_compact_graph.cu file

* Remove never referenced variable in spmm.cu

* Fix comparison of integer expressions of different signedness in rowwise_pick.h file

* Fix comparison of integer expressions of different signedness in choice.cc file

* Remove never referenced variable col_data in spat_op_impl_coo.cc

* Remove never referenced variable allowed in global_uniform.cc

* Fix comparison of integer expressions of different signedness in graph.cc

* Fix comparison of integer expressions of different signedness in graph_apis.cc

* Fix the un-used ctx variable in ndarray_partition.cc file for cpu only build

* Fix comparison of integer expressions of different signedness in libra_partition.cc

* Fix comparison of integer expressions of different signedness in graph_op.cc
Co-authored-by: Triston Cao <tristonc@nvidia.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

1ad65879

20 Jun, 2022 1 commit
- [Dist] re-try to receive rpc ndarray msg (#4142) · 3ffe0c09
  Rhett Ying authored Jun 20, 2022
  
  3ffe0c09
14 Jun, 2022 1 commit
- [Bugfix] Disable non-atomic atomic operations (#4117) · 473bf15f
  nv-dlasalle authored Jun 14, 2022
```
* Disable non-atomic atomic operations

* Improve error message

* Make error message more friendly
```
  473bf15f
11 Jun, 2022 1 commit

[Fix] Wrap all CUDA runtime API/CUB calls with macro (#4083) · 60b1c992

Xin Yao authored Jun 11, 2022



* Wrap all CUDA runtime API/CUB calls with macro

* remove the usage of explicit cudaMalloc in favor of AllocWorkspace

* fix typo
Co-authored-by: Israt Nisa <neesha295@gmail.com>

60b1c992

08 Jun, 2022 1 commit

[Dist] enable time out when fetching msg (#4043) · cac3720b

Rhett Ying authored Jun 08, 2022

* [ist] enable time out when fetching msg

* fix lint error

* minor refinements

* improve minor log

* fix dist test

* fix timeout issue in tensorpipe

cac3720b

07 Jun, 2022 1 commit

[Bug][Feature] Added cublasGemm<__half> specialization (#3988) (#4029) · eabcc58e

ndickson-nvidia authored Jun 07, 2022

* * Added specialization of cublasGemm function for `__half` type, to try to address https://github.com/dmlc/dgl/issues/3988



* * Added USE_FP16 guard

* * Added test cases to test_segment_mm, to test newly-added FP16 specialization of cublasGemm

* * Replaced for loop in test_segment_mm with pytest.mark.parametrize, as recommended
Co-authored-by: Xin Yao <xiny@nvidia.com>

eabcc58e

06 Jun, 2022 3 commits

[Bug] Added common operations for FP16 on older GPUs (#4079) · ea44da50

ndickson-nvidia authored Jun 06, 2022

* * Added support for common operations on FP16 (`half` or `__half`) for older GPU architectures
* Fixed an issue with previous check for FP16 support

* * Removing FP16 type checks, since they should no longer be needed

* * Fixed AtomicAdd to be atomic for `float` and `double` for old GPU architectures.  Unfortunately, it seems that atomicCAS for unsigned short seems to be unavailable until architecture 70, so half will have to stay non-atomic on old GPUs.

* * Fixed non-atomic version of `AtomicAdd<half>` for older GPUs to return old value instead value of new

ea44da50

parallelize csr2coo (#4081) · 31a81438
Quan (Andy) Gan authored Jun 06, 2022
```
Co-authored-by: Xin Yao <xiny@nvidia.com>
```
31a81438

wrap all cuda kernel calls with macro (#4066) · 6014623d

Xin Yao authored Jun 06, 2022


Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Israt Nisa <neesha295@gmail.com>

6014623d

28 May, 2022 3 commits
- Change warning message for tensoradapter when not found (#4055) · 9922f41f
  Quan (Andy) Gan authored May 29, 2022
```
* change warning message

* Update tensordispatch.cc
```
  9922f41f
- Revert "[bugfix] Explicitly unpin tensoradapter allocated arrays (#3997)" (#4061) · 00c09b9f
  Quan (Andy) Gan authored May 28, 2022
```
This reverts commit fdd1fe19.
```
  00c09b9f
- add sanity check (#4050) · c577dc9f
  Quan (Andy) Gan authored May 28, 2022
  
  c577dc9f
26 May, 2022 1 commit

[Build][Tests] Enable FP16 for GPU builds in CI (#4030) · 7a065a9c

nv-dlasalle authored May 26, 2022

* Enable FP16 for GPU builds in CI

* Limit default GPU archs to pascal and above

* Disable FP16 dispatching for cuda architectures less than 60

* Fix linting

* Fix typos

7a065a9c

25 May, 2022 1 commit
- [Bugfix] Cython CAPI holding GIL causes deadlock when Python callback is asynchronous (#4036) · 3c129ad7
  Minjie Wang authored May 25, 2022
```
* cython nogil

* move APIs to internal and add unit test

* fix lint

* disable callback array test
```
  3c129ad7
17 May, 2022 1 commit

change the curandState and launch dimension of CSRRowwiseSample kernel (#3990) · bacf2ab4

paoxiaode authored May 17, 2022



* Change the curand_init parameter

* Change the curand_init parameter

* commit

* commit

* change the curandState and launch dim of CSRRowwiseSample kernel

* commit

* keep  _CSRRowWiseSampleReplaceKernel in sync
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

bacf2ab4

16 May, 2022 2 commits
- [bugfix] Explicitly unpin tensoradapter allocated arrays (#3997) · fdd1fe19
  nv-dlasalle authored May 16, 2022
```
* Explicitly unpin tensoradapter allocated arrays

* Undo unrelated change

* Add unit test

* update unit test
```
  fdd1fe19
- [Peformance] Remove unnecessary induced vertices in EdgeSubgraph (#3978) · 03024f95
  Xin Yao authored May 16, 2022
```
* remove unnecessary induced vertices in EdgeSubgraph

* add unit test
```
  03024f95
12 May, 2022 1 commit
- Fix launch parameters index select kernel in sparse push (#3524) · 4177f729
  nv-dlasalle authored May 12, 2022
  
  4177f729
11 May, 2022 1 commit

[Dist] Enable maximum try times for socket backend via DGL_DIST_MAX_T… (#3977) · 22e218d3

Rhett Ying authored May 11, 2022

* [Dist] Enable maximum try times for socket backend via DGL_DIST_MAX_TRY_TIMES

* reset env before/after test

* print log for info when trying to connect

* fix

* print log in python instead of cpp

22e218d3

27 Apr, 2022 1 commit

[Feature] enable socket net_type for rpc (#3951) · 37be02a4

Rhett Ying authored Apr 28, 2022

* [Feature] enable socket net_type for rpc

* fix lint

* fix lint

* fix build issue on windows

* fix test failure on windows

* fix test failure

* fix cpp unit test failure

* net_type blocking max_try_times

* fix other comments

* fix lint

* fix comment

* fix lint

* fix cpp

37be02a4

26 Apr, 2022 1 commit

[Performance][GPU] Improving Disjoint Union kernel for Graph Dataloaders (#3895) · 6e46bbf5

ayasar70 authored Apr 26, 2022



* Based on issue #3436. Improving _SegmentCopyKernel s GPU utilization by switching to nonzero based thread assignment

* fixing lint issues

* Update cub for cuda 11.5 compatibility (#3468)

* fixing type mismatch

* tx guaranteed to be smaller than nnz. Hence removing last check

* minor: updating comment

* adding three unit tests for csr slice method to cover some corner cases

* timing repeatkernel

* clean

* clean

* clean

* updating _SegmentMaskColKernel

* Working on requests: removing sorted array check and adding comments to utility functions

* fixing lint issue

* Optimizing disjoint union kernel

* Trying to resolve compilation issue on CI

* [EMPTY] Relevant commit message here

* applying revision requests on cpu/disjoint_union.cc

* removing unnecessary casts

* remove extra space
Co-authored-by: Abdurrahman Yasar <ayasar@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

6e46bbf5

12 Apr, 2022 1 commit
- [Example] Cleaned GraphSAGE node classification example with PyTorch Lightning (#3863) · 0d878ff8
  Quan (Andy) Gan authored Apr 12, 2022
```
* cleaned pl node classification example

* conform to PL's method of updating the dataloader

* update

* lint

* fix test

* fix
```
  0d878ff8
11 Apr, 2022 1 commit

[Feature] Enable UVA for GPU PinSAGE and RandomWalk (#3857) · 5fcd7f29

Xin Yao authored Apr 11, 2022



* enable uva for pinsage sampler

* unit test

* modify some checks on the python side

* remove legacy random walk code

* update unit test

* update unit test

* fix unit test

* adjust checks

* move some checks to c++

* move max_nodes check to cuda kernel

* fix ci for tf
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

5fcd7f29

09 Apr, 2022 1 commit

[BugFix] record/restore pin status when pickle/unpickle (#3914) · adb3a7c1

Rhett Ying authored Apr 09, 2022

* [BugFix] record/restore pin status when pickle/unpickle

* disable test on TF

* set version as expected

* unpin memory in test

adb3a7c1

05 Apr, 2022 1 commit

[Examples] Update graphsage multi-gpu example to use mutliple GPUs for... · 27a6eb56

nv-dlasalle authored Apr 05, 2022


[Examples] Update graphsage multi-gpu example to use mutliple GPUs for validation and testing. (#3827)

* Update graphsage multi-gpu example to use mutliple GPUs for validation and
testing.

* Remove argmax

* Fix rebase error

* Add more documentation to example and simplify

* Switch to name shared memory

* Add comment about how training is distributed

* Restore iteration count

* fix munmap error reporting for better error messages
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

27a6eb56

31 Mar, 2022 1 commit
- [Bugfix] Fix UVA sampling with partially specified node types (#3897) · 35e66f42
  Quan (Andy) Gan authored Mar 31, 2022
```
* fix uva with partial node types

* lint

* skip tensorflow unit test
```
  35e66f42
27 Mar, 2022 1 commit

[Feature] METIS Partition with Communication Volume Minimization (#3821) · fbbca994

Cheng Wan authored Mar 27, 2022

* upd

* upd

* upd

* upd

* upd

* fix OpenMP compatibility issues

* typo

* partition

* misc

* fix typo

* num_parts=1

* import torch

* long

* print info

* print info

* print info

* upd

* remove debug code

* revert partition.py

* fix cut count

* fix cut count

* Revert "fix cut count"

This reverts commit 10926b4fd48f45c8f1ddb58be7db6c22e653effd.

* Revert "fix cut count"

This reverts commit 76465283bef093a2b4209ad70dd15d2437b2ec8a.

* type of deprecate

* typo in deprecate info

* fix typo

* use cv for partitioning

* CE

* no message

* revert

* typo

* add objtype

* no message

* fix bug

* fix bug

* fix bug

* ?

* semicolon

* drop tensors

* no message

* backward

* backward

* max op

* store X.shape

* th

* test

* Revert "test"

This reverts commit 92b3b2f64a3a1128590098fa03ce429c5466e6ce.

* test

* tolist

* debug

* to cuda

* tuple

* fix bug

* remove X

* no message

* fix bug

* workload balance

* Revert "workload balance"

This reverts commit d7f8e4a16ba2a7eabb4a9bb945523bfe6623e723.

* reverse

* Revert "reverse"

This reverts commit 8a71cf25685aa7d889b9b8881b46f7a16b7d6e6d.

* Revert "Revert "reverse""

This reverts commit 196b143932d5cf9813576ece7c990b63d322d063.

* Revert "Revert "Revert "reverse"""

This reverts commit cf9e89a07013582056e7cde235e51331aca7fa9c.

* no message

* Merge commit '5498cf05'

# Conflicts:
#	python/dgl/distributed/partition.py

* Revert "Merge commit '5498cf05

'"

This reverts commit f79be2ad777897c7025b28308454cad81ad6bb27.

* fix bug

* third party

* no message

* try to avoid memory leak

* try to avoid memory leak

* avoid memory leak with no hope

* Revert "avoid memory leak with no hope"

This reverts commit c77befe9479f46758e744642f66dd209b50eef7d.

* no message

* Revert "no message"

This reverts commit 478cb28fe25fb1002b2f1dc202bb9bdaad8b2a56.

* del

* Revert "del"

This reverts commit 1b468e45ce646b400ff3ffa61a0b2da058b3bdfd.

* no message

* no message

* Revert "no message"

This reverts commit 92e4f5561ed42da0606618b2fff9f1ad5ed439d9.

* third party

* document

* Update metis_partition.cc

* Update metis_partition_hetero.cc

* Update metis_partition_hetero.cc

* Update partition.py

* Update partition.py

* Update partition.py
Co-authored-by: yzh119 <expye@outlook.com>
Co-authored-by: chwan-rice <54331508+chwan-rice@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>

fbbca994

24 Mar, 2022 2 commits
- [Bugfix] Fix multiple bugs and code refactor (#3841) · 223a3da5
  Quan (Andy) Gan authored Mar 24, 2022
```
* fix

* remove setcxx methods

* move pin flag to CSR and COO matrix
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
```
  223a3da5
- [BugFix] send rpc messages blockingly in case of congestion (#3867) · e9fd65e9
  Rhett Ying authored Mar 24, 2022
```
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
```
  e9fd65e9
10 Mar, 2022 1 commit

Change the parameter of curand_init (#3794) · eec219ab

paoxiaode authored Mar 10, 2022



* Change the curand_init parameter

* Change the curand_init parameter

* commit

* commit
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>

eec219ab

01 Mar, 2022 1 commit
- [Build] Working around broken name mangling in MSVC 16.5.5 + CUDA 11.3 (#3790) · 396d7180
  Quan (Andy) Gan authored Mar 01, 2022
```
* fix

* explain

* oops
```
  396d7180
28 Feb, 2022 2 commits
- [Build] Split spmm.cu and sddmm.cu for building on Windows (#3789) · 3521fbe9
  Quan (Andy) Gan authored Mar 01, 2022
```
* split files

* fix
```
  3521fbe9
- [Build or bug?] Fix VS2019 compilation error in randomwalk GPU kernel (#3788) · 6e1c6990
  Quan (Andy) Gan authored Feb 28, 2022
```
* Update randomwalk_gpu.cu

* Update randomwalk_gpu.cu
```
  6e1c6990
27 Feb, 2022 1 commit

[Doc and bugfix] Add docs and user guide and update tutorial for sampling pipeline (#3774) · d41d07d0

Quan (Andy) Gan authored Feb 28, 2022



* huuuuge update

* remove

* lint

* lint

* fix

* what happened to nccl

* update multi-gpu unsupervised graphsage example

* replace most of the dgl.mp.process with torch.mp.spawn

* update if condition for use_uva case

* update user guide

* address comments

* incorporating suggestions from @jermainewang

* oops

* fix tutorial to pass CI

* oops

* fix again
Co-authored-by: Xin Yao <xiny@nvidia.com>

d41d07d0

23 Feb, 2022 2 commits

Fixes the bug when total_nnz is > integer limit (#3766) · e7ad4c9c
sanchit-misra authored Feb 24, 2022

e7ad4c9c

[NN] Rework RelGraphConv and HGTConv (#3742) · 0227ddfb

Minjie Wang authored Feb 23, 2022

* WIP: TypedLinear and new RelGraphConv

* wip

* further simplify RGCN

* a bunch of tweak for performance; add basic cpu support

* update on segmm

* wip: segment.cu

* new backward kernel works

* fix a bunch of bugs in kernel; leave idx_a for future

* add nn test for typed_linear

* rgcn nn test

* bugfix in corner case; update RGCN README

* doc

* fix cpp lint

* fix lint

* fix ut

* wip: hgtconv; presorted flag for rgcn

* hgt code and ut; WIP: some fix on reorder graph

* better typed linear init

* fix ut

* fix lint; add docstring

0227ddfb

21 Feb, 2022 1 commit

[Bugfix] Bug fixes in new dataloader (#3727) · 3f138eba

Quan (Andy) Gan authored Feb 22, 2022



* fixes

* fix

* more fixes

* update

* oops

* lint?

* temporarily revert - will fix in another PR

* more fixes

* skipping mxnet test

* address comments

* fix DDP

* fix edge dataloader exclusion problems

* stupid bug

* fix

* use_uvm option

* fix

* fixes

* fixes

* fixes

* fixes

* add evaluation for cluster gcn and ddp

* stupid bug again

* fixes

* move sanity checks to only support DGLGraphs

* pytorch lightning compatibility fixes

* remove

* poke

* more fixes

* fix

* fix

* disable test

* docstrings

* why is it getting a memory leak?

* fix

* update

* updates and temporarily disable forkingpickler

* update

* fix?

* fix?

* oops

* oops

* fix

* lint

* huh

* uh

* update

* fix

* made it memory efficient

* refine exclude interface

* fix tutorial

* fix tutorial

* fix graph duplication in CPU dataloader workers

* lint

* lint

* Revert "lint"

This reverts commit 805484dd553695111b5fb37f2125214a6b7276e9.

* Revert "lint"

This reverts commit 0bce411b2b415c2ab770343949404498436dc8b2.

* Revert "fix graph duplication in CPU dataloader workers"

This reverts commit 9e3a8cf34c175d3093c773f6bb023b155f2bd27f.
Co-authored-by: xiny <xiny@nvidia.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

3f138eba

18 Feb, 2022 1 commit

[Performance][GPU] Improving _SegmentMaskColKernel (#3745) · 7b9afbfa

ayasar70 authored Feb 18, 2022



* Based on issue #3436. Improving _SegmentCopyKernel s GPU utilization by switching to nonzero based thread assignment

* fixing lint issues

* Update cub for cuda 11.5 compatibility (#3468)

* fixing type mismatch

* tx guaranteed to be smaller than nnz. Hence removing last check

* minor: updating comment

* adding three unit tests for csr slice method to cover some corner cases

* timing repeatkernel

* clean

* clean

* clean

* updating _SegmentMaskColKernel

* Working on requests: removing sorted array check and adding comments to utility functions

* fixing lint issue
Co-authored-by: Abdurrahman Yasar <ayasar@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>

7b9afbfa