Commits · 020f02498c090b80b690a22bde5cae6bf46c3375 · OpenDAS / dgl

24 Jun, 2022 1 commit

[Performance][Optimizer] Enable using UVA and FP16 with SparseAdam Optimizer (#3885) · 020f0249

nv-dlasalle authored Jun 23, 2022



* Add uva by default to embedding

* More updates

* Update optimizer

* Add new uva functions

* Expose new pinned memory function

* Add unit tests

* Update formatting

* Fix unit test

* Handle auto UVA case when training is on CPU

* Allow per-embedding decisions for whether to use UVA

* Address spares_optim.py comments

* Remove unused templates

* Update unit test

* Use dgl allocate memory for pinning

* allow automatically unpin

* workaround for d2h copy with a different dtype

* fix linting

* update error message

* update copyright
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>

020f0249

23 Mar, 2022 1 commit

Fix jenkins (#3875) · a3fd0595

Jinjing Zhou authored Mar 23, 2022

* try fix

* try fix

* try fix

* try fix

* Revert "try fix"

This reverts commit a3fa0b1e9c0ab892cc3a22acf3770903db8b14a7.

* try fix shared memory

* try fix shared memory

* try fix image version

* fix

a3fd0595

24 Jun, 2021 1 commit

[Bug fix] Use shared memory for grad sync when NCCL is not avaliable as... · 2f7ca414

xiang song(charlie.song) authored Jun 24, 2021


[Bug fix] Use shared memory for grad sync when NCCL is not avaliable as PyTorch distributed backend. (#3034)

* Use shared memory for grad sync when NCCL is not avaliable as PyTorch distributed backend.

Fix small bugs and update unitests

* Fix bug

* update test

* update test

* Fix unitest

* Fix unitest

* Fix test

* Fix

* simple update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-212.ec2.internal>

2f7ca414

11 Jun, 2021 1 commit

[Feature] Allow using NCCL for communication in dgl.NodeEmbedding and dgl.SparseOptimizer (#2824) · 17d604b5

nv-dlasalle authored Jun 10, 2021



* Split from NCCL PR

* Fix type in comment

* Expand documentation for sparse_all_to_all_push

* Restore previous behavior in example

* Re-work optimizer to use NCCL based on gradient location

* Allow for running with embedding on CPU but using NCCL for gradient exchange

* Optimize single partition case

* Fix pylint errors

* Add missing include

* fix gradient indexing

* Fix line continuation

* Migrate 'first_step'

* Skip tests without enough GPUs to run NCCL

* Improve empty tensor handling for pytorch 1.5

* Fix indentation

* Allow multiple NCCL communicator to coexist

* Improve handling of empty message

* Update python/dgl/nn/pytorch/sparse_emb.py
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

* Update python/dgl/nn/pytorch/sparse_emb.py
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

* Keepy empty tensor dimensionaless

* th.empty -> th.tensor

* Preserve shape for empty non-zero dimension tensors

* Use shared state, when embedding is shared

* Add support for gathering an embedding

* Fix typo

* Fix more typos

* Fix backend call

* Use NodeDataLoader to take advantage of ddp

* Update training script to share memory

* Only squeeze last dimension

* Better handle empty message

* Keep embedding on the target device GPU if dgl_sparse if false in RGCN example

* Fix typo in comment

* Add asserts

* Improve documentation in example
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

17d604b5

25 Apr, 2021 1 commit

[Bug Fix] Fix sparse opt bug (#2859) · c37e0364

xiang song(charlie.song) authored Apr 25, 2021



* Fix #2856

* upd

* Fix unitest

* upd

* upd

* upd

* Fix
Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-25.ec2.internal>

c37e0364

27 Jan, 2021 1 commit

[Feature] Add support for sparse embedding (#2451) · a7e941c3

xiang song(charlie.song) authored Jan 28, 2021



* Add sparse embedding for dgl and update rgcn example

* upd

* Fix

* Revert "Fix"

This reverts commit 4da87cdfb8b8c3506b7fc7376cd2385ba8045c2a.

* Fix

* upd

* upd

* Fix

* Add unitest and update impl

* fix

* Clean up rgcn example code

* upd

* upd

* update

* Fix

* update score

* sparse for sage

* remove model sparse

* upd

* upd

* remove global norm

* revert delete model_sparse.py

* update according to comments

* Fix doc

* upd

* Fix test

* upd

* lint

* lint

* lint

* upd

* upd

* clean up
Co-authored-by: Ubuntu <ubuntu@ip-172-31-56-220.ec2.internal>

a7e941c3