[Feature][Kernel] DGL kernel support (#596)

* [Kernel] Minigun integration and fused kernel support (#519) * kernel interface * add minigun * Add cuda build * functors * working on binary elewise * binary reduce * change kernel interface * WIP * wip * fix minigun * compile * binary reduce kernels * compile * simple test passed * more reducers * fix thrust problem * fix cmake * fix cmake; add proper guard for atomic * WIP: bcast * WIP * bcast kernels * update to new minigun pass-by-value practice * broadcasting dim * add copy src and copy edge * fix linking * fix none array problem * fix copy edge * add device_type and device_id to backend operator * cache csr adj, remove cache for adjmat and incmat * custom ops in backend and pytorch impl * change dgl-mg kernel python interface * add id_mapping var * clean up plus v2e spmv schedule * spmv schedule & clean up fall back * symbolic message and reduce func, remove bundle func * new executors * new backend interface for dgl kernels and pytorch impl * minor fix * fix * fix docstring, comments, func names * nodeflow * fix message id mapping and bugs... * pytorch test case & fix * backward binary reduce * fix bug * WIP: cusparse * change to int32 csr for cusparse workaround * disable cusparse * change back to int64 * broadcasting backward * cusparse; WIP: add rev_csr * unit test for kernels * pytorch backward with dgl kernel * edge softmax * fix backward * improve softmax * cache edge on device * cache mappings on device * fix partial forward code * cusparse done * copy_src_sum with cusparse * rm id getter * reduce grad for broadcast * copy edge reduce backward * kernel unit test for broadcasting * full kernel unit test * add cpu kernels * edge softmax unit test * missing ref * fix compile and small bugs * fix bug in bcast * Add backward both * fix torch utests * expose infershape * create out tensor in python * fix c++ lint * [Kernel] Add GPU utest and kernel utest (#524) * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * [Kernel] Update kernel branch (#550) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * [Kernel] Update kernel branch (#576) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * all demo use python-3 (#555) * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * add network cpp test (#565) * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * edge softmax module * WIP * Fixing typo in JTNN after interface change (#536) * mxnet backend support * improve reduce grad * add max to unittest backend * fix kernel unittest * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * lint * lint * win build * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * try * fix * fix * fix * fix * fix * try * test * test * test * try * try * try * test * fix * try gen_target * fix gen_target * fix msvc var_args expand issue * fix * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * WIP * WIP * all demo use python-3 (#555) * ToImmutable and CopyTo * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * DGLRetValue DGLContext conversion * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * Add support to convert immutable graph to 32 bits * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * fix binary reduce following new minigun template * enable both int64 and int32 kernels * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * new kernel interface done for CPU * docstring * rename & docstring * copy reduce and backward * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * adapt cuda kernels to the new interface * add network cpp test (#565) * fix bug * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * remove pytorch-specific test_function * fix unittest * fix * fix unittest backend bug in converting tensor to numpy array * fix * mxnet version * [BUGFIX] fix for MXNet 1.5. (#552) * remove clone. * turn on numpy compatible. * Revert "remove clone." This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457. * revert format changes * fix mxnet api name * revert mistakes in previous revert * roll back CI to 20190523 build * fix unittest * disable test_shared_mem_store.py for now * remove mxnet/test_specialization.py * sync win64 test script * fix lowercase * missing backend in gpu unit test * transpose to get forward graph * pass update all * add sanity check * passing test_specialization.py * fix and pass test_function * fix check * fix pytorch softmax * mxnet kernels * c++ lint * pylint * try * win build * fix * win * ci enable gpu build * init submodule recursively * backend docstring * try * test win dev * doc string * disable pytorch test_nn * try to fix windows issue * bug fixed, revert changes * [Test] fix CI. (#586) * disable unit test in mxnet tutorial. * retry socket connection. * roll back to set_np_compat * try to fix multi-processing test hangs when it fails. * fix test. * fix. * doc string * doc string and clean up * missing field in ctypes * fix node flow schedule and unit test * rename * pylint * copy from parent default context * fix unit test script * fix * demo bug in nodeflow gpu test * [Kernel][Bugfix] fix nodeflow bug (#604) * fix nodeflow bug * remove debug code * add build gtest option * fix cmake; fix graph index bug in spmv.py * remove clone * fix div rhs grad bug * [Kernel] Support full builtin method, edge softmax and unit tests (#605) * add full builtin support * unit test * unit test backend * edge softmax * apply edge with builtin * fix kernel unit test * disable mxnet test_shared_mem_store * gen builtin reduce * enable mxnet gpu unittest * revert some changes * docstring * add note for the hack * [Kernel][Unittest][CI] Fix MXNet GPU CI (#607) * update docker image for MXNet GPU CI * force all dgl graph input and output on CPU * fix gpu unittest * speedup compilation * add some comments * lint * add more comments * fix as requested * add some comments * comment * lint * lint * update pylint * fix as requested * lint * lint * lint * docstrings of python DGL kernel entries * disable lint warnings on arguments in kernel.py * fix docstring in scheduler * fix some bug in unittest; try again * Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel" This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * Revert "fix some bug in unittest; try again" This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * more comprehensive kernel test * remove shape check in test_specialization

[Feature][Kernel] DGL kernel support (#596)
* [Kernel] Minigun integration and fused kernel support (#519) * kernel interface * add minigun * Add cuda build * functors * working on binary elewise * binary reduce * change kernel interface * WIP * wip * fix minigun * compile * binary reduce kernels * compile * simple test passed * more reducers * fix thrust problem * fix cmake * fix cmake; add proper guard for atomic * WIP: bcast * WIP * bcast kernels * update to new minigun pass-by-value practice * broadcasting dim * add copy src and copy edge * fix linking * fix none array problem * fix copy edge * add device_type and device_id to backend operator * cache csr adj, remove cache for adjmat and incmat * custom ops in backend and pytorch impl * change dgl-mg kernel python interface * add id_mapping var * clean up plus v2e spmv schedule * spmv schedule & clean up fall back * symbolic message and reduce func, remove bundle func * new executors * new backend interface for dgl kernels and pytorch impl * minor fix * fix * fix docstring, comments, func names * nodeflow * fix message id mapping and bugs... * pytorch test case & fix * backward binary reduce * fix bug * WIP: cusparse * change to int32 csr for cusparse workaround * disable cusparse * change back to int64 * broadcasting backward * cusparse; WIP: add rev_csr * unit test for kernels * pytorch backward with dgl kernel * edge softmax * fix backward * improve softmax * cache edge on device * cache mappings on device * fix partial forward code * cusparse done * copy_src_sum with cusparse * rm id getter * reduce grad for broadcast * copy edge reduce backward * kernel unit test for broadcasting * full kernel unit test * add cpu kernels * edge softmax unit test * missing ref * fix compile and small bugs * fix bug in bcast * Add backward both * fix torch utests * expose infershape * create out tensor in python * fix c++ lint * [Kernel] Add GPU utest and kernel utest (#524) * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * [Kernel] Update kernel branch (#550) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * [Kernel] Update kernel branch (#576) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * all demo use python-3 (#555) * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * add network cpp test (#565) * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * edge softmax module * WIP * Fixing typo in JTNN after interface change (#536) * mxnet backend support * improve reduce grad * add max to unittest backend * fix kernel unittest * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * lint * lint * win build * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * try * fix * fix * fix * fix * fix * try * test * test * test * try * try * try * test * fix * try gen_target * fix gen_target * fix msvc var_args expand issue * fix * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * WIP * WIP * all demo use python-3 (#555) * ToImmutable and CopyTo * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * DGLRetValue DGLContext conversion * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * Add support to convert immutable graph to 32 bits * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * fix binary reduce following new minigun template * enable both int64 and int32 kernels * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * new kernel interface done for CPU * docstring * rename & docstring * copy reduce and backward * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * adapt cuda kernels to the new interface * add network cpp test (#565) * fix bug * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * remove pytorch-specific test_function * fix unittest * fix * fix unittest backend bug in converting tensor to numpy array * fix * mxnet version * [BUGFIX] fix for MXNet 1.5. (#552) * remove clone. * turn on numpy compatible. * Revert "remove clone." This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457. * revert format changes * fix mxnet api name * revert mistakes in previous revert * roll back CI to 20190523 build * fix unittest * disable test_shared_mem_store.py for now * remove mxnet/test_specialization.py * sync win64 test script * fix lowercase * missing backend in gpu unit test * transpose to get forward graph * pass update all * add sanity check * passing test_specialization.py * fix and pass test_function * fix check * fix pytorch softmax * mxnet kernels * c++ lint * pylint * try * win build * fix * win * ci enable gpu build * init submodule recursively * backend docstring * try * test win dev * doc string * disable pytorch test_nn * try to fix windows issue * bug fixed, revert changes * [Test] fix CI. (#586) * disable unit test in mxnet tutorial. * retry socket connection. * roll back to set_np_compat * try to fix multi-processing test hangs when it fails. * fix test. * fix. * doc string * doc string and clean up * missing field in ctypes * fix node flow schedule and unit test * rename * pylint * copy from parent default context * fix unit test script * fix * demo bug in nodeflow gpu test * [Kernel][Bugfix] fix nodeflow bug (#604) * fix nodeflow bug * remove debug code * add build gtest option * fix cmake; fix graph index bug in spmv.py * remove clone * fix div rhs grad bug * [Kernel] Support full builtin method, edge softmax and unit tests (#605) * add full builtin support * unit test * unit test backend * edge softmax * apply edge with builtin * fix kernel unit test * disable mxnet test_shared_mem_store * gen builtin reduce * enable mxnet gpu unittest * revert some changes * docstring * add note for the hack * [Kernel][Unittest][CI] Fix MXNet GPU CI (#607) * update docker image for MXNet GPU CI * force all dgl graph input and output on CPU * fix gpu unittest * speedup compilation * add some comments * lint * add more comments * fix as requested * add some comments * comment * lint * lint * update pylint * fix as requested * lint * lint * lint * docstrings of python DGL kernel entries * disable lint warnings on arguments in kernel.py * fix docstring in scheduler * fix some bug in unittest; try again * Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel" This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * Revert "fix some bug in unittest; try again" This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * more comprehensive kernel test * remove shape check in test_specialization
653428bd · Lingfan Yu · Minjie Wang · da0c92a2 · 653428bd · 653428bd
Commit 653428bd authored Jun 06, 2019 by Lingfan Yu Committed by Minjie Wang Jun 06, 2019
20 changed files
--- a/.gitmodules
+++ b/.gitmodules
@@ -7,3 +7,6 @@
 [submodule "third_party/googletest"]
 	path = third_party/googletest
 	url = https://github.com/google/googletest.git
+[submodule "third_party/minigun"]
+	path = third_party/minigun
+	url = https://github.com/jermainewang/minigun.git
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
+cmake_minimum_required(VERSION 3.9)
 ########################################
 # Borrowed and adapted from TVM project
 ########################################
-cmake_minimum_required(VERSION 2.8)
 project(dgl C CXX)
+message(STATUS "Start configuring project ${PROJECT_NAME}")
+# cmake utils
+include(cmake/util/Util.cmake)
+include(cmake/util/MshadowUtil.cmake)
+include(cmake/util/FindCUDA.cmake)
 if(EXISTS ${CMAKE_CURRENT_BINARY_DIR}/config.cmake)
  include(${CMAKE_CURRENT_BINARY_DIR}/config.cmake)
@@ -12,10 +18,25 @@ else()
  endif()
 endif()
+# NOTE: do not modify this file to change option values.
+# You can create a config.cmake at build folder
+# and add set(OPTION VALUE) to override these build options.
+# Alernatively, use cmake -DOPTION=VALUE through command-line.
+dgl_option(USE_CUDA "Build with CUDA" OFF)
+dgl_option(BUILD_CPP_TEST "Build cpp unittest executables" OFF)
+if(USE_CUDA)
+  message(STATUS "Build with CUDA support")
+  project(dgl C CXX CUDA)
+  include(cmake/modules/CUDA.cmake)
+endif(USE_CUDA)
 # include directories
 include_directories("include")
 include_directories("third_party/dlpack/include")
 include_directories("third_party/dmlc-core/include")
+include_directories("third_party/minigun/minigun")
+include_directories("third_party/minigun/third_party/moderngpu/src")
 # initial variables
 set(DGL_LINKER_LIBS "")
@@ -58,12 +79,30 @@ else(MSVC)
 endif(MSVC)
 # Source file lists
-file(GLOB CORE_SRCS src/graph/*.cc src/graph/network/* src/*.cc src/scheduler/*.cc)
+file(GLOB DGL_SRC
+  src/*.cc
+  src/kernel/*.cc
+  src/kernel/cpu/*.cc
+  src/runtime/*.cc
+)
+file(GLOB_RECURSE DGL_SRC_1
+  src/graph/*.cc
+  src/scheduler/*.cc
+)
-file(GLOB RUNTIME_SRCS src/runtime/*.cc)
+list(APPEND DGL_SRC ${DGL_SRC_1})
+if(USE_CUDA)
+  dgl_config_cuda(DGL_CUDA_SRC)
+  list(APPEND DGL_SRC ${DGL_CUDA_SRC})
+endif(USE_CUDA)
-add_library(dgl SHARED ${CORE_SRCS} ${RUNTIME_SRCS})
+if(USE_CUDA)
+  cuda_add_library(dgl SHARED ${DGL_SRC})
+else(USE_CUDA)
+  add_library(dgl SHARED ${DGL_SRC})
+endif(USE_CUDA)
 target_link_libraries(dgl ${DGL_LINKER_LIBS} ${DGL_RUNTIME_LINKER_LIBS})
@@ -72,7 +111,7 @@ install(TARGETS dgl DESTINATION lib${LIB_SUFFIX})
 # Testing
 if(BUILD_CPP_TEST)
-  message("Build with unittest")
+  message(STATUS "Build with unittest")
  add_subdirectory(./third_party/googletest)
  enable_testing()
  include_directories(${gtest_SOURCE_DIR}/include ${gtest_SOURCE_DIR})

--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -7,14 +7,12 @@ dgl_win64_libs = "build\\dgl.dll, build\\runUnitTests.exe"
 def init_git() {
  sh "rm -rf *"
  checkout scm
-  sh "git submodule init"
+  sh "git submodule update --recursive --init"
-  sh "git submodule update"
 }
 def init_git_win64() {
  checkout scm
-  bat "git submodule init"
+  bat "git submodule update --recursive --init"
-  bat "git submodule update"
 }
 // pack libraries for later use
@@ -31,7 +29,7 @@ def unpack_lib(name, libs) {
 def build_dgl_linux(dev) {
  init_git()
-  sh "bash tests/scripts/build_dgl.sh"
+  sh "bash tests/scripts/build_dgl.sh ${dev}"
  pack_lib("dgl-${dev}-linux", dgl_linux_libs)
 }
@@ -59,7 +57,7 @@ def unit_test_linux(backend, dev) {
  init_git()
  unpack_lib("dgl-${dev}-linux", dgl_linux_libs)
  timeout(time: 2, unit: 'MINUTES') {
-    sh "bash tests/scripts/task_unit_test.sh ${backend}"
+    sh "bash tests/scripts/task_unit_test.sh ${backend} ${dev}"
  }
 }
@@ -195,8 +193,8 @@ pipeline {
          stages {
            stage("Unit test") {
              steps {
-                //unit_test_linux("pytorch", "gpu")
                sh "nvidia-smi"
+                unit_test_linux("pytorch", "gpu")
              }
            }
            stage("Example test") {
@@ -226,6 +224,32 @@ pipeline {
            //}
          }
        }
+        stage("MXNet GPU") {
+          agent {
+            docker {
+              image "dgllib/dgl-ci-gpu"
+              args "--runtime nvidia"
+            }
+          }
+          stages {
+            stage("Unit test") {
+              steps {
+                sh "nvidia-smi"
+                unit_test_linux("mxnet", "gpu")
+              }
+            }
+            //stage("Example test") {
+            //  steps {
+            //    unit_test_linux("pytorch", "cpu")
+            //  }
+            //}
+            //stage("Tutorial test") {
+            //  steps {
+            //    tutorial_test_linux("mxnet")
+            //  }
+            //}
+          }
+        }
      }
    }
  }

--- a/cmake/config.cmake
+++ b/cmake/config.cmake
+#--------------------------------------------------------------------
+#  Template custom cmake configuration for compiling
+#
+#  This file is used to override the build options in build.
+#  If you want to change the configuration, please use the following
+#  steps. Assume you are on the root directory. First copy the this
+#  file so that any local changes will be ignored by git
+#
+#  $ mkdir build
+#  $ cp cmake/config.cmake build
+#
+#  Next modify the according entries, and then compile by
+#
+#  $ cd build
+#  $ cmake ..
+#
+#  Then buld in parallel with 8 threads
+#
+#  $ make -j8
+#--------------------------------------------------------------------
+#---------------------------------------------
+# Backend runtimes.
+#---------------------------------------------
+# Whether enable CUDA during compile,
+#
+# Possible values:
+# - ON: enable CUDA with cmake's auto search
+# - OFF: disable CUDA
+# - /path/to/cuda: use specific path to cuda toolkit
+set(USE_CUDA OFF)
+#---------------------------------------------
+# Misc.
+#---------------------------------------------
+# Whether to build cpp unittest executables
+set(BUILD_CPP_TEST OFF)
--- a/cmake/modules/CUDA.cmake
+++ b/cmake/modules/CUDA.cmake
+# CUDA Module
+if(USE_CUDA)
+  find_cuda(${USE_CUDA} REQUIRED)
+else(USE_CUDA)
+  return()
+endif()
+###### Borrowed from MSHADOW project
+include(CheckCXXCompilerFlag)
+check_cxx_compiler_flag("-std=c++11"   SUPPORT_CXX11)
+################################################################################################
+# A function for automatic detection of GPUs installed  (if autodetection is enabled)
+# Usage:
+#   dgl_detect_installed_gpus(out_variable)
+function(dgl_detect_installed_gpus out_variable)
+set(CUDA_gpu_detect_output "")
+  if(NOT CUDA_gpu_detect_output)
+    message(STATUS "Running GPU architecture autodetection")
+    set(__cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)
+    file(WRITE ${__cufile} ""
+      "#include <cstdio>\n"
+      "#include <iostream>\n"
+      "using namespace std;\n"
+      "int main()\n"
+      "{\n"
+      "  int count = 0;\n"
+      "  if (cudaSuccess != cudaGetDeviceCount(&count)) { return -1; }\n"
+      "  if (count == 0) { cerr << \"No cuda devices detected\" << endl; return -1; }\n"
+      "  for (int device = 0; device < count; ++device)\n"
+      "  {\n"
+      "    cudaDeviceProp prop;\n"
+      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
+      "      std::printf(\"%d.%d \", prop.major, prop.minor);\n"
+      "  }\n"
+      "  return 0;\n"
+      "}\n")
+    if(MSVC)
+      #find vcvarsall.bat and run it building msvc environment
+      get_filename_component(MY_COMPILER_DIR ${CMAKE_CXX_COMPILER} DIRECTORY)
+      find_file(MY_VCVARSALL_BAT vcvarsall.bat "${MY_COMPILER_DIR}/.." "${MY_COMPILER_DIR}/../..")
+      execute_process(COMMAND ${MY_VCVARSALL_BAT} && ${CUDA_NVCC_EXECUTABLE} -arch sm_30 --run  ${__cufile}
+                      WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
+                      RESULT_VARIABLE __nvcc_res OUTPUT_VARIABLE __nvcc_out
+                      OUTPUT_STRIP_TRAILING_WHITESPACE)
+    else()
+      if(CUDA_LIBRARY_PATH)
+        set(CUDA_LINK_LIBRARY_PATH "-L${CUDA_LIBRARY_PATH}")
+      endif()
+      execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -arch sm_30 --run ${__cufile} ${CUDA_LINK_LIBRARY_PATH}
+                      WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
+                      RESULT_VARIABLE __nvcc_res OUTPUT_VARIABLE __nvcc_out
+                      OUTPUT_STRIP_TRAILING_WHITESPACE)
+    endif()
+    if(__nvcc_res EQUAL 0)
+      # nvcc outputs text containing line breaks when building with MSVC.
+      # The line below prevents CMake from inserting a variable with line
+      # breaks in the cache
+      message(STATUS "Found CUDA arch ${__nvcc_out}")
+      string(REGEX MATCH "([1-9].[0-9])" __nvcc_out "${__nvcc_out}")
+      string(REPLACE "2.1" "2.1(2.0)" __nvcc_out "${__nvcc_out}")
+      set(CUDA_gpu_detect_output ${__nvcc_out} CACHE INTERNAL "Returned GPU architetures from mshadow_detect_gpus tool" FORCE)
+    else()
+      message(WARNING "Running GPU detection script with nvcc failed: ${__nvcc_out}")
+    endif()
+  endif()
+  if(NOT CUDA_gpu_detect_output)
+    message(WARNING "Automatic GPU detection failed. Building for all known architectures (${mshadow_known_gpu_archs}).")
+    set(${out_variable} ${mshadow_known_gpu_archs} PARENT_SCOPE)
+  else()
+    set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
+  endif()
+endfunction()
+################################################################################################
+# Function for selecting GPU arch flags for nvcc based on CUDA_ARCH_NAME
+# Usage:
+#   dgl_select_nvcc_arch_flags(out_variable)
+function(dgl_select_nvcc_arch_flags out_variable)
+  # List of arch names
+  set(__archs_names "Fermi" "Kepler" "Maxwell" "Pascal" "Volta" "All" "Manual")
+  set(__archs_name_default "All")
+  if(NOT CMAKE_CROSSCOMPILING)
+    list(APPEND __archs_names "Auto")
+    set(__archs_name_default "Auto")
+  endif()
+  # set CUDA_ARCH_NAME strings (so it will be seen as dropbox in CMake-Gui)
+  set(CUDA_ARCH_NAME ${__archs_name_default} CACHE STRING "Select target NVIDIA GPU achitecture.")
+  set_property( CACHE CUDA_ARCH_NAME PROPERTY STRINGS "" ${__archs_names} )
+  mark_as_advanced(CUDA_ARCH_NAME)
+  # verify CUDA_ARCH_NAME value
+  if(NOT ";${__archs_names};" MATCHES ";${CUDA_ARCH_NAME};")
+    string(REPLACE ";" ", " __archs_names "${__archs_names}")
+    message(FATAL_ERROR "Only ${__archs_names} architeture names are supported.")
+  endif()
+  if(${CUDA_ARCH_NAME} STREQUAL "Manual")
+    set(CUDA_ARCH_BIN ${mshadow_known_gpu_archs} CACHE STRING "Specify 'real' GPU architectures to build binaries for, BIN(PTX) format is supported")
+    set(CUDA_ARCH_PTX "50"                     CACHE STRING "Specify 'virtual' PTX architectures to build PTX intermediate code for")
+    mark_as_advanced(CUDA_ARCH_BIN CUDA_ARCH_PTX)
+  else()
+    unset(CUDA_ARCH_BIN CACHE)
+    unset(CUDA_ARCH_PTX CACHE)
+  endif()
+  if(${CUDA_ARCH_NAME} STREQUAL "Fermi")
+    set(__cuda_arch_bin "20 21(20)")
+  elseif(${CUDA_ARCH_NAME} STREQUAL "Kepler")
+    set(__cuda_arch_bin "30 35")
+  elseif(${CUDA_ARCH_NAME} STREQUAL "Maxwell")
+    set(__cuda_arch_bin "50")
+  elseif(${CUDA_ARCH_NAME} STREQUAL "Pascal")
+    set(__cuda_arch_bin "60 61")
+  elseif(${CUDA_ARCH_NAME} STREQUAL "Volta")
+    set(__cuda_arch_bin "70")
+  elseif(${CUDA_ARCH_NAME} STREQUAL "All")
+    set(__cuda_arch_bin ${mshadow_known_gpu_archs})
+  elseif(${CUDA_ARCH_NAME} STREQUAL "Auto")
+    dgl_detect_installed_gpus(__cuda_arch_bin)
+  else()  # (${CUDA_ARCH_NAME} STREQUAL "Manual")
+    set(__cuda_arch_bin ${CUDA_ARCH_BIN})
+  endif()
+  # remove dots and convert to lists
+  string(REGEX REPLACE "\\." "" __cuda_arch_bin "${__cuda_arch_bin}")
+  string(REGEX REPLACE "\\." "" __cuda_arch_ptx "${CUDA_ARCH_PTX}")
+  string(REGEX MATCHALL "[0-9()]+" __cuda_arch_bin "${__cuda_arch_bin}")
+  string(REGEX MATCHALL "[0-9]+"   __cuda_arch_ptx "${__cuda_arch_ptx}")
+  mshadow_list_unique(__cuda_arch_bin __cuda_arch_ptx)
+  set(__nvcc_flags "")
+  set(__nvcc_archs_readable "")
+  # Tell NVCC to add binaries for the specified GPUs
+  foreach(__arch ${__cuda_arch_bin})
+    if(__arch MATCHES "([0-9]+)\\(([0-9]+)\\)")
+      # User explicitly specified PTX for the concrete BIN
+      list(APPEND __nvcc_flags -gencode arch=compute_${CMAKE_MATCH_2},code=sm_${CMAKE_MATCH_1})
+      list(APPEND __nvcc_archs_readable sm_${CMAKE_MATCH_1})
+    else()
+      # User didn't explicitly specify PTX for the concrete BIN, we assume PTX=BIN
+      list(APPEND __nvcc_flags -gencode arch=compute_${__arch},code=sm_${__arch})
+      list(APPEND __nvcc_archs_readable sm_${__arch})
+    endif()
+  endforeach()
+  # Tell NVCC to add PTX intermediate code for the specified architectures
+  foreach(__arch ${__cuda_arch_ptx})
+    list(APPEND __nvcc_flags -gencode arch=compute_${__arch},code=compute_${__arch})
+    list(APPEND __nvcc_archs_readable compute_${__arch})
+  endforeach()
+  string(REPLACE ";" " " __nvcc_archs_readable "${__nvcc_archs_readable}")
+  set(${out_variable}          ${__nvcc_flags}          PARENT_SCOPE)
+  set(${out_variable}_readable ${__nvcc_archs_readable} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Short command for cuda comnpilation
+# Usage:
+#   dgl_cuda_compile(<objlist_variable> <cuda_files>)
+macro(dgl_cuda_compile objlist_variable)
+  foreach(var CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_RELEASE CMAKE_CXX_FLAGS_DEBUG)
+    set(${var}_backup_in_cuda_compile_ "${${var}}")
+    # we remove /EHa as it generates warnings under windows
+    string(REPLACE "/EHa" "" ${var} "${${var}}")
+  endforeach()
+  if(UNIX OR APPLE)
+    list(APPEND CUDA_NVCC_FLAGS -Xcompiler -fPIC)
+  endif()
+  if(APPLE)
+    list(APPEND CUDA_NVCC_FLAGS -Xcompiler -Wno-unused-function)
+  endif()
+  set(CUDA_NVCC_FLAGS_DEBUG "${CUDA_NVCC_FLAGS_DEBUG} -G")
+  if(MSVC)
+    # disable noisy warnings:
+    # 4819: The file contains a character that cannot be represented in the current code page (number).
+    list(APPEND CUDA_NVCC_FLAGS -Xcompiler "/wd4819")
+    foreach(flag_var
+        CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_DEBUG CMAKE_CXX_FLAGS_RELEASE
+        CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO)
+      if(${flag_var} MATCHES "/MD")
+        string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}")
+      endif(${flag_var} MATCHES "/MD")
+    endforeach(flag_var)
+  endif()
+  # If the build system is a container, make sure the nvcc intermediate files
+  # go into the build output area rather than in /tmp, which may run out of space
+  if(IS_CONTAINER_BUILD)
+    set(CUDA_NVCC_INTERMEDIATE_DIR "${CMAKE_CURRENT_BINARY_DIR}")
+    message(STATUS "Container build enabled, so nvcc intermediate files in: ${CUDA_NVCC_INTERMEDIATE_DIR}")
+    list(APPEND CUDA_NVCC_FLAGS "--keep --keep-dir ${CUDA_NVCC_INTERMEDIATE_DIR}")
+  endif()
+  cuda_compile(cuda_objcs ${ARGN})
+  foreach(var CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_RELEASE CMAKE_CXX_FLAGS_DEBUG)
+    set(${var} "${${var}_backup_in_cuda_compile_}")
+    unset(${var}_backup_in_cuda_compile_)
+  endforeach()
+  set(${objlist_variable} ${cuda_objcs})
+endmacro()
+################################################################################################
+# Config cuda compilation.
+# Usage:
+#   dgl_config_cuda(<dgl_cuda_src>)
+macro(dgl_config_cuda out_variable)
+  if(NOT CUDA_FOUND)
+    message(FATAL_ERROR "Cannot find CUDA.")
+  endif()
+  # always set the includedir when cuda is available
+  # avoid global retrigger of cmake
+	include_directories(${CUDA_INCLUDE_DIRS})
+  add_definitions(-DDGL_USE_CUDA)
+  file(GLOB_RECURSE DGL_CUDA_SRC
+    src/kernel/cuda/*.cc
+    src/kernel/cuda/*.cu
+    src/runtime/cuda/*.cc
+  )
+  dgl_select_nvcc_arch_flags(NVCC_FLAGS_ARCH)
+  string(REPLACE ";" " " NVCC_FLAGS_ARCH "${NVCC_FLAGS_ARCH}")
+  set(NVCC_FLAGS_EXTRA ${NVCC_FLAGS_ARCH})
+  # for lambda support in moderngpu
+  set(NVCC_FLAGS_EXTRA "${NVCC_FLAGS_EXTRA} --expt-extended-lambda")
+  # suppress deprecated warning in moderngpu
+  set(NVCC_FLAGS_EXTRA "${NVCC_FLAGS_EXTRA} -Wno-deprecated-declarations")
+  message(STATUS "NVCC extra flags: ${NVCC_FLAGS_EXTRA}")
+  set(CUDA_NVCC_FLAGS  "${CUDA_NVCC_FLAGS} ${NVCC_FLAGS_EXTRA}")
+  list(APPEND CMAKE_CUDA_FLAGS "${NVCC_FLAGS_EXTRA}")
+  list(APPEND DGL_LINKER_LIBS
+    ${CUDA_CUDA_LIBRARY} ${CUDA_CUDART_LIBRARY}
+    ${CUDA_CUBLAS_LIBRARIES} ${CUDA_cusparse_LIBRARY})
+  set(${out_variable} ${DGL_CUDA_SRC})
+endmacro()
--- a/cmake/util/FindCUDA.cmake
+++ b/cmake/util/FindCUDA.cmake
+#######################################################
+# Enhanced version of find CUDA.
+#
+# Usage:
+#   find_cuda(${USE_CUDA})
+#
+# - When USE_CUDA=ON, use auto search
+# - When USE_CUDA=/path/to/cuda-path, use the cuda path
+#
+# Provide variables:
+#
+# - CUDA_FOUND
+# - CUDA_INCLUDE_DIRS
+# - CUDA_TOOLKIT_ROOT_DIR
+# - CUDA_CUDA_LIBRARY
+# - CUDA_CUDART_LIBRARY
+# - CUDA_NVRTC_LIBRARY
+# - CUDA_CUDNN_LIBRARY
+# - CUDA_CUBLAS_LIBRARY
+#
+macro(find_cuda use_cuda)
+  set(__use_cuda ${use_cuda})
+  if(__use_cuda STREQUAL "ON")
+    find_package(CUDA QUIET)
+  elseif(IS_DIRECTORY ${__use_cuda})
+    set(CUDA_TOOLKIT_ROOT_DIR ${__use_cuda})
+    message(STATUS "Custom CUDA_PATH=" ${CUDA_TOOLKIT_ROOT_DIR})
+    set(CUDA_INCLUDE_DIRS ${CUDA_TOOLKIT_ROOT_DIR}/include)
+    set(CUDA_FOUND TRUE)
+    if(MSVC)
+      find_library(CUDA_CUDART_LIBRARY cudart
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+    else(MSVC)
+      find_library(CUDA_CUDART_LIBRARY cudart
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib)
+    endif(MSVC)
+  endif()
+  # additional libraries
+  if(CUDA_FOUND)
+    if(MSVC)
+      find_library(CUDA_CUDA_LIBRARY cuda
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+      find_library(CUDA_NVRTC_LIBRARY nvrtc
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+      find_library(CUDA_CUDNN_LIBRARY cudnn
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+      find_library(CUDA_CUBLAS_LIBRARY cublas
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+    else(MSVC)
+      find_library(_CUDA_CUDA_LIBRARY cuda
+        PATHS ${CUDA_TOOLKIT_ROOT_DIR}
+        PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs
+        NO_DEFAULT_PATH)
+      if(_CUDA_CUDA_LIBRARY)
+        set(CUDA_CUDA_LIBRARY ${_CUDA_CUDA_LIBRARY})
+      endif()
+      find_library(CUDA_NVRTC_LIBRARY nvrtc
+        PATHS ${CUDA_TOOLKIT_ROOT_DIR}
+        PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs lib/x86_64-linux-gnu
+        NO_DEFAULT_PATH)
+      find_library(CUDA_CUDNN_LIBRARY cudnn
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib)
+      find_library(CUDA_CUBLAS_LIBRARY cublas
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib64
+        ${CUDA_TOOLKIT_ROOT_DIR}/lib)
+    endif(MSVC)
+    message(STATUS "Found CUDA_TOOLKIT_ROOT_DIR=" ${CUDA_TOOLKIT_ROOT_DIR})
+    message(STATUS "Found CUDA_CUDA_LIBRARY=" ${CUDA_CUDA_LIBRARY})
+    message(STATUS "Found CUDA_CUDART_LIBRARY=" ${CUDA_CUDART_LIBRARY})
+    message(STATUS "Found CUDA_NVRTC_LIBRARY=" ${CUDA_NVRTC_LIBRARY})
+    message(STATUS "Found CUDA_CUDNN_LIBRARY=" ${CUDA_CUDNN_LIBRARY})
+    message(STATUS "Found CUDA_CUBLAS_LIBRARY=" ${CUDA_CUBLAS_LIBRARY})
+  endif(CUDA_FOUND)
+endmacro(find_cuda)
--- a/cmake/util/MshadowUtil.cmake
+++ b/cmake/util/MshadowUtil.cmake
+################################################################################################
+# Command alias for debugging messages
+# Usage:
+#   dmsg(<message>)
+function(dmsg)
+  message(STATUS ${ARGN})
+endfunction()
+################################################################################################
+# Removes duplicates from list(s)
+# Usage:
+#   mshadow_list_unique(<list_variable> [<list_variable>] [...])
+macro(mshadow_list_unique)
+  foreach(__lst ${ARGN})
+    if(${__lst})
+      list(REMOVE_DUPLICATES ${__lst})
+    endif()
+  endforeach()
+endmacro()
+################################################################################################
+# Clears variables from list
+# Usage:
+#   mshadow_clear_vars(<variables_list>)
+macro(mshadow_clear_vars)
+  foreach(_var ${ARGN})
+    unset(${_var})
+  endforeach()
+endmacro()
+################################################################################################
+# Removes duplicates from string
+# Usage:
+#   mshadow_string_unique(<string_variable>)
+function(mshadow_string_unique __string)
+  if(${__string})
+    set(__list ${${__string}})
+    separate_arguments(__list)
+    list(REMOVE_DUPLICATES __list)
+    foreach(__e ${__list})
+      set(__str "${__str} ${__e}")
+    endforeach()
+    set(${__string} ${__str} PARENT_SCOPE)
+  endif()
+endfunction()
+################################################################################################
+# Prints list element per line
+# Usage:
+#   mshadow_print_list(<list>)
+function(mshadow_print_list)
+  foreach(e ${ARGN})
+    message(STATUS ${e})
+  endforeach()
+endfunction()
+################################################################################################
+# Function merging lists of compiler flags to single string.
+# Usage:
+#   mshadow_merge_flag_lists(out_variable <list1> [<list2>] [<list3>] ...)
+function(mshadow_merge_flag_lists out_var)
+  set(__result "")
+  foreach(__list ${ARGN})
+    foreach(__flag ${${__list}})
+      string(STRIP ${__flag} __flag)
+      set(__result "${__result} ${__flag}")
+    endforeach()
+  endforeach()
+  string(STRIP ${__result} __result)
+  set(${out_var} ${__result} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Converts all paths in list to absolute
+# Usage:
+#   mshadow_convert_absolute_paths(<list_variable>)
+function(mshadow_convert_absolute_paths variable)
+  set(__dlist "")
+  foreach(__s ${${variable}})
+    get_filename_component(__abspath ${__s} ABSOLUTE)
+    list(APPEND __list ${__abspath})
+  endforeach()
+  set(${variable} ${__list} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Reads set of version defines from the header file
+# Usage:
+#   mshadow_parse_header(<file> <define1> <define2> <define3> ..)
+macro(mshadow_parse_header FILENAME FILE_VAR)
+  set(vars_regex "")
+  set(__parnet_scope OFF)
+  set(__add_cache OFF)
+  foreach(name ${ARGN})
+    if("${name}" STREQUAL "PARENT_SCOPE")
+      set(__parnet_scope ON)
+    elseif("${name}" STREQUAL "CACHE")
+      set(__add_cache ON)
+    elseif(vars_regex)
+      set(vars_regex "${vars_regex}|${name}")
+    else()
+      set(vars_regex "${name}")
+    endif()
+  endforeach()
+  if(EXISTS "${FILENAME}")
+    file(STRINGS "${FILENAME}" ${FILE_VAR} REGEX "#define[ \t]+(${vars_regex})[ \t]+[0-9]+" )
+  else()
+    unset(${FILE_VAR})
+  endif()
+  foreach(name ${ARGN})
+    if(NOT "${name}" STREQUAL "PARENT_SCOPE" AND NOT "${name}" STREQUAL "CACHE")
+      if(${FILE_VAR})
+        if(${FILE_VAR} MATCHES ".+[ \t]${name}[ \t]+([0-9]+).*")
+          string(REGEX REPLACE ".+[ \t]${name}[ \t]+([0-9]+).*" "\\1" ${name} "${${FILE_VAR}}")
+        else()
+          set(${name} "")
+        endif()
+        if(__add_cache)
+          set(${name} ${${name}} CACHE INTERNAL "${name} parsed from ${FILENAME}" FORCE)
+        elseif(__parnet_scope)
+          set(${name} "${${name}}" PARENT_SCOPE)
+        endif()
+      else()
+        unset(${name} CACHE)
+      endif()
+    endif()
+  endforeach()
+endmacro()
+################################################################################################
+# Reads single version define from the header file and parses it
+# Usage:
+#   mshadow_parse_header_single_define(<library_name> <file> <define_name>)
+function(mshadow_parse_header_single_define LIBNAME HDR_PATH VARNAME)
+  set(${LIBNAME}_H "")
+  if(EXISTS "${HDR_PATH}")
+    file(STRINGS "${HDR_PATH}" ${LIBNAME}_H REGEX "^#define[ \t]+${VARNAME}[ \t]+\"[^\"]*\".*$" LIMIT_COUNT 1)
+  endif()
+  if(${LIBNAME}_H)
+    string(REGEX REPLACE "^.*[ \t]${VARNAME}[ \t]+\"([0-9]+).*$" "\\1" ${LIBNAME}_VERSION_MAJOR "${${LIBNAME}_H}")
+    string(REGEX REPLACE "^.*[ \t]${VARNAME}[ \t]+\"[0-9]+\\.([0-9]+).*$" "\\1" ${LIBNAME}_VERSION_MINOR  "${${LIBNAME}_H}")
+    string(REGEX REPLACE "^.*[ \t]${VARNAME}[ \t]+\"[0-9]+\\.[0-9]+\\.([0-9]+).*$" "\\1" ${LIBNAME}_VERSION_PATCH "${${LIBNAME}_H}")
+    set(${LIBNAME}_VERSION_MAJOR ${${LIBNAME}_VERSION_MAJOR} ${ARGN} PARENT_SCOPE)
+    set(${LIBNAME}_VERSION_MINOR ${${LIBNAME}_VERSION_MINOR} ${ARGN} PARENT_SCOPE)
+    set(${LIBNAME}_VERSION_PATCH ${${LIBNAME}_VERSION_PATCH} ${ARGN} PARENT_SCOPE)
+    set(${LIBNAME}_VERSION_STRING "${${LIBNAME}_VERSION_MAJOR}.${${LIBNAME}_VERSION_MINOR}.${${LIBNAME}_VERSION_PATCH}" PARENT_SCOPE)
+    # append a TWEAK version if it exists:
+    set(${LIBNAME}_VERSION_TWEAK "")
+    if("${${LIBNAME}_H}" MATCHES "^.*[ \t]${VARNAME}[ \t]+\"[0-9]+\\.[0-9]+\\.[0-9]+\\.([0-9]+).*$")
+      set(${LIBNAME}_VERSION_TWEAK "${CMAKE_MATCH_1}" ${ARGN} PARENT_SCOPE)
+    endif()
+    if(${LIBNAME}_VERSION_TWEAK)
+      set(${LIBNAME}_VERSION_STRING "${${LIBNAME}_VERSION_STRING}.${${LIBNAME}_VERSION_TWEAK}" ${ARGN} PARENT_SCOPE)
+    else()
+      set(${LIBNAME}_VERSION_STRING "${${LIBNAME}_VERSION_STRING}" ${ARGN} PARENT_SCOPE)
+    endif()
+  endif()
+endfunction()
+########################################################################################################
+# An option that the user can select. Can accept condition to control when option is available for user.
+# Usage:
+#   mshadow_option(<option_variable> "doc string" <initial value or boolean expression> [IF <condition>])
+function(mshadow_option variable description value)
+  set(__value ${value})
+  set(__condition "")
+  set(__varname "__value")
+  foreach(arg ${ARGN})
+    if(arg STREQUAL "IF" OR arg STREQUAL "if")
+      set(__varname "__condition")
+    else()
+      list(APPEND ${__varname} ${arg})
+    endif()
+  endforeach()
+  unset(__varname)
+  if("${__condition}" STREQUAL "")
+    set(__condition 2 GREATER 1)
+  endif()
+  if(${__condition})
+    if("${__value}" MATCHES ";")
+      if(${__value})
+        option(${variable} "${description}" ON)
+      else()
+        option(${variable} "${description}" OFF)
+      endif()
+    elseif(DEFINED ${__value})
+      if(${__value})
+        option(${variable} "${description}" ON)
+      else()
+        option(${variable} "${description}" OFF)
+      endif()
+    else()
+      option(${variable} "${description}" ${__value})
+    endif()
+  else()
+    unset(${variable} CACHE)
+  endif()
+endfunction()
+################################################################################################
+# Utility macro for comparing two lists. Used for CMake debugging purposes
+# Usage:
+#   mshadow_compare_lists(<list_variable> <list2_variable> [description])
+function(mshadow_compare_lists list1 list2 desc)
+  set(__list1 ${${list1}})
+  set(__list2 ${${list2}})
+  list(SORT __list1)
+  list(SORT __list2)
+  list(LENGTH __list1 __len1)
+  list(LENGTH __list2 __len2)
+  if(NOT ${__len1} EQUAL ${__len2})
+    message(FATAL_ERROR "Lists are not equal. ${__len1} != ${__len2}. ${desc}")
+  endif()
+  foreach(__i RANGE 1 ${__len1})
+    math(EXPR __index "${__i}- 1")
+    list(GET __list1 ${__index} __item1)
+    list(GET __list2 ${__index} __item2)
+    if(NOT ${__item1} STREQUAL ${__item2})
+      message(FATAL_ERROR "Lists are not equal. Differ at element ${__index}. ${desc}")
+    endif()
+  endforeach()
+endfunction()
+################################################################################################
+# Command for disabling warnings for different platforms (see below for gcc and VisualStudio)
+# Usage:
+#   mshadow_warnings_disable(<CMAKE_[C|CXX]_FLAGS[_CONFIGURATION]> -Wshadow /wd4996 ..,)
+macro(mshadow_warnings_disable)
+  set(_flag_vars "")
+  set(_msvc_warnings "")
+  set(_gxx_warnings "")
+  foreach(arg ${ARGN})
+    if(arg MATCHES "^CMAKE_")
+      list(APPEND _flag_vars ${arg})
+    elseif(arg MATCHES "^/wd")
+      list(APPEND _msvc_warnings ${arg})
+    elseif(arg MATCHES "^-W")
+      list(APPEND _gxx_warnings ${arg})
+    endif()
+  endforeach()
+  if(NOT _flag_vars)
+    set(_flag_vars CMAKE_C_FLAGS CMAKE_CXX_FLAGS)
+  endif()
+  if(MSVC AND _msvc_warnings)
+    foreach(var ${_flag_vars})
+      foreach(warning ${_msvc_warnings})
+        set(${var} "${${var}} ${warning}")
+      endforeach()
+    endforeach()
+  elseif((CMAKE_COMPILER_IS_GNUCXX OR CMAKE_COMPILER_IS_CLANGXX) AND _gxx_warnings)
+    foreach(var ${_flag_vars})
+      foreach(warning ${_gxx_warnings})
+        if(NOT warning MATCHES "^-Wno-")
+          string(REPLACE "${warning}" "" ${var} "${${var}}")
+          string(REPLACE "-W" "-Wno-" warning "${warning}")
+        endif()
+        set(${var} "${${var}} ${warning}")
+      endforeach()
+    endforeach()
+  endif()
+  mshadow_clear_vars(_flag_vars _msvc_warnings _gxx_warnings)
+endmacro()
+################################################################################################
+# Helper function get current definitions
+# Usage:
+#   mshadow_get_current_definitions(<definitions_variable>)
+function(mshadow_get_current_definitions definitions_var)
+  get_property(current_definitions DIRECTORY PROPERTY COMPILE_DEFINITIONS)
+  set(result "")
+  foreach(d ${current_definitions})
+    list(APPEND result -D${d})
+  endforeach()
+  mshadow_list_unique(result)
+  set(${definitions_var} ${result} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Helper function get current includes/definitions
+# Usage:
+#   mshadow_get_current_cflags(<cflagslist_variable>)
+function(mshadow_get_current_cflags cflags_var)
+  get_property(current_includes DIRECTORY PROPERTY INCLUDE_DIRECTORIES)
+  mshadow_convert_absolute_paths(current_includes)
+  mshadow_get_current_definitions(cflags)
+  foreach(i ${current_includes})
+    list(APPEND cflags "-I${i}")
+  endforeach()
+  mshadow_list_unique(cflags)
+  set(${cflags_var} ${cflags} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Helper function to parse current linker libs into link directories, libflags and osx frameworks
+# Usage:
+#   mshadow_parse_linker_libs(<mshadow_LINKER_LIBS_var> <directories_var> <libflags_var> <frameworks_var>)
+function(mshadow_parse_linker_libs mshadow_LINKER_LIBS_variable folders_var flags_var frameworks_var)
+  set(__unspec "")
+  set(__debug "")
+  set(__optimized "")
+  set(__framework "")
+  set(__varname "__unspec")
+  # split libs into debug, optimized, unspecified and frameworks
+  foreach(list_elem ${${mshadow_LINKER_LIBS_variable}})
+    if(list_elem STREQUAL "debug")
+      set(__varname "__debug")
+    elseif(list_elem STREQUAL "optimized")
+      set(__varname "__optimized")
+    elseif(list_elem MATCHES "^-framework[ \t]+([^ \t].*)")
+      list(APPEND __framework -framework ${CMAKE_MATCH_1})
+    else()
+      list(APPEND ${__varname} ${list_elem})
+      set(__varname "__unspec")
+    endif()
+  endforeach()
+  # attach debug or optimized libs to unspecified according to current configuration
+  if(CMAKE_BUILD_TYPE MATCHES "Debug")
+    set(__libs ${__unspec} ${__debug})
+  else()
+    set(__libs ${__unspec} ${__optimized})
+  endif()
+  set(libflags "")
+  set(folders "")
+  # convert linker libraries list to link flags
+  foreach(lib ${__libs})
+    if(TARGET ${lib})
+      list(APPEND folders $<TARGET_LINKER_FILE_DIR:${lib}>)
+      list(APPEND libflags -l${lib})
+    elseif(lib MATCHES "^-l.*")
+      list(APPEND libflags ${lib})
+    elseif(IS_ABSOLUTE ${lib})
+      get_filename_component(name_we ${lib} NAME_WE)
+      get_filename_component(folder  ${lib} PATH)
+      string(REGEX MATCH "^lib(.*)" __match ${name_we})
+      list(APPEND libflags -l${CMAKE_MATCH_1})
+      list(APPEND folders    ${folder})
+    else()
+      message(FATAL_ERROR "Logic error. Need to update cmake script")
+    endif()
+  endforeach()
+  mshadow_list_unique(libflags folders)
+  set(${folders_var} ${folders} PARENT_SCOPE)
+  set(${flags_var} ${libflags} PARENT_SCOPE)
+  set(${frameworks_var} ${__framework} PARENT_SCOPE)
+endfunction()
+################################################################################################
+# Helper function to detect Darwin version, i.e. 10.8, 10.9, 10.10, ....
+# Usage:
+#   mshadow_detect_darwin_version(<version_variable>)
+function(mshadow_detect_darwin_version output_var)
+  if(APPLE)
+    execute_process(COMMAND /usr/bin/sw_vers -productVersion
+                    RESULT_VARIABLE __sw_vers OUTPUT_VARIABLE __sw_vers_out
+                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
+    set(${output_var} ${__sw_vers_out} PARENT_SCOPE)
+  else()
+    set(${output_var} "" PARENT_SCOPE)
+  endif()
+endfunction()
+################################################################################################
+# Convenient command to setup source group for IDEs that support this feature (VS, XCode)
+# Usage:
+#   caffe_source_group(<group> GLOB[_RECURSE] <globbing_expression>)
+function(mshadow_source_group group)
+  cmake_parse_arguments(CAFFE_SOURCE_GROUP "" "" "GLOB;GLOB_RECURSE" ${ARGN})
+  if(CAFFE_SOURCE_GROUP_GLOB)
+    file(GLOB srcs1 ${CAFFE_SOURCE_GROUP_GLOB})
+    source_group(${group} FILES ${srcs1})
+  endif()
+  if(CAFFE_SOURCE_GROUP_GLOB_RECURSE)
+    file(GLOB_RECURSE srcs2 ${CAFFE_SOURCE_GROUP_GLOB_RECURSE})
+    source_group(${group} FILES ${srcs2})
+  endif()
+endfunction()
\ No newline at end of file
--- a/cmake/util/Util.cmake
+++ b/cmake/util/Util.cmake
+macro(__dgl_option variable description value)
+  if(NOT DEFINED ${variable})
+    set(${variable} ${value} CACHE STRING ${description})
+  endif()
+endmacro()
+#######################################################
+# An option that the user can select. Can accept condition to control when option is available for user.
+# Usage:
+#   dgl_option(<option_variable> "doc string" <initial value or boolean expression> [IF <condition>])
+macro(dgl_option variable description value)
+  set(__value ${value})
+  set(__condition "")
+  set(__varname "__value")
+  foreach(arg ${ARGN})
+    if(arg STREQUAL "IF" OR arg STREQUAL "if")
+      set(__varname "__condition")
+    else()
+      list(APPEND ${__varname} ${arg})
+    endif()
+  endforeach()
+  unset(__varname)
+  if("${__condition}" STREQUAL "")
+    set(__condition 2 GREATER 1)
+  endif()
+  if(${__condition})
+    if("${__value}" MATCHES ";")
+      if(${__value})
+        __dgl_option(${variable} "${description}" ON)
+      else()
+        __dgl_option(${variable} "${description}" OFF)
+      endif()
+    elseif(DEFINED ${__value})
+      if(${__value})
+        __dgl_option(${variable} "${description}" ON)
+      else()
+        __dgl_option(${variable} "${description}" OFF)
+      endif()
+    else()
+      __dgl_option(${variable} "${description}" "${__value}")
+    endif()
+  else()
+    unset(${variable} CACHE)
+  endif()
+endmacro()
+function(assign_source_group group)
+    foreach(_source IN ITEMS ${ARGN})
+        if (IS_ABSOLUTE "${_source}")
+            file(RELATIVE_PATH _source_rel "${CMAKE_CURRENT_SOURCE_DIR}" "${_source}")
+        else()
+            set(_source_rel "${_source}")
+        endif()
+        get_filename_component(_source_path "${_source_rel}" PATH)
+        string(REPLACE "/" "\\" _source_path_msvc "${_source_path}")
+        source_group("${group}\\${_source_path_msvc}" FILES "${_source}")
+    endforeach()
+endfunction(assign_source_group)
--- a/docker/Dockerfile.ci_cpu_mxnet
+++ b/docker/Dockerfile.ci_cpu_mxnet
-# CI docker CPU env
-FROM ubuntu:16.04
-# Base scripts
-RUN apt-get update --fix-missing
-COPY install/ubuntu_install_core.sh /install/ubuntu_install_core.sh
-RUN bash /install/ubuntu_install_core.sh
-COPY install/ubuntu_install_build.sh /install/ubuntu_install_build.sh
-RUN bash /install/ubuntu_install_build.sh
-COPY install/ubuntu_install_python.sh /install/ubuntu_install_python.sh
-RUN bash /install/ubuntu_install_python.sh
-COPY install/ubuntu_install_python_package.sh /install/ubuntu_install_python_package.sh
-RUN bash /install/ubuntu_install_python_package.sh
-COPY install/ubuntu_install_mxnet_cpu.sh /install/ubuntu_install_mxnet_cpu.sh
-RUN bash /install/ubuntu_install_mxnet_cpu.sh
--- a/docker/Dockerfile.ci_gpu
+++ b/docker/Dockerfile.ci_gpu
@@ -27,6 +27,9 @@ RUN bash /install/ubuntu_install_python_package.sh
 COPY install/ubuntu_install_torch.sh /install/ubuntu_install_torch.sh
 RUN bash /install/ubuntu_install_torch.sh
+COPY install/ubuntu_install_mxnet_gpu.sh /install/ubuntu_install_mxnet_gpu.sh
+RUN bash /install/ubuntu_install_mxnet_gpu.sh
 # Environment variables
 ENV PATH=/usr/local/nvidia/bin:${PATH}
 ENV PATH=/usr/local/cuda/bin:${PATH}

--- a/docker/Dockerfile.ci_gpu_mxnet
+++ b/docker/Dockerfile.ci_gpu_mxnet
-# CI docker GPU env
-FROM nvidia/cuda:9.0-cudnn7-devel
-# Base scripts
-RUN apt-get update --fix-missing
-COPY install/ubuntu_install_core.sh /install/ubuntu_install_core.sh
-RUN bash /install/ubuntu_install_core.sh
-COPY install/ubuntu_install_build.sh /install/ubuntu_install_build.sh
-RUN bash /install/ubuntu_install_build.sh
-COPY install/ubuntu_install_python.sh /install/ubuntu_install_python.sh
-RUN bash /install/ubuntu_install_python.sh
-COPY install/ubuntu_install_python_package.sh /install/ubuntu_install_python_package.sh
-RUN bash /install/ubuntu_install_python_package.sh
-COPY install/ubuntu_install_mxnet_gpu.sh /install/ubuntu_install_mxnet_gpu.sh
-RUN bash /install/ubuntu_install_mxnet_gpu.sh
- # Environment variables
-ENV PATH=/usr/local/nvidia/bin:${PATH}
-ENV PATH=/usr/local/cuda/bin:${PATH}
-ENV CPLUS_INCLUDE_PATH=/usr/local/cuda/include:${CPLUS_INCLUDE_PATH}
-ENV C_INCLUDE_PATH=/usr/local/cuda/include:${C_INCLUDE_PATH}
-ENV LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nvidia/lib64:${LIBRARY_PATH}
-ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
--- a/docker/Dockerfile.ci_lint
+++ b/docker/Dockerfile.ci_lint
@@ -11,4 +11,4 @@ RUN bash /install/ubuntu_install_python.sh
 RUN apt-get install -y doxygen graphviz
-RUN pip3 install cpplint==1.3.0 pylint==2.2.2 mypy
+RUN pip3 install cpplint==1.3.0 pylint==2.3.0 mypy
--- a/docker/README.md
+++ b/docker/README.md
@@ -14,13 +14,3 @@ docker build -t dgl-gpu -f Dockerfile.ci_gpu .
 ```bash
 docker build -t dgl-lint -f Dockerfile.ci_lint .
 ```
-### CPU MXNet image
-```bash
-docker build -t dgl-mxnet-cpu -f Dockerfile.ci_cpu_mxnet .
-```
-### GPU MXNet image
-```bash
-docker build -t dgl-mxnet-gpu -f Dockerfile.ci_gpu_mxnet .
-```
--- a/include/dgl/array.h
+++ b/include/dgl/array.h
@@ -30,6 +30,9 @@ IdArray VecToIdArray(const std::vector<dgl_id_t>& vec);
 /*! \brief Create a copy of the given array */
 IdArray Clone(IdArray arr);
+/*! \brief Convert the idarray to the given bit width (on CPU) */
+IdArray AsNumBits(IdArray arr, uint8_t bits);
 /*! \brief Arithmetic functions */
 IdArray Add(IdArray lhs, IdArray rhs);
 IdArray Sub(IdArray lhs, IdArray rhs);

--- a/include/dgl/graph.h
+++ b/include/dgl/graph.h
@@ -73,26 +73,26 @@ class Graph: public GraphInterface {
   *       vertices to be added needs to be specified.
   * \param num_vertices The number of vertices to be added.
   */
-  void AddVertices(uint64_t num_vertices);
+  void AddVertices(uint64_t num_vertices) override;
  /*!
   * \brief Add one edge to the graph.
   * \param src The source vertex.
   * \param dst The destination vertex.
   */
-  void AddEdge(dgl_id_t src, dgl_id_t dst);
+  void AddEdge(dgl_id_t src, dgl_id_t dst) override;
  /*!
   * \brief Add edges to the graph.
   * \param src_ids The source vertex id array.
   * \param dst_ids The destination vertex id array.
   */
-  void AddEdges(IdArray src_ids, IdArray dst_ids);
+  void AddEdges(IdArray src_ids, IdArray dst_ids) override;
  /*!
   * \brief Clear the graph. Remove all vertices/edges.
   */
-  void Clear() {
+  void Clear() override {
    adjlist_.clear();
    reverse_adjlist_.clear();
    all_edges_src_.clear();
@@ -101,44 +101,52 @@ class Graph: public GraphInterface {
    num_edges_ = 0;
  }
+  DLContext Context() const override {
+    return DLContext{kDLCPU, 0};
+  }
+  uint8_t NumBits() const override {
+    return 64;
+  }
  /*!
   * \note not const since we have caches
   * \return whether the graph is a multigraph
   */
-  bool IsMultigraph() const {
+  bool IsMultigraph() const override {
    return is_multigraph_;
  }
  /*!
   * \return whether the graph is read-only
   */
-  virtual bool IsReadonly() const {
+  bool IsReadonly() const override {
    return false;
  }
  /*! \return the number of vertices in the graph.*/
-  uint64_t NumVertices() const {
+  uint64_t NumVertices() const override {
    return adjlist_.size();
  }
  /*! \return the number of edges in the graph.*/
-  uint64_t NumEdges() const {
+  uint64_t NumEdges() const override {
    return num_edges_;
  }
  /*! \return true if the given vertex is in the graph.*/
-  bool HasVertex(dgl_id_t vid) const {
+  bool HasVertex(dgl_id_t vid) const override {
    return vid < NumVertices();
  }
  /*! \return a 0-1 array indicating whether the given vertices are in the graph.*/
-  BoolArray HasVertices(IdArray vids) const;
+  BoolArray HasVertices(IdArray vids) const override;
  /*! \return true if the given edge is in the graph.*/
-  bool HasEdgeBetween(dgl_id_t src, dgl_id_t dst) const;
+  bool HasEdgeBetween(dgl_id_t src, dgl_id_t dst) const override;
  /*! \return a 0-1 array indicating whether the given edges are in the graph.*/
-  BoolArray HasEdgesBetween(IdArray src_ids, IdArray dst_ids) const;
+  BoolArray HasEdgesBetween(IdArray src_ids, IdArray dst_ids) const override;
  /*!
   * \brief Find the predecessors of a vertex.
@@ -146,7 +154,7 @@ class Graph: public GraphInterface {
   * \param radius The radius of the neighborhood. Default is immediate neighbor (radius=1).
   * \return the predecessor id array.
   */
-  IdArray Predecessors(dgl_id_t vid, uint64_t radius = 1) const;
+  IdArray Predecessors(dgl_id_t vid, uint64_t radius = 1) const override;
  /*!
   * \brief Find the successors of a vertex.
@@ -154,7 +162,7 @@ class Graph: public GraphInterface {
   * \param radius The radius of the neighborhood. Default is immediate neighbor (radius=1).
   * \return the successor id array.
   */
-  IdArray Successors(dgl_id_t vid, uint64_t radius = 1) const;
+  IdArray Successors(dgl_id_t vid, uint64_t radius = 1) const override;
  /*!
   * \brief Get all edge ids between the two given endpoints
@@ -164,7 +172,7 @@ class Graph: public GraphInterface {
   * \param dst The destination vertex.
   * \return the edge id array.
   */
-  IdArray EdgeId(dgl_id_t src, dgl_id_t dst) const;
+  IdArray EdgeId(dgl_id_t src, dgl_id_t dst) const override;
  /*!
   * \brief Get all edge ids between the given endpoint pairs.
@@ -175,14 +183,14 @@ class Graph: public GraphInterface {
   *       first, and ties are broken by the order of edge ID.
   * \return EdgeArray containing all edges between all pairs.
   */
-  EdgeArray EdgeIds(IdArray src, IdArray dst) const;
+  EdgeArray EdgeIds(IdArray src, IdArray dst) const override;
  /*!
   * \brief Find the edge ID and return the pair of endpoints
   * \param eid The edge ID
   * \return a pair whose first element is the source and the second the destination.
   */
-  std::pair<dgl_id_t, dgl_id_t> FindEdge(dgl_id_t eid) const {
+  std::pair<dgl_id_t, dgl_id_t> FindEdge(dgl_id_t eid) const override {
    return std::make_pair(all_edges_src_[eid], all_edges_dst_[eid]);
  }
@@ -191,7 +199,7 @@ class Graph: public GraphInterface {
   * \param eids The edge ID array.
   * \return EdgeArray containing all edges with id in eid.  The order is preserved.
   */
-  EdgeArray FindEdges(IdArray eids) const;
+  EdgeArray FindEdges(IdArray eids) const override;
  /*!
   * \brief Get the in edges of the vertex.
@@ -199,14 +207,14 @@ class Graph: public GraphInterface {
   * \param vid The vertex id.
   * \return the edges
   */
-  EdgeArray InEdges(dgl_id_t vid) const;
+  EdgeArray InEdges(dgl_id_t vid) const override;
  /*!
   * \brief Get the in edges of the vertices.
   * \param vids The vertex id array.
   * \return the id arrays of the two endpoints of the edges.
   */
-  EdgeArray InEdges(IdArray vids) const;
+  EdgeArray InEdges(IdArray vids) const override;
  /*!
   * \brief Get the out edges of the vertex.
@@ -214,14 +222,14 @@ class Graph: public GraphInterface {
   * \param vid The vertex id.
   * \return the id arrays of the two endpoints of the edges.
   */
-  EdgeArray OutEdges(dgl_id_t vid) const;
+  EdgeArray OutEdges(dgl_id_t vid) const override;
  /*!
   * \brief Get the out edges of the vertices.
   * \param vids The vertex id array.
   * \return the id arrays of the two endpoints of the edges.
   */
-  EdgeArray OutEdges(IdArray vids) const;
+  EdgeArray OutEdges(IdArray vids) const override;
  /*!
   * \brief Get all the edges in the graph.
@@ -230,14 +238,14 @@ class Graph: public GraphInterface {
   * \param sorted Whether the returned edge list is sorted by their src and dst ids
   * \return the id arrays of the two endpoints of the edges.
   */
-  EdgeArray Edges(const std::string &order = "") const;
+  EdgeArray Edges(const std::string &order = "") const override;
  /*!
   * \brief Get the in degree of the given vertex.
   * \param vid The vertex id.
   * \return the in degree
   */
-  uint64_t InDegree(dgl_id_t vid) const {
+  uint64_t InDegree(dgl_id_t vid) const override {
    CHECK(HasVertex(vid)) << "invalid vertex: " << vid;
    return reverse_adjlist_[vid].succ.size();
  }
@@ -247,14 +255,14 @@ class Graph: public GraphInterface {
   * \param vid The vertex id array.
   * \return the in degree array
   */
-  DegreeArray InDegrees(IdArray vids) const;
+  DegreeArray InDegrees(IdArray vids) const override;
  /*!
   * \brief Get the out degree of the given vertex.
   * \param vid The vertex id.
   * \return the out degree
   */
-  uint64_t OutDegree(dgl_id_t vid) const {
+  uint64_t OutDegree(dgl_id_t vid) const override {
    CHECK(HasVertex(vid)) << "invalid vertex: " << vid;
    return adjlist_[vid].succ.size();
  }
@@ -264,7 +272,7 @@ class Graph: public GraphInterface {
   * \param vid The vertex id array.
   * \return the out degree array
   */
-  DegreeArray OutDegrees(IdArray vids) const;
+  DegreeArray OutDegrees(IdArray vids) const override;
  /*!
   * \brief Construct the induced subgraph of the given vertices.
@@ -282,7 +290,7 @@ class Graph: public GraphInterface {
   * \param vids The vertices in the subgraph.
   * \return the induced subgraph
   */
-  Subgraph VertexSubgraph(IdArray vids) const;
+  Subgraph VertexSubgraph(IdArray vids) const override;
  /*!
   * \brief Construct the induced edge subgraph of the given edges.
@@ -300,7 +308,7 @@ class Graph: public GraphInterface {
   * \param eids The edges in the subgraph.
   * \return the induced edge subgraph
   */
-  Subgraph EdgeSubgraph(IdArray eids) const;
+  Subgraph EdgeSubgraph(IdArray eids) const override;
  /*!
   * \brief Return a new graph with all the edges reversed.
@@ -309,14 +317,14 @@ class Graph: public GraphInterface {
   *
   * \return the reversed graph
   */
-  GraphPtr Reverse() const;
+  GraphPtr Reverse() const override;
  /*!
   * \brief Return the successor vector
   * \param vid The vertex id.
   * \return the successor vector
   */
-  DGLIdIters SuccVec(dgl_id_t vid) const {
+  DGLIdIters SuccVec(dgl_id_t vid) const override {
    auto data = adjlist_[vid].succ.data();
    auto size = adjlist_[vid].succ.size();
    return DGLIdIters(data, data + size);
@@ -327,7 +335,7 @@ class Graph: public GraphInterface {
   * \param vid The vertex id.
   * \return the out edge id vector
   */
-  DGLIdIters OutEdgeVec(dgl_id_t vid) const {
+  DGLIdIters OutEdgeVec(dgl_id_t vid) const override {
    auto data = adjlist_[vid].edge_id.data();
    auto size = adjlist_[vid].edge_id.size();
    return DGLIdIters(data, data + size);
@@ -338,7 +346,7 @@ class Graph: public GraphInterface {
   * \param vid The vertex id.
   * \return the predecessor vector
   */
-  DGLIdIters PredVec(dgl_id_t vid) const {
+  DGLIdIters PredVec(dgl_id_t vid) const override {
    auto data = reverse_adjlist_[vid].succ.data();
    auto size = reverse_adjlist_[vid].succ.size();
    return DGLIdIters(data, data + size);
@@ -349,7 +357,7 @@ class Graph: public GraphInterface {
   * \param vid The vertex id.
   * \return the in edge id vector
   */
-  DGLIdIters InEdgeVec(dgl_id_t vid) const {
+  DGLIdIters InEdgeVec(dgl_id_t vid) const override {
    auto data = reverse_adjlist_[vid].edge_id.data();
    auto size = reverse_adjlist_[vid].edge_id.size();
    return DGLIdIters(data, data + size);
@@ -359,7 +367,7 @@ class Graph: public GraphInterface {
   * \brief Reset the data in the graph and move its data to the returned graph object.
   * \return a raw pointer to the graph object.
   */
-  virtual GraphInterface *Reset() {
+  GraphInterface *Reset() override {
    Graph* gptr = new Graph();
    *gptr = std::move(*this);
    return gptr;
@@ -374,7 +382,7 @@ class Graph: public GraphInterface {
   * \param fmt the format of the returned adjacency matrix.
   * \return a vector of three IdArray.
   */
-  virtual std::vector<IdArray> GetAdj(bool transpose, const std::string &fmt) const;
+  std::vector<IdArray> GetAdj(bool transpose, const std::string &fmt) const override;
 protected:
  friend class GraphOp;

--- a/include/dgl/graph_interface.h
+++ b/include/dgl/graph_interface.h
@@ -96,6 +96,16 @@ class GraphInterface {
   */
  virtual void Clear() = 0;
+  /*!
+   * \brief Get the device context of this graph.
+   */
+  virtual DLContext Context() const = 0;
+  /*!
+   * \brief Get the number of integer bits used to store node/edge ids (32 or 64).
+   */
+  virtual uint8_t NumBits() const = 0;
  /*!
   * \note not const since we have caches
   * \return whether the graph is a multigraph

--- a/include/dgl/immutable_graph.h
+++ b/include/dgl/immutable_graph.h
@@ -68,6 +68,14 @@ class CSR : public GraphInterface {
    LOG(FATAL) << "CSR graph does not allow mutation.";
  }
+  DLContext Context() const override {
+    return indptr_->ctx;
+  }
+  uint8_t NumBits() const override {
+    return indices_->dtype.bits;
+  }
  bool IsMultigraph() const override;
  bool IsReadonly() const override {
@@ -215,6 +223,20 @@ class CSR : public GraphInterface {
    return CSRMatrix{indptr_, indices_, edge_ids_};
  }
+  /*!
+   * \brief Copy the data to another context.
+   * \param ctx The target context.
+   * \return The graph under another context.
+   */
+  CSR CopyTo(const DLContext& ctx) const;
+  /*!
+   * \brief Convert the graph to use the given number of bits for storage.
+   * \param bits The new number of integer bits (32 or 64).
+   * \return The graph with new bit size storage.
+   */
+  CSR AsNumBits(uint8_t bits) const;
  // member getters
  IdArray indptr() const { return indptr_; }
@@ -266,6 +288,14 @@ class COO : public GraphInterface {
    LOG(FATAL) << "CSR graph does not allow mutation.";
  }
+  DLContext Context() const override {
+    return src_->ctx;
+  }
+  uint8_t NumBits() const override {
+    return src_->dtype.bits;
+  }
  bool IsMultigraph() const override;
  bool IsReadonly() const override {
@@ -441,6 +471,20 @@ class COO : public GraphInterface {
    return COOMatrix{src_, dst_, {}};
  }
+  /*!
+   * \brief Copy the data to another context.
+   * \param ctx The target context.
+   * \return The graph under another context.
+   */
+  COO CopyTo(const DLContext& ctx) const;
+  /*!
+   * \brief Convert the graph to use the given number of bits for storage.
+   * \param bits The new number of integer bits (32 or 64).
+   * \return The graph with new bit size storage.
+   */
+  COO AsNumBits(uint8_t bits) const;
  // member getters
  IdArray src() const { return src_; }
@@ -528,6 +572,14 @@ class ImmutableGraph: public GraphInterface {
    LOG(FATAL) << "Clear isn't supported in ImmutableGraph";
  }
+  DLContext Context() const override {
+    return AnyGraph()->Context();
+  }
+  uint8_t NumBits() const override {
+    return AnyGraph()->NumBits();
+  }
  /*!
   * \note not const since we have caches
   * \return whether the graph is a multigraph
@@ -871,6 +923,31 @@ class ImmutableGraph: public GraphInterface {
    return coo_;
  }
+  /*!
+   * \brief Convert the given graph to an immutable graph.
+   *
+   * If the graph is already an immutable graph. The result graph will share
+   * the storage with the given one.
+   *
+   * \param graph The input graph.
+   * \return an immutable graph object.
+   */
+  static ImmutableGraph ToImmutable(const GraphInterface* graph);
+  /*!
+   * \brief Copy the data to another context.
+   * \param ctx The target context.
+   * \return The graph under another context.
+   */
+  ImmutableGraph CopyTo(const DLContext& ctx) const;
+  /*!
+   * \brief Convert the graph to use the given number of bits for storage.
+   * \param bits The new number of integer bits (32 or 64).
+   * \return The graph with new bit size storage.
+   */
+  ImmutableGraph AsNumBits(uint8_t bits) const;
 protected:
  /* !\brief internal default constructor */
  ImmutableGraph() {}

--- a/include/dgl/runtime/packed_func.h
+++ b/include/dgl/runtime/packed_func.h
@@ -642,6 +642,11 @@ class DGLRetValue : public DGLPODValue_ {
    value_.v_type = t;
    return *this;
  }
+  DGLRetValue& operator=(DGLContext ctx) {
+    this->SwitchToPOD(kDGLContext);
+    value_.v_ctx = ctx;
+    return *this;
+  }
  DGLRetValue& operator=(bool value) {
    this->SwitchToPOD(kDLInt);
    value_.v_int64 = value;

--- a/python/dgl/_ffi/_ctypes/types.py
+++ b/python/dgl/_ffi/_ctypes/types.py
@@ -4,14 +4,16 @@ from __future__ import absolute_import as _abs
 import ctypes
 from ..base import py_str, check_call, _LIB
-from ..runtime_ctypes import DGLByteArray, TypeCode
+from ..runtime_ctypes import DGLByteArray, TypeCode, DGLType, DGLContext
 class DGLValue(ctypes.Union):
    """DGLValue in C API"""
    _fields_ = [("v_int64", ctypes.c_int64),
                ("v_float64", ctypes.c_double),
                ("v_handle", ctypes.c_void_p),
-                ("v_str", ctypes.c_char_p)]
+                ("v_str", ctypes.c_char_p),
+                ("v_type", DGLType),
+                ("v_ctx", DGLContext)]
 DGLPackedCFunc = ctypes.CFUNCTYPE(

--- a/python/dgl/backend/backend.py
+++ b/python/dgl/backend/backend.py
@@ -211,6 +211,14 @@ def context(input):
    """
    pass
+def device_type(ctx):
+    """Return a str representing device type"""
+    pass
+def device_id(ctx):
+    """Return device index"""
+    pass
 def astype(input, ty):
    """Convert the input tensor to the given data type.
@@ -555,23 +563,6 @@ def ones(shape, dtype, ctx):
    """
    pass
-def spmm(x, y):
-    """Multiply a sparse matrix with a dense matrix.
-    Parameters
-    ----------
-    x : SparseTensor
-        The sparse matrix.
-    y : Tensor
-        The dense matrix.
-    Returns
-    -------
-    Tensor
-        The result dense matrix.
-    """
-    pass
 def unsorted_1d_segment_sum(input, seg_id, n_segs, dim):
    """Computes the sum along segments of a tensor.
@@ -842,6 +833,105 @@ def zerocopy_from_numpy(np_array):
    """
    pass
+def zerocopy_to_dgl_ndarray(input):
+    """Zerocopy a framework-specific Tensor to dgl.ndarray.NDArray
+    Parameters
+    ----------
+    input : Tensor
+    Returns
+    -------
+    dgl.ndarray.NDArray
+    """
+    pass
+def zerocopy_from_dgl_ndarray(input):
+    """Zerocopy a dgl.ndarray.NDArray to framework-specific Tensor
+    Parameters
+    ----------
+    input : dgl.ndarray.NDArray
+    Returns
+    -------
+    Tensor
+    """
+    pass
+###############################################################################
+# Custom Operators for graph level computations.
+# Note: These operators are supposed to be implemented using DGL-provided
+# kernels (see kernel.py), and plug into tensor framework using custom op
+# extensions.
+def binary_reduce(reducer, binary_op, graph, lhs, rhs, lhs_data, rhs_data,
+                  out_size, lhs_map, rhs_map, out_map):
+    """Perform binary operation between given data and reduce based on graph
+    structure.
+    Parameters
+    ----------
+    reducer : str
+        Type of reduction: 'sum', 'max', 'min', 'mean', 'prod', 'none' (no
+        reduction)
+    binary_op : str
+        Binary operation to perform, can be 'add', 'mul', 'sub', 'div'
+    graph : GraphIndex
+        The graph
+    lhs : int
+        The lhs target (src, dst, edge)
+    rhs : int
+        The rhs target (src, dst, edge)
+    lhs_data : Tensor
+        The lhs data
+    rhs_data : Tensor
+        The rhs data
+    out_size : int
+        Size of first dimension of output data
+    lhs_map : tuple
+        Two lhs id mapping arrays, one for forward pass, the other for backward
+    rhs_map : tuple
+        Two rhs id mapping arrays, one for forward pass, the other for backward
+    out_map : tuple
+        Two out id mapping arrays, one for forward pass, the other for backward
+    Returns
+    -------
+    Tensor
+        The result.
+    """
+    pass
+def copy_reduce(reducer, graph, target, in_data, out_size, in_map, out_map):
+    """Copy target data and perform reduce based on graph structure.
+    Parameters
+    ----------
+    reducer : str
+        Type of reduction: be 'sum', 'max', 'min', 'mean', 'prod', 'none' (no
+        reduction)
+    graph : GraphIndex
+        The graph
+    target : int
+        The input target (src, dst, edge)
+    in_data : Tensor
+        The input data
+    out_size : int
+        Size of first dimension of output data
+    in_map : tuple
+        Two input id mapping arrays, one for forward, the other for backward
+    out_map : tuple
+        Two output id mapping arrays, one for forward, the other for backward
+    Returns
+    -------
+    Tensor
+        The result.
+    """
+    pass
 ###############################################################################
 # Other interfaces
 # ----------------