[Feature][Kernel] DGL kernel support (#596)

* [Kernel] Minigun integration and fused kernel support (#519) * kernel interface * add minigun * Add cuda build * functors * working on binary elewise * binary reduce * change kernel interface * WIP * wip * fix minigun * compile * binary reduce kernels * compile * simple test passed * more reducers * fix thrust problem * fix cmake * fix cmake; add proper guard for atomic * WIP: bcast * WIP * bcast kernels * update to new minigun pass-by-value practice * broadcasting dim * add copy src and copy edge * fix linking * fix none array problem * fix copy edge * add device_type and device_id to backend operator * cache csr adj, remove cache for adjmat and incmat * custom ops in backend and pytorch impl * change dgl-mg kernel python interface * add id_mapping var * clean up plus v2e spmv schedule * spmv schedule & clean up fall back * symbolic message and reduce func, remove bundle func * new executors * new backend interface for dgl kernels and pytorch impl * minor fix * fix * fix docstring, comments, func names * nodeflow * fix message id mapping and bugs... * pytorch test case & fix * backward binary reduce * fix bug * WIP: cusparse * change to int32 csr for cusparse workaround * disable cusparse * change back to int64 * broadcasting backward * cusparse; WIP: add rev_csr * unit test for kernels * pytorch backward with dgl kernel * edge softmax * fix backward * improve softmax * cache edge on device * cache mappings on device * fix partial forward code * cusparse done * copy_src_sum with cusparse * rm id getter * reduce grad for broadcast * copy edge reduce backward * kernel unit test for broadcasting * full kernel unit test * add cpu kernels * edge softmax unit test * missing ref * fix compile and small bugs * fix bug in bcast * Add backward both * fix torch utests * expose infershape * create out tensor in python * fix c++ lint * [Kernel] Add GPU utest and kernel utest (#524) * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * [Kernel] Update kernel branch (#550) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * [Kernel] Update kernel branch (#576) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * all demo use python-3 (#555) * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * add network cpp test (#565) * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * edge softmax module * WIP * Fixing typo in JTNN after interface change (#536) * mxnet backend support * improve reduce grad * add max to unittest backend * fix kernel unittest * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * lint * lint * win build * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * try * fix * fix * fix * fix * fix * try * test * test * test * try * try * try * test * fix * try gen_target * fix gen_target * fix msvc var_args expand issue * fix * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * WIP * WIP * all demo use python-3 (#555) * ToImmutable and CopyTo * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * DGLRetValue DGLContext conversion * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * Add support to convert immutable graph to 32 bits * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * fix binary reduce following new minigun template * enable both int64 and int32 kernels * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * new kernel interface done for CPU * docstring * rename & docstring * copy reduce and backward * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * adapt cuda kernels to the new interface * add network cpp test (#565) * fix bug * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * remove pytorch-specific test_function * fix unittest * fix * fix unittest backend bug in converting tensor to numpy array * fix * mxnet version * [BUGFIX] fix for MXNet 1.5. (#552) * remove clone. * turn on numpy compatible. * Revert "remove clone." This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457. * revert format changes * fix mxnet api name * revert mistakes in previous revert * roll back CI to 20190523 build * fix unittest * disable test_shared_mem_store.py for now * remove mxnet/test_specialization.py * sync win64 test script * fix lowercase * missing backend in gpu unit test * transpose to get forward graph * pass update all * add sanity check * passing test_specialization.py * fix and pass test_function * fix check * fix pytorch softmax * mxnet kernels * c++ lint * pylint * try * win build * fix * win * ci enable gpu build * init submodule recursively * backend docstring * try * test win dev * doc string * disable pytorch test_nn * try to fix windows issue * bug fixed, revert changes * [Test] fix CI. (#586) * disable unit test in mxnet tutorial. * retry socket connection. * roll back to set_np_compat * try to fix multi-processing test hangs when it fails. * fix test. * fix. * doc string * doc string and clean up * missing field in ctypes * fix node flow schedule and unit test * rename * pylint * copy from parent default context * fix unit test script * fix * demo bug in nodeflow gpu test * [Kernel][Bugfix] fix nodeflow bug (#604) * fix nodeflow bug * remove debug code * add build gtest option * fix cmake; fix graph index bug in spmv.py * remove clone * fix div rhs grad bug * [Kernel] Support full builtin method, edge softmax and unit tests (#605) * add full builtin support * unit test * unit test backend * edge softmax * apply edge with builtin * fix kernel unit test * disable mxnet test_shared_mem_store * gen builtin reduce * enable mxnet gpu unittest * revert some changes * docstring * add note for the hack * [Kernel][Unittest][CI] Fix MXNet GPU CI (#607) * update docker image for MXNet GPU CI * force all dgl graph input and output on CPU * fix gpu unittest * speedup compilation * add some comments * lint * add more comments * fix as requested * add some comments * comment * lint * lint * update pylint * fix as requested * lint * lint * lint * docstrings of python DGL kernel entries * disable lint warnings on arguments in kernel.py * fix docstring in scheduler * fix some bug in unittest; try again * Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel" This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * Revert "fix some bug in unittest; try again" This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * more comprehensive kernel test * remove shape check in test_specialization

[Feature][Kernel] DGL kernel support (#596)
* [Kernel] Minigun integration and fused kernel support (#519) * kernel interface * add minigun * Add cuda build * functors * working on binary elewise * binary reduce * change kernel interface * WIP * wip * fix minigun * compile * binary reduce kernels * compile * simple test passed * more reducers * fix thrust problem * fix cmake * fix cmake; add proper guard for atomic * WIP: bcast * WIP * bcast kernels * update to new minigun pass-by-value practice * broadcasting dim * add copy src and copy edge * fix linking * fix none array problem * fix copy edge * add device_type and device_id to backend operator * cache csr adj, remove cache for adjmat and incmat * custom ops in backend and pytorch impl * change dgl-mg kernel python interface * add id_mapping var * clean up plus v2e spmv schedule * spmv schedule & clean up fall back * symbolic message and reduce func, remove bundle func * new executors * new backend interface for dgl kernels and pytorch impl * minor fix * fix * fix docstring, comments, func names * nodeflow * fix message id mapping and bugs... * pytorch test case & fix * backward binary reduce * fix bug * WIP: cusparse * change to int32 csr for cusparse workaround * disable cusparse * change back to int64 * broadcasting backward * cusparse; WIP: add rev_csr * unit test for kernels * pytorch backward with dgl kernel * edge softmax * fix backward * improve softmax * cache edge on device * cache mappings on device * fix partial forward code * cusparse done * copy_src_sum with cusparse * rm id getter * reduce grad for broadcast * copy edge reduce backward * kernel unit test for broadcasting * full kernel unit test * add cpu kernels * edge softmax unit test * missing ref * fix compile and small bugs * fix bug in bcast * Add backward both * fix torch utests * expose infershape * create out tensor in python * fix c++ lint * [Kernel] Add GPU utest and kernel utest (#524) * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * [Kernel] Update kernel branch (#550) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * [Kernel] Update kernel branch (#576) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * Fixing typo in JTNN after interface change (#536) * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * all demo use python-3 (#555) * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * add network cpp test (#565) * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541) * [Model] add multiprocessing training with sampling. (#484) * reorganize sampling code. * add multi-process training. * speed up gcn_cv * fix graphsage_cv. * add new API in graph store. * update barrier impl. * support both local and distributed training. * fix multiprocess train. * fix. * fix barrier. * add script for loading data. * multiprocessing sampling. * accel training. * replace pull with spmv for speedup. * nodeflow copy from parent with context. * enable GPU. * fix a bug in graph store. * enable multi-GPU training. * fix lint. * add comments. * rename to run_store_server.py * fix gcn_cv. * fix a minor bug in sampler. * handle error better in graph store. * improve graphsage_cv for distributed mode. * update README. * fix. * update. * [Tutorial] add sampling tutorial. (#522) * add sampling tutorial. * add readme * update author list. * fix indent in the code. * rename the file. * update tutorial. * fix the last API. * update image. * [BUGFIX] fix the problems in the sampling tutorial. (#523) * add index. * update. * update tutorial. * fix gpu utest * cuda utest runnable * temp disable test nodeflow; unified test for kernel * cuda test kernel done * edge softmax module * WIP * Fixing typo in JTNN after interface change (#536) * mxnet backend support * improve reduce grad * add max to unittest backend * fix kernel unittest * [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515) * lint * lint * win build * [Bug Fix] Fix inplace op at backend (#546) * Fix inplace operation * fix line seprator * [Feature] Add batch and unbatch for immutable graph (#539) * Add batch and unbatch for immutable graph * fix line seprator * fix lintr * remove unnecessary include * fix code review * [BUGFix] Improve multi-processing training (#526) * fix. * add comment. * remove. * temp fix. * initialize for shared memory. * fix graphsage. * fix gcn. * add more unit tests. * add more tests. * avoid creating shared-memory exclusively. * redefine remote initializer. * improve initializer. * fix unit test. * fix lint. * fix lint. * initialize data in the graph store server properly. * fix test. * fix test. * fix test. * small fix. * add comments. * cleanup server. * test graph store with a random port. * print. * print to stderr. * test1 * test2 * remove comment. * adjust the initializer signature. * try * fix * fix * fix * fix * fix * try * test * test * test * try * try * try * test * fix * try gen_target * fix gen_target * fix msvc var_args expand issue * fix * [API] update graph store API. (#549) * add init_ndata and init_edata in DGLGraph. * adjust SharedMemoryGraph API. * print warning. * fix comment. * update example * fix. * fix examples. * add unit tests. * add comments. * [Refactor] Immutable graph index (#543) * WIP * header * WIP .cc * WIP * transpose * wip * immutable graph .h and .cc * WIP: nodeflow.cc * compile * remove all tmp dl managed ctx; they caused refcount issue * one simple test * WIP: testing * test_graph * fix graph index * fix bug in sampler; pass pytorch utest * WIP on mxnet * fix lint * fix mxnet unittest w/ unfortunate workaround * fix msvc * fix lint * SliceRows and test_nodeflow * resolve reviews * resolve reviews * try fix win ci * try fix win ci * poke win ci again * poke * lazy multigraph flag; stackoverflow error * revert node subgraph test * lazy object * try fix win build * try fix win build * poke ci * fix build script * fix compile * add a todo * fix reviews * fix compile * WIP * WIP * all demo use python-3 (#555) * ToImmutable and CopyTo * [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556) * update * update * update * update num_hops * fix bug * update * report numbers of distributed training in AMLC giant graph paper * [DEMO] Remove duplicate code for sampling (#557) * update * update * re-use single-machine code * update * use relative path * update * update * update * add __init__.py * add __init__.py * import sys, os * fix typo * update * [Perf] Improve performance of graph store. (#554) * fix. * use inplace. * move to shared memory graph store. * fix. * add more unit tests. * fix. * fix test. * fix test. * disable test. * fix. * [BUGIFX] fix a bug in edge_ids (#560) * add test. * fix compute. * fix test. * turn on test. * fix a bug. * add test. * fix. * disable test. * DGLRetValue DGLContext conversion * [DEMO] Add Pytorch demo for distributed sampler (#562) * update * update * update * add sender * update * remove duplicate cpde * [Test] Add gtest to project (#547) * add gtest module * add gtest * fix * Update CMakeLists.txt * Update README.md * Add support to convert immutable graph to 32 bits * [Perf] lazily create msg_index. (#563) * lazily create msg_index. * update test. * fix binary reduce following new minigun template * enable both int64 and int32 kernels * [BUGFIX] fix bugs for running GCN on giant graphs. (#561) * load mxnet csr. * enable load large csr. * fix * fix. * fix int overflow. * fix test. * new kernel interface done for CPU * docstring * rename & docstring * copy reduce and backward * [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559) * [DEMO] Update demo of distributed sampler (#564) * update * update * update demo * adapt cuda kernels to the new interface * add network cpp test (#565) * fix bug * Add unittest for C++ RPC (#566) * [CI] Fix CI for cpp test (#570) * fix CI for cpp test * update port number * [Docker] update docker image (#575) * update docker image * specify lint version * rm torch import from unified tests * remove pytorch-specific test_function * fix unittest * fix * fix unittest backend bug in converting tensor to numpy array * fix * mxnet version * [BUGFIX] fix for MXNet 1.5. (#552) * remove clone. * turn on numpy compatible. * Revert "remove clone." This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457. * revert format changes * fix mxnet api name * revert mistakes in previous revert * roll back CI to 20190523 build * fix unittest * disable test_shared_mem_store.py for now * remove mxnet/test_specialization.py * sync win64 test script * fix lowercase * missing backend in gpu unit test * transpose to get forward graph * pass update all * add sanity check * passing test_specialization.py * fix and pass test_function * fix check * fix pytorch softmax * mxnet kernels * c++ lint * pylint * try * win build * fix * win * ci enable gpu build * init submodule recursively * backend docstring * try * test win dev * doc string * disable pytorch test_nn * try to fix windows issue * bug fixed, revert changes * [Test] fix CI. (#586) * disable unit test in mxnet tutorial. * retry socket connection. * roll back to set_np_compat * try to fix multi-processing test hangs when it fails. * fix test. * fix. * doc string * doc string and clean up * missing field in ctypes * fix node flow schedule and unit test * rename * pylint * copy from parent default context * fix unit test script * fix * demo bug in nodeflow gpu test * [Kernel][Bugfix] fix nodeflow bug (#604) * fix nodeflow bug * remove debug code * add build gtest option * fix cmake; fix graph index bug in spmv.py * remove clone * fix div rhs grad bug * [Kernel] Support full builtin method, edge softmax and unit tests (#605) * add full builtin support * unit test * unit test backend * edge softmax * apply edge with builtin * fix kernel unit test * disable mxnet test_shared_mem_store * gen builtin reduce * enable mxnet gpu unittest * revert some changes * docstring * add note for the hack * [Kernel][Unittest][CI] Fix MXNet GPU CI (#607) * update docker image for MXNet GPU CI * force all dgl graph input and output on CPU * fix gpu unittest * speedup compilation * add some comments * lint * add more comments * fix as requested * add some comments * comment * lint * lint * update pylint * fix as requested * lint * lint * lint * docstrings of python DGL kernel entries * disable lint warnings on arguments in kernel.py * fix docstring in scheduler * fix some bug in unittest; try again * Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel" This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * Revert "fix some bug in unittest; try again" This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b. * more comprehensive kernel test * remove shape check in test_specialization
653428bd · Lingfan Yu · Minjie Wang · da0c92a2 · 653428bd · 653428bd
Commit 653428bd authored Jun 06, 2019 by Lingfan Yu Committed by Minjie Wang Jun 06, 2019
20 changed files
--- a/src/kernel/cpu/functor.h
+++ b/src/kernel/cpu/functor.h
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cpu/functor.h
+ * \brief Functors for template on CPU
+ */
+#ifndef DGL_KERNEL_CPU_FUNCTOR_H_
+#define DGL_KERNEL_CPU_FUNCTOR_H_
+#include <dmlc/omp.h>
+#include <algorithm>
+#include "../binary_reduce_common.h"
+namespace dgl {
+namespace kernel {
+// Reducer functor specialization
+template <typename DType>
+struct ReduceSum<kDLCPU, DType> {
+  static void Call(DType* addr, DType val) {
+#pragma omp atomic
+    *addr += val;
+  }
+  static DType BackwardCall(DType val, DType accum) {
+    return 1;
+  }
+};
+template <typename DType>
+struct ReduceMax<kDLCPU, DType> {
+  static void Call(DType* addr, DType val) {
+#pragma omp critical
+    *addr = std::max(*addr, val);
+  }
+  static DType BackwardCall(DType val, DType accum) {
+    return static_cast<DType>(val == accum);
+  }
+};
+template <typename DType>
+struct ReduceMin<kDLCPU, DType> {
+  static void Call(DType* addr, DType val) {
+#pragma omp critical
+    *addr = std::min(*addr, val);
+  }
+  static DType BackwardCall(DType val, DType accum) {
+    return static_cast<DType>(val == accum);
+  }
+};
+template <typename DType>
+struct ReduceProd<kDLCPU, DType> {
+  static void Call(DType* addr, DType val) {
+#pragma omp atomic
+    *addr *= val;
+  }
+  static DType BackwardCall(DType val, DType accum) {
+    return accum / val;
+  }
+};
+template <typename DType>
+struct ReduceNone<kDLCPU, DType> {
+  static void Call(DType* addr, DType val) {
+    *addr = val;
+  }
+  static DType BackwardCall(DType val, DType accum) {
+    return 1;
+  }
+};
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_CPU_FUNCTOR_H_
--- a/src/kernel/cpu/utils.cc
+++ b/src/kernel/cpu/utils.cc
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cpu/utils.cc
+ * \brief Utility function implementations on CPU
+ */
+#include "../utils.h"
+namespace dgl {
+namespace kernel {
+namespace utils {
+template <int XPU, typename DType>
+void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val) {
+  for (size_t i = 0; i < length; ++i) {
+    *(ptr + i) = val;
+  }
+}
+template void Fill<kDLCPU, float>(const DLContext& ctx, float* ptr, size_t length, float val);
+template void Fill<kDLCPU, double>(const DLContext& ctx, double* ptr, size_t length, double val);
+}  // namespace utils
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/atomic.cuh
+++ b/src/kernel/cuda/atomic.cuh
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/atomic.cuh
+ * \brief Atomic functions
+ */
+#ifndef DGL_KERNEL_CUDA_ATOMIC_H_
+#define DGL_KERNEL_CUDA_ATOMIC_H_
+#include <cuda_runtime.h>
+#if __CUDA_ARCH__ >= 600
+#include <cuda_fp16.h>
+#endif
+namespace dgl {
+namespace kernel {
+namespace cuda {
+// Type trait for selecting code type
+template <int Bytes> struct Code { };
+template <> struct Code<4> {
+  typedef unsigned int Type;
+};
+template <> struct Code<8> {
+  typedef unsigned long long int Type;
+};
+// Helper class for converting to/from atomicCAS compatible types.
+template <typename T> struct Cast {
+  typedef typename Code<sizeof(T)>::Type Type;
+  static __device__ __forceinline__ Type Encode(T val) {
+    return static_cast<Type>(val);
+  }
+  static __device__ __forceinline__ T Decode(Type code) {
+    return static_cast<T>(code);
+  }
+};
+template <> struct Cast<float> {
+  typedef Code<sizeof(float)>::Type Type;
+  static __device__ __forceinline__ Type Encode(float val) {
+    return __float_as_uint(val);
+  }
+  static __device__ __forceinline__ float Decode(Type code) {
+    return __uint_as_float(code);
+  }
+};
+template <> struct Cast<double> {
+  typedef Code<sizeof(double)>::Type Type;
+  static __device__ __forceinline__ Type Encode(double val) {
+    return __double_as_longlong(val);
+  }
+  static __device__ __forceinline__ double Decode(Type code) {
+    return __longlong_as_double(code);
+  }
+};
+#define DEFINE_ATOMIC(NAME) \
+  template <typename T>                                          \
+  __device__ __forceinline__ T Atomic##NAME(T* addr, T val) {    \
+    typedef typename Cast<T>::Type CT;                           \
+    CT* addr_as_ui = reinterpret_cast<CT*>(addr);                \
+    CT old = *addr_as_ui;                                        \
+    CT assumed = old;                                            \
+    do {                                                         \
+      assumed = old;                                             \
+      old = atomicCAS(addr_as_ui, assumed,                       \
+          Cast<T>::Encode(OP(val, Cast<T>::Decode(old))));       \
+    } while (assumed != old);                                    \
+    return Cast<T>::Decode(old);                                 \
+  }
+#define OP(a, b) max(a, b)
+DEFINE_ATOMIC(Max)
+#undef OP
+#define OP(a, b) min(a, b)
+DEFINE_ATOMIC(Min)
+#undef OP
+#define OP(a, b) a + b
+DEFINE_ATOMIC(Add)
+#undef OP
+#if __CUDA_ARCH__ >= 200
+template <>
+__device__ __forceinline__ float AtomicAdd<float>(float* addr, float val) {
+  return atomicAdd(addr, val);
+}
+#endif  // __CUDA_ARCH__
+#if __CUDA_ARCH__ >= 600
+template <>
+__device__ __forceinline__ double AtomicAdd<double>(double* addr, double val) {
+  return atomicAdd(addr, val);
+}
+#endif
+#if defined(CUDART_VERSION) && CUDART_VERSION >= 10000
+#if __CUDA_ARCH__ >= 600
+template <>
+__device__ __forceinline__ __half2 AtomicAdd<__half2>(__half2* addr, __half2 val) {
+  return atomicAdd(addr, val);
+}
+#endif  // __CUDA_ARCH__
+#if __CUDA_ARCH__ >= 700
+template <>
+__device__ __forceinline__ __half AtomicAdd<__half>(__half* addr, __half val) {
+  return atomicAdd(addr, val);
+}
+#endif  // __CUDA_ARCH__
+#endif
+#define OP(a, b) a * b
+DEFINE_ATOMIC(Mul)
+#undef OP
+}  // namespace cuda
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_CUDA_ATOMIC_H_
--- a/src/kernel/cuda/backward_binary_reduce_impl.cuh
+++ b/src/kernel/cuda/backward_binary_reduce_impl.cuh
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/backward_binary_reduce_impl.cuh
+ * \brief Minigun CUDA UDFs for bacward binary reduce
+ */
+#ifndef DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
+#define DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
+#include <minigun/minigun.h>
+#include <dgl/immutable_graph.h>
+#include "../binary_reduce_impl_decl.h"
+#include "../utils.h"
+#include "./functor.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+// Minigun UDF to compute backward binary reduce.
+template <int Mode, typename Idx, typename DType, typename Functors>
+struct BackwardBinaryReduce {
+  static __device__ __forceinline__ bool CondEdge(
+      Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
+    return true;
+  }
+  static __device__ __forceinline__ void ApplyEdge(
+      Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
+    const int64_t D = gdata->x_length;
+    int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride_x = blockDim.x * gridDim.x;
+    Idx lid = Functors::SelectLeft(src, eid, dst);
+    Idx rid = Functors::SelectRight(src, eid, dst);
+    Idx oid = Functors::SelectOut(src, eid, dst);
+    if (gdata->lhs_mapping) {
+      lid = Functors::GetId(lid, gdata->lhs_mapping);
+    }
+    if (gdata->rhs_mapping) {
+      rid = Functors::GetId(rid, gdata->rhs_mapping);
+    }
+    if (gdata->out_mapping) {
+      oid = Functors::GetId(oid, gdata->out_mapping);
+    }
+    DType* lhsoff = gdata->lhs_data + lid * D;
+    DType* rhsoff = gdata->rhs_data + rid * D;
+    DType* outoff = gdata->out_data + oid * D;
+    DType* gradlhsoff = gdata->grad_lhs_data + lid * D;
+    DType* gradrhsoff = gdata->grad_rhs_data + rid * D;
+    DType* gradoutoff = gdata->grad_out_data + oid * D;
+    while (tx < D) {
+      DType lhs = Functors::Read(lhsoff + tx);
+      DType rhs = Functors::Read(rhsoff + tx);
+      DType out = Functors::Read(outoff + tx);
+      DType grad_out = Functors::Read(gradoutoff + tx);
+      DType e = Functors::Op(lhs, rhs);
+      DType grad_e = grad_out * Functors::BackwardWrite(e, out);
+      if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
+        DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
+        AtomicAdd(gradlhsoff + tx, grad_lhs);
+      }
+      if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
+        DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
+        AtomicAdd(gradrhsoff + tx, grad_rhs);
+      }
+      tx += stride_x;
+    }
+  }
+};
+// Minigun UDF to compute backward binary reduce with broadcasting.
+template <int Mode, int NDim, typename Idx, typename DType, typename Functors>
+struct BackwardBinaryReduceBcast {
+  static __device__ __forceinline__ bool CondEdge(
+      Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
+    return true;
+  }
+  static __device__ __forceinline__ void ApplyEdge(
+      Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
+    int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride_x = blockDim.x * gridDim.x;
+    Idx lid = Functors::SelectLeft(src, eid, dst);
+    Idx rid = Functors::SelectRight(src, eid, dst);
+    Idx oid = Functors::SelectOut(src, eid, dst);
+    if (gdata->lhs_mapping) {
+      lid = Functors::GetId(lid, gdata->lhs_mapping);
+    }
+    if (gdata->rhs_mapping) {
+      rid = Functors::GetId(rid, gdata->rhs_mapping);
+    }
+    if (gdata->out_mapping) {
+      oid = Functors::GetId(oid, gdata->out_mapping);
+    }
+    DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
+    DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
+    DType* outoff = gdata->out_data + oid * gdata->out_len;
+    DType* gradlhsoff = gdata->grad_lhs_data + lid * gdata->out_len;
+    DType* gradrhsoff = gdata->grad_rhs_data + rid * gdata->out_len;
+    DType* gradoutoff = gdata->grad_out_data + oid * gdata->out_len;
+    int64_t tmp[NDim];  // store unraveled idx.
+    while (tx < gdata->out_len) {
+      Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
+      DType lhs = Functors::Read(lhsoff +
+          Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
+      DType rhs = Functors::Read(rhsoff +
+          Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
+      DType out = Functors::Read(outoff + tx);
+      DType grad_out = Functors::Read(gradoutoff + tx);
+      DType e = Functors::Op(lhs, rhs);
+      DType grad_e = grad_out * Functors::BackwardWrite(e, out);
+      if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
+        DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
+        AtomicAdd(gradlhsoff + tx, grad_lhs);
+      }
+      if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
+        DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
+        AtomicAdd(gradrhsoff + tx, grad_rhs);
+      }
+      tx += stride_x;
+    }
+  }
+};
+// Auxiliary template used in UDF.
+template <typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+struct BackwardFunctorsTempl {
+  static __device__ __forceinline__ Idx SelectOut(
+      Idx src, Idx edge, Idx dst) {
+    typedef typename OutSelector<Reducer>::Type OutTarget;
+    return SwitchSrcDst<OutTarget>::Type::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ Idx SelectLeft(
+      Idx src, Idx edge, Idx dst) {
+    return LeftSelector::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ Idx SelectRight(
+      Idx src, Idx edge, Idx dst) {
+    return RightSelector::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ DType Op(DType lhs, DType rhs) {
+    return BinaryOp::Call(lhs, rhs);
+  }
+  static __device__ __forceinline__ DType Read(DType* addr) {
+    return LDGReader<DType>::Call(addr);
+  }
+  static __device__ __forceinline__ void Write(DType* addr, DType val) {
+    Reducer::Call(addr, val);
+  }
+  static __device__ __forceinline__ Idx GetId(Idx id, Idx* id_map) {
+    return LDGReader<Idx>::Call(id_map + id);
+  }
+  static __device__ __forceinline__ DType BackwardWrite(DType val, DType accum) {
+    return Reducer::BackwardCall(val, accum);
+  }
+  static __device__ __forceinline__ DType BackwardOpLhs(DType lhs, DType rhs, DType out) {
+    return BinaryOp::BackwardLhs(lhs, rhs, out);
+  }
+  static __device__ __forceinline__ DType BackwardOpRhs(DType lhs, DType rhs, DType out) {
+    return BinaryOp::BackwardRhs(lhs, rhs, out);
+  }
+};
+typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
+}  // namespace cuda
+// Template implementation of BackwardBinaryReduce operator.
+template <int XPU, int Mode, typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+void CallBackwardBinaryReduce(
+    const minigun::advance::RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    BackwardGData<Idx, DType>* gdata) {
+  // For backward computation, we use reverse csr and switch dst and src.
+  // This benefits the most common src_op_edge or copy_src case, because the
+  // gradients of src are now aggregated into destination buffer to reduce
+  // competition of atomic add.
+  auto incsr = graph->GetInCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
+  typedef cuda::BackwardFunctorsTempl<Idx, DType,
+          typename SwitchSrcDst<LeftSelector>::Type,
+          typename SwitchSrcDst<RightSelector>::Type,
+          BinaryOp, Reducer> Functors;
+  typedef cuda::BackwardBinaryReduce<Mode, Idx, DType, Functors> UDF;
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge
+      && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge
+      && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, BackwardGData<Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+// Following macro is used to generate explicit-specialization of the template
+// operator.
+#define GEN_BACKWARD_DEFINE(mode, dtype, lhs_tgt, rhs_tgt, op)  \
+  template void CallBackwardBinaryReduce<XPU,                \
+                    mode, IDX, dtype,                           \
+                    lhs_tgt, rhs_tgt,                           \
+                    op<dtype>, REDUCER<XPU, dtype>>(            \
+      const minigun::advance::RuntimeConfig& rtcfg,             \
+      const ImmutableGraph* graph,                              \
+      BackwardGData<IDX, dtype>* gdata);
+// Template implementation of BackwardBinaryReduce with broadcasting operator.
+template <int XPU, int Mode, int NDim, typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+void CallBackwardBinaryReduceBcast(
+    const minigun::advance::RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    BackwardBcastGData<NDim, Idx, DType>* gdata) {
+  // For backward computation, we use reverse csr and switch dst and src.
+  // This benefits the most common src_op_edge or copy_src case, because the
+  // gradients of src are now aggregated into destination buffer to reduce
+  // competition of atomic add.
+  auto incsr = graph->GetInCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
+  typedef cuda::BackwardFunctorsTempl<Idx, DType,
+          typename SwitchSrcDst<LeftSelector>::Type,
+          typename SwitchSrcDst<RightSelector>::Type,
+          BinaryOp, Reducer> Functors;
+  typedef cuda::BackwardBinaryReduceBcast<Mode, NDim, Idx, DType, Functors> UDF;
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge
+      && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge
+      && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig,
+    BackwardBcastGData<NDim, Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+// Following macro is used to generate explicit-specialization of the template
+// operator.
+#define GEN_BACKWARD_BCAST_DEFINE(mode, ndim, dtype, lhs_tgt, rhs_tgt, op)  \
+  template void CallBackwardBinaryReduceBcast<XPU,                       \
+                    mode, ndim, IDX, dtype,                                 \
+                    lhs_tgt, rhs_tgt,                                       \
+                    op<dtype>, REDUCER<XPU, dtype>>(                        \
+      const minigun::advance::RuntimeConfig& rtcfg,                         \
+      const ImmutableGraph* graph,                                          \
+      BackwardBcastGData<ndim, IDX, dtype>* gdata);
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
--- a/src/kernel/cuda/binary_bcast_reduce_max.cu
+++ b/src/kernel/cuda/binary_bcast_reduce_max.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_bcast_reduce_max.cu
+ * \brief CUDA kernels for braodcasting binary reduce max
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+}  // namespace cuda
+#define REDUCER ReduceMax
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
+     GEN_BACKWARD_BCAST_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_bcast_reduce_min.cu
+++ b/src/kernel/cuda/binary_bcast_reduce_min.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_bcast_reduce_min.cu
+ * \brief CUDA kernels for braodcasting binary reduce min
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+}  // namespace cuda
+#define REDUCER ReduceMin
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
+EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
+     GEN_BACKWARD_BCAST_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_bcast_reduce_none.cu
+++ b/src/kernel/cuda/binary_bcast_reduce_none.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_bcast_reduce_none.cu
+ * \brief CUDA kernels for braodcasting binary reduce none
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+}  // namespace cuda
+#define REDUCER ReduceNone
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
+     GEN_BACKWARD_BCAST_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_bcast_reduce_prod.cu
+++ b/src/kernel/cuda/binary_bcast_reduce_prod.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_bcast_reduce_prod.cu
+ * \brief CUDA kernels for braodcasting binary reduce prod
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+}  // namespace cuda
+#define REDUCER ReduceProd
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
+     GEN_BACKWARD_BCAST_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_bcast_reduce_sum.cu
+++ b/src/kernel/cuda/binary_bcast_reduce_sum.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_bcast_reduce_sum.cu
+ * \brief CUDA kernels for braodcasting binary reduce sum
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+}  // namespace cuda
+#define REDUCER ReduceSum
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
+EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
+     GEN_BACKWARD_BCAST_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_impl.cu
+++ b/src/kernel/cuda/binary_reduce_impl.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_impl.cu
+ * \brief Binary reduce implementation on cuda.
+ */
+#include "../binary_reduce_impl.h"
+using dgl::runtime::NDArray;
+namespace dgl {
+namespace kernel {
+template void BinaryReduceImpl<kDLGPU>(
+    const std::string& reducer,
+    const std::string& op,
+    const ImmutableGraph* graph,
+    binary_op::Target lhs, binary_op::Target rhs,
+    runtime::NDArray lhs_data, runtime::NDArray rhs_data,
+    runtime::NDArray out_data,
+    runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
+    runtime::NDArray out_mapping);
+template void BinaryReduceBcastImpl<kDLGPU>(
+    const BcastInfo& info,
+    const std::string& reducer,
+    const std::string& op,
+    const ImmutableGraph* graph,
+    binary_op::Target lhs, binary_op::Target rhs,
+    runtime::NDArray lhs_data, runtime::NDArray rhs_data,
+    runtime::NDArray out_data,
+    runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
+    runtime::NDArray out_mapping);
+template void BackwardBinaryReduceImpl<kDLGPU>(
+    const std::string& reducer,
+    const std::string& op,
+    const ImmutableGraph* graph,
+    binary_op::Target lhs, binary_op::Target rhs,
+    NDArray lhs_mapping, NDArray rhs_mapping, NDArray out_mapping,
+    NDArray lhs_data, NDArray rhs_data, NDArray out_data,
+    NDArray grad_out_data,
+    NDArray grad_lhs_data, NDArray grad_rhs_data);
+template void BackwardBinaryReduceBcastImpl<kDLGPU>(
+    const BcastInfo& info,
+    const std::string& reducer,
+    const std::string& op,
+    const ImmutableGraph* graph,
+    binary_op::Target lhs_tgt, binary_op::Target rhs_tgt,
+    runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
+    runtime::NDArray lhs, runtime::NDArray rhs, runtime::NDArray out, runtime::NDArray grad_out,
+    runtime::NDArray grad_lhs, runtime::NDArray grad_rhs);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_impl.cuh
+++ b/src/kernel/cuda/binary_reduce_impl.cuh
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_impl.cuh
+ * \brief Minigun CUDA UDFs for binary reduce
+ */
+#ifndef DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
+#define DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
+#include <minigun/minigun.h>
+#include <dgl/immutable_graph.h>
+#include "../binary_reduce_impl_decl.h"
+#include "../utils.h"
+#include "./functor.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+// Minigun UDF to compute binary reduce.
+template <typename Idx, typename DType, typename Functors>
+struct BinaryReduce {
+  static __device__ __forceinline__ bool CondEdge(
+      Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
+    return true;
+  }
+  static __device__ __forceinline__ void ApplyEdge(
+      Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
+    const int64_t D = gdata->x_length;
+    int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride_x = blockDim.x * gridDim.x;
+    Idx lid = Functors::SelectLeft(src, eid, dst);
+    Idx rid = Functors::SelectRight(src, eid, dst);
+    Idx oid = Functors::SelectOut(src, eid, dst);
+    if (gdata->lhs_mapping) {
+      lid = Functors::GetId(lid, gdata->lhs_mapping);
+    }
+    if (gdata->rhs_mapping) {
+      rid = Functors::GetId(rid, gdata->rhs_mapping);
+    }
+    if (gdata->out_mapping) {
+      oid = Functors::GetId(oid, gdata->out_mapping);
+    }
+    DType* lhsoff = gdata->lhs_data + lid * D;
+    DType* rhsoff = gdata->rhs_data + rid * D;
+    DType* outoff = gdata->out_data + oid * D;
+    while (tx < D) {
+      DType lhs = Functors::Read(lhsoff + tx);
+      DType rhs = Functors::Read(rhsoff + tx);
+      DType out = Functors::Op(lhs, rhs);
+      Functors::Write(outoff + tx, out);
+      tx += stride_x;
+    }
+  }
+};
+// Convert flattened index to multi-dimension index (assume row-major).
+__device__ __forceinline__ void Unravel(
+    int64_t idx, int ndim, const int64_t* shape, const int64_t* stride, int64_t* out) {
+  for (int d = 0; d < ndim; ++d) {
+    out[d] = (idx / stride[d]) % shape[d];
+  }
+}
+// Convert multi-dimension index to flattened index (assume row-major).
+__device__ __forceinline__ int64_t Ravel(
+    const int64_t* idx, int ndim, const int64_t* shape, const int64_t* stride) {
+  int64_t out = 0;
+  for (int d = 0; d < ndim; ++d) {
+    out += min(idx[d], shape[d] - 1) * stride[d];
+  }
+  return out;
+}
+// Minigun UDF to compute binary reduce with broadcasting.
+template <int NDim, typename Idx, typename DType, typename Functors>
+struct BinaryReduceBcast {
+  static __device__ __forceinline__ bool CondEdge(
+      Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
+    return true;
+  }
+  static __device__ __forceinline__ void ApplyEdge(
+      Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
+    int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride_x = blockDim.x * gridDim.x;
+    Idx lid = Functors::SelectLeft(src, eid, dst);
+    Idx rid = Functors::SelectRight(src, eid, dst);
+    Idx oid = Functors::SelectOut(src, eid, dst);
+    if (gdata->lhs_mapping) {
+      lid = Functors::GetId(lid, gdata->lhs_mapping);
+    }
+    if (gdata->rhs_mapping) {
+      rid = Functors::GetId(rid, gdata->rhs_mapping);
+    }
+    if (gdata->out_mapping) {
+      oid = Functors::GetId(oid, gdata->out_mapping);
+    }
+    DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
+    DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
+    DType* outoff = gdata->out_data + oid * gdata->out_len;
+    int64_t tmp[NDim];  // store unraveled idx.
+    while (tx < gdata->out_len) {
+      Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
+      DType lhs = Functors::Read(lhsoff +
+          Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
+      DType rhs = Functors::Read(rhsoff +
+          Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
+      DType out = Functors::Op(lhs, rhs);
+      Functors::Write(outoff + tx, out);
+      tx += stride_x;
+    }
+  }
+};
+// Auxiliary template used in UDF.
+template <typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+struct FunctorsTempl {
+  static __device__ __forceinline__ Idx SelectOut(
+      Idx src, Idx edge, Idx dst) {
+    return OutSelector<Reducer>::Type::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ Idx SelectLeft(
+      Idx src, Idx edge, Idx dst) {
+    return LeftSelector::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ Idx SelectRight(
+      Idx src, Idx edge, Idx dst) {
+    return RightSelector::Call(src, edge, dst);
+  }
+  static __device__ __forceinline__ DType Op(DType lhs, DType rhs) {
+    return BinaryOp::Call(lhs, rhs);
+  }
+  static __device__ __forceinline__ DType Read(DType* addr) {
+    return LDGReader<DType>::Call(addr);
+  }
+  static __device__ __forceinline__ void Write(DType* addr, DType val) {
+    Reducer::Call(addr, val);
+  }
+  static __device__ __forceinline__ Idx GetId(Idx id, Idx* id_map) {
+    return LDGReader<Idx>::Call(id_map + id);
+  }
+};
+typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
+}  // namespace cuda
+// Template implementation of BinaryReduce operator.
+template <int XPU, typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+void CallBinaryReduce(const minigun::advance::RuntimeConfig& rtcfg,
+                      const ImmutableGraph* graph,
+                      GData<Idx, DType>* gdata) {
+  typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
+                        RightSelector, BinaryOp, Reducer>
+          Functors;
+  typedef cuda::BinaryReduce<Idx, DType, Functors> UDF;
+  // csr
+  auto outcsr = graph->GetOutCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, GData<Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+// Template implementation of BinaryReduce broadcasting operator.
+template <int XPU, int NDim, typename Idx, typename DType,
+          typename LeftSelector, typename RightSelector,
+          typename BinaryOp, typename Reducer>
+void CallBinaryReduceBcast(
+  const minigun::advance::RuntimeConfig& rtcfg,
+  const ImmutableGraph* graph,
+  BcastGData<NDim, Idx, DType>* gdata) {
+  typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
+                        RightSelector, BinaryOp, Reducer>
+          Functors;
+  typedef cuda::BinaryReduceBcast<NDim, Idx, DType, Functors> UDF;
+  // csr
+  auto outcsr = graph->GetOutCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig,
+    BcastGData<NDim, Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+// Following macro is used to generate explicit-specialization of the template
+// operator.
+#define GEN_DEFINE(dtype, lhs_tgt, rhs_tgt, op)                    \
+  template void CallBinaryReduce<XPU, IDX,                      \
+        dtype, lhs_tgt, rhs_tgt, op<dtype>, REDUCER<XPU, dtype>>(  \
+      const minigun::advance::RuntimeConfig& rtcfg,                \
+      const ImmutableGraph* graph,                                 \
+      GData<IDX, dtype>* gdata);
+#define GEN_BCAST_DEFINE(ndim, dtype, lhs_tgt, rhs_tgt, op)         \
+  template void CallBinaryReduceBcast<XPU, ndim, IDX, dtype,     \
+                                 lhs_tgt, rhs_tgt,                  \
+                                 op<dtype>, REDUCER<XPU, dtype>>(   \
+      const minigun::advance::RuntimeConfig& rtcfg,                 \
+      const ImmutableGraph* graph,                                  \
+      BcastGData<ndim, IDX, dtype>* gdata);
+#define EVAL(F, ...) MSVC_EXPAND(F(__VA_ARGS__))
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
--- a/src/kernel/cuda/binary_reduce_max.cu
+++ b/src/kernel/cuda/binary_reduce_max.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_max.cu
+ * \brief CUDA kernels for binary reduce max
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+#define REDUCER ReduceMax
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_min.cu
+++ b/src/kernel/cuda/binary_reduce_min.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_min.cu
+ * \brief CUDA kernels for binary reduce min
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+#define REDUCER ReduceMin
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_none.cu
+++ b/src/kernel/cuda/binary_reduce_none.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_none.cu
+ * \brief CUDA kernels for binary reduce none
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+#define REDUCER ReduceNone
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_prod.cu
+++ b/src/kernel/cuda/binary_reduce_prod.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_prod.cu
+ * \brief CUDA kernels for binary reduce prod
+ */
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+namespace dgl {
+namespace kernel {
+#define REDUCER ReduceProd
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
+EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/binary_reduce_sum.cu
+++ b/src/kernel/cuda/binary_reduce_sum.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/binary_reduce_sum.cu
+ * \brief CUDA kernels for binary reduce sum
+ */
+#include <dgl/runtime/device_api.h>
+#include "../../runtime/cuda/cuda_common.h"
+#include "./binary_reduce_impl.cuh"
+#include "./backward_binary_reduce_impl.cuh"
+#include "../utils.h"
+using minigun::advance::RuntimeConfig;
+using Csr = minigun::Csr<int32_t>;
+namespace dgl {
+namespace kernel {
+namespace cuda {
+// specialization for cusparse
+template <typename DType>
+cusparseStatus_t Xcsrmm2(cusparseHandle_t handle, cusparseOperation_t transA,
+    cusparseOperation_t transB, int m, int n, int k, int nnz,
+    const DType* alpha, const cusparseMatDescr_t descrA,
+    const DType* csrValA, const int* csrRowPtrA, const int* csrColIndA,
+    const DType* B, int ldb, const DType* beta, DType* C, int ldc) {
+  LOG(INFO) << "Not supported dtype";
+  return CUSPARSE_STATUS_EXECUTION_FAILED;
+}
+template <>
+cusparseStatus_t Xcsrmm2<float>(cusparseHandle_t handle, cusparseOperation_t transA,
+    cusparseOperation_t transB, int m, int n, int k, int nnz,
+    const float* alpha, const cusparseMatDescr_t descrA,
+    const float* csrValA, const int* csrRowPtrA, const int* csrColIndA,
+    const float* B, int ldb, const float* beta, float* C, int ldc) {
+  return cusparseScsrmm2(handle, transA, transB, m, n, k, nnz,
+      alpha, descrA, csrValA, csrRowPtrA, csrColIndA,
+      B, ldb, beta, C, ldc);
+}
+template <>
+cusparseStatus_t Xcsrmm2<double>(cusparseHandle_t handle, cusparseOperation_t transA,
+    cusparseOperation_t transB, int m, int n, int k, int nnz,
+    const double* alpha, const cusparseMatDescr_t descrA,
+    const double* csrValA, const int* csrRowPtrA, const int* csrColIndA,
+    const double* B, int ldb, const double* beta, double* C, int ldc) {
+  return cusparseDcsrmm2(handle, transA, transB, m, n, k, nnz,
+      alpha, descrA, csrValA, csrRowPtrA, csrColIndA,
+      B, ldb, beta, C, ldc);
+}
+template <typename DType>
+cublasStatus_t Xgeam(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n,
+    const DType* alpha, const DType* A, int lda,
+    const DType* beta, const DType* B, int ldb,
+    DType* C, int ldc) {
+  LOG(INFO) << "Not supported dtype";
+  return CUBLAS_STATUS_EXECUTION_FAILED;
+}
+template <>
+cublasStatus_t Xgeam<float>(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n,
+    const float* alpha, const float* A, int lda,
+    const float* beta, const float* B, int ldb,
+    float* C, int ldc) {
+  return cublasSgeam(handle, transa, transb, m, n, alpha, A, lda,
+      beta, B, ldb, C, ldc);
+}
+template <>
+cublasStatus_t Xgeam<double>(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n,
+    const double* alpha, const double* A, int lda,
+    const double* beta, const double* B, int ldb,
+    double* C, int ldc) {
+  return cublasDgeam(handle, transa, transb, m, n, alpha, A, lda,
+      beta, B, ldb, C, ldc);
+}
+template <typename DType>
+void CusparseCsrmm2(
+    const RuntimeConfig& rtcfg,
+    const Csr& csr,
+    const DType* B_data, DType* C_data,
+    int out_size, int x_length) {
+  // We use csrmm2 to perform following operation:
+  // C = A x B, where A is a sparse matrix in csr format, B is the dense matrix for node
+  // feature tensor. However, since cusparse only supports column-major, while our tensor
+  // is stored in row-major, the actual computation is:
+  // C = trans(A x trans(B)).
+  // Currently, we use cublasXgeam to implement transposition and allocate intermediate
+  // workspace memory for this.
+  // TODO(minjie): The given CSR could potentially represent a bipartite graph (e.g. in the
+  //   case of nodeflow). Currently, we don't have bipartite graph support. Here is a small
+  //   hack. In the python side, we create a CSR that includes both the source and destination
+  //   nodes in the bipartite graph (so it is still square matrix). Here, when multiplying
+  //   this sparse matrix, we specify the number of rows (the `m` here) to be equal to the
+  //   number of rows of the output tensor (i.e, the `out_size`).
+  //   In the future, we should make sure the number of rows of the given csr is equal
+  //   to out_size (a.k.a the given csr is a rectangle matrix).
+  const int m = out_size;
+  const int k = csr.row_offsets.length - 1;
+  const int n = x_length;
+  const int nnz = csr.column_indices.length;
+  const DType alpha = 1.0;
+  const DType beta = 0.0;
+  // device
+  auto device = runtime::DeviceAPI::Get(rtcfg.ctx);
+  auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
+  // allocate cusparse handle if needed
+  if (!thr_entry->cusparse_handle) {
+    CUSPARSE_CALL(cusparseCreate(&(thr_entry->cusparse_handle)));
+  }
+  CUSPARSE_CALL(cusparseSetStream(thr_entry->cusparse_handle, rtcfg.stream));
+  // allocate matrix for temporary transposed output
+  DType* trans_out = static_cast<DType*>(device->AllocWorkspace(rtcfg.ctx, m * n * sizeof(DType)));
+  // all one data array
+  DType* valptr = static_cast<DType*>(device->AllocWorkspace(rtcfg.ctx, nnz * sizeof(DType)));
+  utils::Fill<kDLGPU>(rtcfg.ctx, valptr, nnz, static_cast<DType>(1.));
+  cusparseMatDescr_t descr;
+  CUSPARSE_CALL(cusparseCreateMatDescr(&descr));
+  CUSPARSE_CALL(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL));
+  CUSPARSE_CALL(cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO));
+  CUSPARSE_CALL(Xcsrmm2<DType>(
+      thr_entry->cusparse_handle,
+      CUSPARSE_OPERATION_NON_TRANSPOSE,
+      CUSPARSE_OPERATION_TRANSPOSE,
+      m, n, k, nnz, &alpha,
+      descr, valptr, csr.row_offsets.data, csr.column_indices.data,
+      B_data, n, &beta, trans_out, m));
+  device->FreeWorkspace(rtcfg.ctx, valptr);
+  // transpose the output matrix
+  if (!thr_entry->cublas_handle) {
+    CUBLAS_CALL(cublasCreate(&(thr_entry->cublas_handle)));
+  }
+  CUBLAS_CALL(cublasSetStream(thr_entry->cublas_handle, rtcfg.stream));
+  CUBLAS_CALL(Xgeam<DType>(
+      thr_entry->cublas_handle,
+      CUBLAS_OP_T,
+      CUBLAS_OP_N,
+      n, m,
+      &alpha, trans_out, m,
+      &beta, nullptr, n,
+      C_data, n));
+  device->FreeWorkspace(rtcfg.ctx, trans_out);
+}
+// forward
+template <typename DType>
+void FallbackCallBinaryReduce(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    GData<int32_t, DType>* gdata) {
+  constexpr int XPU = kDLGPU;
+  typedef int32_t Idx;
+  typedef SelectSrc LeftSelector;
+  typedef SelectNone RightSelector;
+  typedef BinaryUseLhs<DType> BinaryOp;
+  typedef ReduceSum<kDLGPU, DType> Reducer;
+  typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
+                        RightSelector, BinaryOp, Reducer>
+          Functors;
+  typedef cuda::BinaryReduce<Idx, DType, Functors> UDF;
+  // csr
+  auto outcsr = graph->GetOutCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, GData<Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+template <typename DType>
+void FallbackCallBackwardBinaryReduce(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    BackwardGData<int32_t, DType>* gdata) {
+  constexpr int XPU = kDLGPU;
+  constexpr int Mode = binary_op::kGradLhs;
+  typedef int32_t Idx;
+  typedef SelectSrc LeftSelector;
+  typedef SelectNone RightSelector;
+  typedef BinaryUseLhs<DType> BinaryOp;
+  typedef ReduceSum<kDLGPU, DType> Reducer;
+  // For backward computation, we use reverse csr and switch dst and src.
+  // This benefits the most common src_op_edge or copy_src case, because the
+  // gradients of src are now aggregated into destination buffer to reduce
+  // competition of atomic add.
+  auto incsr = graph->GetInCSR();
+  minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
+  typedef cuda::BackwardFunctorsTempl<Idx, DType,
+          typename SwitchSrcDst<LeftSelector>::Type,
+          typename SwitchSrcDst<RightSelector>::Type,
+          BinaryOp, Reducer> Functors;
+  typedef cuda::BackwardBinaryReduce<Mode, Idx, DType, Functors> UDF;
+  // If the user-given mapping is none and the target is edge data, we need to
+  // replace the mapping by the edge ids in the csr graph so that the edge
+  // data is correctly read/written.
+  if (LeftSelector::target == binary_op::kEdge
+      && gdata->lhs_mapping == nullptr) {
+    gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (RightSelector::target == binary_op::kEdge
+      && gdata->rhs_mapping == nullptr) {
+    gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  if (OutSelector<Reducer>::Type::target == binary_op::kEdge
+      && gdata->out_mapping == nullptr) {
+    gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
+  }
+  // TODO(minjie): allocator
+  minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, BackwardGData<Idx, DType>, UDF>(
+        rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
+}
+}  // namespace cuda
+template <>
+void CallBinaryReduce<kDLGPU, int32_t, float, SelectSrc, SelectNone,
+                      BinaryUseLhs<float>, ReduceSum<kDLGPU, float>>(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    GData<int32_t, float>* gdata) {
+  if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
+    cuda::FallbackCallBinaryReduce<float>(rtcfg, graph, gdata);
+  } else {
+    // cusparse use rev csr for csrmm
+    auto incsr = graph->GetInCSR();
+    Csr csr = utils::CreateCsr<int32_t>(incsr->indptr(), incsr->indices());
+    cuda::CusparseCsrmm2(rtcfg, csr, gdata->lhs_data, gdata->out_data,
+        gdata->out_size, gdata->x_length);
+  }
+}
+template <>
+void CallBinaryReduce<kDLGPU, int32_t, double, SelectSrc, SelectNone,
+                      BinaryUseLhs<double>, ReduceSum<kDLGPU, double>>(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    GData<int32_t, double>* gdata) {
+  if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
+    cuda::FallbackCallBinaryReduce<double>(rtcfg, graph, gdata);
+  } else {
+    // cusparse use rev csr for csrmm
+    auto incsr = graph->GetInCSR();
+    Csr csr = utils::CreateCsr<int32_t>(incsr->indptr(), incsr->indices());
+    cuda::CusparseCsrmm2(rtcfg, csr, gdata->lhs_data, gdata->out_data,
+        gdata->out_size, gdata->x_length);
+  }
+}
+// backward
+template <>
+void CallBackwardBinaryReduce<kDLGPU, binary_op::kGradLhs, int32_t, float,
+                              SelectSrc, SelectNone,
+                              BinaryUseLhs<float>, ReduceSum<kDLGPU, float>>(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    BackwardGData<int32_t, float>* gdata) {
+  if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
+    cuda::FallbackCallBackwardBinaryReduce<float>(rtcfg, graph, gdata);
+  } else {
+    auto outcsr = graph->GetOutCSR();
+    Csr csr = utils::CreateCsr<int32_t>(outcsr->indptr(), outcsr->indices());
+    cuda::CusparseCsrmm2(rtcfg, csr, gdata->grad_out_data, gdata->grad_lhs_data,
+        gdata->out_size, gdata->x_length);
+  }
+}
+template <>
+void CallBackwardBinaryReduce<kDLGPU, binary_op::kGradLhs, int32_t, double,
+                              SelectSrc, SelectNone,
+                              BinaryUseLhs<double>, ReduceSum<kDLGPU, double>>(
+    const RuntimeConfig& rtcfg,
+    const ImmutableGraph* graph,
+    BackwardGData<int32_t, double>* gdata) {
+  if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
+    cuda::FallbackCallBackwardBinaryReduce<double>(rtcfg, graph, gdata);
+  } else {
+    auto outcsr = graph->GetOutCSR();
+    Csr csr = utils::CreateCsr<int32_t>(outcsr->indptr(), outcsr->indices());
+    cuda::CusparseCsrmm2(rtcfg, csr, gdata->grad_out_data, gdata->grad_lhs_data,
+        gdata->out_size, gdata->x_length);
+  }
+}
+// generate definitions
+#define REDUCER ReduceSum
+#define XPU kDLGPU
+#define IDX int32_t
+EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
+EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/cuda/functor.cuh
+++ b/src/kernel/cuda/functor.cuh
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/functor.cuh
+ * \brief Functors for template on CUDA
+ */
+#ifndef DGL_KERNEL_CUDA_FUNCTOR_CUH_
+#define DGL_KERNEL_CUDA_FUNCTOR_CUH_
+#include "../binary_reduce_common.h"
+#include "./atomic.cuh"
+namespace dgl {
+namespace kernel {
+namespace cuda {
+// Cache load from global memory
+template <typename DType>
+struct LDGReader {
+  static __device__ __forceinline__ DType Call(DType* addr) {
+#if __CUDA_ARCH__ >= 350
+    return __ldg(addr);
+#else
+    return *addr;
+#endif
+  }
+};
+}  // namespace cuda
+// Reducer functor specialization
+template <typename DType>
+struct ReduceSum<kDLGPU, DType> {
+  static __device__ __forceinline__ void Call(DType* addr, DType val) {
+    cuda::AtomicAdd(addr, val);
+  }
+  static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
+    return 1;
+  }
+};
+template <typename DType>
+struct ReduceMax<kDLGPU, DType> {
+  static __device__ __forceinline__ void Call(DType* addr, DType val) {
+    cuda::AtomicMax(addr, val);
+  }
+  static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
+    return static_cast<DType>(val == accum);
+  }
+};
+template <typename DType>
+struct ReduceMin<kDLGPU, DType> {
+  static __device__ __forceinline__ void Call(DType* addr, DType val) {
+    cuda::AtomicMin(addr, val);
+  }
+  static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
+    return static_cast<DType>(val == accum);
+  }
+};
+template <typename DType>
+struct ReduceProd<kDLGPU, DType> {
+  static __device__ __forceinline__ void Call(DType* addr, DType val) {
+    cuda::AtomicMul(addr, val);
+  }
+  static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
+    return accum / val;
+  }
+};
+template <typename DType>
+struct ReduceNone<kDLGPU, DType> {
+  static __device__ __forceinline__ void Call(DType* addr, DType val) {
+    *addr = val;
+  }
+  static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
+    return 1;
+  }
+};
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_CUDA_FUNCTOR_CUH_
--- a/src/kernel/cuda/utils.cu
+++ b/src/kernel/cuda/utils.cu
+/*!
+ *  Copyright (c) 2019 by Contributors
+ * \file kernel/cuda/utils.cu
+ * \brief Utility function implementations on CUDA
+ */
+#include "../../runtime/cuda/cuda_common.h"
+#include "../utils.h"
+namespace dgl {
+namespace kernel {
+namespace utils {
+template <typename DType>
+__global__ void _FillKernel(DType* ptr, size_t length, DType val) {
+  int tx = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride_x = gridDim.x * blockDim.x;
+  while (tx < length) {
+    ptr[tx] = val;
+    tx += stride_x;
+  }
+}
+template <int XPU, typename DType>
+void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val) {
+  auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
+  int nt = utils::FindNumThreads(length, 1024);
+  int nb = (length + nt - 1) / nt;
+  _FillKernel<<<nb, nt, 0, thr_entry->stream>>>(ptr, length, val);
+}
+template void Fill<kDLGPU, float>(const DLContext& ctx, float* ptr, size_t length, float val);
+template void Fill<kDLGPU, double>(const DLContext& ctx, double* ptr, size_t length, double val);
+}  // namespace utils
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/utils.cc
+++ b/src/kernel/utils.cc
+/*!
+ *  Copyright (c) 2018 by Contributors
+ * \file kernel/utils.cc
+ * \brief Kernel utilities
+ */
+#include <vector>
+#include <string>
+#include "./utils.h"
+#include "./binary_reduce_common.h"
+namespace dgl {
+namespace kernel {
+namespace utils {
+int FindNumThreads(int dim, int max_nthrs) {
+  int ret = max_nthrs;
+  while (ret > dim) {
+    ret = ret >> 1;
+  }
+  return ret;
+}
+int64_t ComputeXLength(runtime::NDArray feat_array) {
+  int64_t ret = 1;
+  for (int i = 1; i < feat_array->ndim; ++i) {
+    ret *= feat_array->shape[i];
+  }
+  return ret;
+}
+int64_t NElements(const runtime::NDArray& array) {
+  if (IsNoneArray(array)) {
+    return 0;
+  } else {
+    int64_t ret = 1;
+    for (int i = 0; i < array->ndim; ++i) {
+      ret *= array->shape[i];
+    }
+    return ret;
+  }
+}
+int64_t Prod(const std::vector<int64_t>& vec) {
+  int64_t ret = 1;
+  for (int64_t v : vec) {
+    ret *= v;
+  }
+  return ret;
+}
+}  // namespace utils
+}  // namespace kernel
+}  // namespace dgl
--- a/src/kernel/utils.h
+++ b/src/kernel/utils.h
+/*!
+ *  Copyright (c) 2018 by Contributors
+ * \file kernel/utils.h
+ * \brief Kernel utilities
+ */
+#ifndef DGL_KERNEL_UTILS_H_
+#define DGL_KERNEL_UTILS_H_
+#include <minigun/csr.h>
+#include <dlpack/dlpack.h>
+#include <dgl/runtime/ndarray.h>
+#include <cstdlib>
+#include <vector>
+namespace dgl {
+namespace kernel {
+namespace utils {
+/* !\brief Return an NDArray that represents none value. */
+inline runtime::NDArray NoneArray() {
+  return runtime::NDArray::Empty({}, DLDataType{kDLInt, 32, 1}, DLContext{kDLCPU, 0});
+}
+/* !\brief Return true if the NDArray is none. */
+inline bool IsNoneArray(runtime::NDArray array) {
+  return array->ndim == 0;
+}
+/*
+ * !\brief Find number of threads is smaller than dim and max_nthrs
+ * and is also the power of two.
+ */
+int FindNumThreads(int dim, int max_nthrs);
+/*
+ * !\brief Compute the total number of feature elements.
+ */
+int64_t ComputeXLength(runtime::NDArray feat_array);
+/*
+ * !\brief Compute the total number of elements in the array.
+ */
+int64_t NElements(const runtime::NDArray& array);
+/*
+ * !\brief Compute the product of the given vector.
+ */
+int64_t Prod(const std::vector<int64_t>& vec);
+/*
+ * !\brief Fill the array with constant value.
+ */
+template <int XPU, typename DType>
+void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val);
+/*
+ * !\brief Create minigun CSR from two ndarrays.
+ */
+template <typename Idx>
+minigun::Csr<Idx> CreateCsr(runtime::NDArray indptr, runtime::NDArray indices) {
+  minigun::Csr<Idx> csr;
+  csr.row_offsets.data = static_cast<Idx*>(indptr->data);
+  csr.row_offsets.length = indptr->shape[0];
+  csr.column_indices.data = static_cast<Idx*>(indices->data);
+  csr.column_indices.length = indices->shape[0];
+  return csr;
+}
+}  // namespace utils
+}  // namespace kernel
+}  // namespace dgl
+#endif  // DGL_KERNEL_UTILS_H_