Commit 653428bd authored by Lingfan Yu's avatar Lingfan Yu Committed by Minjie Wang
Browse files

[Feature][Kernel] DGL kernel support (#596)

* [Kernel] Minigun integration and fused kernel support (#519)

* kernel interface

* add minigun

* Add cuda build

* functors

* working on binary elewise

* binary reduce

* change kernel interface

* WIP

* wip

* fix minigun

* compile

* binary reduce kernels

* compile

* simple test passed

* more reducers

* fix thrust problem

* fix cmake

* fix cmake; add proper guard for atomic

* WIP: bcast

* WIP

* bcast kernels

* update to new minigun pass-by-value practice

* broadcasting dim

* add copy src and copy edge

* fix linking

* fix none array problem

* fix copy edge

* add device_type and device_id to backend operator

* cache csr adj, remove cache for adjmat and incmat

* custom ops in backend and pytorch impl

* change dgl-mg kernel python interface

* add id_mapping var

* clean up plus v2e spmv schedule

* spmv schedule & clean up fall back

* symbolic message and reduce func, remove bundle func

* new executors

* new backend interface for dgl kernels and pytorch impl

* minor fix

* fix

* fix docstring, comments, func names

* nodeflow

* fix message id mapping and bugs...

* pytorch test case & fix

* backward binary reduce

* fix bug

* WIP: cusparse

* change to int32 csr for cusparse workaround

* disable cusparse

* change back to int64

* broadcasting backward

* cusparse; WIP: add rev_csr

* unit test for kernels

* pytorch backward with dgl kernel

* edge softmax

* fix backward

* improve softmax

* cache edge on device

* cache mappings on device

* fix partial forward code

* cusparse done

* copy_src_sum with cusparse

* rm id getter

* reduce grad for broadcast

* copy edge reduce backward

* kernel unit test for broadcasting

* full kernel unit test

* add cpu kernels

* edge softmax unit test

* missing ref

* fix compile and small bugs

* fix bug in bcast

* Add backward both

* fix torch utests

* expose infershape

* create out tensor in python

* fix c++ lint

* [Kernel] Add GPU utest and kernel utest (#524)

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* [Kernel] Update kernel branch (#550)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* Fixing typo in JTNN after interface change (#536)

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* [Kernel] Update kernel branch (#576)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* Fixing typo in JTNN after interface change (#536)

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* all demo use python-3 (#555)

* [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556)

* update

* update

* update

* update num_hops

* fix bug

* update

* report numbers of distributed training in AMLC giant graph paper

* [DEMO] Remove duplicate code for sampling (#557)

* update

* update

* re-use single-machine code

* update

* use relative path

* update

* update

* update

* add __init__.py

* add __init__.py

* import sys, os

* fix typo

* update

* [Perf] Improve performance of graph store. (#554)

* fix.

* use inplace.

* move to shared memory graph store.

* fix.

* add more unit tests.

* fix.

* fix test.

* fix test.

* disable test.

* fix.

* [BUGIFX] fix a bug in edge_ids (#560)

* add test.

* fix compute.

* fix test.

* turn on test.

* fix a bug.

* add test.

* fix.

* disable test.

* [DEMO] Add Pytorch demo for distributed sampler (#562)

* update

* update

* update

* add sender

* update

* remove duplicate cpde

* [Test] Add gtest to project (#547)

* add gtest module

* add gtest

* fix

* Update CMakeLists.txt

* Update README.md

* [Perf] lazily create msg_index. (#563)

* lazily create msg_index.

* update test.

* [BUGFIX] fix bugs for running GCN on giant graphs. (#561)

* load mxnet csr.

* enable load large csr.

* fix

* fix.

* fix int overflow.

* fix test.

* [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559)

* [DEMO] Update demo of distributed sampler (#564)

* update

* update

* update demo

* add network cpp test (#565)

* Add unittest for C++ RPC (#566)

* [CI] Fix CI for cpp test (#570)

* fix CI for cpp test

* update port number

* [Docker] update docker image (#575)

* update docker image

* specify lint version

* rm torch import from unified tests

* [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* edge softmax module

* WIP

* Fixing typo in JTNN after interface change (#536)

* mxnet backend support

* improve reduce grad

* add max to unittest backend

* fix kernel unittest

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* lint

* lint

* win build

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* try

* fix

* fix

* fix

* fix

* fix

* try

* test

* test

* test

* try

* try

* try

* test

* fix

* try gen_target

* fix gen_target

* fix msvc var_args expand issue

* fix

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* WIP

* WIP

* all demo use python-3 (#555)

* ToImmutable and CopyTo

* [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556)

* update

* update

* update

* update num_hops

* fix bug

* update

* report numbers of distributed training in AMLC giant graph paper

* [DEMO] Remove duplicate code for sampling (#557)

* update

* update

* re-use single-machine code

* update

* use relative path

* update

* update

* update

* add __init__.py

* add __init__.py

* import sys, os

* fix typo

* update

* [Perf] Improve performance of graph store. (#554)

* fix.

* use inplace.

* move to shared memory graph store.

* fix.

* add more unit tests.

* fix.

* fix test.

* fix test.

* disable test.

* fix.

* [BUGIFX] fix a bug in edge_ids (#560)

* add test.

* fix compute.

* fix test.

* turn on test.

* fix a bug.

* add test.

* fix.

* disable test.

* DGLRetValue DGLContext conversion

* [DEMO] Add Pytorch demo for distributed sampler (#562)

* update

* update

* update

* add sender

* update

* remove duplicate cpde

* [Test] Add gtest to project (#547)

* add gtest module

* add gtest

* fix

* Update CMakeLists.txt

* Update README.md

* Add support to convert immutable graph to 32 bits

* [Perf] lazily create msg_index. (#563)

* lazily create msg_index.

* update test.

* fix binary reduce following new minigun template

* enable both int64 and int32 kernels

* [BUGFIX] fix bugs for running GCN on giant graphs. (#561)

* load mxnet csr.

* enable load large csr.

* fix

* fix.

* fix int overflow.

* fix test.

* new kernel interface done for CPU

* docstring

* rename & docstring

* copy reduce and backward

* [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559)

* [DEMO] Update demo of distributed sampler (#564)

* update

* update

* update demo

* adapt cuda kernels to the new interface

* add network cpp test (#565)

* fix bug

* Add unittest for C++ RPC (#566)

* [CI] Fix CI for cpp test (#570)

* fix CI for cpp test

* update port number

* [Docker] update docker image (#575)

* update docker image

* specify lint version

* rm torch import from unified tests

* remove pytorch-specific test_function

* fix unittest

* fix

* fix unittest backend bug in converting tensor to numpy array

* fix

* mxnet version

* [BUGFIX] fix for MXNet 1.5. (#552)

* remove clone.

* turn on numpy compatible.

* Revert "remove clone."

This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457.

* revert format changes

* fix mxnet api name

* revert mistakes in previous revert

* roll back CI to 20190523 build

* fix unittest

* disable test_shared_mem_store.py for now

* remove mxnet/test_specialization.py

* sync win64 test script

* fix lowercase

* missing backend in gpu unit test

* transpose to get forward graph

* pass update all

* add sanity check

* passing test_specialization.py

* fix and pass test_function

* fix check

* fix pytorch softmax

* mxnet kernels

* c++ lint

* pylint

* try

* win build

* fix

* win

* ci enable gpu build

* init submodule recursively

* backend docstring

* try

* test win dev

* doc string

* disable pytorch test_nn

* try to fix windows issue

* bug fixed, revert changes

* [Test] fix CI. (#586)

* disable unit test in mxnet tutorial.

* retry socket connection.

* roll back to set_np_compat

* try to fix multi-processing test hangs when it fails.

* fix test.

* fix.

* doc string

* doc string and clean up

* missing field in ctypes

* fix node flow schedule and unit test

* rename

* pylint

* copy from parent default context

* fix unit test script

* fix

* demo bug in nodeflow gpu test

* [Kernel][Bugfix] fix nodeflow bug (#604)

* fix nodeflow bug

* remove debug code

* add build gtest option

* fix cmake; fix graph index bug in spmv.py

* remove clone

* fix div rhs grad bug

* [Kernel] Support full builtin method, edge softmax and unit tests (#605)

* add full builtin support

* unit test

* unit test backend

* edge softmax

* apply edge with builtin

* fix kernel unit test

* disable mxnet test_shared_mem_store

* gen builtin reduce

* enable mxnet gpu unittest

* revert some changes

* docstring

* add note for the hack

* [Kernel][Unittest][CI] Fix MXNet GPU CI (#607)

* update docker image for MXNet GPU CI

* force all dgl graph input and output on CPU

* fix gpu unittest

* speedup compilation

* add some comments

* lint

* add more comments

* fix as requested

* add some comments

* comment

* lint

* lint

* update pylint

* fix as requested

* lint

* lint

* lint

* docstrings of python DGL kernel entries

* disable lint warnings on arguments in kernel.py

* fix docstring in scheduler

* fix some bug in unittest; try again

* Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel"

This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing
changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b.

* Revert "fix some bug in unittest; try again"

This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b.

* more comprehensive kernel test

* remove shape check in test_specialization
parent da0c92a2
...@@ -83,9 +83,10 @@ DGL_REGISTER_GLOBAL("nodeflow._CAPI_NodeFlowGetBlockAdj") ...@@ -83,9 +83,10 @@ DGL_REGISTER_GLOBAL("nodeflow._CAPI_NodeFlowGetBlockAdj")
int64_t layer0_size = args[2]; int64_t layer0_size = args[2];
int64_t start = args[3]; int64_t start = args[3];
int64_t end = args[4]; int64_t end = args[4];
const bool remap = args[5];
const GraphInterface *ptr = static_cast<const GraphInterface *>(ghandle); const GraphInterface *ptr = static_cast<const GraphInterface *>(ghandle);
const ImmutableGraph* gptr = dynamic_cast<const ImmutableGraph*>(ptr); const ImmutableGraph* gptr = dynamic_cast<const ImmutableGraph*>(ptr);
auto res = GetNodeFlowSlice(*gptr, format, layer0_size, start, end, true); auto res = GetNodeFlowSlice(*gptr, format, layer0_size, start, end, remap);
*rv = ConvertNDArrayVectorToPackedFunc(res); *rv = ConvertNDArrayVectorToPackedFunc(res);
}); });
......
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/binary_reduce.cc
* \brief Binary reduce C APIs and definitions.
*/
#include "./binary_reduce.h"
#include "./common.h"
#include "./binary_reduce_impl_decl.h"
#include "./utils.h"
#include "../c_api_common.h"
using dgl::runtime::DGLArgs;
using dgl::runtime::DGLArgValue;
using dgl::runtime::DGLRetValue;
using dgl::runtime::PackedFunc;
using dgl::runtime::NDArray;
namespace dgl {
namespace kernel {
namespace {
// convert ndarray shape to string
std::string ShapeString(NDArray nd) {
std::ostringstream oss;
oss << "(";
for (int i = 1; i < nd->ndim; ++i) {
oss << nd->shape[i];
if (i != nd->ndim - 1) {
oss << ",";
}
}
oss << ")";
return oss.str();
}
// compute stride vector given shape; assume row-major storage
std::vector<int64_t> ComputeStride(const std::vector<int64_t>& shape) {
std::vector<int64_t> ret(shape.size(), 1);
for (int i = shape.size() - 2; i >= 0; --i) {
ret[i] = ret[i+1] * shape[i+1];
}
return ret;
}
// Return true if the feature shapes of the two ndarrays can be
// computed element-wisely *without* broadcasting.
// Examples:
//
// valid:
// lhs.shape = (N, D1, D2)
// rhs.shape = (M, D1, D2) # the first dimension could be different
//
// invalid:
// lhs.shape = (N, D1, D2)
// rhs.shape = (M, D1)
bool IsValidBinaryOpShape(NDArray lhs, NDArray rhs) {
if (lhs->ndim != rhs->ndim) {
return false;
}
for (int i = 1; i < lhs->ndim; ++i) {
if (lhs->shape[i] != rhs->shape[i]) {
return false;
}
}
return true;
}
// Return true if broadcasting might be required to compute the element-wise
// operation between the features of the two ndarrays.
// The broadcasting semantic strictly follows numpy.
// Note that the function could return true for invalid element-wise shapes
// (e.g. lhs.shape = (N, 3), rhs.shape = (N, 5)). This is fine since
// ``CalcBcastInfo`` will handle that.
bool HasBcast(NDArray lhs, NDArray rhs) {
if (lhs->ndim != rhs->ndim) {
return true;
}
for (int i = 1; i < lhs->ndim; ++i) {
if (lhs->shape[i] != rhs->shape[i]) {
return true;
}
}
return false;
}
// Compute auxiliary information of broadcasting dimensions.
// The function preprocesses the feature shapes so that:
// - The first dimension (for graph) is removed.
// - Feature dimensions are aligned.
// e.g. (4,) and (3, 4) become (1, 4) and (3, 4)
// - Continuous non-broadcasting dimenions are flattened to reduce number of
// integers used to represent the feature shape.
// e.g. (4, 1, 3, 3) and (4, 5, 3, 3) become (4, 1, 9) and (4, 5, 9)
//
// See also: BcastInfo (kernel/binary_reduce.h)
BcastInfo CalcBcastInfo(NDArray lhs, NDArray rhs) {
BcastInfo ret;
const int max_ndim = std::max(lhs->ndim, rhs->ndim) - 1;
int64_t accum = 0;
for (int j = 0; j < max_ndim; ++j) {
const int dl = (lhs->ndim - 1 - j < 1)? 1 : lhs->shape[lhs->ndim - 1 - j];
const int dr = (rhs->ndim - 1 - j < 1)? 1 : rhs->shape[rhs->ndim - 1 - j];
if (dl != dr) {
if (dl != 1 && dr != 1) {
LOG(FATAL) << "Invalid broadcasting between feature shapes "
<< ShapeString(lhs) << " and " << ShapeString(rhs);
}
if (accum != 0) {
ret.lhs_shape.push_back(accum);
ret.rhs_shape.push_back(accum);
ret.out_shape.push_back(accum);
accum = 0;
}
ret.lhs_shape.push_back(dl);
ret.rhs_shape.push_back(dr);
ret.out_shape.push_back(std::max(dl, dr));
} else {
if (accum == 0) {
accum = dl;
} else {
accum *= dl;
}
}
ret.real_out_shape.push_back(std::max(dl, dr));
}
if (accum != 0) {
ret.lhs_shape.push_back(accum);
ret.rhs_shape.push_back(accum);
ret.out_shape.push_back(accum);
accum = 0;
}
std::reverse(ret.real_out_shape.begin(), ret.real_out_shape.end());
std::reverse(ret.lhs_shape.begin(), ret.lhs_shape.end());
std::reverse(ret.rhs_shape.begin(), ret.rhs_shape.end());
std::reverse(ret.out_shape.begin(), ret.out_shape.end());
// stride
ret.lhs_stride = ComputeStride(ret.lhs_shape);
ret.rhs_stride = ComputeStride(ret.rhs_shape);
ret.out_stride = ComputeStride(ret.out_shape);
return ret;
}
// Function to convert an idarray to string
std::string IdArrayToStr(IdArray arr) {
arr = arr.CopyTo(DLContext{kDLCPU, 0});
int64_t len = arr->shape[0];
std::ostringstream oss;
oss << "(" << len << ")[";
if (arr->dtype.bits == 32) {
int32_t* data = static_cast<int32_t*>(arr->data);
for (int64_t i = 0; i < len; ++i) {
oss << data[i] << " ";
}
} else {
int64_t* data = static_cast<int64_t*>(arr->data);
for (int64_t i = 0; i < len; ++i) {
oss << data[i] << " ";
}
}
oss << "]";
return oss.str();
}
// Check whether the given arguments have the same context.
inline void CheckCtx(
const DLContext& ctx,
const std::vector<NDArray>& arrays,
const std::vector<std::string>& names) {
for (size_t i = 0; i < arrays.size(); ++i) {
if (utils::IsNoneArray(arrays[i]))
continue;
CHECK_EQ(ctx, arrays[i]->ctx)
<< "Expected device context " << ctx << ". But got "
<< arrays[i]->ctx << " for " << names[i] << ".";
}
}
// Check whether the given arguments use the same number of bits.
inline void CheckIdArray(
const uint8_t bits,
const std::vector<NDArray>& arrays,
const std::vector<std::string>& names) {
for (size_t i = 0; i < arrays.size(); ++i) {
if (utils::IsNoneArray(arrays[i]))
continue;
CHECK(arrays[i]->dtype.code == kDLInt);
CHECK_EQ(arrays[i]->ndim, 1);
CHECK_EQ(bits, arrays[i]->dtype.bits)
<< "Expected " << bits << " integer array. But got "
<< arrays[i]->dtype.bits << " for " << names[i] << ".";
}
}
// Return true if the operator is commutative and lhs and rhs need
// to be switched. For example, Add(kDst, kSrc) needs to be changed
// to Add(kSrc, kDst).
// This is because we only generate kernels for
// Add(kSrc, kDst), Add(kDst, kEdge), Add(kSrc, kDst)
// to save compilation time.
inline bool NeedSwitchOrder(const std::string& op,
binary_op::Target lhs, binary_op::Target rhs) {
CHECK_NE(lhs, rhs);
return (op == binary_op::kAdd || op == binary_op::kMul)
&& lhs > rhs;
}
} // namespace
std::vector<int64_t> InferBinaryFeatureShape(
NDArray lhs,
NDArray rhs) {
return CalcBcastInfo(lhs, rhs).real_out_shape;
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelInferBinaryFeatureShape")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
NDArray lhs = args[0];
NDArray rhs = args[1];
const auto& shape = InferBinaryFeatureShape(lhs, rhs);
const int64_t len = shape.size();
NDArray ret = NDArray::Empty(
{len}, DLDataType{kDLInt, 64, 1}, DLContext{kDLCPU, 0});
int64_t* ret_data = static_cast<int64_t*>(ret->data);
std::copy(shape.begin(), shape.end(), ret_data);
*rv = ret;
});
void BinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
NDArray lhs_data, NDArray rhs_data,
NDArray out_data,
NDArray lhs_mapping, NDArray rhs_mapping,
NDArray out_mapping) {
const auto& ctx = graph->Context();
// sanity check
CheckCtx(ctx,
{lhs_data, rhs_data, out_data, lhs_mapping, rhs_mapping, out_mapping},
{"lhs_data", "rhs_data", "out_data", "lhs_mapping", "rhs_mapping", "out_mapping"});
CheckIdArray(graph->NumBits(),
{lhs_mapping, rhs_mapping, out_mapping},
{"lhs_mapping", "rhs_mapping", "out_mapping"});
// Switch order for commutative operation
if (NeedSwitchOrder(op, lhs, rhs)) {
BinaryOpReduce(reducer, op, graph,
rhs, lhs, rhs_data, lhs_data, out_data,
rhs_mapping, lhs_mapping, out_mapping);
} else {
if (HasBcast(lhs_data, rhs_data)) {
BcastInfo info = CalcBcastInfo(lhs_data, rhs_data);
DGL_XPU_SWITCH(ctx.device_type, BinaryReduceBcastImpl,
info, reducer, op, graph,
lhs, rhs,
lhs_data, rhs_data, out_data,
lhs_mapping, rhs_mapping, out_mapping);
} else {
CHECK(IsValidBinaryOpShape(lhs_data, rhs_data))
<< "Cannot compute binary operation between feature shapes "
<< ShapeString(lhs_data) << " and " << ShapeString(rhs_data);
DGL_XPU_SWITCH(ctx.device_type, BinaryReduceImpl,
reducer, op, graph,
lhs, rhs,
lhs_data, rhs_data, out_data,
lhs_mapping, rhs_mapping, out_mapping);
}
}
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelBinaryOpReduce")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
std::string reducer = args[0];
std::string op = args[1];
GraphHandle ghdl = args[2];
int lhs = args[3];
int rhs = args[4];
NDArray lhs_data = args[5];
NDArray rhs_data = args[6];
NDArray out_data = args[7];
NDArray lhs_mapping = args[8];
NDArray rhs_mapping = args[9];
NDArray out_mapping = args[10];
GraphInterface* gptr = static_cast<GraphInterface*>(ghdl);
const ImmutableGraph* igptr = dynamic_cast<ImmutableGraph*>(gptr);
CHECK(igptr) << "Invalid graph object argument. Must be an immutable graph.";
BinaryOpReduce(reducer, op, igptr,
static_cast<binary_op::Target>(lhs), static_cast<binary_op::Target>(rhs),
lhs_data, rhs_data, out_data,
lhs_mapping, rhs_mapping, out_mapping);
});
void BackwardLhsBinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
NDArray lhs_mapping,
NDArray rhs_mapping,
NDArray out_mapping,
NDArray lhs_data,
NDArray rhs_data,
NDArray out_data,
NDArray grad_out_data,
NDArray grad_lhs_data) {
const auto& ctx = graph->Context();
// sanity check
CheckCtx(ctx,
{lhs_data, rhs_data, out_data, grad_out_data, grad_lhs_data,
lhs_mapping, rhs_mapping, out_mapping},
{"lhs_data", "rhs_data", "out_data", "grad_out_data", "grad_lhs_data",
"lhs_mapping", "rhs_mapping", "out_mapping"});
CheckIdArray(graph->NumBits(),
{lhs_mapping, rhs_mapping, out_mapping},
{"lhs_mapping", "rhs_mapping", "out_mapping"});
// Switch order for commutative operation
if (NeedSwitchOrder(op, lhs, rhs)) {
BackwardRhsBinaryOpReduce(reducer, op, graph,
rhs, lhs,
rhs_mapping, lhs_mapping, out_mapping,
rhs_data, lhs_data, out_data,
grad_out_data, grad_lhs_data);
} else {
if (HasBcast(lhs_data, rhs_data)) {
BcastInfo info = CalcBcastInfo(lhs_data, rhs_data);
DGL_XPU_SWITCH(ctx.device_type, BackwardBinaryReduceBcastImpl,
info, reducer, op, graph,
lhs, rhs,
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
grad_lhs_data, utils::NoneArray());
} else {
DGL_XPU_SWITCH(ctx.device_type, BackwardBinaryReduceImpl,
reducer, op, graph,
lhs, rhs,
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
grad_lhs_data, utils::NoneArray());
}
}
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelBackwardLhsBinaryOpReduce")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
std::string reducer = args[0];
std::string op = args[1];
GraphHandle ghdl = args[2];
int lhs = args[3];
int rhs = args[4];
NDArray lhs_mapping = args[5];
NDArray rhs_mapping = args[6];
NDArray out_mapping = args[7];
NDArray lhs_data = args[8];
NDArray rhs_data = args[9];
NDArray out_data = args[10];
NDArray grad_out_data = args[11];
NDArray grad_lhs_data = args[12];
GraphInterface* gptr = static_cast<GraphInterface*>(ghdl);
const ImmutableGraph* igptr = dynamic_cast<ImmutableGraph*>(gptr);
CHECK(igptr) << "Invalid graph object argument. Must be an immutable graph.";
BackwardLhsBinaryOpReduce(
reducer, op, igptr,
static_cast<binary_op::Target>(lhs), static_cast<binary_op::Target>(rhs),
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
grad_lhs_data);
});
void BackwardRhsBinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
NDArray lhs_mapping,
NDArray rhs_mapping,
NDArray out_mapping,
NDArray lhs_data,
NDArray rhs_data,
NDArray out_data,
NDArray grad_out_data,
NDArray grad_rhs_data) {
const auto& ctx = graph->Context();
// sanity check
CheckCtx(ctx,
{lhs_data, rhs_data, out_data, grad_out_data, grad_rhs_data,
lhs_mapping, rhs_mapping, out_mapping},
{"lhs_data", "rhs_data", "out_data", "grad_out_data", "grad_rhs_data",
"lhs_mapping", "rhs_mapping", "out_mapping"});
CheckIdArray(graph->NumBits(),
{lhs_mapping, rhs_mapping, out_mapping},
{"lhs_mapping", "rhs_mapping", "out_mapping"});
if (NeedSwitchOrder(op, lhs, rhs)) {
BackwardLhsBinaryOpReduce(reducer, op, graph,
rhs, lhs,
rhs_mapping, lhs_mapping, out_mapping,
rhs_data, lhs_data, out_data,
grad_out_data, grad_rhs_data);
} else {
if (HasBcast(lhs_data, rhs_data)) {
BcastInfo info = CalcBcastInfo(lhs_data, rhs_data);
DGL_XPU_SWITCH(ctx.device_type, BackwardBinaryReduceBcastImpl,
info, reducer, op, graph,
lhs, rhs,
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
utils::NoneArray(), grad_rhs_data);
} else {
DGL_XPU_SWITCH(ctx.device_type, BackwardBinaryReduceImpl,
reducer, op, graph,
lhs, rhs,
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
utils::NoneArray(), grad_rhs_data);
}
}
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelBackwardRhsBinaryOpReduce")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
std::string reducer = args[0];
std::string op = args[1];
GraphHandle ghdl = args[2];
int lhs = args[3];
int rhs = args[4];
NDArray lhs_mapping = args[5];
NDArray rhs_mapping = args[6];
NDArray out_mapping = args[7];
NDArray lhs_data = args[8];
NDArray rhs_data = args[9];
NDArray out_data = args[10];
NDArray grad_out_data = args[11];
NDArray grad_rhs_data = args[12];
GraphInterface* gptr = static_cast<GraphInterface*>(ghdl);
const ImmutableGraph* igptr = dynamic_cast<ImmutableGraph*>(gptr);
CHECK(igptr) << "Invalid graph object argument. Must be an immutable graph.";
BackwardRhsBinaryOpReduce(
reducer, op, igptr,
static_cast<binary_op::Target>(lhs), static_cast<binary_op::Target>(rhs),
lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
grad_rhs_data);
});
void CopyReduce(
const std::string& reducer,
const ImmutableGraph* graph,
binary_op::Target target,
NDArray in_data, NDArray out_data,
NDArray in_mapping, NDArray out_mapping) {
const auto& ctx = graph->Context();
// sanity check
CheckCtx(ctx,
{in_data, out_data, in_mapping, out_mapping},
{"in_data", "out_data", "in_mapping", "out_mapping"});
CheckIdArray(graph->NumBits(),
{in_mapping, out_mapping},
{"in_mapping", "out_mapping"});
DGL_XPU_SWITCH(ctx.device_type, BinaryReduceImpl,
reducer, binary_op::kUseLhs, graph,
target, binary_op::kNone,
in_data, utils::NoneArray(), out_data,
in_mapping, utils::NoneArray(), out_mapping);
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelCopyReduce")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
std::string reducer = args[0];
GraphHandle ghdl = args[1];
int target = args[2];
NDArray in_data = args[3];
NDArray out_data = args[4];
NDArray in_mapping = args[5];
NDArray out_mapping = args[6];
GraphInterface* gptr = static_cast<GraphInterface*>(ghdl);
const ImmutableGraph* igptr = dynamic_cast<ImmutableGraph*>(gptr);
CHECK(igptr) << "Invalid graph object argument. Must be an immutable graph.";
CopyReduce(reducer, igptr,
static_cast<binary_op::Target>(target),
in_data, out_data,
in_mapping, out_mapping);
});
void BackwardCopyReduce(
const std::string& reducer,
const ImmutableGraph* graph,
binary_op::Target target,
NDArray in_mapping,
NDArray out_mapping,
NDArray in_data,
NDArray out_data,
NDArray grad_out_data,
NDArray grad_in_data) {
const auto& ctx = graph->Context();
// sanity check
CheckCtx(ctx,
{in_data, out_data, grad_out_data, grad_in_data, in_mapping, out_mapping},
{"in_data", "out_data", "grad_out_data", "grad_in_data", "in_mapping", "out_mapping"});
CheckIdArray(graph->NumBits(),
{in_mapping, out_mapping},
{"in_mapping", "out_mapping"});
if (!utils::IsNoneArray(out_mapping)) {
CHECK_EQ(ctx, out_mapping->ctx) << "Expected device context " << ctx
<< ". But got " << out_mapping->ctx << " for rhs_data.";
}
DGL_XPU_SWITCH(ctx.device_type, BackwardBinaryReduceImpl,
reducer, binary_op::kUseLhs, graph,
target, binary_op::kNone,
in_mapping, utils::NoneArray(), out_mapping,
in_data, utils::NoneArray(), out_data, grad_out_data,
grad_in_data, utils::NoneArray());
}
DGL_REGISTER_GLOBAL("kernel._CAPI_DGLKernelBackwardCopyReduce")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
std::string reducer = args[0];
GraphHandle ghdl = args[1];
int target = args[2];
NDArray in_data = args[3];
NDArray out_data = args[4];
NDArray grad_out_data = args[5];
NDArray grad_in_data = args[6];
NDArray in_mapping = args[7];
NDArray out_mapping = args[8];
GraphInterface* gptr = static_cast<GraphInterface*>(ghdl);
const ImmutableGraph* igptr = dynamic_cast<ImmutableGraph*>(gptr);
CHECK(igptr) << "Invalid graph object argument. Must be an immutable graph.";
BackwardCopyReduce(
reducer, igptr, static_cast<binary_op::Target>(target),
in_mapping, out_mapping,
in_data, out_data, grad_out_data,
grad_in_data);
});
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/binary_reduce.h
* \brief Binary reduce function C++ header.
*/
#ifndef DGL_KERNEL_BINARY_REDUCE_H_
#define DGL_KERNEL_BINARY_REDUCE_H_
#include <dgl/runtime/ndarray.h>
#include <dgl/immutable_graph.h>
#include <vector>
#include <string>
#include "./binary_reduce_common.h"
namespace dgl {
namespace kernel {
// Structure for broadcasting shapes
struct BcastInfo {
// inferred output shape
std::vector<int64_t> real_out_shape;
// Following shapes here have been preprocessed, so that:
// - The first dimension (for graph) is removed. Shapes here are only for features.
// - They have the same number of dimensions.
// e.g. (4,) and (3, 4) become (1, 4) and (3, 4)
// - Continuous non-broadcasting dimenions are flattened.
// e.g. (4, 1, 3, 3) and (4, 5, 3, 3) become (4, 1, 9) and (4, 5, 9)
std::vector<int64_t> lhs_shape, lhs_stride;
std::vector<int64_t> rhs_shape, rhs_stride;
std::vector<int64_t> out_shape, out_stride;
};
/*
* !\brief Compute the feature shape after binary reduce computation.
*/
std::vector<int64_t> InferBinaryFeatureShape(
runtime::NDArray lhs,
runtime::NDArray rhs);
/*!
* \brief Perform binary operation between the given data and reduce by the graph.
*
* If the reducer is one of "sum, "max, "min", "prod", the operator computes,
* for each node i,
*
* out[i] = Sigma_{j\in Neighbor(i)} ( A[s1(i, j, e)] op B[s2(i, j, e)] )
*
* , where A, B are two input feature tensors, op could be element-wise add/sub/div/mul.
* Depending on the lhs and rhs target, s1 and s2 will select the src/dst/edge
* ids of each neighbor.
*
* If the reducer is "none", the operator computes, for each edge e,
*
* out[e] = A[s1(i, j, e)] op B[s2(i, j, e)]
*
* Here, the node/edge feature (e.g., A[i], B[e]) could be dense tensor. In such
* case, broadcasting is supported on the feature dimensions.
*
* Examples:
*
* A.shape = (N, D1, D2) # N is the number of nodes
* B.shape = (M, D1, 1) # M is the number of edges
* C = BinaryOpReduce("sum", "add", graph, A, B, ...)
* C.shape = (N, D1, D2)
*
* \param reducer The type of the reducer ("sum", "max", "prod", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
*/
void BinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
/*!
* \brief Compute the lhs gradient of BinaryOpReduce
*
* Broadcasting along feature dimensions is supported. However, the gradient
* of the being-broadcasted dimensions will *not* be reduced. Therefore, the
* gradient tensor has the same shape with the out tensor.
*
* Examples:
* A.shape = (N, D1, 1) # N is the number of nodes
* B.shape = (M, D1, D2) # M is the number of edges
* C = BinaryOpReduce("sum", "add", graph, A, B, ...)
* C.shape = (N, D1, D2)
* dC.shape = (N, D1, D2)
* dA = BackwardLhsBinaryOpReduce("sum", "add", graph, A, B, C, dC, ...)
* dA.shape = (N, D1, D2) # extra reduction should be handled afterwards
*
* \param reducer The type of the reducer ("sum", "max", "prod", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param grad_out_data The gradient output tensor.
* \param grad_lhs_data The gradient lhs tensor.
*/
void BackwardLhsBinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_mapping,
runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping,
runtime::NDArray lhs_data,
runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_lhs_data);
/*!
* \brief Compute the rhs gradient of BinaryOpReduce
*
* Broadcasting along feature dimensions is supported. However, the gradient
* of the being-broadcasted dimensions will *not* be reduced. Therefore, the
* gradient tensor has the same shape with the out tensor.
*
* Examples:
* A.shape = (N, D1, D2) # N is the number of nodes
* B.shape = (M, D1, 1) # M is the number of edges
* C = BinaryOpReduce("sum", "add", graph, A, B, ...)
* C.shape = (N, D1, D2)
* dC.shape = (N, D1, D2)
* dB = BackwardRhsBinaryOpReduce("sum", "add", graph, A, B, C, dC, ...)
* dB.shape = (N, D1, D2) # extra reduction should be handled afterwards
*
* \param reducer The type of the reducer ("sum", "max", "prod", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param grad_out_data The gradient output tensor.
* \param grad_rhs_data The gradient rhs tensor.
*/
void BackwardRhsBinaryOpReduce(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_mapping,
runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping,
runtime::NDArray lhs_data,
runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_rhs_data);
/*!
* \brief Copy the target data and reduce by graph structure.
*
* If the reducer is one of "sum, "max, "min", "prod", the operator computes,
* for each node i,
*
* out[i] = Sigma_{j\in Neighbor(i)} A[s1(i, j, e)]
*
* , where A, B are two input feature tensors.
* Depending on the lhs and rhs target, s1 and s2 will select the src/dst/edge
* ids of each neighbor.
*
* If the reducer is "none", the operator computes, for each edge e,
*
* out[e] = A[s1(i, j, e)]
*
* \param reducer The type of the reducer ("sum", "max", "prod", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param graph The graph object.
* \param target The nput target (src, edge)
* \param in_data The input feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param in_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
*/
void CopyReduce(
const std::string& reducer,
const ImmutableGraph* graph,
binary_op::Target target,
runtime::NDArray in_data, runtime::NDArray out_data,
runtime::NDArray in_mapping, runtime::NDArray out_mapping);
/*!
* \brief Compute backward of the CopyReduce
*
* \param reducer The type of the reducer ("sum", "max", "prod", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param graph The graph object.
* \param target The nput target (src, edge)
* \param in_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
* \param in_data The input feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param grad_out_data The gradient output tensor.
* \param grad_in_data The gradient input tensor.
*/
void BackwardCopyReduce(
const std::string& reducer,
const ImmutableGraph* graph,
binary_op::Target target,
runtime::NDArray in_mapping,
runtime::NDArray out_mapping,
runtime::NDArray in_data,
runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_in_data);
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_BINARY_REDUCE_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/binary_reduce_common.h
* \brief Common utilities for binary reduce operation.
*/
#ifndef DGL_KERNEL_BINARY_REDUCE_COMMON_H_
#define DGL_KERNEL_BINARY_REDUCE_COMMON_H_
#include <dgl/runtime/ndarray.h>
#include <limits>
#include <string>
#include "./common.h"
namespace dgl {
namespace kernel {
namespace binary_op {
/*! \brief Reducer names. */
static const char kReduceSum[] = "sum";
static const char kReduceMax[] = "max";
static const char kReduceMin[] = "min";
static const char kReduceMean[] = "mean";
static const char kReduceProd[] = "prod";
static const char kReduceNone[] = "none";
/*! \brief Binary op names. */
static const char kAdd[] = "add";
static const char kSub[] = "sub";
static const char kMul[] = "mul";
static const char kDiv[] = "div";
static const char kUseLhs[] = "use_lhs";
/*!
* \brief Enum code for operand targets.
* \seealso BinaryOpReduce in binary_reduce_common.h
*/
enum Target {
kSrc = 0, // select src node
kDst, // select dst node
kEdge, // select edge
kNone, // select none
};
/*! \brief Enum code for backward operator mode. */
enum BackwardMode {
kGradLhs = 0, // compute lhs gradient
kGradRhs, // compute rhs gradient
kGradBoth, // compute both gradients
};
} // namespace binary_op
//////////////////////////////////////////////////////////////////////////
// Defines operand target category. Each category is a structure with
// two static members:
// - target: The enum code of this category.
// - Call: The call functor that returns the selected target.
//////////////////////////////////////////////////////////////////////////
/*! \brief Select src category. */
struct SelectSrc {
// Target value
static constexpr binary_op::Target target = binary_op::kSrc;
// Call functor.
template <typename T>
static DGLDEVICE DGLINLINE T Call(T src, T edge, T dst) {
return src;
}
};
/*! \brief Select dst category. */
struct SelectDst {
// Target value
static constexpr binary_op::Target target = binary_op::kDst;
// Call functor.
template <typename T>
static DGLDEVICE DGLINLINE T Call(T src, T edge, T dst) {
return dst;
}
};
/*! \brief Select edge category. */
struct SelectEdge {
// Target value
static constexpr binary_op::Target target = binary_op::kEdge;
// Call functor.
template <typename T>
static DGLDEVICE DGLINLINE T Call(T src, T edge, T dst) {
return edge;
}
};
/*! \brief Select none category. */
struct SelectNone {
// Target value
static constexpr binary_op::Target target = binary_op::kNone;
// Call functor.
template <typename T>
static DGLDEVICE DGLINLINE T Call(T src, T edge, T dst) {
return 0;
}
};
/*! \brief Type functor to switch SelectSrc and SelectDst category.
* SelectEdge and SelectNone will remain the same. */
template <typename Selector>
struct SwitchSrcDst {
typedef Selector Type;
};
template <>
struct SwitchSrcDst<SelectSrc> {
typedef SelectDst Type;
};
template <>
struct SwitchSrcDst<SelectDst> {
typedef SelectSrc Type;
};
//////////////////////////////////////////////////////////////////////////
// Defines binary op category. Each category is a structure with
// three static members:
// - Call: The forward computation given two operand.
// - BackwardLhs: Compute lhs gradient.
// - BackwardRhs: Compute rhs gradient.
//////////////////////////////////////////////////////////////////////////
// common binary functors
template <typename DType>
struct BinaryAdd {
static DGLDEVICE DGLINLINE DType Call(DType lhs, DType rhs) {
return lhs + rhs;
}
static DGLDEVICE DGLINLINE DType BackwardLhs(DType lhs, DType rhs, DType out) {
return 1;
}
static DGLDEVICE DGLINLINE DType BackwardRhs(DType lhs, DType rhs, DType out) {
return 1;
}
};
template <typename DType>
struct BinaryMul {
static DGLDEVICE DGLINLINE DType Call(DType lhs, DType rhs) {
return lhs * rhs;
}
static DGLDEVICE DGLINLINE DType BackwardLhs(DType lhs, DType rhs, DType out) {
return rhs;
}
static DGLDEVICE DGLINLINE DType BackwardRhs(DType lhs, DType rhs, DType out) {
return lhs;
}
};
template <typename DType>
struct BinarySub {
static DGLDEVICE DGLINLINE DType Call(DType lhs, DType rhs) {
return lhs - rhs;
}
static DGLDEVICE DGLINLINE DType BackwardLhs(DType lhs, DType rhs, DType out) {
return 1;
}
static DGLDEVICE DGLINLINE DType BackwardRhs(DType lhs, DType rhs, DType out) {
return -1;
}
};
template <typename DType>
struct BinaryDiv {
static DGLDEVICE DGLINLINE DType Call(DType lhs, DType rhs) {
return lhs / rhs;
}
static DGLDEVICE DGLINLINE DType BackwardLhs(DType lhs, DType rhs, DType out) {
return static_cast<DType>(1) / rhs;
}
static DGLDEVICE DGLINLINE DType BackwardRhs(DType lhs, DType rhs, DType out) {
return -lhs / (rhs * rhs);
}
};
template <typename DType>
struct BinaryUseLhs {
static DGLDEVICE DGLINLINE DType Call(DType lhs, DType rhs) {
return lhs;
}
static DGLDEVICE DGLINLINE DType BackwardLhs(DType lhs, DType rhs, DType out) {
return 1;
}
static DGLDEVICE DGLINLINE DType BackwardRhs(DType lhs, DType rhs, DType out) {
return 0;
}
};
// Macro for dispatching op enum code and target code into template arguments.
// The macro dispatches following combinations:
// - Add(Src, Dst), Add(Src, Edge), Add(Dst, Edge)
// - Mul(Src, Dst), Mul(Src, Edge), Mul(Dst, Edge)
// - Sub(Src, Dst), Sub(Src, Edge), Sub(Dst, Edge)
// Sub(Dst, Src), Sub(Edge, Src), Sub(Edge, Dst)
// - Div(Src, Dst), Div(Src, Edge), Div(Dst, Edge)
// Div(Dst, Src), Div(Edge, Src), Div(Edge, Dst)
// - UseLhs(Src, None), UseLhs(Edge, None)
// Note that for commutative operators (e.g. Add and Mul), we only generate
// kernels for lhs code smaller than rhs code.
#define OP_TARGET_SWITCH(op, lhs, rhs, DType, OpType, LeftType, RightType, ...) \
{ \
using namespace binary_op; \
if (op == kAdd && lhs == kSrc && rhs == kDst) { \
typedef BinaryAdd<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kAdd && lhs == kSrc && rhs == kEdge) { \
typedef BinaryAdd<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kAdd && lhs == kDst && rhs == kEdge) { \
typedef BinaryAdd<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kMul && lhs == kSrc && rhs == kDst) { \
typedef BinaryMul<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kMul && lhs == kSrc && rhs == kEdge) { \
typedef BinaryMul<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kMul && lhs == kDst && rhs == kEdge) { \
typedef BinaryMul<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kSrc && rhs == kDst) { \
typedef BinarySub<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kDst && rhs == kSrc) { \
typedef BinarySub<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectSrc RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kSrc && rhs == kEdge) { \
typedef BinarySub<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kEdge && rhs == kSrc) { \
typedef BinarySub<DType> OpType; \
typedef SelectEdge LeftType; \
typedef SelectSrc RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kDst && rhs == kEdge) { \
typedef BinarySub<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kSub && lhs == kEdge && rhs == kDst) { \
typedef BinarySub<DType> OpType; \
typedef SelectEdge LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kSrc && rhs == kDst) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kDst && rhs == kSrc) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectSrc RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kSrc && rhs == kEdge) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kEdge && rhs == kSrc) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectEdge LeftType; \
typedef SelectSrc RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kDst && rhs == kEdge) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectDst LeftType; \
typedef SelectEdge RightType; \
{__VA_ARGS__} \
} else if (op == kDiv && lhs == kEdge && rhs == kDst) { \
typedef BinaryDiv<DType> OpType; \
typedef SelectEdge LeftType; \
typedef SelectDst RightType; \
{__VA_ARGS__} \
} else if (op == kUseLhs && lhs == kSrc) { \
typedef BinaryUseLhs<DType> OpType; \
typedef SelectSrc LeftType; \
typedef SelectNone RightType; \
{__VA_ARGS__} \
} else if (op == kUseLhs && lhs == kEdge) { \
typedef BinaryUseLhs<DType> OpType; \
typedef SelectEdge LeftType; \
typedef SelectNone RightType; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Unsupported operation: op=" << op \
<< " lhs=" << lhs << " rhs=" << rhs; \
} \
}
// Macro for unrolling with various template argument combinations
#define GEN_OP_TARGET(GEN, ...) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectDst, BinaryAdd)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectEdge, BinaryAdd)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectEdge, BinaryAdd)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectDst, BinaryMul)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectEdge, BinaryMul)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectEdge, BinaryMul)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectDst, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectSrc, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectEdge, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectEdge, SelectSrc, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectEdge, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectEdge, SelectDst, BinarySub)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectDst, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectSrc, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectEdge, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectEdge, SelectSrc, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectDst, SelectEdge, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectEdge, SelectDst, BinaryDiv)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectSrc, SelectNone, BinaryUseLhs)) \
MSVC_EXPAND(GEN(__VA_ARGS__, SelectEdge, SelectNone, BinaryUseLhs))
//////////////////////////////////////////////////////////////////////////
// Defines reducer category. Each category is an empty structure.
// The call functor is device dependent, so should be specialized
// in the each device's implementation.
// See Also:
// - kernel/cpu/functor.h
// - kernel/cuda/functor.cuh
//////////////////////////////////////////////////////////////////////////
// functors for reducers
template <int XPU, typename DType>
struct ReduceSum { };
template <int XPU, typename DType>
struct ReduceMax { };
template <int XPU, typename DType>
struct ReduceMin { };
template <int XPU, typename DType>
struct ReduceProd { };
template <int XPU, typename DType>
struct ReduceNone { };
// Macro for dispatching reducer names to Reducer op structure
#define REDUCER_SWITCH(val, XPU, DType, RedType, ...) \
if (val == binary_op::kReduceSum \
|| val == binary_op::kReduceMean) { \
typedef ReduceSum<XPU, DType> RedType; \
{__VA_ARGS__} \
} else if (val == binary_op::kReduceMax) { \
typedef ReduceMax<XPU, DType> RedType; \
{__VA_ARGS__} \
} else if (val == binary_op::kReduceMin) { \
typedef ReduceMin<XPU, DType> RedType; \
{__VA_ARGS__} \
} else if (val == binary_op::kReduceProd) { \
typedef ReduceProd<XPU, DType> RedType; \
{__VA_ARGS__} \
} else if (val == binary_op::kReduceNone) { \
typedef ReduceNone<XPU, DType> RedType; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Unsupported reducer: " << val; \
}
// Type trait for getting zero value of the given reducer type.
template <typename Reducer>
struct Zero { };
template <int XPU, typename DType>
struct Zero<ReduceSum<XPU, DType>> {
static constexpr DType value = 0;
};
template <int XPU, typename DType>
struct Zero<ReduceMax<XPU, DType>> {
static constexpr DType value = std::numeric_limits<DType>::lowest();
};
template <int XPU, typename DType>
struct Zero<ReduceMin<XPU, DType>> {
static constexpr DType value = std::numeric_limits<DType>::max();
};
template <int XPU, typename DType>
struct Zero<ReduceProd<XPU, DType>> {
static constexpr DType value = 1;
};
template <int XPU, typename DType>
struct Zero<ReduceNone<XPU, DType>> {
static constexpr DType value = 0;
};
template <int XPU, typename DType>
constexpr DType Zero<ReduceSum<XPU, DType>>::value;
template <int XPU, typename DType>
constexpr DType Zero<ReduceMax<XPU, DType>>::value;
template <int XPU, typename DType>
constexpr DType Zero<ReduceMin<XPU, DType>>::value;
template <int XPU, typename DType>
constexpr DType Zero<ReduceProd<XPU, DType>>::value;
template <int XPU, typename DType>
constexpr DType Zero<ReduceNone<XPU, DType>>::value;
// Type functor for selecting output target based on reducer type.
/*! \brief For all the reducer types except ReduceNone, select dst as the output target. */
template <typename Reducer>
struct OutSelector {
typedef SelectDst Type;
};
/*! \brief For ReduceNone, select edge as the output target. */
template <int XPU, typename DType>
struct OutSelector<ReduceNone<XPU, DType>> {
typedef SelectEdge Type;
};
// macro for dispatching number of broadcasting dimensions to template argument
#define BCAST_NDIM_SWITCH(ndim, NDim, ...) \
if (ndim <= 2) { \
constexpr int NDim = 2; \
{__VA_ARGS__} \
} else if (ndim <= 4) { \
constexpr int NDim = 4; \
{__VA_ARGS__} \
} else if (ndim <= 8) { \
constexpr int NDim = 8; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Too many broadcasting dimensions."; \
}
// macro for unrolling different broadcasting dimensions
#define GEN_NDIM(GEN, ...) \
MSVC_EXPAND(GEN(__VA_ARGS__, 2)) \
MSVC_EXPAND(GEN(__VA_ARGS__, 4)) \
MSVC_EXPAND(GEN(__VA_ARGS__, 8))
// macro for dispatching backward mode enum to template argument
#define BACKWARD_MODE_SWITCH(req_lhs, req_rhs, Mode, ...) \
CHECK(!(req_lhs && req_rhs)); \
if (req_lhs) { \
constexpr int Mode = binary_op::kGradLhs; \
{__VA_ARGS__} \
} else { \
constexpr int Mode = binary_op::kGradRhs; \
{__VA_ARGS__} \
}
// macro for unrolling different backward mode
#define GEN_BACKWARD_MODE(GEN, ...) \
MSVC_EXPAND(GEN(__VA_ARGS__, binary_op::kGradLhs)) \
MSVC_EXPAND(GEN(__VA_ARGS__, binary_op::kGradRhs)) \
MSVC_EXPAND(GEN(__VA_ARGS__, binary_op::kGradBoth))
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_BINARY_REDUCE_COMMON_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/binary_reduce_impl.h
* \brief Implementations of binary reduce operations.
*/
#ifndef DGL_KERNEL_BINARY_REDUCE_IMPL_H_
#define DGL_KERNEL_BINARY_REDUCE_IMPL_H_
#include <minigun/minigun.h>
#include <dgl/runtime/device_api.h>
#include <dgl/immutable_graph.h>
#include <algorithm>
#include <string>
#ifdef __CUDACC__
#include "../runtime/cuda/cuda_common.h"
#endif
#include "./binary_reduce.h"
#include "./binary_reduce_impl_decl.h"
#include "./utils.h"
namespace dgl {
namespace kernel {
///////////////////////////////////////////////////////////////////////////////
// BinaryReduce device-agnostic implementation
///////////////////////////////////////////////////////////////////////////////
template <int XPU, typename Idx, typename DType, typename Reducer>
GData<Idx, DType> AllocGData(
const DLContext& ctx, int64_t x_len,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_mapping, runtime::NDArray out_data) {
// GData
GData<Idx, DType> gdata;
gdata.x_length = x_len;
gdata.out_size = out_data->shape[0];
gdata.lhs_data = static_cast<DType*>(lhs_data->data);
gdata.rhs_data = static_cast<DType*>(rhs_data->data);
gdata.out_data = static_cast<DType*>(out_data->data);
if (!utils::IsNoneArray(lhs_mapping)) {
gdata.lhs_mapping = static_cast<Idx*>(lhs_mapping->data);
}
if (!utils::IsNoneArray(rhs_mapping)) {
gdata.rhs_mapping = static_cast<Idx*>(rhs_mapping->data);
}
if (!utils::IsNoneArray(out_mapping)) {
gdata.out_mapping = static_cast<Idx*>(out_mapping->data);
}
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.out_data, utils::NElements(out_data), Zero<Reducer>::value);
return gdata;
}
template <int XPU>
void BinaryReduceImpl(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping) {
using runtime::NDArray;
using minigun::Csr;
// device
#ifdef __CUDACC__
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
#endif
const int64_t x_len = utils::ComputeXLength(out_data);
// advance config
minigun::advance::RuntimeConfig rtcfg;
rtcfg.ctx = out_data->ctx;
#ifdef __CUDACC__
rtcfg.stream = thr_entry->stream;
const int nt = utils::FindNumThreads(x_len, 64);
rtcfg.data_num_threads = nt;
// XXX(minjie): hard-code to let each thread compute two elements to increase
// instruction level parallelism
rtcfg.data_num_blocks = (x_len + (nt * 2) - 1) / (nt * 2);
#endif
if (reducer == binary_op::kReduceMean) {
// TODO(minjie): divide
LOG(FATAL) << "reduce mean is not supported.";
}
const DLDataType& dtype = out_data->dtype;
const auto bits = graph->NumBits();
DGL_DTYPE_SWITCH(dtype, DType, {
DGL_IDX_TYPE_SWITCH(bits, Idx, {
REDUCER_SWITCH(reducer, XPU, DType, Reducer, {
auto gdata = AllocGData<XPU, Idx, DType, Reducer>(
rtcfg.ctx, x_len, lhs_mapping, rhs_mapping,
lhs_data, rhs_data, out_mapping, out_data);
OP_TARGET_SWITCH(op, lhs, rhs, DType, BinaryOp, LeftTarget, RightTarget, {
CallBinaryReduce<XPU, Idx, DType, LeftTarget,
RightTarget, BinaryOp, Reducer>(rtcfg, graph, &gdata);
});
});
});
});
}
///////////////////////////////////////////////////////////////////////////////
// BackwardBinaryReduce device-agnostic implementation
///////////////////////////////////////////////////////////////////////////////
template <int XPU, typename Idx, typename DType>
BackwardGData<Idx, DType> AllocBackwardGData(
const DLContext& ctx, int64_t x_len,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data, runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_lhs_data, runtime::NDArray grad_rhs_data) {
// GData
BackwardGData<Idx, DType> gdata;
gdata.x_length = x_len;
gdata.out_size = out_data->shape[0];
gdata.lhs_data = static_cast<DType*>(lhs_data->data);
gdata.rhs_data = static_cast<DType*>(rhs_data->data);
gdata.out_data = static_cast<DType*>(out_data->data);
gdata.grad_out_data = static_cast<DType*>(grad_out_data->data);
if (!utils::IsNoneArray(grad_lhs_data)) {
gdata.grad_lhs_data = static_cast<DType*>(grad_lhs_data->data);
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.grad_lhs_data, utils::NElements(grad_lhs_data),
static_cast<DType>(0));
}
if (!utils::IsNoneArray(grad_rhs_data)) {
gdata.grad_rhs_data = static_cast<DType*>(grad_rhs_data->data);
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.grad_rhs_data, utils::NElements(grad_rhs_data),
static_cast<DType>(0));
}
if (!utils::IsNoneArray(lhs_mapping)) {
gdata.lhs_mapping = static_cast<Idx*>(lhs_mapping->data);
}
if (!utils::IsNoneArray(rhs_mapping)) {
gdata.rhs_mapping = static_cast<Idx*>(rhs_mapping->data);
}
if (!utils::IsNoneArray(out_mapping)) {
gdata.out_mapping = static_cast<Idx*>(out_mapping->data);
}
return gdata;
}
template <int XPU>
void BackwardBinaryReduceImpl(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data, runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_lhs_data, runtime::NDArray grad_rhs_data) {
using runtime::NDArray;
using minigun::Csr;
#ifdef __CUDACC__
// device
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
#endif
// Graph
const int64_t x_len = utils::ComputeXLength(out_data);
// advance config
minigun::advance::RuntimeConfig rtcfg;
rtcfg.ctx = out_data->ctx;
#ifdef __CUDACC__
rtcfg.stream = thr_entry->stream;
const int nt = utils::FindNumThreads(x_len, 64);
rtcfg.data_num_threads = nt;
// XXX(minjie): hard-code to let each thread compute two elements to increase
// instruction level parallelism
rtcfg.data_num_blocks = (x_len + (nt * 2) - 1) / (nt * 2);
#endif
const DLDataType& dtype = out_data->dtype;
const bool req_lhs = !utils::IsNoneArray(grad_lhs_data);
const bool req_rhs = !utils::IsNoneArray(grad_rhs_data);
const auto bits = graph->NumBits();
if (reducer == binary_op::kReduceMean) {
// TODO(minjie): divide
LOG(FATAL) << "reduce mean is not supported.";
}
DGL_DTYPE_SWITCH(dtype, DType, {
DGL_IDX_TYPE_SWITCH(bits, Idx, {
auto gdata = AllocBackwardGData<XPU, Idx, DType>(
rtcfg.ctx, x_len, lhs_mapping, rhs_mapping, out_mapping,
lhs_data, rhs_data, out_data, grad_out_data,
grad_lhs_data, grad_rhs_data);
BACKWARD_MODE_SWITCH(req_lhs, req_rhs, Mode, {
REDUCER_SWITCH(reducer, XPU, DType, Reducer, {
OP_TARGET_SWITCH(op, lhs, rhs, DType, BinaryOp, LeftTarget, RightTarget, {
CallBackwardBinaryReduce<XPU, Mode, Idx, DType, LeftTarget,
RightTarget, BinaryOp, Reducer>(rtcfg, graph, &gdata);
});
});
});
});
});
}
///////////////////////////////////////////////////////////////////////////////
// BinaryReduceBcast device-agnostic implementation
///////////////////////////////////////////////////////////////////////////////
template <int XPU, int NDim, typename Idx, typename DType, typename Reducer>
BcastGData<NDim, Idx, DType> AllocBcastGData(
const DLContext& ctx, const BcastInfo& info,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_mapping, runtime::NDArray out_data) {
// GData
BcastGData<NDim, Idx, DType> gdata;
// dim, shape and stride
gdata.ndim = info.lhs_shape.size();
std::copy(info.lhs_shape.begin(), info.lhs_shape.end(), gdata.lhs_shape);
std::copy(info.lhs_stride.begin(), info.lhs_stride.end(), gdata.lhs_stride);
std::copy(info.rhs_shape.begin(), info.rhs_shape.end(), gdata.rhs_shape);
std::copy(info.rhs_stride.begin(), info.rhs_stride.end(), gdata.rhs_stride);
std::copy(info.out_shape.begin(), info.out_shape.end(), gdata.out_shape);
std::copy(info.out_stride.begin(), info.out_stride.end(), gdata.out_stride);
gdata.lhs_len = utils::Prod(info.lhs_shape);
gdata.rhs_len = utils::Prod(info.rhs_shape);
gdata.out_len = utils::Prod(info.out_shape);
// data
gdata.lhs_data = static_cast<DType*>(lhs_data->data);
gdata.rhs_data = static_cast<DType*>(rhs_data->data);
gdata.out_data = static_cast<DType*>(out_data->data);
if (!utils::IsNoneArray(lhs_mapping)) {
gdata.lhs_mapping = static_cast<Idx*>(lhs_mapping->data);
}
if (!utils::IsNoneArray(rhs_mapping)) {
gdata.rhs_mapping = static_cast<Idx*>(rhs_mapping->data);
}
if (!utils::IsNoneArray(out_mapping)) {
gdata.out_mapping = static_cast<Idx*>(out_mapping->data);
}
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.out_data, utils::NElements(out_data), Zero<Reducer>::value);
return gdata;
}
template <int XPU>
void BinaryReduceBcastImpl(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs,
binary_op::Target rhs,
runtime::NDArray lhs_data,
runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping,
runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping) {
using runtime::NDArray;
using minigun::Csr;
#ifdef __CUDACC__
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
#endif
// advance config
minigun::advance::RuntimeConfig rtcfg;
rtcfg.ctx = out_data->ctx;
#ifdef __CUDACC__
rtcfg.stream = thr_entry->stream;
const int64_t x_len = utils::ComputeXLength(out_data);
const int nt = utils::FindNumThreads(x_len, 64);
rtcfg.data_num_threads = nt;
// XXX(minjie): hard-code to let each thread compute two elements to increase
// instruction level parallelism
rtcfg.data_num_blocks = (x_len + (nt * 2) - 1) / (nt * 2);
#endif
const DLDataType& dtype = out_data->dtype;
const int bcast_ndim = info.out_shape.size();
const auto bits = graph->NumBits();
if (reducer == binary_op::kReduceMean) {
// TODO(minjie): divide
LOG(FATAL) << "reduce mean is not supported.";
}
DGL_DTYPE_SWITCH(dtype, DType, {
DGL_IDX_TYPE_SWITCH(bits, Idx, {
REDUCER_SWITCH(reducer, XPU, DType, Reducer, {
BCAST_NDIM_SWITCH(bcast_ndim, NDim, {
auto gdata = AllocBcastGData<XPU, NDim, Idx, DType, Reducer>(
rtcfg.ctx, info, lhs_mapping, rhs_mapping,
lhs_data, rhs_data, out_mapping, out_data);
OP_TARGET_SWITCH(op, lhs, rhs, DType, BinaryOp, LeftTarget, RightTarget, {
CallBinaryReduceBcast<XPU, NDim, Idx, DType, LeftTarget,
RightTarget, BinaryOp, Reducer>(rtcfg, graph, &gdata);
});
});
});
});
});
}
///////////////////////////////////////////////////////////////////////////////
// BackwardBinaryReduceBcast device-agnostic implementation
///////////////////////////////////////////////////////////////////////////////
template <int XPU, int NDim, typename Idx, typename DType>
BackwardBcastGData<NDim, Idx, DType> AllocBackwardBcastGData(
const DLContext& ctx, const BcastInfo& info,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs, runtime::NDArray rhs, runtime::NDArray out, runtime::NDArray grad_out,
runtime::NDArray grad_lhs, runtime::NDArray grad_rhs) {
// GData
BackwardBcastGData<NDim, Idx, DType> gdata;
// dim, shape and stride
gdata.ndim = info.lhs_shape.size();
gdata.lhs_len = utils::Prod(info.lhs_shape);
gdata.rhs_len = utils::Prod(info.rhs_shape);
gdata.out_len = utils::Prod(info.out_shape);
std::copy(info.lhs_shape.begin(), info.lhs_shape.end(), gdata.lhs_shape);
std::copy(info.lhs_stride.begin(), info.lhs_stride.end(), gdata.lhs_stride);
std::copy(info.rhs_shape.begin(), info.rhs_shape.end(), gdata.rhs_shape);
std::copy(info.rhs_stride.begin(), info.rhs_stride.end(), gdata.rhs_stride);
std::copy(info.out_shape.begin(), info.out_shape.end(), gdata.out_shape);
std::copy(info.out_stride.begin(), info.out_stride.end(), gdata.out_stride);
// mappings
if (!utils::IsNoneArray(lhs_mapping)) {
gdata.lhs_mapping = static_cast<Idx*>(lhs_mapping->data);
}
if (!utils::IsNoneArray(rhs_mapping)) {
gdata.rhs_mapping = static_cast<Idx*>(rhs_mapping->data);
}
if (!utils::IsNoneArray(out_mapping)) {
gdata.out_mapping = static_cast<Idx*>(out_mapping->data);
}
// data
gdata.lhs_data = static_cast<DType*>(lhs->data);
gdata.rhs_data = static_cast<DType*>(rhs->data);
gdata.out_data = static_cast<DType*>(out->data);
gdata.grad_out_data = static_cast<DType*>(grad_out->data);
if (!utils::IsNoneArray(grad_lhs)) {
gdata.grad_lhs_data = static_cast<DType*>(grad_lhs->data);
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.grad_lhs_data, utils::NElements(grad_lhs),
static_cast<DType>(0));
}
if (!utils::IsNoneArray(grad_rhs)) {
gdata.grad_rhs_data = static_cast<DType*>(grad_rhs->data);
// fill out data with zero values
utils::Fill<XPU>(ctx, gdata.grad_rhs_data, utils::NElements(grad_rhs),
static_cast<DType>(0));
}
return gdata;
}
template <int XPU>
void BackwardBinaryReduceBcastImpl(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs_tgt, binary_op::Target rhs_tgt,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs, runtime::NDArray rhs, runtime::NDArray out, runtime::NDArray grad_out,
runtime::NDArray grad_lhs, runtime::NDArray grad_rhs) {
using runtime::NDArray;
using minigun::Csr;
#ifdef __CUDACC__
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
#endif
// advance config
minigun::advance::RuntimeConfig rtcfg;
rtcfg.ctx = out->ctx;
#ifdef __CUDACC__
rtcfg.stream = thr_entry->stream;
const int64_t x_len = utils::ComputeXLength(out);
const int nt = utils::FindNumThreads(x_len, 64);
rtcfg.data_num_threads = nt;
// XXX(minjie): hard-code to let each thread compute two elements to increase
// instruction level parallelism
rtcfg.data_num_blocks = (x_len + (nt * 2) - 1) / (nt * 2);
#endif
const DLDataType& dtype = out->dtype;
const int bcast_ndim = info.out_shape.size();
const bool req_lhs = !utils::IsNoneArray(grad_lhs);
const bool req_rhs = !utils::IsNoneArray(grad_rhs);
const auto bits = graph->NumBits();
if (reducer == binary_op::kReduceMean) {
// TODO(minjie): divide
LOG(FATAL) << "reduce mean is not supported.";
}
DGL_DTYPE_SWITCH(dtype, DType, {
DGL_IDX_TYPE_SWITCH(bits, Idx, {
BCAST_NDIM_SWITCH(bcast_ndim, NDim, {
auto gdata = AllocBackwardBcastGData<XPU, NDim, Idx, DType>(
rtcfg.ctx, info,
lhs_mapping, rhs_mapping, out_mapping,
lhs, rhs, out, grad_out,
grad_lhs, grad_rhs);
BACKWARD_MODE_SWITCH(req_lhs, req_rhs, Mode, {
REDUCER_SWITCH(reducer, XPU, DType, Reducer, {
OP_TARGET_SWITCH(op, lhs_tgt, rhs_tgt, DType, BinaryOp, LeftTarget, RightTarget, {
CallBackwardBinaryReduceBcast<XPU, Mode, NDim, Idx, DType,
LeftTarget, RightTarget, BinaryOp, Reducer>(rtcfg, graph, &gdata);
});
});
});
});
});
});
}
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_BINARY_REDUCE_IMPL_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/binary_reduce_impl_decl.h
* \brief Data structure and function declarations for implementations.
*/
#ifndef DGL_KERNEL_BINARY_REDUCE_IMPL_DECL_H_
#define DGL_KERNEL_BINARY_REDUCE_IMPL_DECL_H_
#include <dgl/runtime/ndarray.h>
#include <string>
#include "./binary_reduce_common.h"
namespace minigun {
namespace advance {
// forward declaration
struct RuntimeConfig;
} // namespace advance
} // namespace minigun
namespace dgl {
// forward declaration
class ImmutableGraph;
namespace kernel {
// forward declaration
struct BcastInfo;
///////////////////////////////////////////////////////////////////////////////
// BinaryReduce declarations
///////////////////////////////////////////////////////////////////////////////
/*!\brief Data structure used by computing BinaryOpReduce in Minigun. */
template <typename Idx, typename DType>
struct GData {
// length along x(feature) dimension
int64_t x_length{0};
// number of rows of the output tensor
int64_t out_size{0};
// input data
DType *lhs_data{nullptr}, *rhs_data{nullptr};
// output data
DType *out_data{nullptr};
// input id mappings
Idx *lhs_mapping{nullptr}, *rhs_mapping{nullptr};
// output id mapping
Idx *out_mapping{nullptr};
};
/*!
* \brief Template declaration for BinaryReduce operator.
*
* LeftSelector and RightSelector must be one of the four operand target
* categories.
*
* BinaryOp must be one of the binary operator types.
*
* Reducer must be one of the reducer types.
*
* The implementation of this template is device-dependent
* (see kernel/xpu/binary_reduce_impl.(cu)h).
*
* See definitions in binary_reduce_common.h
*
* \tparam XPU the device flag
* \tparam Idx type of node/edge index (e.g. int32_t, int64_t)
* \tparam DType type of the feature data (e.g. float32)
* \tparam LeftSelect lhs category type
* \tparam RightSelect rhs category type
* \tparam BinaryOp Binary operator type
* \tparam Reducer Reducer type
* \param rtcfg Runtime configuration used by miningun
* \param graph The graph object.
* \param gdata The feature and mapping data used by the computation.
*/
template <int XPU, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduce(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<Idx, DType>* gdata);
/*!
* \brief Template declaration for common logics shared by different devices.
*
* \tparam XPU the device flag
* \param reducer The type of the reducer ("sum", "max", "mean", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
*/
template <int XPU>
void BinaryReduceImpl(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data, runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping);
///////////////////////////////////////////////////////////////////////////////
// BackwardBinaryReduce declarations
///////////////////////////////////////////////////////////////////////////////
/*!\brief Data structure used by computing BackwardBinaryReduce in Minigun. */
template <typename Idx, typename DType>
struct BackwardGData {
// length along x(feature) dimension
int64_t x_length{0};
// number of rows of the output tensor
int64_t out_size{0};
// input data
DType *lhs_data{nullptr}, *rhs_data{nullptr}, *out_data{nullptr};
DType *grad_out_data{nullptr};
// output data
DType *grad_lhs_data{nullptr}, *grad_rhs_data{nullptr};
// input id mappings
Idx *lhs_mapping{nullptr}, *rhs_mapping{nullptr};
// output id mapping
Idx *out_mapping{nullptr};
};
/*!
* \brief Template declaration for BackwardBinaryReduce operator.
*
* Mode must be one of the enum code in binary_op::BackwardMode.
*
* LeftSelector and RightSelector must be one of the four operand target
* categories.
*
* BinaryOp must be one of the binary operator types.
*
* Reducer must be one of the reducer types.
*
* The implementation of this template is device-dependent
* (see kernel/xpu/backward_binary_reduce_impl.(cu)h).
*
* See definitions in binary_reduce_common.h
*
* \tparam XPU the device flag
* \tparam Mode the backward mode code
* \tparam Idx type of node/edge index (e.g. int32_t, int64_t)
* \tparam DType type of the feature data (e.g. float32)
* \tparam LeftSelect lhs category type
* \tparam RightSelect rhs category type
* \tparam BinaryOp Binary operator type
* \tparam Reducer Reducer type
* \param rtcfg Runtime configuration used by miningun
* \param graph The graph object.
* \param gdata The feature and mapping data used by the computation.
*/
template <int XPU, int Mode, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduce(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<Idx, DType>* gdata);
/*!
* \brief Template declaration for common logics shared by different devices.
*
* \tparam XPU the device flag
* \param reducer The type of the reducer ("sum", "max", "mean", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param grad_out_data The gradient output tensor.
* \param grad_lhs_data The gradient lhs tensor.
*/
template <int XPU>
void BackwardBinaryReduceImpl(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data, runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_lhs_data, runtime::NDArray grad_rhs_data);
///////////////////////////////////////////////////////////////////////////////
// BinaryReduce with broadcasting declarations
///////////////////////////////////////////////////////////////////////////////
/*!
* \brief Data structure used by computing BinaryOp with broadcasting in Minigun.
*
* Note that all the shapes and strides are for the feature dimensions.
*
* \tparam NDim maximum number of feature dimensions
* \tparam Idx id index type
* \tparam DType feature data type
*/
template <int NDim, typename Idx, typename DType>
struct BcastGData {
// actual number of feature dimensions
int ndim{0};
// input feature shape and stride
int64_t lhs_len{0}, rhs_len{0};
int64_t lhs_shape[NDim]{0}, lhs_stride[NDim]{0};
int64_t rhs_shape[NDim]{0}, rhs_stride[NDim]{0};
// input data
DType *lhs_data{nullptr}, *rhs_data{nullptr};
// input id mappings
Idx *lhs_mapping{nullptr}, *rhs_mapping{nullptr};
// output feature shape and stride
int64_t out_len{0}; // output total feature length (equal to prod(out_shape));
int64_t out_shape[NDim]{0}, out_stride[NDim]{0};
// output data
DType *out_data{nullptr};
// output id mapping
Idx *out_mapping{nullptr};
};
/*!
* \brief Template declaration for BinaryReduce with broadcasting operator.
*
* LeftSelector and RightSelector must be one of the four operand target
* categories.
*
* BinaryOp must be one of the binary operator types.
*
* Reducer must be one of the reducer types.
*
* The implementation of this template is device-dependent
* (see kernel/xpu/binary_reduce_impl.(cu)h).
*
* See definitions in binary_reduce_common.h
*
* \tparam XPU the device flag
* \tparam NDim maximum number of feature dimensions
* \tparam Idx type of node/edge index (e.g. int32_t, int64_t)
* \tparam DType type of the feature data (e.g. float32)
* \tparam LeftSelect lhs category type
* \tparam RightSelect rhs category type
* \tparam BinaryOp rinary operator type
* \tparam Reducer reducer type
* \param rtcfg runtime configuration used by miningun
* \param graph The graph object.
* \param gdata The feature and mapping data used by the computation.
*/
template <int XPU, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BcastGData<NDim, Idx, DType>* gdata);
/*!
* \brief Template declaration for common logics shared by different devices.
*
* \tparam XPU the device flag
* \param reducer The type of the reducer ("sum", "max", "mean", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
*/
template <int XPU>
void BinaryReduceBcastImpl(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
///////////////////////////////////////////////////////////////////////////////
// BackwardBinaryReduce with broadcasting declarations
///////////////////////////////////////////////////////////////////////////////
/*!
* \brief Data and auxiliary information for backward binary broadcasting op.
*
* Note that all the shapes and strides are for the feature dimensions.
*
* The gradients of the broadcasting dimensions are not reduced. As a result,
* The grad_lhs and grad_rhs have the same shape as grad_out.
*
* \tparam NDim maximum number of feature dimensions
* \tparam Idx id index type
* \tparam DType feature data type
*/
template <int NDim, typename Idx, typename DType>
struct BackwardBcastGData {
// actual number of feature dimensions
int ndim{0};
// input shape and stride
int64_t lhs_len{0}, rhs_len{0}, out_len{0};
int64_t lhs_shape[NDim]{0}, lhs_stride[NDim]{0};
int64_t rhs_shape[NDim]{0}, rhs_stride[NDim]{0};
int64_t out_shape[NDim]{0}, out_stride[NDim]{0};
// input id mappings
Idx *lhs_mapping{nullptr}, *rhs_mapping{nullptr}, *out_mapping{nullptr};
// input data
DType *lhs_data{nullptr}, *rhs_data{nullptr}, *out_data{nullptr};
DType *grad_out_data{nullptr};
// output data
DType *grad_lhs_data{nullptr}, *grad_rhs_data{nullptr};
};
/*!
* \brief Template declaration for BackwardBinaryReduce with broadcasting operator.
*
* LeftSelector and RightSelector must be one of the four operand target
* categories.
*
* BinaryOp must be one of the binary operator types.
*
* Reducer must be one of the reducer types.
*
* The implementation of this template is device-dependent
* (see kernel/xpu/binary_reduce_impl.(cu)h).
*
* See definitions in binary_reduce_common.h
*
* \tparam XPU the device flag
* \tparam Mode the backward mode code
* \tparam NDim maximum number of feature dimensions
* \tparam Idx type of node/edge index (e.g. int32_t, int64_t)
* \tparam DType type of the feature data (e.g. float32)
* \tparam LeftSelect lhs category type
* \tparam RightSelect rhs category type
* \tparam BinaryOp rinary operator type
* \tparam Reducer reducer type
* \param rtcfg runtime configuration used by miningun
* \param graph The graph object.
* \param gdata The feature and mapping data used by the computation.
*/
template <int XPU, int Mode, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardBcastGData<NDim, Idx, DType>* gdata);
/*!
* \brief Template declaration for common logics shared by different devices.
*
* \tparam XPU the device flag
* \param reducer The type of the reducer ("sum", "max", "mean", "min", "none").
* If the reducer is "none", the output is an edge feature tensor.
* Otherwise, a node feature tensor is returned.
* \param op The type of the binary operator ("mul", "add").
* \param graph The graph object.
* \param lhs The lhs target (src, dst, edge)
* \param rhs The rhs target (src, dst, edge)
* \param lhs_mapping An optional int64 id mapping array.
* \param rhs_mapping An optional int64 id mapping array.
* \param out_mapping An optional int64 id mapping array.
* \param lhs_data The lhs feature tensor.
* \param rhs_data The rhs feature tensor.
* \param out_data The output tensor. Could be either node or edge feature
* tensor depending on the reducer.
* \param grad_out_data The gradient output tensor.
* \param grad_lhs_data The gradient lhs tensor.
*/
template <int XPU>
void BackwardBinaryReduceBcastImpl(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs_data, runtime::NDArray rhs_data, runtime::NDArray out_data,
runtime::NDArray grad_out_data,
runtime::NDArray grad_lhs_data, runtime::NDArray grad_rhs_data);
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_BINARY_REDUCE_IMPL_DECL_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/common.h
* \brief Kernel common utilities
*/
#ifndef DGL_KERNEL_COMMON_H_
#define DGL_KERNEL_COMMON_H_
#include <dgl/runtime/ndarray.h>
#include <cstdint>
namespace dgl {
namespace kernel {
#ifdef __CUDACC__
#define DGLDEVICE __device__
#define DGLINLINE __forceinline__
#else
#define DGLDEVICE
#define DGLINLINE inline
#endif // __CUDACC__
// Macro for dispatch device flag to template function calls
#ifdef DGL_USE_CUDA
#define DGL_XPU_SWITCH(val, Method, ...) \
if (val == kDLCPU) { \
Method<kDLCPU>(__VA_ARGS__); \
} else if (val == kDLGPU) { \
Method<kDLGPU>(__VA_ARGS__); \
} else { \
LOG(FATAL) << "Unsupported device type: " << val; \
}
#else // DGL_USE_CUDA
#define DGL_XPU_SWITCH(val, Method, ...) \
if (val == kDLCPU) { \
Method<kDLCPU>(__VA_ARGS__); \
} else { \
LOG(FATAL) << "Unsupported device type: " << val; \
}
#endif // DGL_USE_CUDA
// MSVC does not expand __VA_ARGS__ correctly, and needs this expand hack
#define MSVC_EXPAND(x) x
// Macro for dispatch dtype flag to template argument. Currently only
// support float32.
#define DGL_DTYPE_SWITCH(val, DType, ...) \
if (val.code == kDLFloat && val.bits == 32) { \
typedef float DType; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Unsupported dtype: " << val.code << "_" \
<< val.bits; \
}
// Macro for unrolling with data type arguments.
#define GEN_DTYPE(GEN, ...) \
MSVC_EXPAND(GEN(__VA_ARGS__, float))
// Macro for dispatch index nbits to template argument.
#ifdef __CUDACC__
#define DGL_IDX_TYPE_SWITCH(bits, Idx, ...) \
if (bits == 32) { \
typedef int32_t Idx; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Unsupported idx bits: " << bits; \
}
#else
#define DGL_IDX_TYPE_SWITCH(bits, Idx, ...) \
if (bits == 32) { \
typedef int32_t Idx; \
{__VA_ARGS__} \
} else if (bits == 64) { \
typedef int64_t Idx; \
{__VA_ARGS__} \
} else { \
LOG(FATAL) << "Unsupported idx bits: " << bits; \
}
#endif
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_COMMON_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/backward_binary_reduce_impl.h
* \brief Minigun CPU UDFs for bacward binary reduce
*/
#ifndef DGL_KERNEL_CPU_BACKWARD_BINARY_REDUCE_IMPL_H_
#define DGL_KERNEL_CPU_BACKWARD_BINARY_REDUCE_IMPL_H_
#include <minigun/minigun.h>
#include <dgl/immutable_graph.h>
#include "../binary_reduce_impl_decl.h"
#include "../utils.h"
#include "./functor.h"
namespace dgl {
namespace kernel {
namespace cpu {
// Minigun UDF to compute backward binary reduce.
template <int Mode, typename Idx, typename DType, typename Functors>
struct BackwardBinaryReduce {
static inline bool CondEdge(
Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
return true;
}
static inline void ApplyEdge(
Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
const int64_t D = gdata->x_length;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * D;
DType* rhsoff = gdata->rhs_data + rid * D;
DType* outoff = gdata->out_data + oid * D;
DType* gradlhsoff = gdata->grad_lhs_data + lid * D;
DType* gradrhsoff = gdata->grad_rhs_data + rid * D;
DType* gradoutoff = gdata->grad_out_data + oid * D;
for (int64_t tx = 0; tx < D; ++tx) {
DType lhs = Functors::Read(lhsoff + tx);
DType rhs = Functors::Read(rhsoff + tx);
DType out = Functors::Read(outoff + tx);
DType grad_out = Functors::Read(gradoutoff + tx);
DType e = Functors::Op(lhs, rhs);
DType grad_e = grad_out * Functors::BackwardWrite(e, out);
if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
#pragma omp atomic
gradlhsoff[tx] += grad_lhs;
}
if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
#pragma omp atomic
gradrhsoff[tx] += grad_rhs;
}
}
}
};
// Minigun UDF to compute backward binary reduce with broadcasting.
template <int Mode, int NDim,
typename Idx, typename DType, typename Functors>
struct BackwardBinaryReduceBcast {
static inline bool CondEdge(
Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
return true;
}
static inline void ApplyEdge(
Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
DType* outoff = gdata->out_data + oid * gdata->out_len;
DType* gradlhsoff = gdata->grad_lhs_data + lid * gdata->out_len;
DType* gradrhsoff = gdata->grad_rhs_data + rid * gdata->out_len;
DType* gradoutoff = gdata->grad_out_data + oid * gdata->out_len;
int64_t tmp[NDim]; // store unraveled idx.
for (int64_t tx = 0; tx < gdata->out_len; ++tx) {
Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
DType lhs = Functors::Read(lhsoff +
Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
DType rhs = Functors::Read(rhsoff +
Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
DType out = Functors::Read(outoff + tx);
DType grad_out = Functors::Read(gradoutoff + tx);
DType e = Functors::Op(lhs, rhs);
DType grad_e = grad_out * Functors::BackwardWrite(e, out);
if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
#pragma omp atomic
gradlhsoff[tx] += grad_lhs;
}
if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
#pragma omp atomic
gradrhsoff[tx] += grad_rhs;
}
}
}
};
// Auxiliary template used in UDF.
template <typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
struct BackwardFunctorsTempl {
static inline Idx SelectOut(
Idx src, Idx edge, Idx dst) {
typedef typename OutSelector<Reducer>::Type OutTarget;
return SwitchSrcDst<OutTarget>::Type::Call(src, edge, dst);
}
static inline Idx SelectLeft(
Idx src, Idx edge, Idx dst) {
return LeftSelector::Call(src, edge, dst);
}
static inline Idx SelectRight(
Idx src, Idx edge, Idx dst) {
return RightSelector::Call(src, edge, dst);
}
static inline DType Op(DType lhs, DType rhs) {
return BinaryOp::Call(lhs, rhs);
}
static inline DType Read(DType* addr) {
return *addr;
}
static inline void Write(DType* addr, DType val) {
Reducer::Call(addr, val);
}
static inline Idx GetId(Idx id, Idx* id_map) {
return *(id_map + id);
}
static inline DType BackwardWrite(DType val, DType accum) {
return Reducer::BackwardCall(val, accum);
}
static inline DType BackwardOpLhs(DType lhs, DType rhs, DType out) {
return BinaryOp::BackwardLhs(lhs, rhs, out);
}
static inline DType BackwardOpRhs(DType lhs, DType rhs, DType out) {
return BinaryOp::BackwardRhs(lhs, rhs, out);
}
};
typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
} // namespace cpu
// Template implementation of BackwardBinaryReduce operator.
template <int XPU, int Mode, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduce(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<Idx, DType>* gdata) {
// For backward computation, we use reverse csr and switch dst and src.
// This benefits the most common src_op_edge or copy_src case, because the
// gradients of src are now aggregated into destination buffer to reduce
// competition of atomic add.
auto incsr = graph->GetInCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
typedef cpu::BackwardFunctorsTempl<Idx, DType,
typename SwitchSrcDst<LeftSelector>::Type,
typename SwitchSrcDst<RightSelector>::Type,
BinaryOp, Reducer> Functors;
typedef cpu::BackwardBinaryReduce<Mode, Idx, DType, Functors> UDF;
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge
&& gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge
&& gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cpu::AdvanceConfig, BackwardGData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_BACKWARD_DEFINE(mode, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBackwardBinaryReduce<XPU, \
mode, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BackwardGData<IDX, dtype>* gdata);
// Template implementation of BackwardBinaryReduce with broadcasting operator.
template <int XPU, int Mode, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardBcastGData<NDim, Idx, DType>* gdata) {
// For backward computation, we use reverse csr and switch dst and src.
// This benefits the most common src_op_edge or copy_src case, because the
// gradients of src are now aggregated into destination buffer to reduce
// competition of atomic add.
auto incsr = graph->GetInCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
typedef cpu::BackwardFunctorsTempl<Idx, DType,
typename SwitchSrcDst<LeftSelector>::Type,
typename SwitchSrcDst<RightSelector>::Type,
BinaryOp, Reducer> Functors;
typedef cpu::BackwardBinaryReduceBcast<Mode, NDim, Idx, DType, Functors> UDF;
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge
&& gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge
&& gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cpu::AdvanceConfig,
BackwardBcastGData<NDim, Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_BACKWARD_BCAST_DEFINE(mode, ndim, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBackwardBinaryReduceBcast<XPU, \
mode, ndim, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BackwardBcastGData<ndim, IDX, dtype>* gdata);
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CPU_BACKWARD_BINARY_REDUCE_IMPL_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_bcast_reduce_max.cc
* \brief CPU kernels for braodcasting binary reduce max
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMax
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_bcast_reduce_min.cc
* \brief CPU kernels for braodcasting binary reduce min
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMin
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_bcast_reduce_none.cc
* \brief CPU kernels for braodcasting binary reduce none
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceNone
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_bcast_reduce_prod.cc
* \brief CPU kernels for braodcasting binary reduce prod
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceProd
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_bcast_reduce_sum.cc
* \brief CPU kernels for braodcasting binary reduce sum
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceSum
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_impl.cc
* \brief Binary reduce implementation on CPU.
*/
#include "../binary_reduce_impl.h"
using dgl::runtime::NDArray;
namespace dgl {
namespace kernel {
template void BinaryReduceImpl<kDLCPU>(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
template void BinaryReduceBcastImpl<kDLCPU>(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
template void BackwardBinaryReduceImpl<kDLCPU>(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
NDArray lhs_mapping, NDArray rhs_mapping, NDArray out_mapping,
NDArray lhs_data, NDArray rhs_data, NDArray out_data,
NDArray grad_out_data,
NDArray grad_lhs_data, NDArray grad_rhs_data);
template void BackwardBinaryReduceBcastImpl<kDLCPU>(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs_tgt, binary_op::Target rhs_tgt,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs, runtime::NDArray rhs, runtime::NDArray out, runtime::NDArray grad_out,
runtime::NDArray grad_lhs, runtime::NDArray grad_rhs);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_impl.h
* \brief Minigun CPU UDFs for binary reduce
*/
#ifndef DGL_KERNEL_CPU_BINARY_REDUCE_IMPL_H_
#define DGL_KERNEL_CPU_BINARY_REDUCE_IMPL_H_
#include <minigun/minigun.h>
#include <dgl/immutable_graph.h>
#include <algorithm>
#include "../binary_reduce_impl_decl.h"
#include "../utils.h"
#include "./functor.h"
namespace dgl {
namespace kernel {
namespace cpu {
// Minigun UDF to compute binary reduce.
template <typename Idx, typename DType, typename Functors>
struct BinaryReduce {
static inline bool CondEdge(
Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
return true;
}
static inline void ApplyEdge(
Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
const int64_t D = gdata->x_length;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * D;
DType* rhsoff = gdata->rhs_data + rid * D;
DType* outoff = gdata->out_data + oid * D;
for (int64_t tx = 0; tx < D; ++tx) {
DType lhs = Functors::Read(lhsoff + tx);
DType rhs = Functors::Read(rhsoff + tx);
DType out = Functors::Op(lhs, rhs);
Functors::Write(outoff + tx, out);
}
}
};
// Convert flattened index to multi-dimension index (assume row-major).
inline void Unravel(int64_t idx, int ndim,
const int64_t* shape, const int64_t* stride, int64_t* out) {
for (int d = 0; d < ndim; ++d) {
out[d] = (idx / stride[d]) % shape[d];
}
}
// Convert multi-dimension index to flattened index (assume row-major).
inline int64_t Ravel(const int64_t* idx, int ndim,
const int64_t* shape, const int64_t* stride) {
int64_t out = 0;
for (int d = 0; d < ndim; ++d) {
out += std::min(idx[d], shape[d] - 1) * stride[d];
}
return out;
}
// Minigun UDF to compute binary reduce with broadcasting.
template <int NDim, typename Idx, typename DType, typename Functors>
struct BinaryReduceBcast {
static inline bool CondEdge(
Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
return true;
}
static inline void ApplyEdge(
Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
DType* outoff = gdata->out_data + oid * gdata->out_len;
int64_t tmp[NDim]; // store unraveled idx.
for (int64_t tx = 0; tx < gdata->out_len; ++tx) {
Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
DType lhs = Functors::Read(lhsoff +
Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
DType rhs = Functors::Read(rhsoff +
Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
DType out = Functors::Op(lhs, rhs);
Functors::Write(outoff + tx, out);
}
}
};
// Auxiliary template used in UDF.
template <typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
struct FunctorsTempl {
static inline Idx SelectOut(
Idx src, Idx edge, Idx dst) {
return OutSelector<Reducer>::Type::Call(src, edge, dst);
}
static inline Idx SelectLeft(
Idx src, Idx edge, Idx dst) {
return LeftSelector::Call(src, edge, dst);
}
static inline Idx SelectRight(
Idx src, Idx edge, Idx dst) {
return RightSelector::Call(src, edge, dst);
}
static inline DType Op(DType lhs, DType rhs) {
return BinaryOp::Call(lhs, rhs);
}
static inline DType Read(DType* addr) {
return *addr;
}
static inline void Write(DType* addr, DType val) {
Reducer::Call(addr, val);
}
static inline Idx GetId(Idx id, Idx* id_map) {
return *(id_map + id);
}
};
typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
} // namespace cpu
// Template implementation of BinaryReduce operator.
template <int XPU, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduce(const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<Idx, DType>* gdata) {
typedef cpu::FunctorsTempl<Idx, DType, LeftSelector,
RightSelector, BinaryOp, Reducer>
Functors;
typedef cpu::BinaryReduce<Idx, DType, Functors> UDF;
// csr
auto outcsr = graph->GetOutCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cpu::AdvanceConfig, GData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Template implementation of BinaryReduce broadcasting operator.
template <int XPU, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BcastGData<NDim, Idx, DType>* gdata) {
typedef cpu::FunctorsTempl<Idx, DType, LeftSelector,
RightSelector, BinaryOp, Reducer>
Functors;
typedef cpu::BinaryReduceBcast<NDim, Idx, DType, Functors> UDF;
// csr
auto outcsr = graph->GetOutCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cpu::AdvanceConfig,
BcastGData<NDim, Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_DEFINE(dtype, lhs_tgt, rhs_tgt, op) \
template void CallBinaryReduce<XPU, IDX, \
dtype, lhs_tgt, rhs_tgt, op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
GData<IDX, dtype>* gdata);
#define GEN_BCAST_DEFINE(ndim, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBinaryReduceBcast<XPU, ndim, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BcastGData<ndim, IDX, dtype>* gdata);
#define EVAL(F, ...) MSVC_EXPAND(F(__VA_ARGS__))
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CPU_BINARY_REDUCE_IMPL_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_max.cc
* \brief CPU kernels for binary reduce max
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMax
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_min.cc
* \brief CPU kernels for binary reduce min
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMin
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_none.cc
* \brief CPU kernels for binary reduce none
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceNone
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_prod.cc
* \brief CPU kernels for binary reduce prod
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceProd
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/binary_reduce_sum.cc
* \brief CPU kernels for binary reduce sum
*/
#include "./binary_reduce_impl.h"
#include "./backward_binary_reduce_impl.h"
namespace dgl {
namespace kernel {
#define REDUCER ReduceSum
#define XPU kDLCPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
#define IDX int64_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
#undef IDX
} // namespace kernel
} // namespace dgl
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment