Commit 653428bd authored by Lingfan Yu's avatar Lingfan Yu Committed by Minjie Wang
Browse files

[Feature][Kernel] DGL kernel support (#596)

* [Kernel] Minigun integration and fused kernel support (#519)

* kernel interface

* add minigun

* Add cuda build

* functors

* working on binary elewise

* binary reduce

* change kernel interface

* WIP

* wip

* fix minigun

* compile

* binary reduce kernels

* compile

* simple test passed

* more reducers

* fix thrust problem

* fix cmake

* fix cmake; add proper guard for atomic

* WIP: bcast

* WIP

* bcast kernels

* update to new minigun pass-by-value practice

* broadcasting dim

* add copy src and copy edge

* fix linking

* fix none array problem

* fix copy edge

* add device_type and device_id to backend operator

* cache csr adj, remove cache for adjmat and incmat

* custom ops in backend and pytorch impl

* change dgl-mg kernel python interface

* add id_mapping var

* clean up plus v2e spmv schedule

* spmv schedule & clean up fall back

* symbolic message and reduce func, remove bundle func

* new executors

* new backend interface for dgl kernels and pytorch impl

* minor fix

* fix

* fix docstring, comments, func names

* nodeflow

* fix message id mapping and bugs...

* pytorch test case & fix

* backward binary reduce

* fix bug

* WIP: cusparse

* change to int32 csr for cusparse workaround

* disable cusparse

* change back to int64

* broadcasting backward

* cusparse; WIP: add rev_csr

* unit test for kernels

* pytorch backward with dgl kernel

* edge softmax

* fix backward

* improve softmax

* cache edge on device

* cache mappings on device

* fix partial forward code

* cusparse done

* copy_src_sum with cusparse

* rm id getter

* reduce grad for broadcast

* copy edge reduce backward

* kernel unit test for broadcasting

* full kernel unit test

* add cpu kernels

* edge softmax unit test

* missing ref

* fix compile and small bugs

* fix bug in bcast

* Add backward both

* fix torch utests

* expose infershape

* create out tensor in python

* fix c++ lint

* [Kernel] Add GPU utest and kernel utest (#524)

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* [Kernel] Update kernel branch (#550)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* Fixing typo in JTNN after interface change (#536)

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* [Kernel] Update kernel branch (#576)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* Fixing typo in JTNN after interface change (#536)

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* all demo use python-3 (#555)

* [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556)

* update

* update

* update

* update num_hops

* fix bug

* update

* report numbers of distributed training in AMLC giant graph paper

* [DEMO] Remove duplicate code for sampling (#557)

* update

* update

* re-use single-machine code

* update

* use relative path

* update

* update

* update

* add __init__.py

* add __init__.py

* import sys, os

* fix typo

* update

* [Perf] Improve performance of graph store. (#554)

* fix.

* use inplace.

* move to shared memory graph store.

* fix.

* add more unit tests.

* fix.

* fix test.

* fix test.

* disable test.

* fix.

* [BUGIFX] fix a bug in edge_ids (#560)

* add test.

* fix compute.

* fix test.

* turn on test.

* fix a bug.

* add test.

* fix.

* disable test.

* [DEMO] Add Pytorch demo for distributed sampler (#562)

* update

* update

* update

* add sender

* update

* remove duplicate cpde

* [Test] Add gtest to project (#547)

* add gtest module

* add gtest

* fix

* Update CMakeLists.txt

* Update README.md

* [Perf] lazily create msg_index. (#563)

* lazily create msg_index.

* update test.

* [BUGFIX] fix bugs for running GCN on giant graphs. (#561)

* load mxnet csr.

* enable load large csr.

* fix

* fix.

* fix int overflow.

* fix test.

* [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559)

* [DEMO] Update demo of distributed sampler (#564)

* update

* update

* update demo

* add network cpp test (#565)

* Add unittest for C++ RPC (#566)

* [CI] Fix CI for cpp test (#570)

* fix CI for cpp test

* update port number

* [Docker] update docker image (#575)

* update docker image

* specify lint version

* rm torch import from unified tests

* [Kernel][Scheduler][MXNet] Scheduler for DGL kernels and MXNet backend support (#541)

* [Model] add multiprocessing training with sampling. (#484)

* reorganize sampling code.

* add multi-process training.

* speed up gcn_cv

* fix graphsage_cv.

* add new API in graph store.

* update barrier impl.

* support both local and distributed training.

* fix multiprocess train.

* fix.

* fix barrier.

* add script for loading data.

* multiprocessing sampling.

* accel training.

* replace pull with spmv for speedup.

* nodeflow copy from parent with context.

* enable GPU.

* fix a bug in graph store.

* enable multi-GPU training.

* fix lint.

* add comments.

* rename to run_store_server.py

* fix gcn_cv.

* fix a minor bug in sampler.

* handle error better in graph store.

* improve graphsage_cv for distributed mode.

* update README.

* fix.

* update.

* [Tutorial] add sampling tutorial. (#522)

* add sampling tutorial.

* add readme

* update author list.

* fix indent in the code.

* rename the file.

* update tutorial.

* fix the last API.

* update image.

* [BUGFIX] fix the problems in the sampling tutorial. (#523)

* add index.

* update.

* update tutorial.

* fix gpu utest

* cuda utest runnable

* temp disable test nodeflow; unified test for kernel

* cuda test kernel done

* edge softmax module

* WIP

* Fixing typo in JTNN after interface change (#536)

* mxnet backend support

* improve reduce grad

* add max to unittest backend

* fix kernel unittest

* [BugFix] Fix getting src and dst id of ALL edges in NodeFlow.apply_block (#515)

* lint

* lint

* win build

* [Bug Fix] Fix inplace op at backend (#546)

* Fix inplace operation

* fix line seprator

* [Feature] Add batch and unbatch for immutable graph (#539)

* Add batch and unbatch for immutable graph

* fix line seprator

* fix lintr

* remove unnecessary include

* fix code review

* [BUGFix] Improve multi-processing training (#526)

* fix.

* add comment.

* remove.

* temp fix.

* initialize for shared memory.

* fix graphsage.

* fix gcn.

* add more unit tests.

* add more tests.

* avoid creating shared-memory exclusively.

* redefine remote initializer.

* improve initializer.

* fix unit test.

* fix lint.

* fix lint.

* initialize data in the graph store server properly.

* fix test.

* fix test.

* fix test.

* small fix.

* add comments.

* cleanup server.

* test graph store with a random port.

* print.

* print to stderr.

* test1

* test2

* remove comment.

* adjust the initializer signature.

* try

* fix

* fix

* fix

* fix

* fix

* try

* test

* test

* test

* try

* try

* try

* test

* fix

* try gen_target

* fix gen_target

* fix msvc var_args expand issue

* fix

* [API] update graph store API. (#549)

* add init_ndata and init_edata in DGLGraph.

* adjust SharedMemoryGraph API.

* print warning.

* fix comment.

* update example

* fix.

* fix examples.

* add unit tests.

* add comments.

* [Refactor] Immutable graph index (#543)

* WIP

* header

* WIP .cc

* WIP

* transpose

* wip

* immutable graph .h and .cc

* WIP: nodeflow.cc

* compile

* remove all tmp dl managed ctx; they caused refcount issue

* one simple test

* WIP: testing

* test_graph

* fix graph index

* fix bug in sampler; pass pytorch utest

* WIP on mxnet

* fix lint

* fix mxnet unittest w/ unfortunate workaround

* fix msvc

* fix lint

* SliceRows and test_nodeflow

* resolve reviews

* resolve reviews

* try fix win ci

* try fix win ci

* poke win ci again

* poke

* lazy multigraph flag; stackoverflow error

* revert node subgraph test

* lazy object

* try fix win build

* try fix win build

* poke ci

* fix build script

* fix compile

* add a todo

* fix reviews

* fix compile

* WIP

* WIP

* all demo use python-3 (#555)

* ToImmutable and CopyTo

* [DEMO] Reproduce numbers of distributed training in AMLC giant graph paper (#556)

* update

* update

* update

* update num_hops

* fix bug

* update

* report numbers of distributed training in AMLC giant graph paper

* [DEMO] Remove duplicate code for sampling (#557)

* update

* update

* re-use single-machine code

* update

* use relative path

* update

* update

* update

* add __init__.py

* add __init__.py

* import sys, os

* fix typo

* update

* [Perf] Improve performance of graph store. (#554)

* fix.

* use inplace.

* move to shared memory graph store.

* fix.

* add more unit tests.

* fix.

* fix test.

* fix test.

* disable test.

* fix.

* [BUGIFX] fix a bug in edge_ids (#560)

* add test.

* fix compute.

* fix test.

* turn on test.

* fix a bug.

* add test.

* fix.

* disable test.

* DGLRetValue DGLContext conversion

* [DEMO] Add Pytorch demo for distributed sampler (#562)

* update

* update

* update

* add sender

* update

* remove duplicate cpde

* [Test] Add gtest to project (#547)

* add gtest module

* add gtest

* fix

* Update CMakeLists.txt

* Update README.md

* Add support to convert immutable graph to 32 bits

* [Perf] lazily create msg_index. (#563)

* lazily create msg_index.

* update test.

* fix binary reduce following new minigun template

* enable both int64 and int32 kernels

* [BUGFIX] fix bugs for running GCN on giant graphs. (#561)

* load mxnet csr.

* enable load large csr.

* fix

* fix.

* fix int overflow.

* fix test.

* new kernel interface done for CPU

* docstring

* rename & docstring

* copy reduce and backward

* [BugFix] Fix error when bfs_level = 0 in Entity Classification with RGCN (#559)

* [DEMO] Update demo of distributed sampler (#564)

* update

* update

* update demo

* adapt cuda kernels to the new interface

* add network cpp test (#565)

* fix bug

* Add unittest for C++ RPC (#566)

* [CI] Fix CI for cpp test (#570)

* fix CI for cpp test

* update port number

* [Docker] update docker image (#575)

* update docker image

* specify lint version

* rm torch import from unified tests

* remove pytorch-specific test_function

* fix unittest

* fix

* fix unittest backend bug in converting tensor to numpy array

* fix

* mxnet version

* [BUGFIX] fix for MXNet 1.5. (#552)

* remove clone.

* turn on numpy compatible.

* Revert "remove clone."

This reverts commit 17bbf76ed72ff178df6b3f35addc428048672457.

* revert format changes

* fix mxnet api name

* revert mistakes in previous revert

* roll back CI to 20190523 build

* fix unittest

* disable test_shared_mem_store.py for now

* remove mxnet/test_specialization.py

* sync win64 test script

* fix lowercase

* missing backend in gpu unit test

* transpose to get forward graph

* pass update all

* add sanity check

* passing test_specialization.py

* fix and pass test_function

* fix check

* fix pytorch softmax

* mxnet kernels

* c++ lint

* pylint

* try

* win build

* fix

* win

* ci enable gpu build

* init submodule recursively

* backend docstring

* try

* test win dev

* doc string

* disable pytorch test_nn

* try to fix windows issue

* bug fixed, revert changes

* [Test] fix CI. (#586)

* disable unit test in mxnet tutorial.

* retry socket connection.

* roll back to set_np_compat

* try to fix multi-processing test hangs when it fails.

* fix test.

* fix.

* doc string

* doc string and clean up

* missing field in ctypes

* fix node flow schedule and unit test

* rename

* pylint

* copy from parent default context

* fix unit test script

* fix

* demo bug in nodeflow gpu test

* [Kernel][Bugfix] fix nodeflow bug (#604)

* fix nodeflow bug

* remove debug code

* add build gtest option

* fix cmake; fix graph index bug in spmv.py

* remove clone

* fix div rhs grad bug

* [Kernel] Support full builtin method, edge softmax and unit tests (#605)

* add full builtin support

* unit test

* unit test backend

* edge softmax

* apply edge with builtin

* fix kernel unit test

* disable mxnet test_shared_mem_store

* gen builtin reduce

* enable mxnet gpu unittest

* revert some changes

* docstring

* add note for the hack

* [Kernel][Unittest][CI] Fix MXNet GPU CI (#607)

* update docker image for MXNet GPU CI

* force all dgl graph input and output on CPU

* fix gpu unittest

* speedup compilation

* add some comments

* lint

* add more comments

* fix as requested

* add some comments

* comment

* lint

* lint

* update pylint

* fix as requested

* lint

* lint

* lint

* docstrings of python DGL kernel entries

* disable lint warnings on arguments in kernel.py

* fix docstring in scheduler

* fix some bug in unittest; try again

* Revert "Merge branch 'kernel' of github.com:zzhang-cn/dgl into kernel"

This reverts commit 1d2299e68b004182ea6130b088de1f1122b18a49, reversing
changes made to ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b.

* Revert "fix some bug in unittest; try again"

This reverts commit ddc97fbf1bec2b7815c0da7c74f7ecb2f428889b.

* more comprehensive kernel test

* remove shape check in test_specialization
parent da0c92a2
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/functor.h
* \brief Functors for template on CPU
*/
#ifndef DGL_KERNEL_CPU_FUNCTOR_H_
#define DGL_KERNEL_CPU_FUNCTOR_H_
#include <dmlc/omp.h>
#include <algorithm>
#include "../binary_reduce_common.h"
namespace dgl {
namespace kernel {
// Reducer functor specialization
template <typename DType>
struct ReduceSum<kDLCPU, DType> {
static void Call(DType* addr, DType val) {
#pragma omp atomic
*addr += val;
}
static DType BackwardCall(DType val, DType accum) {
return 1;
}
};
template <typename DType>
struct ReduceMax<kDLCPU, DType> {
static void Call(DType* addr, DType val) {
#pragma omp critical
*addr = std::max(*addr, val);
}
static DType BackwardCall(DType val, DType accum) {
return static_cast<DType>(val == accum);
}
};
template <typename DType>
struct ReduceMin<kDLCPU, DType> {
static void Call(DType* addr, DType val) {
#pragma omp critical
*addr = std::min(*addr, val);
}
static DType BackwardCall(DType val, DType accum) {
return static_cast<DType>(val == accum);
}
};
template <typename DType>
struct ReduceProd<kDLCPU, DType> {
static void Call(DType* addr, DType val) {
#pragma omp atomic
*addr *= val;
}
static DType BackwardCall(DType val, DType accum) {
return accum / val;
}
};
template <typename DType>
struct ReduceNone<kDLCPU, DType> {
static void Call(DType* addr, DType val) {
*addr = val;
}
static DType BackwardCall(DType val, DType accum) {
return 1;
}
};
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CPU_FUNCTOR_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cpu/utils.cc
* \brief Utility function implementations on CPU
*/
#include "../utils.h"
namespace dgl {
namespace kernel {
namespace utils {
template <int XPU, typename DType>
void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val) {
for (size_t i = 0; i < length; ++i) {
*(ptr + i) = val;
}
}
template void Fill<kDLCPU, float>(const DLContext& ctx, float* ptr, size_t length, float val);
template void Fill<kDLCPU, double>(const DLContext& ctx, double* ptr, size_t length, double val);
} // namespace utils
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/atomic.cuh
* \brief Atomic functions
*/
#ifndef DGL_KERNEL_CUDA_ATOMIC_H_
#define DGL_KERNEL_CUDA_ATOMIC_H_
#include <cuda_runtime.h>
#if __CUDA_ARCH__ >= 600
#include <cuda_fp16.h>
#endif
namespace dgl {
namespace kernel {
namespace cuda {
// Type trait for selecting code type
template <int Bytes> struct Code { };
template <> struct Code<4> {
typedef unsigned int Type;
};
template <> struct Code<8> {
typedef unsigned long long int Type;
};
// Helper class for converting to/from atomicCAS compatible types.
template <typename T> struct Cast {
typedef typename Code<sizeof(T)>::Type Type;
static __device__ __forceinline__ Type Encode(T val) {
return static_cast<Type>(val);
}
static __device__ __forceinline__ T Decode(Type code) {
return static_cast<T>(code);
}
};
template <> struct Cast<float> {
typedef Code<sizeof(float)>::Type Type;
static __device__ __forceinline__ Type Encode(float val) {
return __float_as_uint(val);
}
static __device__ __forceinline__ float Decode(Type code) {
return __uint_as_float(code);
}
};
template <> struct Cast<double> {
typedef Code<sizeof(double)>::Type Type;
static __device__ __forceinline__ Type Encode(double val) {
return __double_as_longlong(val);
}
static __device__ __forceinline__ double Decode(Type code) {
return __longlong_as_double(code);
}
};
#define DEFINE_ATOMIC(NAME) \
template <typename T> \
__device__ __forceinline__ T Atomic##NAME(T* addr, T val) { \
typedef typename Cast<T>::Type CT; \
CT* addr_as_ui = reinterpret_cast<CT*>(addr); \
CT old = *addr_as_ui; \
CT assumed = old; \
do { \
assumed = old; \
old = atomicCAS(addr_as_ui, assumed, \
Cast<T>::Encode(OP(val, Cast<T>::Decode(old)))); \
} while (assumed != old); \
return Cast<T>::Decode(old); \
}
#define OP(a, b) max(a, b)
DEFINE_ATOMIC(Max)
#undef OP
#define OP(a, b) min(a, b)
DEFINE_ATOMIC(Min)
#undef OP
#define OP(a, b) a + b
DEFINE_ATOMIC(Add)
#undef OP
#if __CUDA_ARCH__ >= 200
template <>
__device__ __forceinline__ float AtomicAdd<float>(float* addr, float val) {
return atomicAdd(addr, val);
}
#endif // __CUDA_ARCH__
#if __CUDA_ARCH__ >= 600
template <>
__device__ __forceinline__ double AtomicAdd<double>(double* addr, double val) {
return atomicAdd(addr, val);
}
#endif
#if defined(CUDART_VERSION) && CUDART_VERSION >= 10000
#if __CUDA_ARCH__ >= 600
template <>
__device__ __forceinline__ __half2 AtomicAdd<__half2>(__half2* addr, __half2 val) {
return atomicAdd(addr, val);
}
#endif // __CUDA_ARCH__
#if __CUDA_ARCH__ >= 700
template <>
__device__ __forceinline__ __half AtomicAdd<__half>(__half* addr, __half val) {
return atomicAdd(addr, val);
}
#endif // __CUDA_ARCH__
#endif
#define OP(a, b) a * b
DEFINE_ATOMIC(Mul)
#undef OP
} // namespace cuda
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CUDA_ATOMIC_H_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/backward_binary_reduce_impl.cuh
* \brief Minigun CUDA UDFs for bacward binary reduce
*/
#ifndef DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
#define DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
#include <minigun/minigun.h>
#include <dgl/immutable_graph.h>
#include "../binary_reduce_impl_decl.h"
#include "../utils.h"
#include "./functor.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
// Minigun UDF to compute backward binary reduce.
template <int Mode, typename Idx, typename DType, typename Functors>
struct BackwardBinaryReduce {
static __device__ __forceinline__ bool CondEdge(
Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
return true;
}
static __device__ __forceinline__ void ApplyEdge(
Idx src, Idx dst, Idx eid, BackwardGData<Idx, DType>* gdata) {
const int64_t D = gdata->x_length;
int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * D;
DType* rhsoff = gdata->rhs_data + rid * D;
DType* outoff = gdata->out_data + oid * D;
DType* gradlhsoff = gdata->grad_lhs_data + lid * D;
DType* gradrhsoff = gdata->grad_rhs_data + rid * D;
DType* gradoutoff = gdata->grad_out_data + oid * D;
while (tx < D) {
DType lhs = Functors::Read(lhsoff + tx);
DType rhs = Functors::Read(rhsoff + tx);
DType out = Functors::Read(outoff + tx);
DType grad_out = Functors::Read(gradoutoff + tx);
DType e = Functors::Op(lhs, rhs);
DType grad_e = grad_out * Functors::BackwardWrite(e, out);
if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
AtomicAdd(gradlhsoff + tx, grad_lhs);
}
if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
AtomicAdd(gradrhsoff + tx, grad_rhs);
}
tx += stride_x;
}
}
};
// Minigun UDF to compute backward binary reduce with broadcasting.
template <int Mode, int NDim, typename Idx, typename DType, typename Functors>
struct BackwardBinaryReduceBcast {
static __device__ __forceinline__ bool CondEdge(
Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
return true;
}
static __device__ __forceinline__ void ApplyEdge(
Idx src, Idx dst, Idx eid, BackwardBcastGData<NDim, Idx, DType>* gdata) {
int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
DType* outoff = gdata->out_data + oid * gdata->out_len;
DType* gradlhsoff = gdata->grad_lhs_data + lid * gdata->out_len;
DType* gradrhsoff = gdata->grad_rhs_data + rid * gdata->out_len;
DType* gradoutoff = gdata->grad_out_data + oid * gdata->out_len;
int64_t tmp[NDim]; // store unraveled idx.
while (tx < gdata->out_len) {
Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
DType lhs = Functors::Read(lhsoff +
Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
DType rhs = Functors::Read(rhsoff +
Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
DType out = Functors::Read(outoff + tx);
DType grad_out = Functors::Read(gradoutoff + tx);
DType e = Functors::Op(lhs, rhs);
DType grad_e = grad_out * Functors::BackwardWrite(e, out);
if (Mode == binary_op::kGradLhs || Mode == binary_op::kGradBoth) {
DType grad_lhs = grad_e * Functors::BackwardOpLhs(lhs, rhs, e);
AtomicAdd(gradlhsoff + tx, grad_lhs);
}
if (Mode == binary_op::kGradRhs || Mode == binary_op::kGradBoth) {
DType grad_rhs = grad_e * Functors::BackwardOpRhs(lhs, rhs, e);
AtomicAdd(gradrhsoff + tx, grad_rhs);
}
tx += stride_x;
}
}
};
// Auxiliary template used in UDF.
template <typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
struct BackwardFunctorsTempl {
static __device__ __forceinline__ Idx SelectOut(
Idx src, Idx edge, Idx dst) {
typedef typename OutSelector<Reducer>::Type OutTarget;
return SwitchSrcDst<OutTarget>::Type::Call(src, edge, dst);
}
static __device__ __forceinline__ Idx SelectLeft(
Idx src, Idx edge, Idx dst) {
return LeftSelector::Call(src, edge, dst);
}
static __device__ __forceinline__ Idx SelectRight(
Idx src, Idx edge, Idx dst) {
return RightSelector::Call(src, edge, dst);
}
static __device__ __forceinline__ DType Op(DType lhs, DType rhs) {
return BinaryOp::Call(lhs, rhs);
}
static __device__ __forceinline__ DType Read(DType* addr) {
return LDGReader<DType>::Call(addr);
}
static __device__ __forceinline__ void Write(DType* addr, DType val) {
Reducer::Call(addr, val);
}
static __device__ __forceinline__ Idx GetId(Idx id, Idx* id_map) {
return LDGReader<Idx>::Call(id_map + id);
}
static __device__ __forceinline__ DType BackwardWrite(DType val, DType accum) {
return Reducer::BackwardCall(val, accum);
}
static __device__ __forceinline__ DType BackwardOpLhs(DType lhs, DType rhs, DType out) {
return BinaryOp::BackwardLhs(lhs, rhs, out);
}
static __device__ __forceinline__ DType BackwardOpRhs(DType lhs, DType rhs, DType out) {
return BinaryOp::BackwardRhs(lhs, rhs, out);
}
};
typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
} // namespace cuda
// Template implementation of BackwardBinaryReduce operator.
template <int XPU, int Mode, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduce(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<Idx, DType>* gdata) {
// For backward computation, we use reverse csr and switch dst and src.
// This benefits the most common src_op_edge or copy_src case, because the
// gradients of src are now aggregated into destination buffer to reduce
// competition of atomic add.
auto incsr = graph->GetInCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
typedef cuda::BackwardFunctorsTempl<Idx, DType,
typename SwitchSrcDst<LeftSelector>::Type,
typename SwitchSrcDst<RightSelector>::Type,
BinaryOp, Reducer> Functors;
typedef cuda::BackwardBinaryReduce<Mode, Idx, DType, Functors> UDF;
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge
&& gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge
&& gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, BackwardGData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_BACKWARD_DEFINE(mode, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBackwardBinaryReduce<XPU, \
mode, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BackwardGData<IDX, dtype>* gdata);
// Template implementation of BackwardBinaryReduce with broadcasting operator.
template <int XPU, int Mode, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBackwardBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardBcastGData<NDim, Idx, DType>* gdata) {
// For backward computation, we use reverse csr and switch dst and src.
// This benefits the most common src_op_edge or copy_src case, because the
// gradients of src are now aggregated into destination buffer to reduce
// competition of atomic add.
auto incsr = graph->GetInCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
typedef cuda::BackwardFunctorsTempl<Idx, DType,
typename SwitchSrcDst<LeftSelector>::Type,
typename SwitchSrcDst<RightSelector>::Type,
BinaryOp, Reducer> Functors;
typedef cuda::BackwardBinaryReduceBcast<Mode, NDim, Idx, DType, Functors> UDF;
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge
&& gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge
&& gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig,
BackwardBcastGData<NDim, Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_BACKWARD_BCAST_DEFINE(mode, ndim, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBackwardBinaryReduceBcast<XPU, \
mode, ndim, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BackwardBcastGData<ndim, IDX, dtype>* gdata);
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CUDA_BACKWARD_BINARY_REDUCE_IMPL_CUH_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_bcast_reduce_max.cu
* \brief CUDA kernels for braodcasting binary reduce max
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
} // namespace cuda
#define REDUCER ReduceMax
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_bcast_reduce_min.cu
* \brief CUDA kernels for braodcasting binary reduce min
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
} // namespace cuda
#define REDUCER ReduceMin
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_bcast_reduce_none.cu
* \brief CUDA kernels for braodcasting binary reduce none
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
} // namespace cuda
#define REDUCER ReduceNone
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_bcast_reduce_prod.cu
* \brief CUDA kernels for braodcasting binary reduce prod
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
} // namespace cuda
#define REDUCER ReduceProd
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_bcast_reduce_sum.cu
* \brief CUDA kernels for braodcasting binary reduce sum
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
} // namespace cuda
#define REDUCER ReduceSum
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET, GEN_BCAST_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_NDIM, GEN_DTYPE, GEN_OP_TARGET,
GEN_BACKWARD_BCAST_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_impl.cu
* \brief Binary reduce implementation on cuda.
*/
#include "../binary_reduce_impl.h"
using dgl::runtime::NDArray;
namespace dgl {
namespace kernel {
template void BinaryReduceImpl<kDLGPU>(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
template void BinaryReduceBcastImpl<kDLGPU>(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
runtime::NDArray lhs_data, runtime::NDArray rhs_data,
runtime::NDArray out_data,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping,
runtime::NDArray out_mapping);
template void BackwardBinaryReduceImpl<kDLGPU>(
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs, binary_op::Target rhs,
NDArray lhs_mapping, NDArray rhs_mapping, NDArray out_mapping,
NDArray lhs_data, NDArray rhs_data, NDArray out_data,
NDArray grad_out_data,
NDArray grad_lhs_data, NDArray grad_rhs_data);
template void BackwardBinaryReduceBcastImpl<kDLGPU>(
const BcastInfo& info,
const std::string& reducer,
const std::string& op,
const ImmutableGraph* graph,
binary_op::Target lhs_tgt, binary_op::Target rhs_tgt,
runtime::NDArray lhs_mapping, runtime::NDArray rhs_mapping, runtime::NDArray out_mapping,
runtime::NDArray lhs, runtime::NDArray rhs, runtime::NDArray out, runtime::NDArray grad_out,
runtime::NDArray grad_lhs, runtime::NDArray grad_rhs);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_impl.cuh
* \brief Minigun CUDA UDFs for binary reduce
*/
#ifndef DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
#define DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
#include <minigun/minigun.h>
#include <dgl/immutable_graph.h>
#include "../binary_reduce_impl_decl.h"
#include "../utils.h"
#include "./functor.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
// Minigun UDF to compute binary reduce.
template <typename Idx, typename DType, typename Functors>
struct BinaryReduce {
static __device__ __forceinline__ bool CondEdge(
Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
return true;
}
static __device__ __forceinline__ void ApplyEdge(
Idx src, Idx dst, Idx eid, GData<Idx, DType>* gdata) {
const int64_t D = gdata->x_length;
int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * D;
DType* rhsoff = gdata->rhs_data + rid * D;
DType* outoff = gdata->out_data + oid * D;
while (tx < D) {
DType lhs = Functors::Read(lhsoff + tx);
DType rhs = Functors::Read(rhsoff + tx);
DType out = Functors::Op(lhs, rhs);
Functors::Write(outoff + tx, out);
tx += stride_x;
}
}
};
// Convert flattened index to multi-dimension index (assume row-major).
__device__ __forceinline__ void Unravel(
int64_t idx, int ndim, const int64_t* shape, const int64_t* stride, int64_t* out) {
for (int d = 0; d < ndim; ++d) {
out[d] = (idx / stride[d]) % shape[d];
}
}
// Convert multi-dimension index to flattened index (assume row-major).
__device__ __forceinline__ int64_t Ravel(
const int64_t* idx, int ndim, const int64_t* shape, const int64_t* stride) {
int64_t out = 0;
for (int d = 0; d < ndim; ++d) {
out += min(idx[d], shape[d] - 1) * stride[d];
}
return out;
}
// Minigun UDF to compute binary reduce with broadcasting.
template <int NDim, typename Idx, typename DType, typename Functors>
struct BinaryReduceBcast {
static __device__ __forceinline__ bool CondEdge(
Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
return true;
}
static __device__ __forceinline__ void ApplyEdge(
Idx src, Idx dst, Idx eid, BcastGData<NDim, Idx, DType>* gdata) {
int64_t tx = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
Idx lid = Functors::SelectLeft(src, eid, dst);
Idx rid = Functors::SelectRight(src, eid, dst);
Idx oid = Functors::SelectOut(src, eid, dst);
if (gdata->lhs_mapping) {
lid = Functors::GetId(lid, gdata->lhs_mapping);
}
if (gdata->rhs_mapping) {
rid = Functors::GetId(rid, gdata->rhs_mapping);
}
if (gdata->out_mapping) {
oid = Functors::GetId(oid, gdata->out_mapping);
}
DType* lhsoff = gdata->lhs_data + lid * gdata->lhs_len;
DType* rhsoff = gdata->rhs_data + rid * gdata->rhs_len;
DType* outoff = gdata->out_data + oid * gdata->out_len;
int64_t tmp[NDim]; // store unraveled idx.
while (tx < gdata->out_len) {
Unravel(tx, gdata->ndim, gdata->out_shape, gdata->out_stride, tmp);
DType lhs = Functors::Read(lhsoff +
Ravel(tmp, gdata->ndim, gdata->lhs_shape, gdata->lhs_stride));
DType rhs = Functors::Read(rhsoff +
Ravel(tmp, gdata->ndim, gdata->rhs_shape, gdata->rhs_stride));
DType out = Functors::Op(lhs, rhs);
Functors::Write(outoff + tx, out);
tx += stride_x;
}
}
};
// Auxiliary template used in UDF.
template <typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
struct FunctorsTempl {
static __device__ __forceinline__ Idx SelectOut(
Idx src, Idx edge, Idx dst) {
return OutSelector<Reducer>::Type::Call(src, edge, dst);
}
static __device__ __forceinline__ Idx SelectLeft(
Idx src, Idx edge, Idx dst) {
return LeftSelector::Call(src, edge, dst);
}
static __device__ __forceinline__ Idx SelectRight(
Idx src, Idx edge, Idx dst) {
return RightSelector::Call(src, edge, dst);
}
static __device__ __forceinline__ DType Op(DType lhs, DType rhs) {
return BinaryOp::Call(lhs, rhs);
}
static __device__ __forceinline__ DType Read(DType* addr) {
return LDGReader<DType>::Call(addr);
}
static __device__ __forceinline__ void Write(DType* addr, DType val) {
Reducer::Call(addr, val);
}
static __device__ __forceinline__ Idx GetId(Idx id, Idx* id_map) {
return LDGReader<Idx>::Call(id_map + id);
}
};
typedef minigun::advance::Config<true, minigun::advance::kV2N> AdvanceConfig;
} // namespace cuda
// Template implementation of BinaryReduce operator.
template <int XPU, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduce(const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<Idx, DType>* gdata) {
typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
RightSelector, BinaryOp, Reducer>
Functors;
typedef cuda::BinaryReduce<Idx, DType, Functors> UDF;
// csr
auto outcsr = graph->GetOutCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, GData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Template implementation of BinaryReduce broadcasting operator.
template <int XPU, int NDim, typename Idx, typename DType,
typename LeftSelector, typename RightSelector,
typename BinaryOp, typename Reducer>
void CallBinaryReduceBcast(
const minigun::advance::RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BcastGData<NDim, Idx, DType>* gdata) {
typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
RightSelector, BinaryOp, Reducer>
Functors;
typedef cuda::BinaryReduceBcast<NDim, Idx, DType, Functors> UDF;
// csr
auto outcsr = graph->GetOutCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig,
BcastGData<NDim, Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
// Following macro is used to generate explicit-specialization of the template
// operator.
#define GEN_DEFINE(dtype, lhs_tgt, rhs_tgt, op) \
template void CallBinaryReduce<XPU, IDX, \
dtype, lhs_tgt, rhs_tgt, op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
GData<IDX, dtype>* gdata);
#define GEN_BCAST_DEFINE(ndim, dtype, lhs_tgt, rhs_tgt, op) \
template void CallBinaryReduceBcast<XPU, ndim, IDX, dtype, \
lhs_tgt, rhs_tgt, \
op<dtype>, REDUCER<XPU, dtype>>( \
const minigun::advance::RuntimeConfig& rtcfg, \
const ImmutableGraph* graph, \
BcastGData<ndim, IDX, dtype>* gdata);
#define EVAL(F, ...) MSVC_EXPAND(F(__VA_ARGS__))
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CUDA_BINARY_REDUCE_IMPL_CUH_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_max.cu
* \brief CUDA kernels for binary reduce max
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMax
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_min.cu
* \brief CUDA kernels for binary reduce min
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
#define REDUCER ReduceMin
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_none.cu
* \brief CUDA kernels for binary reduce none
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
#define REDUCER ReduceNone
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_prod.cu
* \brief CUDA kernels for binary reduce prod
*/
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
namespace dgl {
namespace kernel {
#define REDUCER ReduceProd
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE)
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE)
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/binary_reduce_sum.cu
* \brief CUDA kernels for binary reduce sum
*/
#include <dgl/runtime/device_api.h>
#include "../../runtime/cuda/cuda_common.h"
#include "./binary_reduce_impl.cuh"
#include "./backward_binary_reduce_impl.cuh"
#include "../utils.h"
using minigun::advance::RuntimeConfig;
using Csr = minigun::Csr<int32_t>;
namespace dgl {
namespace kernel {
namespace cuda {
// specialization for cusparse
template <typename DType>
cusparseStatus_t Xcsrmm2(cusparseHandle_t handle, cusparseOperation_t transA,
cusparseOperation_t transB, int m, int n, int k, int nnz,
const DType* alpha, const cusparseMatDescr_t descrA,
const DType* csrValA, const int* csrRowPtrA, const int* csrColIndA,
const DType* B, int ldb, const DType* beta, DType* C, int ldc) {
LOG(INFO) << "Not supported dtype";
return CUSPARSE_STATUS_EXECUTION_FAILED;
}
template <>
cusparseStatus_t Xcsrmm2<float>(cusparseHandle_t handle, cusparseOperation_t transA,
cusparseOperation_t transB, int m, int n, int k, int nnz,
const float* alpha, const cusparseMatDescr_t descrA,
const float* csrValA, const int* csrRowPtrA, const int* csrColIndA,
const float* B, int ldb, const float* beta, float* C, int ldc) {
return cusparseScsrmm2(handle, transA, transB, m, n, k, nnz,
alpha, descrA, csrValA, csrRowPtrA, csrColIndA,
B, ldb, beta, C, ldc);
}
template <>
cusparseStatus_t Xcsrmm2<double>(cusparseHandle_t handle, cusparseOperation_t transA,
cusparseOperation_t transB, int m, int n, int k, int nnz,
const double* alpha, const cusparseMatDescr_t descrA,
const double* csrValA, const int* csrRowPtrA, const int* csrColIndA,
const double* B, int ldb, const double* beta, double* C, int ldc) {
return cusparseDcsrmm2(handle, transA, transB, m, n, k, nnz,
alpha, descrA, csrValA, csrRowPtrA, csrColIndA,
B, ldb, beta, C, ldc);
}
template <typename DType>
cublasStatus_t Xgeam(cublasHandle_t handle, cublasOperation_t transa,
cublasOperation_t transb, int m, int n,
const DType* alpha, const DType* A, int lda,
const DType* beta, const DType* B, int ldb,
DType* C, int ldc) {
LOG(INFO) << "Not supported dtype";
return CUBLAS_STATUS_EXECUTION_FAILED;
}
template <>
cublasStatus_t Xgeam<float>(cublasHandle_t handle, cublasOperation_t transa,
cublasOperation_t transb, int m, int n,
const float* alpha, const float* A, int lda,
const float* beta, const float* B, int ldb,
float* C, int ldc) {
return cublasSgeam(handle, transa, transb, m, n, alpha, A, lda,
beta, B, ldb, C, ldc);
}
template <>
cublasStatus_t Xgeam<double>(cublasHandle_t handle, cublasOperation_t transa,
cublasOperation_t transb, int m, int n,
const double* alpha, const double* A, int lda,
const double* beta, const double* B, int ldb,
double* C, int ldc) {
return cublasDgeam(handle, transa, transb, m, n, alpha, A, lda,
beta, B, ldb, C, ldc);
}
template <typename DType>
void CusparseCsrmm2(
const RuntimeConfig& rtcfg,
const Csr& csr,
const DType* B_data, DType* C_data,
int out_size, int x_length) {
// We use csrmm2 to perform following operation:
// C = A x B, where A is a sparse matrix in csr format, B is the dense matrix for node
// feature tensor. However, since cusparse only supports column-major, while our tensor
// is stored in row-major, the actual computation is:
// C = trans(A x trans(B)).
// Currently, we use cublasXgeam to implement transposition and allocate intermediate
// workspace memory for this.
// TODO(minjie): The given CSR could potentially represent a bipartite graph (e.g. in the
// case of nodeflow). Currently, we don't have bipartite graph support. Here is a small
// hack. In the python side, we create a CSR that includes both the source and destination
// nodes in the bipartite graph (so it is still square matrix). Here, when multiplying
// this sparse matrix, we specify the number of rows (the `m` here) to be equal to the
// number of rows of the output tensor (i.e, the `out_size`).
// In the future, we should make sure the number of rows of the given csr is equal
// to out_size (a.k.a the given csr is a rectangle matrix).
const int m = out_size;
const int k = csr.row_offsets.length - 1;
const int n = x_length;
const int nnz = csr.column_indices.length;
const DType alpha = 1.0;
const DType beta = 0.0;
// device
auto device = runtime::DeviceAPI::Get(rtcfg.ctx);
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
// allocate cusparse handle if needed
if (!thr_entry->cusparse_handle) {
CUSPARSE_CALL(cusparseCreate(&(thr_entry->cusparse_handle)));
}
CUSPARSE_CALL(cusparseSetStream(thr_entry->cusparse_handle, rtcfg.stream));
// allocate matrix for temporary transposed output
DType* trans_out = static_cast<DType*>(device->AllocWorkspace(rtcfg.ctx, m * n * sizeof(DType)));
// all one data array
DType* valptr = static_cast<DType*>(device->AllocWorkspace(rtcfg.ctx, nnz * sizeof(DType)));
utils::Fill<kDLGPU>(rtcfg.ctx, valptr, nnz, static_cast<DType>(1.));
cusparseMatDescr_t descr;
CUSPARSE_CALL(cusparseCreateMatDescr(&descr));
CUSPARSE_CALL(cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL));
CUSPARSE_CALL(cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO));
CUSPARSE_CALL(Xcsrmm2<DType>(
thr_entry->cusparse_handle,
CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_TRANSPOSE,
m, n, k, nnz, &alpha,
descr, valptr, csr.row_offsets.data, csr.column_indices.data,
B_data, n, &beta, trans_out, m));
device->FreeWorkspace(rtcfg.ctx, valptr);
// transpose the output matrix
if (!thr_entry->cublas_handle) {
CUBLAS_CALL(cublasCreate(&(thr_entry->cublas_handle)));
}
CUBLAS_CALL(cublasSetStream(thr_entry->cublas_handle, rtcfg.stream));
CUBLAS_CALL(Xgeam<DType>(
thr_entry->cublas_handle,
CUBLAS_OP_T,
CUBLAS_OP_N,
n, m,
&alpha, trans_out, m,
&beta, nullptr, n,
C_data, n));
device->FreeWorkspace(rtcfg.ctx, trans_out);
}
// forward
template <typename DType>
void FallbackCallBinaryReduce(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<int32_t, DType>* gdata) {
constexpr int XPU = kDLGPU;
typedef int32_t Idx;
typedef SelectSrc LeftSelector;
typedef SelectNone RightSelector;
typedef BinaryUseLhs<DType> BinaryOp;
typedef ReduceSum<kDLGPU, DType> Reducer;
typedef cuda::FunctorsTempl<Idx, DType, LeftSelector,
RightSelector, BinaryOp, Reducer>
Functors;
typedef cuda::BinaryReduce<Idx, DType, Functors> UDF;
// csr
auto outcsr = graph->GetOutCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(outcsr->indptr(), outcsr->indices());
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge && gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge && gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(outcsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, GData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
template <typename DType>
void FallbackCallBackwardBinaryReduce(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<int32_t, DType>* gdata) {
constexpr int XPU = kDLGPU;
constexpr int Mode = binary_op::kGradLhs;
typedef int32_t Idx;
typedef SelectSrc LeftSelector;
typedef SelectNone RightSelector;
typedef BinaryUseLhs<DType> BinaryOp;
typedef ReduceSum<kDLGPU, DType> Reducer;
// For backward computation, we use reverse csr and switch dst and src.
// This benefits the most common src_op_edge or copy_src case, because the
// gradients of src are now aggregated into destination buffer to reduce
// competition of atomic add.
auto incsr = graph->GetInCSR();
minigun::Csr<Idx> csr = utils::CreateCsr<Idx>(incsr->indptr(), incsr->indices());
typedef cuda::BackwardFunctorsTempl<Idx, DType,
typename SwitchSrcDst<LeftSelector>::Type,
typename SwitchSrcDst<RightSelector>::Type,
BinaryOp, Reducer> Functors;
typedef cuda::BackwardBinaryReduce<Mode, Idx, DType, Functors> UDF;
// If the user-given mapping is none and the target is edge data, we need to
// replace the mapping by the edge ids in the csr graph so that the edge
// data is correctly read/written.
if (LeftSelector::target == binary_op::kEdge
&& gdata->lhs_mapping == nullptr) {
gdata->lhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (RightSelector::target == binary_op::kEdge
&& gdata->rhs_mapping == nullptr) {
gdata->rhs_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
if (OutSelector<Reducer>::Type::target == binary_op::kEdge
&& gdata->out_mapping == nullptr) {
gdata->out_mapping = static_cast<Idx*>(incsr->edge_ids()->data);
}
// TODO(minjie): allocator
minigun::advance::Advance<XPU, Idx, cuda::AdvanceConfig, BackwardGData<Idx, DType>, UDF>(
rtcfg, csr, gdata, minigun::IntArray1D<Idx>());
}
} // namespace cuda
template <>
void CallBinaryReduce<kDLGPU, int32_t, float, SelectSrc, SelectNone,
BinaryUseLhs<float>, ReduceSum<kDLGPU, float>>(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<int32_t, float>* gdata) {
if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
cuda::FallbackCallBinaryReduce<float>(rtcfg, graph, gdata);
} else {
// cusparse use rev csr for csrmm
auto incsr = graph->GetInCSR();
Csr csr = utils::CreateCsr<int32_t>(incsr->indptr(), incsr->indices());
cuda::CusparseCsrmm2(rtcfg, csr, gdata->lhs_data, gdata->out_data,
gdata->out_size, gdata->x_length);
}
}
template <>
void CallBinaryReduce<kDLGPU, int32_t, double, SelectSrc, SelectNone,
BinaryUseLhs<double>, ReduceSum<kDLGPU, double>>(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
GData<int32_t, double>* gdata) {
if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
cuda::FallbackCallBinaryReduce<double>(rtcfg, graph, gdata);
} else {
// cusparse use rev csr for csrmm
auto incsr = graph->GetInCSR();
Csr csr = utils::CreateCsr<int32_t>(incsr->indptr(), incsr->indices());
cuda::CusparseCsrmm2(rtcfg, csr, gdata->lhs_data, gdata->out_data,
gdata->out_size, gdata->x_length);
}
}
// backward
template <>
void CallBackwardBinaryReduce<kDLGPU, binary_op::kGradLhs, int32_t, float,
SelectSrc, SelectNone,
BinaryUseLhs<float>, ReduceSum<kDLGPU, float>>(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<int32_t, float>* gdata) {
if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
cuda::FallbackCallBackwardBinaryReduce<float>(rtcfg, graph, gdata);
} else {
auto outcsr = graph->GetOutCSR();
Csr csr = utils::CreateCsr<int32_t>(outcsr->indptr(), outcsr->indices());
cuda::CusparseCsrmm2(rtcfg, csr, gdata->grad_out_data, gdata->grad_lhs_data,
gdata->out_size, gdata->x_length);
}
}
template <>
void CallBackwardBinaryReduce<kDLGPU, binary_op::kGradLhs, int32_t, double,
SelectSrc, SelectNone,
BinaryUseLhs<double>, ReduceSum<kDLGPU, double>>(
const RuntimeConfig& rtcfg,
const ImmutableGraph* graph,
BackwardGData<int32_t, double>* gdata) {
if (gdata->lhs_mapping || gdata->rhs_mapping || gdata->out_mapping) {
cuda::FallbackCallBackwardBinaryReduce<double>(rtcfg, graph, gdata);
} else {
auto outcsr = graph->GetOutCSR();
Csr csr = utils::CreateCsr<int32_t>(outcsr->indptr(), outcsr->indices());
cuda::CusparseCsrmm2(rtcfg, csr, gdata->grad_out_data, gdata->grad_lhs_data,
gdata->out_size, gdata->x_length);
}
}
// generate definitions
#define REDUCER ReduceSum
#define XPU kDLGPU
#define IDX int32_t
EVAL(GEN_DTYPE, GEN_OP_TARGET, GEN_DEFINE);
EVAL(GEN_BACKWARD_MODE, GEN_DTYPE, GEN_OP_TARGET, GEN_BACKWARD_DEFINE);
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/functor.cuh
* \brief Functors for template on CUDA
*/
#ifndef DGL_KERNEL_CUDA_FUNCTOR_CUH_
#define DGL_KERNEL_CUDA_FUNCTOR_CUH_
#include "../binary_reduce_common.h"
#include "./atomic.cuh"
namespace dgl {
namespace kernel {
namespace cuda {
// Cache load from global memory
template <typename DType>
struct LDGReader {
static __device__ __forceinline__ DType Call(DType* addr) {
#if __CUDA_ARCH__ >= 350
return __ldg(addr);
#else
return *addr;
#endif
}
};
} // namespace cuda
// Reducer functor specialization
template <typename DType>
struct ReduceSum<kDLGPU, DType> {
static __device__ __forceinline__ void Call(DType* addr, DType val) {
cuda::AtomicAdd(addr, val);
}
static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
return 1;
}
};
template <typename DType>
struct ReduceMax<kDLGPU, DType> {
static __device__ __forceinline__ void Call(DType* addr, DType val) {
cuda::AtomicMax(addr, val);
}
static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
return static_cast<DType>(val == accum);
}
};
template <typename DType>
struct ReduceMin<kDLGPU, DType> {
static __device__ __forceinline__ void Call(DType* addr, DType val) {
cuda::AtomicMin(addr, val);
}
static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
return static_cast<DType>(val == accum);
}
};
template <typename DType>
struct ReduceProd<kDLGPU, DType> {
static __device__ __forceinline__ void Call(DType* addr, DType val) {
cuda::AtomicMul(addr, val);
}
static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
return accum / val;
}
};
template <typename DType>
struct ReduceNone<kDLGPU, DType> {
static __device__ __forceinline__ void Call(DType* addr, DType val) {
*addr = val;
}
static __device__ __forceinline__ DType BackwardCall(DType val, DType accum) {
return 1;
}
};
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_CUDA_FUNCTOR_CUH_
/*!
* Copyright (c) 2019 by Contributors
* \file kernel/cuda/utils.cu
* \brief Utility function implementations on CUDA
*/
#include "../../runtime/cuda/cuda_common.h"
#include "../utils.h"
namespace dgl {
namespace kernel {
namespace utils {
template <typename DType>
__global__ void _FillKernel(DType* ptr, size_t length, DType val) {
int tx = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = gridDim.x * blockDim.x;
while (tx < length) {
ptr[tx] = val;
tx += stride_x;
}
}
template <int XPU, typename DType>
void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val) {
auto* thr_entry = runtime::CUDAThreadEntry::ThreadLocal();
int nt = utils::FindNumThreads(length, 1024);
int nb = (length + nt - 1) / nt;
_FillKernel<<<nb, nt, 0, thr_entry->stream>>>(ptr, length, val);
}
template void Fill<kDLGPU, float>(const DLContext& ctx, float* ptr, size_t length, float val);
template void Fill<kDLGPU, double>(const DLContext& ctx, double* ptr, size_t length, double val);
} // namespace utils
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2018 by Contributors
* \file kernel/utils.cc
* \brief Kernel utilities
*/
#include <vector>
#include <string>
#include "./utils.h"
#include "./binary_reduce_common.h"
namespace dgl {
namespace kernel {
namespace utils {
int FindNumThreads(int dim, int max_nthrs) {
int ret = max_nthrs;
while (ret > dim) {
ret = ret >> 1;
}
return ret;
}
int64_t ComputeXLength(runtime::NDArray feat_array) {
int64_t ret = 1;
for (int i = 1; i < feat_array->ndim; ++i) {
ret *= feat_array->shape[i];
}
return ret;
}
int64_t NElements(const runtime::NDArray& array) {
if (IsNoneArray(array)) {
return 0;
} else {
int64_t ret = 1;
for (int i = 0; i < array->ndim; ++i) {
ret *= array->shape[i];
}
return ret;
}
}
int64_t Prod(const std::vector<int64_t>& vec) {
int64_t ret = 1;
for (int64_t v : vec) {
ret *= v;
}
return ret;
}
} // namespace utils
} // namespace kernel
} // namespace dgl
/*!
* Copyright (c) 2018 by Contributors
* \file kernel/utils.h
* \brief Kernel utilities
*/
#ifndef DGL_KERNEL_UTILS_H_
#define DGL_KERNEL_UTILS_H_
#include <minigun/csr.h>
#include <dlpack/dlpack.h>
#include <dgl/runtime/ndarray.h>
#include <cstdlib>
#include <vector>
namespace dgl {
namespace kernel {
namespace utils {
/* !\brief Return an NDArray that represents none value. */
inline runtime::NDArray NoneArray() {
return runtime::NDArray::Empty({}, DLDataType{kDLInt, 32, 1}, DLContext{kDLCPU, 0});
}
/* !\brief Return true if the NDArray is none. */
inline bool IsNoneArray(runtime::NDArray array) {
return array->ndim == 0;
}
/*
* !\brief Find number of threads is smaller than dim and max_nthrs
* and is also the power of two.
*/
int FindNumThreads(int dim, int max_nthrs);
/*
* !\brief Compute the total number of feature elements.
*/
int64_t ComputeXLength(runtime::NDArray feat_array);
/*
* !\brief Compute the total number of elements in the array.
*/
int64_t NElements(const runtime::NDArray& array);
/*
* !\brief Compute the product of the given vector.
*/
int64_t Prod(const std::vector<int64_t>& vec);
/*
* !\brief Fill the array with constant value.
*/
template <int XPU, typename DType>
void Fill(const DLContext& ctx, DType* ptr, size_t length, DType val);
/*
* !\brief Create minigun CSR from two ndarrays.
*/
template <typename Idx>
minigun::Csr<Idx> CreateCsr(runtime::NDArray indptr, runtime::NDArray indices) {
minigun::Csr<Idx> csr;
csr.row_offsets.data = static_cast<Idx*>(indptr->data);
csr.row_offsets.length = indptr->shape[0];
csr.column_indices.data = static_cast<Idx*>(indices->data);
csr.column_indices.length = indices->shape[0];
return csr;
}
} // namespace utils
} // namespace kernel
} // namespace dgl
#endif // DGL_KERNEL_UTILS_H_
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment