Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
......@@ -22,6 +22,9 @@
#include <utility>
#include <vector>
#include <LightGBM/cuda/cuda_column_data.hpp>
#include <LightGBM/cuda/cuda_metadata.hpp>
namespace LightGBM {
/*! \brief forward declaration */
......@@ -211,6 +214,14 @@ class Metadata {
/*! \brief Disable copy */
Metadata(const Metadata&) = delete;
#ifdef USE_CUDA_EXP
CUDAMetadata* cuda_metadata() const { return cuda_metadata_.get(); }
void CreateCUDAMetadata(const int gpu_device_id);
#endif // USE_CUDA_EXP
private:
/*! \brief Load initial scores from file */
void LoadInitialScore();
......@@ -247,6 +258,9 @@ class Metadata {
bool weight_load_from_file_;
bool query_load_from_file_;
bool init_score_load_from_file_;
#ifdef USE_CUDA_EXP
std::unique_ptr<CUDAMetadata> cuda_metadata_;
#endif // USE_CUDA_EXP
};
......@@ -623,6 +637,21 @@ class Dataset {
return feature_groups_[group]->FeatureGroupData();
}
const void* GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const;
const void* GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const;
inline double RealThreshold(int i, uint32_t threshold) const {
const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i];
......@@ -636,6 +665,12 @@ class Dataset {
return feature_groups_[group]->bin_mappers_[sub_feature]->ValueToBin(threshold_double);
}
inline int MaxRealCatValue(int i) const {
const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i];
return feature_groups_[group]->bin_mappers_[sub_feature]->MaxCatValue();
}
/*!
* \brief Get meta data pointer
* \return Pointer of meta data
......@@ -739,7 +774,29 @@ class Dataset {
return raw_data_[numeric_feature_map_[feat_ind]].data();
}
inline uint32_t feature_max_bin(const int inner_feature_index) const {
const int feature_group_index = Feature2Group(inner_feature_index);
const int sub_feature_index = feature2subfeature_[inner_feature_index];
return feature_groups_[feature_group_index]->feature_max_bin(sub_feature_index);
}
inline uint32_t feature_min_bin(const int inner_feature_index) const {
const int feature_group_index = Feature2Group(inner_feature_index);
const int sub_feature_index = feature2subfeature_[inner_feature_index];
return feature_groups_[feature_group_index]->feature_min_bin(sub_feature_index);
}
#ifdef USE_CUDA_EXP
const CUDAColumnData* cuda_column_data() const {
return cuda_column_data_.get();
}
#endif // USE_CUDA_EXP
private:
void CreateCUDAColumnData();
std::string data_filename_;
/*! \brief Store used features */
std::vector<std::unique_ptr<FeatureGroup>> feature_groups_;
......@@ -780,6 +837,13 @@ class Dataset {
/*! map feature (inner index) to its index in the list of numeric (non-categorical) features */
std::vector<int> numeric_feature_map_;
int num_numeric_features_;
std::string device_type_;
int gpu_device_id_;
#ifdef USE_CUDA_EXP
std::unique_ptr<CUDAColumnData> cuda_column_data_;
#endif // USE_CUDA_EXP
std::string parser_config_str_;
};
......
......@@ -478,6 +478,50 @@ class FeatureGroup {
}
}
const void* GetColWiseData(const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
if (sub_feature_index >= 0) {
CHECK(is_multi_val_);
return multi_bin_data_[sub_feature_index]->GetColWiseData(bit_type, is_sparse, bin_iterator, num_threads);
} else {
CHECK(!is_multi_val_);
return bin_data_->GetColWiseData(bit_type, is_sparse, bin_iterator, num_threads);
}
}
const void* GetColWiseData(const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
if (sub_feature_index >= 0) {
CHECK(is_multi_val_);
return multi_bin_data_[sub_feature_index]->GetColWiseData(bit_type, is_sparse, bin_iterator);
} else {
CHECK(!is_multi_val_);
return bin_data_->GetColWiseData(bit_type, is_sparse, bin_iterator);
}
}
uint32_t feature_max_bin(const int sub_feature_index) {
if (!is_multi_val_) {
return bin_offsets_[sub_feature_index + 1] - 1;
} else {
int addi = bin_mappers_[sub_feature_index]->GetMostFreqBin() == 0 ? 0 : 1;
return bin_mappers_[sub_feature_index]->num_bin() - 1 + addi;
}
}
uint32_t feature_min_bin(const int sub_feature_index) {
if (!is_multi_val_) {
return bin_offsets_[sub_feature_index];
} else {
return 1;
}
}
private:
void CreateBinData(int num_data, bool is_multi_val, bool force_dense, bool force_sparse) {
if (is_multi_val) {
......
......@@ -49,6 +49,8 @@ typedef float label_t;
const score_t kMinScore = -std::numeric_limits<score_t>::infinity();
const score_t kMaxScore = std::numeric_limits<score_t>::infinity();
const score_t kEpsilon = 1e-15f;
const double kZeroThreshold = 1e-35f;
......
......@@ -125,6 +125,25 @@ class MultiValBinWrapper {
is_subrow_copied_ = is_subrow_copied;
}
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
if (multi_val_bin_ == nullptr) {
*bit_type = 0;
*total_size = 0;
*is_sparse = false;
return nullptr;
} else {
return multi_val_bin_->GetRowWiseData(bit_type, total_size, is_sparse, out_data_ptr, data_ptr_bit_type);
}
}
#endif // USE_CUDA_EXP
private:
bool is_use_subcol_ = false;
bool is_use_subrow_ = false;
......@@ -162,7 +181,11 @@ struct TrainingShareStates {
int num_hist_total_bin() { return num_hist_total_bin_; }
const std::vector<uint32_t>& feature_hist_offsets() { return feature_hist_offsets_; }
const std::vector<uint32_t>& feature_hist_offsets() const { return feature_hist_offsets_; }
#ifdef USE_CUDA_EXP
const std::vector<uint32_t>& column_hist_offsets() const { return column_hist_offsets_; }
#endif // USE_CUDA_EXP
bool IsSparseRowwise() {
return (multi_val_bin_wrapper_ != nullptr && multi_val_bin_wrapper_->IsSparse());
......@@ -211,8 +234,29 @@ struct TrainingShareStates {
}
}
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) {
if (multi_val_bin_wrapper_ != nullptr) {
return multi_val_bin_wrapper_->GetRowWiseData(bit_type, total_size, is_sparse, out_data_ptr, data_ptr_bit_type);
} else {
*bit_type = 0;
*total_size = 0;
*is_sparse = false;
return nullptr;
}
}
#endif // USE_CUDA_EXP
private:
std::vector<uint32_t> feature_hist_offsets_;
#ifdef USE_CUDA_EXP
std::vector<uint32_t> column_hist_offsets_;
#endif // USE_CUDA_EXP
int num_hist_total_bin_ = 0;
std::unique_ptr<MultiValBinWrapper> multi_val_bin_wrapper_;
std::vector<hist_t, Common::AlignmentAllocator<hist_t, kAlignedSize>> hist_buf_;
......
......@@ -39,7 +39,7 @@ class Tree {
*/
Tree(const char* str, size_t* used_len);
~Tree() noexcept = default;
virtual ~Tree() noexcept = default;
/*!
* \brief Performing a split on tree leaves.
......@@ -100,7 +100,7 @@ class Tree {
* \param num_data Number of total data
* \param score Will add prediction to score
*/
void AddPredictionToScore(const Dataset* data,
virtual void AddPredictionToScore(const Dataset* data,
data_size_t num_data,
double* score) const;
......@@ -111,7 +111,7 @@ class Tree {
* \param num_data Number of total data
* \param score Will add prediction to score
*/
void AddPredictionToScore(const Dataset* data,
virtual void AddPredictionToScore(const Dataset* data,
const data_size_t* used_data_indices,
data_size_t num_data, double* score) const;
......@@ -184,7 +184,7 @@ class Tree {
* shrinkage rate (a.k.a learning rate) is used to tune the training process
* \param rate The factor of shrinkage
*/
inline void Shrinkage(double rate) {
virtual inline void Shrinkage(double rate) {
#pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048)
for (int i = 0; i < num_leaves_ - 1; ++i) {
leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] * rate);
......@@ -209,7 +209,7 @@ class Tree {
inline double shrinkage() const { return shrinkage_; }
inline void AddBias(double val) {
virtual inline void AddBias(double val) {
#pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048)
for (int i = 0; i < num_leaves_ - 1; ++i) {
leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] + val);
......@@ -319,11 +319,15 @@ class Tree {
inline bool is_linear() const { return is_linear_; }
#ifdef USE_CUDA_EXP
inline bool is_cuda_tree() const { return is_cuda_tree_; }
#endif // USE_CUDA_EXP
inline void SetIsLinear(bool is_linear) {
is_linear_ = is_linear;
}
private:
protected:
std::string NumericalDecisionIfElse(int node) const;
std::string CategoricalDecisionIfElse(int node) const;
......@@ -528,6 +532,10 @@ class Tree {
std::vector<std::vector<int>> leaf_features_;
/* \brief features used in leaf linear models; indexing is relative to used_features_ */
std::vector<std::vector<int>> leaf_features_inner_;
#ifdef USE_CUDA_EXP
/*! \brief Marks whether this tree is a CUDATree */
bool is_cuda_tree_;
#endif // USE_CUDA_EXP
};
inline void Tree::Split(int leaf, int feature, int real_feature,
......
......@@ -123,6 +123,8 @@ All requirements from `Build from Sources section <#build-from-sources>`__ apply
**CUDA** library (version 9.0 or higher) is needed: details for installation can be found in `Installation Guide <https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#build-cuda-version-experimental>`__.
Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``--cuda`` with ``--cuda-exp`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.
Build HDFS Version
~~~~~~~~~~~~~~~~~~
......@@ -198,6 +200,8 @@ Run ``python setup.py install --gpu`` to enable GPU support. All requirements fr
Run ``python setup.py install --cuda`` to enable CUDA support. All requirements from `Build CUDA Version section <#build-cuda-version>`__ apply for this installation option as well.
Run ``python setup.py install --cuda-exp`` to enable the new experimental version of CUDA support. All requirements from `Build CUDA Version section <#build-cuda-version>`__ apply for this installation option as well.
Run ``python setup.py install --hdfs`` to enable HDFS support. All requirements from `Build HDFS Version section <#build-hdfs-version>`__ apply for this installation option as well.
Run ``python setup.py install --bit32``, if you want to use 32-bit version. All requirements from `Build 32-bit Version with 32-bit Python section <#build-32-bit-version-with-32-bit-python>`__ apply for this installation option as well.
......
......@@ -21,6 +21,7 @@ LIGHTGBM_OPTIONS = [
('integrated-opencl', None, 'Compile integrated OpenCL version'),
('gpu', 'g', 'Compile GPU version'),
('cuda', None, 'Compile CUDA version'),
('cuda-exp', None, 'Compile CUDA Experimental version'),
('mpi', None, 'Compile MPI version'),
('nomp', None, 'Compile version without OpenMP support'),
('hdfs', 'h', 'Compile HDFS version'),
......@@ -104,6 +105,7 @@ def compile_cpp(
use_mingw: bool = False,
use_gpu: bool = False,
use_cuda: bool = False,
use_cuda_exp: bool = False,
use_mpi: bool = False,
use_hdfs: bool = False,
boost_root: Optional[str] = None,
......@@ -144,6 +146,8 @@ def compile_cpp(
cmake_cmd.append(f"-DOpenCL_LIBRARY={opencl_library}")
elif use_cuda:
cmake_cmd.append("-DUSE_CUDA=ON")
elif use_cuda_exp:
cmake_cmd.append("-DUSE_CUDA_EXP=ON")
if use_mpi:
cmake_cmd.append("-DUSE_MPI=ON")
if nomp:
......@@ -163,7 +167,7 @@ def compile_cpp(
else:
status = 1
lib_path = CURRENT_DIR / "compile" / "windows" / "x64" / "DLL" / "lib_lightgbm.dll"
if not any((use_gpu, use_cuda, use_mpi, use_hdfs, nomp, bit32, integrated_opencl)):
if not any((use_gpu, use_cuda, use_cuda_exp, use_mpi, use_hdfs, nomp, bit32, integrated_opencl)):
logger.info("Starting to compile with MSBuild from existing solution file.")
platform_toolsets = ("v143", "v142", "v141", "v140")
for pt in platform_toolsets:
......@@ -227,6 +231,7 @@ class CustomInstall(install):
self.integrated_opencl = False
self.gpu = False
self.cuda = False
self.cuda_exp = False
self.boost_root = None
self.boost_dir = None
self.boost_include_dir = None
......@@ -250,7 +255,7 @@ class CustomInstall(install):
LOG_PATH.touch()
if not self.precompile:
copy_files(integrated_opencl=self.integrated_opencl, use_gpu=self.gpu)
compile_cpp(use_mingw=self.mingw, use_gpu=self.gpu, use_cuda=self.cuda, use_mpi=self.mpi,
compile_cpp(use_mingw=self.mingw, use_gpu=self.gpu, use_cuda=self.cuda, use_cuda_exp=self.cuda_exp, use_mpi=self.mpi,
use_hdfs=self.hdfs, boost_root=self.boost_root, boost_dir=self.boost_dir,
boost_include_dir=self.boost_include_dir, boost_librarydir=self.boost_librarydir,
opencl_include_dir=self.opencl_include_dir, opencl_library=self.opencl_library,
......@@ -270,6 +275,7 @@ class CustomBdistWheel(bdist_wheel):
self.integrated_opencl = False
self.gpu = False
self.cuda = False
self.cuda_exp = False
self.boost_root = None
self.boost_dir = None
self.boost_include_dir = None
......@@ -291,6 +297,7 @@ class CustomBdistWheel(bdist_wheel):
install.integrated_opencl = self.integrated_opencl
install.gpu = self.gpu
install.cuda = self.cuda
install.cuda_exp = self.cuda_exp
install.boost_root = self.boost_root
install.boost_dir = self.boost_dir
install.boost_include_dir = self.boost_include_dir
......
......@@ -36,7 +36,7 @@ Application::Application(int argc, char** argv) {
Log::Fatal("No training/prediction data, application quit");
}
if (config_.device_type == std::string("cuda")) {
if (config_.device_type == std::string("cuda") || config_.device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda;
}
}
......
......@@ -65,7 +65,7 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective
es_first_metric_only_ = config_->first_metric_only;
shrinkage_rate_ = config_->learning_rate;
if (config_->device_type == std::string("cuda")) {
if (config_->device_type == std::string("cuda") || config_->device_type == std::string("cuda_exp")) {
LGBM_config_::current_learner = use_cuda_learner;
}
......@@ -391,7 +391,7 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {
auto grad = gradients + offset;
auto hess = hessians + offset;
// need to copy gradients for bagging subset.
if (is_use_subset_ && bag_data_cnt_ < num_data_) {
if (is_use_subset_ && bag_data_cnt_ < num_data_ && config_->device_type != std::string("cuda_exp")) {
for (int i = 0; i < bag_data_cnt_; ++i) {
gradients_[offset + i] = grad[bag_data_indices_[i]];
hessians_[offset + i] = hess[bag_data_indices_[i]];
......@@ -805,15 +805,17 @@ void GBDT::ResetBaggingConfig(const Config* config, bool is_change_dataset) {
double average_bag_rate =
(static_cast<double>(bag_data_cnt_) / num_data_) / config->bagging_freq;
is_use_subset_ = false;
const int group_threshold_usesubset = 100;
if (average_bag_rate <= 0.5
&& (train_data_->num_feature_groups() < group_threshold_usesubset)) {
if (tmp_subset_ == nullptr || is_change_dataset) {
tmp_subset_.reset(new Dataset(bag_data_cnt_));
tmp_subset_->CopyFeatureMapperFrom(train_data_);
if (config_->device_type != std::string("cuda_exp")) {
const int group_threshold_usesubset = 100;
if (average_bag_rate <= 0.5
&& (train_data_->num_feature_groups() < group_threshold_usesubset)) {
if (tmp_subset_ == nullptr || is_change_dataset) {
tmp_subset_.reset(new Dataset(bag_data_cnt_));
tmp_subset_->CopyFeatureMapperFrom(train_data_);
}
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
}
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
}
need_re_bagging_ = true;
......
......@@ -488,7 +488,7 @@ class GBDT : public GBDTBase {
/*! \brief Parser config file content */
std::string parser_config_str_ = "";
#ifdef USE_CUDA
#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
/*! \brief First order derivative of training data */
std::vector<score_t, CHAllocator<score_t>> gradients_;
/*! \brief Second order derivative of training data */
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_algorithms.hpp>
namespace LightGBM {
template <typename T>
__global__ void ShufflePrefixSumGlobalKernel(T* values, size_t len, T* block_prefix_sum_buffer) {
__shared__ T shared_mem_buffer[32];
const size_t index = static_cast<size_t>(threadIdx.x + blockIdx.x * blockDim.x);
T value = 0;
if (index < len) {
value = values[index];
}
const T prefix_sum_value = ShufflePrefixSum<T>(value, shared_mem_buffer);
values[index] = prefix_sum_value;
if (threadIdx.x == blockDim.x - 1) {
block_prefix_sum_buffer[blockIdx.x] = prefix_sum_value;
}
}
template <typename T>
__global__ void ShufflePrefixSumGlobalReduceBlockKernel(T* block_prefix_sum_buffer, int num_blocks) {
__shared__ T shared_mem_buffer[32];
const int num_blocks_per_thread = (num_blocks + GLOBAL_PREFIX_SUM_BLOCK_SIZE - 2) / (GLOBAL_PREFIX_SUM_BLOCK_SIZE - 1);
int thread_block_start = threadIdx.x == 0 ? 0 : (threadIdx.x - 1) * num_blocks_per_thread;
int thread_block_end = threadIdx.x == 0 ? 0 : min(thread_block_start + num_blocks_per_thread, num_blocks);
T base = 0;
for (int block_index = thread_block_start; block_index < thread_block_end; ++block_index) {
base += block_prefix_sum_buffer[block_index];
}
base = ShufflePrefixSum<T>(base, shared_mem_buffer);
thread_block_start = threadIdx.x == blockDim.x - 1 ? 0 : threadIdx.x * num_blocks_per_thread;
thread_block_end = threadIdx.x == blockDim.x - 1 ? 0 : min(thread_block_start + num_blocks_per_thread, num_blocks);
for (int block_index = thread_block_start + 1; block_index < thread_block_end; ++block_index) {
block_prefix_sum_buffer[block_index] += block_prefix_sum_buffer[block_index - 1];
}
for (int block_index = thread_block_start; block_index < thread_block_end; ++block_index) {
block_prefix_sum_buffer[block_index] += base;
}
}
template <typename T>
__global__ void ShufflePrefixSumGlobalAddBase(size_t len, const T* block_prefix_sum_buffer, T* values) {
const T base = blockIdx.x == 0 ? 0 : block_prefix_sum_buffer[blockIdx.x - 1];
const size_t index = static_cast<size_t>(threadIdx.x + blockIdx.x * blockDim.x);
if (index < len) {
values[index] += base;
}
}
template <typename T>
void ShufflePrefixSumGlobalInner(T* values, size_t len, T* block_prefix_sum_buffer) {
const int num_blocks = (static_cast<int>(len) + GLOBAL_PREFIX_SUM_BLOCK_SIZE - 1) / GLOBAL_PREFIX_SUM_BLOCK_SIZE;
ShufflePrefixSumGlobalKernel<<<num_blocks, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(values, len, block_prefix_sum_buffer);
ShufflePrefixSumGlobalReduceBlockKernel<<<1, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(block_prefix_sum_buffer, num_blocks);
ShufflePrefixSumGlobalAddBase<<<num_blocks, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(len, block_prefix_sum_buffer, values);
}
template <>
void ShufflePrefixSumGlobal(uint16_t* values, size_t len, uint16_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint16_t>(values, len, block_prefix_sum_buffer);
}
template <>
void ShufflePrefixSumGlobal(uint32_t* values, size_t len, uint32_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint32_t>(values, len, block_prefix_sum_buffer);
}
template <>
void ShufflePrefixSumGlobal(uint64_t* values, size_t len, uint64_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint64_t>(values, len, block_prefix_sum_buffer);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_utils.h>
namespace LightGBM {
void SynchronizeCUDADevice(const char* file, const int line) {
gpuAssert(cudaDeviceSynchronize(), file, line);
}
void PrintLastCUDAError() {
const char* error_name = cudaGetErrorName(cudaGetLastError());
Log::Fatal(error_name);
}
void SetCUDADevice(int gpu_device_id, const char* file, int line) {
int cur_gpu_device_id = 0;
CUDASUCCESS_OR_FATAL_OUTER(cudaGetDevice(&cur_gpu_device_id));
if (cur_gpu_device_id != gpu_device_id) {
CUDASUCCESS_OR_FATAL_OUTER(cudaSetDevice(gpu_device_id));
}
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
......@@ -128,6 +128,8 @@ void GetDeviceType(const std::unordered_map<std::string, std::string>& params, s
*device_type = "gpu";
} else if (value == std::string("cuda")) {
*device_type = "cuda";
} else if (value == std::string("cuda_exp")) {
*device_type = "cuda_exp";
} else {
Log::Fatal("Unknown device type %s", value.c_str());
}
......@@ -208,7 +210,7 @@ void Config::Set(const std::unordered_map<std::string, std::string>& params) {
GetObjectiveType(params, &objective);
GetMetricType(params, objective, &metric);
GetDeviceType(params, &device_type);
if (device_type == std::string("cuda")) {
if (device_type == std::string("cuda") || device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda;
}
GetTreeLearnerType(params, &tree_learner);
......@@ -331,13 +333,20 @@ void Config::CheckParamConflict() {
num_leaves = static_cast<int>(full_num_leaves);
}
}
// force col-wise for gpu & CUDA
if (device_type == std::string("gpu") || device_type == std::string("cuda")) {
// force col-wise for gpu, and cuda version
force_col_wise = true;
force_row_wise = false;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}
} else if (device_type == std::string("cuda_exp")) {
// force row-wise for cuda_exp version
force_col_wise = false;
force_row_wise = true;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}
}
// force gpu_use_dp for CUDA
if (device_type == std::string("cuda") && !gpu_use_dp) {
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_column_data.hpp>
namespace LightGBM {
CUDAColumnData::CUDAColumnData(const data_size_t num_data, const int gpu_device_id) {
num_threads_ = OMP_NUM_THREADS();
num_data_ = num_data;
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_used_indices_ = nullptr;
cuda_data_by_column_ = nullptr;
cuda_column_bit_type_ = nullptr;
cuda_feature_min_bin_ = nullptr;
cuda_feature_max_bin_ = nullptr;
cuda_feature_offset_ = nullptr;
cuda_feature_most_freq_bin_ = nullptr;
cuda_feature_default_bin_ = nullptr;
cuda_feature_missing_is_zero_ = nullptr;
cuda_feature_missing_is_na_ = nullptr;
cuda_feature_mfb_is_zero_ = nullptr;
cuda_feature_mfb_is_na_ = nullptr;
cuda_feature_to_column_ = nullptr;
data_by_column_.clear();
}
CUDAColumnData::~CUDAColumnData() {
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
DeallocateCUDAMemory<void*>(&cuda_data_by_column_, __FILE__, __LINE__);
for (size_t i = 0; i < data_by_column_.size(); ++i) {
DeallocateCUDAMemory<void>(&data_by_column_[i], __FILE__, __LINE__);
}
DeallocateCUDAMemory<uint8_t>(&cuda_column_bit_type_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_min_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_max_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_offset_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_most_freq_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_default_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_missing_is_zero_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_missing_is_na_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_mfb_is_zero_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_mfb_is_na_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_feature_to_column_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
}
template <bool IS_SPARSE, bool IS_4BIT, typename BIN_TYPE>
void CUDAColumnData::InitOneColumnData(const void* in_column_data, BinIterator* bin_iterator, void** out_column_data_pointer) {
BIN_TYPE* cuda_column_data = nullptr;
if (!IS_SPARSE) {
if (IS_4BIT) {
std::vector<BIN_TYPE> expanded_column_data(num_data_, 0);
const BIN_TYPE* in_column_data_reintrepreted = reinterpret_cast<const BIN_TYPE*>(in_column_data);
for (data_size_t i = 0; i < num_data_; ++i) {
expanded_column_data[i] = static_cast<BIN_TYPE>((in_column_data_reintrepreted[i >> 1] >> ((i & 1) << 2)) & 0xf);
}
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
expanded_column_data.data(),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
} else {
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
reinterpret_cast<const BIN_TYPE*>(in_column_data),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
}
} else {
// need to iterate bin iterator
std::vector<BIN_TYPE> expanded_column_data(num_data_, 0);
for (data_size_t i = 0; i < num_data_; ++i) {
expanded_column_data[i] = static_cast<BIN_TYPE>(bin_iterator->RawGet(i));
}
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
expanded_column_data.data(),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
}
*out_column_data_pointer = reinterpret_cast<void*>(cuda_column_data);
}
void CUDAColumnData::Init(const int num_columns,
const std::vector<const void*>& column_data,
const std::vector<BinIterator*>& column_bin_iterator,
const std::vector<uint8_t>& column_bit_type,
const std::vector<uint32_t>& feature_max_bin,
const std::vector<uint32_t>& feature_min_bin,
const std::vector<uint32_t>& feature_offset,
const std::vector<uint32_t>& feature_most_freq_bin,
const std::vector<uint32_t>& feature_default_bin,
const std::vector<uint8_t>& feature_missing_is_zero,
const std::vector<uint8_t>& feature_missing_is_na,
const std::vector<uint8_t>& feature_mfb_is_zero,
const std::vector<uint8_t>& feature_mfb_is_na,
const std::vector<int>& feature_to_column) {
num_columns_ = num_columns;
column_bit_type_ = column_bit_type;
feature_max_bin_ = feature_max_bin;
feature_min_bin_ = feature_min_bin;
feature_offset_ = feature_offset;
feature_most_freq_bin_ = feature_most_freq_bin;
feature_default_bin_ = feature_default_bin;
feature_missing_is_zero_ = feature_missing_is_zero;
feature_missing_is_na_ = feature_missing_is_na;
feature_mfb_is_zero_ = feature_mfb_is_zero;
feature_mfb_is_na_ = feature_mfb_is_na;
data_by_column_.resize(num_columns_, nullptr);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const int8_t bit_type = column_bit_type[column_index];
if (column_data[column_index] != nullptr) {
// is dense column
if (bit_type == 4) {
column_bit_type_[column_index] = 8;
InitOneColumnData<false, true, uint8_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 8) {
InitOneColumnData<false, false, uint8_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 16) {
InitOneColumnData<false, false, uint16_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 32) {
InitOneColumnData<false, false, uint32_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else {
Log::Fatal("Unknow column bit type %d", bit_type);
}
} else {
// is sparse column
if (bit_type == 8) {
InitOneColumnData<true, false, uint8_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else if (bit_type == 16) {
InitOneColumnData<true, false, uint16_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else if (bit_type == 32) {
InitOneColumnData<true, false, uint32_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else {
Log::Fatal("Unknow column bit type %d", bit_type);
}
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
feature_to_column_ = feature_to_column;
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_,
data_by_column_.data(),
data_by_column_.size(),
__FILE__,
__LINE__);
InitColumnMetaInfo();
}
void CUDAColumnData::CopySubrow(
const CUDAColumnData* full_set,
const data_size_t* used_indices,
const data_size_t num_used_indices) {
num_threads_ = full_set->num_threads_;
num_columns_ = full_set->num_columns_;
column_bit_type_ = full_set->column_bit_type_;
feature_min_bin_ = full_set->feature_min_bin_;
feature_max_bin_ = full_set->feature_max_bin_;
feature_offset_ = full_set->feature_offset_;
feature_most_freq_bin_ = full_set->feature_most_freq_bin_;
feature_default_bin_ = full_set->feature_default_bin_;
feature_missing_is_zero_ = full_set->feature_missing_is_zero_;
feature_missing_is_na_ = full_set->feature_missing_is_na_;
feature_mfb_is_zero_ = full_set->feature_mfb_is_zero_;
feature_mfb_is_na_ = full_set->feature_mfb_is_na_;
feature_to_column_ = full_set->feature_to_column_;
if (cuda_used_indices_ == nullptr) {
// initialize the subset cuda column data
const size_t num_used_indices_size = static_cast<size_t>(num_used_indices);
AllocateCUDAMemory<data_size_t>(&cuda_used_indices_, num_used_indices_size, __FILE__, __LINE__);
data_by_column_.resize(num_columns_, nullptr);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const uint8_t bit_type = column_bit_type_[column_index];
if (bit_type == 8) {
uint8_t* column_data = nullptr;
AllocateCUDAMemory<uint8_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 16) {
uint16_t* column_data = nullptr;
AllocateCUDAMemory<uint16_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 32) {
uint32_t* column_data = nullptr;
AllocateCUDAMemory<uint32_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_, data_by_column_.data(), data_by_column_.size(), __FILE__, __LINE__);
InitColumnMetaInfo();
cur_subset_buffer_size_ = num_used_indices;
} else {
if (num_used_indices > cur_subset_buffer_size_) {
ResizeWhenCopySubrow(num_used_indices);
cur_subset_buffer_size_ = num_used_indices;
}
}
CopyFromHostToCUDADevice<data_size_t>(cuda_used_indices_, used_indices, static_cast<size_t>(num_used_indices), __FILE__, __LINE__);
num_used_indices_ = num_used_indices;
LaunchCopySubrowKernel(full_set->cuda_data_by_column());
}
void CUDAColumnData::ResizeWhenCopySubrow(const data_size_t num_used_indices) {
const size_t num_used_indices_size = static_cast<size_t>(num_used_indices);
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_used_indices_, num_used_indices_size, __FILE__, __LINE__);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const uint8_t bit_type = column_bit_type_[column_index];
if (bit_type == 8) {
uint8_t* column_data = reinterpret_cast<uint8_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint8_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint8_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 16) {
uint16_t* column_data = reinterpret_cast<uint16_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint16_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint16_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 32) {
uint32_t* column_data = reinterpret_cast<uint32_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint32_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint32_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
DeallocateCUDAMemory<void*>(&cuda_data_by_column_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_, data_by_column_.data(), data_by_column_.size(), __FILE__, __LINE__);
}
void CUDAColumnData::InitColumnMetaInfo() {
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_column_bit_type_,
column_bit_type_.data(),
column_bit_type_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_max_bin_,
feature_max_bin_.data(),
feature_max_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_min_bin_,
feature_min_bin_.data(),
feature_min_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_offset_,
feature_offset_.data(),
feature_offset_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_most_freq_bin_,
feature_most_freq_bin_.data(),
feature_most_freq_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_default_bin_,
feature_default_bin_.data(),
feature_default_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_missing_is_zero_,
feature_missing_is_zero_.data(),
feature_missing_is_zero_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_missing_is_na_,
feature_missing_is_na_.data(),
feature_missing_is_na_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_mfb_is_zero_,
feature_mfb_is_zero_.data(),
feature_mfb_is_zero_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_mfb_is_na_,
feature_mfb_is_na_.data(),
feature_mfb_is_na_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_feature_to_column_,
feature_to_column_.data(),
feature_to_column_.size(),
__FILE__,
__LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_column_data.hpp>
#define COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA (1024)
namespace LightGBM {
__global__ void CopySubrowKernel_ColumnData(
void* const* in_cuda_data_by_column,
const uint8_t* cuda_column_bit_type,
const data_size_t* cuda_used_indices,
const data_size_t num_used_indices,
const int num_column,
void** out_cuda_data_by_column) {
const data_size_t local_data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
if (local_data_index < num_used_indices) {
for (int column_index = 0; column_index < num_column; ++column_index) {
const void* in_column_data = in_cuda_data_by_column[column_index];
void* out_column_data = out_cuda_data_by_column[column_index];
const uint8_t bit_type = cuda_column_bit_type[column_index];
if (bit_type == 8) {
const uint8_t* true_in_column_data = reinterpret_cast<const uint8_t*>(in_column_data);
uint8_t* true_out_column_data = reinterpret_cast<uint8_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
} else if (bit_type == 16) {
const uint16_t* true_in_column_data = reinterpret_cast<const uint16_t*>(in_column_data);
uint16_t* true_out_column_data = reinterpret_cast<uint16_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
} else if (bit_type == 32) {
const uint32_t* true_in_column_data = reinterpret_cast<const uint32_t*>(in_column_data);
uint32_t* true_out_column_data = reinterpret_cast<uint32_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
}
}
}
}
void CUDAColumnData::LaunchCopySubrowKernel(void* const* in_cuda_data_by_column) {
const int num_blocks = (num_used_indices_ + COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA - 1) / COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA;
CopySubrowKernel_ColumnData<<<num_blocks, COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA>>>(
in_cuda_data_by_column,
cuda_column_bit_type_,
cuda_used_indices_,
num_used_indices_,
num_columns_,
cuda_data_by_column_);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_metadata.hpp>
namespace LightGBM {
CUDAMetadata::CUDAMetadata(const int gpu_device_id) {
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_label_ = nullptr;
cuda_weights_ = nullptr;
cuda_query_boundaries_ = nullptr;
cuda_query_weights_ = nullptr;
cuda_init_score_ = nullptr;
}
CUDAMetadata::~CUDAMetadata() {
DeallocateCUDAMemory<label_t>(&cuda_label_, __FILE__, __LINE__);
DeallocateCUDAMemory<label_t>(&cuda_weights_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_query_boundaries_, __FILE__, __LINE__);
DeallocateCUDAMemory<label_t>(&cuda_query_weights_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_init_score_, __FILE__, __LINE__);
}
void CUDAMetadata::Init(const std::vector<label_t>& label,
const std::vector<label_t>& weight,
const std::vector<data_size_t>& query_boundaries,
const std::vector<label_t>& query_weights,
const std::vector<double>& init_score) {
if (label.size() == 0) {
cuda_label_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_label_, label.data(), label.size(), __FILE__, __LINE__);
}
if (weight.size() == 0) {
cuda_weights_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_weights_, weight.data(), weight.size(), __FILE__, __LINE__);
}
if (query_boundaries.size() == 0) {
cuda_query_boundaries_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_query_boundaries_, query_boundaries.data(), query_boundaries.size(), __FILE__, __LINE__);
}
if (query_weights.size() == 0) {
cuda_query_weights_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_query_weights_, query_weights.data(), query_weights.size(), __FILE__, __LINE__);
}
if (init_score.size() == 0) {
cuda_init_score_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<double>(&cuda_init_score_, init_score.data(), init_score.size(), __FILE__, __LINE__);
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDAMetadata::SetLabel(const label_t* label, data_size_t len) {
DeallocateCUDAMemory<label_t>(&cuda_label_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_label_, label, static_cast<size_t>(len), __FILE__, __LINE__);
}
void CUDAMetadata::SetWeights(const label_t* weights, data_size_t len) {
DeallocateCUDAMemory<label_t>(&cuda_weights_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_weights_, weights, static_cast<size_t>(len), __FILE__, __LINE__);
}
void CUDAMetadata::SetQuery(const data_size_t* query_boundaries, const label_t* query_weights, data_size_t num_queries) {
DeallocateCUDAMemory<data_size_t>(&cuda_query_boundaries_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_query_boundaries_, query_boundaries, static_cast<size_t>(num_queries) + 1, __FILE__, __LINE__);
if (query_weights != nullptr) {
DeallocateCUDAMemory<label_t>(&cuda_query_weights_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_query_weights_, query_weights, static_cast<size_t>(num_queries), __FILE__, __LINE__);
}
}
void CUDAMetadata::SetInitScore(const double* init_score, data_size_t len) {
DeallocateCUDAMemory<double>(&cuda_init_score_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_init_score_, init_score, static_cast<size_t>(len), __FILE__, __LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_row_data.hpp>
namespace LightGBM {
CUDARowData::CUDARowData(const Dataset* train_data,
const TrainingShareStates* train_share_state,
const int gpu_device_id,
const bool gpu_use_dp):
gpu_device_id_(gpu_device_id),
gpu_use_dp_(gpu_use_dp) {
num_threads_ = OMP_NUM_THREADS();
num_data_ = train_data->num_data();
const auto& feature_hist_offsets = train_share_state->feature_hist_offsets();
if (gpu_use_dp_) {
shared_hist_size_ = DP_SHARED_HIST_SIZE;
} else {
shared_hist_size_ = SP_SHARED_HIST_SIZE;
}
if (feature_hist_offsets.empty()) {
num_total_bin_ = 0;
} else {
num_total_bin_ = static_cast<int>(feature_hist_offsets.back());
}
num_feature_group_ = train_data->num_feature_groups();
num_feature_ = train_data->num_features();
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_data_uint8_t_ = nullptr;
cuda_data_uint16_t_ = nullptr;
cuda_data_uint32_t_ = nullptr;
cuda_row_ptr_uint16_t_ = nullptr;
cuda_row_ptr_uint32_t_ = nullptr;
cuda_row_ptr_uint64_t_ = nullptr;
cuda_partition_ptr_uint16_t_ = nullptr;
cuda_partition_ptr_uint32_t_ = nullptr;
cuda_partition_ptr_uint64_t_ = nullptr;
cuda_feature_partition_column_index_offsets_ = nullptr;
cuda_column_hist_offsets_ = nullptr;
cuda_partition_hist_offsets_ = nullptr;
cuda_block_buffer_uint16_t_ = nullptr;
cuda_block_buffer_uint32_t_ = nullptr;
cuda_block_buffer_uint64_t_ = nullptr;
}
CUDARowData::~CUDARowData() {
DeallocateCUDAMemory<uint8_t>(&cuda_data_uint8_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_data_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_data_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_row_ptr_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_row_ptr_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint64_t>(&cuda_row_ptr_uint64_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_feature_partition_column_index_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_column_hist_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_partition_hist_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_block_buffer_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_block_buffer_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint64_t>(&cuda_block_buffer_uint64_t_, __FILE__, __LINE__);
}
void CUDARowData::Init(const Dataset* train_data, TrainingShareStates* train_share_state) {
if (num_feature_ == 0) {
return;
}
DivideCUDAFeatureGroups(train_data, train_share_state);
bit_type_ = 0;
size_t total_size = 0;
const void* host_row_ptr = nullptr;
row_ptr_bit_type_ = 0;
const void* host_data = train_share_state->GetRowWiseData(&bit_type_, &total_size, &is_sparse_, &host_row_ptr, &row_ptr_bit_type_);
if (bit_type_ == 8) {
if (!is_sparse_) {
std::vector<uint8_t> partitioned_data;
GetDenseDataPartitioned<uint8_t>(reinterpret_cast<const uint8_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_data_uint8_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint8_t, uint16_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint8_t, uint32_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint8_t, uint64_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else if (bit_type_ == 16) {
if (!is_sparse_) {
std::vector<uint16_t> partitioned_data;
GetDenseDataPartitioned<uint16_t>(reinterpret_cast<const uint16_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint16_t>(&cuda_data_uint16_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint16_t, uint16_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint16_t, uint32_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint16_t, uint64_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else if (bit_type_ == 32) {
if (!is_sparse_) {
std::vector<uint32_t> partitioned_data;
GetDenseDataPartitioned<uint32_t>(reinterpret_cast<const uint32_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_data_uint32_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint32_t, uint16_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint32_t, uint32_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint32_t, uint64_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else {
Log::Fatal("Unknow bit type = %d", bit_type_);
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDARowData::DivideCUDAFeatureGroups(const Dataset* train_data, TrainingShareStates* share_state) {
const uint32_t max_num_bin_per_partition = shared_hist_size_ / 2;
const std::vector<uint32_t>& column_hist_offsets = share_state->column_hist_offsets();
std::vector<int> feature_group_num_feature_offsets;
int offsets = 0;
int prev_group_index = -1;
for (int feature_index = 0; feature_index < num_feature_; ++feature_index) {
const int feature_group_index = train_data->Feature2Group(feature_index);
if (prev_group_index == -1 || feature_group_index != prev_group_index) {
feature_group_num_feature_offsets.emplace_back(offsets);
prev_group_index = feature_group_index;
}
++offsets;
}
CHECK_EQ(offsets, num_feature_);
feature_group_num_feature_offsets.emplace_back(offsets);
uint32_t start_hist_offset = 0;
feature_partition_column_index_offsets_.clear();
column_hist_offsets_.clear();
partition_hist_offsets_.clear();
feature_partition_column_index_offsets_.emplace_back(0);
partition_hist_offsets_.emplace_back(0);
const int num_feature_groups = train_data->num_feature_groups();
int column_index = 0;
num_feature_partitions_ = 0;
large_bin_partitions_.clear();
small_bin_partitions_.clear();
for (int feature_group_index = 0; feature_group_index < num_feature_groups; ++feature_group_index) {
if (!train_data->IsMultiGroup(feature_group_index)) {
const uint32_t column_feature_hist_start = column_hist_offsets[column_index];
const uint32_t column_feature_hist_end = column_hist_offsets[column_index + 1];
const uint32_t num_bin_in_dense_group = column_feature_hist_end - column_feature_hist_start;
// if one column has too many bins, use a separate partition for that column
if (num_bin_in_dense_group > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
start_hist_offset = column_feature_hist_end;
partition_hist_offsets_.emplace_back(start_hist_offset);
large_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
column_hist_offsets_.emplace_back(0);
++column_index;
continue;
}
// try if adding this column exceed the maximum number per partition
const uint32_t cur_hist_num_bin = column_feature_hist_end - start_hist_offset;
if (cur_hist_num_bin > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index);
start_hist_offset = column_feature_hist_start;
partition_hist_offsets_.emplace_back(start_hist_offset);
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
column_hist_offsets_.emplace_back(column_hist_offsets[column_index] - start_hist_offset);
if (feature_group_index == num_feature_groups - 1) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
partition_hist_offsets_.emplace_back(column_hist_offsets.back());
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
++column_index;
} else {
const int group_feature_index_start = feature_group_num_feature_offsets[feature_group_index];
const int num_feature_in_group = feature_group_num_feature_offsets[feature_group_index + 1] - group_feature_index_start;
for (int sub_feature_index = 0; sub_feature_index < num_feature_in_group; ++sub_feature_index) {
const int feature_index = group_feature_index_start + sub_feature_index;
const uint32_t column_feature_hist_start = column_hist_offsets[column_index];
const uint32_t column_feature_hist_end = column_hist_offsets[column_index + 1];
const uint32_t num_bin_in_dense_group = column_feature_hist_end - column_feature_hist_start;
// if one column has too many bins, use a separate partition for that column
if (num_bin_in_dense_group > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
start_hist_offset = column_feature_hist_end;
partition_hist_offsets_.emplace_back(start_hist_offset);
large_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
column_hist_offsets_.emplace_back(0);
++column_index;
continue;
}
// try if adding this column exceed the maximum number per partition
const uint32_t cur_hist_num_bin = column_feature_hist_end - start_hist_offset;
if (cur_hist_num_bin > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index);
start_hist_offset = column_feature_hist_start;
partition_hist_offsets_.emplace_back(start_hist_offset);
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
column_hist_offsets_.emplace_back(column_hist_offsets[column_index] - start_hist_offset);
if (feature_group_index == num_feature_groups - 1 && sub_feature_index == num_feature_in_group - 1) {
CHECK_EQ(feature_index, num_feature_ - 1);
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
partition_hist_offsets_.emplace_back(column_hist_offsets.back());
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
++column_index;
}
}
}
column_hist_offsets_.emplace_back(column_hist_offsets.back() - start_hist_offset);
max_num_column_per_partition_ = 0;
for (size_t i = 0; i < feature_partition_column_index_offsets_.size() - 1; ++i) {
const int num_column = feature_partition_column_index_offsets_[i + 1] - feature_partition_column_index_offsets_[i];
if (num_column > max_num_column_per_partition_) {
max_num_column_per_partition_ = num_column;
}
}
InitCUDAMemoryFromHostMemory<int>(&cuda_feature_partition_column_index_offsets_,
feature_partition_column_index_offsets_.data(),
feature_partition_column_index_offsets_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_column_hist_offsets_,
column_hist_offsets_.data(),
column_hist_offsets_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_partition_hist_offsets_,
partition_hist_offsets_.data(),
partition_hist_offsets_.size(),
__FILE__,
__LINE__);
}
template <typename BIN_TYPE>
void CUDARowData::GetDenseDataPartitioned(const BIN_TYPE* row_wise_data, std::vector<BIN_TYPE>* partitioned_data) {
const int num_total_columns = feature_partition_column_index_offsets_.back();
partitioned_data->resize(static_cast<size_t>(num_total_columns) * static_cast<size_t>(num_data_), 0);
BIN_TYPE* out_data = partitioned_data->data();
Threading::For<data_size_t>(0, num_data_, 512,
[this, num_total_columns, row_wise_data, out_data] (int /*thread_index*/, data_size_t start, data_size_t end) {
for (size_t i = 0; i < feature_partition_column_index_offsets_.size() - 1; ++i) {
const int num_prev_columns = static_cast<int>(feature_partition_column_index_offsets_[i]);
const data_size_t offset = num_data_ * num_prev_columns;
const int partition_column_start = feature_partition_column_index_offsets_[i];
const int partition_column_end = feature_partition_column_index_offsets_[i + 1];
const int num_columns_in_cur_partition = partition_column_end - partition_column_start;
for (data_size_t data_index = start; data_index < end; ++data_index) {
const data_size_t data_offset = offset + data_index * num_columns_in_cur_partition;
const data_size_t read_data_offset = data_index * num_total_columns;
for (int column_index = 0; column_index < num_columns_in_cur_partition; ++column_index) {
const int true_column_index = read_data_offset + column_index + partition_column_start;
const BIN_TYPE bin = row_wise_data[true_column_index];
out_data[data_offset + column_index] = bin;
}
}
}
});
}
template <typename BIN_TYPE, typename DATA_PTR_TYPE>
void CUDARowData::GetSparseDataPartitioned(
const BIN_TYPE* row_wise_data,
const DATA_PTR_TYPE* row_ptr,
std::vector<std::vector<BIN_TYPE>>* partitioned_data,
std::vector<std::vector<DATA_PTR_TYPE>>* partitioned_row_ptr,
std::vector<DATA_PTR_TYPE>* partition_ptr) {
const int num_partitions = static_cast<int>(feature_partition_column_index_offsets_.size()) - 1;
partitioned_data->resize(num_partitions);
partitioned_row_ptr->resize(num_partitions);
std::vector<int> thread_max_elements_per_row(num_threads_, 0);
Threading::For<int>(0, num_partitions, 1,
[partitioned_data, partitioned_row_ptr, row_ptr, row_wise_data, &thread_max_elements_per_row, this] (int thread_index, int start, int end) {
for (int partition_index = start; partition_index < end; ++partition_index) {
std::vector<BIN_TYPE>& data_for_this_partition = partitioned_data->at(partition_index);
std::vector<DATA_PTR_TYPE>& row_ptr_for_this_partition = partitioned_row_ptr->at(partition_index);
const int partition_hist_start = partition_hist_offsets_[partition_index];
const int partition_hist_end = partition_hist_offsets_[partition_index + 1];
DATA_PTR_TYPE offset = 0;
row_ptr_for_this_partition.clear();
data_for_this_partition.clear();
row_ptr_for_this_partition.emplace_back(offset);
for (data_size_t data_index = 0; data_index < num_data_; ++data_index) {
const DATA_PTR_TYPE row_start = row_ptr[data_index];
const DATA_PTR_TYPE row_end = row_ptr[data_index + 1];
const BIN_TYPE* row_data_start = row_wise_data + row_start;
const BIN_TYPE* row_data_end = row_wise_data + row_end;
const size_t partition_start_in_row = std::lower_bound(row_data_start, row_data_end, partition_hist_start) - row_data_start;
const size_t partition_end_in_row = std::lower_bound(row_data_start, row_data_end, partition_hist_end) - row_data_start;
for (size_t pos = partition_start_in_row; pos < partition_end_in_row; ++pos) {
const BIN_TYPE bin = row_data_start[pos];
CHECK_GE(bin, static_cast<BIN_TYPE>(partition_hist_start));
data_for_this_partition.emplace_back(bin - partition_hist_start);
}
CHECK_GE(partition_end_in_row, partition_start_in_row);
const data_size_t num_elements_in_row = partition_end_in_row - partition_start_in_row;
offset += static_cast<DATA_PTR_TYPE>(num_elements_in_row);
row_ptr_for_this_partition.emplace_back(offset);
if (num_elements_in_row > thread_max_elements_per_row[thread_index]) {
thread_max_elements_per_row[thread_index] = num_elements_in_row;
}
}
}
});
partition_ptr->clear();
DATA_PTR_TYPE offset = 0;
partition_ptr->emplace_back(offset);
for (size_t i = 0; i < partitioned_row_ptr->size(); ++i) {
offset += partitioned_row_ptr->at(i).back();
partition_ptr->emplace_back(offset);
}
max_num_column_per_partition_ = 0;
for (int thread_index = 0; thread_index < num_threads_; ++thread_index) {
if (thread_max_elements_per_row[thread_index] > max_num_column_per_partition_) {
max_num_column_per_partition_ = thread_max_elements_per_row[thread_index];
}
}
}
template <typename BIN_TYPE, typename ROW_PTR_TYPE>
void CUDARowData::InitSparseData(const BIN_TYPE* host_data,
const ROW_PTR_TYPE* host_row_ptr,
BIN_TYPE** cuda_data,
ROW_PTR_TYPE** cuda_row_ptr,
ROW_PTR_TYPE** cuda_partition_ptr) {
std::vector<std::vector<BIN_TYPE>> partitioned_data;
std::vector<std::vector<ROW_PTR_TYPE>> partitioned_data_ptr;
std::vector<ROW_PTR_TYPE> partition_ptr;
GetSparseDataPartitioned<BIN_TYPE, ROW_PTR_TYPE>(host_data, host_row_ptr, &partitioned_data, &partitioned_data_ptr, &partition_ptr);
InitCUDAMemoryFromHostMemory<ROW_PTR_TYPE>(cuda_partition_ptr, partition_ptr.data(), partition_ptr.size(), __FILE__, __LINE__);
AllocateCUDAMemory<BIN_TYPE>(cuda_data, partition_ptr.back(), __FILE__, __LINE__);
AllocateCUDAMemory<ROW_PTR_TYPE>(cuda_row_ptr, (num_data_ + 1) * partitioned_data_ptr.size(), __FILE__, __LINE__);
for (size_t i = 0; i < partitioned_data.size(); ++i) {
const std::vector<ROW_PTR_TYPE>& data_ptr_for_this_partition = partitioned_data_ptr[i];
const std::vector<BIN_TYPE>& data_for_this_partition = partitioned_data[i];
CopyFromHostToCUDADevice<BIN_TYPE>((*cuda_data) + partition_ptr[i], data_for_this_partition.data(), data_for_this_partition.size(), __FILE__, __LINE__);
CopyFromHostToCUDADevice<ROW_PTR_TYPE>((*cuda_row_ptr) + i * (num_data_ + 1), data_ptr_for_this_partition.data(), data_ptr_for_this_partition.size(), __FILE__, __LINE__);
}
}
template <typename BIN_TYPE>
const BIN_TYPE* CUDARowData::GetBin() const {
if (bit_type_ == 8) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint8_t_);
} else if (bit_type_ == 16) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint16_t_);
} else if (bit_type_ == 32) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint32_t_);
} else {
Log::Fatal("Unknown bit_type %d for GetBin.", bit_type_);
}
}
template const uint8_t* CUDARowData::GetBin<uint8_t>() const;
template const uint16_t* CUDARowData::GetBin<uint16_t>() const;
template const uint32_t* CUDARowData::GetBin<uint32_t>() const;
template <typename PTR_TYPE>
const PTR_TYPE* CUDARowData::GetRowPtr() const {
if (row_ptr_bit_type_ == 16) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint64_t_);
} else {
Log::Fatal("Unknown row_ptr_bit_type = %d for GetRowPtr.", row_ptr_bit_type_);
}
}
template const uint16_t* CUDARowData::GetRowPtr<uint16_t>() const;
template const uint32_t* CUDARowData::GetRowPtr<uint32_t>() const;
template const uint64_t* CUDARowData::GetRowPtr<uint64_t>() const;
template <typename PTR_TYPE>
const PTR_TYPE* CUDARowData::GetPartitionPtr() const {
if (row_ptr_bit_type_ == 16) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknown row_ptr_bit_type = %d for GetPartitionPtr.", row_ptr_bit_type_);
}
}
template const uint16_t* CUDARowData::GetPartitionPtr<uint16_t>() const;
template const uint32_t* CUDARowData::GetPartitionPtr<uint32_t>() const;
template const uint64_t* CUDARowData::GetPartitionPtr<uint64_t>() const;
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_tree.hpp>
namespace LightGBM {
CUDATree::CUDATree(int max_leaves, bool track_branch_features, bool is_linear,
const int gpu_device_id, const bool has_categorical_feature):
Tree(max_leaves, track_branch_features, is_linear),
num_threads_per_block_add_prediction_to_score_(1024) {
is_cuda_tree_ = true;
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
if (has_categorical_feature) {
cuda_cat_boundaries_.Resize(max_leaves);
cuda_cat_boundaries_inner_.Resize(max_leaves);
}
InitCUDAMemory();
}
CUDATree::CUDATree(const Tree* host_tree):
Tree(*host_tree),
num_threads_per_block_add_prediction_to_score_(1024) {
is_cuda_tree_ = true;
InitCUDA();
}
CUDATree::~CUDATree() {
DeallocateCUDAMemory<int>(&cuda_left_child_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_right_child_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_split_feature_inner_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_split_feature_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_leaf_depth_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_leaf_parent_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_threshold_in_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_threshold_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_internal_weight_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_internal_value_, __FILE__, __LINE__);
DeallocateCUDAMemory<int8_t>(&cuda_decision_type_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_leaf_value_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_leaf_count_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_leaf_weight_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_internal_count_, __FILE__, __LINE__);
DeallocateCUDAMemory<float>(&cuda_split_gain_, __FILE__, __LINE__);
gpuAssert(cudaStreamDestroy(cuda_stream_), __FILE__, __LINE__);
}
void CUDATree::InitCUDAMemory() {
AllocateCUDAMemory<int>(&cuda_left_child_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_right_child_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_split_feature_inner_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_split_feature_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_leaf_depth_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_leaf_parent_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<uint32_t>(&cuda_threshold_in_bin_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_threshold_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int8_t>(&cuda_decision_type_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_leaf_value_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_internal_weight_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_internal_value_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_leaf_weight_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_leaf_count_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_internal_count_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<float>(&cuda_split_gain_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
SetCUDAMemory<double>(cuda_leaf_value_, 0.0f, 1, __FILE__, __LINE__);
SetCUDAMemory<double>(cuda_leaf_weight_, 0.0f, 1, __FILE__, __LINE__);
SetCUDAMemory<int>(cuda_leaf_parent_, -1, 1, __FILE__, __LINE__);
CUDASUCCESS_OR_FATAL(cudaStreamCreate(&cuda_stream_));
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDATree::InitCUDA() {
InitCUDAMemoryFromHostMemory<int>(&cuda_left_child_,
left_child_.data(),
left_child_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_right_child_,
right_child_.data(),
right_child_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_split_feature_inner_,
split_feature_inner_.data(),
split_feature_inner_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_split_feature_,
split_feature_.data(),
split_feature_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_threshold_in_bin_,
threshold_in_bin_.data(),
threshold_in_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_threshold_,
threshold_.data(),
threshold_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_leaf_depth_,
leaf_depth_.data(),
leaf_depth_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int8_t>(&cuda_decision_type_,
decision_type_.data(),
decision_type_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_internal_weight_,
internal_weight_.data(),
internal_weight_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_internal_value_,
internal_value_.data(),
internal_value_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_internal_count_,
internal_count_.data(),
internal_count_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_leaf_count_,
leaf_count_.data(),
leaf_count_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<float>(&cuda_split_gain_,
split_gain_.data(),
split_gain_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_leaf_value_,
leaf_value_.data(),
leaf_value_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_leaf_weight_,
leaf_weight_.data(),
leaf_weight_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_leaf_parent_,
leaf_parent_.data(),
leaf_parent_.size(),
__FILE__,
__LINE__);
CUDASUCCESS_OR_FATAL(cudaStreamCreate(&cuda_stream_));
SynchronizeCUDADevice(__FILE__, __LINE__);
}
int CUDATree::Split(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info) {
LaunchSplitKernel(leaf_index, real_feature_index, real_threshold, missing_type, cuda_split_info);
++num_leaves_;
return num_leaves_ - 1;
}
int CUDATree::SplitCategorical(const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
uint32_t* cuda_bitset,
size_t cuda_bitset_len,
uint32_t* cuda_bitset_inner,
size_t cuda_bitset_inner_len) {
LaunchSplitCategoricalKernel(leaf_index, real_feature_index,
missing_type, cuda_split_info,
cuda_bitset_len, cuda_bitset_inner_len);
cuda_bitset_.PushBack(cuda_bitset, cuda_bitset_len);
cuda_bitset_inner_.PushBack(cuda_bitset_inner, cuda_bitset_inner_len);
++num_leaves_;
++num_cat_;
return num_leaves_ - 1;
}
inline void CUDATree::Shrinkage(double rate) {
Tree::Shrinkage(rate);
LaunchShrinkageKernel(rate);
}
inline void CUDATree::AddBias(double val) {
Tree::AddBias(val);
LaunchAddBiasKernel(val);
}
void CUDATree::ToHost() {
left_child_.resize(max_leaves_ - 1);
right_child_.resize(max_leaves_ - 1);
split_feature_inner_.resize(max_leaves_ - 1);
split_feature_.resize(max_leaves_ - 1);
threshold_in_bin_.resize(max_leaves_ - 1);
threshold_.resize(max_leaves_ - 1);
decision_type_.resize(max_leaves_ - 1, 0);
split_gain_.resize(max_leaves_ - 1);
leaf_parent_.resize(max_leaves_);
leaf_value_.resize(max_leaves_);
leaf_weight_.resize(max_leaves_);
leaf_count_.resize(max_leaves_);
internal_value_.resize(max_leaves_ - 1);
internal_weight_.resize(max_leaves_ - 1);
internal_count_.resize(max_leaves_ - 1);
leaf_depth_.resize(max_leaves_);
const size_t num_leaves_size = static_cast<size_t>(num_leaves_);
CopyFromCUDADeviceToHost<int>(left_child_.data(), cuda_left_child_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(right_child_.data(), cuda_right_child_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(split_feature_inner_.data(), cuda_split_feature_inner_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(split_feature_.data(), cuda_split_feature_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<uint32_t>(threshold_in_bin_.data(), cuda_threshold_in_bin_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(threshold_.data(), cuda_threshold_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int8_t>(decision_type_.data(), cuda_decision_type_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<float>(split_gain_.data(), cuda_split_gain_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(leaf_parent_.data(), cuda_leaf_parent_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(leaf_value_.data(), cuda_leaf_value_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(leaf_weight_.data(), cuda_leaf_weight_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<data_size_t>(leaf_count_.data(), cuda_leaf_count_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(internal_value_.data(), cuda_internal_value_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(internal_weight_.data(), cuda_internal_weight_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<data_size_t>(internal_count_.data(), cuda_internal_count_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(leaf_depth_.data(), cuda_leaf_depth_, num_leaves_size, __FILE__, __LINE__);
if (num_cat_ > 0) {
cuda_cat_boundaries_inner_.Resize(num_cat_ + 1);
cuda_cat_boundaries_.Resize(num_cat_ + 1);
cat_boundaries_ = cuda_cat_boundaries_.ToHost();
cat_boundaries_inner_ = cuda_cat_boundaries_inner_.ToHost();
cat_threshold_ = cuda_bitset_.ToHost();
cat_threshold_inner_ = cuda_bitset_inner_.ToHost();
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDATree::SyncLeafOutputFromHostToCUDA() {
CopyFromHostToCUDADevice<double>(cuda_leaf_value_, leaf_value_.data(), leaf_value_.size(), __FILE__, __LINE__);
}
void CUDATree::SyncLeafOutputFromCUDAToHost() {
CopyFromCUDADeviceToHost<double>(leaf_value_.data(), cuda_leaf_value_, leaf_value_.size(), __FILE__, __LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_tree.hpp>
namespace LightGBM {
__device__ void SetDecisionTypeCUDA(int8_t* decision_type, bool input, int8_t mask) {
if (input) {
(*decision_type) |= mask;
} else {
(*decision_type) &= (127 - mask);
}
}
__device__ void SetMissingTypeCUDA(int8_t* decision_type, int8_t input) {
(*decision_type) &= 3;
(*decision_type) |= (input << 2);
}
__device__ bool GetDecisionTypeCUDA(int8_t decision_type, int8_t mask) {
return (decision_type & mask) > 0;
}
__device__ int8_t GetMissingTypeCUDA(int8_t decision_type) {
return (decision_type >> 2) & 3;
}
__device__ bool IsZeroCUDA(double fval) {
return (fval >= -kZeroThreshold && fval <= kZeroThreshold);
}
__global__ void SplitKernel( // split information
const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
// tree structure
const int num_leaves,
int* leaf_parent,
int* leaf_depth,
int* left_child,
int* right_child,
int* split_feature_inner,
int* split_feature,
float* split_gain,
double* internal_weight,
double* internal_value,
data_size_t* internal_count,
double* leaf_weight,
double* leaf_value,
data_size_t* leaf_count,
int8_t* decision_type,
uint32_t* threshold_in_bin,
double* threshold) {
const int new_node_index = num_leaves - 1;
const int thread_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
const int parent_index = leaf_parent[leaf_index];
if (thread_index == 0) {
if (parent_index >= 0) {
// if cur node is left child
if (left_child[parent_index] == ~leaf_index) {
left_child[parent_index] = new_node_index;
} else {
right_child[parent_index] = new_node_index;
}
}
left_child[new_node_index] = ~leaf_index;
right_child[new_node_index] = ~num_leaves;
leaf_parent[leaf_index] = new_node_index;
leaf_parent[num_leaves] = new_node_index;
} else if (thread_index == 1) {
// add new node
split_feature_inner[new_node_index] = cuda_split_info->inner_feature_index;
} else if (thread_index == 2) {
split_feature[new_node_index] = real_feature_index;
} else if (thread_index == 3) {
split_gain[new_node_index] = static_cast<float>(cuda_split_info->gain);
} else if (thread_index == 4) {
// save current leaf value to internal node before change
internal_weight[new_node_index] = leaf_weight[leaf_index];
leaf_weight[leaf_index] = cuda_split_info->left_sum_hessians;
} else if (thread_index == 5) {
internal_value[new_node_index] = leaf_value[leaf_index];
leaf_value[leaf_index] = isnan(cuda_split_info->left_value) ? 0.0f : cuda_split_info->left_value;
} else if (thread_index == 6) {
internal_count[new_node_index] = cuda_split_info->left_count + cuda_split_info->right_count;
} else if (thread_index == 7) {
leaf_count[leaf_index] = cuda_split_info->left_count;
} else if (thread_index == 8) {
leaf_value[num_leaves] = isnan(cuda_split_info->right_value) ? 0.0f : cuda_split_info->right_value;
} else if (thread_index == 9) {
leaf_weight[num_leaves] = cuda_split_info->right_sum_hessians;
} else if (thread_index == 10) {
leaf_count[num_leaves] = cuda_split_info->right_count;
} else if (thread_index == 11) {
// update leaf depth
leaf_depth[num_leaves] = leaf_depth[leaf_index] + 1;
leaf_depth[leaf_index]++;
} else if (thread_index == 12) {
decision_type[new_node_index] = 0;
SetDecisionTypeCUDA(&decision_type[new_node_index], false, kCategoricalMask);
SetDecisionTypeCUDA(&decision_type[new_node_index], cuda_split_info->default_left, kDefaultLeftMask);
SetMissingTypeCUDA(&decision_type[new_node_index], static_cast<int8_t>(missing_type));
} else if (thread_index == 13) {
threshold_in_bin[new_node_index] = cuda_split_info->threshold;
} else if (thread_index == 14) {
threshold[new_node_index] = real_threshold;
}
}
void CUDATree::LaunchSplitKernel(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info) {
SplitKernel<<<3, 5, 0, cuda_stream_>>>(
// split information
leaf_index,
real_feature_index,
real_threshold,
missing_type,
cuda_split_info,
// tree structure
num_leaves_,
cuda_leaf_parent_,
cuda_leaf_depth_,
cuda_left_child_,
cuda_right_child_,
cuda_split_feature_inner_,
cuda_split_feature_,
cuda_split_gain_,
cuda_internal_weight_,
cuda_internal_value_,
cuda_internal_count_,
cuda_leaf_weight_,
cuda_leaf_value_,
cuda_leaf_count_,
cuda_decision_type_,
cuda_threshold_in_bin_,
cuda_threshold_);
}
__global__ void SplitCategoricalKernel( // split information
const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
// tree structure
const int num_leaves,
int* leaf_parent,
int* leaf_depth,
int* left_child,
int* right_child,
int* split_feature_inner,
int* split_feature,
float* split_gain,
double* internal_weight,
double* internal_value,
data_size_t* internal_count,
double* leaf_weight,
double* leaf_value,
data_size_t* leaf_count,
int8_t* decision_type,
uint32_t* threshold_in_bin,
double* threshold,
size_t cuda_bitset_len,
size_t cuda_bitset_inner_len,
int num_cat,
int* cuda_cat_boundaries,
int* cuda_cat_boundaries_inner) {
const int new_node_index = num_leaves - 1;
const int thread_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
const int parent_index = leaf_parent[leaf_index];
if (thread_index == 0) {
if (parent_index >= 0) {
// if cur node is left child
if (left_child[parent_index] == ~leaf_index) {
left_child[parent_index] = new_node_index;
} else {
right_child[parent_index] = new_node_index;
}
}
left_child[new_node_index] = ~leaf_index;
right_child[new_node_index] = ~num_leaves;
leaf_parent[leaf_index] = new_node_index;
leaf_parent[num_leaves] = new_node_index;
} else if (thread_index == 1) {
// add new node
split_feature_inner[new_node_index] = cuda_split_info->inner_feature_index;
} else if (thread_index == 2) {
split_feature[new_node_index] = real_feature_index;
} else if (thread_index == 3) {
split_gain[new_node_index] = static_cast<float>(cuda_split_info->gain);
} else if (thread_index == 4) {
// save current leaf value to internal node before change
internal_weight[new_node_index] = leaf_weight[leaf_index];
leaf_weight[leaf_index] = cuda_split_info->left_sum_hessians;
} else if (thread_index == 5) {
internal_value[new_node_index] = leaf_value[leaf_index];
leaf_value[leaf_index] = isnan(cuda_split_info->left_value) ? 0.0f : cuda_split_info->left_value;
} else if (thread_index == 6) {
internal_count[new_node_index] = cuda_split_info->left_count + cuda_split_info->right_count;
} else if (thread_index == 7) {
leaf_count[leaf_index] = cuda_split_info->left_count;
} else if (thread_index == 8) {
leaf_value[num_leaves] = isnan(cuda_split_info->right_value) ? 0.0f : cuda_split_info->right_value;
} else if (thread_index == 9) {
leaf_weight[num_leaves] = cuda_split_info->right_sum_hessians;
} else if (thread_index == 10) {
leaf_count[num_leaves] = cuda_split_info->right_count;
} else if (thread_index == 11) {
// update leaf depth
leaf_depth[num_leaves] = leaf_depth[leaf_index] + 1;
leaf_depth[leaf_index]++;
} else if (thread_index == 12) {
decision_type[new_node_index] = 0;
SetDecisionTypeCUDA(&decision_type[new_node_index], true, kCategoricalMask);
SetMissingTypeCUDA(&decision_type[new_node_index], static_cast<int8_t>(missing_type));
} else if (thread_index == 13) {
threshold_in_bin[new_node_index] = num_cat;
} else if (thread_index == 14) {
threshold[new_node_index] = num_cat;
} else if (thread_index == 15) {
if (num_cat == 0) {
cuda_cat_boundaries[num_cat] = 0;
}
cuda_cat_boundaries[num_cat + 1] = cuda_cat_boundaries[num_cat] + cuda_bitset_len;
} else if (thread_index == 16) {
if (num_cat == 0) {
cuda_cat_boundaries_inner[num_cat] = 0;
}
cuda_cat_boundaries_inner[num_cat + 1] = cuda_cat_boundaries_inner[num_cat] + cuda_bitset_inner_len;
}
}
void CUDATree::LaunchSplitCategoricalKernel(const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
size_t cuda_bitset_len,
size_t cuda_bitset_inner_len) {
SplitCategoricalKernel<<<3, 6, 0, cuda_stream_>>>(
// split information
leaf_index,
real_feature_index,
missing_type,
cuda_split_info,
// tree structure
num_leaves_,
cuda_leaf_parent_,
cuda_leaf_depth_,
cuda_left_child_,
cuda_right_child_,
cuda_split_feature_inner_,
cuda_split_feature_,
cuda_split_gain_,
cuda_internal_weight_,
cuda_internal_value_,
cuda_internal_count_,
cuda_leaf_weight_,
cuda_leaf_value_,
cuda_leaf_count_,
cuda_decision_type_,
cuda_threshold_in_bin_,
cuda_threshold_,
cuda_bitset_len,
cuda_bitset_inner_len,
num_cat_,
cuda_cat_boundaries_.RawData(),
cuda_cat_boundaries_inner_.RawData());
}
__global__ void ShrinkageKernel(const double rate, double* cuda_leaf_value, const int num_leaves) {
const int leaf_index = static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x);
if (leaf_index < num_leaves) {
cuda_leaf_value[leaf_index] *= rate;
}
}
void CUDATree::LaunchShrinkageKernel(const double rate) {
const int num_threads_per_block = 1024;
const int num_blocks = (num_leaves_ + num_threads_per_block - 1) / num_threads_per_block;
ShrinkageKernel<<<num_blocks, num_threads_per_block>>>(rate, cuda_leaf_value_, num_leaves_);
}
__global__ void AddBiasKernel(const double val, double* cuda_leaf_value, const int num_leaves) {
const int leaf_index = static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x);
if (leaf_index < num_leaves) {
cuda_leaf_value[leaf_index] += val;
}
}
void CUDATree::LaunchAddBiasKernel(const double val) {
const int num_threads_per_block = 1024;
const int num_blocks = (num_leaves_ + num_threads_per_block - 1) / num_threads_per_block;
AddBiasKernel<<<num_blocks, num_threads_per_block>>>(val, cuda_leaf_value_, num_leaves_);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
......@@ -340,17 +340,18 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
auto features_in_group = NoGroup(used_features);
auto is_sparse = io_config.is_enable_sparse;
if (io_config.device_type == std::string("cuda")) {
if (io_config.device_type == std::string("cuda") || io_config.device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda;
if (is_sparse) {
if (io_config.device_type == std::string("cuda") && is_sparse) {
Log::Warning("Using sparse features with CUDA is currently not supported.");
is_sparse = false;
}
is_sparse = false;
}
std::vector<int8_t> group_is_multi_val(used_features.size(), 0);
if (io_config.enable_bundle && !used_features.empty()) {
bool lgbm_is_gpu_used = io_config.device_type == std::string("gpu") || io_config.device_type == std::string("cuda");
bool lgbm_is_gpu_used = io_config.device_type == std::string("gpu") || io_config.device_type == std::string("cuda")
|| io_config.device_type == std::string("cuda_exp");
features_in_group = FastFeatureBundling(
*bin_mappers, sample_non_zero_indices, sample_values, num_per_col,
num_sample_col, static_cast<data_size_t>(total_sample_cnt),
......@@ -426,6 +427,8 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
++num_numeric_features_;
}
}
device_type_ = io_config.device_type;
gpu_device_id_ = io_config.gpu_device_id;
}
void Dataset::FinishLoad() {
......@@ -437,6 +440,14 @@ void Dataset::FinishLoad() {
feature_groups_[i]->FinishLoad();
}
}
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
CreateCUDAColumnData();
metadata_.CreateCUDAMetadata(gpu_device_id_);
} else {
cuda_column_data_.reset(nullptr);
}
#endif // USE_CUDA_EXP
is_finish_load_ = true;
}
......@@ -768,6 +779,8 @@ void Dataset::CreateValid(const Dataset* dataset) {
label_idx_ = dataset->label_idx_;
real_feature_idx_ = dataset->real_feature_idx_;
forced_bin_bounds_ = dataset->forced_bin_bounds_;
device_type_ = dataset->device_type_;
gpu_device_id_ = dataset->gpu_device_id_;
}
void Dataset::ReSize(data_size_t num_data) {
......@@ -833,6 +846,19 @@ void Dataset::CopySubrow(const Dataset* fullset,
}
}
}
// update CUDA storage for column data and metadata
device_type_ = fullset->device_type_;
gpu_device_id_ = fullset->gpu_device_id_;
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
if (cuda_column_data_ == nullptr) {
cuda_column_data_.reset(new CUDAColumnData(fullset->num_data(), gpu_device_id_));
metadata_.CreateCUDAMetadata(gpu_device_id_);
}
cuda_column_data_->CopySubrow(fullset->cuda_column_data(), used_indices, num_used_indices);
}
#endif // USE_CUDA_EXP
}
bool Dataset::SetFloatField(const char* field_name, const float* field_data,
......@@ -1470,6 +1496,169 @@ void Dataset::AddFeaturesFrom(Dataset* other) {
raw_data_.push_back(other->raw_data_[i]);
}
}
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
CreateCUDAColumnData();
} else {
cuda_column_data_ = nullptr;
}
#endif // USE_CUDA_EXP
}
const void* Dataset::GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
return feature_groups_[feature_group_index]->GetColWiseData(sub_feature_index, bit_type, is_sparse, bin_iterator, num_threads);
}
const void* Dataset::GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
return feature_groups_[feature_group_index]->GetColWiseData(sub_feature_index, bit_type, is_sparse, bin_iterator);
}
#ifdef USE_CUDA_EXP
void Dataset::CreateCUDAColumnData() {
cuda_column_data_.reset(new CUDAColumnData(num_data_, gpu_device_id_));
int num_columns = 0;
std::vector<const void*> column_data;
std::vector<BinIterator*> column_bin_iterator;
std::vector<uint8_t> column_bit_type;
int feature_index = 0;
std::vector<int> feature_to_column(num_features_, -1);
std::vector<uint32_t> feature_max_bins(num_features_, 0);
std::vector<uint32_t> feature_min_bins(num_features_, 0);
std::vector<uint32_t> feature_offsets(num_features_, 0);
std::vector<uint32_t> feature_most_freq_bins(num_features_, 0);
std::vector<uint32_t> feature_default_bin(num_features_, 0);
std::vector<uint8_t> feature_missing_is_zero(num_features_, 0);
std::vector<uint8_t> feature_missing_is_na(num_features_, 0);
std::vector<uint8_t> feature_mfb_is_zero(num_features_, 0);
std::vector<uint8_t> feature_mfb_is_na(num_features_, 0);
for (int feature_group_index = 0; feature_group_index < num_groups_; ++feature_group_index) {
if (feature_groups_[feature_group_index]->is_multi_val_) {
for (int sub_feature_index = 0; sub_feature_index < feature_groups_[feature_group_index]->num_feature_; ++sub_feature_index) {
uint8_t bit_type = 0;
bool is_sparse = false;
BinIterator* bin_iterator = nullptr;
const void* one_column_data = GetColWiseData(feature_group_index,
sub_feature_index,
&bit_type,
&is_sparse,
&bin_iterator);
column_data.emplace_back(one_column_data);
column_bin_iterator.emplace_back(bin_iterator);
column_bit_type.emplace_back(bit_type);
feature_to_column[feature_index] = num_columns;
++num_columns;
const BinMapper* feature_bin_mapper = FeatureBinMapper(feature_index);
feature_max_bins[feature_index] = feature_max_bin(feature_index);
feature_min_bins[feature_index] = feature_min_bin(feature_index);
const uint32_t most_freq_bin = feature_bin_mapper->GetMostFreqBin();
feature_offsets[feature_index] = static_cast<uint32_t>(most_freq_bin == 0);
feature_most_freq_bins[feature_index] = most_freq_bin;
feature_default_bin[feature_index] = feature_bin_mapper->GetDefaultBin();
if (feature_bin_mapper->missing_type() == MissingType::Zero) {
feature_missing_is_zero[feature_index] = 1;
feature_missing_is_na[feature_index] = 0;
if (feature_default_bin[feature_index] == feature_most_freq_bins[feature_index]) {
feature_mfb_is_zero[feature_index] = 1;
} else {
feature_mfb_is_zero[feature_index] = 0;
}
feature_mfb_is_na[feature_index] = 0;
} else if (feature_bin_mapper->missing_type() == MissingType::NaN) {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 1;
feature_mfb_is_zero[feature_index] = 0;
if (feature_most_freq_bins[feature_index] + feature_min_bins[feature_index] == feature_max_bins[feature_index] &&
feature_most_freq_bins[feature_index] > 0) {
feature_mfb_is_na[feature_index] = 1;
} else {
feature_mfb_is_na[feature_index] = 0;
}
} else {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 0;
feature_mfb_is_zero[feature_index] = 0;
feature_mfb_is_na[feature_index] = 0;
}
++feature_index;
}
} else {
uint8_t bit_type = 0;
bool is_sparse = false;
BinIterator* bin_iterator = nullptr;
const void* one_column_data = GetColWiseData(feature_group_index,
-1,
&bit_type,
&is_sparse,
&bin_iterator);
column_data.emplace_back(one_column_data);
column_bin_iterator.emplace_back(bin_iterator);
column_bit_type.emplace_back(bit_type);
for (int sub_feature_index = 0; sub_feature_index < feature_groups_[feature_group_index]->num_feature_; ++sub_feature_index) {
feature_to_column[feature_index] = num_columns;
const BinMapper* feature_bin_mapper = FeatureBinMapper(feature_index);
feature_max_bins[feature_index] = feature_max_bin(feature_index);
feature_min_bins[feature_index] = feature_min_bin(feature_index);
const uint32_t most_freq_bin = feature_bin_mapper->GetMostFreqBin();
feature_offsets[feature_index] = static_cast<uint32_t>(most_freq_bin == 0);
feature_most_freq_bins[feature_index] = most_freq_bin;
feature_default_bin[feature_index] = feature_bin_mapper->GetDefaultBin();
if (feature_bin_mapper->missing_type() == MissingType::Zero) {
feature_missing_is_zero[feature_index] = 1;
feature_missing_is_na[feature_index] = 0;
if (feature_default_bin[feature_index] == feature_most_freq_bins[feature_index]) {
feature_mfb_is_zero[feature_index] = 1;
} else {
feature_mfb_is_zero[feature_index] = 0;
}
feature_mfb_is_na[feature_index] = 0;
} else if (feature_bin_mapper->missing_type() == MissingType::NaN) {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 1;
feature_mfb_is_zero[feature_index] = 0;
if (feature_most_freq_bins[feature_index] + feature_min_bins[feature_index] == feature_max_bins[feature_index] &&
feature_most_freq_bins[feature_index] > 0) {
feature_mfb_is_na[feature_index] = 1;
} else {
feature_mfb_is_na[feature_index] = 0;
}
} else {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 0;
feature_mfb_is_zero[feature_index] = 0;
feature_mfb_is_na[feature_index] = 0;
}
++feature_index;
}
++num_columns;
}
}
cuda_column_data_->Init(num_columns,
column_data,
column_bin_iterator,
column_bit_type,
feature_max_bins,
feature_min_bins,
feature_offsets,
feature_most_freq_bins,
feature_default_bin,
feature_missing_is_zero,
feature_missing_is_na,
feature_mfb_is_zero,
feature_mfb_is_na,
feature_to_column);
}
#endif // USE_CUDA_EXP
} // namespace LightGBM
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment