Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
...@@ -22,6 +22,9 @@ ...@@ -22,6 +22,9 @@
#include <utility> #include <utility>
#include <vector> #include <vector>
#include <LightGBM/cuda/cuda_column_data.hpp>
#include <LightGBM/cuda/cuda_metadata.hpp>
namespace LightGBM { namespace LightGBM {
/*! \brief forward declaration */ /*! \brief forward declaration */
...@@ -211,6 +214,14 @@ class Metadata { ...@@ -211,6 +214,14 @@ class Metadata {
/*! \brief Disable copy */ /*! \brief Disable copy */
Metadata(const Metadata&) = delete; Metadata(const Metadata&) = delete;
#ifdef USE_CUDA_EXP
CUDAMetadata* cuda_metadata() const { return cuda_metadata_.get(); }
void CreateCUDAMetadata(const int gpu_device_id);
#endif // USE_CUDA_EXP
private: private:
/*! \brief Load initial scores from file */ /*! \brief Load initial scores from file */
void LoadInitialScore(); void LoadInitialScore();
...@@ -247,6 +258,9 @@ class Metadata { ...@@ -247,6 +258,9 @@ class Metadata {
bool weight_load_from_file_; bool weight_load_from_file_;
bool query_load_from_file_; bool query_load_from_file_;
bool init_score_load_from_file_; bool init_score_load_from_file_;
#ifdef USE_CUDA_EXP
std::unique_ptr<CUDAMetadata> cuda_metadata_;
#endif // USE_CUDA_EXP
}; };
...@@ -623,6 +637,21 @@ class Dataset { ...@@ -623,6 +637,21 @@ class Dataset {
return feature_groups_[group]->FeatureGroupData(); return feature_groups_[group]->FeatureGroupData();
} }
const void* GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const;
const void* GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const;
inline double RealThreshold(int i, uint32_t threshold) const { inline double RealThreshold(int i, uint32_t threshold) const {
const int group = feature2group_[i]; const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i]; const int sub_feature = feature2subfeature_[i];
...@@ -636,6 +665,12 @@ class Dataset { ...@@ -636,6 +665,12 @@ class Dataset {
return feature_groups_[group]->bin_mappers_[sub_feature]->ValueToBin(threshold_double); return feature_groups_[group]->bin_mappers_[sub_feature]->ValueToBin(threshold_double);
} }
inline int MaxRealCatValue(int i) const {
const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i];
return feature_groups_[group]->bin_mappers_[sub_feature]->MaxCatValue();
}
/*! /*!
* \brief Get meta data pointer * \brief Get meta data pointer
* \return Pointer of meta data * \return Pointer of meta data
...@@ -739,7 +774,29 @@ class Dataset { ...@@ -739,7 +774,29 @@ class Dataset {
return raw_data_[numeric_feature_map_[feat_ind]].data(); return raw_data_[numeric_feature_map_[feat_ind]].data();
} }
inline uint32_t feature_max_bin(const int inner_feature_index) const {
const int feature_group_index = Feature2Group(inner_feature_index);
const int sub_feature_index = feature2subfeature_[inner_feature_index];
return feature_groups_[feature_group_index]->feature_max_bin(sub_feature_index);
}
inline uint32_t feature_min_bin(const int inner_feature_index) const {
const int feature_group_index = Feature2Group(inner_feature_index);
const int sub_feature_index = feature2subfeature_[inner_feature_index];
return feature_groups_[feature_group_index]->feature_min_bin(sub_feature_index);
}
#ifdef USE_CUDA_EXP
const CUDAColumnData* cuda_column_data() const {
return cuda_column_data_.get();
}
#endif // USE_CUDA_EXP
private: private:
void CreateCUDAColumnData();
std::string data_filename_; std::string data_filename_;
/*! \brief Store used features */ /*! \brief Store used features */
std::vector<std::unique_ptr<FeatureGroup>> feature_groups_; std::vector<std::unique_ptr<FeatureGroup>> feature_groups_;
...@@ -780,6 +837,13 @@ class Dataset { ...@@ -780,6 +837,13 @@ class Dataset {
/*! map feature (inner index) to its index in the list of numeric (non-categorical) features */ /*! map feature (inner index) to its index in the list of numeric (non-categorical) features */
std::vector<int> numeric_feature_map_; std::vector<int> numeric_feature_map_;
int num_numeric_features_; int num_numeric_features_;
std::string device_type_;
int gpu_device_id_;
#ifdef USE_CUDA_EXP
std::unique_ptr<CUDAColumnData> cuda_column_data_;
#endif // USE_CUDA_EXP
std::string parser_config_str_; std::string parser_config_str_;
}; };
......
...@@ -478,6 +478,50 @@ class FeatureGroup { ...@@ -478,6 +478,50 @@ class FeatureGroup {
} }
} }
const void* GetColWiseData(const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
if (sub_feature_index >= 0) {
CHECK(is_multi_val_);
return multi_bin_data_[sub_feature_index]->GetColWiseData(bit_type, is_sparse, bin_iterator, num_threads);
} else {
CHECK(!is_multi_val_);
return bin_data_->GetColWiseData(bit_type, is_sparse, bin_iterator, num_threads);
}
}
const void* GetColWiseData(const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
if (sub_feature_index >= 0) {
CHECK(is_multi_val_);
return multi_bin_data_[sub_feature_index]->GetColWiseData(bit_type, is_sparse, bin_iterator);
} else {
CHECK(!is_multi_val_);
return bin_data_->GetColWiseData(bit_type, is_sparse, bin_iterator);
}
}
uint32_t feature_max_bin(const int sub_feature_index) {
if (!is_multi_val_) {
return bin_offsets_[sub_feature_index + 1] - 1;
} else {
int addi = bin_mappers_[sub_feature_index]->GetMostFreqBin() == 0 ? 0 : 1;
return bin_mappers_[sub_feature_index]->num_bin() - 1 + addi;
}
}
uint32_t feature_min_bin(const int sub_feature_index) {
if (!is_multi_val_) {
return bin_offsets_[sub_feature_index];
} else {
return 1;
}
}
private: private:
void CreateBinData(int num_data, bool is_multi_val, bool force_dense, bool force_sparse) { void CreateBinData(int num_data, bool is_multi_val, bool force_dense, bool force_sparse) {
if (is_multi_val) { if (is_multi_val) {
......
...@@ -49,6 +49,8 @@ typedef float label_t; ...@@ -49,6 +49,8 @@ typedef float label_t;
const score_t kMinScore = -std::numeric_limits<score_t>::infinity(); const score_t kMinScore = -std::numeric_limits<score_t>::infinity();
const score_t kMaxScore = std::numeric_limits<score_t>::infinity();
const score_t kEpsilon = 1e-15f; const score_t kEpsilon = 1e-15f;
const double kZeroThreshold = 1e-35f; const double kZeroThreshold = 1e-35f;
......
...@@ -125,6 +125,25 @@ class MultiValBinWrapper { ...@@ -125,6 +125,25 @@ class MultiValBinWrapper {
is_subrow_copied_ = is_subrow_copied; is_subrow_copied_ = is_subrow_copied;
} }
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
if (multi_val_bin_ == nullptr) {
*bit_type = 0;
*total_size = 0;
*is_sparse = false;
return nullptr;
} else {
return multi_val_bin_->GetRowWiseData(bit_type, total_size, is_sparse, out_data_ptr, data_ptr_bit_type);
}
}
#endif // USE_CUDA_EXP
private: private:
bool is_use_subcol_ = false; bool is_use_subcol_ = false;
bool is_use_subrow_ = false; bool is_use_subrow_ = false;
...@@ -162,7 +181,11 @@ struct TrainingShareStates { ...@@ -162,7 +181,11 @@ struct TrainingShareStates {
int num_hist_total_bin() { return num_hist_total_bin_; } int num_hist_total_bin() { return num_hist_total_bin_; }
const std::vector<uint32_t>& feature_hist_offsets() { return feature_hist_offsets_; } const std::vector<uint32_t>& feature_hist_offsets() const { return feature_hist_offsets_; }
#ifdef USE_CUDA_EXP
const std::vector<uint32_t>& column_hist_offsets() const { return column_hist_offsets_; }
#endif // USE_CUDA_EXP
bool IsSparseRowwise() { bool IsSparseRowwise() {
return (multi_val_bin_wrapper_ != nullptr && multi_val_bin_wrapper_->IsSparse()); return (multi_val_bin_wrapper_ != nullptr && multi_val_bin_wrapper_->IsSparse());
...@@ -211,8 +234,29 @@ struct TrainingShareStates { ...@@ -211,8 +234,29 @@ struct TrainingShareStates {
} }
} }
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) {
if (multi_val_bin_wrapper_ != nullptr) {
return multi_val_bin_wrapper_->GetRowWiseData(bit_type, total_size, is_sparse, out_data_ptr, data_ptr_bit_type);
} else {
*bit_type = 0;
*total_size = 0;
*is_sparse = false;
return nullptr;
}
}
#endif // USE_CUDA_EXP
private: private:
std::vector<uint32_t> feature_hist_offsets_; std::vector<uint32_t> feature_hist_offsets_;
#ifdef USE_CUDA_EXP
std::vector<uint32_t> column_hist_offsets_;
#endif // USE_CUDA_EXP
int num_hist_total_bin_ = 0; int num_hist_total_bin_ = 0;
std::unique_ptr<MultiValBinWrapper> multi_val_bin_wrapper_; std::unique_ptr<MultiValBinWrapper> multi_val_bin_wrapper_;
std::vector<hist_t, Common::AlignmentAllocator<hist_t, kAlignedSize>> hist_buf_; std::vector<hist_t, Common::AlignmentAllocator<hist_t, kAlignedSize>> hist_buf_;
......
...@@ -39,7 +39,7 @@ class Tree { ...@@ -39,7 +39,7 @@ class Tree {
*/ */
Tree(const char* str, size_t* used_len); Tree(const char* str, size_t* used_len);
~Tree() noexcept = default; virtual ~Tree() noexcept = default;
/*! /*!
* \brief Performing a split on tree leaves. * \brief Performing a split on tree leaves.
...@@ -100,7 +100,7 @@ class Tree { ...@@ -100,7 +100,7 @@ class Tree {
* \param num_data Number of total data * \param num_data Number of total data
* \param score Will add prediction to score * \param score Will add prediction to score
*/ */
void AddPredictionToScore(const Dataset* data, virtual void AddPredictionToScore(const Dataset* data,
data_size_t num_data, data_size_t num_data,
double* score) const; double* score) const;
...@@ -111,7 +111,7 @@ class Tree { ...@@ -111,7 +111,7 @@ class Tree {
* \param num_data Number of total data * \param num_data Number of total data
* \param score Will add prediction to score * \param score Will add prediction to score
*/ */
void AddPredictionToScore(const Dataset* data, virtual void AddPredictionToScore(const Dataset* data,
const data_size_t* used_data_indices, const data_size_t* used_data_indices,
data_size_t num_data, double* score) const; data_size_t num_data, double* score) const;
...@@ -184,7 +184,7 @@ class Tree { ...@@ -184,7 +184,7 @@ class Tree {
* shrinkage rate (a.k.a learning rate) is used to tune the training process * shrinkage rate (a.k.a learning rate) is used to tune the training process
* \param rate The factor of shrinkage * \param rate The factor of shrinkage
*/ */
inline void Shrinkage(double rate) { virtual inline void Shrinkage(double rate) {
#pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048) #pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048)
for (int i = 0; i < num_leaves_ - 1; ++i) { for (int i = 0; i < num_leaves_ - 1; ++i) {
leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] * rate); leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] * rate);
...@@ -209,7 +209,7 @@ class Tree { ...@@ -209,7 +209,7 @@ class Tree {
inline double shrinkage() const { return shrinkage_; } inline double shrinkage() const { return shrinkage_; }
inline void AddBias(double val) { virtual inline void AddBias(double val) {
#pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048) #pragma omp parallel for schedule(static, 1024) if (num_leaves_ >= 2048)
for (int i = 0; i < num_leaves_ - 1; ++i) { for (int i = 0; i < num_leaves_ - 1; ++i) {
leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] + val); leaf_value_[i] = MaybeRoundToZero(leaf_value_[i] + val);
...@@ -319,11 +319,15 @@ class Tree { ...@@ -319,11 +319,15 @@ class Tree {
inline bool is_linear() const { return is_linear_; } inline bool is_linear() const { return is_linear_; }
#ifdef USE_CUDA_EXP
inline bool is_cuda_tree() const { return is_cuda_tree_; }
#endif // USE_CUDA_EXP
inline void SetIsLinear(bool is_linear) { inline void SetIsLinear(bool is_linear) {
is_linear_ = is_linear; is_linear_ = is_linear;
} }
private: protected:
std::string NumericalDecisionIfElse(int node) const; std::string NumericalDecisionIfElse(int node) const;
std::string CategoricalDecisionIfElse(int node) const; std::string CategoricalDecisionIfElse(int node) const;
...@@ -528,6 +532,10 @@ class Tree { ...@@ -528,6 +532,10 @@ class Tree {
std::vector<std::vector<int>> leaf_features_; std::vector<std::vector<int>> leaf_features_;
/* \brief features used in leaf linear models; indexing is relative to used_features_ */ /* \brief features used in leaf linear models; indexing is relative to used_features_ */
std::vector<std::vector<int>> leaf_features_inner_; std::vector<std::vector<int>> leaf_features_inner_;
#ifdef USE_CUDA_EXP
/*! \brief Marks whether this tree is a CUDATree */
bool is_cuda_tree_;
#endif // USE_CUDA_EXP
}; };
inline void Tree::Split(int leaf, int feature, int real_feature, inline void Tree::Split(int leaf, int feature, int real_feature,
......
...@@ -123,6 +123,8 @@ All requirements from `Build from Sources section <#build-from-sources>`__ apply ...@@ -123,6 +123,8 @@ All requirements from `Build from Sources section <#build-from-sources>`__ apply
**CUDA** library (version 9.0 or higher) is needed: details for installation can be found in `Installation Guide <https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#build-cuda-version-experimental>`__. **CUDA** library (version 9.0 or higher) is needed: details for installation can be found in `Installation Guide <https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#build-cuda-version-experimental>`__.
Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``--cuda`` with ``--cuda-exp`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.
Build HDFS Version Build HDFS Version
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
...@@ -198,6 +200,8 @@ Run ``python setup.py install --gpu`` to enable GPU support. All requirements fr ...@@ -198,6 +200,8 @@ Run ``python setup.py install --gpu`` to enable GPU support. All requirements fr
Run ``python setup.py install --cuda`` to enable CUDA support. All requirements from `Build CUDA Version section <#build-cuda-version>`__ apply for this installation option as well. Run ``python setup.py install --cuda`` to enable CUDA support. All requirements from `Build CUDA Version section <#build-cuda-version>`__ apply for this installation option as well.
Run ``python setup.py install --cuda-exp`` to enable the new experimental version of CUDA support. All requirements from `Build CUDA Version section <#build-cuda-version>`__ apply for this installation option as well.
Run ``python setup.py install --hdfs`` to enable HDFS support. All requirements from `Build HDFS Version section <#build-hdfs-version>`__ apply for this installation option as well. Run ``python setup.py install --hdfs`` to enable HDFS support. All requirements from `Build HDFS Version section <#build-hdfs-version>`__ apply for this installation option as well.
Run ``python setup.py install --bit32``, if you want to use 32-bit version. All requirements from `Build 32-bit Version with 32-bit Python section <#build-32-bit-version-with-32-bit-python>`__ apply for this installation option as well. Run ``python setup.py install --bit32``, if you want to use 32-bit version. All requirements from `Build 32-bit Version with 32-bit Python section <#build-32-bit-version-with-32-bit-python>`__ apply for this installation option as well.
......
...@@ -21,6 +21,7 @@ LIGHTGBM_OPTIONS = [ ...@@ -21,6 +21,7 @@ LIGHTGBM_OPTIONS = [
('integrated-opencl', None, 'Compile integrated OpenCL version'), ('integrated-opencl', None, 'Compile integrated OpenCL version'),
('gpu', 'g', 'Compile GPU version'), ('gpu', 'g', 'Compile GPU version'),
('cuda', None, 'Compile CUDA version'), ('cuda', None, 'Compile CUDA version'),
('cuda-exp', None, 'Compile CUDA Experimental version'),
('mpi', None, 'Compile MPI version'), ('mpi', None, 'Compile MPI version'),
('nomp', None, 'Compile version without OpenMP support'), ('nomp', None, 'Compile version without OpenMP support'),
('hdfs', 'h', 'Compile HDFS version'), ('hdfs', 'h', 'Compile HDFS version'),
...@@ -104,6 +105,7 @@ def compile_cpp( ...@@ -104,6 +105,7 @@ def compile_cpp(
use_mingw: bool = False, use_mingw: bool = False,
use_gpu: bool = False, use_gpu: bool = False,
use_cuda: bool = False, use_cuda: bool = False,
use_cuda_exp: bool = False,
use_mpi: bool = False, use_mpi: bool = False,
use_hdfs: bool = False, use_hdfs: bool = False,
boost_root: Optional[str] = None, boost_root: Optional[str] = None,
...@@ -144,6 +146,8 @@ def compile_cpp( ...@@ -144,6 +146,8 @@ def compile_cpp(
cmake_cmd.append(f"-DOpenCL_LIBRARY={opencl_library}") cmake_cmd.append(f"-DOpenCL_LIBRARY={opencl_library}")
elif use_cuda: elif use_cuda:
cmake_cmd.append("-DUSE_CUDA=ON") cmake_cmd.append("-DUSE_CUDA=ON")
elif use_cuda_exp:
cmake_cmd.append("-DUSE_CUDA_EXP=ON")
if use_mpi: if use_mpi:
cmake_cmd.append("-DUSE_MPI=ON") cmake_cmd.append("-DUSE_MPI=ON")
if nomp: if nomp:
...@@ -163,7 +167,7 @@ def compile_cpp( ...@@ -163,7 +167,7 @@ def compile_cpp(
else: else:
status = 1 status = 1
lib_path = CURRENT_DIR / "compile" / "windows" / "x64" / "DLL" / "lib_lightgbm.dll" lib_path = CURRENT_DIR / "compile" / "windows" / "x64" / "DLL" / "lib_lightgbm.dll"
if not any((use_gpu, use_cuda, use_mpi, use_hdfs, nomp, bit32, integrated_opencl)): if not any((use_gpu, use_cuda, use_cuda_exp, use_mpi, use_hdfs, nomp, bit32, integrated_opencl)):
logger.info("Starting to compile with MSBuild from existing solution file.") logger.info("Starting to compile with MSBuild from existing solution file.")
platform_toolsets = ("v143", "v142", "v141", "v140") platform_toolsets = ("v143", "v142", "v141", "v140")
for pt in platform_toolsets: for pt in platform_toolsets:
...@@ -227,6 +231,7 @@ class CustomInstall(install): ...@@ -227,6 +231,7 @@ class CustomInstall(install):
self.integrated_opencl = False self.integrated_opencl = False
self.gpu = False self.gpu = False
self.cuda = False self.cuda = False
self.cuda_exp = False
self.boost_root = None self.boost_root = None
self.boost_dir = None self.boost_dir = None
self.boost_include_dir = None self.boost_include_dir = None
...@@ -250,7 +255,7 @@ class CustomInstall(install): ...@@ -250,7 +255,7 @@ class CustomInstall(install):
LOG_PATH.touch() LOG_PATH.touch()
if not self.precompile: if not self.precompile:
copy_files(integrated_opencl=self.integrated_opencl, use_gpu=self.gpu) copy_files(integrated_opencl=self.integrated_opencl, use_gpu=self.gpu)
compile_cpp(use_mingw=self.mingw, use_gpu=self.gpu, use_cuda=self.cuda, use_mpi=self.mpi, compile_cpp(use_mingw=self.mingw, use_gpu=self.gpu, use_cuda=self.cuda, use_cuda_exp=self.cuda_exp, use_mpi=self.mpi,
use_hdfs=self.hdfs, boost_root=self.boost_root, boost_dir=self.boost_dir, use_hdfs=self.hdfs, boost_root=self.boost_root, boost_dir=self.boost_dir,
boost_include_dir=self.boost_include_dir, boost_librarydir=self.boost_librarydir, boost_include_dir=self.boost_include_dir, boost_librarydir=self.boost_librarydir,
opencl_include_dir=self.opencl_include_dir, opencl_library=self.opencl_library, opencl_include_dir=self.opencl_include_dir, opencl_library=self.opencl_library,
...@@ -270,6 +275,7 @@ class CustomBdistWheel(bdist_wheel): ...@@ -270,6 +275,7 @@ class CustomBdistWheel(bdist_wheel):
self.integrated_opencl = False self.integrated_opencl = False
self.gpu = False self.gpu = False
self.cuda = False self.cuda = False
self.cuda_exp = False
self.boost_root = None self.boost_root = None
self.boost_dir = None self.boost_dir = None
self.boost_include_dir = None self.boost_include_dir = None
...@@ -291,6 +297,7 @@ class CustomBdistWheel(bdist_wheel): ...@@ -291,6 +297,7 @@ class CustomBdistWheel(bdist_wheel):
install.integrated_opencl = self.integrated_opencl install.integrated_opencl = self.integrated_opencl
install.gpu = self.gpu install.gpu = self.gpu
install.cuda = self.cuda install.cuda = self.cuda
install.cuda_exp = self.cuda_exp
install.boost_root = self.boost_root install.boost_root = self.boost_root
install.boost_dir = self.boost_dir install.boost_dir = self.boost_dir
install.boost_include_dir = self.boost_include_dir install.boost_include_dir = self.boost_include_dir
......
...@@ -36,7 +36,7 @@ Application::Application(int argc, char** argv) { ...@@ -36,7 +36,7 @@ Application::Application(int argc, char** argv) {
Log::Fatal("No training/prediction data, application quit"); Log::Fatal("No training/prediction data, application quit");
} }
if (config_.device_type == std::string("cuda")) { if (config_.device_type == std::string("cuda") || config_.device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda; LGBM_config_::current_device = lgbm_device_cuda;
} }
} }
......
...@@ -65,7 +65,7 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective ...@@ -65,7 +65,7 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective
es_first_metric_only_ = config_->first_metric_only; es_first_metric_only_ = config_->first_metric_only;
shrinkage_rate_ = config_->learning_rate; shrinkage_rate_ = config_->learning_rate;
if (config_->device_type == std::string("cuda")) { if (config_->device_type == std::string("cuda") || config_->device_type == std::string("cuda_exp")) {
LGBM_config_::current_learner = use_cuda_learner; LGBM_config_::current_learner = use_cuda_learner;
} }
...@@ -391,7 +391,7 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) { ...@@ -391,7 +391,7 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {
auto grad = gradients + offset; auto grad = gradients + offset;
auto hess = hessians + offset; auto hess = hessians + offset;
// need to copy gradients for bagging subset. // need to copy gradients for bagging subset.
if (is_use_subset_ && bag_data_cnt_ < num_data_) { if (is_use_subset_ && bag_data_cnt_ < num_data_ && config_->device_type != std::string("cuda_exp")) {
for (int i = 0; i < bag_data_cnt_; ++i) { for (int i = 0; i < bag_data_cnt_; ++i) {
gradients_[offset + i] = grad[bag_data_indices_[i]]; gradients_[offset + i] = grad[bag_data_indices_[i]];
hessians_[offset + i] = hess[bag_data_indices_[i]]; hessians_[offset + i] = hess[bag_data_indices_[i]];
...@@ -805,15 +805,17 @@ void GBDT::ResetBaggingConfig(const Config* config, bool is_change_dataset) { ...@@ -805,15 +805,17 @@ void GBDT::ResetBaggingConfig(const Config* config, bool is_change_dataset) {
double average_bag_rate = double average_bag_rate =
(static_cast<double>(bag_data_cnt_) / num_data_) / config->bagging_freq; (static_cast<double>(bag_data_cnt_) / num_data_) / config->bagging_freq;
is_use_subset_ = false; is_use_subset_ = false;
const int group_threshold_usesubset = 100; if (config_->device_type != std::string("cuda_exp")) {
if (average_bag_rate <= 0.5 const int group_threshold_usesubset = 100;
&& (train_data_->num_feature_groups() < group_threshold_usesubset)) { if (average_bag_rate <= 0.5
if (tmp_subset_ == nullptr || is_change_dataset) { && (train_data_->num_feature_groups() < group_threshold_usesubset)) {
tmp_subset_.reset(new Dataset(bag_data_cnt_)); if (tmp_subset_ == nullptr || is_change_dataset) {
tmp_subset_->CopyFeatureMapperFrom(train_data_); tmp_subset_.reset(new Dataset(bag_data_cnt_));
tmp_subset_->CopyFeatureMapperFrom(train_data_);
}
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
} }
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
} }
need_re_bagging_ = true; need_re_bagging_ = true;
......
...@@ -488,7 +488,7 @@ class GBDT : public GBDTBase { ...@@ -488,7 +488,7 @@ class GBDT : public GBDTBase {
/*! \brief Parser config file content */ /*! \brief Parser config file content */
std::string parser_config_str_ = ""; std::string parser_config_str_ = "";
#ifdef USE_CUDA #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
/*! \brief First order derivative of training data */ /*! \brief First order derivative of training data */
std::vector<score_t, CHAllocator<score_t>> gradients_; std::vector<score_t, CHAllocator<score_t>> gradients_;
/*! \brief Second order derivative of training data */ /*! \brief Second order derivative of training data */
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_algorithms.hpp>
namespace LightGBM {
template <typename T>
__global__ void ShufflePrefixSumGlobalKernel(T* values, size_t len, T* block_prefix_sum_buffer) {
__shared__ T shared_mem_buffer[32];
const size_t index = static_cast<size_t>(threadIdx.x + blockIdx.x * blockDim.x);
T value = 0;
if (index < len) {
value = values[index];
}
const T prefix_sum_value = ShufflePrefixSum<T>(value, shared_mem_buffer);
values[index] = prefix_sum_value;
if (threadIdx.x == blockDim.x - 1) {
block_prefix_sum_buffer[blockIdx.x] = prefix_sum_value;
}
}
template <typename T>
__global__ void ShufflePrefixSumGlobalReduceBlockKernel(T* block_prefix_sum_buffer, int num_blocks) {
__shared__ T shared_mem_buffer[32];
const int num_blocks_per_thread = (num_blocks + GLOBAL_PREFIX_SUM_BLOCK_SIZE - 2) / (GLOBAL_PREFIX_SUM_BLOCK_SIZE - 1);
int thread_block_start = threadIdx.x == 0 ? 0 : (threadIdx.x - 1) * num_blocks_per_thread;
int thread_block_end = threadIdx.x == 0 ? 0 : min(thread_block_start + num_blocks_per_thread, num_blocks);
T base = 0;
for (int block_index = thread_block_start; block_index < thread_block_end; ++block_index) {
base += block_prefix_sum_buffer[block_index];
}
base = ShufflePrefixSum<T>(base, shared_mem_buffer);
thread_block_start = threadIdx.x == blockDim.x - 1 ? 0 : threadIdx.x * num_blocks_per_thread;
thread_block_end = threadIdx.x == blockDim.x - 1 ? 0 : min(thread_block_start + num_blocks_per_thread, num_blocks);
for (int block_index = thread_block_start + 1; block_index < thread_block_end; ++block_index) {
block_prefix_sum_buffer[block_index] += block_prefix_sum_buffer[block_index - 1];
}
for (int block_index = thread_block_start; block_index < thread_block_end; ++block_index) {
block_prefix_sum_buffer[block_index] += base;
}
}
template <typename T>
__global__ void ShufflePrefixSumGlobalAddBase(size_t len, const T* block_prefix_sum_buffer, T* values) {
const T base = blockIdx.x == 0 ? 0 : block_prefix_sum_buffer[blockIdx.x - 1];
const size_t index = static_cast<size_t>(threadIdx.x + blockIdx.x * blockDim.x);
if (index < len) {
values[index] += base;
}
}
template <typename T>
void ShufflePrefixSumGlobalInner(T* values, size_t len, T* block_prefix_sum_buffer) {
const int num_blocks = (static_cast<int>(len) + GLOBAL_PREFIX_SUM_BLOCK_SIZE - 1) / GLOBAL_PREFIX_SUM_BLOCK_SIZE;
ShufflePrefixSumGlobalKernel<<<num_blocks, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(values, len, block_prefix_sum_buffer);
ShufflePrefixSumGlobalReduceBlockKernel<<<1, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(block_prefix_sum_buffer, num_blocks);
ShufflePrefixSumGlobalAddBase<<<num_blocks, GLOBAL_PREFIX_SUM_BLOCK_SIZE>>>(len, block_prefix_sum_buffer, values);
}
template <>
void ShufflePrefixSumGlobal(uint16_t* values, size_t len, uint16_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint16_t>(values, len, block_prefix_sum_buffer);
}
template <>
void ShufflePrefixSumGlobal(uint32_t* values, size_t len, uint32_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint32_t>(values, len, block_prefix_sum_buffer);
}
template <>
void ShufflePrefixSumGlobal(uint64_t* values, size_t len, uint64_t* block_prefix_sum_buffer) {
ShufflePrefixSumGlobalInner<uint64_t>(values, len, block_prefix_sum_buffer);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_utils.h>
namespace LightGBM {
void SynchronizeCUDADevice(const char* file, const int line) {
gpuAssert(cudaDeviceSynchronize(), file, line);
}
void PrintLastCUDAError() {
const char* error_name = cudaGetErrorName(cudaGetLastError());
Log::Fatal(error_name);
}
void SetCUDADevice(int gpu_device_id, const char* file, int line) {
int cur_gpu_device_id = 0;
CUDASUCCESS_OR_FATAL_OUTER(cudaGetDevice(&cur_gpu_device_id));
if (cur_gpu_device_id != gpu_device_id) {
CUDASUCCESS_OR_FATAL_OUTER(cudaSetDevice(gpu_device_id));
}
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
...@@ -128,6 +128,8 @@ void GetDeviceType(const std::unordered_map<std::string, std::string>& params, s ...@@ -128,6 +128,8 @@ void GetDeviceType(const std::unordered_map<std::string, std::string>& params, s
*device_type = "gpu"; *device_type = "gpu";
} else if (value == std::string("cuda")) { } else if (value == std::string("cuda")) {
*device_type = "cuda"; *device_type = "cuda";
} else if (value == std::string("cuda_exp")) {
*device_type = "cuda_exp";
} else { } else {
Log::Fatal("Unknown device type %s", value.c_str()); Log::Fatal("Unknown device type %s", value.c_str());
} }
...@@ -208,7 +210,7 @@ void Config::Set(const std::unordered_map<std::string, std::string>& params) { ...@@ -208,7 +210,7 @@ void Config::Set(const std::unordered_map<std::string, std::string>& params) {
GetObjectiveType(params, &objective); GetObjectiveType(params, &objective);
GetMetricType(params, objective, &metric); GetMetricType(params, objective, &metric);
GetDeviceType(params, &device_type); GetDeviceType(params, &device_type);
if (device_type == std::string("cuda")) { if (device_type == std::string("cuda") || device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda; LGBM_config_::current_device = lgbm_device_cuda;
} }
GetTreeLearnerType(params, &tree_learner); GetTreeLearnerType(params, &tree_learner);
...@@ -331,13 +333,20 @@ void Config::CheckParamConflict() { ...@@ -331,13 +333,20 @@ void Config::CheckParamConflict() {
num_leaves = static_cast<int>(full_num_leaves); num_leaves = static_cast<int>(full_num_leaves);
} }
} }
// force col-wise for gpu & CUDA
if (device_type == std::string("gpu") || device_type == std::string("cuda")) { if (device_type == std::string("gpu") || device_type == std::string("cuda")) {
// force col-wise for gpu, and cuda version
force_col_wise = true; force_col_wise = true;
force_row_wise = false; force_row_wise = false;
if (deterministic) { if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic."); Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
} }
} else if (device_type == std::string("cuda_exp")) {
// force row-wise for cuda_exp version
force_col_wise = false;
force_row_wise = true;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}
} }
// force gpu_use_dp for CUDA // force gpu_use_dp for CUDA
if (device_type == std::string("cuda") && !gpu_use_dp) { if (device_type == std::string("cuda") && !gpu_use_dp) {
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_column_data.hpp>
namespace LightGBM {
CUDAColumnData::CUDAColumnData(const data_size_t num_data, const int gpu_device_id) {
num_threads_ = OMP_NUM_THREADS();
num_data_ = num_data;
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_used_indices_ = nullptr;
cuda_data_by_column_ = nullptr;
cuda_column_bit_type_ = nullptr;
cuda_feature_min_bin_ = nullptr;
cuda_feature_max_bin_ = nullptr;
cuda_feature_offset_ = nullptr;
cuda_feature_most_freq_bin_ = nullptr;
cuda_feature_default_bin_ = nullptr;
cuda_feature_missing_is_zero_ = nullptr;
cuda_feature_missing_is_na_ = nullptr;
cuda_feature_mfb_is_zero_ = nullptr;
cuda_feature_mfb_is_na_ = nullptr;
cuda_feature_to_column_ = nullptr;
data_by_column_.clear();
}
CUDAColumnData::~CUDAColumnData() {
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
DeallocateCUDAMemory<void*>(&cuda_data_by_column_, __FILE__, __LINE__);
for (size_t i = 0; i < data_by_column_.size(); ++i) {
DeallocateCUDAMemory<void>(&data_by_column_[i], __FILE__, __LINE__);
}
DeallocateCUDAMemory<uint8_t>(&cuda_column_bit_type_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_min_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_max_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_offset_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_most_freq_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_feature_default_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_missing_is_zero_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_missing_is_na_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_mfb_is_zero_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint8_t>(&cuda_feature_mfb_is_na_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_feature_to_column_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
}
template <bool IS_SPARSE, bool IS_4BIT, typename BIN_TYPE>
void CUDAColumnData::InitOneColumnData(const void* in_column_data, BinIterator* bin_iterator, void** out_column_data_pointer) {
BIN_TYPE* cuda_column_data = nullptr;
if (!IS_SPARSE) {
if (IS_4BIT) {
std::vector<BIN_TYPE> expanded_column_data(num_data_, 0);
const BIN_TYPE* in_column_data_reintrepreted = reinterpret_cast<const BIN_TYPE*>(in_column_data);
for (data_size_t i = 0; i < num_data_; ++i) {
expanded_column_data[i] = static_cast<BIN_TYPE>((in_column_data_reintrepreted[i >> 1] >> ((i & 1) << 2)) & 0xf);
}
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
expanded_column_data.data(),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
} else {
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
reinterpret_cast<const BIN_TYPE*>(in_column_data),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
}
} else {
// need to iterate bin iterator
std::vector<BIN_TYPE> expanded_column_data(num_data_, 0);
for (data_size_t i = 0; i < num_data_; ++i) {
expanded_column_data[i] = static_cast<BIN_TYPE>(bin_iterator->RawGet(i));
}
InitCUDAMemoryFromHostMemory<BIN_TYPE>(&cuda_column_data,
expanded_column_data.data(),
static_cast<size_t>(num_data_),
__FILE__,
__LINE__);
}
*out_column_data_pointer = reinterpret_cast<void*>(cuda_column_data);
}
void CUDAColumnData::Init(const int num_columns,
const std::vector<const void*>& column_data,
const std::vector<BinIterator*>& column_bin_iterator,
const std::vector<uint8_t>& column_bit_type,
const std::vector<uint32_t>& feature_max_bin,
const std::vector<uint32_t>& feature_min_bin,
const std::vector<uint32_t>& feature_offset,
const std::vector<uint32_t>& feature_most_freq_bin,
const std::vector<uint32_t>& feature_default_bin,
const std::vector<uint8_t>& feature_missing_is_zero,
const std::vector<uint8_t>& feature_missing_is_na,
const std::vector<uint8_t>& feature_mfb_is_zero,
const std::vector<uint8_t>& feature_mfb_is_na,
const std::vector<int>& feature_to_column) {
num_columns_ = num_columns;
column_bit_type_ = column_bit_type;
feature_max_bin_ = feature_max_bin;
feature_min_bin_ = feature_min_bin;
feature_offset_ = feature_offset;
feature_most_freq_bin_ = feature_most_freq_bin;
feature_default_bin_ = feature_default_bin;
feature_missing_is_zero_ = feature_missing_is_zero;
feature_missing_is_na_ = feature_missing_is_na;
feature_mfb_is_zero_ = feature_mfb_is_zero;
feature_mfb_is_na_ = feature_mfb_is_na;
data_by_column_.resize(num_columns_, nullptr);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const int8_t bit_type = column_bit_type[column_index];
if (column_data[column_index] != nullptr) {
// is dense column
if (bit_type == 4) {
column_bit_type_[column_index] = 8;
InitOneColumnData<false, true, uint8_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 8) {
InitOneColumnData<false, false, uint8_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 16) {
InitOneColumnData<false, false, uint16_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else if (bit_type == 32) {
InitOneColumnData<false, false, uint32_t>(column_data[column_index], nullptr, &data_by_column_[column_index]);
} else {
Log::Fatal("Unknow column bit type %d", bit_type);
}
} else {
// is sparse column
if (bit_type == 8) {
InitOneColumnData<true, false, uint8_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else if (bit_type == 16) {
InitOneColumnData<true, false, uint16_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else if (bit_type == 32) {
InitOneColumnData<true, false, uint32_t>(nullptr, column_bin_iterator[column_index], &data_by_column_[column_index]);
} else {
Log::Fatal("Unknow column bit type %d", bit_type);
}
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
feature_to_column_ = feature_to_column;
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_,
data_by_column_.data(),
data_by_column_.size(),
__FILE__,
__LINE__);
InitColumnMetaInfo();
}
void CUDAColumnData::CopySubrow(
const CUDAColumnData* full_set,
const data_size_t* used_indices,
const data_size_t num_used_indices) {
num_threads_ = full_set->num_threads_;
num_columns_ = full_set->num_columns_;
column_bit_type_ = full_set->column_bit_type_;
feature_min_bin_ = full_set->feature_min_bin_;
feature_max_bin_ = full_set->feature_max_bin_;
feature_offset_ = full_set->feature_offset_;
feature_most_freq_bin_ = full_set->feature_most_freq_bin_;
feature_default_bin_ = full_set->feature_default_bin_;
feature_missing_is_zero_ = full_set->feature_missing_is_zero_;
feature_missing_is_na_ = full_set->feature_missing_is_na_;
feature_mfb_is_zero_ = full_set->feature_mfb_is_zero_;
feature_mfb_is_na_ = full_set->feature_mfb_is_na_;
feature_to_column_ = full_set->feature_to_column_;
if (cuda_used_indices_ == nullptr) {
// initialize the subset cuda column data
const size_t num_used_indices_size = static_cast<size_t>(num_used_indices);
AllocateCUDAMemory<data_size_t>(&cuda_used_indices_, num_used_indices_size, __FILE__, __LINE__);
data_by_column_.resize(num_columns_, nullptr);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const uint8_t bit_type = column_bit_type_[column_index];
if (bit_type == 8) {
uint8_t* column_data = nullptr;
AllocateCUDAMemory<uint8_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 16) {
uint16_t* column_data = nullptr;
AllocateCUDAMemory<uint16_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 32) {
uint32_t* column_data = nullptr;
AllocateCUDAMemory<uint32_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_, data_by_column_.data(), data_by_column_.size(), __FILE__, __LINE__);
InitColumnMetaInfo();
cur_subset_buffer_size_ = num_used_indices;
} else {
if (num_used_indices > cur_subset_buffer_size_) {
ResizeWhenCopySubrow(num_used_indices);
cur_subset_buffer_size_ = num_used_indices;
}
}
CopyFromHostToCUDADevice<data_size_t>(cuda_used_indices_, used_indices, static_cast<size_t>(num_used_indices), __FILE__, __LINE__);
num_used_indices_ = num_used_indices;
LaunchCopySubrowKernel(full_set->cuda_data_by_column());
}
void CUDAColumnData::ResizeWhenCopySubrow(const data_size_t num_used_indices) {
const size_t num_used_indices_size = static_cast<size_t>(num_used_indices);
DeallocateCUDAMemory<data_size_t>(&cuda_used_indices_, __FILE__, __LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_used_indices_, num_used_indices_size, __FILE__, __LINE__);
OMP_INIT_EX();
#pragma omp parallel for schedule(static) num_threads(num_threads_)
for (int column_index = 0; column_index < num_columns_; ++column_index) {
OMP_LOOP_EX_BEGIN();
const uint8_t bit_type = column_bit_type_[column_index];
if (bit_type == 8) {
uint8_t* column_data = reinterpret_cast<uint8_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint8_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint8_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 16) {
uint16_t* column_data = reinterpret_cast<uint16_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint16_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint16_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
} else if (bit_type == 32) {
uint32_t* column_data = reinterpret_cast<uint32_t*>(data_by_column_[column_index]);
DeallocateCUDAMemory<uint32_t>(&column_data, __FILE__, __LINE__);
AllocateCUDAMemory<uint32_t>(&column_data, num_used_indices_size, __FILE__, __LINE__);
data_by_column_[column_index] = reinterpret_cast<void*>(column_data);
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
DeallocateCUDAMemory<void*>(&cuda_data_by_column_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<void*>(&cuda_data_by_column_, data_by_column_.data(), data_by_column_.size(), __FILE__, __LINE__);
}
void CUDAColumnData::InitColumnMetaInfo() {
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_column_bit_type_,
column_bit_type_.data(),
column_bit_type_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_max_bin_,
feature_max_bin_.data(),
feature_max_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_min_bin_,
feature_min_bin_.data(),
feature_min_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_offset_,
feature_offset_.data(),
feature_offset_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_most_freq_bin_,
feature_most_freq_bin_.data(),
feature_most_freq_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_feature_default_bin_,
feature_default_bin_.data(),
feature_default_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_missing_is_zero_,
feature_missing_is_zero_.data(),
feature_missing_is_zero_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_missing_is_na_,
feature_missing_is_na_.data(),
feature_missing_is_na_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_mfb_is_zero_,
feature_mfb_is_zero_.data(),
feature_mfb_is_zero_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_feature_mfb_is_na_,
feature_mfb_is_na_.data(),
feature_mfb_is_na_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_feature_to_column_,
feature_to_column_.data(),
feature_to_column_.size(),
__FILE__,
__LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_column_data.hpp>
#define COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA (1024)
namespace LightGBM {
__global__ void CopySubrowKernel_ColumnData(
void* const* in_cuda_data_by_column,
const uint8_t* cuda_column_bit_type,
const data_size_t* cuda_used_indices,
const data_size_t num_used_indices,
const int num_column,
void** out_cuda_data_by_column) {
const data_size_t local_data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
if (local_data_index < num_used_indices) {
for (int column_index = 0; column_index < num_column; ++column_index) {
const void* in_column_data = in_cuda_data_by_column[column_index];
void* out_column_data = out_cuda_data_by_column[column_index];
const uint8_t bit_type = cuda_column_bit_type[column_index];
if (bit_type == 8) {
const uint8_t* true_in_column_data = reinterpret_cast<const uint8_t*>(in_column_data);
uint8_t* true_out_column_data = reinterpret_cast<uint8_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
} else if (bit_type == 16) {
const uint16_t* true_in_column_data = reinterpret_cast<const uint16_t*>(in_column_data);
uint16_t* true_out_column_data = reinterpret_cast<uint16_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
} else if (bit_type == 32) {
const uint32_t* true_in_column_data = reinterpret_cast<const uint32_t*>(in_column_data);
uint32_t* true_out_column_data = reinterpret_cast<uint32_t*>(out_column_data);
const data_size_t global_data_index = cuda_used_indices[local_data_index];
true_out_column_data[local_data_index] = true_in_column_data[global_data_index];
}
}
}
}
void CUDAColumnData::LaunchCopySubrowKernel(void* const* in_cuda_data_by_column) {
const int num_blocks = (num_used_indices_ + COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA - 1) / COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA;
CopySubrowKernel_ColumnData<<<num_blocks, COPY_SUBROW_BLOCK_SIZE_COLUMN_DATA>>>(
in_cuda_data_by_column,
cuda_column_bit_type_,
cuda_used_indices_,
num_used_indices_,
num_columns_,
cuda_data_by_column_);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_metadata.hpp>
namespace LightGBM {
CUDAMetadata::CUDAMetadata(const int gpu_device_id) {
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_label_ = nullptr;
cuda_weights_ = nullptr;
cuda_query_boundaries_ = nullptr;
cuda_query_weights_ = nullptr;
cuda_init_score_ = nullptr;
}
CUDAMetadata::~CUDAMetadata() {
DeallocateCUDAMemory<label_t>(&cuda_label_, __FILE__, __LINE__);
DeallocateCUDAMemory<label_t>(&cuda_weights_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_query_boundaries_, __FILE__, __LINE__);
DeallocateCUDAMemory<label_t>(&cuda_query_weights_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_init_score_, __FILE__, __LINE__);
}
void CUDAMetadata::Init(const std::vector<label_t>& label,
const std::vector<label_t>& weight,
const std::vector<data_size_t>& query_boundaries,
const std::vector<label_t>& query_weights,
const std::vector<double>& init_score) {
if (label.size() == 0) {
cuda_label_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_label_, label.data(), label.size(), __FILE__, __LINE__);
}
if (weight.size() == 0) {
cuda_weights_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_weights_, weight.data(), weight.size(), __FILE__, __LINE__);
}
if (query_boundaries.size() == 0) {
cuda_query_boundaries_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_query_boundaries_, query_boundaries.data(), query_boundaries.size(), __FILE__, __LINE__);
}
if (query_weights.size() == 0) {
cuda_query_weights_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<label_t>(&cuda_query_weights_, query_weights.data(), query_weights.size(), __FILE__, __LINE__);
}
if (init_score.size() == 0) {
cuda_init_score_ = nullptr;
} else {
InitCUDAMemoryFromHostMemory<double>(&cuda_init_score_, init_score.data(), init_score.size(), __FILE__, __LINE__);
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDAMetadata::SetLabel(const label_t* label, data_size_t len) {
DeallocateCUDAMemory<label_t>(&cuda_label_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_label_, label, static_cast<size_t>(len), __FILE__, __LINE__);
}
void CUDAMetadata::SetWeights(const label_t* weights, data_size_t len) {
DeallocateCUDAMemory<label_t>(&cuda_weights_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_weights_, weights, static_cast<size_t>(len), __FILE__, __LINE__);
}
void CUDAMetadata::SetQuery(const data_size_t* query_boundaries, const label_t* query_weights, data_size_t num_queries) {
DeallocateCUDAMemory<data_size_t>(&cuda_query_boundaries_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_query_boundaries_, query_boundaries, static_cast<size_t>(num_queries) + 1, __FILE__, __LINE__);
if (query_weights != nullptr) {
DeallocateCUDAMemory<label_t>(&cuda_query_weights_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<label_t>(&cuda_query_weights_, query_weights, static_cast<size_t>(num_queries), __FILE__, __LINE__);
}
}
void CUDAMetadata::SetInitScore(const double* init_score, data_size_t len) {
DeallocateCUDAMemory<double>(&cuda_init_score_, __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_init_score_, init_score, static_cast<size_t>(len), __FILE__, __LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_row_data.hpp>
namespace LightGBM {
CUDARowData::CUDARowData(const Dataset* train_data,
const TrainingShareStates* train_share_state,
const int gpu_device_id,
const bool gpu_use_dp):
gpu_device_id_(gpu_device_id),
gpu_use_dp_(gpu_use_dp) {
num_threads_ = OMP_NUM_THREADS();
num_data_ = train_data->num_data();
const auto& feature_hist_offsets = train_share_state->feature_hist_offsets();
if (gpu_use_dp_) {
shared_hist_size_ = DP_SHARED_HIST_SIZE;
} else {
shared_hist_size_ = SP_SHARED_HIST_SIZE;
}
if (feature_hist_offsets.empty()) {
num_total_bin_ = 0;
} else {
num_total_bin_ = static_cast<int>(feature_hist_offsets.back());
}
num_feature_group_ = train_data->num_feature_groups();
num_feature_ = train_data->num_features();
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
cuda_data_uint8_t_ = nullptr;
cuda_data_uint16_t_ = nullptr;
cuda_data_uint32_t_ = nullptr;
cuda_row_ptr_uint16_t_ = nullptr;
cuda_row_ptr_uint32_t_ = nullptr;
cuda_row_ptr_uint64_t_ = nullptr;
cuda_partition_ptr_uint16_t_ = nullptr;
cuda_partition_ptr_uint32_t_ = nullptr;
cuda_partition_ptr_uint64_t_ = nullptr;
cuda_feature_partition_column_index_offsets_ = nullptr;
cuda_column_hist_offsets_ = nullptr;
cuda_partition_hist_offsets_ = nullptr;
cuda_block_buffer_uint16_t_ = nullptr;
cuda_block_buffer_uint32_t_ = nullptr;
cuda_block_buffer_uint64_t_ = nullptr;
}
CUDARowData::~CUDARowData() {
DeallocateCUDAMemory<uint8_t>(&cuda_data_uint8_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_data_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_data_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_row_ptr_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_row_ptr_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint64_t>(&cuda_row_ptr_uint64_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_feature_partition_column_index_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_column_hist_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_partition_hist_offsets_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint16_t>(&cuda_block_buffer_uint16_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_block_buffer_uint32_t_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint64_t>(&cuda_block_buffer_uint64_t_, __FILE__, __LINE__);
}
void CUDARowData::Init(const Dataset* train_data, TrainingShareStates* train_share_state) {
if (num_feature_ == 0) {
return;
}
DivideCUDAFeatureGroups(train_data, train_share_state);
bit_type_ = 0;
size_t total_size = 0;
const void* host_row_ptr = nullptr;
row_ptr_bit_type_ = 0;
const void* host_data = train_share_state->GetRowWiseData(&bit_type_, &total_size, &is_sparse_, &host_row_ptr, &row_ptr_bit_type_);
if (bit_type_ == 8) {
if (!is_sparse_) {
std::vector<uint8_t> partitioned_data;
GetDenseDataPartitioned<uint8_t>(reinterpret_cast<const uint8_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint8_t>(&cuda_data_uint8_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint8_t, uint16_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint8_t, uint32_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint8_t, uint64_t>(
reinterpret_cast<const uint8_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint8_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else if (bit_type_ == 16) {
if (!is_sparse_) {
std::vector<uint16_t> partitioned_data;
GetDenseDataPartitioned<uint16_t>(reinterpret_cast<const uint16_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint16_t>(&cuda_data_uint16_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint16_t, uint16_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint16_t, uint32_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint16_t, uint64_t>(
reinterpret_cast<const uint16_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint16_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else if (bit_type_ == 32) {
if (!is_sparse_) {
std::vector<uint32_t> partitioned_data;
GetDenseDataPartitioned<uint32_t>(reinterpret_cast<const uint32_t*>(host_data), &partitioned_data);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_data_uint32_t_, partitioned_data.data(), total_size, __FILE__, __LINE__);
} else {
if (row_ptr_bit_type_ == 16) {
InitSparseData<uint32_t, uint16_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint16_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint16_t_,
&cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
InitSparseData<uint32_t, uint32_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint32_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint32_t_,
&cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
InitSparseData<uint32_t, uint64_t>(
reinterpret_cast<const uint32_t*>(host_data),
reinterpret_cast<const uint64_t*>(host_row_ptr),
&cuda_data_uint32_t_,
&cuda_row_ptr_uint64_t_,
&cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknow data ptr bit type %d", row_ptr_bit_type_);
}
}
} else {
Log::Fatal("Unknow bit type = %d", bit_type_);
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDARowData::DivideCUDAFeatureGroups(const Dataset* train_data, TrainingShareStates* share_state) {
const uint32_t max_num_bin_per_partition = shared_hist_size_ / 2;
const std::vector<uint32_t>& column_hist_offsets = share_state->column_hist_offsets();
std::vector<int> feature_group_num_feature_offsets;
int offsets = 0;
int prev_group_index = -1;
for (int feature_index = 0; feature_index < num_feature_; ++feature_index) {
const int feature_group_index = train_data->Feature2Group(feature_index);
if (prev_group_index == -1 || feature_group_index != prev_group_index) {
feature_group_num_feature_offsets.emplace_back(offsets);
prev_group_index = feature_group_index;
}
++offsets;
}
CHECK_EQ(offsets, num_feature_);
feature_group_num_feature_offsets.emplace_back(offsets);
uint32_t start_hist_offset = 0;
feature_partition_column_index_offsets_.clear();
column_hist_offsets_.clear();
partition_hist_offsets_.clear();
feature_partition_column_index_offsets_.emplace_back(0);
partition_hist_offsets_.emplace_back(0);
const int num_feature_groups = train_data->num_feature_groups();
int column_index = 0;
num_feature_partitions_ = 0;
large_bin_partitions_.clear();
small_bin_partitions_.clear();
for (int feature_group_index = 0; feature_group_index < num_feature_groups; ++feature_group_index) {
if (!train_data->IsMultiGroup(feature_group_index)) {
const uint32_t column_feature_hist_start = column_hist_offsets[column_index];
const uint32_t column_feature_hist_end = column_hist_offsets[column_index + 1];
const uint32_t num_bin_in_dense_group = column_feature_hist_end - column_feature_hist_start;
// if one column has too many bins, use a separate partition for that column
if (num_bin_in_dense_group > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
start_hist_offset = column_feature_hist_end;
partition_hist_offsets_.emplace_back(start_hist_offset);
large_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
column_hist_offsets_.emplace_back(0);
++column_index;
continue;
}
// try if adding this column exceed the maximum number per partition
const uint32_t cur_hist_num_bin = column_feature_hist_end - start_hist_offset;
if (cur_hist_num_bin > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index);
start_hist_offset = column_feature_hist_start;
partition_hist_offsets_.emplace_back(start_hist_offset);
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
column_hist_offsets_.emplace_back(column_hist_offsets[column_index] - start_hist_offset);
if (feature_group_index == num_feature_groups - 1) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
partition_hist_offsets_.emplace_back(column_hist_offsets.back());
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
++column_index;
} else {
const int group_feature_index_start = feature_group_num_feature_offsets[feature_group_index];
const int num_feature_in_group = feature_group_num_feature_offsets[feature_group_index + 1] - group_feature_index_start;
for (int sub_feature_index = 0; sub_feature_index < num_feature_in_group; ++sub_feature_index) {
const int feature_index = group_feature_index_start + sub_feature_index;
const uint32_t column_feature_hist_start = column_hist_offsets[column_index];
const uint32_t column_feature_hist_end = column_hist_offsets[column_index + 1];
const uint32_t num_bin_in_dense_group = column_feature_hist_end - column_feature_hist_start;
// if one column has too many bins, use a separate partition for that column
if (num_bin_in_dense_group > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
start_hist_offset = column_feature_hist_end;
partition_hist_offsets_.emplace_back(start_hist_offset);
large_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
column_hist_offsets_.emplace_back(0);
++column_index;
continue;
}
// try if adding this column exceed the maximum number per partition
const uint32_t cur_hist_num_bin = column_feature_hist_end - start_hist_offset;
if (cur_hist_num_bin > max_num_bin_per_partition) {
feature_partition_column_index_offsets_.emplace_back(column_index);
start_hist_offset = column_feature_hist_start;
partition_hist_offsets_.emplace_back(start_hist_offset);
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
column_hist_offsets_.emplace_back(column_hist_offsets[column_index] - start_hist_offset);
if (feature_group_index == num_feature_groups - 1 && sub_feature_index == num_feature_in_group - 1) {
CHECK_EQ(feature_index, num_feature_ - 1);
feature_partition_column_index_offsets_.emplace_back(column_index + 1);
partition_hist_offsets_.emplace_back(column_hist_offsets.back());
small_bin_partitions_.emplace_back(num_feature_partitions_);
++num_feature_partitions_;
}
++column_index;
}
}
}
column_hist_offsets_.emplace_back(column_hist_offsets.back() - start_hist_offset);
max_num_column_per_partition_ = 0;
for (size_t i = 0; i < feature_partition_column_index_offsets_.size() - 1; ++i) {
const int num_column = feature_partition_column_index_offsets_[i + 1] - feature_partition_column_index_offsets_[i];
if (num_column > max_num_column_per_partition_) {
max_num_column_per_partition_ = num_column;
}
}
InitCUDAMemoryFromHostMemory<int>(&cuda_feature_partition_column_index_offsets_,
feature_partition_column_index_offsets_.data(),
feature_partition_column_index_offsets_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_column_hist_offsets_,
column_hist_offsets_.data(),
column_hist_offsets_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_partition_hist_offsets_,
partition_hist_offsets_.data(),
partition_hist_offsets_.size(),
__FILE__,
__LINE__);
}
template <typename BIN_TYPE>
void CUDARowData::GetDenseDataPartitioned(const BIN_TYPE* row_wise_data, std::vector<BIN_TYPE>* partitioned_data) {
const int num_total_columns = feature_partition_column_index_offsets_.back();
partitioned_data->resize(static_cast<size_t>(num_total_columns) * static_cast<size_t>(num_data_), 0);
BIN_TYPE* out_data = partitioned_data->data();
Threading::For<data_size_t>(0, num_data_, 512,
[this, num_total_columns, row_wise_data, out_data] (int /*thread_index*/, data_size_t start, data_size_t end) {
for (size_t i = 0; i < feature_partition_column_index_offsets_.size() - 1; ++i) {
const int num_prev_columns = static_cast<int>(feature_partition_column_index_offsets_[i]);
const data_size_t offset = num_data_ * num_prev_columns;
const int partition_column_start = feature_partition_column_index_offsets_[i];
const int partition_column_end = feature_partition_column_index_offsets_[i + 1];
const int num_columns_in_cur_partition = partition_column_end - partition_column_start;
for (data_size_t data_index = start; data_index < end; ++data_index) {
const data_size_t data_offset = offset + data_index * num_columns_in_cur_partition;
const data_size_t read_data_offset = data_index * num_total_columns;
for (int column_index = 0; column_index < num_columns_in_cur_partition; ++column_index) {
const int true_column_index = read_data_offset + column_index + partition_column_start;
const BIN_TYPE bin = row_wise_data[true_column_index];
out_data[data_offset + column_index] = bin;
}
}
}
});
}
template <typename BIN_TYPE, typename DATA_PTR_TYPE>
void CUDARowData::GetSparseDataPartitioned(
const BIN_TYPE* row_wise_data,
const DATA_PTR_TYPE* row_ptr,
std::vector<std::vector<BIN_TYPE>>* partitioned_data,
std::vector<std::vector<DATA_PTR_TYPE>>* partitioned_row_ptr,
std::vector<DATA_PTR_TYPE>* partition_ptr) {
const int num_partitions = static_cast<int>(feature_partition_column_index_offsets_.size()) - 1;
partitioned_data->resize(num_partitions);
partitioned_row_ptr->resize(num_partitions);
std::vector<int> thread_max_elements_per_row(num_threads_, 0);
Threading::For<int>(0, num_partitions, 1,
[partitioned_data, partitioned_row_ptr, row_ptr, row_wise_data, &thread_max_elements_per_row, this] (int thread_index, int start, int end) {
for (int partition_index = start; partition_index < end; ++partition_index) {
std::vector<BIN_TYPE>& data_for_this_partition = partitioned_data->at(partition_index);
std::vector<DATA_PTR_TYPE>& row_ptr_for_this_partition = partitioned_row_ptr->at(partition_index);
const int partition_hist_start = partition_hist_offsets_[partition_index];
const int partition_hist_end = partition_hist_offsets_[partition_index + 1];
DATA_PTR_TYPE offset = 0;
row_ptr_for_this_partition.clear();
data_for_this_partition.clear();
row_ptr_for_this_partition.emplace_back(offset);
for (data_size_t data_index = 0; data_index < num_data_; ++data_index) {
const DATA_PTR_TYPE row_start = row_ptr[data_index];
const DATA_PTR_TYPE row_end = row_ptr[data_index + 1];
const BIN_TYPE* row_data_start = row_wise_data + row_start;
const BIN_TYPE* row_data_end = row_wise_data + row_end;
const size_t partition_start_in_row = std::lower_bound(row_data_start, row_data_end, partition_hist_start) - row_data_start;
const size_t partition_end_in_row = std::lower_bound(row_data_start, row_data_end, partition_hist_end) - row_data_start;
for (size_t pos = partition_start_in_row; pos < partition_end_in_row; ++pos) {
const BIN_TYPE bin = row_data_start[pos];
CHECK_GE(bin, static_cast<BIN_TYPE>(partition_hist_start));
data_for_this_partition.emplace_back(bin - partition_hist_start);
}
CHECK_GE(partition_end_in_row, partition_start_in_row);
const data_size_t num_elements_in_row = partition_end_in_row - partition_start_in_row;
offset += static_cast<DATA_PTR_TYPE>(num_elements_in_row);
row_ptr_for_this_partition.emplace_back(offset);
if (num_elements_in_row > thread_max_elements_per_row[thread_index]) {
thread_max_elements_per_row[thread_index] = num_elements_in_row;
}
}
}
});
partition_ptr->clear();
DATA_PTR_TYPE offset = 0;
partition_ptr->emplace_back(offset);
for (size_t i = 0; i < partitioned_row_ptr->size(); ++i) {
offset += partitioned_row_ptr->at(i).back();
partition_ptr->emplace_back(offset);
}
max_num_column_per_partition_ = 0;
for (int thread_index = 0; thread_index < num_threads_; ++thread_index) {
if (thread_max_elements_per_row[thread_index] > max_num_column_per_partition_) {
max_num_column_per_partition_ = thread_max_elements_per_row[thread_index];
}
}
}
template <typename BIN_TYPE, typename ROW_PTR_TYPE>
void CUDARowData::InitSparseData(const BIN_TYPE* host_data,
const ROW_PTR_TYPE* host_row_ptr,
BIN_TYPE** cuda_data,
ROW_PTR_TYPE** cuda_row_ptr,
ROW_PTR_TYPE** cuda_partition_ptr) {
std::vector<std::vector<BIN_TYPE>> partitioned_data;
std::vector<std::vector<ROW_PTR_TYPE>> partitioned_data_ptr;
std::vector<ROW_PTR_TYPE> partition_ptr;
GetSparseDataPartitioned<BIN_TYPE, ROW_PTR_TYPE>(host_data, host_row_ptr, &partitioned_data, &partitioned_data_ptr, &partition_ptr);
InitCUDAMemoryFromHostMemory<ROW_PTR_TYPE>(cuda_partition_ptr, partition_ptr.data(), partition_ptr.size(), __FILE__, __LINE__);
AllocateCUDAMemory<BIN_TYPE>(cuda_data, partition_ptr.back(), __FILE__, __LINE__);
AllocateCUDAMemory<ROW_PTR_TYPE>(cuda_row_ptr, (num_data_ + 1) * partitioned_data_ptr.size(), __FILE__, __LINE__);
for (size_t i = 0; i < partitioned_data.size(); ++i) {
const std::vector<ROW_PTR_TYPE>& data_ptr_for_this_partition = partitioned_data_ptr[i];
const std::vector<BIN_TYPE>& data_for_this_partition = partitioned_data[i];
CopyFromHostToCUDADevice<BIN_TYPE>((*cuda_data) + partition_ptr[i], data_for_this_partition.data(), data_for_this_partition.size(), __FILE__, __LINE__);
CopyFromHostToCUDADevice<ROW_PTR_TYPE>((*cuda_row_ptr) + i * (num_data_ + 1), data_ptr_for_this_partition.data(), data_ptr_for_this_partition.size(), __FILE__, __LINE__);
}
}
template <typename BIN_TYPE>
const BIN_TYPE* CUDARowData::GetBin() const {
if (bit_type_ == 8) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint8_t_);
} else if (bit_type_ == 16) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint16_t_);
} else if (bit_type_ == 32) {
return reinterpret_cast<const BIN_TYPE*>(cuda_data_uint32_t_);
} else {
Log::Fatal("Unknown bit_type %d for GetBin.", bit_type_);
}
}
template const uint8_t* CUDARowData::GetBin<uint8_t>() const;
template const uint16_t* CUDARowData::GetBin<uint16_t>() const;
template const uint32_t* CUDARowData::GetBin<uint32_t>() const;
template <typename PTR_TYPE>
const PTR_TYPE* CUDARowData::GetRowPtr() const {
if (row_ptr_bit_type_ == 16) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
return reinterpret_cast<const PTR_TYPE*>(cuda_row_ptr_uint64_t_);
} else {
Log::Fatal("Unknown row_ptr_bit_type = %d for GetRowPtr.", row_ptr_bit_type_);
}
}
template const uint16_t* CUDARowData::GetRowPtr<uint16_t>() const;
template const uint32_t* CUDARowData::GetRowPtr<uint32_t>() const;
template const uint64_t* CUDARowData::GetRowPtr<uint64_t>() const;
template <typename PTR_TYPE>
const PTR_TYPE* CUDARowData::GetPartitionPtr() const {
if (row_ptr_bit_type_ == 16) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint16_t_);
} else if (row_ptr_bit_type_ == 32) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint32_t_);
} else if (row_ptr_bit_type_ == 64) {
return reinterpret_cast<const PTR_TYPE*>(cuda_partition_ptr_uint64_t_);
} else {
Log::Fatal("Unknown row_ptr_bit_type = %d for GetPartitionPtr.", row_ptr_bit_type_);
}
}
template const uint16_t* CUDARowData::GetPartitionPtr<uint16_t>() const;
template const uint32_t* CUDARowData::GetPartitionPtr<uint32_t>() const;
template const uint64_t* CUDARowData::GetPartitionPtr<uint64_t>() const;
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_tree.hpp>
namespace LightGBM {
CUDATree::CUDATree(int max_leaves, bool track_branch_features, bool is_linear,
const int gpu_device_id, const bool has_categorical_feature):
Tree(max_leaves, track_branch_features, is_linear),
num_threads_per_block_add_prediction_to_score_(1024) {
is_cuda_tree_ = true;
if (gpu_device_id >= 0) {
SetCUDADevice(gpu_device_id, __FILE__, __LINE__);
} else {
SetCUDADevice(0, __FILE__, __LINE__);
}
if (has_categorical_feature) {
cuda_cat_boundaries_.Resize(max_leaves);
cuda_cat_boundaries_inner_.Resize(max_leaves);
}
InitCUDAMemory();
}
CUDATree::CUDATree(const Tree* host_tree):
Tree(*host_tree),
num_threads_per_block_add_prediction_to_score_(1024) {
is_cuda_tree_ = true;
InitCUDA();
}
CUDATree::~CUDATree() {
DeallocateCUDAMemory<int>(&cuda_left_child_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_right_child_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_split_feature_inner_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_split_feature_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_leaf_depth_, __FILE__, __LINE__);
DeallocateCUDAMemory<int>(&cuda_leaf_parent_, __FILE__, __LINE__);
DeallocateCUDAMemory<uint32_t>(&cuda_threshold_in_bin_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_threshold_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_internal_weight_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_internal_value_, __FILE__, __LINE__);
DeallocateCUDAMemory<int8_t>(&cuda_decision_type_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_leaf_value_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_leaf_count_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_leaf_weight_, __FILE__, __LINE__);
DeallocateCUDAMemory<data_size_t>(&cuda_internal_count_, __FILE__, __LINE__);
DeallocateCUDAMemory<float>(&cuda_split_gain_, __FILE__, __LINE__);
gpuAssert(cudaStreamDestroy(cuda_stream_), __FILE__, __LINE__);
}
void CUDATree::InitCUDAMemory() {
AllocateCUDAMemory<int>(&cuda_left_child_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_right_child_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_split_feature_inner_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_split_feature_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_leaf_depth_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int>(&cuda_leaf_parent_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<uint32_t>(&cuda_threshold_in_bin_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_threshold_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<int8_t>(&cuda_decision_type_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_leaf_value_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_internal_weight_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_internal_value_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<double>(&cuda_leaf_weight_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_leaf_count_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<data_size_t>(&cuda_internal_count_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
AllocateCUDAMemory<float>(&cuda_split_gain_,
static_cast<size_t>(max_leaves_),
__FILE__,
__LINE__);
SetCUDAMemory<double>(cuda_leaf_value_, 0.0f, 1, __FILE__, __LINE__);
SetCUDAMemory<double>(cuda_leaf_weight_, 0.0f, 1, __FILE__, __LINE__);
SetCUDAMemory<int>(cuda_leaf_parent_, -1, 1, __FILE__, __LINE__);
CUDASUCCESS_OR_FATAL(cudaStreamCreate(&cuda_stream_));
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDATree::InitCUDA() {
InitCUDAMemoryFromHostMemory<int>(&cuda_left_child_,
left_child_.data(),
left_child_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_right_child_,
right_child_.data(),
right_child_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_split_feature_inner_,
split_feature_inner_.data(),
split_feature_inner_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_split_feature_,
split_feature_.data(),
split_feature_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<uint32_t>(&cuda_threshold_in_bin_,
threshold_in_bin_.data(),
threshold_in_bin_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_threshold_,
threshold_.data(),
threshold_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_leaf_depth_,
leaf_depth_.data(),
leaf_depth_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int8_t>(&cuda_decision_type_,
decision_type_.data(),
decision_type_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_internal_weight_,
internal_weight_.data(),
internal_weight_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_internal_value_,
internal_value_.data(),
internal_value_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_internal_count_,
internal_count_.data(),
internal_count_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<data_size_t>(&cuda_leaf_count_,
leaf_count_.data(),
leaf_count_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<float>(&cuda_split_gain_,
split_gain_.data(),
split_gain_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_leaf_value_,
leaf_value_.data(),
leaf_value_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<double>(&cuda_leaf_weight_,
leaf_weight_.data(),
leaf_weight_.size(),
__FILE__,
__LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_leaf_parent_,
leaf_parent_.data(),
leaf_parent_.size(),
__FILE__,
__LINE__);
CUDASUCCESS_OR_FATAL(cudaStreamCreate(&cuda_stream_));
SynchronizeCUDADevice(__FILE__, __LINE__);
}
int CUDATree::Split(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info) {
LaunchSplitKernel(leaf_index, real_feature_index, real_threshold, missing_type, cuda_split_info);
++num_leaves_;
return num_leaves_ - 1;
}
int CUDATree::SplitCategorical(const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
uint32_t* cuda_bitset,
size_t cuda_bitset_len,
uint32_t* cuda_bitset_inner,
size_t cuda_bitset_inner_len) {
LaunchSplitCategoricalKernel(leaf_index, real_feature_index,
missing_type, cuda_split_info,
cuda_bitset_len, cuda_bitset_inner_len);
cuda_bitset_.PushBack(cuda_bitset, cuda_bitset_len);
cuda_bitset_inner_.PushBack(cuda_bitset_inner, cuda_bitset_inner_len);
++num_leaves_;
++num_cat_;
return num_leaves_ - 1;
}
inline void CUDATree::Shrinkage(double rate) {
Tree::Shrinkage(rate);
LaunchShrinkageKernel(rate);
}
inline void CUDATree::AddBias(double val) {
Tree::AddBias(val);
LaunchAddBiasKernel(val);
}
void CUDATree::ToHost() {
left_child_.resize(max_leaves_ - 1);
right_child_.resize(max_leaves_ - 1);
split_feature_inner_.resize(max_leaves_ - 1);
split_feature_.resize(max_leaves_ - 1);
threshold_in_bin_.resize(max_leaves_ - 1);
threshold_.resize(max_leaves_ - 1);
decision_type_.resize(max_leaves_ - 1, 0);
split_gain_.resize(max_leaves_ - 1);
leaf_parent_.resize(max_leaves_);
leaf_value_.resize(max_leaves_);
leaf_weight_.resize(max_leaves_);
leaf_count_.resize(max_leaves_);
internal_value_.resize(max_leaves_ - 1);
internal_weight_.resize(max_leaves_ - 1);
internal_count_.resize(max_leaves_ - 1);
leaf_depth_.resize(max_leaves_);
const size_t num_leaves_size = static_cast<size_t>(num_leaves_);
CopyFromCUDADeviceToHost<int>(left_child_.data(), cuda_left_child_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(right_child_.data(), cuda_right_child_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(split_feature_inner_.data(), cuda_split_feature_inner_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(split_feature_.data(), cuda_split_feature_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<uint32_t>(threshold_in_bin_.data(), cuda_threshold_in_bin_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(threshold_.data(), cuda_threshold_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int8_t>(decision_type_.data(), cuda_decision_type_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<float>(split_gain_.data(), cuda_split_gain_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(leaf_parent_.data(), cuda_leaf_parent_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(leaf_value_.data(), cuda_leaf_value_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(leaf_weight_.data(), cuda_leaf_weight_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<data_size_t>(leaf_count_.data(), cuda_leaf_count_, num_leaves_size, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(internal_value_.data(), cuda_internal_value_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<double>(internal_weight_.data(), cuda_internal_weight_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<data_size_t>(internal_count_.data(), cuda_internal_count_, num_leaves_size - 1, __FILE__, __LINE__);
CopyFromCUDADeviceToHost<int>(leaf_depth_.data(), cuda_leaf_depth_, num_leaves_size, __FILE__, __LINE__);
if (num_cat_ > 0) {
cuda_cat_boundaries_inner_.Resize(num_cat_ + 1);
cuda_cat_boundaries_.Resize(num_cat_ + 1);
cat_boundaries_ = cuda_cat_boundaries_.ToHost();
cat_boundaries_inner_ = cuda_cat_boundaries_inner_.ToHost();
cat_threshold_ = cuda_bitset_.ToHost();
cat_threshold_inner_ = cuda_bitset_inner_.ToHost();
}
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDATree::SyncLeafOutputFromHostToCUDA() {
CopyFromHostToCUDADevice<double>(cuda_leaf_value_, leaf_value_.data(), leaf_value_.size(), __FILE__, __LINE__);
}
void CUDATree::SyncLeafOutputFromCUDAToHost() {
CopyFromCUDADeviceToHost<double>(leaf_value_.data(), cuda_leaf_value_, leaf_value_.size(), __FILE__, __LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_tree.hpp>
namespace LightGBM {
__device__ void SetDecisionTypeCUDA(int8_t* decision_type, bool input, int8_t mask) {
if (input) {
(*decision_type) |= mask;
} else {
(*decision_type) &= (127 - mask);
}
}
__device__ void SetMissingTypeCUDA(int8_t* decision_type, int8_t input) {
(*decision_type) &= 3;
(*decision_type) |= (input << 2);
}
__device__ bool GetDecisionTypeCUDA(int8_t decision_type, int8_t mask) {
return (decision_type & mask) > 0;
}
__device__ int8_t GetMissingTypeCUDA(int8_t decision_type) {
return (decision_type >> 2) & 3;
}
__device__ bool IsZeroCUDA(double fval) {
return (fval >= -kZeroThreshold && fval <= kZeroThreshold);
}
__global__ void SplitKernel( // split information
const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
// tree structure
const int num_leaves,
int* leaf_parent,
int* leaf_depth,
int* left_child,
int* right_child,
int* split_feature_inner,
int* split_feature,
float* split_gain,
double* internal_weight,
double* internal_value,
data_size_t* internal_count,
double* leaf_weight,
double* leaf_value,
data_size_t* leaf_count,
int8_t* decision_type,
uint32_t* threshold_in_bin,
double* threshold) {
const int new_node_index = num_leaves - 1;
const int thread_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
const int parent_index = leaf_parent[leaf_index];
if (thread_index == 0) {
if (parent_index >= 0) {
// if cur node is left child
if (left_child[parent_index] == ~leaf_index) {
left_child[parent_index] = new_node_index;
} else {
right_child[parent_index] = new_node_index;
}
}
left_child[new_node_index] = ~leaf_index;
right_child[new_node_index] = ~num_leaves;
leaf_parent[leaf_index] = new_node_index;
leaf_parent[num_leaves] = new_node_index;
} else if (thread_index == 1) {
// add new node
split_feature_inner[new_node_index] = cuda_split_info->inner_feature_index;
} else if (thread_index == 2) {
split_feature[new_node_index] = real_feature_index;
} else if (thread_index == 3) {
split_gain[new_node_index] = static_cast<float>(cuda_split_info->gain);
} else if (thread_index == 4) {
// save current leaf value to internal node before change
internal_weight[new_node_index] = leaf_weight[leaf_index];
leaf_weight[leaf_index] = cuda_split_info->left_sum_hessians;
} else if (thread_index == 5) {
internal_value[new_node_index] = leaf_value[leaf_index];
leaf_value[leaf_index] = isnan(cuda_split_info->left_value) ? 0.0f : cuda_split_info->left_value;
} else if (thread_index == 6) {
internal_count[new_node_index] = cuda_split_info->left_count + cuda_split_info->right_count;
} else if (thread_index == 7) {
leaf_count[leaf_index] = cuda_split_info->left_count;
} else if (thread_index == 8) {
leaf_value[num_leaves] = isnan(cuda_split_info->right_value) ? 0.0f : cuda_split_info->right_value;
} else if (thread_index == 9) {
leaf_weight[num_leaves] = cuda_split_info->right_sum_hessians;
} else if (thread_index == 10) {
leaf_count[num_leaves] = cuda_split_info->right_count;
} else if (thread_index == 11) {
// update leaf depth
leaf_depth[num_leaves] = leaf_depth[leaf_index] + 1;
leaf_depth[leaf_index]++;
} else if (thread_index == 12) {
decision_type[new_node_index] = 0;
SetDecisionTypeCUDA(&decision_type[new_node_index], false, kCategoricalMask);
SetDecisionTypeCUDA(&decision_type[new_node_index], cuda_split_info->default_left, kDefaultLeftMask);
SetMissingTypeCUDA(&decision_type[new_node_index], static_cast<int8_t>(missing_type));
} else if (thread_index == 13) {
threshold_in_bin[new_node_index] = cuda_split_info->threshold;
} else if (thread_index == 14) {
threshold[new_node_index] = real_threshold;
}
}
void CUDATree::LaunchSplitKernel(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info) {
SplitKernel<<<3, 5, 0, cuda_stream_>>>(
// split information
leaf_index,
real_feature_index,
real_threshold,
missing_type,
cuda_split_info,
// tree structure
num_leaves_,
cuda_leaf_parent_,
cuda_leaf_depth_,
cuda_left_child_,
cuda_right_child_,
cuda_split_feature_inner_,
cuda_split_feature_,
cuda_split_gain_,
cuda_internal_weight_,
cuda_internal_value_,
cuda_internal_count_,
cuda_leaf_weight_,
cuda_leaf_value_,
cuda_leaf_count_,
cuda_decision_type_,
cuda_threshold_in_bin_,
cuda_threshold_);
}
__global__ void SplitCategoricalKernel( // split information
const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
// tree structure
const int num_leaves,
int* leaf_parent,
int* leaf_depth,
int* left_child,
int* right_child,
int* split_feature_inner,
int* split_feature,
float* split_gain,
double* internal_weight,
double* internal_value,
data_size_t* internal_count,
double* leaf_weight,
double* leaf_value,
data_size_t* leaf_count,
int8_t* decision_type,
uint32_t* threshold_in_bin,
double* threshold,
size_t cuda_bitset_len,
size_t cuda_bitset_inner_len,
int num_cat,
int* cuda_cat_boundaries,
int* cuda_cat_boundaries_inner) {
const int new_node_index = num_leaves - 1;
const int thread_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
const int parent_index = leaf_parent[leaf_index];
if (thread_index == 0) {
if (parent_index >= 0) {
// if cur node is left child
if (left_child[parent_index] == ~leaf_index) {
left_child[parent_index] = new_node_index;
} else {
right_child[parent_index] = new_node_index;
}
}
left_child[new_node_index] = ~leaf_index;
right_child[new_node_index] = ~num_leaves;
leaf_parent[leaf_index] = new_node_index;
leaf_parent[num_leaves] = new_node_index;
} else if (thread_index == 1) {
// add new node
split_feature_inner[new_node_index] = cuda_split_info->inner_feature_index;
} else if (thread_index == 2) {
split_feature[new_node_index] = real_feature_index;
} else if (thread_index == 3) {
split_gain[new_node_index] = static_cast<float>(cuda_split_info->gain);
} else if (thread_index == 4) {
// save current leaf value to internal node before change
internal_weight[new_node_index] = leaf_weight[leaf_index];
leaf_weight[leaf_index] = cuda_split_info->left_sum_hessians;
} else if (thread_index == 5) {
internal_value[new_node_index] = leaf_value[leaf_index];
leaf_value[leaf_index] = isnan(cuda_split_info->left_value) ? 0.0f : cuda_split_info->left_value;
} else if (thread_index == 6) {
internal_count[new_node_index] = cuda_split_info->left_count + cuda_split_info->right_count;
} else if (thread_index == 7) {
leaf_count[leaf_index] = cuda_split_info->left_count;
} else if (thread_index == 8) {
leaf_value[num_leaves] = isnan(cuda_split_info->right_value) ? 0.0f : cuda_split_info->right_value;
} else if (thread_index == 9) {
leaf_weight[num_leaves] = cuda_split_info->right_sum_hessians;
} else if (thread_index == 10) {
leaf_count[num_leaves] = cuda_split_info->right_count;
} else if (thread_index == 11) {
// update leaf depth
leaf_depth[num_leaves] = leaf_depth[leaf_index] + 1;
leaf_depth[leaf_index]++;
} else if (thread_index == 12) {
decision_type[new_node_index] = 0;
SetDecisionTypeCUDA(&decision_type[new_node_index], true, kCategoricalMask);
SetMissingTypeCUDA(&decision_type[new_node_index], static_cast<int8_t>(missing_type));
} else if (thread_index == 13) {
threshold_in_bin[new_node_index] = num_cat;
} else if (thread_index == 14) {
threshold[new_node_index] = num_cat;
} else if (thread_index == 15) {
if (num_cat == 0) {
cuda_cat_boundaries[num_cat] = 0;
}
cuda_cat_boundaries[num_cat + 1] = cuda_cat_boundaries[num_cat] + cuda_bitset_len;
} else if (thread_index == 16) {
if (num_cat == 0) {
cuda_cat_boundaries_inner[num_cat] = 0;
}
cuda_cat_boundaries_inner[num_cat + 1] = cuda_cat_boundaries_inner[num_cat] + cuda_bitset_inner_len;
}
}
void CUDATree::LaunchSplitCategoricalKernel(const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
size_t cuda_bitset_len,
size_t cuda_bitset_inner_len) {
SplitCategoricalKernel<<<3, 6, 0, cuda_stream_>>>(
// split information
leaf_index,
real_feature_index,
missing_type,
cuda_split_info,
// tree structure
num_leaves_,
cuda_leaf_parent_,
cuda_leaf_depth_,
cuda_left_child_,
cuda_right_child_,
cuda_split_feature_inner_,
cuda_split_feature_,
cuda_split_gain_,
cuda_internal_weight_,
cuda_internal_value_,
cuda_internal_count_,
cuda_leaf_weight_,
cuda_leaf_value_,
cuda_leaf_count_,
cuda_decision_type_,
cuda_threshold_in_bin_,
cuda_threshold_,
cuda_bitset_len,
cuda_bitset_inner_len,
num_cat_,
cuda_cat_boundaries_.RawData(),
cuda_cat_boundaries_inner_.RawData());
}
__global__ void ShrinkageKernel(const double rate, double* cuda_leaf_value, const int num_leaves) {
const int leaf_index = static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x);
if (leaf_index < num_leaves) {
cuda_leaf_value[leaf_index] *= rate;
}
}
void CUDATree::LaunchShrinkageKernel(const double rate) {
const int num_threads_per_block = 1024;
const int num_blocks = (num_leaves_ + num_threads_per_block - 1) / num_threads_per_block;
ShrinkageKernel<<<num_blocks, num_threads_per_block>>>(rate, cuda_leaf_value_, num_leaves_);
}
__global__ void AddBiasKernel(const double val, double* cuda_leaf_value, const int num_leaves) {
const int leaf_index = static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x);
if (leaf_index < num_leaves) {
cuda_leaf_value[leaf_index] += val;
}
}
void CUDATree::LaunchAddBiasKernel(const double val) {
const int num_threads_per_block = 1024;
const int num_blocks = (num_leaves_ + num_threads_per_block - 1) / num_threads_per_block;
AddBiasKernel<<<num_blocks, num_threads_per_block>>>(val, cuda_leaf_value_, num_leaves_);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
...@@ -340,17 +340,18 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers, ...@@ -340,17 +340,18 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
auto features_in_group = NoGroup(used_features); auto features_in_group = NoGroup(used_features);
auto is_sparse = io_config.is_enable_sparse; auto is_sparse = io_config.is_enable_sparse;
if (io_config.device_type == std::string("cuda")) { if (io_config.device_type == std::string("cuda") || io_config.device_type == std::string("cuda_exp")) {
LGBM_config_::current_device = lgbm_device_cuda; LGBM_config_::current_device = lgbm_device_cuda;
if (is_sparse) { if (io_config.device_type == std::string("cuda") && is_sparse) {
Log::Warning("Using sparse features with CUDA is currently not supported."); Log::Warning("Using sparse features with CUDA is currently not supported.");
is_sparse = false;
} }
is_sparse = false;
} }
std::vector<int8_t> group_is_multi_val(used_features.size(), 0); std::vector<int8_t> group_is_multi_val(used_features.size(), 0);
if (io_config.enable_bundle && !used_features.empty()) { if (io_config.enable_bundle && !used_features.empty()) {
bool lgbm_is_gpu_used = io_config.device_type == std::string("gpu") || io_config.device_type == std::string("cuda"); bool lgbm_is_gpu_used = io_config.device_type == std::string("gpu") || io_config.device_type == std::string("cuda")
|| io_config.device_type == std::string("cuda_exp");
features_in_group = FastFeatureBundling( features_in_group = FastFeatureBundling(
*bin_mappers, sample_non_zero_indices, sample_values, num_per_col, *bin_mappers, sample_non_zero_indices, sample_values, num_per_col,
num_sample_col, static_cast<data_size_t>(total_sample_cnt), num_sample_col, static_cast<data_size_t>(total_sample_cnt),
...@@ -426,6 +427,8 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers, ...@@ -426,6 +427,8 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
++num_numeric_features_; ++num_numeric_features_;
} }
} }
device_type_ = io_config.device_type;
gpu_device_id_ = io_config.gpu_device_id;
} }
void Dataset::FinishLoad() { void Dataset::FinishLoad() {
...@@ -437,6 +440,14 @@ void Dataset::FinishLoad() { ...@@ -437,6 +440,14 @@ void Dataset::FinishLoad() {
feature_groups_[i]->FinishLoad(); feature_groups_[i]->FinishLoad();
} }
} }
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
CreateCUDAColumnData();
metadata_.CreateCUDAMetadata(gpu_device_id_);
} else {
cuda_column_data_.reset(nullptr);
}
#endif // USE_CUDA_EXP
is_finish_load_ = true; is_finish_load_ = true;
} }
...@@ -768,6 +779,8 @@ void Dataset::CreateValid(const Dataset* dataset) { ...@@ -768,6 +779,8 @@ void Dataset::CreateValid(const Dataset* dataset) {
label_idx_ = dataset->label_idx_; label_idx_ = dataset->label_idx_;
real_feature_idx_ = dataset->real_feature_idx_; real_feature_idx_ = dataset->real_feature_idx_;
forced_bin_bounds_ = dataset->forced_bin_bounds_; forced_bin_bounds_ = dataset->forced_bin_bounds_;
device_type_ = dataset->device_type_;
gpu_device_id_ = dataset->gpu_device_id_;
} }
void Dataset::ReSize(data_size_t num_data) { void Dataset::ReSize(data_size_t num_data) {
...@@ -833,6 +846,19 @@ void Dataset::CopySubrow(const Dataset* fullset, ...@@ -833,6 +846,19 @@ void Dataset::CopySubrow(const Dataset* fullset,
} }
} }
} }
// update CUDA storage for column data and metadata
device_type_ = fullset->device_type_;
gpu_device_id_ = fullset->gpu_device_id_;
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
if (cuda_column_data_ == nullptr) {
cuda_column_data_.reset(new CUDAColumnData(fullset->num_data(), gpu_device_id_));
metadata_.CreateCUDAMetadata(gpu_device_id_);
}
cuda_column_data_->CopySubrow(fullset->cuda_column_data(), used_indices, num_used_indices);
}
#endif // USE_CUDA_EXP
} }
bool Dataset::SetFloatField(const char* field_name, const float* field_data, bool Dataset::SetFloatField(const char* field_name, const float* field_data,
...@@ -1470,6 +1496,169 @@ void Dataset::AddFeaturesFrom(Dataset* other) { ...@@ -1470,6 +1496,169 @@ void Dataset::AddFeaturesFrom(Dataset* other) {
raw_data_.push_back(other->raw_data_[i]); raw_data_.push_back(other->raw_data_[i]);
} }
} }
#ifdef USE_CUDA_EXP
if (device_type_ == std::string("cuda_exp")) {
CreateCUDAColumnData();
} else {
cuda_column_data_ = nullptr;
}
#endif // USE_CUDA_EXP
} }
const void* Dataset::GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
return feature_groups_[feature_group_index]->GetColWiseData(sub_feature_index, bit_type, is_sparse, bin_iterator, num_threads);
}
const void* Dataset::GetColWiseData(
const int feature_group_index,
const int sub_feature_index,
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
return feature_groups_[feature_group_index]->GetColWiseData(sub_feature_index, bit_type, is_sparse, bin_iterator);
}
#ifdef USE_CUDA_EXP
void Dataset::CreateCUDAColumnData() {
cuda_column_data_.reset(new CUDAColumnData(num_data_, gpu_device_id_));
int num_columns = 0;
std::vector<const void*> column_data;
std::vector<BinIterator*> column_bin_iterator;
std::vector<uint8_t> column_bit_type;
int feature_index = 0;
std::vector<int> feature_to_column(num_features_, -1);
std::vector<uint32_t> feature_max_bins(num_features_, 0);
std::vector<uint32_t> feature_min_bins(num_features_, 0);
std::vector<uint32_t> feature_offsets(num_features_, 0);
std::vector<uint32_t> feature_most_freq_bins(num_features_, 0);
std::vector<uint32_t> feature_default_bin(num_features_, 0);
std::vector<uint8_t> feature_missing_is_zero(num_features_, 0);
std::vector<uint8_t> feature_missing_is_na(num_features_, 0);
std::vector<uint8_t> feature_mfb_is_zero(num_features_, 0);
std::vector<uint8_t> feature_mfb_is_na(num_features_, 0);
for (int feature_group_index = 0; feature_group_index < num_groups_; ++feature_group_index) {
if (feature_groups_[feature_group_index]->is_multi_val_) {
for (int sub_feature_index = 0; sub_feature_index < feature_groups_[feature_group_index]->num_feature_; ++sub_feature_index) {
uint8_t bit_type = 0;
bool is_sparse = false;
BinIterator* bin_iterator = nullptr;
const void* one_column_data = GetColWiseData(feature_group_index,
sub_feature_index,
&bit_type,
&is_sparse,
&bin_iterator);
column_data.emplace_back(one_column_data);
column_bin_iterator.emplace_back(bin_iterator);
column_bit_type.emplace_back(bit_type);
feature_to_column[feature_index] = num_columns;
++num_columns;
const BinMapper* feature_bin_mapper = FeatureBinMapper(feature_index);
feature_max_bins[feature_index] = feature_max_bin(feature_index);
feature_min_bins[feature_index] = feature_min_bin(feature_index);
const uint32_t most_freq_bin = feature_bin_mapper->GetMostFreqBin();
feature_offsets[feature_index] = static_cast<uint32_t>(most_freq_bin == 0);
feature_most_freq_bins[feature_index] = most_freq_bin;
feature_default_bin[feature_index] = feature_bin_mapper->GetDefaultBin();
if (feature_bin_mapper->missing_type() == MissingType::Zero) {
feature_missing_is_zero[feature_index] = 1;
feature_missing_is_na[feature_index] = 0;
if (feature_default_bin[feature_index] == feature_most_freq_bins[feature_index]) {
feature_mfb_is_zero[feature_index] = 1;
} else {
feature_mfb_is_zero[feature_index] = 0;
}
feature_mfb_is_na[feature_index] = 0;
} else if (feature_bin_mapper->missing_type() == MissingType::NaN) {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 1;
feature_mfb_is_zero[feature_index] = 0;
if (feature_most_freq_bins[feature_index] + feature_min_bins[feature_index] == feature_max_bins[feature_index] &&
feature_most_freq_bins[feature_index] > 0) {
feature_mfb_is_na[feature_index] = 1;
} else {
feature_mfb_is_na[feature_index] = 0;
}
} else {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 0;
feature_mfb_is_zero[feature_index] = 0;
feature_mfb_is_na[feature_index] = 0;
}
++feature_index;
}
} else {
uint8_t bit_type = 0;
bool is_sparse = false;
BinIterator* bin_iterator = nullptr;
const void* one_column_data = GetColWiseData(feature_group_index,
-1,
&bit_type,
&is_sparse,
&bin_iterator);
column_data.emplace_back(one_column_data);
column_bin_iterator.emplace_back(bin_iterator);
column_bit_type.emplace_back(bit_type);
for (int sub_feature_index = 0; sub_feature_index < feature_groups_[feature_group_index]->num_feature_; ++sub_feature_index) {
feature_to_column[feature_index] = num_columns;
const BinMapper* feature_bin_mapper = FeatureBinMapper(feature_index);
feature_max_bins[feature_index] = feature_max_bin(feature_index);
feature_min_bins[feature_index] = feature_min_bin(feature_index);
const uint32_t most_freq_bin = feature_bin_mapper->GetMostFreqBin();
feature_offsets[feature_index] = static_cast<uint32_t>(most_freq_bin == 0);
feature_most_freq_bins[feature_index] = most_freq_bin;
feature_default_bin[feature_index] = feature_bin_mapper->GetDefaultBin();
if (feature_bin_mapper->missing_type() == MissingType::Zero) {
feature_missing_is_zero[feature_index] = 1;
feature_missing_is_na[feature_index] = 0;
if (feature_default_bin[feature_index] == feature_most_freq_bins[feature_index]) {
feature_mfb_is_zero[feature_index] = 1;
} else {
feature_mfb_is_zero[feature_index] = 0;
}
feature_mfb_is_na[feature_index] = 0;
} else if (feature_bin_mapper->missing_type() == MissingType::NaN) {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 1;
feature_mfb_is_zero[feature_index] = 0;
if (feature_most_freq_bins[feature_index] + feature_min_bins[feature_index] == feature_max_bins[feature_index] &&
feature_most_freq_bins[feature_index] > 0) {
feature_mfb_is_na[feature_index] = 1;
} else {
feature_mfb_is_na[feature_index] = 0;
}
} else {
feature_missing_is_zero[feature_index] = 0;
feature_missing_is_na[feature_index] = 0;
feature_mfb_is_zero[feature_index] = 0;
feature_mfb_is_na[feature_index] = 0;
}
++feature_index;
}
++num_columns;
}
}
cuda_column_data_->Init(num_columns,
column_data,
column_bin_iterator,
column_bit_type,
feature_max_bins,
feature_min_bins,
feature_offsets,
feature_most_freq_bins,
feature_default_bin,
feature_missing_is_zero,
feature_missing_is_na,
feature_mfb_is_zero,
feature_mfb_is_na,
feature_to_column);
}
#endif // USE_CUDA_EXP
} // namespace LightGBM } // namespace LightGBM
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment