Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
#define LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_row_data.hpp>
#include <LightGBM/feature_group.h>
#include <LightGBM/tree.h>
#include <memory>
#include <vector>
#include "cuda_leaf_splits.hpp"
#define NUM_DATA_PER_THREAD (400)
#define NUM_THRADS_PER_BLOCK (504)
#define NUM_FEATURE_PER_THREAD_GROUP (28)
#define SUBTRACT_BLOCK_SIZE (1024)
#define FIX_HISTOGRAM_SHARED_MEM_SIZE (1024)
#define FIX_HISTOGRAM_BLOCK_SIZE (512)
#define USED_HISTOGRAM_BUFFER_NUM (8)
namespace LightGBM {
class CUDAHistogramConstructor {
public:
CUDAHistogramConstructor(
const Dataset* train_data,
const int num_leaves,
const int num_threads,
const std::vector<uint32_t>& feature_hist_offsets,
const int min_data_in_leaf,
const double min_sum_hessian_in_leaf,
const int gpu_device_id,
const bool gpu_use_dp);
~CUDAHistogramConstructor();
void Init(const Dataset* train_data, TrainingShareStates* share_state);
void ConstructHistogramForLeaf(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const CUDALeafSplitsStruct* cuda_larger_leaf_splits,
const data_size_t num_data_in_smaller_leaf,
const data_size_t num_data_in_larger_leaf,
const double sum_hessians_in_smaller_leaf,
const double sum_hessians_in_larger_leaf);
void ResetTrainingData(const Dataset* train_data, TrainingShareStates* share_states);
void ResetConfig(const Config* config);
void BeforeTrain(const score_t* gradients, const score_t* hessians);
const hist_t* cuda_hist() const { return cuda_hist_; }
hist_t* cuda_hist_pointer() { return cuda_hist_; }
private:
void InitFeatureMetaInfo(const Dataset* train_data, const std::vector<uint32_t>& feature_hist_offsets);
void CalcConstructHistogramKernelDim(
int* grid_dim_x,
int* grid_dim_y,
int* block_dim_x,
int* block_dim_y,
const data_size_t num_data_in_smaller_leaf);
template <typename HIST_TYPE, size_t SHARED_HIST_SIZE>
void LaunchConstructHistogramKernelInner(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const data_size_t num_data_in_smaller_leaf);
template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE>
void LaunchConstructHistogramKernelInner0(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const data_size_t num_data_in_smaller_leaf);
template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE, typename PTR_TYPE>
void LaunchConstructHistogramKernelInner1(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const data_size_t num_data_in_smaller_leaf);
template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE, typename PTR_TYPE, bool USE_GLOBAL_MEM_BUFFER>
void LaunchConstructHistogramKernelInner2(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const data_size_t num_data_in_smaller_leaf);
void LaunchConstructHistogramKernel(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const data_size_t num_data_in_smaller_leaf);
void LaunchSubtractHistogramKernel(
const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
const CUDALeafSplitsStruct* cuda_larger_leaf_splits);
// Host memory
/*! \brief size of training data */
data_size_t num_data_;
/*! \brief number of features in training data */
int num_features_;
/*! \brief maximum number of leaves */
int num_leaves_;
/*! \brief number of threads */
int num_threads_;
/*! \brief total number of bins in histogram */
int num_total_bin_;
/*! \brief number of bins per feature */
std::vector<uint32_t> feature_num_bins_;
/*! \brief offsets in histogram of all features */
std::vector<uint32_t> feature_hist_offsets_;
/*! \brief most frequent bins in each feature */
std::vector<uint32_t> feature_most_freq_bins_;
/*! \brief minimum number of data allowed per leaf */
int min_data_in_leaf_;
/*! \brief minimum sum value of hessians allowed per leaf */
double min_sum_hessian_in_leaf_;
/*! \brief cuda stream for histogram construction */
cudaStream_t cuda_stream_;
/*! \brief indices of feature whose histograms need to be fixed */
std::vector<int> need_fix_histogram_features_;
/*! \brief aligned number of bins of the features whose histograms need to be fixed */
std::vector<uint32_t> need_fix_histogram_features_num_bin_aligend_;
/*! \brief minimum number of blocks allowed in the y dimension */
const int min_grid_dim_y_ = 160;
// CUDA memory, held by this object
/*! \brief CUDA row wise data */
std::unique_ptr<CUDARowData> cuda_row_data_;
/*! \brief number of bins per feature */
uint32_t* cuda_feature_num_bins_;
/*! \brief offsets in histogram of all features */
uint32_t* cuda_feature_hist_offsets_;
/*! \brief most frequent bins in each feature */
uint32_t* cuda_feature_most_freq_bins_;
/*! \brief CUDA histograms */
hist_t* cuda_hist_;
/*! \brief CUDA histograms buffer for each block */
float* cuda_hist_buffer_;
/*! \brief indices of feature whose histograms need to be fixed */
int* cuda_need_fix_histogram_features_;
/*! \brief aligned number of bins of the features whose histograms need to be fixed */
uint32_t* cuda_need_fix_histogram_features_num_bin_aligned_;
// CUDA memory, held by other object
/*! \brief gradients on CUDA */
const score_t* cuda_gradients_;
/*! \brief hessians on CUDA */
const score_t* cuda_hessians_;
/*! \brief GPU device index */
const int gpu_device_id_;
/*! \brief use double precision histogram per block */
const bool gpu_use_dp_;
};
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#include "cuda_leaf_splits.hpp"
namespace LightGBM {
CUDALeafSplits::CUDALeafSplits(const data_size_t num_data):
num_data_(num_data) {
cuda_struct_ = nullptr;
cuda_sum_of_gradients_buffer_ = nullptr;
cuda_sum_of_hessians_buffer_ = nullptr;
}
CUDALeafSplits::~CUDALeafSplits() {
DeallocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
}
void CUDALeafSplits::Init() {
num_blocks_init_from_gradients_ = (num_data_ + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
// allocate more memory for sum reduction in CUDA
// only the first element records the final sum
AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, 1, __FILE__, __LINE__);
}
void CUDALeafSplits::InitValues() {
LaunchInitValuesEmptyKernel();
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDALeafSplits::InitValues(
const double lambda_l1, const double lambda_l2,
const score_t* cuda_gradients, const score_t* cuda_hessians,
const data_size_t* cuda_bagging_data_indices, const data_size_t* cuda_data_indices_in_leaf,
const data_size_t num_used_indices, hist_t* cuda_hist_in_leaf, double* root_sum_hessians) {
cuda_gradients_ = cuda_gradients;
cuda_hessians_ = cuda_hessians;
SetCUDAMemory<double>(cuda_sum_of_gradients_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
SetCUDAMemory<double>(cuda_sum_of_hessians_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
LaunchInitValuesKernal(lambda_l1, lambda_l2, cuda_bagging_data_indices, cuda_data_indices_in_leaf, num_used_indices, cuda_hist_in_leaf);
CopyFromCUDADeviceToHost<double>(root_sum_hessians, cuda_sum_of_hessians_buffer_, 1, __FILE__, __LINE__);
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDALeafSplits::Resize(const data_size_t num_data) {
if (num_data > num_data_) {
DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
} else {
num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
}
num_data_ = num_data;
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#include "cuda_leaf_splits.hpp"
#include <LightGBM/cuda/cuda_algorithms.hpp>
namespace LightGBM {
template <bool USE_INDICES>
__global__ void CUDAInitValuesKernel1(const score_t* cuda_gradients, const score_t* cuda_hessians,
const data_size_t num_data, const data_size_t* cuda_bagging_data_indices,
double* cuda_sum_of_gradients, double* cuda_sum_of_hessians) {
__shared__ double shared_mem_buffer[32];
const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
double gradient = 0.0f;
double hessian = 0.0f;
if (data_index < num_data) {
gradient = USE_INDICES ? cuda_gradients[cuda_bagging_data_indices[data_index]] : cuda_gradients[data_index];
hessian = USE_INDICES ? cuda_hessians[cuda_bagging_data_indices[data_index]] : cuda_hessians[data_index];
}
const double block_sum_gradient = ShuffleReduceSum<double>(gradient, shared_mem_buffer, blockDim.x);
__syncthreads();
const double block_sum_hessian = ShuffleReduceSum<double>(hessian, shared_mem_buffer, blockDim.x);
if (threadIdx.x == 0) {
cuda_sum_of_gradients[blockIdx.x] += block_sum_gradient;
cuda_sum_of_hessians[blockIdx.x] += block_sum_hessian;
}
}
__global__ void CUDAInitValuesKernel2(
const double lambda_l1,
const double lambda_l2,
const int num_blocks_to_reduce,
double* cuda_sum_of_gradients,
double* cuda_sum_of_hessians,
const data_size_t num_data,
const data_size_t* cuda_data_indices_in_leaf,
hist_t* cuda_hist_in_leaf,
CUDALeafSplitsStruct* cuda_struct) {
__shared__ double shared_mem_buffer[32];
double thread_sum_of_gradients = 0.0f;
double thread_sum_of_hessians = 0.0f;
for (int block_index = static_cast<int>(threadIdx.x); block_index < num_blocks_to_reduce; block_index += static_cast<int>(blockDim.x)) {
thread_sum_of_gradients += cuda_sum_of_gradients[block_index];
thread_sum_of_hessians += cuda_sum_of_hessians[block_index];
}
const double sum_of_gradients = ShuffleReduceSum<double>(thread_sum_of_gradients, shared_mem_buffer, blockDim.x);
__syncthreads();
const double sum_of_hessians = ShuffleReduceSum<double>(thread_sum_of_hessians, shared_mem_buffer, blockDim.x);
if (threadIdx.x == 0) {
cuda_sum_of_hessians[0] = sum_of_hessians;
cuda_struct->leaf_index = 0;
cuda_struct->sum_of_gradients = sum_of_gradients;
cuda_struct->sum_of_hessians = sum_of_hessians;
cuda_struct->num_data_in_leaf = num_data;
const bool use_l1 = lambda_l1 > 0.0f;
if (!use_l1) {
// no smoothing on root node
cuda_struct->gain = CUDALeafSplits::GetLeafGain<false, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
} else {
// no smoothing on root node
cuda_struct->gain = CUDALeafSplits::GetLeafGain<true, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
}
if (!use_l1) {
// no smoothing on root node
cuda_struct->leaf_value =
CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
} else {
// no smoothing on root node
cuda_struct->leaf_value =
CUDALeafSplits::CalculateSplittedLeafOutput<true, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
}
cuda_struct->data_indices_in_leaf = cuda_data_indices_in_leaf;
cuda_struct->hist_in_leaf = cuda_hist_in_leaf;
}
}
__global__ void InitValuesEmptyKernel(CUDALeafSplitsStruct* cuda_struct) {
cuda_struct->leaf_index = -1;
cuda_struct->sum_of_gradients = 0.0f;
cuda_struct->sum_of_hessians = 0.0f;
cuda_struct->num_data_in_leaf = 0;
cuda_struct->gain = 0.0f;
cuda_struct->leaf_value = 0.0f;
cuda_struct->data_indices_in_leaf = nullptr;
cuda_struct->hist_in_leaf = nullptr;
}
void CUDALeafSplits::LaunchInitValuesEmptyKernel() {
InitValuesEmptyKernel<<<1, 1>>>(cuda_struct_);
}
void CUDALeafSplits::LaunchInitValuesKernal(
const double lambda_l1, const double lambda_l2,
const data_size_t* cuda_bagging_data_indices,
const data_size_t* cuda_data_indices_in_leaf,
const data_size_t num_used_indices,
hist_t* cuda_hist_in_leaf) {
if (cuda_bagging_data_indices == nullptr) {
CUDAInitValuesKernel1<false><<<num_blocks_init_from_gradients_, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
cuda_gradients_, cuda_hessians_, num_used_indices, nullptr, cuda_sum_of_gradients_buffer_,
cuda_sum_of_hessians_buffer_);
} else {
CUDAInitValuesKernel1<true><<<num_blocks_init_from_gradients_, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
cuda_gradients_, cuda_hessians_, num_used_indices, cuda_bagging_data_indices, cuda_sum_of_gradients_buffer_,
cuda_sum_of_hessians_buffer_);
}
SynchronizeCUDADevice(__FILE__, __LINE__);
CUDAInitValuesKernel2<<<1, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
lambda_l1, lambda_l2,
num_blocks_init_from_gradients_,
cuda_sum_of_gradients_buffer_,
cuda_sum_of_hessians_buffer_,
num_used_indices,
cuda_data_indices_in_leaf,
cuda_hist_in_leaf,
cuda_struct_);
SynchronizeCUDADevice(__FILE__, __LINE__);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
#define LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/bin.h>
#include <LightGBM/utils/log.h>
#include <LightGBM/meta.h>
#define NUM_THRADS_PER_BLOCK_LEAF_SPLITS (1024)
#define NUM_DATA_THREAD_ADD_LEAF_SPLITS (6)
namespace LightGBM {
struct CUDALeafSplitsStruct {
public:
int leaf_index;
double sum_of_gradients;
double sum_of_hessians;
data_size_t num_data_in_leaf;
double gain;
double leaf_value;
const data_size_t* data_indices_in_leaf;
hist_t* hist_in_leaf;
};
class CUDALeafSplits {
public:
explicit CUDALeafSplits(const data_size_t num_data);
~CUDALeafSplits();
void Init();
void InitValues(
const double lambda_l1, const double lambda_l2,
const score_t* cuda_gradients, const score_t* cuda_hessians,
const data_size_t* cuda_bagging_data_indices,
const data_size_t* cuda_data_indices_in_leaf, const data_size_t num_used_indices,
hist_t* cuda_hist_in_leaf, double* root_sum_hessians);
void InitValues();
const CUDALeafSplitsStruct* GetCUDAStruct() const { return cuda_struct_; }
CUDALeafSplitsStruct* GetCUDAStructRef() { return cuda_struct_; }
void Resize(const data_size_t num_data);
__device__ static double ThresholdL1(double s, double l1) {
const double reg_s = fmax(0.0, fabs(s) - l1);
if (s >= 0.0f) {
return reg_s;
} else {
return -reg_s;
}
}
template <bool USE_L1, bool USE_SMOOTHING>
__device__ static double CalculateSplittedLeafOutput(double sum_gradients,
double sum_hessians, double l1, double l2,
double path_smooth, data_size_t num_data,
double parent_output) {
double ret;
if (USE_L1) {
ret = -ThresholdL1(sum_gradients, l1) / (sum_hessians + l2);
} else {
ret = -sum_gradients / (sum_hessians + l2);
}
if (USE_SMOOTHING) {
ret = ret * (num_data / path_smooth) / (num_data / path_smooth + 1) \
+ parent_output / (num_data / path_smooth + 1);
}
return ret;
}
template <bool USE_L1>
__device__ static double GetLeafGainGivenOutput(double sum_gradients,
double sum_hessians, double l1,
double l2, double output) {
if (USE_L1) {
const double sg_l1 = ThresholdL1(sum_gradients, l1);
return -(2.0 * sg_l1 * output + (sum_hessians + l2) * output * output);
} else {
return -(2.0 * sum_gradients * output +
(sum_hessians + l2) * output * output);
}
}
template <bool USE_L1, bool USE_SMOOTHING>
__device__ static double GetLeafGain(double sum_gradients, double sum_hessians,
double l1, double l2,
double path_smooth, data_size_t num_data,
double parent_output) {
if (!USE_SMOOTHING) {
if (USE_L1) {
const double sg_l1 = ThresholdL1(sum_gradients, l1);
return (sg_l1 * sg_l1) / (sum_hessians + l2);
} else {
return (sum_gradients * sum_gradients) / (sum_hessians + l2);
}
} else {
const double output = CalculateSplittedLeafOutput<USE_L1, USE_SMOOTHING>(
sum_gradients, sum_hessians, l1, l2, path_smooth, num_data, parent_output);
return GetLeafGainGivenOutput<USE_L1>(sum_gradients, sum_hessians, l1, l2, output);
}
}
template <bool USE_L1, bool USE_SMOOTHING>
__device__ static double GetSplitGains(double sum_left_gradients,
double sum_left_hessians,
double sum_right_gradients,
double sum_right_hessians,
double l1, double l2,
double path_smooth,
data_size_t left_count,
data_size_t right_count,
double parent_output) {
return GetLeafGain<USE_L1, USE_SMOOTHING>(sum_left_gradients,
sum_left_hessians,
l1, l2, path_smooth, left_count, parent_output) +
GetLeafGain<USE_L1, USE_SMOOTHING>(sum_right_gradients,
sum_right_hessians,
l1, l2, path_smooth, right_count, parent_output);
}
private:
void LaunchInitValuesEmptyKernel();
void LaunchInitValuesKernal(
const double lambda_l1, const double lambda_l2,
const data_size_t* cuda_bagging_data_indices,
const data_size_t* cuda_data_indices_in_leaf,
const data_size_t num_used_indices,
hist_t* cuda_hist_in_leaf);
// Host memory
data_size_t num_data_;
int num_blocks_init_from_gradients_;
// CUDA memory, held by this object
CUDALeafSplitsStruct* cuda_struct_;
double* cuda_sum_of_gradients_buffer_;
double* cuda_sum_of_hessians_buffer_;
// CUDA memory, held by other object
const score_t* cuda_gradients_;
const score_t* cuda_hessians_;
};
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#include "cuda_single_gpu_tree_learner.hpp"
#include <LightGBM/cuda/cuda_tree.hpp>
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/feature_group.h>
#include <LightGBM/network.h>
#include <LightGBM/objective_function.h>
#include <algorithm>
#include <memory>
namespace LightGBM {
CUDASingleGPUTreeLearner::CUDASingleGPUTreeLearner(const Config* config): SerialTreeLearner(config) {
cuda_gradients_ = nullptr;
cuda_hessians_ = nullptr;
}
CUDASingleGPUTreeLearner::~CUDASingleGPUTreeLearner() {
DeallocateCUDAMemory<score_t>(&cuda_gradients_, __FILE__, __LINE__);
DeallocateCUDAMemory<score_t>(&cuda_hessians_, __FILE__, __LINE__);
}
void CUDASingleGPUTreeLearner::Init(const Dataset* train_data, bool is_constant_hessian) {
SerialTreeLearner::Init(train_data, is_constant_hessian);
num_threads_ = OMP_NUM_THREADS();
// use the first gpu by default
gpu_device_id_ = config_->gpu_device_id >= 0 ? config_->gpu_device_id : 0;
SetCUDADevice(gpu_device_id_, __FILE__, __LINE__);
cuda_smaller_leaf_splits_.reset(new CUDALeafSplits(num_data_));
cuda_smaller_leaf_splits_->Init();
cuda_larger_leaf_splits_.reset(new CUDALeafSplits(num_data_));
cuda_larger_leaf_splits_->Init();
cuda_histogram_constructor_.reset(new CUDAHistogramConstructor(train_data_, config_->num_leaves, num_threads_,
share_state_->feature_hist_offsets(),
config_->min_data_in_leaf, config_->min_sum_hessian_in_leaf, gpu_device_id_, config_->gpu_use_dp));
cuda_histogram_constructor_->Init(train_data_, share_state_.get());
const auto& feature_hist_offsets = share_state_->feature_hist_offsets();
const int num_total_bin = feature_hist_offsets.empty() ? 0 : static_cast<int>(feature_hist_offsets.back());
cuda_data_partition_.reset(new CUDADataPartition(
train_data_, num_total_bin, config_->num_leaves, num_threads_,
cuda_histogram_constructor_->cuda_hist_pointer()));
cuda_data_partition_->Init();
cuda_best_split_finder_.reset(new CUDABestSplitFinder(cuda_histogram_constructor_->cuda_hist(),
train_data_, this->share_state_->feature_hist_offsets(), config_));
cuda_best_split_finder_->Init();
leaf_best_split_feature_.resize(config_->num_leaves, -1);
leaf_best_split_threshold_.resize(config_->num_leaves, 0);
leaf_best_split_default_left_.resize(config_->num_leaves, 0);
leaf_num_data_.resize(config_->num_leaves, 0);
leaf_data_start_.resize(config_->num_leaves, 0);
leaf_sum_hessians_.resize(config_->num_leaves, 0.0f);
AllocateCUDAMemory<score_t>(&cuda_gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
AllocateCUDAMemory<score_t>(&cuda_hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
AllocateBitset();
cuda_leaf_gradient_stat_buffer_ = nullptr;
cuda_leaf_hessian_stat_buffer_ = nullptr;
leaf_stat_buffer_size_ = 0;
num_cat_threshold_ = 0;
}
void CUDASingleGPUTreeLearner::BeforeTrain() {
const data_size_t root_num_data = cuda_data_partition_->root_num_data();
CopyFromHostToCUDADevice<score_t>(cuda_gradients_, gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
CopyFromHostToCUDADevice<score_t>(cuda_hessians_, hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
const data_size_t* leaf_splits_init_indices =
cuda_data_partition_->use_bagging() ? cuda_data_partition_->cuda_data_indices() : nullptr;
cuda_data_partition_->BeforeTrain();
cuda_smaller_leaf_splits_->InitValues(
config_->lambda_l1,
config_->lambda_l2,
cuda_gradients_,
cuda_hessians_,
leaf_splits_init_indices,
cuda_data_partition_->cuda_data_indices(),
root_num_data,
cuda_histogram_constructor_->cuda_hist_pointer(),
&leaf_sum_hessians_[0]);
leaf_num_data_[0] = root_num_data;
cuda_larger_leaf_splits_->InitValues();
cuda_histogram_constructor_->BeforeTrain(cuda_gradients_, cuda_hessians_);
col_sampler_.ResetByTree();
cuda_best_split_finder_->BeforeTrain(col_sampler_.is_feature_used_bytree());
leaf_data_start_[0] = 0;
smaller_leaf_index_ = 0;
larger_leaf_index_ = -1;
}
void CUDASingleGPUTreeLearner::AddPredictionToScore(const Tree* tree, double* out_score) const {
cuda_data_partition_->UpdateTrainScore(tree, out_score);
}
Tree* CUDASingleGPUTreeLearner::Train(const score_t* gradients,
const score_t* hessians, bool /*is_first_tree*/) {
gradients_ = gradients;
hessians_ = hessians;
global_timer.Start("CUDASingleGPUTreeLearner::BeforeTrain");
BeforeTrain();
global_timer.Stop("CUDASingleGPUTreeLearner::BeforeTrain");
const bool track_branch_features = !(config_->interaction_constraints_vector.empty());
std::unique_ptr<CUDATree> tree(new CUDATree(config_->num_leaves, track_branch_features,
config_->linear_tree, config_->gpu_device_id, has_categorical_feature_));
for (int i = 0; i < config_->num_leaves - 1; ++i) {
global_timer.Start("CUDASingleGPUTreeLearner::ConstructHistogramForLeaf");
const data_size_t num_data_in_smaller_leaf = leaf_num_data_[smaller_leaf_index_];
const data_size_t num_data_in_larger_leaf = larger_leaf_index_ < 0 ? 0 : leaf_num_data_[larger_leaf_index_];
const double sum_hessians_in_smaller_leaf = leaf_sum_hessians_[smaller_leaf_index_];
const double sum_hessians_in_larger_leaf = larger_leaf_index_ < 0 ? 0 : leaf_sum_hessians_[larger_leaf_index_];
cuda_histogram_constructor_->ConstructHistogramForLeaf(
cuda_smaller_leaf_splits_->GetCUDAStruct(),
cuda_larger_leaf_splits_->GetCUDAStruct(),
num_data_in_smaller_leaf,
num_data_in_larger_leaf,
sum_hessians_in_smaller_leaf,
sum_hessians_in_larger_leaf);
global_timer.Stop("CUDASingleGPUTreeLearner::ConstructHistogramForLeaf");
global_timer.Start("CUDASingleGPUTreeLearner::FindBestSplitsForLeaf");
cuda_best_split_finder_->FindBestSplitsForLeaf(
cuda_smaller_leaf_splits_->GetCUDAStruct(),
cuda_larger_leaf_splits_->GetCUDAStruct(),
smaller_leaf_index_, larger_leaf_index_,
num_data_in_smaller_leaf, num_data_in_larger_leaf,
sum_hessians_in_smaller_leaf, sum_hessians_in_larger_leaf);
global_timer.Stop("CUDASingleGPUTreeLearner::FindBestSplitsForLeaf");
global_timer.Start("CUDASingleGPUTreeLearner::FindBestFromAllSplits");
const CUDASplitInfo* best_split_info = nullptr;
if (larger_leaf_index_ >= 0) {
best_split_info = cuda_best_split_finder_->FindBestFromAllSplits(
tree->num_leaves(),
smaller_leaf_index_,
larger_leaf_index_,
&leaf_best_split_feature_[smaller_leaf_index_],
&leaf_best_split_threshold_[smaller_leaf_index_],
&leaf_best_split_default_left_[smaller_leaf_index_],
&leaf_best_split_feature_[larger_leaf_index_],
&leaf_best_split_threshold_[larger_leaf_index_],
&leaf_best_split_default_left_[larger_leaf_index_],
&best_leaf_index_,
&num_cat_threshold_);
} else {
best_split_info = cuda_best_split_finder_->FindBestFromAllSplits(
tree->num_leaves(),
smaller_leaf_index_,
larger_leaf_index_,
&leaf_best_split_feature_[smaller_leaf_index_],
&leaf_best_split_threshold_[smaller_leaf_index_],
&leaf_best_split_default_left_[smaller_leaf_index_],
nullptr,
nullptr,
nullptr,
&best_leaf_index_,
&num_cat_threshold_);
}
global_timer.Stop("CUDASingleGPUTreeLearner::FindBestFromAllSplits");
if (best_leaf_index_ == -1) {
Log::Warning("No further splits with positive gain, training stopped with %d leaves.", (i + 1));
break;
}
global_timer.Start("CUDASingleGPUTreeLearner::Split");
if (num_cat_threshold_ > 0) {
ConstructBitsetForCategoricalSplit(best_split_info);
}
int right_leaf_index = 0;
if (train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->bin_type() == BinType::CategoricalBin) {
right_leaf_index = tree->SplitCategorical(best_leaf_index_,
train_data_->RealFeatureIndex(leaf_best_split_feature_[best_leaf_index_]),
train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->missing_type(),
best_split_info,
cuda_bitset_,
cuda_bitset_len_,
cuda_bitset_inner_,
cuda_bitset_inner_len_);
} else {
right_leaf_index = tree->Split(best_leaf_index_,
train_data_->RealFeatureIndex(leaf_best_split_feature_[best_leaf_index_]),
train_data_->RealThreshold(leaf_best_split_feature_[best_leaf_index_],
leaf_best_split_threshold_[best_leaf_index_]),
train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->missing_type(),
best_split_info);
}
double sum_left_gradients = 0.0f;
double sum_right_gradients = 0.0f;
cuda_data_partition_->Split(best_split_info,
best_leaf_index_,
right_leaf_index,
leaf_best_split_feature_[best_leaf_index_],
leaf_best_split_threshold_[best_leaf_index_],
cuda_bitset_inner_,
static_cast<int>(cuda_bitset_inner_len_),
leaf_best_split_default_left_[best_leaf_index_],
leaf_num_data_[best_leaf_index_],
leaf_data_start_[best_leaf_index_],
cuda_smaller_leaf_splits_->GetCUDAStructRef(),
cuda_larger_leaf_splits_->GetCUDAStructRef(),
&leaf_num_data_[best_leaf_index_],
&leaf_num_data_[right_leaf_index],
&leaf_data_start_[best_leaf_index_],
&leaf_data_start_[right_leaf_index],
&leaf_sum_hessians_[best_leaf_index_],
&leaf_sum_hessians_[right_leaf_index],
&sum_left_gradients,
&sum_right_gradients);
#ifdef DEBUG
CheckSplitValid(best_leaf_index_, right_leaf_index, sum_left_gradients, sum_right_gradients);
#endif // DEBUG
smaller_leaf_index_ = (leaf_num_data_[best_leaf_index_] < leaf_num_data_[right_leaf_index] ? best_leaf_index_ : right_leaf_index);
larger_leaf_index_ = (smaller_leaf_index_ == best_leaf_index_ ? right_leaf_index : best_leaf_index_);
global_timer.Stop("CUDASingleGPUTreeLearner::Split");
}
SynchronizeCUDADevice(__FILE__, __LINE__);
tree->ToHost();
return tree.release();
}
void CUDASingleGPUTreeLearner::ResetTrainingData(
const Dataset* train_data,
bool is_constant_hessian) {
SerialTreeLearner::ResetTrainingData(train_data, is_constant_hessian);
CHECK_EQ(num_features_, train_data_->num_features());
cuda_histogram_constructor_->ResetTrainingData(train_data, share_state_.get());
cuda_data_partition_->ResetTrainingData(train_data,
static_cast<int>(share_state_->feature_hist_offsets().back()),
cuda_histogram_constructor_->cuda_hist_pointer());
cuda_best_split_finder_->ResetTrainingData(
cuda_histogram_constructor_->cuda_hist(),
train_data,
share_state_->feature_hist_offsets());
cuda_smaller_leaf_splits_->Resize(num_data_);
cuda_larger_leaf_splits_->Resize(num_data_);
CHECK_EQ(is_constant_hessian, share_state_->is_constant_hessian);
DeallocateCUDAMemory<score_t>(&cuda_gradients_, __FILE__, __LINE__);
DeallocateCUDAMemory<score_t>(&cuda_hessians_, __FILE__, __LINE__);
AllocateCUDAMemory<score_t>(&cuda_gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
AllocateCUDAMemory<score_t>(&cuda_hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
}
void CUDASingleGPUTreeLearner::ResetConfig(const Config* config) {
const int old_num_leaves = config_->num_leaves;
SerialTreeLearner::ResetConfig(config);
if (config_->gpu_device_id >= 0 && config_->gpu_device_id != gpu_device_id_) {
Log::Fatal("Changing gpu device ID by resetting configuration parameter is not allowed for CUDA tree learner.");
}
num_threads_ = OMP_NUM_THREADS();
if (config_->num_leaves != old_num_leaves) {
leaf_best_split_feature_.resize(config_->num_leaves, -1);
leaf_best_split_threshold_.resize(config_->num_leaves, 0);
leaf_best_split_default_left_.resize(config_->num_leaves, 0);
leaf_num_data_.resize(config_->num_leaves, 0);
leaf_data_start_.resize(config_->num_leaves, 0);
leaf_sum_hessians_.resize(config_->num_leaves, 0.0f);
}
cuda_histogram_constructor_->ResetConfig(config);
cuda_best_split_finder_->ResetConfig(config, cuda_histogram_constructor_->cuda_hist());
cuda_data_partition_->ResetConfig(config, cuda_histogram_constructor_->cuda_hist_pointer());
}
void CUDASingleGPUTreeLearner::SetBaggingData(const Dataset* /*subset*/,
const data_size_t* used_indices, data_size_t num_data) {
cuda_data_partition_->SetUsedDataIndices(used_indices, num_data);
}
void CUDASingleGPUTreeLearner::RenewTreeOutput(Tree* tree, const ObjectiveFunction* obj, std::function<double(const label_t*, int)> residual_getter,
data_size_t total_num_data, const data_size_t* bag_indices, data_size_t bag_cnt) const {
CHECK(tree->is_cuda_tree());
CUDATree* cuda_tree = reinterpret_cast<CUDATree*>(tree);
if (obj != nullptr && obj->IsRenewTreeOutput()) {
CHECK_LE(cuda_tree->num_leaves(), data_partition_->num_leaves());
const data_size_t* bag_mapper = nullptr;
if (total_num_data != num_data_) {
CHECK_EQ(bag_cnt, num_data_);
bag_mapper = bag_indices;
}
std::vector<int> n_nozeroworker_perleaf(tree->num_leaves(), 1);
int num_machines = Network::num_machines();
#pragma omp parallel for schedule(static)
for (int i = 0; i < tree->num_leaves(); ++i) {
const double output = static_cast<double>(tree->LeafOutput(i));
data_size_t cnt_leaf_data = leaf_num_data_[i];
std::vector<data_size_t> index_mapper(cnt_leaf_data, -1);
CopyFromCUDADeviceToHost<data_size_t>(index_mapper.data(),
cuda_data_partition_->cuda_data_indices() + leaf_data_start_[i],
static_cast<size_t>(cnt_leaf_data), __FILE__, __LINE__);
if (cnt_leaf_data > 0) {
const double new_output = obj->RenewTreeOutput(output, residual_getter, index_mapper.data(), bag_mapper, cnt_leaf_data);
tree->SetLeafOutput(i, new_output);
} else {
CHECK_GT(num_machines, 1);
tree->SetLeafOutput(i, 0.0);
n_nozeroworker_perleaf[i] = 0;
}
}
if (num_machines > 1) {
std::vector<double> outputs(tree->num_leaves());
for (int i = 0; i < tree->num_leaves(); ++i) {
outputs[i] = static_cast<double>(tree->LeafOutput(i));
}
outputs = Network::GlobalSum(&outputs);
n_nozeroworker_perleaf = Network::GlobalSum(&n_nozeroworker_perleaf);
for (int i = 0; i < tree->num_leaves(); ++i) {
tree->SetLeafOutput(i, outputs[i] / n_nozeroworker_perleaf[i]);
}
}
}
cuda_tree->SyncLeafOutputFromHostToCUDA();
}
Tree* CUDASingleGPUTreeLearner::FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const {
std::unique_ptr<CUDATree> cuda_tree(new CUDATree(old_tree));
SetCUDAMemory<double>(cuda_leaf_gradient_stat_buffer_, 0, static_cast<size_t>(old_tree->num_leaves()), __FILE__, __LINE__);
SetCUDAMemory<double>(cuda_leaf_hessian_stat_buffer_, 0, static_cast<size_t>(old_tree->num_leaves()), __FILE__, __LINE__);
ReduceLeafStat(cuda_tree.get(), gradients, hessians, cuda_data_partition_->cuda_data_indices());
cuda_tree->SyncLeafOutputFromCUDAToHost();
return cuda_tree.release();
}
Tree* CUDASingleGPUTreeLearner::FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t* hessians) const {
cuda_data_partition_->ResetByLeafPred(leaf_pred, old_tree->num_leaves());
refit_num_data_ = static_cast<data_size_t>(leaf_pred.size());
data_size_t buffer_size = static_cast<data_size_t>(old_tree->num_leaves());
if (old_tree->num_leaves() > 2048) {
const int num_block = (refit_num_data_ + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
buffer_size *= static_cast<data_size_t>(num_block + 1);
}
if (buffer_size != leaf_stat_buffer_size_) {
if (leaf_stat_buffer_size_ != 0) {
DeallocateCUDAMemory<double>(&cuda_leaf_gradient_stat_buffer_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_leaf_hessian_stat_buffer_, __FILE__, __LINE__);
}
AllocateCUDAMemory<double>(&cuda_leaf_gradient_stat_buffer_, static_cast<size_t>(buffer_size), __FILE__, __LINE__);
AllocateCUDAMemory<double>(&cuda_leaf_hessian_stat_buffer_, static_cast<size_t>(buffer_size), __FILE__, __LINE__);
}
return FitByExistingTree(old_tree, gradients, hessians);
}
void CUDASingleGPUTreeLearner::ReduceLeafStat(
CUDATree* old_tree, const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf) const {
LaunchReduceLeafStatKernel(gradients, hessians, num_data_in_leaf, old_tree->cuda_leaf_parent(),
old_tree->cuda_left_child(), old_tree->cuda_right_child(),
old_tree->num_leaves(), refit_num_data_, old_tree->cuda_leaf_value_ref(), old_tree->shrinkage());
}
void CUDASingleGPUTreeLearner::ConstructBitsetForCategoricalSplit(
const CUDASplitInfo* best_split_info) {
LaunchConstructBitsetForCategoricalSplitKernel(best_split_info);
}
void CUDASingleGPUTreeLearner::AllocateBitset() {
has_categorical_feature_ = false;
categorical_bin_offsets_.clear();
categorical_bin_offsets_.push_back(0);
categorical_bin_to_value_.clear();
for (int i = 0; i < train_data_->num_features(); ++i) {
const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
if (bin_mapper->bin_type() == BinType::CategoricalBin) {
has_categorical_feature_ = true;
break;
}
}
if (has_categorical_feature_) {
int max_cat_value = 0;
int max_cat_num_bin = 0;
for (int i = 0; i < train_data_->num_features(); ++i) {
const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
if (bin_mapper->bin_type() == BinType::CategoricalBin) {
max_cat_value = std::max(bin_mapper->MaxCatValue(), max_cat_value);
max_cat_num_bin = std::max(bin_mapper->num_bin(), max_cat_num_bin);
}
}
// std::max(..., 1UL) to avoid error in the case when there are NaN's in the categorical values
const size_t cuda_bitset_max_size = std::max(static_cast<size_t>((max_cat_value + 31) / 32), 1UL);
const size_t cuda_bitset_inner_max_size = std::max(static_cast<size_t>((max_cat_num_bin + 31) / 32), 1UL);
AllocateCUDAMemory<uint32_t>(&cuda_bitset_, cuda_bitset_max_size, __FILE__, __LINE__);
AllocateCUDAMemory<uint32_t>(&cuda_bitset_inner_, cuda_bitset_inner_max_size, __FILE__, __LINE__);
const int max_cat_in_split = std::min(config_->max_cat_threshold, max_cat_num_bin / 2);
const int num_blocks = (max_cat_in_split + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
AllocateCUDAMemory<size_t>(&cuda_block_bitset_len_buffer_, num_blocks, __FILE__, __LINE__);
for (int i = 0; i < train_data_->num_features(); ++i) {
const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
if (bin_mapper->bin_type() == BinType::CategoricalBin) {
categorical_bin_offsets_.push_back(bin_mapper->num_bin());
} else {
categorical_bin_offsets_.push_back(0);
}
}
for (size_t i = 1; i < categorical_bin_offsets_.size(); ++i) {
categorical_bin_offsets_[i] += categorical_bin_offsets_[i - 1];
}
categorical_bin_to_value_.resize(categorical_bin_offsets_.back(), 0);
for (int i = 0; i < train_data_->num_features(); ++i) {
const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
if (bin_mapper->bin_type() == BinType::CategoricalBin) {
const int offset = categorical_bin_offsets_[i];
for (int bin = 0; bin < bin_mapper->num_bin(); ++bin) {
categorical_bin_to_value_[offset + bin] = static_cast<int>(bin_mapper->BinToValue(bin));
}
}
}
InitCUDAMemoryFromHostMemory<int>(&cuda_categorical_bin_offsets_, categorical_bin_offsets_.data(), categorical_bin_offsets_.size(), __FILE__, __LINE__);
InitCUDAMemoryFromHostMemory<int>(&cuda_categorical_bin_to_value_, categorical_bin_to_value_.data(), categorical_bin_to_value_.size(), __FILE__, __LINE__);
} else {
cuda_bitset_ = nullptr;
cuda_bitset_inner_ = nullptr;
}
cuda_bitset_len_ = 0;
cuda_bitset_inner_len_ = 0;
}
#ifdef DEBUG
void CUDASingleGPUTreeLearner::CheckSplitValid(
const int left_leaf,
const int right_leaf,
const double split_sum_left_gradients,
const double split_sum_right_gradients) {
std::vector<data_size_t> left_data_indices(leaf_num_data_[left_leaf]);
std::vector<data_size_t> right_data_indices(leaf_num_data_[right_leaf]);
CopyFromCUDADeviceToHost<data_size_t>(left_data_indices.data(),
cuda_data_partition_->cuda_data_indices() + leaf_data_start_[left_leaf],
leaf_num_data_[left_leaf], __FILE__, __LINE__);
CopyFromCUDADeviceToHost<data_size_t>(right_data_indices.data(),
cuda_data_partition_->cuda_data_indices() + leaf_data_start_[right_leaf],
leaf_num_data_[right_leaf], __FILE__, __LINE__);
double sum_left_gradients = 0.0f, sum_left_hessians = 0.0f;
double sum_right_gradients = 0.0f, sum_right_hessians = 0.0f;
for (size_t i = 0; i < left_data_indices.size(); ++i) {
const data_size_t index = left_data_indices[i];
sum_left_gradients += gradients_[index];
sum_left_hessians += hessians_[index];
}
for (size_t i = 0; i < right_data_indices.size(); ++i) {
const data_size_t index = right_data_indices[i];
sum_right_gradients += gradients_[index];
sum_right_hessians += hessians_[index];
}
CHECK_LE(std::fabs(sum_left_gradients - split_sum_left_gradients), 1e-6f);
CHECK_LE(std::fabs(sum_left_hessians - leaf_sum_hessians_[left_leaf]), 1e-6f);
CHECK_LE(std::fabs(sum_right_gradients - split_sum_right_gradients), 1e-6f);
CHECK_LE(std::fabs(sum_right_hessians - leaf_sum_hessians_[right_leaf]), 1e-6f);
}
#endif // DEBUG
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#include <LightGBM/cuda/cuda_algorithms.hpp>
#include "cuda_single_gpu_tree_learner.hpp"
#include <algorithm>
namespace LightGBM {
__global__ void ReduceLeafStatKernel_SharedMemory(
const score_t* gradients,
const score_t* hessians,
const int num_leaves,
const data_size_t num_data,
const int* data_index_to_leaf_index,
double* leaf_grad_stat_buffer,
double* leaf_hess_stat_buffer) {
extern __shared__ double shared_mem[];
double* shared_grad_sum = shared_mem;
double* shared_hess_sum = shared_mem + num_leaves;
const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
shared_grad_sum[leaf_index] = 0.0f;
shared_hess_sum[leaf_index] = 0.0f;
}
__syncthreads();
if (data_index < num_data) {
const int leaf_index = data_index_to_leaf_index[data_index];
atomicAdd_block(shared_grad_sum + leaf_index, gradients[data_index]);
atomicAdd_block(shared_hess_sum + leaf_index, hessians[data_index]);
}
__syncthreads();
for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
atomicAdd_system(leaf_grad_stat_buffer + leaf_index, shared_grad_sum[leaf_index]);
atomicAdd_system(leaf_hess_stat_buffer + leaf_index, shared_hess_sum[leaf_index]);
}
}
__global__ void ReduceLeafStatKernel_GlobalMemory(
const score_t* gradients,
const score_t* hessians,
const int num_leaves,
const data_size_t num_data,
const int* data_index_to_leaf_index,
double* leaf_grad_stat_buffer,
double* leaf_hess_stat_buffer) {
const size_t offset = static_cast<size_t>(num_leaves) * (blockIdx.x + 1);
double* grad_sum = leaf_grad_stat_buffer + offset;
double* hess_sum = leaf_hess_stat_buffer + offset;
const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
grad_sum[leaf_index] = 0.0f;
hess_sum[leaf_index] = 0.0f;
}
__syncthreads();
if (data_index < num_data) {
const int leaf_index = data_index_to_leaf_index[data_index];
atomicAdd_block(grad_sum + leaf_index, gradients[data_index]);
atomicAdd_block(hess_sum + leaf_index, hessians[data_index]);
}
__syncthreads();
for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
atomicAdd_system(leaf_grad_stat_buffer + leaf_index, grad_sum[leaf_index]);
atomicAdd_system(leaf_hess_stat_buffer + leaf_index, hess_sum[leaf_index]);
}
}
template <bool USE_L1, bool USE_SMOOTHING>
__global__ void CalcRefitLeafOutputKernel(
const int num_leaves,
const double* leaf_grad_stat_buffer,
const double* leaf_hess_stat_buffer,
const data_size_t* num_data_in_leaf,
const int* leaf_parent,
const int* left_child,
const int* right_child,
const double lambda_l1,
const double lambda_l2,
const double path_smooth,
const double shrinkage_rate,
const double refit_decay_rate,
double* leaf_value) {
const int leaf_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
if (leaf_index < num_leaves) {
const double sum_gradients = leaf_grad_stat_buffer[leaf_index];
const double sum_hessians = leaf_hess_stat_buffer[leaf_index];
const data_size_t num_data = num_data_in_leaf[leaf_index];
const double old_leaf_value = leaf_value[leaf_index];
double new_leaf_value = 0.0f;
if (!USE_SMOOTHING) {
new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_gradients, sum_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
} else {
const int parent = leaf_parent[leaf_index];
if (parent >= 0) {
const int sibliing = left_child[parent] == leaf_index ? right_child[parent] : left_child[parent];
const double sum_gradients_of_parent = sum_gradients + leaf_grad_stat_buffer[sibliing];
const double sum_hessians_of_parent = sum_hessians + leaf_hess_stat_buffer[sibliing];
const data_size_t num_data_in_parent = num_data + num_data_in_leaf[sibliing];
const double parent_output =
CUDALeafSplits::CalculateSplittedLeafOutput<false, true>(
sum_gradients_of_parent, sum_hessians_of_parent, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, true>(
sum_gradients, sum_hessians, lambda_l1, lambda_l2, path_smooth, num_data_in_parent, parent_output);
} else {
new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_gradients, sum_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
}
}
if (isnan(new_leaf_value)) {
new_leaf_value = 0.0f;
} else {
new_leaf_value *= shrinkage_rate;
}
leaf_value[leaf_index] = refit_decay_rate * old_leaf_value + (1.0f - refit_decay_rate) * new_leaf_value;
}
}
void CUDASingleGPUTreeLearner::LaunchReduceLeafStatKernel(
const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf,
const int* leaf_parent, const int* left_child, const int* right_child, const int num_leaves,
const data_size_t num_data, double* cuda_leaf_value, const double shrinkage_rate) const {
int num_block = (num_data + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
if (num_leaves <= 2048) {
ReduceLeafStatKernel_SharedMemory<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE, 2 * num_leaves * sizeof(double)>>>(
gradients, hessians, num_leaves, num_data, cuda_data_partition_->cuda_data_index_to_leaf_index(),
cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_);
} else {
ReduceLeafStatKernel_GlobalMemory<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(
gradients, hessians, num_leaves, num_data, cuda_data_partition_->cuda_data_index_to_leaf_index(),
cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_);
}
const bool use_l1 = config_->lambda_l1 > 0.0f;
const bool use_smoothing = config_->path_smooth > 0.0f;
num_block = (num_leaves + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
#define CalcRefitLeafOutputKernel_ARGS \
num_leaves, cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_, num_data_in_leaf, \
leaf_parent, left_child, right_child, \
config_->lambda_l1, config_->lambda_l2, config_->path_smooth, \
shrinkage_rate, config_->refit_decay_rate, cuda_leaf_value
if (!use_l1) {
if (!use_smoothing) {
CalcRefitLeafOutputKernel<false, false>
<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
} else {
CalcRefitLeafOutputKernel<false, true>
<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
}
} else {
if (!use_smoothing) {
CalcRefitLeafOutputKernel<true, false>
<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
} else {
CalcRefitLeafOutputKernel<true, true>
<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
}
}
}
template <typename T, bool IS_INNER>
__global__ void CalcBitsetLenKernel(const CUDASplitInfo* best_split_info, size_t* out_len_buffer) {
__shared__ size_t shared_mem_buffer[32];
const T* vals = nullptr;
if (IS_INNER) {
vals = reinterpret_cast<const T*>(best_split_info->cat_threshold);
} else {
vals = reinterpret_cast<const T*>(best_split_info->cat_threshold_real);
}
const int i = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
size_t len = 0;
if (i < best_split_info->num_cat_threshold) {
const T val = vals[i];
len = (val / 32) + 1;
}
const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x);
if (threadIdx.x == 0) {
out_len_buffer[blockIdx.x] = block_max_len;
}
}
__global__ void ReduceBlockMaxLen(size_t* out_len_buffer, const int num_blocks) {
__shared__ size_t shared_mem_buffer[32];
size_t max_len = 0;
for (int i = static_cast<int>(threadIdx.x); i < num_blocks; i += static_cast<int>(blockDim.x)) {
max_len = max(out_len_buffer[i], max_len);
}
const size_t all_max_len = ShuffleReduceMax<size_t>(max_len, shared_mem_buffer, blockDim.x);
if (threadIdx.x == 0) {
out_len_buffer[0] = max_len;
}
}
template <typename T, bool IS_INNER>
__global__ void CUDAConstructBitsetKernel(const CUDASplitInfo* best_split_info, uint32_t* out, size_t cuda_bitset_len) {
const T* vals = nullptr;
if (IS_INNER) {
vals = reinterpret_cast<const T*>(best_split_info->cat_threshold);
} else {
vals = reinterpret_cast<const T*>(best_split_info->cat_threshold_real);
}
const int i = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
if (i < best_split_info->num_cat_threshold) {
const T val = vals[i];
// can use add instead of or here, because each bit will only be added once
atomicAdd_system(out + (val / 32), (0x1 << (val % 32)));
}
}
__global__ void SetRealThresholdKernel(
const CUDASplitInfo* best_split_info,
const int* categorical_bin_to_value,
const int* categorical_bin_offsets) {
const int num_cat_threshold = best_split_info->num_cat_threshold;
const int* categorical_bin_to_value_ptr = categorical_bin_to_value + categorical_bin_offsets[best_split_info->inner_feature_index];
int* cat_threshold_real = best_split_info->cat_threshold_real;
const uint32_t* cat_threshold = best_split_info->cat_threshold;
const int index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
if (index < num_cat_threshold) {
cat_threshold_real[index] = categorical_bin_to_value_ptr[cat_threshold[index]];
}
}
template <typename T, bool IS_INNER>
void CUDAConstructBitset(const CUDASplitInfo* best_split_info, const int num_cat_threshold, uint32_t* out, size_t bitset_len) {
const int num_blocks = (num_cat_threshold + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
// clear the bitset vector first
SetCUDAMemory<uint32_t>(out, 0, bitset_len, __FILE__, __LINE__);
CUDAConstructBitsetKernel<T, IS_INNER><<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(best_split_info, out, bitset_len);
}
template <typename T, bool IS_INNER>
size_t CUDABitsetLen(const CUDASplitInfo* best_split_info, const int num_cat_threshold, size_t* out_len_buffer) {
const int num_blocks = (num_cat_threshold + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
CalcBitsetLenKernel<T, IS_INNER><<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(best_split_info, out_len_buffer);
ReduceBlockMaxLen<<<1, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(out_len_buffer, num_blocks);
size_t host_max_len = 0;
CopyFromCUDADeviceToHost<size_t>(&host_max_len, out_len_buffer, 1, __FILE__, __LINE__);
return host_max_len;
}
void CUDASingleGPUTreeLearner::LaunchConstructBitsetForCategoricalSplitKernel(
const CUDASplitInfo* best_split_info) {
const int num_blocks = (num_cat_threshold_ + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
SetRealThresholdKernel<<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>
(best_split_info, cuda_categorical_bin_to_value_, cuda_categorical_bin_offsets_);
cuda_bitset_inner_len_ = CUDABitsetLen<uint32_t, true>(best_split_info, num_cat_threshold_, cuda_block_bitset_len_buffer_);
CUDAConstructBitset<uint32_t, true>(best_split_info, num_cat_threshold_, cuda_bitset_inner_, cuda_bitset_inner_len_);
cuda_bitset_len_ = CUDABitsetLen<int, false>(best_split_info, num_cat_threshold_, cuda_block_bitset_len_buffer_);
CUDAConstructBitset<int, false>(best_split_info, num_cat_threshold_, cuda_bitset_, cuda_bitset_len_);
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
#define LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
#include <memory>
#include <vector>
#ifdef USE_CUDA_EXP
#include "cuda_leaf_splits.hpp"
#include "cuda_histogram_constructor.hpp"
#include "cuda_data_partition.hpp"
#include "cuda_best_split_finder.hpp"
#include "../serial_tree_learner.h"
namespace LightGBM {
#define CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE (1024)
class CUDASingleGPUTreeLearner: public SerialTreeLearner {
public:
explicit CUDASingleGPUTreeLearner(const Config* config);
~CUDASingleGPUTreeLearner();
void Init(const Dataset* train_data, bool is_constant_hessian) override;
void ResetTrainingData(const Dataset* train_data,
bool is_constant_hessian) override;
Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override;
void AddPredictionToScore(const Tree* tree, double* out_score) const override;
void RenewTreeOutput(Tree* tree, const ObjectiveFunction* obj, std::function<double(const label_t*, int)> residual_getter,
data_size_t total_num_data, const data_size_t* bag_indices, data_size_t bag_cnt) const override;
void ResetConfig(const Config* config) override;
Tree* FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const override;
Tree* FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t* hessians) const override;
protected:
void BeforeTrain() override;
void ReduceLeafStat(CUDATree* old_tree, const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf) const;
void LaunchReduceLeafStatKernel(const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf,
const int* leaf_parent, const int* left_child, const int* right_child,
const int num_leaves, const data_size_t num_data, double* cuda_leaf_value, const double shrinkage_rate) const;
void ConstructBitsetForCategoricalSplit(const CUDASplitInfo* best_split_info);
void LaunchConstructBitsetForCategoricalSplitKernel(const CUDASplitInfo* best_split_info);
void AllocateBitset();
#ifdef DEUBG
void CheckSplitValid(
const int left_leaf, const int right_leaf,
const double sum_left_gradients, const double sum_right_gradients);
#endif // DEBUG
// GPU device ID
int gpu_device_id_;
// number of threads on CPU
int num_threads_;
// CUDA components for tree training
// leaf splits information for smaller and larger leaves
std::unique_ptr<CUDALeafSplits> cuda_smaller_leaf_splits_;
std::unique_ptr<CUDALeafSplits> cuda_larger_leaf_splits_;
// data partition that partitions data indices into different leaves
std::unique_ptr<CUDADataPartition> cuda_data_partition_;
// for histogram construction
std::unique_ptr<CUDAHistogramConstructor> cuda_histogram_constructor_;
// for best split information finding, given the histograms
std::unique_ptr<CUDABestSplitFinder> cuda_best_split_finder_;
std::vector<int> leaf_best_split_feature_;
std::vector<uint32_t> leaf_best_split_threshold_;
std::vector<uint8_t> leaf_best_split_default_left_;
std::vector<data_size_t> leaf_num_data_;
std::vector<data_size_t> leaf_data_start_;
std::vector<double> leaf_sum_hessians_;
int smaller_leaf_index_;
int larger_leaf_index_;
int best_leaf_index_;
int num_cat_threshold_;
bool has_categorical_feature_;
std::vector<int> categorical_bin_to_value_;
std::vector<int> categorical_bin_offsets_;
mutable double* cuda_leaf_gradient_stat_buffer_;
mutable double* cuda_leaf_hessian_stat_buffer_;
mutable data_size_t leaf_stat_buffer_size_;
mutable data_size_t refit_num_data_;
uint32_t* cuda_bitset_;
size_t cuda_bitset_len_;
uint32_t* cuda_bitset_inner_;
size_t cuda_bitset_inner_len_;
size_t* cuda_block_bitset_len_buffer_;
int* cuda_categorical_bin_to_value_;
int* cuda_categorical_bin_offsets_;
/*! \brief gradients on CUDA */
score_t* cuda_gradients_;
/*! \brief hessians on CUDA */
score_t* cuda_hessians_;
};
} // namespace LightGBM
#else // USE_CUDA_EXP
// When GPU support is not compiled in, quit with an error message
namespace LightGBM {
class CUDASingleGPUTreeLearner: public SerialTreeLearner {
public:
#pragma warning(disable : 4702)
explicit CUDASingleGPUTreeLearner(const Config* tree_config) : SerialTreeLearner(tree_config) {
Log::Fatal("CUDA Tree Learner experimental version was not enabled in this build.\n"
"Please recompile with CMake option -DUSE_CUDA_EXP=1");
}
};
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
......@@ -206,12 +206,12 @@ class SerialTreeLearner: public TreeLearner {
std::unique_ptr<LeafSplits> smaller_leaf_splits_;
/*! \brief stores best thresholds for all feature for larger leaf */
std::unique_ptr<LeafSplits> larger_leaf_splits_;
#ifdef USE_GPU
#if defined(USE_GPU)
/*! \brief gradients of current iteration, ordered for cache optimized, aligned to 4K page */
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_gradients_;
/*! \brief hessians of current iteration, ordered for cache optimized, aligned to 4K page */
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_hessians_;
#elif USE_CUDA
#elif defined(USE_CUDA) || defined(USE_CUDA_EXP)
/*! \brief gradients of current iteration, ordered for cache optimized */
std::vector<score_t, CHAllocator<score_t>> ordered_gradients_;
/*! \brief hessians of current iteration, ordered for cache optimized */
......
......@@ -9,6 +9,7 @@
#include "linear_tree_learner.h"
#include "parallel_tree_learner.h"
#include "serial_tree_learner.h"
#include "cuda/cuda_single_gpu_tree_learner.hpp"
namespace LightGBM {
......@@ -48,6 +49,16 @@ TreeLearner* TreeLearner::CreateTreeLearner(const std::string& learner_type, con
} else if (learner_type == std::string("voting")) {
return new VotingParallelTreeLearner<CUDATreeLearner>(config);
}
} else if (device_type == std::string("cuda_exp")) {
if (learner_type == std::string("serial")) {
if (config->num_gpu == 1) {
return new CUDASingleGPUTreeLearner(config);
} else {
Log::Fatal("cuda_exp only supports training on a single GPU.");
}
} else {
Log::Fatal("cuda_exp only supports training on a single machine.");
}
}
return nullptr;
}
......
......@@ -2,6 +2,7 @@
import filecmp
import numbers
import re
from os import getenv
from pathlib import Path
import numpy as np
......@@ -47,8 +48,9 @@ def test_basic(tmp_path):
assert bst.current_iteration() == 20
assert bst.num_trees() == 20
assert bst.num_model_per_iteration() == 1
assert bst.lower_bound() == pytest.approx(-2.9040190126976606)
assert bst.upper_bound() == pytest.approx(3.3182142872462883)
if getenv('TASK', '') != 'cuda_exp':
assert bst.lower_bound() == pytest.approx(-2.9040190126976606)
assert bst.upper_bound() == pytest.approx(3.3182142872462883)
tname = tmp_path / "svm_light.dat"
model_file = tmp_path / "model.txt"
......
......@@ -56,7 +56,8 @@ task_to_local_factory = {
pytestmark = [
pytest.mark.skipif(getenv('TASK', '') == 'mpi', reason='Fails to run with MPI interface'),
pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface')
pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface'),
pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Fails to run with CUDA Experimental interface')
]
......
......@@ -6,6 +6,7 @@ import math
import pickle
import platform
import random
from os import getenv
from pathlib import Path
import numpy as np
......@@ -570,6 +571,7 @@ def test_multi_class_error():
assert results['training']['multi_error@2'][-1] == pytest.approx(0)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_auc_mu():
# should give same result as binary auc for 2 classes
X, y = load_digits(n_class=10, return_X_y=True)
......@@ -1501,6 +1503,7 @@ def generate_trainset_for_monotone_constraints_tests(x3_to_category=True):
return trainset
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
@pytest.mark.parametrize("test_with_categorical_variable", [True, False])
def test_monotone_constraints(test_with_categorical_variable):
def is_increasing(y):
......@@ -1590,6 +1593,7 @@ def test_monotone_constraints(test_with_categorical_variable):
assert are_interactions_enforced(constrained_model, feature_sets)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
def test_monotone_penalty():
def are_first_splits_non_monotone(tree, n, monotone_constraints):
if n <= 0:
......@@ -1629,6 +1633,7 @@ def test_monotone_penalty():
# test if a penalty as high as the depth indeed prohibits all monotone splits
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
def test_monotone_penalty_max():
max_depth = 5
monotone_constraints = [1, -1, 0]
......@@ -2393,6 +2398,7 @@ def test_model_size():
pytest.skipTest('not enough RAM')
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_get_split_value_histogram():
X, y = load_boston(return_X_y=True)
lgb_train = lgb.Dataset(X, y, categorical_feature=[2])
......@@ -2472,6 +2478,7 @@ def test_get_split_value_histogram():
gbm.get_split_value_histogram(2)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_early_stopping_for_only_first_metric():
def metrics_combination_train_regression(valid_sets, metric_list, assumed_iteration,
......@@ -2878,6 +2885,7 @@ def test_trees_to_dataframe():
assert tree_df.loc[0, col] is None
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Interaction constraints are not yet supported by CUDA Experimental version')
def test_interaction_constraints():
X, y = load_boston(return_X_y=True)
num_features = X.shape[1]
......@@ -3272,6 +3280,7 @@ def test_dump_model_hook():
assert "LV" in dumped_model_str
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Forced splits are not yet supported by CUDA Experimental version')
def test_force_split_with_feature_fraction(tmp_path):
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
......
# coding: utf-8
import itertools
import math
from os import getenv
from pathlib import Path
import joblib
......@@ -99,6 +100,7 @@ def test_regression():
assert gbm.evals_result_['valid_0']['l2'][gbm.best_iteration_ - 1] == pytest.approx(ret)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_multiclass():
X, y = load_digits(n_class=10, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
......@@ -111,6 +113,7 @@ def test_multiclass():
assert gbm.evals_result_['valid_0']['multi_logloss'][gbm.best_iteration_ - 1] == pytest.approx(ret)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_lambdarank():
rank_example_dir = Path(__file__).absolute().parents[2] / 'examples' / 'lambdarank'
X_train, y_train = load_svmlight_file(str(rank_example_dir / 'rank.train'))
......@@ -1068,6 +1071,7 @@ def test_nan_handle():
np.testing.assert_allclose(gbm.evals_result_['training']['l2'], np.nan)
@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
def test_first_metric_only():
def fit_and_check(eval_set_names, metric_names, assumed_iteration, first_metric_only):
......
......@@ -317,10 +317,14 @@
<ClCompile Include="..\src\io\config_auto.cpp" />
<ClCompile Include="..\src\io\dataset.cpp" />
<ClCompile Include="..\src\io\dataset_loader.cpp" />
<ClCompile Include="..\src\io\dense_bin.cpp" />
<ClCompile Include="..\src\io\file_io.cpp" />
<ClCompile Include="..\src\io\json11.cpp" />
<ClCompile Include="..\src\io\metadata.cpp" />
<ClCompile Include="..\src\io\multi_val_dense_bin.cpp" />
<ClCompile Include="..\src\io\multi_val_sparse_bin.cpp" />
<ClCompile Include="..\src\io\parser.cpp" />
<ClCompile Include="..\src\io\sparse_bin.cpp" />
<ClCompile Include="..\src\io\train_share_states.cpp" />
<ClCompile Include="..\src\io\tree.cpp" />
<ClCompile Include="..\src\metric\dcg_calculator.cpp" />
......
......@@ -326,5 +326,17 @@
<ClCompile Include="..\src\treelearner\linear_tree_learner.cpp">
<Filter>src\treelearner</Filter>
</ClCompile>
<ClCompile Include="..\src\io\multi_val_dense_bin.cpp">
<Filter>src\io</Filter>
</ClCompile>
<ClCompile Include="..\src\io\multi_val_sparse_bin.cpp">
<Filter>src\io</Filter>
</ClCompile>
<ClCompile Include="..\src\io\dense_bin.cpp">
<Filter>src\io</Filter>
</ClCompile>
<ClCompile Include="..\src\io\sparse_bin.cpp">
<Filter>src\io</Filter>
</ClCompile>
</ItemGroup>
</Project>
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment