Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
This diff is collapsed.
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#include "cuda_leaf_splits.hpp"
namespace LightGBM {
CUDALeafSplits::CUDALeafSplits(const data_size_t num_data):
num_data_(num_data) {
cuda_struct_ = nullptr;
cuda_sum_of_gradients_buffer_ = nullptr;
cuda_sum_of_hessians_buffer_ = nullptr;
}
CUDALeafSplits::~CUDALeafSplits() {
DeallocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
}
void CUDALeafSplits::Init() {
num_blocks_init_from_gradients_ = (num_data_ + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
// allocate more memory for sum reduction in CUDA
// only the first element records the final sum
AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, 1, __FILE__, __LINE__);
}
void CUDALeafSplits::InitValues() {
LaunchInitValuesEmptyKernel();
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDALeafSplits::InitValues(
const double lambda_l1, const double lambda_l2,
const score_t* cuda_gradients, const score_t* cuda_hessians,
const data_size_t* cuda_bagging_data_indices, const data_size_t* cuda_data_indices_in_leaf,
const data_size_t num_used_indices, hist_t* cuda_hist_in_leaf, double* root_sum_hessians) {
cuda_gradients_ = cuda_gradients;
cuda_hessians_ = cuda_hessians;
SetCUDAMemory<double>(cuda_sum_of_gradients_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
SetCUDAMemory<double>(cuda_sum_of_hessians_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
LaunchInitValuesKernal(lambda_l1, lambda_l2, cuda_bagging_data_indices, cuda_data_indices_in_leaf, num_used_indices, cuda_hist_in_leaf);
CopyFromCUDADeviceToHost<double>(root_sum_hessians, cuda_sum_of_hessians_buffer_, 1, __FILE__, __LINE__);
SynchronizeCUDADevice(__FILE__, __LINE__);
}
void CUDALeafSplits::Resize(const data_size_t num_data) {
if (num_data > num_data_) {
DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
} else {
num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
}
num_data_ = num_data;
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -206,12 +206,12 @@ class SerialTreeLearner: public TreeLearner {
std::unique_ptr<LeafSplits> smaller_leaf_splits_;
/*! \brief stores best thresholds for all feature for larger leaf */
std::unique_ptr<LeafSplits> larger_leaf_splits_;
#ifdef USE_GPU
#if defined(USE_GPU)
/*! \brief gradients of current iteration, ordered for cache optimized, aligned to 4K page */
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_gradients_;
/*! \brief hessians of current iteration, ordered for cache optimized, aligned to 4K page */
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_hessians_;
#elif USE_CUDA
#elif defined(USE_CUDA) || defined(USE_CUDA_EXP)
/*! \brief gradients of current iteration, ordered for cache optimized */
std::vector<score_t, CHAllocator<score_t>> ordered_gradients_;
/*! \brief hessians of current iteration, ordered for cache optimized */
......
......@@ -9,6 +9,7 @@
#include "linear_tree_learner.h"
#include "parallel_tree_learner.h"
#include "serial_tree_learner.h"
#include "cuda/cuda_single_gpu_tree_learner.hpp"
namespace LightGBM {
......@@ -48,6 +49,16 @@ TreeLearner* TreeLearner::CreateTreeLearner(const std::string& learner_type, con
} else if (learner_type == std::string("voting")) {
return new VotingParallelTreeLearner<CUDATreeLearner>(config);
}
} else if (device_type == std::string("cuda_exp")) {
if (learner_type == std::string("serial")) {
if (config->num_gpu == 1) {
return new CUDASingleGPUTreeLearner(config);
} else {
Log::Fatal("cuda_exp only supports training on a single GPU.");
}
} else {
Log::Fatal("cuda_exp only supports training on a single machine.");
}
}
return nullptr;
}
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment