Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
......@@ -272,6 +272,16 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
is_load_from_binary = true;
Log::Info("Load from binary file %s", bin_filename.c_str());
dataset.reset(LoadFromBinFile(filename, bin_filename.c_str(), rank, num_machines, &num_global_data, &used_data_indices));
dataset->device_type_ = config_.device_type;
dataset->gpu_device_id_ = config_.gpu_device_id;
#ifdef USE_CUDA_EXP
if (config_.device_type == std::string("cuda_exp")) {
dataset->CreateCUDAColumnData();
dataset->metadata_.CreateCUDAMetadata(dataset->gpu_device_id_);
} else {
dataset->cuda_column_data_ = nullptr;
}
#endif // USE_CUDA_EXP
}
// check meta data
dataset->metadata_.CheckOrPartition(num_global_data, used_data_indices);
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#include "dense_bin.hpp"
namespace LightGBM {
template <>
const void* DenseBin<uint8_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int /*num_threads*/) const {
*is_sparse = false;
*bit_type = 8;
bin_iterator->clear();
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint16_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int /*num_threads*/) const {
*is_sparse = false;
*bit_type = 16;
bin_iterator->clear();
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint32_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int /*num_threads*/) const {
*is_sparse = false;
*bit_type = 32;
bin_iterator->clear();
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint8_t, true>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int /*num_threads*/) const {
*is_sparse = false;
*bit_type = 4;
bin_iterator->clear();
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint8_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = false;
*bit_type = 8;
*bin_iterator = nullptr;
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint16_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = false;
*bit_type = 16;
*bin_iterator = nullptr;
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint32_t, false>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = false;
*bit_type = 32;
*bin_iterator = nullptr;
return reinterpret_cast<const void*>(data_.data());
}
template <>
const void* DenseBin<uint8_t, true>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = false;
*bit_type = 4;
*bin_iterator = nullptr;
return reinterpret_cast<const void*>(data_.data());
}
} // namespace LightGBM
......@@ -461,9 +461,13 @@ class DenseBin : public Bin {
DenseBin<VAL_T, IS_4BIT>* Clone() override;
const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const override;
const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const override;
private:
data_size_t num_data_;
#ifdef USE_CUDA
#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
std::vector<VAL_T, CHAllocator<VAL_T>> data_;
#else
std::vector<VAL_T, Common::AlignmentAllocator<VAL_T, kAlignedSize>> data_;
......
......@@ -18,6 +18,9 @@ Metadata::Metadata() {
weight_load_from_file_ = false;
query_load_from_file_ = false;
init_score_load_from_file_ = false;
#ifdef USE_CUDA_EXP
cuda_metadata_ = nullptr;
#endif // USE_CUDA_EXP
}
void Metadata::Init(const char* data_filename) {
......@@ -302,6 +305,11 @@ void Metadata::SetInitScore(const double* init_score, data_size_t len) {
init_score_[i] = Common::AvoidInf(init_score[i]);
}
init_score_load_from_file_ = false;
#ifdef USE_CUDA_EXP
if (cuda_metadata_ != nullptr) {
cuda_metadata_->SetInitScore(init_score_.data(), len);
}
#endif // USE_CUDA_EXP
}
void Metadata::SetLabel(const label_t* label, data_size_t len) {
......@@ -318,6 +326,11 @@ void Metadata::SetLabel(const label_t* label, data_size_t len) {
for (data_size_t i = 0; i < num_data_; ++i) {
label_[i] = Common::AvoidInf(label[i]);
}
#ifdef USE_CUDA_EXP
if (cuda_metadata_ != nullptr) {
cuda_metadata_->SetLabel(label_.data(), len);
}
#endif // USE_CUDA_EXP
}
void Metadata::SetWeights(const label_t* weights, data_size_t len) {
......@@ -340,6 +353,11 @@ void Metadata::SetWeights(const label_t* weights, data_size_t len) {
}
LoadQueryWeights();
weight_load_from_file_ = false;
#ifdef USE_CUDA_EXP
if (cuda_metadata_ != nullptr) {
cuda_metadata_->SetWeights(weights_.data(), len);
}
#endif // USE_CUDA_EXP
}
void Metadata::SetQuery(const data_size_t* query, data_size_t len) {
......@@ -366,6 +384,16 @@ void Metadata::SetQuery(const data_size_t* query, data_size_t len) {
}
LoadQueryWeights();
query_load_from_file_ = false;
#ifdef USE_CUDA_EXP
if (cuda_metadata_ != nullptr) {
if (query_weights_.size() > 0) {
CHECK_EQ(query_weights_.size(), static_cast<size_t>(num_queries_));
cuda_metadata_->SetQuery(query_boundaries_.data(), query_weights_.data(), num_queries_);
} else {
cuda_metadata_->SetQuery(query_boundaries_.data(), nullptr, num_queries_);
}
}
#endif // USE_CUDA_EXP
}
void Metadata::LoadWeights() {
......@@ -472,6 +500,13 @@ void Metadata::LoadQueryWeights() {
}
}
#ifdef USE_CUDA_EXP
void Metadata::CreateCUDAMetadata(const int gpu_device_id) {
cuda_metadata_.reset(new CUDAMetadata(gpu_device_id));
cuda_metadata_->Init(label_, weights_, query_boundaries_, query_weights_, init_score_);
}
#endif // USE_CUDA_EXP
void Metadata::LoadFromMemory(const void* memory) {
const char* mem_ptr = reinterpret_cast<const char*>(memory);
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#include "multi_val_dense_bin.hpp"
namespace LightGBM {
#ifdef USE_CUDA_EXP
template <>
const void* MultiValDenseBin<uint8_t>::GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = data_.data();
*bit_type = 8;
*total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
CHECK_EQ(*total_size, data_.size());
*is_sparse = false;
*out_data_ptr = nullptr;
*data_ptr_bit_type = 0;
return to_return;
}
template <>
const void* MultiValDenseBin<uint16_t>::GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint16_t* data_ptr = data_.data();
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_ptr);
*bit_type = 16;
*total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
CHECK_EQ(*total_size, data_.size());
*is_sparse = false;
*out_data_ptr = nullptr;
*data_ptr_bit_type = 0;
return to_return;
}
template <>
const void* MultiValDenseBin<uint32_t>::GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint32_t* data_ptr = data_.data();
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_ptr);
*bit_type = 32;
*total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
CHECK_EQ(*total_size, data_.size());
*is_sparse = false;
*out_data_ptr = nullptr;
*data_ptr_bit_type = 0;
return to_return;
}
#endif // USE_CUDA_EXP
} // namespace LightGBM
......@@ -7,6 +7,7 @@
#include <LightGBM/bin.h>
#include <LightGBM/utils/openmp_wrapper.h>
#include <LightGBM/utils/threading.h>
#include <algorithm>
#include <cstdint>
......@@ -210,6 +211,14 @@ class MultiValDenseBin : public MultiValBin {
MultiValDenseBin<VAL_T>* Clone() override;
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const override;
#endif // USE_CUDA_EXP
private:
data_size_t num_data_;
int num_bin_;
......@@ -229,4 +238,5 @@ MultiValDenseBin<VAL_T>* MultiValDenseBin<VAL_T>::Clone() {
}
} // namespace LightGBM
#endif // LIGHTGBM_IO_MULTI_VAL_DENSE_BIN_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#include "multi_val_sparse_bin.hpp"
namespace LightGBM {
#ifdef USE_CUDA_EXP
template <>
const void* MultiValSparseBin<uint16_t, uint8_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = data_.data();
*bit_type = 8;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 16;
return to_return;
}
template <>
const void* MultiValSparseBin<uint16_t, uint16_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 16;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 16;
return to_return;
}
template <>
const void* MultiValSparseBin<uint16_t, uint32_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 32;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 16;
return to_return;
}
template <>
const void* MultiValSparseBin<uint32_t, uint8_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = data_.data();
*bit_type = 8;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 32;
return to_return;
}
template <>
const void* MultiValSparseBin<uint32_t, uint16_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 16;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 32;
return to_return;
}
template <>
const void* MultiValSparseBin<uint32_t, uint32_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 32;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 32;
return to_return;
}
template <>
const void* MultiValSparseBin<uint64_t, uint8_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = data_.data();
*bit_type = 8;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 64;
return to_return;
}
template <>
const void* MultiValSparseBin<uint64_t, uint16_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 16;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 64;
return to_return;
}
template <>
const void* MultiValSparseBin<uint64_t, uint32_t>::GetRowWiseData(
uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const {
const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
*bit_type = 32;
*total_size = data_.size();
*is_sparse = true;
*out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
*data_ptr_bit_type = 64;
return to_return;
}
#endif // USE_CUDA_EXP
} // namespace LightGBM
......@@ -7,6 +7,7 @@
#include <LightGBM/bin.h>
#include <LightGBM/utils/openmp_wrapper.h>
#include <LightGBM/utils/threading.h>
#include <algorithm>
#include <cstdint>
......@@ -290,6 +291,15 @@ class MultiValSparseBin : public MultiValBin {
MultiValSparseBin<INDEX_T, VAL_T>* Clone() override;
#ifdef USE_CUDA_EXP
const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const override;
#endif // USE_CUDA_EXP
private:
data_size_t num_data_;
int num_bin_;
......@@ -317,4 +327,5 @@ MultiValSparseBin<INDEX_T, VAL_T>* MultiValSparseBin<INDEX_T, VAL_T>::Clone() {
}
} // namespace LightGBM
#endif // LIGHTGBM_IO_MULTI_VAL_SPARSE_BIN_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#include "sparse_bin.hpp"
namespace LightGBM {
template <>
const void* SparseBin<uint8_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
*is_sparse = true;
*bit_type = 8;
for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
bin_iterator->emplace_back(new SparseBinIterator<uint8_t>(this, 0));
}
return nullptr;
}
template <>
const void* SparseBin<uint16_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
*is_sparse = true;
*bit_type = 16;
for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
bin_iterator->emplace_back(new SparseBinIterator<uint16_t>(this, 0));
}
return nullptr;
}
template <>
const void* SparseBin<uint32_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
std::vector<BinIterator*>* bin_iterator,
const int num_threads) const {
*is_sparse = true;
*bit_type = 32;
for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
bin_iterator->emplace_back(new SparseBinIterator<uint32_t>(this, 0));
}
return nullptr;
}
template <>
const void* SparseBin<uint8_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = true;
*bit_type = 8;
*bin_iterator = new SparseBinIterator<uint8_t>(this, 0);
return nullptr;
}
template <>
const void* SparseBin<uint16_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = true;
*bit_type = 16;
*bin_iterator = new SparseBinIterator<uint16_t>(this, 0);
return nullptr;
}
template <>
const void* SparseBin<uint32_t>::GetColWiseData(
uint8_t* bit_type,
bool* is_sparse,
BinIterator** bin_iterator) const {
*is_sparse = true;
*bit_type = 32;
*bin_iterator = new SparseBinIterator<uint32_t>(this, 0);
return nullptr;
}
} // namespace LightGBM
......@@ -620,6 +620,10 @@ class SparseBin : public Bin {
}
}
const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const override;
const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const override;
private:
data_size_t num_data_;
std::vector<uint8_t, Common::AlignmentAllocator<uint8_t, kAlignedSize>>
......@@ -665,4 +669,5 @@ BinIterator* SparseBin<VAL_T>::GetIterator(uint32_t min_bin, uint32_t max_bin,
}
} // namespace LightGBM
#endif // LightGBM_IO_SPARSE_BIN_HPP_
......@@ -382,6 +382,9 @@ void TrainingShareStates::CalcBinOffsets(const std::vector<std::unique_ptr<Featu
}
num_hist_total_bin_ = static_cast<int>(feature_hist_offsets_.back());
}
#ifdef USE_CUDA_EXP
column_hist_offsets_ = *offsets;
#endif // USE_CUDA_EXP
}
void TrainingShareStates::SetMultiValBin(MultiValBin* bin, data_size_t num_data,
......
......@@ -53,6 +53,9 @@ Tree::Tree(int max_leaves, bool track_branch_features, bool is_linear)
leaf_features_.resize(max_leaves_);
leaf_features_inner_.resize(max_leaves_);
}
#ifdef USE_CUDA_EXP
is_cuda_tree_ = false;
#endif // USE_CUDA_EXP
}
int Tree::Split(int leaf, int feature, int real_feature, uint32_t threshold_bin,
......@@ -734,6 +737,10 @@ Tree::Tree(const char* str, size_t* used_len) {
is_linear_ = false;
}
#ifdef USE_CUDA_EXP
is_cuda_tree_ = false;
#endif // USE_CUDA_EXP
if ((num_leaves_ <= 1) && !is_linear_) {
return;
}
......
This diff is collapsed.
This diff is collapsed.
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
#define LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
#ifdef USE_CUDA_EXP
#include <LightGBM/bin.h>
#include <LightGBM/dataset.h>
#include <vector>
#include <LightGBM/cuda/cuda_random.hpp>
#include <LightGBM/cuda/cuda_split_info.hpp>
#include "cuda_leaf_splits.hpp"
#define NUM_THREADS_PER_BLOCK_BEST_SPLIT_FINDER (256)
#define NUM_THREADS_FIND_BEST_LEAF (256)
#define NUM_TASKS_PER_SYNC_BLOCK (1024)
namespace LightGBM {
struct SplitFindTask {
int inner_feature_index;
bool reverse;
bool skip_default_bin;
bool na_as_missing;
bool assume_out_default_left;
bool is_categorical;
bool is_one_hot;
uint32_t hist_offset;
uint8_t mfb_offset;
uint32_t num_bin;
uint32_t default_bin;
int rand_threshold;
};
class CUDABestSplitFinder {
public:
CUDABestSplitFinder(
const hist_t* cuda_hist,
const Dataset* train_data,
const std::vector<uint32_t>& feature_hist_offsets,
const Config* config);
~CUDABestSplitFinder();
void InitFeatureMetaInfo(const Dataset* train_data);
void Init();
void InitCUDAFeatureMetaInfo();
void BeforeTrain(const std::vector<int8_t>& is_feature_used_bytree);
void FindBestSplitsForLeaf(
const CUDALeafSplitsStruct* smaller_leaf_splits,
const CUDALeafSplitsStruct* larger_leaf_splits,
const int smaller_leaf_index,
const int larger_leaf_index,
const data_size_t num_data_in_smaller_leaf,
const data_size_t num_data_in_larger_leaf,
const double sum_hessians_in_smaller_leaf,
const double sum_hessians_in_larger_leaf);
const CUDASplitInfo* FindBestFromAllSplits(
const int cur_num_leaves,
const int smaller_leaf_index,
const int larger_leaf_index,
int* smaller_leaf_best_split_feature,
uint32_t* smaller_leaf_best_split_threshold,
uint8_t* smaller_leaf_best_split_default_left,
int* larger_leaf_best_split_feature,
uint32_t* larger_leaf_best_split_threshold,
uint8_t* larger_leaf_best_split_default_left,
int* best_leaf_index,
int* num_cat_threshold);
void ResetTrainingData(
const hist_t* cuda_hist,
const Dataset* train_data,
const std::vector<uint32_t>& feature_hist_offsets);
void ResetConfig(const Config* config, const hist_t* cuda_hist);
private:
#define LaunchFindBestSplitsForLeafKernel_PARAMS \
const CUDALeafSplitsStruct* smaller_leaf_splits, \
const CUDALeafSplitsStruct* larger_leaf_splits, \
const int smaller_leaf_index, \
const int larger_leaf_index, \
const bool is_smaller_leaf_valid, \
const bool is_larger_leaf_valid
void LaunchFindBestSplitsForLeafKernel(LaunchFindBestSplitsForLeafKernel_PARAMS);
template <bool USE_RAND>
void LaunchFindBestSplitsForLeafKernelInner0(LaunchFindBestSplitsForLeafKernel_PARAMS);
template <bool USE_RAND, bool USE_L1>
void LaunchFindBestSplitsForLeafKernelInner1(LaunchFindBestSplitsForLeafKernel_PARAMS);
template <bool USE_RAND, bool USE_L1, bool USE_SMOOTHING>
void LaunchFindBestSplitsForLeafKernelInner2(LaunchFindBestSplitsForLeafKernel_PARAMS);
#undef LaunchFindBestSplitsForLeafKernel_PARAMS
void LaunchSyncBestSplitForLeafKernel(
const int host_smaller_leaf_index,
const int host_larger_leaf_index,
const bool is_smaller_leaf_valid,
const bool is_larger_leaf_valid);
void LaunchFindBestFromAllSplitsKernel(
const int cur_num_leaves,
const int smaller_leaf_index,
const int larger_leaf_index,
int* smaller_leaf_best_split_feature,
uint32_t* smaller_leaf_best_split_threshold,
uint8_t* smaller_leaf_best_split_default_left,
int* larger_leaf_best_split_feature,
uint32_t* larger_leaf_best_split_threshold,
uint8_t* larger_leaf_best_split_default_left,
int* best_leaf_index,
data_size_t* num_cat_threshold);
void AllocateCatVectors(CUDASplitInfo* cuda_split_infos, uint32_t* cat_threshold_vec, int* cat_threshold_real_vec, size_t len);
void LaunchAllocateCatVectorsKernel(CUDASplitInfo* cuda_split_infos, uint32_t* cat_threshold_vec, int* cat_threshold_real_vec, size_t len);
void LaunchInitCUDARandomKernel();
// Host memory
int num_features_;
int num_leaves_;
int max_num_bin_in_feature_;
std::vector<uint32_t> feature_hist_offsets_;
std::vector<uint8_t> feature_mfb_offsets_;
std::vector<uint32_t> feature_default_bins_;
std::vector<uint32_t> feature_num_bins_;
std::vector<MissingType> feature_missing_type_;
double lambda_l1_;
double lambda_l2_;
data_size_t min_data_in_leaf_;
double min_sum_hessian_in_leaf_;
double min_gain_to_split_;
double cat_smooth_;
double cat_l2_;
int max_cat_threshold_;
int min_data_per_group_;
int max_cat_to_onehot_;
bool extra_trees_;
int extra_seed_;
bool use_smoothing_;
double path_smooth_;
std::vector<cudaStream_t> cuda_streams_;
// for best split find tasks
std::vector<SplitFindTask> split_find_tasks_;
int num_tasks_;
// use global memory
bool use_global_memory_;
// number of total bins in the dataset
const int num_total_bin_;
// has categorical feature
bool has_categorical_feature_;
// maximum number of bins of categorical features
int max_num_categorical_bin_;
// marks whether a feature is categorical
std::vector<int8_t> is_categorical_;
// CUDA memory, held by this object
// for per leaf best split information
CUDASplitInfo* cuda_leaf_best_split_info_;
// for best split information when finding best split
CUDASplitInfo* cuda_best_split_info_;
// best split information buffer, to be copied to host
int* cuda_best_split_info_buffer_;
// find best split task information
CUDAVector<SplitFindTask> cuda_split_find_tasks_;
int8_t* cuda_is_feature_used_bytree_;
// used when finding best split with global memory
hist_t* cuda_feature_hist_grad_buffer_;
hist_t* cuda_feature_hist_hess_buffer_;
hist_t* cuda_feature_hist_stat_buffer_;
data_size_t* cuda_feature_hist_index_buffer_;
uint32_t* cuda_cat_threshold_leaf_;
int* cuda_cat_threshold_real_leaf_;
uint32_t* cuda_cat_threshold_feature_;
int* cuda_cat_threshold_real_feature_;
int max_num_categories_in_split_;
// used for extremely randomized trees
CUDAVector<CUDARandom> cuda_randoms_;
// CUDA memory, held by other object
const hist_t* cuda_hist_;
};
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment