[CUDA] New CUDA version Part 1 (#4630)

* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

[CUDA] New CUDA version Part 1 (#4630)
* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
6b56a90c · shiyu1994 · GitHub · b857ee10 · 6b56a90c · 6b56a90c
Unverified Commit 6b56a90c authored Mar 23, 2022 by shiyu1994 Committed by GitHub Mar 23, 2022
20 changed files
--- a/src/io/dataset_loader.cpp
+++ b/src/io/dataset_loader.cpp
@@ -272,6 +272,16 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
    is_load_from_binary = true;
    Log::Info("Load from binary file %s", bin_filename.c_str());
    dataset.reset(LoadFromBinFile(filename, bin_filename.c_str(), rank, num_machines, &num_global_data, &used_data_indices));
+    dataset->device_type_ = config_.device_type;
+    dataset->gpu_device_id_ = config_.gpu_device_id;
+    #ifdef USE_CUDA_EXP
+    if (config_.device_type == std::string("cuda_exp")) {
+      dataset->CreateCUDAColumnData();
+      dataset->metadata_.CreateCUDAMetadata(dataset->gpu_device_id_);
+    } else {
+      dataset->cuda_column_data_ = nullptr;
+    }
+    #endif  // USE_CUDA_EXP
  }
  // check meta data
  dataset->metadata_.CheckOrPartition(num_global_data, used_data_indices);

--- a/src/io/dense_bin.cpp
+++ b/src/io/dense_bin.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#include "dense_bin.hpp"
+
+namespace LightGBM {
+
+template <>
+const void* DenseBin<uint8_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int /*num_threads*/) const {
+  *is_sparse = false;
+  *bit_type = 8;
+  bin_iterator->clear();
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint16_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int /*num_threads*/) const {
+  *is_sparse = false;
+  *bit_type = 16;
+  bin_iterator->clear();
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint32_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int /*num_threads*/) const {
+  *is_sparse = false;
+  *bit_type = 32;
+  bin_iterator->clear();
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint8_t, true>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int /*num_threads*/) const {
+  *is_sparse = false;
+  *bit_type = 4;
+  bin_iterator->clear();
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint8_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = false;
+  *bit_type = 8;
+  *bin_iterator = nullptr;
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint16_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = false;
+  *bit_type = 16;
+  *bin_iterator = nullptr;
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint32_t, false>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = false;
+  *bit_type = 32;
+  *bin_iterator = nullptr;
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+template <>
+const void* DenseBin<uint8_t, true>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = false;
+  *bit_type = 4;
+  *bin_iterator = nullptr;
+  return reinterpret_cast<const void*>(data_.data());
+}
+
+}  // namespace LightGBM
--- a/src/io/dense_bin.hpp
+++ b/src/io/dense_bin.hpp
@@ -461,9 +461,13 @@ class DenseBin : public Bin {

  DenseBin<VAL_T, IS_4BIT>* Clone() override;

+  const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const override;
+
+  const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const override;
+
 private:
  data_size_t num_data_;
-#ifdef USE_CUDA
+#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
  std::vector<VAL_T, CHAllocator<VAL_T>> data_;
 #else
  std::vector<VAL_T, Common::AlignmentAllocator<VAL_T, kAlignedSize>> data_;

--- a/src/io/metadata.cpp
+++ b/src/io/metadata.cpp
@@ -18,6 +18,9 @@ Metadata::Metadata() {
  weight_load_from_file_ = false;
  query_load_from_file_ = false;
  init_score_load_from_file_ = false;
+  #ifdef USE_CUDA_EXP
+  cuda_metadata_ = nullptr;
+  #endif  // USE_CUDA_EXP
 }

 void Metadata::Init(const char* data_filename) {
@@ -302,6 +305,11 @@ void Metadata::SetInitScore(const double* init_score, data_size_t len) {
    init_score_[i] = Common::AvoidInf(init_score[i]);
  }
  init_score_load_from_file_ = false;
+  #ifdef USE_CUDA_EXP
+  if (cuda_metadata_ != nullptr) {
+    cuda_metadata_->SetInitScore(init_score_.data(), len);
+  }
+  #endif  // USE_CUDA_EXP
 }

 void Metadata::SetLabel(const label_t* label, data_size_t len) {
@@ -318,6 +326,11 @@ void Metadata::SetLabel(const label_t* label, data_size_t len) {
  for (data_size_t i = 0; i < num_data_; ++i) {
    label_[i] = Common::AvoidInf(label[i]);
  }
+  #ifdef USE_CUDA_EXP
+  if (cuda_metadata_ != nullptr) {
+    cuda_metadata_->SetLabel(label_.data(), len);
+  }
+  #endif  // USE_CUDA_EXP
 }

 void Metadata::SetWeights(const label_t* weights, data_size_t len) {
@@ -340,6 +353,11 @@ void Metadata::SetWeights(const label_t* weights, data_size_t len) {
  }
  LoadQueryWeights();
  weight_load_from_file_ = false;
+  #ifdef USE_CUDA_EXP
+  if (cuda_metadata_ != nullptr) {
+    cuda_metadata_->SetWeights(weights_.data(), len);
+  }
+  #endif  // USE_CUDA_EXP
 }

 void Metadata::SetQuery(const data_size_t* query, data_size_t len) {
@@ -366,6 +384,16 @@ void Metadata::SetQuery(const data_size_t* query, data_size_t len) {
  }
  LoadQueryWeights();
  query_load_from_file_ = false;
+  #ifdef USE_CUDA_EXP
+  if (cuda_metadata_ != nullptr) {
+    if (query_weights_.size() > 0) {
+      CHECK_EQ(query_weights_.size(), static_cast<size_t>(num_queries_));
+      cuda_metadata_->SetQuery(query_boundaries_.data(), query_weights_.data(), num_queries_);
+    } else {
+      cuda_metadata_->SetQuery(query_boundaries_.data(), nullptr, num_queries_);
+    }
+  }
+  #endif  // USE_CUDA_EXP
 }

 void Metadata::LoadWeights() {
@@ -472,6 +500,13 @@ void Metadata::LoadQueryWeights() {
  }
 }

+#ifdef USE_CUDA_EXP
+void Metadata::CreateCUDAMetadata(const int gpu_device_id) {
+  cuda_metadata_.reset(new CUDAMetadata(gpu_device_id));
+  cuda_metadata_->Init(label_, weights_, query_boundaries_, query_weights_, init_score_);
+}
+#endif  // USE_CUDA_EXP
+
 void Metadata::LoadFromMemory(const void* memory) {
  const char* mem_ptr = reinterpret_cast<const char*>(memory);


--- a/src/io/multi_val_dense_bin.cpp
+++ b/src/io/multi_val_dense_bin.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+
+#include "multi_val_dense_bin.hpp"
+
+namespace LightGBM {
+
+
+#ifdef USE_CUDA_EXP
+template <>
+const void* MultiValDenseBin<uint8_t>::GetRowWiseData(uint8_t* bit_type,
+    size_t* total_size,
+    bool* is_sparse,
+    const void** out_data_ptr,
+    uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = data_.data();
+  *bit_type = 8;
+  *total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
+  CHECK_EQ(*total_size, data_.size());
+  *is_sparse = false;
+  *out_data_ptr = nullptr;
+  *data_ptr_bit_type = 0;
+  return to_return;
+}
+
+template <>
+const void* MultiValDenseBin<uint16_t>::GetRowWiseData(uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint16_t* data_ptr = data_.data();
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_ptr);
+  *bit_type = 16;
+  *total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
+  CHECK_EQ(*total_size, data_.size());
+  *is_sparse = false;
+  *out_data_ptr = nullptr;
+  *data_ptr_bit_type = 0;
+  return to_return;
+}
+
+template <>
+const void* MultiValDenseBin<uint32_t>::GetRowWiseData(uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint32_t* data_ptr = data_.data();
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_ptr);
+  *bit_type = 32;
+  *total_size = static_cast<size_t>(num_data_) * static_cast<size_t>(num_feature_);
+  CHECK_EQ(*total_size, data_.size());
+  *is_sparse = false;
+  *out_data_ptr = nullptr;
+  *data_ptr_bit_type = 0;
+  return to_return;
+}
+
+#endif  // USE_CUDA_EXP
+
+}  // namespace LightGBM
--- a/src/io/multi_val_dense_bin.hpp
+++ b/src/io/multi_val_dense_bin.hpp
@@ -7,6 +7,7 @@

 #include <LightGBM/bin.h>
 #include <LightGBM/utils/openmp_wrapper.h>
+#include <LightGBM/utils/threading.h>

 #include <algorithm>
 #include <cstdint>
@@ -210,6 +211,14 @@ class MultiValDenseBin : public MultiValBin {

  MultiValDenseBin<VAL_T>* Clone() override;

+  #ifdef USE_CUDA_EXP
+  const void* GetRowWiseData(uint8_t* bit_type,
+    size_t* total_size,
+    bool* is_sparse,
+    const void** out_data_ptr,
+    uint8_t* data_ptr_bit_type) const override;
+  #endif  // USE_CUDA_EXP
+
 private:
  data_size_t num_data_;
  int num_bin_;
@@ -229,4 +238,5 @@ MultiValDenseBin<VAL_T>* MultiValDenseBin<VAL_T>::Clone() {
 }

 }  // namespace LightGBM
+
 #endif   // LIGHTGBM_IO_MULTI_VAL_DENSE_BIN_HPP_
--- a/src/io/multi_val_sparse_bin.cpp
+++ b/src/io/multi_val_sparse_bin.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+
+#include "multi_val_sparse_bin.hpp"
+
+namespace LightGBM {
+
+#ifdef USE_CUDA_EXP
+
+template <>
+const void* MultiValSparseBin<uint16_t, uint8_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = data_.data();
+  *bit_type = 8;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 16;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint16_t, uint16_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 16;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 16;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint16_t, uint32_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 32;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 16;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint32_t, uint8_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = data_.data();
+  *bit_type = 8;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 32;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint32_t, uint16_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 16;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 32;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint32_t, uint32_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 32;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 32;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint64_t, uint8_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = data_.data();
+  *bit_type = 8;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 64;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint64_t, uint16_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 16;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 64;
+  return to_return;
+}
+
+template <>
+const void* MultiValSparseBin<uint64_t, uint32_t>::GetRowWiseData(
+  uint8_t* bit_type,
+  size_t* total_size,
+  bool* is_sparse,
+  const void** out_data_ptr,
+  uint8_t* data_ptr_bit_type) const {
+  const uint8_t* to_return = reinterpret_cast<const uint8_t*>(data_.data());
+  *bit_type = 32;
+  *total_size = data_.size();
+  *is_sparse = true;
+  *out_data_ptr = reinterpret_cast<const uint8_t*>(row_ptr_.data());
+  *data_ptr_bit_type = 64;
+  return to_return;
+}
+
+#endif  // USE_CUDA_EXP
+
+}  // namespace LightGBM
--- a/src/io/multi_val_sparse_bin.hpp
+++ b/src/io/multi_val_sparse_bin.hpp
@@ -7,6 +7,7 @@

 #include <LightGBM/bin.h>
 #include <LightGBM/utils/openmp_wrapper.h>
+#include <LightGBM/utils/threading.h>

 #include <algorithm>
 #include <cstdint>
@@ -290,6 +291,15 @@ class MultiValSparseBin : public MultiValBin {

  MultiValSparseBin<INDEX_T, VAL_T>* Clone() override;

+
+  #ifdef USE_CUDA_EXP
+  const void* GetRowWiseData(uint8_t* bit_type,
+    size_t* total_size,
+    bool* is_sparse,
+    const void** out_data_ptr,
+    uint8_t* data_ptr_bit_type) const override;
+  #endif  // USE_CUDA_EXP
+
 private:
  data_size_t num_data_;
  int num_bin_;
@@ -317,4 +327,5 @@ MultiValSparseBin<INDEX_T, VAL_T>* MultiValSparseBin<INDEX_T, VAL_T>::Clone() {
 }

 }  // namespace LightGBM
+
 #endif  // LIGHTGBM_IO_MULTI_VAL_SPARSE_BIN_HPP_
--- a/src/io/sparse_bin.cpp
+++ b/src/io/sparse_bin.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#include "sparse_bin.hpp"
+
+namespace LightGBM {
+
+template <>
+const void* SparseBin<uint8_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int num_threads) const {
+  *is_sparse = true;
+  *bit_type = 8;
+  for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
+    bin_iterator->emplace_back(new SparseBinIterator<uint8_t>(this, 0));
+  }
+  return nullptr;
+}
+
+template <>
+const void* SparseBin<uint16_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int num_threads) const {
+  *is_sparse = true;
+  *bit_type = 16;
+  for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
+    bin_iterator->emplace_back(new SparseBinIterator<uint16_t>(this, 0));
+  }
+  return nullptr;
+}
+
+template <>
+const void* SparseBin<uint32_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  std::vector<BinIterator*>* bin_iterator,
+  const int num_threads) const {
+  *is_sparse = true;
+  *bit_type = 32;
+  for (int thread_index = 0; thread_index < num_threads; ++thread_index) {
+    bin_iterator->emplace_back(new SparseBinIterator<uint32_t>(this, 0));
+  }
+  return nullptr;
+}
+
+template <>
+const void* SparseBin<uint8_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = true;
+  *bit_type = 8;
+  *bin_iterator = new SparseBinIterator<uint8_t>(this, 0);
+  return nullptr;
+}
+
+template <>
+const void* SparseBin<uint16_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = true;
+  *bit_type = 16;
+  *bin_iterator = new SparseBinIterator<uint16_t>(this, 0);
+  return nullptr;
+}
+
+template <>
+const void* SparseBin<uint32_t>::GetColWiseData(
+  uint8_t* bit_type,
+  bool* is_sparse,
+  BinIterator** bin_iterator) const {
+  *is_sparse = true;
+  *bit_type = 32;
+  *bin_iterator = new SparseBinIterator<uint32_t>(this, 0);
+  return nullptr;
+}
+
+}  // namespace LightGBM
--- a/src/io/sparse_bin.hpp
+++ b/src/io/sparse_bin.hpp
@@ -620,6 +620,10 @@ class SparseBin : public Bin {
    }
  }

+  const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const override;
+
+  const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const override;
+
 private:
  data_size_t num_data_;
  std::vector<uint8_t, Common::AlignmentAllocator<uint8_t, kAlignedSize>>
@@ -665,4 +669,5 @@ BinIterator* SparseBin<VAL_T>::GetIterator(uint32_t min_bin, uint32_t max_bin,
 }

 }  // namespace LightGBM
+
 #endif  // LightGBM_IO_SPARSE_BIN_HPP_
--- a/src/io/train_share_states.cpp
+++ b/src/io/train_share_states.cpp
@@ -382,6 +382,9 @@ void TrainingShareStates::CalcBinOffsets(const std::vector<std::unique_ptr<Featu
    }
    num_hist_total_bin_ = static_cast<int>(feature_hist_offsets_.back());
  }
+  #ifdef USE_CUDA_EXP
+  column_hist_offsets_ = *offsets;
+  #endif  // USE_CUDA_EXP
 }

 void TrainingShareStates::SetMultiValBin(MultiValBin* bin, data_size_t num_data,

--- a/src/io/tree.cpp
+++ b/src/io/tree.cpp
@@ -53,6 +53,9 @@ Tree::Tree(int max_leaves, bool track_branch_features, bool is_linear)
    leaf_features_.resize(max_leaves_);
    leaf_features_inner_.resize(max_leaves_);
  }
+  #ifdef USE_CUDA_EXP
+  is_cuda_tree_ = false;
+  #endif  // USE_CUDA_EXP
 }

 int Tree::Split(int leaf, int feature, int real_feature, uint32_t threshold_bin,
@@ -734,6 +737,10 @@ Tree::Tree(const char* str, size_t* used_len) {
    is_linear_ = false;
  }

+  #ifdef USE_CUDA_EXP
+  is_cuda_tree_ = false;
+  #endif  // USE_CUDA_EXP
+
  if ((num_leaves_ <= 1) && !is_linear_) {
    return;
  }

--- a/src/treelearner/cuda/cuda_best_split_finder.cpp
+++ b/src/treelearner/cuda/cuda_best_split_finder.cpp
--- a/src/treelearner/cuda/cuda_best_split_finder.cu
+++ b/src/treelearner/cuda/cuda_best_split_finder.cu
--- a/src/treelearner/cuda/cuda_best_split_finder.hpp
+++ b/src/treelearner/cuda/cuda_best_split_finder.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
+#define LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
+
+#ifdef USE_CUDA_EXP
+
+#include <LightGBM/bin.h>
+#include <LightGBM/dataset.h>
+
+#include <vector>
+
+#include <LightGBM/cuda/cuda_random.hpp>
+#include <LightGBM/cuda/cuda_split_info.hpp>
+
+#include "cuda_leaf_splits.hpp"
+
+#define NUM_THREADS_PER_BLOCK_BEST_SPLIT_FINDER (256)
+#define NUM_THREADS_FIND_BEST_LEAF (256)
+#define NUM_TASKS_PER_SYNC_BLOCK (1024)
+
+namespace LightGBM {
+
+struct SplitFindTask {
+  int inner_feature_index;
+  bool reverse;
+  bool skip_default_bin;
+  bool na_as_missing;
+  bool assume_out_default_left;
+  bool is_categorical;
+  bool is_one_hot;
+  uint32_t hist_offset;
+  uint8_t mfb_offset;
+  uint32_t num_bin;
+  uint32_t default_bin;
+  int rand_threshold;
+};
+
+class CUDABestSplitFinder {
+ public:
+  CUDABestSplitFinder(
+    const hist_t* cuda_hist,
+    const Dataset* train_data,
+    const std::vector<uint32_t>& feature_hist_offsets,
+    const Config* config);
+
+  ~CUDABestSplitFinder();
+
+  void InitFeatureMetaInfo(const Dataset* train_data);
+
+  void Init();
+
+  void InitCUDAFeatureMetaInfo();
+
+  void BeforeTrain(const std::vector<int8_t>& is_feature_used_bytree);
+
+  void FindBestSplitsForLeaf(
+    const CUDALeafSplitsStruct* smaller_leaf_splits,
+    const CUDALeafSplitsStruct* larger_leaf_splits,
+    const int smaller_leaf_index,
+    const int larger_leaf_index,
+    const data_size_t num_data_in_smaller_leaf,
+    const data_size_t num_data_in_larger_leaf,
+    const double sum_hessians_in_smaller_leaf,
+    const double sum_hessians_in_larger_leaf);
+
+  const CUDASplitInfo* FindBestFromAllSplits(
+    const int cur_num_leaves,
+    const int smaller_leaf_index,
+    const int larger_leaf_index,
+    int* smaller_leaf_best_split_feature,
+    uint32_t* smaller_leaf_best_split_threshold,
+    uint8_t* smaller_leaf_best_split_default_left,
+    int* larger_leaf_best_split_feature,
+    uint32_t* larger_leaf_best_split_threshold,
+    uint8_t* larger_leaf_best_split_default_left,
+    int* best_leaf_index,
+    int* num_cat_threshold);
+
+  void ResetTrainingData(
+    const hist_t* cuda_hist,
+    const Dataset* train_data,
+    const std::vector<uint32_t>& feature_hist_offsets);
+
+  void ResetConfig(const Config* config, const hist_t* cuda_hist);
+
+ private:
+  #define LaunchFindBestSplitsForLeafKernel_PARAMS \
+    const CUDALeafSplitsStruct* smaller_leaf_splits, \
+    const CUDALeafSplitsStruct* larger_leaf_splits, \
+    const int smaller_leaf_index, \
+    const int larger_leaf_index, \
+    const bool is_smaller_leaf_valid, \
+    const bool is_larger_leaf_valid
+
+  void LaunchFindBestSplitsForLeafKernel(LaunchFindBestSplitsForLeafKernel_PARAMS);
+
+  template <bool USE_RAND>
+  void LaunchFindBestSplitsForLeafKernelInner0(LaunchFindBestSplitsForLeafKernel_PARAMS);
+
+  template <bool USE_RAND, bool USE_L1>
+  void LaunchFindBestSplitsForLeafKernelInner1(LaunchFindBestSplitsForLeafKernel_PARAMS);
+
+  template <bool USE_RAND, bool USE_L1, bool USE_SMOOTHING>
+  void LaunchFindBestSplitsForLeafKernelInner2(LaunchFindBestSplitsForLeafKernel_PARAMS);
+
+  #undef LaunchFindBestSplitsForLeafKernel_PARAMS
+
+  void LaunchSyncBestSplitForLeafKernel(
+    const int host_smaller_leaf_index,
+    const int host_larger_leaf_index,
+    const bool is_smaller_leaf_valid,
+    const bool is_larger_leaf_valid);
+
+  void LaunchFindBestFromAllSplitsKernel(
+    const int cur_num_leaves,
+    const int smaller_leaf_index,
+    const int larger_leaf_index,
+    int* smaller_leaf_best_split_feature,
+    uint32_t* smaller_leaf_best_split_threshold,
+    uint8_t* smaller_leaf_best_split_default_left,
+    int* larger_leaf_best_split_feature,
+    uint32_t* larger_leaf_best_split_threshold,
+    uint8_t* larger_leaf_best_split_default_left,
+    int* best_leaf_index,
+    data_size_t* num_cat_threshold);
+
+  void AllocateCatVectors(CUDASplitInfo* cuda_split_infos, uint32_t* cat_threshold_vec, int* cat_threshold_real_vec, size_t len);
+
+  void LaunchAllocateCatVectorsKernel(CUDASplitInfo* cuda_split_infos, uint32_t* cat_threshold_vec, int* cat_threshold_real_vec, size_t len);
+
+  void LaunchInitCUDARandomKernel();
+
+  // Host memory
+  int num_features_;
+  int num_leaves_;
+  int max_num_bin_in_feature_;
+  std::vector<uint32_t> feature_hist_offsets_;
+  std::vector<uint8_t> feature_mfb_offsets_;
+  std::vector<uint32_t> feature_default_bins_;
+  std::vector<uint32_t> feature_num_bins_;
+  std::vector<MissingType> feature_missing_type_;
+  double lambda_l1_;
+  double lambda_l2_;
+  data_size_t min_data_in_leaf_;
+  double min_sum_hessian_in_leaf_;
+  double min_gain_to_split_;
+  double cat_smooth_;
+  double cat_l2_;
+  int max_cat_threshold_;
+  int min_data_per_group_;
+  int max_cat_to_onehot_;
+  bool extra_trees_;
+  int extra_seed_;
+  bool use_smoothing_;
+  double path_smooth_;
+  std::vector<cudaStream_t> cuda_streams_;
+  // for best split find tasks
+  std::vector<SplitFindTask> split_find_tasks_;
+  int num_tasks_;
+  // use global memory
+  bool use_global_memory_;
+  // number of total bins in the dataset
+  const int num_total_bin_;
+  // has categorical feature
+  bool has_categorical_feature_;
+  // maximum number of bins of categorical features
+  int max_num_categorical_bin_;
+  // marks whether a feature is categorical
+  std::vector<int8_t> is_categorical_;
+
+  // CUDA memory, held by this object
+  // for per leaf best split information
+  CUDASplitInfo* cuda_leaf_best_split_info_;
+  // for best split information when finding best split
+  CUDASplitInfo* cuda_best_split_info_;
+  // best split information buffer, to be copied to host
+  int* cuda_best_split_info_buffer_;
+  // find best split task information
+  CUDAVector<SplitFindTask> cuda_split_find_tasks_;
+  int8_t* cuda_is_feature_used_bytree_;
+  // used when finding best split with global memory
+  hist_t* cuda_feature_hist_grad_buffer_;
+  hist_t* cuda_feature_hist_hess_buffer_;
+  hist_t* cuda_feature_hist_stat_buffer_;
+  data_size_t* cuda_feature_hist_index_buffer_;
+  uint32_t* cuda_cat_threshold_leaf_;
+  int* cuda_cat_threshold_real_leaf_;
+  uint32_t* cuda_cat_threshold_feature_;
+  int* cuda_cat_threshold_real_feature_;
+  int max_num_categories_in_split_;
+  // used for extremely randomized trees
+  CUDAVector<CUDARandom> cuda_randoms_;
+
+  // CUDA memory, held by other object
+  const hist_t* cuda_hist_;
+};
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_TREELEARNER_CUDA_CUDA_BEST_SPLIT_FINDER_HPP_
--- a/src/treelearner/cuda/cuda_data_partition.cpp
+++ b/src/treelearner/cuda/cuda_data_partition.cpp
--- a/src/treelearner/cuda/cuda_data_partition.cu
+++ b/src/treelearner/cuda/cuda_data_partition.cu
--- a/src/treelearner/cuda/cuda_data_partition.hpp
+++ b/src/treelearner/cuda/cuda_data_partition.hpp
--- a/src/treelearner/cuda/cuda_histogram_constructor.cpp
+++ b/src/treelearner/cuda/cuda_histogram_constructor.cpp
--- a/src/treelearner/cuda/cuda_histogram_constructor.cu
+++ b/src/treelearner/cuda/cuda_histogram_constructor.cu