[CUDA] New CUDA version Part 1 (#4630)

* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

[CUDA] New CUDA version Part 1 (#4630)
* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
6b56a90c · shiyu1994 · GitHub · b857ee10 · 6b56a90c · 6b56a90c
Unverified Commit 6b56a90c authored Mar 23, 2022 by shiyu1994 Committed by GitHub Mar 23, 2022
15 changed files
--- a/src/treelearner/cuda/cuda_histogram_constructor.hpp
+++ b/src/treelearner/cuda/cuda_histogram_constructor.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
+#define LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
+
+#ifdef USE_CUDA_EXP
+
+#include <LightGBM/cuda/cuda_row_data.hpp>
+#include <LightGBM/feature_group.h>
+#include <LightGBM/tree.h>
+
+#include <memory>
+#include <vector>
+
+#include "cuda_leaf_splits.hpp"
+
+#define NUM_DATA_PER_THREAD (400)
+#define NUM_THRADS_PER_BLOCK (504)
+#define NUM_FEATURE_PER_THREAD_GROUP (28)
+#define SUBTRACT_BLOCK_SIZE (1024)
+#define FIX_HISTOGRAM_SHARED_MEM_SIZE (1024)
+#define FIX_HISTOGRAM_BLOCK_SIZE (512)
+#define USED_HISTOGRAM_BUFFER_NUM (8)
+
+namespace LightGBM {
+
+class CUDAHistogramConstructor {
+ public:
+  CUDAHistogramConstructor(
+    const Dataset* train_data,
+    const int num_leaves,
+    const int num_threads,
+    const std::vector<uint32_t>& feature_hist_offsets,
+    const int min_data_in_leaf,
+    const double min_sum_hessian_in_leaf,
+    const int gpu_device_id,
+    const bool gpu_use_dp);
+
+  ~CUDAHistogramConstructor();
+
+  void Init(const Dataset* train_data, TrainingShareStates* share_state);
+
+  void ConstructHistogramForLeaf(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const CUDALeafSplitsStruct* cuda_larger_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf,
+    const data_size_t num_data_in_larger_leaf,
+    const double sum_hessians_in_smaller_leaf,
+    const double sum_hessians_in_larger_leaf);
+
+  void ResetTrainingData(const Dataset* train_data, TrainingShareStates* share_states);
+
+  void ResetConfig(const Config* config);
+
+  void BeforeTrain(const score_t* gradients, const score_t* hessians);
+
+  const hist_t* cuda_hist() const { return cuda_hist_; }
+
+  hist_t* cuda_hist_pointer() { return cuda_hist_; }
+
+ private:
+  void InitFeatureMetaInfo(const Dataset* train_data, const std::vector<uint32_t>& feature_hist_offsets);
+
+  void CalcConstructHistogramKernelDim(
+    int* grid_dim_x,
+    int* grid_dim_y,
+    int* block_dim_x,
+    int* block_dim_y,
+    const data_size_t num_data_in_smaller_leaf);
+
+  template <typename HIST_TYPE, size_t SHARED_HIST_SIZE>
+  void LaunchConstructHistogramKernelInner(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf);
+
+  template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE>
+  void LaunchConstructHistogramKernelInner0(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf);
+
+  template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE, typename PTR_TYPE>
+  void LaunchConstructHistogramKernelInner1(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf);
+
+  template <typename HIST_TYPE, size_t SHARED_HIST_SIZE, typename BIN_TYPE, typename PTR_TYPE, bool USE_GLOBAL_MEM_BUFFER>
+  void LaunchConstructHistogramKernelInner2(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf);
+
+  void LaunchConstructHistogramKernel(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const data_size_t num_data_in_smaller_leaf);
+
+  void LaunchSubtractHistogramKernel(
+    const CUDALeafSplitsStruct* cuda_smaller_leaf_splits,
+    const CUDALeafSplitsStruct* cuda_larger_leaf_splits);
+
+  // Host memory
+
+  /*! \brief size of training data */
+  data_size_t num_data_;
+  /*! \brief number of features in training data */
+  int num_features_;
+  /*! \brief maximum number of leaves */
+  int num_leaves_;
+  /*! \brief number of threads */
+  int num_threads_;
+  /*! \brief total number of bins in histogram */
+  int num_total_bin_;
+  /*! \brief number of bins per feature */
+  std::vector<uint32_t> feature_num_bins_;
+  /*! \brief offsets in histogram of all features */
+  std::vector<uint32_t> feature_hist_offsets_;
+  /*! \brief most frequent bins in each feature */
+  std::vector<uint32_t> feature_most_freq_bins_;
+  /*! \brief minimum number of data allowed per leaf */
+  int min_data_in_leaf_;
+  /*! \brief minimum sum value of hessians allowed per leaf */
+  double min_sum_hessian_in_leaf_;
+  /*! \brief cuda stream for histogram construction */
+  cudaStream_t cuda_stream_;
+  /*! \brief indices of feature whose histograms need to be fixed */
+  std::vector<int> need_fix_histogram_features_;
+  /*! \brief aligned number of bins of the features whose histograms need to be fixed */
+  std::vector<uint32_t> need_fix_histogram_features_num_bin_aligend_;
+  /*! \brief minimum number of blocks allowed in the y dimension */
+  const int min_grid_dim_y_ = 160;
+
+
+  // CUDA memory, held by this object
+
+  /*! \brief CUDA row wise data */
+  std::unique_ptr<CUDARowData> cuda_row_data_;
+  /*! \brief number of bins per feature */
+  uint32_t* cuda_feature_num_bins_;
+  /*! \brief offsets in histogram of all features */
+  uint32_t* cuda_feature_hist_offsets_;
+  /*! \brief most frequent bins in each feature */
+  uint32_t* cuda_feature_most_freq_bins_;
+  /*! \brief CUDA histograms */
+  hist_t* cuda_hist_;
+  /*! \brief CUDA histograms buffer for each block */
+  float* cuda_hist_buffer_;
+  /*! \brief indices of feature whose histograms need to be fixed */
+  int* cuda_need_fix_histogram_features_;
+  /*! \brief aligned number of bins of the features whose histograms need to be fixed */
+  uint32_t* cuda_need_fix_histogram_features_num_bin_aligned_;
+
+  // CUDA memory, held by other object
+
+  /*! \brief gradients on CUDA */
+  const score_t* cuda_gradients_;
+  /*! \brief hessians on CUDA */
+  const score_t* cuda_hessians_;
+
+  /*! \brief GPU device index */
+  const int gpu_device_id_;
+  /*! \brief use double precision histogram per block */
+  const bool gpu_use_dp_;
+};
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_TREELEARNER_CUDA_CUDA_HISTOGRAM_CONSTRUCTOR_HPP_
--- a/src/treelearner/cuda/cuda_leaf_splits.cpp
+++ b/src/treelearner/cuda/cuda_leaf_splits.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#ifdef USE_CUDA_EXP
+
+#include "cuda_leaf_splits.hpp"
+
+namespace LightGBM {
+
+CUDALeafSplits::CUDALeafSplits(const data_size_t num_data):
+num_data_(num_data) {
+  cuda_struct_ = nullptr;
+  cuda_sum_of_gradients_buffer_ = nullptr;
+  cuda_sum_of_hessians_buffer_ = nullptr;
+}
+
+CUDALeafSplits::~CUDALeafSplits() {
+  DeallocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, __FILE__, __LINE__);
+  DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
+  DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
+}
+
+void CUDALeafSplits::Init() {
+  num_blocks_init_from_gradients_ = (num_data_ + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
+
+  // allocate more memory for sum reduction in CUDA
+  // only the first element records the final sum
+  AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+  AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+
+  AllocateCUDAMemory<CUDALeafSplitsStruct>(&cuda_struct_, 1, __FILE__, __LINE__);
+}
+
+void CUDALeafSplits::InitValues() {
+  LaunchInitValuesEmptyKernel();
+  SynchronizeCUDADevice(__FILE__, __LINE__);
+}
+
+void CUDALeafSplits::InitValues(
+  const double lambda_l1, const double lambda_l2,
+  const score_t* cuda_gradients, const score_t* cuda_hessians,
+  const data_size_t* cuda_bagging_data_indices, const data_size_t* cuda_data_indices_in_leaf,
+  const data_size_t num_used_indices, hist_t* cuda_hist_in_leaf, double* root_sum_hessians) {
+  cuda_gradients_ = cuda_gradients;
+  cuda_hessians_ = cuda_hessians;
+  SetCUDAMemory<double>(cuda_sum_of_gradients_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+  SetCUDAMemory<double>(cuda_sum_of_hessians_buffer_, 0, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+  LaunchInitValuesKernal(lambda_l1, lambda_l2, cuda_bagging_data_indices, cuda_data_indices_in_leaf, num_used_indices, cuda_hist_in_leaf);
+  CopyFromCUDADeviceToHost<double>(root_sum_hessians, cuda_sum_of_hessians_buffer_, 1, __FILE__, __LINE__);
+  SynchronizeCUDADevice(__FILE__, __LINE__);
+}
+
+void CUDALeafSplits::Resize(const data_size_t num_data) {
+  if (num_data > num_data_) {
+    DeallocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, __FILE__, __LINE__);
+    DeallocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, __FILE__, __LINE__);
+    num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
+    AllocateCUDAMemory<double>(&cuda_sum_of_gradients_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+    AllocateCUDAMemory<double>(&cuda_sum_of_hessians_buffer_, num_blocks_init_from_gradients_, __FILE__, __LINE__);
+  } else {
+    num_blocks_init_from_gradients_ = (num_data + NUM_THRADS_PER_BLOCK_LEAF_SPLITS - 1) / NUM_THRADS_PER_BLOCK_LEAF_SPLITS;
+  }
+  num_data_ = num_data;
+}
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
--- a/src/treelearner/cuda/cuda_leaf_splits.cu
+++ b/src/treelearner/cuda/cuda_leaf_splits.cu
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+
+#ifdef USE_CUDA_EXP
+
+#include "cuda_leaf_splits.hpp"
+#include <LightGBM/cuda/cuda_algorithms.hpp>
+
+namespace LightGBM {
+
+template <bool USE_INDICES>
+__global__ void CUDAInitValuesKernel1(const score_t* cuda_gradients, const score_t* cuda_hessians,
+  const data_size_t num_data, const data_size_t* cuda_bagging_data_indices,
+  double* cuda_sum_of_gradients, double* cuda_sum_of_hessians) {
+  __shared__ double shared_mem_buffer[32];
+  const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
+  double gradient = 0.0f;
+  double hessian = 0.0f;
+  if (data_index < num_data) {
+    gradient = USE_INDICES ? cuda_gradients[cuda_bagging_data_indices[data_index]] : cuda_gradients[data_index];
+    hessian = USE_INDICES ? cuda_hessians[cuda_bagging_data_indices[data_index]] : cuda_hessians[data_index];
+  }
+  const double block_sum_gradient = ShuffleReduceSum<double>(gradient, shared_mem_buffer, blockDim.x);
+  __syncthreads();
+  const double block_sum_hessian = ShuffleReduceSum<double>(hessian, shared_mem_buffer, blockDim.x);
+  if (threadIdx.x == 0) {
+    cuda_sum_of_gradients[blockIdx.x] += block_sum_gradient;
+    cuda_sum_of_hessians[blockIdx.x] += block_sum_hessian;
+  }
+}
+
+__global__ void CUDAInitValuesKernel2(
+  const double lambda_l1,
+  const double lambda_l2,
+  const int num_blocks_to_reduce,
+  double* cuda_sum_of_gradients,
+  double* cuda_sum_of_hessians,
+  const data_size_t num_data,
+  const data_size_t* cuda_data_indices_in_leaf,
+  hist_t* cuda_hist_in_leaf,
+  CUDALeafSplitsStruct* cuda_struct) {
+  __shared__ double shared_mem_buffer[32];
+  double thread_sum_of_gradients = 0.0f;
+  double thread_sum_of_hessians = 0.0f;
+  for (int block_index = static_cast<int>(threadIdx.x); block_index < num_blocks_to_reduce; block_index += static_cast<int>(blockDim.x)) {
+    thread_sum_of_gradients += cuda_sum_of_gradients[block_index];
+    thread_sum_of_hessians += cuda_sum_of_hessians[block_index];
+  }
+  const double sum_of_gradients = ShuffleReduceSum<double>(thread_sum_of_gradients, shared_mem_buffer, blockDim.x);
+  __syncthreads();
+  const double sum_of_hessians = ShuffleReduceSum<double>(thread_sum_of_hessians, shared_mem_buffer, blockDim.x);
+  if (threadIdx.x == 0) {
+    cuda_sum_of_hessians[0] = sum_of_hessians;
+    cuda_struct->leaf_index = 0;
+    cuda_struct->sum_of_gradients = sum_of_gradients;
+    cuda_struct->sum_of_hessians = sum_of_hessians;
+    cuda_struct->num_data_in_leaf = num_data;
+    const bool use_l1 = lambda_l1 > 0.0f;
+    if (!use_l1) {
+      // no smoothing on root node
+      cuda_struct->gain = CUDALeafSplits::GetLeafGain<false, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+    } else {
+      // no smoothing on root node
+      cuda_struct->gain = CUDALeafSplits::GetLeafGain<true, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+    }
+    if (!use_l1) {
+      // no smoothing on root node
+      cuda_struct->leaf_value =
+        CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+    } else {
+      // no smoothing on root node
+      cuda_struct->leaf_value =
+        CUDALeafSplits::CalculateSplittedLeafOutput<true, false>(sum_of_gradients, sum_of_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+    }
+    cuda_struct->data_indices_in_leaf = cuda_data_indices_in_leaf;
+    cuda_struct->hist_in_leaf = cuda_hist_in_leaf;
+  }
+}
+
+__global__ void InitValuesEmptyKernel(CUDALeafSplitsStruct* cuda_struct) {
+  cuda_struct->leaf_index = -1;
+  cuda_struct->sum_of_gradients = 0.0f;
+  cuda_struct->sum_of_hessians = 0.0f;
+  cuda_struct->num_data_in_leaf = 0;
+  cuda_struct->gain = 0.0f;
+  cuda_struct->leaf_value = 0.0f;
+  cuda_struct->data_indices_in_leaf = nullptr;
+  cuda_struct->hist_in_leaf = nullptr;
+}
+
+void CUDALeafSplits::LaunchInitValuesEmptyKernel() {
+  InitValuesEmptyKernel<<<1, 1>>>(cuda_struct_);
+}
+
+void CUDALeafSplits::LaunchInitValuesKernal(
+  const double lambda_l1, const double lambda_l2,
+  const data_size_t* cuda_bagging_data_indices,
+  const data_size_t* cuda_data_indices_in_leaf,
+  const data_size_t num_used_indices,
+  hist_t* cuda_hist_in_leaf) {
+  if (cuda_bagging_data_indices == nullptr) {
+    CUDAInitValuesKernel1<false><<<num_blocks_init_from_gradients_, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
+      cuda_gradients_, cuda_hessians_, num_used_indices, nullptr, cuda_sum_of_gradients_buffer_,
+      cuda_sum_of_hessians_buffer_);
+  } else {
+    CUDAInitValuesKernel1<true><<<num_blocks_init_from_gradients_, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
+      cuda_gradients_, cuda_hessians_, num_used_indices, cuda_bagging_data_indices, cuda_sum_of_gradients_buffer_,
+      cuda_sum_of_hessians_buffer_);
+  }
+  SynchronizeCUDADevice(__FILE__, __LINE__);
+  CUDAInitValuesKernel2<<<1, NUM_THRADS_PER_BLOCK_LEAF_SPLITS>>>(
+    lambda_l1, lambda_l2,
+    num_blocks_init_from_gradients_,
+    cuda_sum_of_gradients_buffer_,
+    cuda_sum_of_hessians_buffer_,
+    num_used_indices,
+    cuda_data_indices_in_leaf,
+    cuda_hist_in_leaf,
+    cuda_struct_);
+  SynchronizeCUDADevice(__FILE__, __LINE__);
+}
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
--- a/src/treelearner/cuda/cuda_leaf_splits.hpp
+++ b/src/treelearner/cuda/cuda_leaf_splits.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
+#define LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
+
+#ifdef USE_CUDA_EXP
+
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/bin.h>
+#include <LightGBM/utils/log.h>
+#include <LightGBM/meta.h>
+
+#define NUM_THRADS_PER_BLOCK_LEAF_SPLITS (1024)
+#define NUM_DATA_THREAD_ADD_LEAF_SPLITS (6)
+
+namespace LightGBM {
+
+struct CUDALeafSplitsStruct {
+ public:
+  int leaf_index;
+  double sum_of_gradients;
+  double sum_of_hessians;
+  data_size_t num_data_in_leaf;
+  double gain;
+  double leaf_value;
+  const data_size_t* data_indices_in_leaf;
+  hist_t* hist_in_leaf;
+};
+
+class CUDALeafSplits {
+ public:
+  explicit CUDALeafSplits(const data_size_t num_data);
+
+  ~CUDALeafSplits();
+
+  void Init();
+
+  void InitValues(
+    const double lambda_l1, const double lambda_l2,
+    const score_t* cuda_gradients, const score_t* cuda_hessians,
+    const data_size_t* cuda_bagging_data_indices,
+    const data_size_t* cuda_data_indices_in_leaf, const data_size_t num_used_indices,
+    hist_t* cuda_hist_in_leaf, double* root_sum_hessians);
+
+  void InitValues();
+
+  const CUDALeafSplitsStruct* GetCUDAStruct() const { return cuda_struct_; }
+
+  CUDALeafSplitsStruct* GetCUDAStructRef() { return cuda_struct_; }
+
+  void Resize(const data_size_t num_data);
+
+  __device__ static double ThresholdL1(double s, double l1) {
+    const double reg_s = fmax(0.0, fabs(s) - l1);
+    if (s >= 0.0f) {
+      return reg_s;
+    } else {
+      return -reg_s;
+    }
+  }
+
+  template <bool USE_L1, bool USE_SMOOTHING>
+  __device__ static double CalculateSplittedLeafOutput(double sum_gradients,
+                                          double sum_hessians, double l1, double l2,
+                                          double path_smooth, data_size_t num_data,
+                                          double parent_output) {
+    double ret;
+    if (USE_L1) {
+      ret = -ThresholdL1(sum_gradients, l1) / (sum_hessians + l2);
+    } else {
+      ret = -sum_gradients / (sum_hessians + l2);
+    }
+    if (USE_SMOOTHING) {
+      ret = ret * (num_data / path_smooth) / (num_data / path_smooth + 1) \
+          + parent_output / (num_data / path_smooth + 1);
+    }
+    return ret;
+  }
+
+  template <bool USE_L1>
+  __device__ static double GetLeafGainGivenOutput(double sum_gradients,
+                                      double sum_hessians, double l1,
+                                      double l2, double output) {
+    if (USE_L1) {
+      const double sg_l1 = ThresholdL1(sum_gradients, l1);
+      return -(2.0 * sg_l1 * output + (sum_hessians + l2) * output * output);
+    } else {
+      return -(2.0 * sum_gradients * output +
+                (sum_hessians + l2) * output * output);
+    }
+  }
+
+  template <bool USE_L1, bool USE_SMOOTHING>
+  __device__ static double GetLeafGain(double sum_gradients, double sum_hessians,
+                          double l1, double l2,
+                          double path_smooth, data_size_t num_data,
+                          double parent_output) {
+    if (!USE_SMOOTHING) {
+      if (USE_L1) {
+        const double sg_l1 = ThresholdL1(sum_gradients, l1);
+        return (sg_l1 * sg_l1) / (sum_hessians + l2);
+      } else {
+        return (sum_gradients * sum_gradients) / (sum_hessians + l2);
+      }
+    } else {
+      const double output = CalculateSplittedLeafOutput<USE_L1, USE_SMOOTHING>(
+          sum_gradients, sum_hessians, l1, l2, path_smooth, num_data, parent_output);
+      return GetLeafGainGivenOutput<USE_L1>(sum_gradients, sum_hessians, l1, l2, output);
+    }
+  }
+
+  template <bool USE_L1, bool USE_SMOOTHING>
+  __device__ static double GetSplitGains(double sum_left_gradients,
+                            double sum_left_hessians,
+                            double sum_right_gradients,
+                            double sum_right_hessians,
+                            double l1, double l2,
+                            double path_smooth,
+                            data_size_t left_count,
+                            data_size_t right_count,
+                            double parent_output) {
+    return GetLeafGain<USE_L1, USE_SMOOTHING>(sum_left_gradients,
+                      sum_left_hessians,
+                      l1, l2, path_smooth, left_count, parent_output) +
+          GetLeafGain<USE_L1, USE_SMOOTHING>(sum_right_gradients,
+                      sum_right_hessians,
+                      l1, l2, path_smooth, right_count, parent_output);
+  }
+
+ private:
+  void LaunchInitValuesEmptyKernel();
+
+  void LaunchInitValuesKernal(
+    const double lambda_l1, const double lambda_l2,
+    const data_size_t* cuda_bagging_data_indices,
+    const data_size_t* cuda_data_indices_in_leaf,
+    const data_size_t num_used_indices,
+    hist_t* cuda_hist_in_leaf);
+
+  // Host memory
+  data_size_t num_data_;
+  int num_blocks_init_from_gradients_;
+
+  // CUDA memory, held by this object
+  CUDALeafSplitsStruct* cuda_struct_;
+  double* cuda_sum_of_gradients_buffer_;
+  double* cuda_sum_of_hessians_buffer_;
+
+  // CUDA memory, held by other object
+  const score_t* cuda_gradients_;
+  const score_t* cuda_hessians_;
+};
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_TREELEARNER_CUDA_CUDA_LEAF_SPLITS_HPP_
--- a/src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp
+++ b/src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#ifdef USE_CUDA_EXP
+
+#include "cuda_single_gpu_tree_learner.hpp"
+
+#include <LightGBM/cuda/cuda_tree.hpp>
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/feature_group.h>
+#include <LightGBM/network.h>
+#include <LightGBM/objective_function.h>
+
+#include <algorithm>
+#include <memory>
+
+namespace LightGBM {
+
+CUDASingleGPUTreeLearner::CUDASingleGPUTreeLearner(const Config* config): SerialTreeLearner(config) {
+  cuda_gradients_ = nullptr;
+  cuda_hessians_ = nullptr;
+}
+
+CUDASingleGPUTreeLearner::~CUDASingleGPUTreeLearner() {
+  DeallocateCUDAMemory<score_t>(&cuda_gradients_, __FILE__, __LINE__);
+  DeallocateCUDAMemory<score_t>(&cuda_hessians_, __FILE__, __LINE__);
+}
+
+void CUDASingleGPUTreeLearner::Init(const Dataset* train_data, bool is_constant_hessian) {
+  SerialTreeLearner::Init(train_data, is_constant_hessian);
+  num_threads_ = OMP_NUM_THREADS();
+  // use the first gpu by default
+  gpu_device_id_ = config_->gpu_device_id >= 0 ? config_->gpu_device_id : 0;
+  SetCUDADevice(gpu_device_id_, __FILE__, __LINE__);
+
+  cuda_smaller_leaf_splits_.reset(new CUDALeafSplits(num_data_));
+  cuda_smaller_leaf_splits_->Init();
+  cuda_larger_leaf_splits_.reset(new CUDALeafSplits(num_data_));
+  cuda_larger_leaf_splits_->Init();
+
+  cuda_histogram_constructor_.reset(new CUDAHistogramConstructor(train_data_, config_->num_leaves, num_threads_,
+    share_state_->feature_hist_offsets(),
+    config_->min_data_in_leaf, config_->min_sum_hessian_in_leaf, gpu_device_id_, config_->gpu_use_dp));
+  cuda_histogram_constructor_->Init(train_data_, share_state_.get());
+
+  const auto& feature_hist_offsets = share_state_->feature_hist_offsets();
+  const int num_total_bin = feature_hist_offsets.empty() ? 0 : static_cast<int>(feature_hist_offsets.back());
+  cuda_data_partition_.reset(new CUDADataPartition(
+    train_data_, num_total_bin, config_->num_leaves, num_threads_,
+    cuda_histogram_constructor_->cuda_hist_pointer()));
+  cuda_data_partition_->Init();
+
+  cuda_best_split_finder_.reset(new CUDABestSplitFinder(cuda_histogram_constructor_->cuda_hist(),
+    train_data_, this->share_state_->feature_hist_offsets(), config_));
+  cuda_best_split_finder_->Init();
+
+  leaf_best_split_feature_.resize(config_->num_leaves, -1);
+  leaf_best_split_threshold_.resize(config_->num_leaves, 0);
+  leaf_best_split_default_left_.resize(config_->num_leaves, 0);
+  leaf_num_data_.resize(config_->num_leaves, 0);
+  leaf_data_start_.resize(config_->num_leaves, 0);
+  leaf_sum_hessians_.resize(config_->num_leaves, 0.0f);
+
+  AllocateCUDAMemory<score_t>(&cuda_gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+  AllocateCUDAMemory<score_t>(&cuda_hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+  AllocateBitset();
+
+  cuda_leaf_gradient_stat_buffer_ = nullptr;
+  cuda_leaf_hessian_stat_buffer_ = nullptr;
+  leaf_stat_buffer_size_ = 0;
+  num_cat_threshold_ = 0;
+}
+
+void CUDASingleGPUTreeLearner::BeforeTrain() {
+  const data_size_t root_num_data = cuda_data_partition_->root_num_data();
+  CopyFromHostToCUDADevice<score_t>(cuda_gradients_, gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+  CopyFromHostToCUDADevice<score_t>(cuda_hessians_, hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+  const data_size_t* leaf_splits_init_indices =
+    cuda_data_partition_->use_bagging() ? cuda_data_partition_->cuda_data_indices() : nullptr;
+  cuda_data_partition_->BeforeTrain();
+  cuda_smaller_leaf_splits_->InitValues(
+    config_->lambda_l1,
+    config_->lambda_l2,
+    cuda_gradients_,
+    cuda_hessians_,
+    leaf_splits_init_indices,
+    cuda_data_partition_->cuda_data_indices(),
+    root_num_data,
+    cuda_histogram_constructor_->cuda_hist_pointer(),
+    &leaf_sum_hessians_[0]);
+  leaf_num_data_[0] = root_num_data;
+  cuda_larger_leaf_splits_->InitValues();
+  cuda_histogram_constructor_->BeforeTrain(cuda_gradients_, cuda_hessians_);
+  col_sampler_.ResetByTree();
+  cuda_best_split_finder_->BeforeTrain(col_sampler_.is_feature_used_bytree());
+  leaf_data_start_[0] = 0;
+  smaller_leaf_index_ = 0;
+  larger_leaf_index_ = -1;
+}
+
+void CUDASingleGPUTreeLearner::AddPredictionToScore(const Tree* tree, double* out_score) const {
+  cuda_data_partition_->UpdateTrainScore(tree, out_score);
+}
+
+Tree* CUDASingleGPUTreeLearner::Train(const score_t* gradients,
+  const score_t* hessians, bool /*is_first_tree*/) {
+  gradients_ = gradients;
+  hessians_ = hessians;
+  global_timer.Start("CUDASingleGPUTreeLearner::BeforeTrain");
+  BeforeTrain();
+  global_timer.Stop("CUDASingleGPUTreeLearner::BeforeTrain");
+  const bool track_branch_features = !(config_->interaction_constraints_vector.empty());
+  std::unique_ptr<CUDATree> tree(new CUDATree(config_->num_leaves, track_branch_features,
+    config_->linear_tree, config_->gpu_device_id, has_categorical_feature_));
+  for (int i = 0; i < config_->num_leaves - 1; ++i) {
+    global_timer.Start("CUDASingleGPUTreeLearner::ConstructHistogramForLeaf");
+    const data_size_t num_data_in_smaller_leaf = leaf_num_data_[smaller_leaf_index_];
+    const data_size_t num_data_in_larger_leaf = larger_leaf_index_ < 0 ? 0 : leaf_num_data_[larger_leaf_index_];
+    const double sum_hessians_in_smaller_leaf = leaf_sum_hessians_[smaller_leaf_index_];
+    const double sum_hessians_in_larger_leaf = larger_leaf_index_ < 0 ? 0 : leaf_sum_hessians_[larger_leaf_index_];
+    cuda_histogram_constructor_->ConstructHistogramForLeaf(
+      cuda_smaller_leaf_splits_->GetCUDAStruct(),
+      cuda_larger_leaf_splits_->GetCUDAStruct(),
+      num_data_in_smaller_leaf,
+      num_data_in_larger_leaf,
+      sum_hessians_in_smaller_leaf,
+      sum_hessians_in_larger_leaf);
+    global_timer.Stop("CUDASingleGPUTreeLearner::ConstructHistogramForLeaf");
+    global_timer.Start("CUDASingleGPUTreeLearner::FindBestSplitsForLeaf");
+    cuda_best_split_finder_->FindBestSplitsForLeaf(
+      cuda_smaller_leaf_splits_->GetCUDAStruct(),
+      cuda_larger_leaf_splits_->GetCUDAStruct(),
+      smaller_leaf_index_, larger_leaf_index_,
+      num_data_in_smaller_leaf, num_data_in_larger_leaf,
+      sum_hessians_in_smaller_leaf, sum_hessians_in_larger_leaf);
+    global_timer.Stop("CUDASingleGPUTreeLearner::FindBestSplitsForLeaf");
+    global_timer.Start("CUDASingleGPUTreeLearner::FindBestFromAllSplits");
+    const CUDASplitInfo* best_split_info = nullptr;
+    if (larger_leaf_index_ >= 0) {
+      best_split_info = cuda_best_split_finder_->FindBestFromAllSplits(
+        tree->num_leaves(),
+        smaller_leaf_index_,
+        larger_leaf_index_,
+        &leaf_best_split_feature_[smaller_leaf_index_],
+        &leaf_best_split_threshold_[smaller_leaf_index_],
+        &leaf_best_split_default_left_[smaller_leaf_index_],
+        &leaf_best_split_feature_[larger_leaf_index_],
+        &leaf_best_split_threshold_[larger_leaf_index_],
+        &leaf_best_split_default_left_[larger_leaf_index_],
+        &best_leaf_index_,
+        &num_cat_threshold_);
+    } else {
+      best_split_info = cuda_best_split_finder_->FindBestFromAllSplits(
+        tree->num_leaves(),
+        smaller_leaf_index_,
+        larger_leaf_index_,
+        &leaf_best_split_feature_[smaller_leaf_index_],
+        &leaf_best_split_threshold_[smaller_leaf_index_],
+        &leaf_best_split_default_left_[smaller_leaf_index_],
+        nullptr,
+        nullptr,
+        nullptr,
+        &best_leaf_index_,
+        &num_cat_threshold_);
+    }
+    global_timer.Stop("CUDASingleGPUTreeLearner::FindBestFromAllSplits");
+
+    if (best_leaf_index_ == -1) {
+      Log::Warning("No further splits with positive gain, training stopped with %d leaves.", (i + 1));
+      break;
+    }
+
+    global_timer.Start("CUDASingleGPUTreeLearner::Split");
+    if (num_cat_threshold_ > 0) {
+      ConstructBitsetForCategoricalSplit(best_split_info);
+    }
+
+    int right_leaf_index = 0;
+    if (train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->bin_type() == BinType::CategoricalBin) {
+      right_leaf_index = tree->SplitCategorical(best_leaf_index_,
+                                       train_data_->RealFeatureIndex(leaf_best_split_feature_[best_leaf_index_]),
+                                       train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->missing_type(),
+                                       best_split_info,
+                                       cuda_bitset_,
+                                       cuda_bitset_len_,
+                                       cuda_bitset_inner_,
+                                       cuda_bitset_inner_len_);
+    } else {
+      right_leaf_index = tree->Split(best_leaf_index_,
+                                       train_data_->RealFeatureIndex(leaf_best_split_feature_[best_leaf_index_]),
+                                       train_data_->RealThreshold(leaf_best_split_feature_[best_leaf_index_],
+                                        leaf_best_split_threshold_[best_leaf_index_]),
+                                       train_data_->FeatureBinMapper(leaf_best_split_feature_[best_leaf_index_])->missing_type(),
+                                       best_split_info);
+    }
+
+    double sum_left_gradients = 0.0f;
+    double sum_right_gradients = 0.0f;
+    cuda_data_partition_->Split(best_split_info,
+                                best_leaf_index_,
+                                right_leaf_index,
+                                leaf_best_split_feature_[best_leaf_index_],
+                                leaf_best_split_threshold_[best_leaf_index_],
+                                cuda_bitset_inner_,
+                                static_cast<int>(cuda_bitset_inner_len_),
+                                leaf_best_split_default_left_[best_leaf_index_],
+                                leaf_num_data_[best_leaf_index_],
+                                leaf_data_start_[best_leaf_index_],
+                                cuda_smaller_leaf_splits_->GetCUDAStructRef(),
+                                cuda_larger_leaf_splits_->GetCUDAStructRef(),
+                                &leaf_num_data_[best_leaf_index_],
+                                &leaf_num_data_[right_leaf_index],
+                                &leaf_data_start_[best_leaf_index_],
+                                &leaf_data_start_[right_leaf_index],
+                                &leaf_sum_hessians_[best_leaf_index_],
+                                &leaf_sum_hessians_[right_leaf_index],
+                                &sum_left_gradients,
+                                &sum_right_gradients);
+    #ifdef DEBUG
+    CheckSplitValid(best_leaf_index_, right_leaf_index, sum_left_gradients, sum_right_gradients);
+    #endif  // DEBUG
+    smaller_leaf_index_ = (leaf_num_data_[best_leaf_index_] < leaf_num_data_[right_leaf_index] ? best_leaf_index_ : right_leaf_index);
+    larger_leaf_index_ = (smaller_leaf_index_ == best_leaf_index_ ? right_leaf_index : best_leaf_index_);
+    global_timer.Stop("CUDASingleGPUTreeLearner::Split");
+  }
+  SynchronizeCUDADevice(__FILE__, __LINE__);
+  tree->ToHost();
+  return tree.release();
+}
+
+void CUDASingleGPUTreeLearner::ResetTrainingData(
+  const Dataset* train_data,
+  bool is_constant_hessian) {
+  SerialTreeLearner::ResetTrainingData(train_data, is_constant_hessian);
+  CHECK_EQ(num_features_, train_data_->num_features());
+  cuda_histogram_constructor_->ResetTrainingData(train_data, share_state_.get());
+  cuda_data_partition_->ResetTrainingData(train_data,
+    static_cast<int>(share_state_->feature_hist_offsets().back()),
+    cuda_histogram_constructor_->cuda_hist_pointer());
+  cuda_best_split_finder_->ResetTrainingData(
+    cuda_histogram_constructor_->cuda_hist(),
+    train_data,
+    share_state_->feature_hist_offsets());
+  cuda_smaller_leaf_splits_->Resize(num_data_);
+  cuda_larger_leaf_splits_->Resize(num_data_);
+  CHECK_EQ(is_constant_hessian, share_state_->is_constant_hessian);
+  DeallocateCUDAMemory<score_t>(&cuda_gradients_, __FILE__, __LINE__);
+  DeallocateCUDAMemory<score_t>(&cuda_hessians_, __FILE__, __LINE__);
+  AllocateCUDAMemory<score_t>(&cuda_gradients_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+  AllocateCUDAMemory<score_t>(&cuda_hessians_, static_cast<size_t>(num_data_), __FILE__, __LINE__);
+}
+
+void CUDASingleGPUTreeLearner::ResetConfig(const Config* config) {
+  const int old_num_leaves = config_->num_leaves;
+  SerialTreeLearner::ResetConfig(config);
+  if (config_->gpu_device_id >= 0 && config_->gpu_device_id != gpu_device_id_) {
+    Log::Fatal("Changing gpu device ID by resetting configuration parameter is not allowed for CUDA tree learner.");
+  }
+  num_threads_ = OMP_NUM_THREADS();
+  if (config_->num_leaves != old_num_leaves) {
+    leaf_best_split_feature_.resize(config_->num_leaves, -1);
+    leaf_best_split_threshold_.resize(config_->num_leaves, 0);
+    leaf_best_split_default_left_.resize(config_->num_leaves, 0);
+    leaf_num_data_.resize(config_->num_leaves, 0);
+    leaf_data_start_.resize(config_->num_leaves, 0);
+    leaf_sum_hessians_.resize(config_->num_leaves, 0.0f);
+  }
+  cuda_histogram_constructor_->ResetConfig(config);
+  cuda_best_split_finder_->ResetConfig(config, cuda_histogram_constructor_->cuda_hist());
+  cuda_data_partition_->ResetConfig(config, cuda_histogram_constructor_->cuda_hist_pointer());
+}
+
+void CUDASingleGPUTreeLearner::SetBaggingData(const Dataset* /*subset*/,
+  const data_size_t* used_indices, data_size_t num_data) {
+  cuda_data_partition_->SetUsedDataIndices(used_indices, num_data);
+}
+
+void CUDASingleGPUTreeLearner::RenewTreeOutput(Tree* tree, const ObjectiveFunction* obj, std::function<double(const label_t*, int)> residual_getter,
+                                         data_size_t total_num_data, const data_size_t* bag_indices, data_size_t bag_cnt) const {
+  CHECK(tree->is_cuda_tree());
+  CUDATree* cuda_tree = reinterpret_cast<CUDATree*>(tree);
+  if (obj != nullptr && obj->IsRenewTreeOutput()) {
+    CHECK_LE(cuda_tree->num_leaves(), data_partition_->num_leaves());
+    const data_size_t* bag_mapper = nullptr;
+    if (total_num_data != num_data_) {
+      CHECK_EQ(bag_cnt, num_data_);
+      bag_mapper = bag_indices;
+    }
+    std::vector<int> n_nozeroworker_perleaf(tree->num_leaves(), 1);
+    int num_machines = Network::num_machines();
+    #pragma omp parallel for schedule(static)
+    for (int i = 0; i < tree->num_leaves(); ++i) {
+      const double output = static_cast<double>(tree->LeafOutput(i));
+      data_size_t cnt_leaf_data = leaf_num_data_[i];
+      std::vector<data_size_t> index_mapper(cnt_leaf_data, -1);
+      CopyFromCUDADeviceToHost<data_size_t>(index_mapper.data(),
+        cuda_data_partition_->cuda_data_indices() + leaf_data_start_[i],
+        static_cast<size_t>(cnt_leaf_data), __FILE__, __LINE__);
+      if (cnt_leaf_data > 0) {
+        const double new_output = obj->RenewTreeOutput(output, residual_getter, index_mapper.data(), bag_mapper, cnt_leaf_data);
+        tree->SetLeafOutput(i, new_output);
+      } else {
+        CHECK_GT(num_machines, 1);
+        tree->SetLeafOutput(i, 0.0);
+        n_nozeroworker_perleaf[i] = 0;
+      }
+    }
+    if (num_machines > 1) {
+      std::vector<double> outputs(tree->num_leaves());
+      for (int i = 0; i < tree->num_leaves(); ++i) {
+        outputs[i] = static_cast<double>(tree->LeafOutput(i));
+      }
+      outputs = Network::GlobalSum(&outputs);
+      n_nozeroworker_perleaf = Network::GlobalSum(&n_nozeroworker_perleaf);
+      for (int i = 0; i < tree->num_leaves(); ++i) {
+        tree->SetLeafOutput(i, outputs[i] / n_nozeroworker_perleaf[i]);
+      }
+    }
+  }
+  cuda_tree->SyncLeafOutputFromHostToCUDA();
+}
+
+Tree* CUDASingleGPUTreeLearner::FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const {
+  std::unique_ptr<CUDATree> cuda_tree(new CUDATree(old_tree));
+  SetCUDAMemory<double>(cuda_leaf_gradient_stat_buffer_, 0, static_cast<size_t>(old_tree->num_leaves()), __FILE__, __LINE__);
+  SetCUDAMemory<double>(cuda_leaf_hessian_stat_buffer_, 0, static_cast<size_t>(old_tree->num_leaves()), __FILE__, __LINE__);
+  ReduceLeafStat(cuda_tree.get(), gradients, hessians, cuda_data_partition_->cuda_data_indices());
+  cuda_tree->SyncLeafOutputFromCUDAToHost();
+  return cuda_tree.release();
+}
+
+Tree* CUDASingleGPUTreeLearner::FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
+                                                  const score_t* gradients, const score_t* hessians) const {
+  cuda_data_partition_->ResetByLeafPred(leaf_pred, old_tree->num_leaves());
+  refit_num_data_ = static_cast<data_size_t>(leaf_pred.size());
+  data_size_t buffer_size = static_cast<data_size_t>(old_tree->num_leaves());
+  if (old_tree->num_leaves() > 2048) {
+    const int num_block = (refit_num_data_ + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+    buffer_size *= static_cast<data_size_t>(num_block + 1);
+  }
+  if (buffer_size != leaf_stat_buffer_size_) {
+    if (leaf_stat_buffer_size_ != 0) {
+      DeallocateCUDAMemory<double>(&cuda_leaf_gradient_stat_buffer_, __FILE__, __LINE__);
+      DeallocateCUDAMemory<double>(&cuda_leaf_hessian_stat_buffer_, __FILE__, __LINE__);
+    }
+    AllocateCUDAMemory<double>(&cuda_leaf_gradient_stat_buffer_, static_cast<size_t>(buffer_size), __FILE__, __LINE__);
+    AllocateCUDAMemory<double>(&cuda_leaf_hessian_stat_buffer_, static_cast<size_t>(buffer_size), __FILE__, __LINE__);
+  }
+  return FitByExistingTree(old_tree, gradients, hessians);
+}
+
+void CUDASingleGPUTreeLearner::ReduceLeafStat(
+  CUDATree* old_tree, const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf) const {
+  LaunchReduceLeafStatKernel(gradients, hessians, num_data_in_leaf, old_tree->cuda_leaf_parent(),
+    old_tree->cuda_left_child(), old_tree->cuda_right_child(),
+    old_tree->num_leaves(), refit_num_data_, old_tree->cuda_leaf_value_ref(), old_tree->shrinkage());
+}
+
+void CUDASingleGPUTreeLearner::ConstructBitsetForCategoricalSplit(
+  const CUDASplitInfo* best_split_info) {
+  LaunchConstructBitsetForCategoricalSplitKernel(best_split_info);
+}
+
+void CUDASingleGPUTreeLearner::AllocateBitset() {
+  has_categorical_feature_ = false;
+  categorical_bin_offsets_.clear();
+  categorical_bin_offsets_.push_back(0);
+  categorical_bin_to_value_.clear();
+  for (int i = 0; i < train_data_->num_features(); ++i) {
+    const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
+    if (bin_mapper->bin_type() == BinType::CategoricalBin) {
+      has_categorical_feature_ = true;
+      break;
+    }
+  }
+  if (has_categorical_feature_) {
+    int max_cat_value = 0;
+    int max_cat_num_bin = 0;
+    for (int i = 0; i < train_data_->num_features(); ++i) {
+      const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
+      if (bin_mapper->bin_type() == BinType::CategoricalBin) {
+        max_cat_value = std::max(bin_mapper->MaxCatValue(), max_cat_value);
+        max_cat_num_bin = std::max(bin_mapper->num_bin(), max_cat_num_bin);
+      }
+    }
+    // std::max(..., 1UL) to avoid error in the case when there are NaN's in the categorical values
+    const size_t cuda_bitset_max_size = std::max(static_cast<size_t>((max_cat_value + 31) / 32), 1UL);
+    const size_t cuda_bitset_inner_max_size = std::max(static_cast<size_t>((max_cat_num_bin + 31) / 32), 1UL);
+    AllocateCUDAMemory<uint32_t>(&cuda_bitset_, cuda_bitset_max_size, __FILE__, __LINE__);
+    AllocateCUDAMemory<uint32_t>(&cuda_bitset_inner_, cuda_bitset_inner_max_size, __FILE__, __LINE__);
+    const int max_cat_in_split = std::min(config_->max_cat_threshold, max_cat_num_bin / 2);
+    const int num_blocks = (max_cat_in_split + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+    AllocateCUDAMemory<size_t>(&cuda_block_bitset_len_buffer_, num_blocks, __FILE__, __LINE__);
+
+    for (int i = 0; i < train_data_->num_features(); ++i) {
+      const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
+      if (bin_mapper->bin_type() == BinType::CategoricalBin) {
+        categorical_bin_offsets_.push_back(bin_mapper->num_bin());
+      } else {
+        categorical_bin_offsets_.push_back(0);
+      }
+    }
+    for (size_t i = 1; i < categorical_bin_offsets_.size(); ++i) {
+      categorical_bin_offsets_[i] += categorical_bin_offsets_[i - 1];
+    }
+    categorical_bin_to_value_.resize(categorical_bin_offsets_.back(), 0);
+    for (int i = 0; i < train_data_->num_features(); ++i) {
+      const BinMapper* bin_mapper = train_data_->FeatureBinMapper(i);
+      if (bin_mapper->bin_type() == BinType::CategoricalBin) {
+        const int offset = categorical_bin_offsets_[i];
+        for (int bin = 0; bin < bin_mapper->num_bin(); ++bin) {
+          categorical_bin_to_value_[offset + bin] = static_cast<int>(bin_mapper->BinToValue(bin));
+        }
+      }
+    }
+    InitCUDAMemoryFromHostMemory<int>(&cuda_categorical_bin_offsets_, categorical_bin_offsets_.data(), categorical_bin_offsets_.size(), __FILE__, __LINE__);
+    InitCUDAMemoryFromHostMemory<int>(&cuda_categorical_bin_to_value_, categorical_bin_to_value_.data(), categorical_bin_to_value_.size(), __FILE__, __LINE__);
+  } else {
+    cuda_bitset_ = nullptr;
+    cuda_bitset_inner_ = nullptr;
+  }
+  cuda_bitset_len_ = 0;
+  cuda_bitset_inner_len_ = 0;
+}
+
+#ifdef DEBUG
+void CUDASingleGPUTreeLearner::CheckSplitValid(
+  const int left_leaf,
+  const int right_leaf,
+  const double split_sum_left_gradients,
+  const double split_sum_right_gradients) {
+  std::vector<data_size_t> left_data_indices(leaf_num_data_[left_leaf]);
+  std::vector<data_size_t> right_data_indices(leaf_num_data_[right_leaf]);
+  CopyFromCUDADeviceToHost<data_size_t>(left_data_indices.data(),
+    cuda_data_partition_->cuda_data_indices() + leaf_data_start_[left_leaf],
+    leaf_num_data_[left_leaf], __FILE__, __LINE__);
+  CopyFromCUDADeviceToHost<data_size_t>(right_data_indices.data(),
+    cuda_data_partition_->cuda_data_indices() + leaf_data_start_[right_leaf],
+    leaf_num_data_[right_leaf], __FILE__, __LINE__);
+  double sum_left_gradients = 0.0f, sum_left_hessians = 0.0f;
+  double sum_right_gradients = 0.0f, sum_right_hessians = 0.0f;
+  for (size_t i = 0; i < left_data_indices.size(); ++i) {
+    const data_size_t index = left_data_indices[i];
+    sum_left_gradients += gradients_[index];
+    sum_left_hessians += hessians_[index];
+  }
+  for (size_t i = 0; i < right_data_indices.size(); ++i) {
+    const data_size_t index = right_data_indices[i];
+    sum_right_gradients += gradients_[index];
+    sum_right_hessians += hessians_[index];
+  }
+  CHECK_LE(std::fabs(sum_left_gradients - split_sum_left_gradients), 1e-6f);
+  CHECK_LE(std::fabs(sum_left_hessians - leaf_sum_hessians_[left_leaf]), 1e-6f);
+  CHECK_LE(std::fabs(sum_right_gradients - split_sum_right_gradients), 1e-6f);
+  CHECK_LE(std::fabs(sum_right_hessians - leaf_sum_hessians_[right_leaf]), 1e-6f);
+}
+#endif  // DEBUG
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
--- a/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu
+++ b/src/treelearner/cuda/cuda_single_gpu_tree_learner.cu
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+
+#ifdef USE_CUDA_EXP
+
+#include <LightGBM/cuda/cuda_algorithms.hpp>
+
+#include "cuda_single_gpu_tree_learner.hpp"
+
+#include <algorithm>
+
+namespace LightGBM {
+
+__global__ void ReduceLeafStatKernel_SharedMemory(
+  const score_t* gradients,
+  const score_t* hessians,
+  const int num_leaves,
+  const data_size_t num_data,
+  const int* data_index_to_leaf_index,
+  double* leaf_grad_stat_buffer,
+  double* leaf_hess_stat_buffer) {
+  extern __shared__ double shared_mem[];
+  double* shared_grad_sum = shared_mem;
+  double* shared_hess_sum = shared_mem + num_leaves;
+  const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
+  for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
+    shared_grad_sum[leaf_index] = 0.0f;
+    shared_hess_sum[leaf_index] = 0.0f;
+  }
+  __syncthreads();
+  if (data_index < num_data) {
+    const int leaf_index = data_index_to_leaf_index[data_index];
+    atomicAdd_block(shared_grad_sum + leaf_index, gradients[data_index]);
+    atomicAdd_block(shared_hess_sum + leaf_index, hessians[data_index]);
+  }
+  __syncthreads();
+  for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
+    atomicAdd_system(leaf_grad_stat_buffer + leaf_index, shared_grad_sum[leaf_index]);
+    atomicAdd_system(leaf_hess_stat_buffer + leaf_index, shared_hess_sum[leaf_index]);
+  }
+}
+
+__global__ void ReduceLeafStatKernel_GlobalMemory(
+  const score_t* gradients,
+  const score_t* hessians,
+  const int num_leaves,
+  const data_size_t num_data,
+  const int* data_index_to_leaf_index,
+  double* leaf_grad_stat_buffer,
+  double* leaf_hess_stat_buffer) {
+  const size_t offset = static_cast<size_t>(num_leaves) * (blockIdx.x + 1);
+  double* grad_sum = leaf_grad_stat_buffer + offset;
+  double* hess_sum = leaf_hess_stat_buffer + offset;
+  const data_size_t data_index = static_cast<data_size_t>(threadIdx.x + blockIdx.x * blockDim.x);
+  for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
+    grad_sum[leaf_index] = 0.0f;
+    hess_sum[leaf_index] = 0.0f;
+  }
+  __syncthreads();
+  if (data_index < num_data) {
+    const int leaf_index = data_index_to_leaf_index[data_index];
+    atomicAdd_block(grad_sum + leaf_index, gradients[data_index]);
+    atomicAdd_block(hess_sum + leaf_index, hessians[data_index]);
+  }
+  __syncthreads();
+  for (int leaf_index = static_cast<int>(threadIdx.x); leaf_index < num_leaves; leaf_index += static_cast<int>(blockDim.x)) {
+    atomicAdd_system(leaf_grad_stat_buffer + leaf_index, grad_sum[leaf_index]);
+    atomicAdd_system(leaf_hess_stat_buffer + leaf_index, hess_sum[leaf_index]);
+  }
+}
+
+template <bool USE_L1, bool USE_SMOOTHING>
+__global__ void CalcRefitLeafOutputKernel(
+  const int num_leaves,
+  const double* leaf_grad_stat_buffer,
+  const double* leaf_hess_stat_buffer,
+  const data_size_t* num_data_in_leaf,
+  const int* leaf_parent,
+  const int* left_child,
+  const int* right_child,
+  const double lambda_l1,
+  const double lambda_l2,
+  const double path_smooth,
+  const double shrinkage_rate,
+  const double refit_decay_rate,
+  double* leaf_value) {
+  const int leaf_index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
+  if (leaf_index < num_leaves) {
+    const double sum_gradients = leaf_grad_stat_buffer[leaf_index];
+    const double sum_hessians = leaf_hess_stat_buffer[leaf_index];
+    const data_size_t num_data = num_data_in_leaf[leaf_index];
+    const double old_leaf_value = leaf_value[leaf_index];
+    double new_leaf_value = 0.0f;
+    if (!USE_SMOOTHING) {
+      new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_gradients, sum_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+    } else {
+      const int parent = leaf_parent[leaf_index];
+      if (parent >= 0) {
+        const int sibliing = left_child[parent] == leaf_index ? right_child[parent] : left_child[parent];
+        const double sum_gradients_of_parent = sum_gradients + leaf_grad_stat_buffer[sibliing];
+        const double sum_hessians_of_parent = sum_hessians + leaf_hess_stat_buffer[sibliing];
+        const data_size_t num_data_in_parent = num_data + num_data_in_leaf[sibliing];
+        const double parent_output =
+          CUDALeafSplits::CalculateSplittedLeafOutput<false, true>(
+            sum_gradients_of_parent, sum_hessians_of_parent, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+          new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, true>(
+          sum_gradients, sum_hessians, lambda_l1, lambda_l2, path_smooth, num_data_in_parent, parent_output);
+      } else {
+        new_leaf_value = CUDALeafSplits::CalculateSplittedLeafOutput<false, false>(sum_gradients, sum_hessians, lambda_l1, lambda_l2, 0.0f, 0, 0.0f);
+      }
+    }
+    if (isnan(new_leaf_value)) {
+      new_leaf_value = 0.0f;
+    } else {
+      new_leaf_value *= shrinkage_rate;
+    }
+    leaf_value[leaf_index] = refit_decay_rate * old_leaf_value + (1.0f - refit_decay_rate) * new_leaf_value;
+  }
+}
+
+void CUDASingleGPUTreeLearner::LaunchReduceLeafStatKernel(
+  const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf,
+  const int* leaf_parent, const int* left_child, const int* right_child, const int num_leaves,
+  const data_size_t num_data, double* cuda_leaf_value, const double shrinkage_rate) const {
+  int num_block = (num_data + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+  if (num_leaves <= 2048) {
+    ReduceLeafStatKernel_SharedMemory<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE, 2 * num_leaves * sizeof(double)>>>(
+      gradients, hessians, num_leaves, num_data, cuda_data_partition_->cuda_data_index_to_leaf_index(),
+      cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_);
+  } else {
+    ReduceLeafStatKernel_GlobalMemory<<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(
+      gradients, hessians, num_leaves, num_data, cuda_data_partition_->cuda_data_index_to_leaf_index(),
+      cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_);
+  }
+  const bool use_l1 = config_->lambda_l1 > 0.0f;
+  const bool use_smoothing = config_->path_smooth > 0.0f;
+  num_block = (num_leaves + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+
+  #define CalcRefitLeafOutputKernel_ARGS \
+    num_leaves, cuda_leaf_gradient_stat_buffer_, cuda_leaf_hessian_stat_buffer_, num_data_in_leaf, \
+    leaf_parent, left_child, right_child, \
+    config_->lambda_l1, config_->lambda_l2, config_->path_smooth, \
+    shrinkage_rate, config_->refit_decay_rate, cuda_leaf_value
+
+  if (!use_l1) {
+    if (!use_smoothing) {
+      CalcRefitLeafOutputKernel<false, false>
+        <<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
+    } else {
+      CalcRefitLeafOutputKernel<false, true>
+        <<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
+    }
+  } else {
+    if (!use_smoothing) {
+      CalcRefitLeafOutputKernel<true, false>
+        <<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
+    } else {
+      CalcRefitLeafOutputKernel<true, true>
+        <<<num_block, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(CalcRefitLeafOutputKernel_ARGS);
+    }
+  }
+}
+
+template <typename T, bool IS_INNER>
+__global__ void CalcBitsetLenKernel(const CUDASplitInfo* best_split_info, size_t* out_len_buffer) {
+  __shared__ size_t shared_mem_buffer[32];
+  const T* vals = nullptr;
+  if (IS_INNER) {
+    vals = reinterpret_cast<const T*>(best_split_info->cat_threshold);
+  } else {
+    vals = reinterpret_cast<const T*>(best_split_info->cat_threshold_real);
+  }
+  const int i = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
+  size_t len = 0;
+  if (i < best_split_info->num_cat_threshold) {
+    const T val = vals[i];
+    len = (val / 32) + 1;
+  }
+  const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x);
+  if (threadIdx.x == 0) {
+    out_len_buffer[blockIdx.x] = block_max_len;
+  }
+}
+
+__global__ void ReduceBlockMaxLen(size_t* out_len_buffer, const int num_blocks) {
+  __shared__ size_t shared_mem_buffer[32];
+  size_t max_len = 0;
+  for (int i = static_cast<int>(threadIdx.x); i < num_blocks; i += static_cast<int>(blockDim.x)) {
+    max_len = max(out_len_buffer[i], max_len);
+  }
+  const size_t all_max_len = ShuffleReduceMax<size_t>(max_len, shared_mem_buffer, blockDim.x);
+  if (threadIdx.x == 0) {
+    out_len_buffer[0] = max_len;
+  }
+}
+
+template <typename T, bool IS_INNER>
+__global__ void CUDAConstructBitsetKernel(const CUDASplitInfo* best_split_info, uint32_t* out, size_t cuda_bitset_len) {
+  const T* vals = nullptr;
+  if (IS_INNER) {
+    vals = reinterpret_cast<const T*>(best_split_info->cat_threshold);
+  } else {
+    vals = reinterpret_cast<const T*>(best_split_info->cat_threshold_real);
+  }
+  const int i = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
+  if (i < best_split_info->num_cat_threshold) {
+    const T val = vals[i];
+    // can use add instead of or here, because each bit will only be added once
+    atomicAdd_system(out + (val / 32), (0x1 << (val % 32)));
+  }
+}
+
+__global__ void SetRealThresholdKernel(
+  const CUDASplitInfo* best_split_info,
+  const int* categorical_bin_to_value,
+  const int* categorical_bin_offsets) {
+  const int num_cat_threshold = best_split_info->num_cat_threshold;
+  const int* categorical_bin_to_value_ptr = categorical_bin_to_value + categorical_bin_offsets[best_split_info->inner_feature_index];
+  int* cat_threshold_real = best_split_info->cat_threshold_real;
+  const uint32_t* cat_threshold = best_split_info->cat_threshold;
+  const int index = static_cast<int>(threadIdx.x + blockIdx.x * blockDim.x);
+  if (index < num_cat_threshold) {
+    cat_threshold_real[index] = categorical_bin_to_value_ptr[cat_threshold[index]];
+  }
+}
+
+template <typename T, bool IS_INNER>
+void CUDAConstructBitset(const CUDASplitInfo* best_split_info, const int num_cat_threshold, uint32_t* out, size_t bitset_len) {
+  const int num_blocks = (num_cat_threshold + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+  // clear the bitset vector first
+  SetCUDAMemory<uint32_t>(out, 0, bitset_len, __FILE__, __LINE__);
+  CUDAConstructBitsetKernel<T, IS_INNER><<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(best_split_info, out, bitset_len);
+}
+
+template <typename T, bool IS_INNER>
+size_t CUDABitsetLen(const CUDASplitInfo* best_split_info, const int num_cat_threshold, size_t* out_len_buffer) {
+  const int num_blocks = (num_cat_threshold + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+  CalcBitsetLenKernel<T, IS_INNER><<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(best_split_info, out_len_buffer);
+  ReduceBlockMaxLen<<<1, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>(out_len_buffer, num_blocks);
+  size_t host_max_len = 0;
+  CopyFromCUDADeviceToHost<size_t>(&host_max_len, out_len_buffer, 1, __FILE__, __LINE__);
+  return host_max_len;
+}
+
+void CUDASingleGPUTreeLearner::LaunchConstructBitsetForCategoricalSplitKernel(
+  const CUDASplitInfo* best_split_info) {
+  const int num_blocks = (num_cat_threshold_ + CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE - 1) / CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE;
+  SetRealThresholdKernel<<<num_blocks, CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE>>>
+    (best_split_info, cuda_categorical_bin_to_value_, cuda_categorical_bin_offsets_);
+  cuda_bitset_inner_len_ = CUDABitsetLen<uint32_t, true>(best_split_info, num_cat_threshold_, cuda_block_bitset_len_buffer_);
+  CUDAConstructBitset<uint32_t, true>(best_split_info, num_cat_threshold_, cuda_bitset_inner_, cuda_bitset_inner_len_);
+  cuda_bitset_len_ = CUDABitsetLen<int, false>(best_split_info, num_cat_threshold_, cuda_block_bitset_len_buffer_);
+  CUDAConstructBitset<int, false>(best_split_info, num_cat_threshold_, cuda_bitset_, cuda_bitset_len_);
+}
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
--- a/src/treelearner/cuda/cuda_single_gpu_tree_learner.hpp
+++ b/src/treelearner/cuda/cuda_single_gpu_tree_learner.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+#ifndef LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
+#define LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
+
+#include <memory>
+#include <vector>
+
+#ifdef USE_CUDA_EXP
+
+#include "cuda_leaf_splits.hpp"
+#include "cuda_histogram_constructor.hpp"
+#include "cuda_data_partition.hpp"
+#include "cuda_best_split_finder.hpp"
+
+#include "../serial_tree_learner.h"
+
+namespace LightGBM {
+
+#define CUDA_SINGLE_GPU_TREE_LEARNER_BLOCK_SIZE (1024)
+
+class CUDASingleGPUTreeLearner: public SerialTreeLearner {
+ public:
+  explicit CUDASingleGPUTreeLearner(const Config* config);
+
+  ~CUDASingleGPUTreeLearner();
+
+  void Init(const Dataset* train_data, bool is_constant_hessian) override;
+
+  void ResetTrainingData(const Dataset* train_data,
+                         bool is_constant_hessian) override;
+
+  Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
+
+  void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override;
+
+  void AddPredictionToScore(const Tree* tree, double* out_score) const override;
+
+  void RenewTreeOutput(Tree* tree, const ObjectiveFunction* obj, std::function<double(const label_t*, int)> residual_getter,
+                       data_size_t total_num_data, const data_size_t* bag_indices, data_size_t bag_cnt) const override;
+
+  void ResetConfig(const Config* config) override;
+
+  Tree* FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const override;
+
+  Tree* FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
+                          const score_t* gradients, const score_t* hessians) const override;
+
+ protected:
+  void BeforeTrain() override;
+
+  void ReduceLeafStat(CUDATree* old_tree, const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf) const;
+
+  void LaunchReduceLeafStatKernel(const score_t* gradients, const score_t* hessians, const data_size_t* num_data_in_leaf,
+    const int* leaf_parent, const int* left_child, const int* right_child,
+    const int num_leaves, const data_size_t num_data, double* cuda_leaf_value, const double shrinkage_rate) const;
+
+  void ConstructBitsetForCategoricalSplit(const CUDASplitInfo* best_split_info);
+
+  void LaunchConstructBitsetForCategoricalSplitKernel(const CUDASplitInfo* best_split_info);
+
+  void AllocateBitset();
+
+  #ifdef DEUBG
+  void CheckSplitValid(
+    const int left_leaf, const int right_leaf,
+    const double sum_left_gradients, const double sum_right_gradients);
+  #endif  // DEBUG
+
+  // GPU device ID
+  int gpu_device_id_;
+  // number of threads on CPU
+  int num_threads_;
+
+  // CUDA components for tree training
+
+  // leaf splits information for smaller and larger leaves
+  std::unique_ptr<CUDALeafSplits> cuda_smaller_leaf_splits_;
+  std::unique_ptr<CUDALeafSplits> cuda_larger_leaf_splits_;
+  // data partition that partitions data indices into different leaves
+  std::unique_ptr<CUDADataPartition> cuda_data_partition_;
+  // for histogram construction
+  std::unique_ptr<CUDAHistogramConstructor> cuda_histogram_constructor_;
+  // for best split information finding, given the histograms
+  std::unique_ptr<CUDABestSplitFinder> cuda_best_split_finder_;
+
+  std::vector<int> leaf_best_split_feature_;
+  std::vector<uint32_t> leaf_best_split_threshold_;
+  std::vector<uint8_t> leaf_best_split_default_left_;
+  std::vector<data_size_t> leaf_num_data_;
+  std::vector<data_size_t> leaf_data_start_;
+  std::vector<double> leaf_sum_hessians_;
+  int smaller_leaf_index_;
+  int larger_leaf_index_;
+  int best_leaf_index_;
+  int num_cat_threshold_;
+  bool has_categorical_feature_;
+
+  std::vector<int> categorical_bin_to_value_;
+  std::vector<int> categorical_bin_offsets_;
+
+  mutable double* cuda_leaf_gradient_stat_buffer_;
+  mutable double* cuda_leaf_hessian_stat_buffer_;
+  mutable data_size_t leaf_stat_buffer_size_;
+  mutable data_size_t refit_num_data_;
+  uint32_t* cuda_bitset_;
+  size_t cuda_bitset_len_;
+  uint32_t* cuda_bitset_inner_;
+  size_t cuda_bitset_inner_len_;
+  size_t* cuda_block_bitset_len_buffer_;
+  int* cuda_categorical_bin_to_value_;
+  int* cuda_categorical_bin_offsets_;
+
+  /*! \brief gradients on CUDA */
+  score_t* cuda_gradients_;
+  /*! \brief hessians on CUDA */
+  score_t* cuda_hessians_;
+};
+
+}  // namespace LightGBM
+
+#else  // USE_CUDA_EXP
+
+// When GPU support is not compiled in, quit with an error message
+
+namespace LightGBM {
+
+class CUDASingleGPUTreeLearner: public SerialTreeLearner {
+ public:
+    #pragma warning(disable : 4702)
+    explicit CUDASingleGPUTreeLearner(const Config* tree_config) : SerialTreeLearner(tree_config) {
+      Log::Fatal("CUDA Tree Learner experimental version was not enabled in this build.\n"
+                 "Please recompile with CMake option -DUSE_CUDA_EXP=1");
+    }
+};
+
+}  // namespace LightGBM
+
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_TREELEARNER_CUDA_CUDA_SINGLE_GPU_TREE_LEARNER_HPP_
--- a/src/treelearner/serial_tree_learner.h
+++ b/src/treelearner/serial_tree_learner.h
@@ -206,12 +206,12 @@ class SerialTreeLearner: public TreeLearner {
  std::unique_ptr<LeafSplits> smaller_leaf_splits_;
  /*! \brief stores best thresholds for all feature for larger leaf */
  std::unique_ptr<LeafSplits> larger_leaf_splits_;
-#ifdef USE_GPU
+#if defined(USE_GPU)
  /*! \brief gradients of current iteration, ordered for cache optimized, aligned to 4K page */
  std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_gradients_;
  /*! \brief hessians of current iteration, ordered for cache optimized, aligned to 4K page */
  std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_hessians_;
-#elif USE_CUDA
+#elif defined(USE_CUDA) || defined(USE_CUDA_EXP)
  /*! \brief gradients of current iteration, ordered for cache optimized */
  std::vector<score_t, CHAllocator<score_t>> ordered_gradients_;
  /*! \brief hessians of current iteration, ordered for cache optimized */

--- a/src/treelearner/tree_learner.cpp
+++ b/src/treelearner/tree_learner.cpp
@@ -9,6 +9,7 @@
 #include "linear_tree_learner.h"
 #include "parallel_tree_learner.h"
 #include "serial_tree_learner.h"
+#include "cuda/cuda_single_gpu_tree_learner.hpp"

 namespace LightGBM {

@@ -48,6 +49,16 @@ TreeLearner* TreeLearner::CreateTreeLearner(const std::string& learner_type, con
    } else if (learner_type == std::string("voting")) {
      return new VotingParallelTreeLearner<CUDATreeLearner>(config);
    }
+  } else if (device_type == std::string("cuda_exp")) {
+    if (learner_type == std::string("serial")) {
+      if (config->num_gpu == 1) {
+        return new CUDASingleGPUTreeLearner(config);
+      } else {
+        Log::Fatal("cuda_exp only supports training on a single GPU.");
+      }
+    } else {
+      Log::Fatal("cuda_exp only supports training on a single machine.");
+    }
  }
  return nullptr;
 }

--- a/tests/python_package_test/test_basic.py
+++ b/tests/python_package_test/test_basic.py
@@ -2,6 +2,7 @@
 import filecmp
 import numbers
 import re
+from os import getenv
 from pathlib import Path

 import numpy as np
@@ -47,8 +48,9 @@ def test_basic(tmp_path):
    assert bst.current_iteration() == 20
    assert bst.num_trees() == 20
    assert bst.num_model_per_iteration() == 1
-    assert bst.lower_bound() == pytest.approx(-2.9040190126976606)
-    assert bst.upper_bound() == pytest.approx(3.3182142872462883)
+    if getenv('TASK', '') != 'cuda_exp':
+        assert bst.lower_bound() == pytest.approx(-2.9040190126976606)
+        assert bst.upper_bound() == pytest.approx(3.3182142872462883)

    tname = tmp_path / "svm_light.dat"
    model_file = tmp_path / "model.txt"

--- a/tests/python_package_test/test_dask.py
+++ b/tests/python_package_test/test_dask.py
@@ -56,7 +56,8 @@ task_to_local_factory = {

 pytestmark = [
    pytest.mark.skipif(getenv('TASK', '') == 'mpi', reason='Fails to run with MPI interface'),
-    pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface')
+    pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface'),
+    pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Fails to run with CUDA Experimental interface')
 ]



--- a/tests/python_package_test/test_engine.py
+++ b/tests/python_package_test/test_engine.py
@@ -6,6 +6,7 @@ import math
 import pickle
 import platform
 import random
+from os import getenv
 from pathlib import Path

 import numpy as np
@@ -570,6 +571,7 @@ def test_multi_class_error():
    assert results['training']['multi_error@2'][-1] == pytest.approx(0)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_auc_mu():
    # should give same result as binary auc for 2 classes
    X, y = load_digits(n_class=10, return_X_y=True)
@@ -1501,6 +1503,7 @@ def generate_trainset_for_monotone_constraints_tests(x3_to_category=True):
    return trainset


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
 @pytest.mark.parametrize("test_with_categorical_variable", [True, False])
 def test_monotone_constraints(test_with_categorical_variable):
    def is_increasing(y):
@@ -1590,6 +1593,7 @@ def test_monotone_constraints(test_with_categorical_variable):
                assert are_interactions_enforced(constrained_model, feature_sets)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
 def test_monotone_penalty():
    def are_first_splits_non_monotone(tree, n, monotone_constraints):
        if n <= 0:
@@ -1629,6 +1633,7 @@ def test_monotone_penalty():


 # test if a penalty as high as the depth indeed prohibits all monotone splits
+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Monotone constraints are not yet supported by CUDA Experimental version')
 def test_monotone_penalty_max():
    max_depth = 5
    monotone_constraints = [1, -1, 0]
@@ -2393,6 +2398,7 @@ def test_model_size():
        pytest.skipTest('not enough RAM')


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_get_split_value_histogram():
    X, y = load_boston(return_X_y=True)
    lgb_train = lgb.Dataset(X, y, categorical_feature=[2])
@@ -2472,6 +2478,7 @@ def test_get_split_value_histogram():
        gbm.get_split_value_histogram(2)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_early_stopping_for_only_first_metric():

    def metrics_combination_train_regression(valid_sets, metric_list, assumed_iteration,
@@ -2878,6 +2885,7 @@ def test_trees_to_dataframe():
        assert tree_df.loc[0, col] is None


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Interaction constraints are not yet supported by CUDA Experimental version')
 def test_interaction_constraints():
    X, y = load_boston(return_X_y=True)
    num_features = X.shape[1]
@@ -3272,6 +3280,7 @@ def test_dump_model_hook():
    assert "LV" in dumped_model_str


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Forced splits are not yet supported by CUDA Experimental version')
 def test_force_split_with_feature_fraction(tmp_path):
    X, y = load_boston(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

--- a/tests/python_package_test/test_sklearn.py
+++ b/tests/python_package_test/test_sklearn.py
 # coding: utf-8
 import itertools
 import math
+from os import getenv
 from pathlib import Path

 import joblib
@@ -99,6 +100,7 @@ def test_regression():
    assert gbm.evals_result_['valid_0']['l2'][gbm.best_iteration_ - 1] == pytest.approx(ret)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_multiclass():
    X, y = load_digits(n_class=10, return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
@@ -111,6 +113,7 @@ def test_multiclass():
    assert gbm.evals_result_['valid_0']['multi_logloss'][gbm.best_iteration_ - 1] == pytest.approx(ret)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_lambdarank():
    rank_example_dir = Path(__file__).absolute().parents[2] / 'examples' / 'lambdarank'
    X_train, y_train = load_svmlight_file(str(rank_example_dir / 'rank.train'))
@@ -1068,6 +1071,7 @@ def test_nan_handle():
    np.testing.assert_allclose(gbm.evals_result_['training']['l2'], np.nan)


+@pytest.mark.skipif(getenv('TASK', '') == 'cuda_exp', reason='Skip due to differences in implementation details of CUDA Experimental version')
 def test_first_metric_only():

    def fit_and_check(eval_set_names, metric_names, assumed_iteration, first_metric_only):

--- a/windows/LightGBM.vcxproj
+++ b/windows/LightGBM.vcxproj
@@ -317,10 +317,14 @@
    <ClCompile Include="..\src\io\config_auto.cpp" />
    <ClCompile Include="..\src\io\dataset.cpp" />
    <ClCompile Include="..\src\io\dataset_loader.cpp" />
+    <ClCompile Include="..\src\io\dense_bin.cpp" />
    <ClCompile Include="..\src\io\file_io.cpp" />
    <ClCompile Include="..\src\io\json11.cpp" />
    <ClCompile Include="..\src\io\metadata.cpp" />
+    <ClCompile Include="..\src\io\multi_val_dense_bin.cpp" />
+    <ClCompile Include="..\src\io\multi_val_sparse_bin.cpp" />
    <ClCompile Include="..\src\io\parser.cpp" />
+    <ClCompile Include="..\src\io\sparse_bin.cpp" />
    <ClCompile Include="..\src\io\train_share_states.cpp" />
    <ClCompile Include="..\src\io\tree.cpp" />
    <ClCompile Include="..\src\metric\dcg_calculator.cpp" />

--- a/windows/LightGBM.vcxproj.filters
+++ b/windows/LightGBM.vcxproj.filters
@@ -326,5 +326,17 @@
    <ClCompile Include="..\src\treelearner\linear_tree_learner.cpp">
      <Filter>src\treelearner</Filter>
    </ClCompile>
+    <ClCompile Include="..\src\io\multi_val_dense_bin.cpp">
+      <Filter>src\io</Filter>
+    </ClCompile>
+    <ClCompile Include="..\src\io\multi_val_sparse_bin.cpp">
+      <Filter>src\io</Filter>
+    </ClCompile>
+    <ClCompile Include="..\src\io\dense_bin.cpp">
+      <Filter>src\io</Filter>
+    </ClCompile>
+    <ClCompile Include="..\src\io\sparse_bin.cpp">
+      <Filter>src\io</Filter>
+    </ClCompile>
  </ItemGroup>
 </Project>
\ No newline at end of file