[CUDA] New CUDA version Part 1 (#4630)

* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

[CUDA] New CUDA version Part 1 (#4630)
* new cuda framework * add histogram construction kernel * before removing multi-gpu * new cuda framework * tree learner cuda kernels * single tree framework ready * single tree training framework * remove comments * boosting with cuda * optimize for best split find * data split * move boosting into cuda * parallel synchronize best split point * merge split data kernels * before code refactor * use tasks instead of features as units for split finding * refactor cuda best split finder * fix configuration error with small leaves in data split * skip histogram construction of too small leaf * skip split finding of invalid leaves stop when no leaf to split * support row wise with CUDA * copy data for split by column * copy data from host to CPU by column for data partition * add synchronize best splits for one leaf from multiple blocks * partition dense row data * fix sync best split from task blocks * add support for sparse row wise for CUDA * remove useless code * add l2 regression objective * sparse multi value bin enabled for CUDA * fix cuda ranking objective * support for number of items <= 2048 per query * speedup histogram construction by interleaving global memory access * split optimization * add cuda tree predictor * remove comma * refactor objective and score updater * before use struct * use structure for split information * use structure for leaf splits * return CUDASplitInfo directly after finding best split * split with CUDATree directly * use cuda row data in cuda histogram constructor * clean src/treelearner/cuda * gather shared cuda device functions * put shared CUDA functions into header file * change smaller leaf from <= back to < for consistent result with CPU * add tree predictor * remove useless cuda_tree_predictor * predict on CUDA with pipeline * add global sort algorithms * add global argsort for queries with many items in ranking tasks * remove limitation of maximum number of items per query in ranking * add cuda metrics * fix CUDA AUC * remove debug code * add regression metrics * remove useless file * don't use mask in shuffle reduce * add more regression objectives * fix cuda mape loss add cuda xentropy loss * use template for different versions of BitonicArgSortDevice * add multiclass metrics * add ndcg metric * fix cross entropy objectives and metrics * fix cross entropy and ndcg metrics * add support for customized objective in CUDA * complete multiclass ova for CUDA * separate cuda tree learner * use shuffle based prefix sum * clean up cuda_algorithms.hpp * add copy subset on CUDA * add bagging for CUDA * clean up code * copy gradients from host to device * support bagging without using subset * add support of bagging with subset for CUDAColumnData * add support of bagging with subset for dense CUDARowData * refactor copy sparse subrow * use copy subset for column subset * add reset train data and reset config for CUDA tree learner add deconstructors for cuda tree learner * add USE_CUDA ifdef to cuda tree learner files * check that dataset doesn't contain CUDA tree learner * remove printf debug information * use full new cuda tree learner only when using single GPU * disable all CUDA code when using CPU version * recover main.cpp * add cpp files for multi value bins * update LightGBM.vcxproj * update LightGBM.vcxproj fix lint errors * fix lint errors * fix lint errors * update Makevars fix lint errors * fix the case with 0 feature and 0 bin fix split finding for invalid leaves create cuda column data when loaded from bin file * fix lint errors hide GetRowWiseData when cuda is not used * recover default device type to cpu * fix na_as_missing case fix cuda feature meta information * fix UpdateDataIndexToLeafIndexKernel * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore * add refit by tree for cuda tree learner * fix test_refit in test_engine.py * create set of large bin partitions in CUDARowData * add histogram construction for columns with a large number of bins * add find best split for categorical features on CUDA * add bitvectors for categorical split * cuda data partition split for categorical features * fix split tree with categorical feature * fix categorical feature splits * refactor cuda_data_partition.cu with multi-level templates * refactor CUDABestSplitFinder by grouping task information into struct * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder * fix misuse of reference * remove useless changes * add support for path smoothing * virtual destructor for LightGBM::Tree * fix overlapped cat threshold in best split infos * reset histogram pointers in data partition and spllit finder in ResetConfig * comment useless parameter * fix reverse case when na is missing and default bin is zero * fix mfb_is_na and mfb_is_zero and is_single_feature_column * remove debug log * fix cat_l2 when one-hot fix gradient copy when data subset is used * switch shared histogram size according to CUDA version * gpu_use_dp=true when cuda test * revert modification in config.h * fix setting of gpu_use_dp=true in .ci/test.sh * fix linter errors * fix linter error remove useless change * recover main.cpp * separate cuda_exp and cuda * fix ci bash scripts add description for cuda_exp * add USE_CUDA_EXP flag * switch off USE_CUDA_EXP * revert changes in python-packages * more careful separation for USE_CUDA_EXP * fix CUDARowData::DivideCUDAFeatureGroups fix set fields for cuda metadata * revert config.h * fix test settings for cuda experimental version * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version * fix lint issue by adding a blank line * fix lint errors by resorting imports * fix lint errors by resorting imports * fix lint errors by resorting imports * merge cuda.yml and cuda_exp.yml * update python version in cuda.yml * remove cuda_exp.yml * remove unrelated changes * fix compilation warnings fix cuda exp ci task name * recover task * use multi-level template in histogram construction check split only in debug mode * ignore NVCC related lines in parameter_generator.py * update job name for CUDA tests * apply review suggestions * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .github/workflows/cuda.yml Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update header * remove useless TODOs * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062 * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only * fix include order * fix include order * remove extra space * address review comments * add warning when cuda_exp is used together with deterministic * add comment about gpu_use_dp in .ci/test.sh * revert changing order of included headers Co-authored-by: Yu Shi <shiyu1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
6b56a90c · shiyu1994 · GitHub · b857ee10 · 6b56a90c · 6b56a90c
Unverified Commit 6b56a90c authored Mar 23, 2022 by shiyu1994 Committed by GitHub Mar 23, 2022
20 changed files
--- a/.ci/setup.sh
+++ b/.ci/setup.sh
@@ -80,7 +80,7 @@ else  # Linux
        mv $AMDAPPSDK_PATH/lib/x86_64/sdk/* $AMDAPPSDK_PATH/lib/x86_64/
        echo libamdocl64.so > $OPENCL_VENDOR_PATH/amdocl64.icd
    fi
-    if [[ $TASK == "cuda" ]]; then
+    if [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
        echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
        apt-get update
        apt-get install --no-install-recommends -y \

--- a/.ci/test.sh
+++ b/.ci/test.sh
@@ -190,21 +190,41 @@ if [[ $TASK == "gpu" ]]; then
    elif [[ $METHOD == "source" ]]; then
        cmake -DUSE_GPU=ON -DOpenCL_INCLUDE_DIR=$AMDAPPSDK_PATH/include/ ..
    fi
-elif [[ $TASK == "cuda" ]]; then
+elif [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
-    sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
+    if [[ $TASK == "cuda" ]]; then
-    grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1  # make sure that changes were really done
+        sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
+        grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1  # make sure that changes were really done
+    else
+        sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda_exp";/' $BUILD_DIRECTORY/include/LightGBM/config.h
+        grep -q 'std::string device_type = "cuda_exp"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1  # make sure that changes were really done
+        # by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
+        sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' $BUILD_DIRECTORY/include/LightGBM/config.h
+        grep -q 'gpu_use_dp = true' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1  # make sure that changes were really done
+    fi
    if [[ $METHOD == "pip" ]]; then
        cd $BUILD_DIRECTORY/python-package && python setup.py sdist || exit -1
-        pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
+        if [[ $TASK == "cuda" ]]; then
+            pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
+        else
+            pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda-exp || exit -1
+        fi
        pytest $BUILD_DIRECTORY/tests/python_package_test || exit -1
        exit 0
    elif [[ $METHOD == "wheel" ]]; then
-        cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
+        if [[ $TASK == "cuda" ]]; then
+            cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
+        else
+            cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda-exp || exit -1
+        fi
        pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER*.whl -v || exit -1
        pytest $BUILD_DIRECTORY/tests || exit -1
        exit 0
    elif [[ $METHOD == "source" ]]; then
-        cmake -DUSE_CUDA=ON ..
+        if [[ $TASK == "cuda" ]]; then
+            cmake -DUSE_CUDA=ON ..
+        else
+            cmake -DUSE_CUDA_EXP=ON ..
+        fi
    fi
 elif [[ $TASK == "mpi" ]]; then
    if [[ $METHOD == "pip" ]]; then

--- a/.github/workflows/cuda.yml
+++ b/.github/workflows/cuda.yml
@@ -16,7 +16,7 @@ env:
 jobs:
  test:
-    name: cuda ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
+    name: ${{ matrix.tree_learner }} ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
    runs-on: [self-hosted, linux]
    timeout-minutes: 60
    strategy:
@@ -27,14 +27,27 @@ jobs:
            compiler: gcc
            python_version: "3.8"
            cuda_version: "11.5.1"
+            tree_learner: cuda
          - method: pip
            compiler: clang
            python_version: "3.9"
            cuda_version: "10.0"
+            tree_learner: cuda
          - method: wheel
            compiler: gcc
            python_version: "3.10"
            cuda_version: "9.0"
+            tree_learner: cuda
+          - method: source
+            compiler: gcc
+            python_version: "3.8"
+            cuda_version: "11.5.1"
+            tree_learner: cuda_exp
+          - method: pip
+            compiler: clang
+            python_version: "3.9"
+            cuda_version: "10.0"
+            tree_learner: cuda_exp
    steps:
      - name: Setup or update software on host machine
        run: |

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -5,6 +5,7 @@ option(USE_SWIG "Enable SWIG to generate Java API" OFF)
 option(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF)
 option(USE_TIMETAG "Set to ON to output time costs" OFF)
 option(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF)
+option(USE_CUDA_EXP "Enable CUDA-accelerated training with more acceleration (EXPERIMENTAL)" OFF)
 option(USE_DEBUG "Set to ON for Debug mode" OFF)
 option(USE_SANITIZER "Use santizer flags" OFF)
 set(
@@ -28,7 +29,7 @@ if(__INTEGRATE_OPENCL)
  cmake_minimum_required(VERSION 3.11)
 elseif(USE_GPU OR APPLE)
  cmake_minimum_required(VERSION 3.2)
-elseif(USE_CUDA)
+elseif(USE_CUDA OR USE_CUDA_EXP)
  cmake_minimum_required(VERSION 3.16)
 else()
  cmake_minimum_required(VERSION 3.0)
@@ -133,7 +134,7 @@ else()
    add_definitions(-DUSE_SOCKET)
 endif()
-if(USE_CUDA)
+if(USE_CUDA OR USE_CUDA_EXP)
    set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}")
    enable_language(CUDA)
    set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
@@ -171,8 +172,12 @@ if(__INTEGRATE_OPENCL)
    endif()
 endif()
-if(USE_CUDA)
+if(USE_CUDA OR USE_CUDA_EXP)
-    find_package(CUDA 9.0 REQUIRED)
+    if(USE_CUDA)
+      find_package(CUDA 9.0 REQUIRED)
+    else()
+      find_package(CUDA 10.0 REQUIRED)
+    endif()
    include_directories(${CUDA_INCLUDE_DIRS})
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")
@@ -199,7 +204,12 @@ if(USE_CUDA)
    endif()
    message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
-    add_definitions(-DUSE_CUDA)
+    if(USE_CUDA)
+      add_definitions(-DUSE_CUDA)
+    elseif(USE_CUDA_EXP)
+      add_definitions(-DUSE_CUDA_EXP)
+    endif()
    if(NOT DEFINED CMAKE_CUDA_STANDARD)
      set(CMAKE_CUDA_STANDARD 11)
      set(CMAKE_CUDA_STANDARD_REQUIRED ON)
@@ -369,9 +379,17 @@ file(
      src/objective/*.cpp
      src/network/*.cpp
      src/treelearner/*.cpp
-if(USE_CUDA)
+if(USE_CUDA OR USE_CUDA_EXP)
      src/treelearner/*.cu
 endif()
+if(USE_CUDA_EXP)
+      src/treelearner/cuda/*.cpp
+      src/treelearner/cuda/*.cu
+      src/io/cuda/*.cu
+      src/io/cuda/*.cpp
+      src/cuda/*.cpp
+      src/cuda/*.cu
+endif()
 )
 add_library(lightgbm_objs OBJECT ${SOURCES})
@@ -493,7 +511,7 @@ if(__INTEGRATE_OPENCL)
  target_link_libraries(lightgbm_objs PUBLIC ${INTEGRATED_OPENCL_LIBRARIES})
 endif()
-if(USE_CUDA)
+if(USE_CUDA OR USE_CUDA_EXP)
  # Disable cmake warning about policy CMP0104. Refer to issue #3754 and PR #4268.
  # Custom target properties does not propagate, thus we need to specify for
  # each target that contains or depends on cuda source.
@@ -501,6 +519,8 @@ if(USE_CUDA)
  set_target_properties(_lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
  set_target_properties(lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
+  set_target_properties(lightgbm_objs PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
  # Device linking is not supported for object libraries.
  # Thus we have to specify them on final targets.
  set_target_properties(lightgbm PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)

--- a/R-package/src/Makevars.in
+++ b/R-package/src/Makevars.in
@@ -37,6 +37,10 @@ OBJECTS = \
    io/parser.o \
    io/train_share_states.o \
    io/tree.o \
+    io/dense_bin.o \
+    io/sparse_bin.o \
+    io/multi_val_dense_bin.o \
+    io/multi_val_sparse_bin.o \
    metric/dcg_calculator.o \
    metric/metric.o \
    objective/objective_function.o \

--- a/R-package/src/Makevars.win.in
+++ b/R-package/src/Makevars.win.in
@@ -38,6 +38,10 @@ OBJECTS = \
    io/parser.o \
    io/train_share_states.o \
    io/tree.o \
+    io/dense_bin.o \
+    io/sparse_bin.o \
+    io/multi_val_dense_bin.o \
+    io/multi_val_sparse_bin.o \
    metric/dcg_calculator.o \
    metric/metric.o \
    objective/objective_function.o \

--- a/docs/Installation-Guide.rst
+++ b/docs/Installation-Guide.rst
@@ -636,6 +636,8 @@ To build LightGBM CUDA version, run the following commands:
  cmake -DUSE_CUDA=1 ..
  make -j4
+Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``-DUSE_CUDA`` with ``-DUSE_CUDA_EXP`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.
 **Note**: glibc >= 2.14 is required.
 **Note**: In some rare cases you may need to install OpenMP runtime library separately (use your package manager and search for ``lib[g|i]omp`` for doing this).

--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -199,7 +199,7 @@ Core Parameters
   -  **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors
-  ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, aliases: ``device``
+-  ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, ``cuda_exp``, aliases: ``device``
   -  device for the tree learning, you can use GPU to achieve the faster learning
@@ -209,6 +209,10 @@ Core Parameters
   -  **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support
+   -  **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``
+   -  **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future
 -  ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``
   -  this seed is used to generate other seeds, e.g. ``data_random_seed``, ``feature_fraction_seed``, etc.

--- a/helpers/parameter_generator.py
+++ b/helpers/parameter_generator.py
@@ -34,6 +34,8 @@ def get_parameter_infos(
    member_infos: List[List[Dict[str, List]]] = []
    with open(config_hpp) as config_hpp_file:
        for line in config_hpp_file:
+            if line.strip() in {"#ifndef __NVCC__", "#endif  // __NVCC__"}:
+                continue
            if "#pragma region Parameters" in line:
                is_inparameter = True
            elif "#pragma region" in line and "Parameters" in line:

--- a/include/LightGBM/bin.h
+++ b/include/LightGBM/bin.h
@@ -119,6 +119,23 @@ class BinMapper {
    }
  }
+  /*!
+  * \brief Maximum categorical value
+  * \return Maximum categorical value for categorical features, 0 for numerical features  
+  */
+  inline int MaxCatValue() const {
+    if (bin_2_categorical_.size() == 0) {
+      return 0;
+    }
+    int max_cat_value = bin_2_categorical_[0];
+    for (size_t i = 1; i < bin_2_categorical_.size(); ++i) {
+      if (bin_2_categorical_[i] > max_cat_value) {
+        max_cat_value = bin_2_categorical_[i];
+      }
+    }
+    return max_cat_value;
+  }
  /*!
  * \brief Get sizes in byte of this object
  */
@@ -379,6 +396,10 @@ class Bin {
  * \brief Deep copy the bin
  */
  virtual Bin* Clone() = 0;
+  virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const = 0;
+  virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const = 0;
 };
@@ -452,6 +473,14 @@ class MultiValBin {
  static constexpr double multi_val_bin_sparse_threshold = 0.25f;
  virtual MultiValBin* Clone() = 0;
+  #ifdef USE_CUDA_EXP
+  virtual const void* GetRowWiseData(uint8_t* bit_type,
+    size_t* total_size,
+    bool* is_sparse,
+    const void** out_data_ptr,
+    uint8_t* data_ptr_bit_type) const = 0;
+  #endif  // USE_CUDA_EXP
 };
 inline uint32_t BinMapper::ValueToBin(double value) const {

--- a/include/LightGBM/config.h
+++ b/include/LightGBM/config.h
@@ -81,9 +81,11 @@ struct Config {
  static void KV2Map(std::unordered_map<std::string, std::string>* params, const char* kv);
  static std::unordered_map<std::string, std::string> Str2Map(const char* parameters);
+  #ifndef __NVCC__
  #pragma region Parameters
  #pragma region Core Parameters
+  #endif  // __NVCC__
  // [no-save]
  // [doc-only]
@@ -204,12 +206,14 @@ struct Config {
  // [doc-only]
  // type = enum
-  // options = cpu, gpu, cuda
+  // options = cpu, gpu, cuda, cuda_exp
  // alias = device
  // desc = device for the tree learning, you can use GPU to achieve the faster learning
  // desc = **Note**: it is recommended to use the smaller ``max_bin`` (e.g. 63) to get the better speed up
  // desc = **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training
  // desc = **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support
+  // desc = **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``
+  // desc = **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future
  std::string device_type = "cpu";
  // [doc-only]
@@ -228,9 +232,11 @@ struct Config {
  // desc = **Note**: to avoid potential instability due to numerical issues, please set ``force_col_wise=true`` or ``force_row_wise=true`` when setting ``deterministic=true``
  bool deterministic = false;
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region Learning Control Parameters
+  #endif  // __NVCC__
  // desc = used only with ``cpu`` device type
  // desc = set this to ``true`` to force col-wise histogram building
@@ -568,11 +574,13 @@ struct Config {
  // desc = **Note**: can be used only in CLI version
  int snapshot_freq = -1;
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region IO Parameters
  #pragma region Dataset Parameters
+  #endif  // __NVCC__
  // alias = linear_trees
  // desc = fit piecewise linear gradient boosting tree
@@ -728,9 +736,11 @@ struct Config {
  // desc = **Note**: ``lightgbm-transform`` is not maintained by LightGBM's maintainers. Bug reports or feature requests should go to `issues page <https://github.com/microsoft/lightgbm-transform/issues>`__
  std::string parser_config_file = "";
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region Predict Parameters
+  #endif  // __NVCC__
  // [no-save]
  // desc = used only in ``prediction`` task
@@ -800,9 +810,11 @@ struct Config {
  // desc = **Note**: can be used only in CLI version
  std::string output_result = "LightGBM_predict_result.txt";
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region Convert Parameters
+  #endif  // __NVCC__
  // [no-save]
  // desc = used only in ``convert_model`` task
@@ -818,11 +830,13 @@ struct Config {
  // desc = **Note**: can be used only in CLI version
  std::string convert_model = "gbdt_prediction.cpp";
+  #ifndef __NVCC__
  #pragma endregion
  #pragma endregion
  #pragma region Objective Parameters
+  #endif  // __NVCC__
  // desc = used only in ``rank_xendcg`` objective
  // desc = random seed for objectives, if random process is needed
@@ -902,9 +916,11 @@ struct Config {
  // desc = separate by ``,``
  std::vector<double> label_gain;
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region Metric Parameters
+  #endif  // __NVCC__
  // [doc-only]
  // alias = metrics, metric_types
@@ -976,9 +992,11 @@ struct Config {
  // desc = if not specified, will use equal weights for all classes
  std::vector<double> auc_mu_weights;
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region Network Parameters
+  #endif  // __NVCC__
  // check = >0
  // alias = num_machine
@@ -1007,9 +1025,11 @@ struct Config {
  // desc = list of machines in the following format: ``ip1:port1,ip2:port2``
  std::string machines = "";
+  #ifndef __NVCC__
  #pragma endregion
  #pragma region GPU Parameters
+  #endif  // __NVCC__
  // desc = OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform
  // desc = ``-1`` means the system-wide default platform
@@ -1030,9 +1050,11 @@ struct Config {
  // desc = **Note**: can be used only in CUDA implementation
  int num_gpu = 1;
+  #ifndef __NVCC__
  #pragma endregion
  #pragma endregion
+  #endif  // __NVCC__
  size_t file_load_progress_interval_bytes = size_t(10) * 1024 * 1024 * 1024;

--- a/include/LightGBM/cuda/cuda_algorithms.hpp
+++ b/include/LightGBM/cuda/cuda_algorithms.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifndef LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
+#define LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
+#ifdef USE_CUDA_EXP
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <LightGBM/bin.h>
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/utils/log.h>
+#include <algorithm>
+#define NUM_BANKS_DATA_PARTITION (16)
+#define LOG_NUM_BANKS_DATA_PARTITION (4)
+#define GLOBAL_PREFIX_SUM_BLOCK_SIZE (1024)
+#define BITONIC_SORT_NUM_ELEMENTS (1024)
+#define BITONIC_SORT_DEPTH (11)
+#define BITONIC_SORT_QUERY_ITEM_BLOCK_SIZE (10)
+#define CONFLICT_FREE_INDEX(n) \
+  ((n) + ((n) >> LOG_NUM_BANKS_DATA_PARTITION)) \
+namespace LightGBM {
+template <typename T>
+__device__ __forceinline__ T ShufflePrefixSum(T value, T* shared_mem_buffer) {
+  const uint32_t mask = 0xffffffff;
+  const uint32_t warpLane = threadIdx.x % warpSize;
+  const uint32_t warpID = threadIdx.x / warpSize;
+  const uint32_t num_warp = blockDim.x / warpSize;
+  for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
+    const T other_value = __shfl_up_sync(mask, value, offset);
+    if (warpLane >= offset) {
+      value += other_value;
+    }
+  }
+  if (warpLane == warpSize - 1) {
+    shared_mem_buffer[warpID] = value;
+  }
+  __syncthreads();
+  if (warpID == 0) {
+    T warp_sum = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
+    for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
+      const T other_warp_sum = __shfl_up_sync(mask, warp_sum, offset);
+      if (warpLane >= offset) {
+        warp_sum += other_warp_sum;
+      }
+    }
+    shared_mem_buffer[warpLane] = warp_sum;
+  }
+  __syncthreads();
+  const T warp_base = warpID == 0 ? 0 : shared_mem_buffer[warpID - 1];
+  return warp_base + value;
+}
+template <typename T>
+__device__ __forceinline__ T ShufflePrefixSumExclusive(T value, T* shared_mem_buffer) {
+  const uint32_t mask = 0xffffffff;
+  const uint32_t warpLane = threadIdx.x % warpSize;
+  const uint32_t warpID = threadIdx.x / warpSize;
+  const uint32_t num_warp = blockDim.x / warpSize;
+  for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
+    const T other_value = __shfl_up_sync(mask, value, offset);
+    if (warpLane >= offset) {
+      value += other_value;
+    }
+  }
+  if (warpLane == warpSize - 1) {
+    shared_mem_buffer[warpID] = value;
+  }
+  __syncthreads();
+  if (warpID == 0) {
+    T warp_sum = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
+    for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
+      const T other_warp_sum = __shfl_up_sync(mask, warp_sum, offset);
+      if (warpLane >= offset) {
+        warp_sum += other_warp_sum;
+      }
+    }
+    shared_mem_buffer[warpLane] = warp_sum;
+  }
+  __syncthreads();
+  const T warp_base = warpID == 0 ? 0 : shared_mem_buffer[warpID - 1];
+  const T inclusive_result = warp_base + value;
+  if (threadIdx.x % warpSize == warpSize - 1) {
+    shared_mem_buffer[warpLane] = inclusive_result;
+  }
+  __syncthreads();
+  T exclusive_result = __shfl_up_sync(mask, inclusive_result, 1);
+  if (threadIdx.x == 0) {
+    exclusive_result = 0;
+  } else if (threadIdx.x % warpSize == 0) {
+    exclusive_result = shared_mem_buffer[warpLane - 1];
+  }
+  return exclusive_result;
+}
+template <typename T>
+void ShufflePrefixSumGlobal(T* values, size_t len, T* block_prefix_sum_buffer);
+template <typename T>
+__device__ __forceinline__ T ShuffleReduceSumWarp(T value, const data_size_t len) {
+  if (len > 0) {
+    const uint32_t mask = 0xffffffff;
+    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
+      value += __shfl_down_sync(mask, value, offset);
+    }
+  }
+  return value;
+}
+// reduce values from an 1-dimensional block (block size must be no greather than 1024)
+template <typename T>
+__device__ __forceinline__ T ShuffleReduceSum(T value, T* shared_mem_buffer, const size_t len) {
+  const uint32_t warpLane = threadIdx.x % warpSize;
+  const uint32_t warpID = threadIdx.x / warpSize;
+  const data_size_t warp_len = min(static_cast<data_size_t>(warpSize), static_cast<data_size_t>(len) - static_cast<data_size_t>(warpID * warpSize));
+  value = ShuffleReduceSumWarp<T>(value, warp_len);
+  if (warpLane == 0) {
+    shared_mem_buffer[warpID] = value;
+  }
+  __syncthreads();
+  const data_size_t num_warp = static_cast<data_size_t>((len + warpSize - 1) / warpSize);
+  if (warpID == 0) {
+    value = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
+    value = ShuffleReduceSumWarp<T>(value, num_warp);
+  }
+  return value;
+}
+template <typename T>
+__device__ __forceinline__ T ShuffleReduceMaxWarp(T value, const data_size_t len) {
+  if (len > 0) {
+    const uint32_t mask = 0xffffffff;
+    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
+      value = max(value, __shfl_down_sync(mask, value, offset));
+    }
+  }
+  return value;
+}
+// reduce values from an 1-dimensional block (block size must be no greather than 1024)
+template <typename T>
+__device__ __forceinline__ T ShuffleReduceMax(T value, T* shared_mem_buffer, const size_t len) {
+  const uint32_t warpLane = threadIdx.x % warpSize;
+  const uint32_t warpID = threadIdx.x / warpSize;
+  const data_size_t warp_len = min(static_cast<data_size_t>(warpSize), static_cast<data_size_t>(len) - static_cast<data_size_t>(warpID * warpSize));
+  value = ShuffleReduceMaxWarp<T>(value, warp_len);
+  if (warpLane == 0) {
+    shared_mem_buffer[warpID] = value;
+  }
+  __syncthreads();
+  const data_size_t num_warp = static_cast<data_size_t>((len + warpSize - 1) / warpSize);
+  if (warpID == 0) {
+    value = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
+    value = ShuffleReduceMaxWarp<T>(value, num_warp);
+  }
+  return value;
+}
+// calculate prefix sum values within an 1-dimensional block in global memory, exclusively
+template <typename T>
+__device__ __forceinline__ void GlobalMemoryPrefixSum(T* array, const size_t len) {
+  const size_t num_values_per_thread = (len + blockDim.x - 1) / blockDim.x;
+  const size_t start = threadIdx.x * num_values_per_thread;
+  const size_t end = min(start + num_values_per_thread, len);
+  T thread_sum = 0;
+  for (size_t index = start; index < end; ++index) {
+    thread_sum += array[index];
+  }
+  __shared__ T shared_mem[32];
+  const T thread_base = ShufflePrefixSumExclusive<T>(thread_sum, shared_mem);
+  if (start < end) {
+    array[start] += thread_base;
+  }
+  for (size_t index = start + 1; index < end; ++index) {
+    array[index] += array[index - 1];
+  }
+}
+template <typename VAL_T, typename INDEX_T, bool ASCENDING>
+__device__ __forceinline__ void BitonicArgSort_1024(const VAL_T* scores, INDEX_T* indices, const INDEX_T num_items) {
+  INDEX_T depth = 1;
+  INDEX_T num_items_aligend = 1;
+  INDEX_T num_items_ref = num_items - 1;
+  while (num_items_ref > 0) {
+    num_items_ref >>= 1;
+    num_items_aligend <<= 1;
+    ++depth;
+  }
+  for (INDEX_T outer_depth = depth - 1; outer_depth >= 1; --outer_depth) {
+    const INDEX_T outer_segment_length = 1 << (depth - outer_depth);
+    const INDEX_T outer_segment_index = threadIdx.x / outer_segment_length;
+    const bool ascending = ASCENDING ? (outer_segment_index % 2 == 0) : (outer_segment_index % 2 > 0);
+    for (INDEX_T inner_depth = outer_depth; inner_depth < depth; ++inner_depth) {
+      const INDEX_T segment_length = 1 << (depth - inner_depth);
+      const INDEX_T half_segment_length = segment_length >> 1;
+      const INDEX_T half_segment_index = threadIdx.x / half_segment_length;
+      if (threadIdx.x < num_items_aligend) {
+        if (half_segment_index % 2 == 0) {
+          const INDEX_T index_to_compare = threadIdx.x + half_segment_length;
+          if ((scores[indices[threadIdx.x]] > scores[indices[index_to_compare]]) == ascending) {
+            const INDEX_T index = indices[threadIdx.x];
+            indices[threadIdx.x] = indices[index_to_compare];
+            indices[index_to_compare] = index;
+          }
+        }
+      }
+      __syncthreads();
+    }
+  }
+}
+template <typename VAL_T, typename INDEX_T, bool ASCENDING, uint32_t BLOCK_DIM, uint32_t MAX_DEPTH>
+__device__ void BitonicArgSortDevice(const VAL_T* values, INDEX_T* indices, const int len) {
+  __shared__ VAL_T shared_values[BLOCK_DIM];
+  __shared__ INDEX_T shared_indices[BLOCK_DIM];
+  int len_to_shift = len - 1;
+  int max_depth = 1;
+  while (len_to_shift > 0) {
+    len_to_shift >>= 1;
+    ++max_depth;
+  }
+  const int num_blocks = (len + static_cast<int>(BLOCK_DIM) - 1) / static_cast<int>(BLOCK_DIM);
+  for (int block_index = 0; block_index < num_blocks; ++block_index) {
+    const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
+    if (this_index < len) {
+      shared_values[threadIdx.x] = values[this_index];
+      shared_indices[threadIdx.x] = this_index;
+    } else {
+      shared_indices[threadIdx.x] = len;
+    }
+    __syncthreads();
+    for (int depth = max_depth - 1; depth > max_depth - static_cast<int>(MAX_DEPTH); --depth) {
+      const int segment_length = (1 << (max_depth - depth));
+      const int segment_index = this_index / segment_length;
+      const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
+      {
+        const int half_segment_length = (segment_length >> 1);
+        const int half_segment_index = this_index / half_segment_length;
+        const int num_total_segment = (len + segment_length - 1) / segment_length;
+        const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
+          (num_total_segment * segment_length - len) : 0;
+        if (half_segment_index % 2 == 0) {
+          const int segment_start = segment_index * segment_length;
+          if (this_index >= offset + segment_start) {
+            const int other_index = static_cast<int>(threadIdx.x) + half_segment_length - offset;
+            const INDEX_T this_data_index = shared_indices[threadIdx.x];
+            const INDEX_T other_data_index = shared_indices[other_index];
+            const VAL_T this_value = shared_values[threadIdx.x];
+            const VAL_T other_value = shared_values[other_index];
+            if (other_data_index < len && (this_value > other_value) == ascending) {
+              shared_indices[threadIdx.x] = other_data_index;
+              shared_indices[other_index] = this_data_index;
+              shared_values[threadIdx.x] = other_value;
+              shared_values[other_index] = this_value;
+            }
+          }
+        }
+        __syncthreads();
+      }
+      for (int inner_depth = depth + 1; inner_depth < max_depth; ++inner_depth) {
+        const int half_segment_length = (1 << (max_depth - inner_depth - 1));
+        const int half_segment_index = this_index / half_segment_length;
+        if (half_segment_index % 2 == 0) {
+          const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
+          const INDEX_T this_data_index = shared_indices[threadIdx.x];
+          const INDEX_T other_data_index = shared_indices[other_index];
+          const VAL_T this_value = shared_values[threadIdx.x];
+          const VAL_T other_value = shared_values[other_index];
+          if (other_data_index < len && (this_value > other_value) == ascending) {
+            shared_indices[threadIdx.x] = other_data_index;
+            shared_indices[other_index] = this_data_index;
+            shared_values[threadIdx.x] = other_value;
+            shared_values[other_index] = this_value;
+          }
+        }
+        __syncthreads();
+      }
+    }
+    if (this_index < len) {
+      indices[this_index] = shared_indices[threadIdx.x];
+    }
+    __syncthreads();
+  }
+  for (int depth = max_depth - static_cast<int>(MAX_DEPTH); depth >= 1; --depth) {
+    const int segment_length = (1 << (max_depth - depth));
+    {
+      const int num_total_segment = (len + segment_length - 1) / segment_length;
+      const int half_segment_length = (segment_length >> 1);
+      for (int block_index = 0; block_index < num_blocks; ++block_index) {
+        const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
+        const int segment_index = this_index / segment_length;
+        const int half_segment_index = this_index / half_segment_length;
+        const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
+        const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
+          (num_total_segment * segment_length - len) : 0;
+        if (half_segment_index % 2 == 0) {
+          const int segment_start = segment_index * segment_length;
+          if (this_index >= offset + segment_start) {
+            const int other_index = this_index + half_segment_length - offset;
+            if (other_index < len) {
+              const INDEX_T this_data_index = indices[this_index];
+              const INDEX_T other_data_index = indices[other_index];
+              const VAL_T this_value = values[this_data_index];
+              const VAL_T other_value = values[other_data_index];
+              if ((this_value > other_value) == ascending) {
+                indices[this_index] = other_data_index;
+                indices[other_index] = this_data_index;
+              }
+            }
+          }
+        }
+      }
+      __syncthreads();
+    }
+    for (int inner_depth = depth + 1; inner_depth <= max_depth - static_cast<int>(MAX_DEPTH); ++inner_depth) {
+      const int half_segment_length = (1 << (max_depth - inner_depth - 1));
+      for (int block_index = 0; block_index < num_blocks; ++block_index) {
+        const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
+        const int segment_index = this_index / segment_length;
+        const int half_segment_index = this_index / half_segment_length;
+        const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
+        if (half_segment_index % 2 == 0) {
+          const int other_index = this_index + half_segment_length;
+          if (other_index < len) {
+            const INDEX_T this_data_index = indices[this_index];
+            const INDEX_T other_data_index = indices[other_index];
+            const VAL_T this_value = values[this_data_index];
+            const VAL_T other_value = values[other_data_index];
+            if ((this_value > other_value) == ascending) {
+              indices[this_index] = other_data_index;
+              indices[other_index] = this_data_index;
+            }
+          }
+        }
+        __syncthreads();
+      }
+    }
+    for (int block_index = 0; block_index < num_blocks; ++block_index) {
+      const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
+      const int segment_index = this_index / segment_length;
+      const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
+      if (this_index < len) {
+        const INDEX_T index = indices[this_index];
+        shared_values[threadIdx.x] = values[index];
+        shared_indices[threadIdx.x] = index;
+      } else {
+        shared_indices[threadIdx.x] = len;
+      }
+      __syncthreads();
+      for (int inner_depth = max_depth - static_cast<int>(MAX_DEPTH) + 1; inner_depth < max_depth; ++inner_depth) {
+        const int half_segment_length = (1 << (max_depth - inner_depth - 1));
+        const int half_segment_index = this_index / half_segment_length;
+        if (half_segment_index % 2 == 0) {
+          const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
+          const INDEX_T this_data_index = shared_indices[threadIdx.x];
+          const INDEX_T other_data_index = shared_indices[other_index];
+          const VAL_T this_value = shared_values[threadIdx.x];
+          const VAL_T other_value = shared_values[other_index];
+          if (other_data_index < len && (this_value > other_value) == ascending) {
+            shared_indices[threadIdx.x] = other_data_index;
+            shared_indices[other_index] = this_data_index;
+            shared_values[threadIdx.x] = other_value;
+            shared_values[other_index] = this_value;
+          }
+        }
+        __syncthreads();
+      }
+      if (this_index < len) {
+        indices[this_index] = shared_indices[threadIdx.x];
+      }
+      __syncthreads();
+    }
+  }
+}
+}  // namespace LightGBM
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
--- a/include/LightGBM/cuda/cuda_column_data.hpp
+++ b/include/LightGBM/cuda/cuda_column_data.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifdef USE_CUDA_EXP
+#ifndef LIGHTGBM_CUDA_COLUMN_DATA_HPP_
+#define LIGHTGBM_CUDA_COLUMN_DATA_HPP_
+#include <LightGBM/config.h>
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/bin.h>
+#include <LightGBM/utils/openmp_wrapper.h>
+#include <vector>
+namespace LightGBM {
+class CUDAColumnData {
+ public:
+  CUDAColumnData(const data_size_t num_data, const int gpu_device_id);
+  ~CUDAColumnData();
+  void Init(const int num_columns,
+            const std::vector<const void*>& column_data,
+            const std::vector<BinIterator*>& column_bin_iterator,
+            const std::vector<uint8_t>& column_bit_type,
+            const std::vector<uint32_t>& feature_max_bin,
+            const std::vector<uint32_t>& feature_min_bin,
+            const std::vector<uint32_t>& feature_offset,
+            const std::vector<uint32_t>& feature_most_freq_bin,
+            const std::vector<uint32_t>& feature_default_bin,
+            const std::vector<uint8_t>& feature_missing_is_zero,
+            const std::vector<uint8_t>& feature_missing_is_na,
+            const std::vector<uint8_t>& feature_mfb_is_zero,
+            const std::vector<uint8_t>& feature_mfb_is_na,
+            const std::vector<int>& feature_to_column);
+  const void* GetColumnData(const int column_index) const { return data_by_column_[column_index]; }
+  void CopySubrow(const CUDAColumnData* full_set, const data_size_t* used_indices, const data_size_t num_used_indices);
+  void* const* cuda_data_by_column() const { return cuda_data_by_column_; }
+  uint32_t feature_min_bin(const int feature_index) const { return feature_min_bin_[feature_index]; }
+  uint32_t feature_max_bin(const int feature_index) const { return feature_max_bin_[feature_index]; }
+  uint32_t feature_offset(const int feature_index) const { return feature_offset_[feature_index]; }
+  uint32_t feature_most_freq_bin(const int feature_index) const { return feature_most_freq_bin_[feature_index]; }
+  uint32_t feature_default_bin(const int feature_index) const { return feature_default_bin_[feature_index]; }
+  uint8_t feature_missing_is_zero(const int feature_index) const { return feature_missing_is_zero_[feature_index]; }
+  uint8_t feature_missing_is_na(const int feature_index) const { return feature_missing_is_na_[feature_index]; }
+  uint8_t feature_mfb_is_zero(const int feature_index) const { return feature_mfb_is_zero_[feature_index]; }
+  uint8_t feature_mfb_is_na(const int feature_index) const { return feature_mfb_is_na_[feature_index]; }
+  const uint32_t* cuda_feature_min_bin() const { return cuda_feature_min_bin_; }
+  const uint32_t* cuda_feature_max_bin() const { return cuda_feature_max_bin_; }
+  const uint32_t* cuda_feature_offset() const { return cuda_feature_offset_; }
+  const uint32_t* cuda_feature_most_freq_bin() const { return cuda_feature_most_freq_bin_; }
+  const uint32_t* cuda_feature_default_bin() const { return cuda_feature_default_bin_; }
+  const uint8_t* cuda_feature_missing_is_zero() const { return cuda_feature_missing_is_zero_; }
+  const uint8_t* cuda_feature_missing_is_na() const { return cuda_feature_missing_is_na_; }
+  const uint8_t* cuda_feature_mfb_is_zero() const { return cuda_feature_mfb_is_zero_; }
+  const uint8_t* cuda_feature_mfb_is_na() const { return cuda_feature_mfb_is_na_; }
+  const int* cuda_feature_to_column() const { return cuda_feature_to_column_; }
+  const uint8_t* cuda_column_bit_type() const { return cuda_column_bit_type_; }
+  int feature_to_column(const int feature_index) const { return feature_to_column_[feature_index]; }
+  uint8_t column_bit_type(const int column_index) const { return column_bit_type_[column_index]; }
+ private:
+  template <bool IS_SPARSE, bool IS_4BIT, typename BIN_TYPE>
+  void InitOneColumnData(const void* in_column_data, BinIterator* bin_iterator, void** out_column_data_pointer);
+  void LaunchCopySubrowKernel(void* const* in_cuda_data_by_column);
+  void InitColumnMetaInfo();
+  void ResizeWhenCopySubrow(const data_size_t num_used_indices);
+  int num_threads_;
+  data_size_t num_data_;
+  int num_columns_;
+  std::vector<uint8_t> column_bit_type_;
+  std::vector<uint32_t> feature_min_bin_;
+  std::vector<uint32_t> feature_max_bin_;
+  std::vector<uint32_t> feature_offset_;
+  std::vector<uint32_t> feature_most_freq_bin_;
+  std::vector<uint32_t> feature_default_bin_;
+  std::vector<uint8_t> feature_missing_is_zero_;
+  std::vector<uint8_t> feature_missing_is_na_;
+  std::vector<uint8_t> feature_mfb_is_zero_;
+  std::vector<uint8_t> feature_mfb_is_na_;
+  void** cuda_data_by_column_;
+  std::vector<int> feature_to_column_;
+  std::vector<void*> data_by_column_;
+  uint8_t* cuda_column_bit_type_;
+  uint32_t* cuda_feature_min_bin_;
+  uint32_t* cuda_feature_max_bin_;
+  uint32_t* cuda_feature_offset_;
+  uint32_t* cuda_feature_most_freq_bin_;
+  uint32_t* cuda_feature_default_bin_;
+  uint8_t* cuda_feature_missing_is_zero_;
+  uint8_t* cuda_feature_missing_is_na_;
+  uint8_t* cuda_feature_mfb_is_zero_;
+  uint8_t* cuda_feature_mfb_is_na_;
+  int* cuda_feature_to_column_;
+  // used when bagging with subset
+  data_size_t* cuda_used_indices_;
+  data_size_t num_used_indices_;
+  data_size_t cur_subset_buffer_size_;
+};
+}  // namespace LightGBM
+#endif  // LIGHTGBM_CUDA_COLUMN_DATA_HPP_
+#endif  // USE_CUDA_EXP
--- a/include/LightGBM/cuda/cuda_metadata.hpp
+++ b/include/LightGBM/cuda/cuda_metadata.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifdef USE_CUDA_EXP
+#ifndef LIGHTGBM_CUDA_META_DATA_HPP_
+#define LIGHTGBM_CUDA_META_DATA_HPP_
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/meta.h>
+#include <vector>
+namespace LightGBM {
+class CUDAMetadata {
+ public:
+  explicit CUDAMetadata(const int gpu_device_id);
+  ~CUDAMetadata();
+  void Init(const std::vector<label_t>& label,
+            const std::vector<label_t>& weight,
+            const std::vector<data_size_t>& query_boundaries,
+            const std::vector<label_t>& query_weights,
+            const std::vector<double>& init_score);
+  void SetLabel(const label_t* label, data_size_t len);
+  void SetWeights(const label_t* weights, data_size_t len);
+  void SetQuery(const data_size_t* query, const label_t* query_weights, data_size_t num_queries);
+  void SetInitScore(const double* init_score, data_size_t len);
+  const label_t* cuda_label() const { return cuda_label_; }
+  const label_t* cuda_weights() const { return cuda_weights_; }
+  const data_size_t* cuda_query_boundaries() const { return cuda_query_boundaries_; }
+  const label_t* cuda_query_weights() const { return cuda_query_weights_; }
+ private:
+  label_t* cuda_label_;
+  label_t* cuda_weights_;
+  data_size_t* cuda_query_boundaries_;
+  label_t* cuda_query_weights_;
+  double* cuda_init_score_;
+};
+}  // namespace LightGBM
+#endif  // LIGHTGBM_CUDA_META_DATA_HPP_
+#endif  // USE_CUDA_EXP
--- a/include/LightGBM/cuda/cuda_random.hpp
+++ b/include/LightGBM/cuda/cuda_random.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifndef LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
+#define LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
+#ifdef USE_CUDA_EXP
+#include <cuda.h>
+#include <cuda_runtime.h>
+namespace LightGBM {
+/*!
+* \brief A wrapper for random generator
+*/
+class CUDARandom {
+ public:
+  /*!
+  * \brief Set specific seed
+  */
+  __device__ void SetSeed(int seed) {
+    x = seed;
+  }
+  /*!
+  * \brief Generate random integer, int16 range. [0, 65536]
+  * \param lower_bound lower bound
+  * \param upper_bound upper bound
+  * \return The random integer between [lower_bound, upper_bound)
+  */
+  __device__ inline int NextShort(int lower_bound, int upper_bound) {
+    return (RandInt16()) % (upper_bound - lower_bound) + lower_bound;
+  }
+  /*!
+  * \brief Generate random integer, int32 range
+  * \param lower_bound lower bound
+  * \param upper_bound upper bound
+  * \return The random integer between [lower_bound, upper_bound)
+  */
+  __device__ inline int NextInt(int lower_bound, int upper_bound) {
+    return (RandInt32()) % (upper_bound - lower_bound) + lower_bound;
+  }
+  /*!
+  * \brief Generate random float data
+  * \return The random float between [0.0, 1.0)
+  */
+  __device__ inline float NextFloat() {
+    // get random float in [0,1)
+    return static_cast<float>(RandInt16()) / (32768.0f);
+  }
+ private:
+  __device__ inline int RandInt16() {
+    x = (214013 * x + 2531011);
+    return static_cast<int>((x >> 16) & 0x7FFF);
+  }
+  __device__ inline int RandInt32() {
+    x = (214013 * x + 2531011);
+    return static_cast<int>(x & 0x7FFFFFFF);
+  }
+  unsigned int x = 123456789;
+};
+}  // namespace LightGBM
+#endif  // USE_CUDA_EXP
+#endif  // LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
--- a/include/LightGBM/cuda/cuda_row_data.hpp
+++ b/include/LightGBM/cuda/cuda_row_data.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifdef USE_CUDA_EXP
+#ifndef LIGHTGBM_CUDA_ROW_DATA_HPP_
+#define LIGHTGBM_CUDA_ROW_DATA_HPP_
+#include <LightGBM/bin.h>
+#include <LightGBM/config.h>
+#include <LightGBM/cuda/cuda_utils.h>
+#include <LightGBM/dataset.h>
+#include <LightGBM/train_share_states.h>
+#include <LightGBM/utils/openmp_wrapper.h>
+#include <vector>
+#define COPY_SUBROW_BLOCK_SIZE_ROW_DATA (1024)
+#if CUDART_VERSION == 10000
+#define DP_SHARED_HIST_SIZE (5560)
+#else
+#define DP_SHARED_HIST_SIZE (6144)
+#endif
+#define SP_SHARED_HIST_SIZE (DP_SHARED_HIST_SIZE * 2)
+namespace LightGBM {
+class CUDARowData {
+ public:
+  CUDARowData(const Dataset* train_data,
+              const TrainingShareStates* train_share_state,
+              const int gpu_device_id,
+              const bool gpu_use_dp);
+  ~CUDARowData();
+  void Init(const Dataset* train_data,
+            TrainingShareStates* train_share_state);
+  void CopySubrow(const CUDARowData* full_set, const data_size_t* used_indices, const data_size_t num_used_indices);
+  void CopySubcol(const CUDARowData* full_set, const std::vector<int8_t>& is_feature_used, const Dataset* train_data);
+  void CopySubrowAndSubcol(const CUDARowData* full_set, const data_size_t* used_indices,
+    const data_size_t num_used_indices, const std::vector<bool>& is_feature_used, const Dataset* train_data);
+  template <typename BIN_TYPE>
+  const BIN_TYPE* GetBin() const;
+  template <typename PTR_TYPE>
+  const PTR_TYPE* GetPartitionPtr() const;
+  template <typename PTR_TYPE>
+  const PTR_TYPE* GetRowPtr() const;
+  int NumLargeBinPartition() const { return static_cast<int>(large_bin_partitions_.size()); }
+  int num_feature_partitions() const { return num_feature_partitions_; }
+  int max_num_column_per_partition() const { return max_num_column_per_partition_; }
+  bool is_sparse() const { return is_sparse_; }
+  uint8_t bit_type() const { return bit_type_; }
+  uint8_t row_ptr_bit_type() const { return row_ptr_bit_type_; }
+  const int* cuda_feature_partition_column_index_offsets() const { return cuda_feature_partition_column_index_offsets_; }
+  const uint32_t* cuda_column_hist_offsets() const { return cuda_column_hist_offsets_; }
+  const uint32_t* cuda_partition_hist_offsets() const { return cuda_partition_hist_offsets_; }
+  int shared_hist_size() const { return shared_hist_size_; }
+ private:
+  void DivideCUDAFeatureGroups(const Dataset* train_data, TrainingShareStates* share_state);
+  template <typename BIN_TYPE>
+  void GetDenseDataPartitioned(const BIN_TYPE* row_wise_data, std::vector<BIN_TYPE>* partitioned_data);
+  template <typename BIN_TYPE, typename ROW_PTR_TYPE>
+  void GetSparseDataPartitioned(const BIN_TYPE* row_wise_data,
+    const ROW_PTR_TYPE* row_ptr,
+    std::vector<std::vector<BIN_TYPE>>* partitioned_data,
+    std::vector<std::vector<ROW_PTR_TYPE>>* partitioned_row_ptr,
+    std::vector<ROW_PTR_TYPE>* partition_ptr);
+  template <typename BIN_TYPE, typename ROW_PTR_TYPE>
+  void InitSparseData(const BIN_TYPE* host_data,
+                      const ROW_PTR_TYPE* host_row_ptr,
+                      BIN_TYPE** cuda_data,
+                      ROW_PTR_TYPE** cuda_row_ptr,
+                      ROW_PTR_TYPE** cuda_partition_ptr);
+  /*! \brief number of threads to use */
+  int num_threads_;
+  /*! \brief number of training data */
+  data_size_t num_data_;
+  /*! \brief number of bins of all features */
+  int num_total_bin_;
+  /*! \brief number of feature groups in dataset */
+  int num_feature_group_;
+  /*! \brief number of features in dataset */
+  int num_feature_;
+  /*! \brief number of bits used to store each bin value */
+  uint8_t bit_type_;
+  /*! \brief number of bits used to store each row pointer value */
+  uint8_t row_ptr_bit_type_;
+  /*! \brief is sparse row wise data */
+  bool is_sparse_;
+  /*! \brief start column index of each feature partition */
+  std::vector<int> feature_partition_column_index_offsets_;
+  /*! \brief histogram offset of each column */
+  std::vector<uint32_t> column_hist_offsets_;
+  /*! \brief hisotgram offset of each partition */
+  std::vector<uint32_t> partition_hist_offsets_;
+  /*! \brief maximum number of columns among all feature partitions */
+  int max_num_column_per_partition_;
+  /*! \brief number of partitions */
+  int num_feature_partitions_;
+  /*! \brief used when bagging with subset, number of used indices */
+  data_size_t num_used_indices_;
+  /*! \brief used when bagging with subset, number of total elements */
+  uint64_t num_total_elements_;
+  /*! \brief used when bagging with column subset, the size of maximum number of feature partitions */
+  int cur_num_feature_partition_buffer_size_;
+  /*! \brief CUDA device ID */
+  int gpu_device_id_;
+  /*! \brief index of partitions with large bins that its histogram cannot fit into shared memory, each large bin partition contains a single column */
+  std::vector<int> large_bin_partitions_;
+  /*! \brief index of partitions with small bins */
+  std::vector<int> small_bin_partitions_;
+  /*! \brief shared memory size used by histogram */
+  int shared_hist_size_;
+  /*! \brief whether to use double precision in histograms per block */
+  bool gpu_use_dp_;
+  // CUDA memory
+  /*! \brief row-wise data stored in CUDA, 8 bits */
+  uint8_t* cuda_data_uint8_t_;
+  /*! \brief row-wise data stored in CUDA, 16 bits */
+  uint16_t* cuda_data_uint16_t_;
+  /*! \brief row-wise data stored in CUDA, 32 bits */
+  uint32_t* cuda_data_uint32_t_;
+  /*! \brief row pointer stored in CUDA, 16 bits */
+  uint16_t* cuda_row_ptr_uint16_t_;
+  /*! \brief row pointer stored in CUDA, 32 bits */
+  uint32_t* cuda_row_ptr_uint32_t_;
+  /*! \brief row pointer stored in CUDA, 64 bits */
+  uint64_t* cuda_row_ptr_uint64_t_;
+  /*! \brief partition bin offsets, 16 bits */
+  uint16_t* cuda_partition_ptr_uint16_t_;
+  /*! \brief partition bin offsets, 32 bits */
+  uint32_t* cuda_partition_ptr_uint32_t_;
+  /*! \brief partition bin offsets, 64 bits */
+  uint64_t* cuda_partition_ptr_uint64_t_;
+  /*! \brief start column index of each feature partition */
+  int* cuda_feature_partition_column_index_offsets_;
+  /*! \brief histogram offset of each column */
+  uint32_t* cuda_column_hist_offsets_;
+  /*! \brief hisotgram offset of each partition */
+  uint32_t* cuda_partition_hist_offsets_;
+  /*! \brief block buffer when calculating prefix sum */
+  uint16_t* cuda_block_buffer_uint16_t_;
+  /*! \brief block buffer when calculating prefix sum */
+  uint32_t* cuda_block_buffer_uint32_t_;
+  /*! \brief block buffer when calculating prefix sum */
+  uint64_t* cuda_block_buffer_uint64_t_;
+};
+}  // namespace LightGBM
+#endif  // LIGHTGBM_CUDA_ROW_DATA_HPP_
+#endif  // USE_CUDA_EXP
--- a/include/LightGBM/cuda/cuda_split_info.hpp
+++ b/include/LightGBM/cuda/cuda_split_info.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for
+ * license information.
+ */
+#ifdef USE_CUDA_EXP
+#ifndef LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
+#define LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
+#include <LightGBM/meta.h>
+namespace LightGBM {
+class CUDASplitInfo {
+ public:
+  bool is_valid;
+  int leaf_index;
+  double gain;
+  int inner_feature_index;
+  uint32_t threshold;
+  bool default_left;
+  double left_sum_gradients;
+  double left_sum_hessians;
+  data_size_t left_count;
+  double left_gain;
+  double left_value;
+  double right_sum_gradients;
+  double right_sum_hessians;
+  data_size_t right_count;
+  double right_gain;
+  double right_value;
+  int num_cat_threshold = 0;
+  uint32_t* cat_threshold = nullptr;
+  int* cat_threshold_real = nullptr;
+  __device__ CUDASplitInfo() {
+    num_cat_threshold = 0;
+    cat_threshold = nullptr;
+    cat_threshold_real = nullptr;
+  }
+  __device__ ~CUDASplitInfo() {
+    if (num_cat_threshold > 0) {
+      if (cat_threshold != nullptr) {
+        cudaFree(cat_threshold);
+      }
+      if (cat_threshold_real != nullptr) {
+        cudaFree(cat_threshold_real);
+      }
+    }
+  }
+  __device__ CUDASplitInfo& operator=(const CUDASplitInfo& other) {
+    is_valid = other.is_valid;
+    leaf_index = other.leaf_index;
+    gain = other.gain;
+    inner_feature_index = other.inner_feature_index;
+    threshold = other.threshold;
+    default_left = other.default_left;
+    left_sum_gradients = other.left_sum_gradients;
+    left_sum_hessians = other.left_sum_hessians;
+    left_count = other.left_count;
+    left_gain = other.left_gain;
+    left_value = other.left_value;
+    right_sum_gradients = other.right_sum_gradients;
+    right_sum_hessians = other.right_sum_hessians;
+    right_count = other.right_count;
+    right_gain = other.right_gain;
+    right_value = other.right_value;
+    num_cat_threshold = other.num_cat_threshold;
+    if (num_cat_threshold > 0 && cat_threshold == nullptr) {
+      cat_threshold = new uint32_t[num_cat_threshold];
+    }
+    if (num_cat_threshold > 0 && cat_threshold_real == nullptr) {
+      cat_threshold_real = new int[num_cat_threshold];
+    }
+    if (num_cat_threshold > 0) {
+      if (other.cat_threshold != nullptr) {
+        for (int i = 0; i < num_cat_threshold; ++i) {
+          cat_threshold[i] = other.cat_threshold[i];
+        }
+      }
+      if (other.cat_threshold_real != nullptr) {
+        for (int i = 0; i < num_cat_threshold; ++i) {
+          cat_threshold_real[i] = other.cat_threshold_real[i];
+        }
+      }
+    }
+    return *this;
+  }
+};
+}  // namespace LightGBM
+#endif  // LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
+#endif  // USE_CUDA_EXP
--- a/include/LightGBM/cuda/cuda_tree.hpp
+++ b/include/LightGBM/cuda/cuda_tree.hpp
+/*!
+ * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
+ * Licensed under the MIT License. See LICENSE file in the project root for license information.
+ */
+#ifdef USE_CUDA_EXP
+#ifndef LIGHTGBM_CUDA_CUDA_TREE_HPP_
+#define LIGHTGBM_CUDA_CUDA_TREE_HPP_
+#include <LightGBM/cuda/cuda_column_data.hpp>
+#include <LightGBM/cuda/cuda_split_info.hpp>
+#include <LightGBM/tree.h>
+#include <LightGBM/bin.h>
+namespace LightGBM {
+__device__ void SetDecisionTypeCUDA(int8_t* decision_type, bool input, int8_t mask);
+__device__ void SetMissingTypeCUDA(int8_t* decision_type, int8_t input);
+__device__ bool GetDecisionTypeCUDA(int8_t decision_type, int8_t mask);
+__device__ int8_t GetMissingTypeCUDA(int8_t decision_type);
+__device__ bool IsZeroCUDA(double fval);
+class CUDATree : public Tree {
+ public:
+  /*!
+  * \brief Constructor
+  * \param max_leaves The number of max leaves
+  * \param track_branch_features Whether to keep track of ancestors of leaf nodes
+  * \param is_linear Whether the tree has linear models at each leaf
+  */
+  explicit CUDATree(int max_leaves, bool track_branch_features, bool is_linear,
+    const int gpu_device_id, const bool has_categorical_feature);
+  explicit CUDATree(const Tree* host_tree);
+  ~CUDATree() noexcept;
+  int Split(const int leaf_index,
+            const int real_feature_index,
+            const double real_threshold,
+            const MissingType missing_type,
+            const CUDASplitInfo* cuda_split_info);
+  int SplitCategorical(
+    const int leaf_index,
+    const int real_feature_index,
+    const MissingType missing_type,
+    const CUDASplitInfo* cuda_split_info,
+    uint32_t* cuda_bitset,
+    size_t cuda_bitset_len,
+    uint32_t* cuda_bitset_inner,
+    size_t cuda_bitset_inner_len);
+  const int* cuda_leaf_parent() const { return cuda_leaf_parent_; }
+  const int* cuda_left_child() const { return cuda_left_child_; }
+  const int* cuda_right_child() const { return cuda_right_child_; }
+  const int* cuda_split_feature_inner() const { return cuda_split_feature_inner_; }
+  const int* cuda_split_feature() const { return cuda_split_feature_; }
+  const uint32_t* cuda_threshold_in_bin() const { return cuda_threshold_in_bin_; }
+  const double* cuda_threshold() const { return cuda_threshold_; }
+  const int8_t* cuda_decision_type() const { return cuda_decision_type_; }
+  const double* cuda_leaf_value() const { return cuda_leaf_value_; }
+  double* cuda_leaf_value_ref() { return cuda_leaf_value_; }
+  inline void Shrinkage(double rate) override;
+  inline void AddBias(double val) override;
+  void ToHost();
+  void SyncLeafOutputFromHostToCUDA();
+  void SyncLeafOutputFromCUDAToHost();
+ private:
+  void InitCUDAMemory();
+  void InitCUDA();
+  void LaunchSplitKernel(const int leaf_index,
+                         const int real_feature_index,
+                         const double real_threshold,
+                         const MissingType missing_type,
+                         const CUDASplitInfo* cuda_split_info);
+  void LaunchSplitCategoricalKernel(
+    const int leaf_index,
+    const int real_feature_index,
+    const MissingType missing_type,
+    const CUDASplitInfo* cuda_split_info,
+    size_t cuda_bitset_len,
+    size_t cuda_bitset_inner_len);
+  void LaunchShrinkageKernel(const double rate);
+  void LaunchAddBiasKernel(const double val);
+  int* cuda_left_child_;
+  int* cuda_right_child_;
+  int* cuda_split_feature_inner_;
+  int* cuda_split_feature_;
+  int* cuda_leaf_depth_;
+  int* cuda_leaf_parent_;
+  uint32_t* cuda_threshold_in_bin_;
+  double* cuda_threshold_;
+  double* cuda_internal_weight_;
+  double* cuda_internal_value_;
+  int8_t* cuda_decision_type_;
+  double* cuda_leaf_value_;
+  data_size_t* cuda_leaf_count_;
+  double* cuda_leaf_weight_;
+  data_size_t* cuda_internal_count_;
+  float* cuda_split_gain_;
+  CUDAVector<uint32_t> cuda_bitset_;
+  CUDAVector<uint32_t> cuda_bitset_inner_;
+  CUDAVector<int> cuda_cat_boundaries_;
+  CUDAVector<int> cuda_cat_boundaries_inner_;
+  cudaStream_t cuda_stream_;
+  const int num_threads_per_block_add_prediction_to_score_;
+};
+}  // namespace LightGBM
+#endif  // LIGHTGBM_CUDA_CUDA_TREE_HPP_
+#endif  // USE_CUDA_EXP
--- a/include/LightGBM/cuda/cuda_utils.h
+++ b/include/LightGBM/cuda/cuda_utils.h
 /*!
- * Copyright (c) 2020 IBM Corporation. All rights reserved.
+ * Copyright (c) 2020-2021 IBM Corporation, Microsoft Corporation. All rights reserved.
 * Licensed under the MIT License. See LICENSE file in the project root for license information.
 */
 #ifndef LIGHTGBM_CUDA_CUDA_UTILS_H_
 #define LIGHTGBM_CUDA_CUDA_UTILS_H_
-#ifdef USE_CUDA
+#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include <stdio.h>
+#endif  // USE_CUDA || USE_CUDA_EXP
+#ifdef USE_CUDA_EXP
+#include <LightGBM/utils/log.h>
+#include <vector>
+#endif  // USE_CUDA_EXP
+namespace LightGBM {
+#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
 #define CUDASUCCESS_OR_FATAL(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
@@ -18,7 +27,157 @@ inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort =
    if (abort) exit(code);
  }
 }
+#endif  // USE_CUDA || USE_CUDA_EXP
+#ifdef USE_CUDA_EXP
+#define CUDASUCCESS_OR_FATAL_OUTER(ans) { gpuAssert((ans), file, line); }
+void SetCUDADevice(int gpu_device_id, const char* file, int line);
+template <typename T>
+void AllocateCUDAMemory(T** out_ptr, size_t size, const char* file, const int line) {
+  void* tmp_ptr = nullptr;
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMalloc(&tmp_ptr, size * sizeof(T)));
+  *out_ptr = reinterpret_cast<T*>(tmp_ptr);
+}
+template <typename T>
+void CopyFromHostToCUDADevice(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
+  void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
+  const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
+  size_t size_in_bytes = size * sizeof(T);
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyHostToDevice));
+}
+template <typename T>
+void InitCUDAMemoryFromHostMemory(T** dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
+  AllocateCUDAMemory<T>(dst_ptr, size, file, line);
+  CopyFromHostToCUDADevice<T>(*dst_ptr, src_ptr, size, file, line);
+}
+template <typename T>
+void CopyFromCUDADeviceToHost(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
+  void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
+  const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
+  size_t size_in_bytes = size * sizeof(T);
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToHost));
+}
+template <typename T>
+void CopyFromCUDADeviceToHostAsync(T* dst_ptr, const T* src_ptr, size_t size, cudaStream_t stream, const char* file, const int line) {
+  void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
+  const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
+  size_t size_in_bytes = size * sizeof(T);
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpyAsync(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToHost, stream));
+}
+template <typename T>
+void CopyFromCUDADeviceToCUDADevice(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
+  void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
+  const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
+  size_t size_in_bytes = size * sizeof(T);
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToDevice));
+}
+template <typename T>
+void CopyFromCUDADeviceToCUDADeviceAsync(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
+  void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
+  const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
+  size_t size_in_bytes = size * sizeof(T);
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpyAsync(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToDevice));
+}
+void SynchronizeCUDADevice(const char* file, const int line);
+template <typename T>
+void SetCUDAMemory(T* dst_ptr, int value, size_t size, const char* file, const int line) {
+  CUDASUCCESS_OR_FATAL_OUTER(cudaMemset(reinterpret_cast<void*>(dst_ptr), value, size * sizeof(T)));
+  SynchronizeCUDADevice(file, line);
+}
+template <typename T>
+void DeallocateCUDAMemory(T** ptr, const char* file, const int line) {
+  if (*ptr != nullptr) {
+    CUDASUCCESS_OR_FATAL_OUTER(cudaFree(reinterpret_cast<void*>(*ptr)));
+    *ptr = nullptr;
+  }
+}
+void PrintLastCUDAError();
+template <typename T>
+class CUDAVector {
+ public:
+  CUDAVector() {
+    size_ = 0;
+    data_ = nullptr;
+  }
+  explicit CUDAVector(size_t size) {
+    size_ = size;
+    AllocateCUDAMemory<T>(&data_, size_, __FILE__, __LINE__);
+  }
+  void Resize(size_t size) {
+    if (size == 0) {
+      Clear();
+    }
+    T* new_data = nullptr;
+    AllocateCUDAMemory<T>(&new_data, size, __FILE__, __LINE__);
+    if (size_ > 0 && data_ != nullptr) {
+      CopyFromCUDADeviceToCUDADevice<T>(new_data, data_, size, __FILE__, __LINE__);
+    }
+    DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
+    data_ = new_data;
+    size_ = size;
+  }
+  void Clear() {
+    if (size_ > 0 && data_ != nullptr) {
+      DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
+    }
+    size_ = 0;
+  }
+  void PushBack(const T* values, size_t len) {
+    T* new_data = nullptr;
+    AllocateCUDAMemory<T>(&new_data, size_ + len, __FILE__, __LINE__);
+    if (size_ > 0 && data_ != nullptr) {
+      CopyFromCUDADeviceToCUDADevice<T>(new_data, data_, size_, __FILE__, __LINE__);
+    }
+    CopyFromCUDADeviceToCUDADevice<T>(new_data + size_, values, len, __FILE__, __LINE__);
+    DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
+    size_ += len;
+    data_ = new_data;
+  }
+  size_t Size() {
+    return size_;
+  }
+  ~CUDAVector() {
+    DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
+  }
+  std::vector<T> ToHost() {
+    std::vector<T> host_vector(size_);
+    if (size_ > 0 && data_ != nullptr) {
+      CopyFromCUDADeviceToHost(host_vector.data(), data_, size_, __FILE__, __LINE__);
+    }
+    return host_vector;
+  }
+  T* RawData() {
+    return data_;
+  }
+ private:
+  T* data_;
+  size_t size_;
+};
+#endif  // USE_CUDA_EXP
-#endif  // USE_CUDA
+}  // namespace LightGBM
 #endif  // LIGHTGBM_CUDA_CUDA_UTILS_H_
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
 /*!
- * Copyright (c) 2020 IBM Corporation. All rights reserved.
+ * Copyright (c) 2020 IBM Corporation, Microsoft Corporation. All rights reserved.
 * Licensed under the MIT License. See LICENSE file in the project root for license information.
 */
 #ifndef LIGHTGBM_CUDA_VECTOR_CUDAHOST_H_
@@ -7,7 +7,7 @@
 #include <LightGBM/utils/common.h>
-#ifdef USE_CUDA
+#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
 #include <cuda.h>
 #include <cuda_runtime.h>
 #endif
@@ -42,8 +42,8 @@ struct CHAllocator {
  T* allocate(std::size_t n) {
    T* ptr;
    if (n == 0) return NULL;
-    n = (n + kAlignedSize - 1) & -kAlignedSize;
+    n = SIZE_ALIGNED(n);
-    #ifdef USE_CUDA
+    #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
      if (LGBM_config_::current_device == lgbm_device_cuda) {
        cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);
        if (ret != cudaSuccess) {
@@ -62,7 +62,7 @@ struct CHAllocator {
  void deallocate(T* p, std::size_t n) {
    (void)n;  // UNUSED
    if (p == NULL) return;
-    #ifdef USE_CUDA
+    #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
      if (LGBM_config_::current_device == lgbm_device_cuda) {
        cudaPointerAttributes attributes;
        cudaPointerGetAttributes(&attributes, p);