Initial GPU acceleration support for LightGBM (#368)

* add dummy gpu solver code * initial GPU code * fix crash bug * first working version * use asynchronous copy * use a better kernel for root * parallel read histogram * sparse features now works, but no acceleration, compute on CPU * compute sparse feature on CPU simultaneously * fix big bug; add gpu selection; add kernel selection * better debugging * clean up * add feature scatter * Add sparse_threshold control * fix a bug in feature scatter * clean up debug * temporarily add OpenCL kernels for k=64,256 * fix up CMakeList and definition USE_GPU * add OpenCL kernels as string literals * Add boost.compute as a submodule * add boost dependency into CMakeList * fix opencl pragma * use pinned memory for histogram * use pinned buffer for gradients and hessians * better debugging message * add double precision support on GPU * fix boost version in CMakeList * Add a README * reconstruct GPU initialization code for ResetTrainingData * move data to GPU in parallel * fix a bug during feature copy * update gpu kernels * update gpu code * initial port to LightGBM v2 * speedup GPU data loading process * Add 4-bit bin support to GPU * re-add sparse_threshold parameter * remove kMaxNumWorkgroups and allows an unlimited number of features * add feature mask support for skipping unused features * enable kernel cache * use GPU kernels withoug feature masks when all features are used * REAdme. * REAdme. * update README * fix typos (#349) * change compile to gcc on Apple as default * clean vscode related file * refine api of constructing from sampling data. * fix bug in the last commit. * more efficient algorithm to sample k from n. * fix bug in filter bin * change to boost from average output. * fix tests. * only stop training when all classes are finshed in multi-class. * limit the max tree output. change hessian in multi-class objective. * robust tree model loading. * fix test. * convert the probabilities to raw score in boost_from_average of classification. * fix the average label for binary classification. * Add boost_from_average to docs (#354) * don't use "ConvertToRawScore" for self-defined objective function. * boost_from_average seems doesn't work well in binary classification. remove it. * For a better jump link (#355) * Update Python-API.md * for a better jump in page A space is needed between `#` and the headers content according to Github's markdown format [guideline](https://guides.github.com/features/mastering-markdown/) After adding the spaces, we can jump to the exact position in page by click the link. * fixed something mentioned by @wxchan * Update Python-API.md * add FitByExistingTree. * adapt GPU tree learner for FitByExistingTree * avoid NaN output. * update boost.compute * fix typos (#361) * fix broken links (#359) * update README * disable GPU acceleration by default * fix image url * cleanup debug macro * remove old README * do not save sparse_threshold_ in FeatureGroup * add details for new GPU settings * ignore submodule when doing pep8 check * allocate workspace for at least one thread during builing Feature4 * move sparse_threshold to class Dataset * remove duplicated code in GPUTreeLearner::Split * Remove duplicated code in FindBestThresholds and BeforeFindBestSplit * do not rebuild ordered gradients and hessians for sparse features * support feature groups in GPUTreeLearner * Initial parallel learners with GPU support * add option device, cleanup code * clean up FindBestThresholds; add some omp parallel * constant hessian optimization for GPU * Fix GPUTreeLearner crash when there is zero feature * use np.testing.assert_almost_equal() to compare lists of floats in tests * travis for GPU

Initial GPU acceleration support for LightGBM (#368)
* add dummy gpu solver code * initial GPU code * fix crash bug * first working version * use asynchronous copy * use a better kernel for root * parallel read histogram * sparse features now works, but no acceleration, compute on CPU * compute sparse feature on CPU simultaneously * fix big bug; add gpu selection; add kernel selection * better debugging * clean up * add feature scatter * Add sparse_threshold control * fix a bug in feature scatter * clean up debug * temporarily add OpenCL kernels for k=64,256 * fix up CMakeList and definition USE_GPU * add OpenCL kernels as string literals * Add boost.compute as a submodule * add boost dependency into CMakeList * fix opencl pragma * use pinned memory for histogram * use pinned buffer for gradients and hessians * better debugging message * add double precision support on GPU * fix boost version in CMakeList * Add a README * reconstruct GPU initialization code for ResetTrainingData * move data to GPU in parallel * fix a bug during feature copy * update gpu kernels * update gpu code * initial port to LightGBM v2 * speedup GPU data loading process * Add 4-bit bin support to GPU * re-add sparse_threshold parameter * remove kMaxNumWorkgroups and allows an unlimited number of features * add feature mask support for skipping unused features * enable kernel cache * use GPU kernels withoug feature masks when all features are used * REAdme. * REAdme. * update README * fix typos (#349) * change compile to gcc on Apple as default * clean vscode related file * refine api of constructing from sampling data. * fix bug in the last commit. * more efficient algorithm to sample k from n. * fix bug in filter bin * change to boost from average output. * fix tests. * only stop training when all classes are finshed in multi-class. * limit the max tree output. change hessian in multi-class objective. * robust tree model loading. * fix test. * convert the probabilities to raw score in boost_from_average of classification. * fix the average label for binary classification. * Add boost_from_average to docs (#354) * don't use "ConvertToRawScore" for self-defined objective function. * boost_from_average seems doesn't work well in binary classification. remove it. * For a better jump link (#355) * Update Python-API.md * for a better jump in page A space is needed between `#` and the headers content according to Github's markdown format [guideline](https://guides.github.com/features/mastering-markdown/) After adding the spaces, we can jump to the exact position in page by click the link. * fixed something mentioned by @wxchan * Update Python-API.md * add FitByExistingTree. * adapt GPU tree learner for FitByExistingTree * avoid NaN output. * update boost.compute * fix typos (#361) * fix broken links (#359) * update README * disable GPU acceleration by default * fix image url * cleanup debug macro * remove old README * do not save sparse_threshold_ in FeatureGroup * add details for new GPU settings * ignore submodule when doing pep8 check * allocate workspace for at least one thread during builing Feature4 * move sparse_threshold to class Dataset * remove duplicated code in GPUTreeLearner::Split * Remove duplicated code in FindBestThresholds and BeforeFindBestSplit * do not rebuild ordered gradients and hessians for sparse features * support feature groups in GPUTreeLearner * Initial parallel learners with GPU support * add option device, cleanup code * clean up FindBestThresholds; add some omp parallel * constant hessian optimization for GPU * Fix GPUTreeLearner crash when there is zero feature * use np.testing.assert_almost_equal() to compare lists of floats in tests * travis for GPU
0bb4a825 · Huan Zhang · Guolin Ke · db3d1f89 · 0bb4a825 · 0bb4a825
Commit 0bb4a825 authored Apr 09, 2017 by Huan Zhang Committed by Guolin Ke Apr 09, 2017
20 changed files
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "include/boost/compute"]
+	path = compute
+	url = https://github.com/boostorg/compute
--- a/.travis.yml
+++ b/.travis.yml
@@ -11,24 +11,48 @@ before_install:
 - export PATH="$HOME/miniconda/bin:$PATH"
 - conda config --set always_yes yes --set changeps1 no
 - conda update -q conda
+- sudo add-apt-repository ppa:george-edison55/cmake-3.x -y
+- sudo apt-get update -q
+- bash .travis/amd_sdk.sh;
+- tar -xjf AMD-SDK.tar.bz2;
+- AMDAPPSDK=${HOME}/AMDAPPSDK;
+- export OPENCL_VENDOR_PATH=${AMDAPPSDK}/etc/OpenCL/vendors;
+- mkdir -p ${OPENCL_VENDOR_PATH};
+- sh AMD-APP-SDK*.sh --tar -xf -C ${AMDAPPSDK};
+- echo libamdocl64.so > ${OPENCL_VENDOR_PATH}/amdocl64.icd;
+- export LD_LIBRARY_PATH=${AMDAPPSDK}/lib/x86_64:${LD_LIBRARY_PATH};
+- chmod +x ${AMDAPPSDK}/bin/x86_64/clinfo;
+- ${AMDAPPSDK}/bin/x86_64/clinfo;
+- export LIBRARY_PATH="$HOME/miniconda/lib:$LIBRARY_PATH"
+- export LD_RUN_PATH="$HOME/miniconda/lib:$LD_RUN_PATH"
+- export CPLUS_INCLUDE_PATH="$HOME/miniconda/include:$AMDAPPSDK/include/:$CPLUS_INCLUDE_PATH"
 install:
 - sudo apt-get install -y libopenmpi-dev openmpi-bin build-essential
+- sudo apt-get install -y cmake
 - conda install --yes atlas numpy scipy scikit-learn pandas matplotlib
+- conda install --yes -c conda-forge boost=1.63.0
 - pip install pep8
 script:
 - cd $TRAVIS_BUILD_DIR
 - mkdir build && cd build && cmake .. && make -j
 - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
 - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
 - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
- cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 .
+- cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 --exclude=./compute .
 - rm -rf build && mkdir build && cd build && cmake -DUSE_MPI=ON ..&& make -j
 - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
 - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
 - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
+- cd $TRAVIS_BUILD_DIR
+- rm -rf build && mkdir build && cd build && cmake -DUSE_GPU=ON -DBOOST_ROOT="$HOME/miniconda/" -DOpenCL_INCLUDE_DIR=$AMDAPPSDK/include/ ..
+- sed -i 's/std::string device_type = "cpu";/std::string device_type = "gpu";/' ../include/LightGBM/config.h
+- make -j$(nproc)
+- sed -i 's/std::string device_type = "gpu";/std::string device_type = "cpu";/' ../include/LightGBM/config.h
+- cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
+- cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
+- cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
 notifications:
  email: false

--- a/.travis/amd_sdk.sh
+++ b/.travis/amd_sdk.sh
+#!/bin/bash
+# Original script from https://github.com/gregvw/amd_sdk/
+# Location from which get nonce and file name from
+URL="http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/amd-accelerated-parallel-processing-app-sdk/"
+URLDOWN="http://developer.amd.com/amd-license-agreement-appsdk/"
+NONCE1_STRING='name="amd_developer_central_downloads_page_nonce"'
+FILE_STRING='name="f"'
+POSTID_STRING='name="post_id"'
+NONCE2_STRING='name="amd_developer_central_nonce"'
+#For newest FORM=`wget -qO - $URL | sed -n '/download-2/,/64-bit/p'`
+FORM=`wget -qO - $URL | sed -n '/download-5/,/64-bit/p'`
+# Get nonce from form
+NONCE1=`echo $FORM | awk -F ${NONCE1_STRING} '{print $2}'`
+NONCE1=`echo $NONCE1 | awk -F'"' '{print $2}'`
+echo $NONCE1
+# get the postid
+POSTID=`echo $FORM | awk -F ${POSTID_STRING} '{print $2}'`
+POSTID=`echo $POSTID | awk -F'"' '{print $2}'`
+echo $POSTID
+# get file name
+FILE=`echo $FORM | awk -F ${FILE_STRING} '{print $2}'`
+FILE=`echo $FILE | awk -F'"' '{print $2}'`
+echo $FILE
+FORM=`wget -qO - $URLDOWN --post-data "amd_developer_central_downloads_page_nonce=${NONCE1}&f=${FILE}&post_id=${POSTID}"`
+NONCE2=`echo $FORM | awk -F ${NONCE2_STRING} '{print $2}'`
+NONCE2=`echo $NONCE2 | awk -F'"' '{print $2}'`
+echo $NONCE2
+wget --content-disposition --trust-server-names $URLDOWN --post-data "amd_developer_central_nonce=${NONCE2}&f=${FILE}" -O AMD-SDK.tar.bz2;
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -9,6 +9,7 @@ PROJECT(lightgbm)
 OPTION(USE_MPI "MPI based parallel learning" OFF)
 OPTION(USE_OPENMP "Enable OpenMP" ON)
+OPTION(USE_GPU "Enable GPU-acclerated training (EXPERIMENTAL)" OFF)
 if(APPLE)
    OPTION(APPLE_OUTPUT_DYLIB "Output dylib shared library" OFF)
@@ -34,8 +35,17 @@ else()
    endif()
 endif(USE_OPENMP)
+if(USE_GPU)
+    find_package(OpenCL REQUIRED)
+    include_directories(${OpenCL_INCLUDE_DIRS})
+    MESSAGE(STATUS "OpenCL include directory:" ${OpenCL_INCLUDE_DIRS})
+    find_package(Boost 1.56.0 COMPONENTS filesystem system REQUIRED)
+    include_directories(${Boost_INCLUDE_DIRS})
+    ADD_DEFINITIONS(-DUSE_GPU)
+endif(USE_GPU)
 if(UNIX OR MINGW OR CYGWIN)
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11")
+    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11 -Wno-ignored-attributes")
 endif()
 if(MSVC)
@@ -65,11 +75,13 @@ endif()
 SET(LightGBM_HEADER_DIR ${PROJECT_SOURCE_DIR}/include)
+SET(BOOST_COMPUTE_HEADER_DIR ${PROJECT_SOURCE_DIR}/compute/include)
 SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
 SET(LIBRARY_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
 include_directories (${LightGBM_HEADER_DIR})
+include_directories (${BOOST_COMPUTE_HEADER_DIR})
 if(APPLE)
  if (APPLE_OUTPUT_DYLIB)
@@ -105,6 +117,11 @@ if(USE_MPI)
  TARGET_LINK_LIBRARIES(_lightgbm ${MPI_CXX_LIBRARIES})
 endif(USE_MPI)
+if(USE_GPU)
+  TARGET_LINK_LIBRARIES(lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
+  TARGET_LINK_LIBRARIES(_lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
+endif(USE_GPU)
 if(WIN32 AND (MINGW OR CYGWIN))
    TARGET_LINK_LIBRARIES(lightgbm Ws2_32)
    TARGET_LINK_LIBRARIES(_lightgbm Ws2_32)

--- a/compute @ 1380a045
+++ b/compute @ 1380a045
+Subproject commit 1380a04582080bbe2364352b336270bc4bfa3025
--- a/include/LightGBM/bin.h
+++ b/include/LightGBM/bin.h
@@ -59,7 +59,6 @@ public:
  explicit BinMapper(const void* memory);
  ~BinMapper();
-  static double kSparseThreshold;
  bool CheckAlign(const BinMapper& other) const {
    if (num_bin_ != other.num_bin_) {
      return false;
@@ -258,6 +257,7 @@ public:
  * \return Bin data
  */
  virtual uint32_t Get(data_size_t idx) = 0;
+  virtual uint32_t RawGet(data_size_t idx) = 0;
  virtual void Reset(data_size_t idx) = 0;
  virtual ~BinIterator() = default;
 };
@@ -383,12 +383,13 @@ public:
  * \param num_bin Number of bin
  * \param sparse_rate Sparse rate of this bins( num_bin0/num_data )
  * \param is_enable_sparse True if enable sparse feature
+  * \param sparse_threshold Threshold for treating a feature as a sparse feature
  * \param is_sparse Will set to true if this bin is sparse
  * \param default_bin Default bin for zeros value
  * \return The bin data object
  */
  static Bin* CreateBin(data_size_t num_data, int num_bin,
-    double sparse_rate, bool is_enable_sparse, bool* is_sparse);
+    double sparse_rate, bool is_enable_sparse, double sparse_threshold, bool* is_sparse);
  /*!
  * \brief Create object for bin data of one feature, used for dense feature

--- a/include/LightGBM/config.h
+++ b/include/LightGBM/config.h
@@ -97,6 +97,11 @@ public:
  int num_iteration_predict = -1;
  bool is_pre_partition = false;
  bool is_enable_sparse = true;
+  /*! \brief The threshold of zero elements precentage for treating a feature as a sparse feature.
+   *  Default is 0.8, where a feature is treated as a sparse feature when there are over 80% zeros.
+   *  When setting to 1.0, all features are processed as dense features.
+   */
+  double sparse_threshold = 0.8;
  bool use_two_round_loading = false;
  bool is_save_binary_file = false;
  bool enable_load_from_binary_file = true;
@@ -188,6 +193,16 @@ public:
  // max_depth < 0 means no limit
  int max_depth = -1;
  int top_k = 20;
+  /*! \brief OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform.
+   *  Default value is -1, using the system-wide default platform
+   */
+  int gpu_platform_id = -1;
+  /*! \brief OpenCL device ID in the specified platform. Each GPU in the selected platform has a
+   *  unique device ID. Default value is -1, using the default device in the selected platform
+   */
+  int gpu_device_id = -1;
+  /*! \brief Set to true to use double precision math on GPU (default using single precision) */
+  bool gpu_use_dp = false;
  LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
 };
@@ -216,11 +231,14 @@ public:
  // only used for the regression. Will boost from the average labels.
  bool boost_from_average = true;
  std::string tree_learner_type = "serial";
+  std::string device_type = "cpu";
  TreeConfig tree_config;
  LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
 private:
  void GetTreeLearnerType(const std::unordered_map<std::string,
    std::string>& params);
+  void GetDeviceType(const std::unordered_map<std::string,
+    std::string>& params);
 };
 /*! \brief Config for Network */

--- a/include/LightGBM/dataset.h
+++ b/include/LightGBM/dataset.h
@@ -355,6 +355,9 @@ public:
  inline int Feture2SubFeature(int feature_idx) const {
    return feature2subfeature_[feature_idx];
  }
+  inline uint64_t GroupBinBoundary(int group_idx) const {
+    return group_bin_boundaries_[group_idx];
+  }
  inline uint64_t NumTotalBin() const {
    return group_bin_boundaries_.back();
  }
@@ -421,6 +424,10 @@ public:
    const int sub_feature = feature2subfeature_[i];
    return feature_groups_[group]->bin_mappers_[sub_feature]->num_bin();
  }
+  inline int FeatureGroupNumBin(int group) const {
+    return feature_groups_[group]->num_total_bin_;
+  }
  inline const BinMapper* FeatureBinMapper(int i) const {
    const int group = feature2group_[i];
@@ -428,12 +435,25 @@ public:
    return feature_groups_[group]->bin_mappers_[sub_feature].get();
  }
+  inline const Bin* FeatureBin(int i) const {
+    const int group = feature2group_[i];
+    return feature_groups_[group]->bin_data_.get();
+  }
+  inline const Bin* FeatureGroupBin(int group) const {
+    return feature_groups_[group]->bin_data_.get();
+  }
  inline BinIterator* FeatureIterator(int i) const {
    const int group = feature2group_[i];
    const int sub_feature = feature2subfeature_[i];
    return feature_groups_[group]->SubFeatureIterator(sub_feature);
  }
+  inline BinIterator* FeatureGroupIterator(int group) const {
+    return feature_groups_[group]->FeatureGroupIterator();
+  }
  inline double RealThreshold(int i, uint32_t threshold) const {
    const int group = feature2group_[i];
    const int sub_feature = feature2subfeature_[i];
@@ -461,6 +481,9 @@ public:
  /*! \brief Get Number of used features */
  inline int num_features() const { return num_features_; }
+  /*! \brief Get Number of feature groups */
+  inline int num_feature_groups() const { return num_groups_;}
  /*! \brief Get Number of total features */
  inline int num_total_features() const { return num_total_features_; }
@@ -516,6 +539,8 @@ private:
  Metadata metadata_;
  /*! \brief index of label column */
  int label_idx_ = 0;
+  /*! \brief Threshold for treating a feature as a sparse feature */
+  double sparse_threshold_;
  /*! \brief store feature names */
  std::vector<std::string> feature_names_;
  /*! \brief store feature names */

--- a/include/LightGBM/feature_group.h
+++ b/include/LightGBM/feature_group.h
@@ -25,10 +25,11 @@ public:
  * \param bin_mappers Bin mapper for features
  * \param num_data Total number of data
  * \param is_enable_sparse True if enable sparse feature
+  * \param sparse_threshold Threshold for treating a feature as a sparse feature
  */
  FeatureGroup(int num_feature,
    std::vector<std::unique_ptr<BinMapper>>& bin_mappers,
-    data_size_t num_data, bool is_enable_sparse) : num_feature_(num_feature) {
+    data_size_t num_data, double sparse_threshold, bool is_enable_sparse) : num_feature_(num_feature) {
    CHECK(static_cast<int>(bin_mappers.size()) == num_feature);
    // use bin at zero to store default_bin
    num_total_bin_ = 1;
@@ -46,7 +47,7 @@ public:
    }
    double sparse_rate = 1.0f - static_cast<double>(cnt_non_zero) / (num_data);
    bin_data_.reset(Bin::CreateBin(num_data, num_total_bin_,
-      sparse_rate, is_enable_sparse, &is_sparse_));
+      sparse_rate, is_enable_sparse, sparse_threshold, &is_sparse_));
  }
  /*!
  * \brief Constructor from memory
@@ -120,6 +121,18 @@ public:
    uint32_t default_bin = bin_mappers_[sub_feature]->GetDefaultBin();
    return bin_data_->GetIterator(min_bin, max_bin, default_bin);
  }
+  /*!
+   * \brief Returns a BinIterator that can access the entire feature group's raw data.
+   *        The RawGet() function of the iterator should be called for best efficiency.
+   * \return A pointer to the BinIterator object
+   */
+  inline BinIterator* FeatureGroupIterator() {
+    uint32_t min_bin = bin_offsets_[0];
+    uint32_t max_bin = bin_offsets_.back() - 1;
+    uint32_t default_bin = 0;
+    return bin_data_->GetIterator(min_bin, max_bin, default_bin);
+  }
  inline data_size_t Split(
    int sub_feature,

--- a/include/LightGBM/tree_learner.h
+++ b/include/LightGBM/tree_learner.h
@@ -24,8 +24,9 @@ public:
  /*!
  * \brief Initialize tree learner with training dataset
  * \param train_data The used training data
+  * \param is_constant_hessian True if all hessians share the same value
  */
-  virtual void Init(const Dataset* train_data) = 0;
+  virtual void Init(const Dataset* train_data, bool is_constant_hessian) = 0;
  virtual void ResetTrainingData(const Dataset* train_data) = 0;
@@ -71,10 +72,12 @@ public:
  /*!
  * \brief Create object of tree learner
-  * \param type Type of tree learner
+  * \param learner_type Type of tree learner
+  * \param device_type Type of tree learner
  * \param tree_config config of tree
  */
-  static TreeLearner* CreateTreeLearner(const std::string& type,
+  static TreeLearner* CreateTreeLearner(const std::string& learner_type,
+    const std::string& device_type,
    const TreeConfig* tree_config);
 };

--- a/src/boosting/gbdt.cpp
+++ b/src/boosting/gbdt.cpp
@@ -92,10 +92,10 @@ void GBDT::ResetTrainingData(const BoostingConfig* config, const Dataset* train_
  if (train_data_ != train_data && train_data != nullptr) {
    if (tree_learner_ == nullptr) {
-      tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, &new_config->tree_config));
+      tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, new_config->device_type, &new_config->tree_config));
    }
    // init tree learner
-    tree_learner_->Init(train_data);
+    tree_learner_->Init(train_data, is_constant_hessian_);
    // push training metrics
    training_metrics_.clear();

--- a/src/io/bin.cpp
+++ b/src/io/bin.cpp
@@ -339,12 +339,10 @@ template class OrderedSparseBin<uint8_t>;
 template class OrderedSparseBin<uint16_t>;
 template class OrderedSparseBin<uint32_t>;
-double BinMapper::kSparseThreshold = 0.8f;
 Bin* Bin::CreateBin(data_size_t num_data, int num_bin, double sparse_rate, 
-  bool is_enable_sparse, bool* is_sparse) {
+  bool is_enable_sparse, double sparse_threshold, bool* is_sparse) {
  // sparse threshold
-  if (sparse_rate >= BinMapper::kSparseThreshold && is_enable_sparse) {
+  if (sparse_rate >= sparse_threshold && is_enable_sparse) {
    *is_sparse = true;
    return CreateSparseBin(num_data, num_bin);
  } else {

--- a/src/io/config.cpp
+++ b/src/io/config.cpp
@@ -201,6 +201,7 @@ void IOConfig::Set(const std::unordered_map<std::string, std::string>& params) {
  GetInt(params, "bin_construct_sample_cnt", &bin_construct_sample_cnt);
  GetBool(params, "is_pre_partition", &is_pre_partition);
  GetBool(params, "is_enable_sparse", &is_enable_sparse);
+  GetDouble(params, "sparse_threshold", &sparse_threshold);
  GetBool(params, "use_two_round_loading", &use_two_round_loading);
  GetBool(params, "is_save_binary_file", &is_save_binary_file);
  GetBool(params, "enable_load_from_binary_file", &enable_load_from_binary_file);
@@ -305,6 +306,9 @@ void TreeConfig::Set(const std::unordered_map<std::string, std::string>& params)
  GetDouble(params, "histogram_pool_size", &histogram_pool_size);
  GetInt(params, "max_depth", &max_depth);
  GetInt(params, "top_k", &top_k);
+  GetInt(params, "gpu_platform_id", &gpu_platform_id);
+  GetInt(params, "gpu_device_id", &gpu_device_id);
+  GetBool(params, "gpu_use_dp", &gpu_use_dp);
 }
@@ -336,6 +340,7 @@ void BoostingConfig::Set(const std::unordered_map<std::string, std::string>& par
  GetBool(params, "boost_from_average", &boost_from_average);
  CHECK(drop_rate <= 1.0 && drop_rate >= 0.0);
  CHECK(skip_drop <= 1.0 && skip_drop >= 0.0);
+  GetDeviceType(params);
  GetTreeLearnerType(params);
  tree_config.Set(params);
 }
@@ -346,6 +351,9 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st
    std::transform(value.begin(), value.end(), value.begin(), Common::tolower);
    if (value == std::string("serial")) {
      tree_learner_type = "serial";
+    } else if (value == std::string("gpu")) {
+      tree_learner_type = "serial";
+      device_type = "gpu";
    } else if (value == std::string("feature") || value == std::string("feature_parallel")) {
      tree_learner_type = "feature";
    } else if (value == std::string("data") || value == std::string("data_parallel")) {
@@ -358,6 +366,20 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st
  }
 }
+void BoostingConfig::GetDeviceType(const std::unordered_map<std::string, std::string>& params) {
+  std::string value;
+  if (GetString(params, "device", &value)) {
+    std::transform(value.begin(), value.end(), value.begin(), Common::tolower);
+    if (value == std::string("cpu")) {
+      device_type = "cpu";
+    } else if (value == std::string("gpu")) {
+      device_type = "gpu";
+    } else {
+      Log::Fatal("Unknown device type %s", value.c_str());
+    }
+  }
+}
 void NetworkConfig::Set(const std::unordered_map<std::string, std::string>& params) {
  GetInt(params, "num_machines", &num_machines);
  CHECK(num_machines >= 1);

--- a/src/io/dataset.cpp
+++ b/src/io/dataset.cpp
@@ -50,6 +50,7 @@ void Dataset::Construct(
  size_t,
  const IOConfig& io_config) {
  num_total_features_ = static_cast<int>(bin_mappers.size());
+  sparse_threshold_ = io_config.sparse_threshold;
  // get num_features
  std::vector<int> used_features;
  for (int i = 0; i < static_cast<int>(bin_mappers.size()); ++i) {
@@ -85,7 +86,7 @@ void Dataset::Construct(
      ++cur_fidx;
    }
    feature_groups_.emplace_back(std::unique_ptr<FeatureGroup>(
-      new FeatureGroup(cur_cnt_features, cur_bin_mappers, num_data_, io_config.is_enable_sparse)));
+      new FeatureGroup(cur_cnt_features, cur_bin_mappers, num_data_, sparse_threshold_, io_config.is_enable_sparse)));
  }
  feature_groups_.shrink_to_fit();
  group_bin_boundaries_.clear();
@@ -129,6 +130,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) {
  feature_groups_.clear();
  num_features_ = dataset->num_features_;
  num_groups_ = dataset->num_groups_;
+  sparse_threshold_ = dataset->sparse_threshold_;
  bool is_enable_sparse = false;
  for (int i = 0; i < num_groups_; ++i) {
    if (dataset->feature_groups_[i]->is_sparse_) {
@@ -146,6 +148,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) {
      dataset->feature_groups_[i]->num_feature_,
      bin_mappers,
      num_data_,
+      dataset->sparse_threshold_,
      is_enable_sparse));
  }
  feature_groups_.shrink_to_fit();
@@ -165,6 +168,7 @@ void Dataset::CreateValid(const Dataset* dataset) {
  feature_groups_.clear();
  num_features_ = dataset->num_features_;
  num_groups_ = num_features_;
+  sparse_threshold_ = dataset->sparse_threshold_;
  bool is_enable_sparse = true;
  feature2group_.clear();
  feature2subfeature_.clear();
@@ -176,6 +180,7 @@ void Dataset::CreateValid(const Dataset* dataset) {
      1,
      bin_mappers,
      num_data_,
+      dataset->sparse_threshold_,
      is_enable_sparse));
    feature2group_.push_back(i);
    feature2subfeature_.push_back(0);

--- a/src/io/dense_bin.hpp
+++ b/src/io/dense_bin.hpp
@@ -25,6 +25,7 @@ public:
      bias_ = 0;
    }
  }
+  inline uint32_t RawGet(data_size_t idx) override;
  inline uint32_t Get(data_size_t idx) override;
  inline void Reset(data_size_t) override { }
 private:
@@ -284,6 +285,11 @@ uint32_t DenseBinIterator<VAL_T>::Get(data_size_t idx) {
  }
 }
+template <typename VAL_T>
+inline uint32_t DenseBinIterator<VAL_T>::RawGet(data_size_t idx) {
+  return bin_data_->data_[idx];
+}
 template <typename VAL_T>
 BinIterator* DenseBin<VAL_T>::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const {
  return new DenseBinIterator<VAL_T>(this, min_bin, max_bin, default_bin);

--- a/src/io/dense_nbits_bin.hpp
+++ b/src/io/dense_nbits_bin.hpp
@@ -23,6 +23,7 @@ public:
      bias_ = 0;
    }
  }
+  inline uint32_t RawGet(data_size_t idx) override;
  inline uint32_t Get(data_size_t idx) override;
  inline void Reset(data_size_t) override { }
 private:
@@ -74,7 +75,7 @@ public:
    }
  }
-  BinIterator* GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const override;
+  inline BinIterator* GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const override;
  void ConstructHistogram(const data_size_t* data_indices, data_size_t num_data,
                          const score_t* ordered_gradients, const score_t* ordered_hessians,
@@ -357,7 +358,11 @@ uint32_t Dense4bitsBinIterator::Get(data_size_t idx) {
  }
 }
-BinIterator* Dense4bitsBin::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const {
+uint32_t Dense4bitsBinIterator::RawGet(data_size_t idx) {
+  return (bin_data_->data_[idx >> 1] >> ((idx & 1) << 2)) & 0xf;
+}
+inline BinIterator* Dense4bitsBin::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const {
  return new Dense4bitsBinIterator(this, min_bin, max_bin, default_bin);
 }

--- a/src/io/sparse_bin.hpp
+++ b/src/io/sparse_bin.hpp
@@ -38,7 +38,8 @@ public:
    Reset(start_idx);
  }
-  inline VAL_T RawGet(data_size_t idx);
+  inline uint32_t RawGet(data_size_t idx) override;
+  inline VAL_T InnerRawGet(data_size_t idx);
  inline uint32_t Get( data_size_t idx) override {
    VAL_T ret = RawGet(idx);
@@ -152,7 +153,7 @@ public:
      }
      for (data_size_t i = 0; i < num_data; ++i) {
        const data_size_t idx = data_indices[i];
-        VAL_T bin = iterator.RawGet(idx);
+        VAL_T bin = iterator.InnerRawGet(idx);
        if (bin > maxb || bin < minb) {
          default_indices[(*default_count)++] = idx;
        } else if (bin > th) {
@@ -168,7 +169,7 @@ public:
      }
      for (data_size_t i = 0; i < num_data; ++i) {
        const data_size_t idx = data_indices[i];
-        VAL_T bin = iterator.RawGet(idx);
+        VAL_T bin = iterator.InnerRawGet(idx);
        if (bin > maxb || bin < minb) {
          default_indices[(*default_count)++] = idx;
        } else if (bin != th) {
@@ -327,7 +328,7 @@ public:
    // transform to delta array
    data_size_t last_idx = 0;
    for (data_size_t i = 0; i < num_used_indices; ++i) {
-      VAL_T bin = iterator.RawGet(used_indices[i]);
+      VAL_T bin = iterator.InnerRawGet(used_indices[i]);
      if (bin > 0) {
        data_size_t cur_delta = i - last_idx;
        while (cur_delta >= 256) {
@@ -363,7 +364,12 @@ protected:
 };
 template <typename VAL_T>
-inline VAL_T SparseBinIterator<VAL_T>::RawGet(data_size_t idx) {
+inline uint32_t SparseBinIterator<VAL_T>::RawGet(data_size_t idx) {
+  return InnerRawGet(idx);
+}
+template <typename VAL_T>
+inline VAL_T SparseBinIterator<VAL_T>::InnerRawGet(data_size_t idx) {
  while (cur_pos_ < idx) {
    bin_data_->NextNonzero(&i_delta_, &cur_pos_);
  }

--- a/src/treelearner/data_parallel_tree_learner.cpp
+++ b/src/treelearner/data_parallel_tree_learner.cpp
@@ -7,54 +7,59 @@
 namespace LightGBM {
-DataParallelTreeLearner::DataParallelTreeLearner(const TreeConfig* tree_config)
+template <typename TREELEARNER_T>
-  :SerialTreeLearner(tree_config) {
+DataParallelTreeLearner<TREELEARNER_T>::DataParallelTreeLearner(const TreeConfig* tree_config)
+  :TREELEARNER_T(tree_config) {
 }
-DataParallelTreeLearner::~DataParallelTreeLearner() {
+template <typename TREELEARNER_T>
+DataParallelTreeLearner<TREELEARNER_T>::~DataParallelTreeLearner() {
 }
-void DataParallelTreeLearner::Init(const Dataset* train_data) {
+template <typename TREELEARNER_T>
+void DataParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
  // initialize SerialTreeLearner
-  SerialTreeLearner::Init(train_data);
+  TREELEARNER_T::Init(train_data, is_constant_hessian);
  // Get local rank and global machine size
  rank_ = Network::rank();
  num_machines_ = Network::num_machines();
  // allocate buffer for communication
-  size_t buffer_size = train_data_->NumTotalBin() * sizeof(HistogramBinEntry);
+  size_t buffer_size = this->train_data_->NumTotalBin() * sizeof(HistogramBinEntry);
  input_buffer_.resize(buffer_size);
  output_buffer_.resize(buffer_size);
-  is_feature_aggregated_.resize(num_features_);
+  is_feature_aggregated_.resize(this->num_features_);
  block_start_.resize(num_machines_);
  block_len_.resize(num_machines_);
-  buffer_write_start_pos_.resize(num_features_);
+  buffer_write_start_pos_.resize(this->num_features_);
-  buffer_read_start_pos_.resize(num_features_);
+  buffer_read_start_pos_.resize(this->num_features_);
-  global_data_count_in_leaf_.resize(tree_config_->num_leaves);
+  global_data_count_in_leaf_.resize(this->tree_config_->num_leaves);
 }
-void DataParallelTreeLearner::ResetConfig(const TreeConfig* tree_config) {
+template <typename TREELEARNER_T>
-  SerialTreeLearner::ResetConfig(tree_config);
+void DataParallelTreeLearner<TREELEARNER_T>::ResetConfig(const TreeConfig* tree_config) {
-  global_data_count_in_leaf_.resize(tree_config_->num_leaves);
+  TREELEARNER_T::ResetConfig(tree_config);
+  global_data_count_in_leaf_.resize(this->tree_config_->num_leaves);
 }
-void DataParallelTreeLearner::BeforeTrain() {
+template <typename TREELEARNER_T>
-  SerialTreeLearner::BeforeTrain();
+void DataParallelTreeLearner<TREELEARNER_T>::BeforeTrain() {
+  TREELEARNER_T::BeforeTrain();
  // generate feature partition for current tree
  std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>());
  std::vector<int> num_bins_distributed(num_machines_, 0);
-  for (int i = 0; i < train_data_->num_total_features(); ++i) {
+  for (int i = 0; i < this->train_data_->num_total_features(); ++i) {
-    int inner_feature_index = train_data_->InnerFeatureIndex(i);
+    int inner_feature_index = this->train_data_->InnerFeatureIndex(i);
    if (inner_feature_index == -1) { continue; }
-    if (is_feature_used_[inner_feature_index]) {
+    if (this->is_feature_used_[inner_feature_index]) {
      int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed));
      feature_distribution[cur_min_machine].push_back(inner_feature_index);
-      auto num_bin = train_data_->FeatureNumBin(inner_feature_index);
+      auto num_bin = this->train_data_->FeatureNumBin(inner_feature_index);
-      if (train_data_->FeatureBinMapper(inner_feature_index)->GetDefaultBin() == 0) {
+      if (this->train_data_->FeatureBinMapper(inner_feature_index)->GetDefaultBin() == 0) {
        num_bin -= 1;
      }
      num_bins_distributed[cur_min_machine] += num_bin;
@@ -71,8 +76,8 @@ void DataParallelTreeLearner::BeforeTrain() {
  for (int i = 0; i < num_machines_; ++i) {
    block_len_[i] = 0;
    for (auto fid : feature_distribution[i]) {
-      auto num_bin = train_data_->FeatureNumBin(fid);
+      auto num_bin = this->train_data_->FeatureNumBin(fid);
-      if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
+      if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
        num_bin -= 1;
      }
      block_len_[i] += num_bin * sizeof(HistogramBinEntry);
@@ -90,8 +95,8 @@ void DataParallelTreeLearner::BeforeTrain() {
  for (int i = 0; i < num_machines_; ++i) {
    for (auto fid : feature_distribution[i]) {
      buffer_write_start_pos_[fid] = bin_size;
-      auto num_bin = train_data_->FeatureNumBin(fid);
+      auto num_bin = this->train_data_->FeatureNumBin(fid);
-      if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
+      if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
        num_bin -= 1;
      }
      bin_size += num_bin * sizeof(HistogramBinEntry);
@@ -102,16 +107,16 @@ void DataParallelTreeLearner::BeforeTrain() {
  bin_size = 0;
  for (auto fid : feature_distribution[rank_]) {
    buffer_read_start_pos_[fid] = bin_size;
-    auto num_bin = train_data_->FeatureNumBin(fid);
+    auto num_bin = this->train_data_->FeatureNumBin(fid);
-    if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
+    if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
      num_bin -= 1;
    }
    bin_size += num_bin * sizeof(HistogramBinEntry);
  }
  // sync global data sumup info
-  std::tuple<data_size_t, double, double> data(smaller_leaf_splits_->num_data_in_leaf(),
+  std::tuple<data_size_t, double, double> data(this->smaller_leaf_splits_->num_data_in_leaf(),
-                                               smaller_leaf_splits_->sum_gradients(), smaller_leaf_splits_->sum_hessians());
+                                               this->smaller_leaf_splits_->sum_gradients(), this->smaller_leaf_splits_->sum_hessians());
  int size = sizeof(data);
  std::memcpy(input_buffer_.data(), &data, size);
  // global sumup reduce
@@ -134,93 +139,95 @@ void DataParallelTreeLearner::BeforeTrain() {
  // copy back
  std::memcpy(&data, output_buffer_.data(), size);
  // set global sumup info
-  smaller_leaf_splits_->Init(std::get<1>(data), std::get<2>(data));
+  this->smaller_leaf_splits_->Init(std::get<1>(data), std::get<2>(data));
  // init global data count in leaf
  global_data_count_in_leaf_[0] = std::get<0>(data);
 }
-void DataParallelTreeLearner::FindBestThresholds() {
+template <typename TREELEARNER_T>
-  ConstructHistograms(is_feature_used_, true);
+void DataParallelTreeLearner<TREELEARNER_T>::FindBestThresholds() {
+  this->ConstructHistograms(this->is_feature_used_, true);
  // construct local histograms
  #pragma omp parallel for schedule(static)
-  for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
+  for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
-    if ((!is_feature_used_.empty() && is_feature_used_[feature_index] == false)) continue;
+    if ((!this->is_feature_used_.empty() && this->is_feature_used_[feature_index] == false)) continue;
    // copy to buffer
    std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index],
-                smaller_leaf_histogram_array_[feature_index].RawData(),
+                this->smaller_leaf_histogram_array_[feature_index].RawData(),
-                smaller_leaf_histogram_array_[feature_index].SizeOfHistgram());
+                this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram());
  }
  // Reduce scatter for histogram
  Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, block_start_.data(),
                         block_len_.data(), output_buffer_.data(), &HistogramBinEntry::SumReducer);
-  std::vector<SplitInfo> smaller_best(num_threads_, SplitInfo());
+  std::vector<SplitInfo> smaller_best(this->num_threads_, SplitInfo());
-  std::vector<SplitInfo> larger_best(num_threads_, SplitInfo());
+  std::vector<SplitInfo> larger_best(this->num_threads_, SplitInfo());
  OMP_INIT_EX();
  #pragma omp parallel for schedule(static)
-  for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
+  for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
    OMP_LOOP_EX_BEGIN();
    if (!is_feature_aggregated_[feature_index]) continue;
    const int tid = omp_get_thread_num();
    // restore global histograms from buffer
-    smaller_leaf_histogram_array_[feature_index].FromMemory(
+    this->smaller_leaf_histogram_array_[feature_index].FromMemory(
      output_buffer_.data() + buffer_read_start_pos_[feature_index]);
-    train_data_->FixHistogram(feature_index,
+    this->train_data_->FixHistogram(feature_index,
-                              smaller_leaf_splits_->sum_gradients(), smaller_leaf_splits_->sum_hessians(),
+                              this->smaller_leaf_splits_->sum_gradients(), this->smaller_leaf_splits_->sum_hessians(),
-                              GetGlobalDataCountInLeaf(smaller_leaf_splits_->LeafIndex()),
+                              GetGlobalDataCountInLeaf(this->smaller_leaf_splits_->LeafIndex()),
-                              smaller_leaf_histogram_array_[feature_index].RawData());
+                              this->smaller_leaf_histogram_array_[feature_index].RawData());
    SplitInfo smaller_split;
    // find best threshold for smaller child
-    smaller_leaf_histogram_array_[feature_index].FindBestThreshold(
+    this->smaller_leaf_histogram_array_[feature_index].FindBestThreshold(
-      smaller_leaf_splits_->sum_gradients(),
+      this->smaller_leaf_splits_->sum_gradients(),
-      smaller_leaf_splits_->sum_hessians(),
+      this->smaller_leaf_splits_->sum_hessians(),
-      GetGlobalDataCountInLeaf(smaller_leaf_splits_->LeafIndex()),
+      GetGlobalDataCountInLeaf(this->smaller_leaf_splits_->LeafIndex()),
      &smaller_split);
    if (smaller_split.gain > smaller_best[tid].gain) {
      smaller_best[tid] = smaller_split;
-      smaller_best[tid].feature = train_data_->RealFeatureIndex(feature_index);
+      smaller_best[tid].feature = this->train_data_->RealFeatureIndex(feature_index);
    }
    // only root leaf
-    if (larger_leaf_splits_ == nullptr || larger_leaf_splits_->LeafIndex() < 0) continue;
+    if (this->larger_leaf_splits_ == nullptr || this->larger_leaf_splits_->LeafIndex() < 0) continue;
    // construct histgroms for large leaf, we init larger leaf as the parent, so we can just subtract the smaller leaf's histograms
-    larger_leaf_histogram_array_[feature_index].Subtract(
+    this->larger_leaf_histogram_array_[feature_index].Subtract(
-      smaller_leaf_histogram_array_[feature_index]);
+      this->smaller_leaf_histogram_array_[feature_index]);
    SplitInfo larger_split;
    // find best threshold for larger child
-    larger_leaf_histogram_array_[feature_index].FindBestThreshold(
+    this->larger_leaf_histogram_array_[feature_index].FindBestThreshold(
-      larger_leaf_splits_->sum_gradients(),
+      this->larger_leaf_splits_->sum_gradients(),
-      larger_leaf_splits_->sum_hessians(),
+      this->larger_leaf_splits_->sum_hessians(),
-      GetGlobalDataCountInLeaf(larger_leaf_splits_->LeafIndex()),
+      GetGlobalDataCountInLeaf(this->larger_leaf_splits_->LeafIndex()),
      &larger_split);
    if (larger_split.gain > larger_best[tid].gain) {
      larger_best[tid] = larger_split;
-      larger_best[tid].feature = train_data_->RealFeatureIndex(feature_index);
+      larger_best[tid].feature = this->train_data_->RealFeatureIndex(feature_index);
    }
    OMP_LOOP_EX_END();
  }
  OMP_THROW_EX();
  auto smaller_best_idx = ArrayArgs<SplitInfo>::ArgMax(smaller_best);
-  int leaf = smaller_leaf_splits_->LeafIndex();
+  int leaf = this->smaller_leaf_splits_->LeafIndex();
-  best_split_per_leaf_[leaf] = smaller_best[smaller_best_idx];
+  this->best_split_per_leaf_[leaf] = smaller_best[smaller_best_idx];
-  if (larger_leaf_splits_ == nullptr || larger_leaf_splits_->LeafIndex() < 0) { return; }
+  if (this->larger_leaf_splits_ == nullptr || this->larger_leaf_splits_->LeafIndex() < 0) { return; }
-  leaf = larger_leaf_splits_->LeafIndex();
+  leaf = this->larger_leaf_splits_->LeafIndex();
  auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best);
-  best_split_per_leaf_[leaf] = larger_best[larger_best_idx];
+  this->best_split_per_leaf_[leaf] = larger_best[larger_best_idx];
 }
-void DataParallelTreeLearner::FindBestSplitsForLeaves() {
+template <typename TREELEARNER_T>
+void DataParallelTreeLearner<TREELEARNER_T>::FindBestSplitsForLeaves() {
  SplitInfo smaller_best, larger_best;
-  smaller_best = best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()];
+  smaller_best = this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()];
  // find local best split for larger leaf
-  if (larger_leaf_splits_->LeafIndex() >= 0) {
+  if (this->larger_leaf_splits_->LeafIndex() >= 0) {
-    larger_best = best_split_per_leaf_[larger_leaf_splits_->LeafIndex()];
+    larger_best = this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()];
  }
  // sync global best info
@@ -234,19 +241,23 @@ void DataParallelTreeLearner::FindBestSplitsForLeaves() {
  std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo));
  // set best split
-  best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()] = smaller_best;
+  this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()] = smaller_best;
-  if (larger_leaf_splits_->LeafIndex() >= 0) {
+  if (this->larger_leaf_splits_->LeafIndex() >= 0) {
-    best_split_per_leaf_[larger_leaf_splits_->LeafIndex()] = larger_best;
+    this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()] = larger_best;
  }
 }
-void DataParallelTreeLearner::Split(Tree* tree, int best_Leaf, int* left_leaf, int* right_leaf) {
+template <typename TREELEARNER_T>
-  SerialTreeLearner::Split(tree, best_Leaf, left_leaf, right_leaf);
+void DataParallelTreeLearner<TREELEARNER_T>::Split(Tree* tree, int best_Leaf, int* left_leaf, int* right_leaf) {
-  const SplitInfo& best_split_info = best_split_per_leaf_[best_Leaf];
+  TREELEARNER_T::Split(tree, best_Leaf, left_leaf, right_leaf);
+  const SplitInfo& best_split_info = this->best_split_per_leaf_[best_Leaf];
  // need update global number of data in leaf
  global_data_count_in_leaf_[*left_leaf] = best_split_info.left_count;
  global_data_count_in_leaf_[*right_leaf] = best_split_info.right_count;
 }
+// instantiate template classes, otherwise linker cannot find the code
+template class DataParallelTreeLearner<GPUTreeLearner>;
+template class DataParallelTreeLearner<SerialTreeLearner>;
 }  // namespace LightGBM
--- a/src/treelearner/feature_parallel_tree_learner.cpp
+++ b/src/treelearner/feature_parallel_tree_learner.cpp
@@ -6,15 +6,20 @@
 namespace LightGBM {
-FeatureParallelTreeLearner::FeatureParallelTreeLearner(const TreeConfig* tree_config)
-  :SerialTreeLearner(tree_config) {
+template <typename TREELEARNER_T>
+FeatureParallelTreeLearner<TREELEARNER_T>::FeatureParallelTreeLearner(const TreeConfig* tree_config)
+  :TREELEARNER_T(tree_config) {
 }
-FeatureParallelTreeLearner::~FeatureParallelTreeLearner() {
+template <typename TREELEARNER_T>
+FeatureParallelTreeLearner<TREELEARNER_T>::~FeatureParallelTreeLearner() {
 }
-void FeatureParallelTreeLearner::Init(const Dataset* train_data) {
-  SerialTreeLearner::Init(train_data);
+template <typename TREELEARNER_T>
+void FeatureParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
+  TREELEARNER_T::Init(train_data, is_constant_hessian);
  rank_ = Network::rank();
  num_machines_ = Network::num_machines();
  input_buffer_.resize(sizeof(SplitInfo) * 2);
@@ -22,35 +27,36 @@ void FeatureParallelTreeLearner::Init(const Dataset* train_data) {
 }
+template <typename TREELEARNER_T>
-void FeatureParallelTreeLearner::BeforeTrain() {
+void FeatureParallelTreeLearner<TREELEARNER_T>::BeforeTrain() {
-  SerialTreeLearner::BeforeTrain();
+  TREELEARNER_T::BeforeTrain();
  // get feature partition
  std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>());
  std::vector<int> num_bins_distributed(num_machines_, 0);
-  for (int i = 0; i < train_data_->num_total_features(); ++i) {
+  for (int i = 0; i < this->train_data_->num_total_features(); ++i) {
-    int inner_feature_index = train_data_->InnerFeatureIndex(i);
+    int inner_feature_index = this->train_data_->InnerFeatureIndex(i);
    if (inner_feature_index == -1) { continue; }
-    if (is_feature_used_[inner_feature_index]) {
+    if (this->is_feature_used_[inner_feature_index]) {
      int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed));
      feature_distribution[cur_min_machine].push_back(inner_feature_index);
-      num_bins_distributed[cur_min_machine] += train_data_->FeatureNumBin(inner_feature_index);
+      num_bins_distributed[cur_min_machine] += this->train_data_->FeatureNumBin(inner_feature_index);
-      is_feature_used_[inner_feature_index] = false;
+      this->is_feature_used_[inner_feature_index] = false;
    }
  }
  // get local used features
  for (auto fid : feature_distribution[rank_]) {
-    is_feature_used_[fid] = true;
+    this->is_feature_used_[fid] = true;
  }
 }
-void FeatureParallelTreeLearner::FindBestSplitsForLeaves() {
+template <typename TREELEARNER_T>
+void FeatureParallelTreeLearner<TREELEARNER_T>::FindBestSplitsForLeaves() {
  SplitInfo smaller_best, larger_best;
  // get best split at smaller leaf
-  smaller_best = best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()];
+  smaller_best = this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()];
  // find local best split for larger leaf
-  if (larger_leaf_splits_->LeafIndex() >= 0) {
+  if (this->larger_leaf_splits_->LeafIndex() >= 0) {
-    larger_best = best_split_per_leaf_[larger_leaf_splits_->LeafIndex()];
+    larger_best = this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()];
  }
  // sync global best info
  std::memcpy(input_buffer_.data(), &smaller_best, sizeof(SplitInfo));
@@ -62,10 +68,13 @@ void FeatureParallelTreeLearner::FindBestSplitsForLeaves() {
  std::memcpy(&smaller_best, output_buffer_.data(), sizeof(SplitInfo));
  std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo));
  // update best split
-  best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()] = smaller_best;
+  this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()] = smaller_best;
-  if (larger_leaf_splits_->LeafIndex() >= 0) {
+  if (this->larger_leaf_splits_->LeafIndex() >= 0) {
-    best_split_per_leaf_[larger_leaf_splits_->LeafIndex()] = larger_best;
+    this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()] = larger_best;
  }
 }
+// instantiate template classes, otherwise linker cannot find the code
+template class FeatureParallelTreeLearner<GPUTreeLearner>;
+template class FeatureParallelTreeLearner<SerialTreeLearner>;
 }  // namespace LightGBM
--- a/src/treelearner/gpu_tree_learner.cpp
+++ b/src/treelearner/gpu_tree_learner.cpp