Commit 0bb4a825 authored by Huan Zhang's avatar Huan Zhang Committed by Guolin Ke
Browse files

Initial GPU acceleration support for LightGBM (#368)

* add dummy gpu solver code

* initial GPU code

* fix crash bug

* first working version

* use asynchronous copy

* use a better kernel for root

* parallel read histogram

* sparse features now works, but no acceleration, compute on CPU

* compute sparse feature on CPU simultaneously

* fix big bug; add gpu selection; add kernel selection

* better debugging

* clean up

* add feature scatter

* Add sparse_threshold control

* fix a bug in feature scatter

* clean up debug

* temporarily add OpenCL kernels for k=64,256

* fix up CMakeList and definition USE_GPU

* add OpenCL kernels as string literals

* Add boost.compute as a submodule

* add boost dependency into CMakeList

* fix opencl pragma

* use pinned memory for histogram

* use pinned buffer for gradients and hessians

* better debugging message

* add double precision support on GPU

* fix boost version in CMakeList

* Add a README

* reconstruct GPU initialization code for ResetTrainingData

* move data to GPU in parallel

* fix a bug during feature copy

* update gpu kernels

* update gpu code

* initial port to LightGBM v2

* speedup GPU data loading process

* Add 4-bit bin support to GPU

* re-add sparse_threshold parameter

* remove kMaxNumWorkgroups and allows an unlimited number of features

* add feature mask support for skipping unused features

* enable kernel cache

* use GPU kernels withoug feature masks when all features are used

* REAdme.

* REAdme.

* update README

* fix typos (#349)

* change compile to gcc on Apple as default

* clean vscode related file

* refine api of constructing from sampling data.

* fix bug in the last commit.

* more efficient algorithm to sample k from n.

* fix bug in filter bin

* change to boost from average output.

* fix tests.

* only stop training when all classes are finshed in multi-class.

* limit the max tree output. change hessian in multi-class objective.

* robust tree model loading.

* fix test.

* convert the probabilities to raw score in boost_from_average of classification.

* fix the average label for binary classification.

* Add boost_from_average to docs (#354)

* don't use "ConvertToRawScore" for self-defined objective function.

* boost_from_average seems doesn't work well in binary classification. remove it.

* For a better jump link (#355)

* Update Python-API.md

* for a better jump in page

A space is needed between `#` and the headers content according to Github's markdown format [guideline](https://guides.github.com/features/mastering-markdown/)

After adding the spaces, we can jump to the exact position in page by click the link.

* fixed something mentioned by @wxchan

* Update Python-API.md

* add FitByExistingTree.

* adapt GPU tree learner for FitByExistingTree

* avoid NaN output.

* update boost.compute

* fix typos (#361)

* fix broken links (#359)

* update README

* disable GPU acceleration by default

* fix image url

* cleanup debug macro

* remove old README

* do not save sparse_threshold_ in FeatureGroup

* add details for new GPU settings

* ignore submodule when doing pep8 check

* allocate workspace for at least one thread during builing Feature4

* move sparse_threshold to class Dataset

* remove duplicated code in GPUTreeLearner::Split

* Remove duplicated code in FindBestThresholds and BeforeFindBestSplit

* do not rebuild ordered gradients and hessians for sparse features

* support feature groups in GPUTreeLearner

* Initial parallel learners with GPU support

* add option device, cleanup code

* clean up FindBestThresholds; add some omp parallel

* constant hessian optimization for GPU

* Fix GPUTreeLearner crash when there is zero feature

* use np.testing.assert_almost_equal() to compare lists of floats in tests

* travis for GPU
parent db3d1f89
[submodule "include/boost/compute"]
path = compute
url = https://github.com/boostorg/compute
...@@ -11,24 +11,48 @@ before_install: ...@@ -11,24 +11,48 @@ before_install:
- export PATH="$HOME/miniconda/bin:$PATH" - export PATH="$HOME/miniconda/bin:$PATH"
- conda config --set always_yes yes --set changeps1 no - conda config --set always_yes yes --set changeps1 no
- conda update -q conda - conda update -q conda
- sudo add-apt-repository ppa:george-edison55/cmake-3.x -y
- sudo apt-get update -q
- bash .travis/amd_sdk.sh;
- tar -xjf AMD-SDK.tar.bz2;
- AMDAPPSDK=${HOME}/AMDAPPSDK;
- export OPENCL_VENDOR_PATH=${AMDAPPSDK}/etc/OpenCL/vendors;
- mkdir -p ${OPENCL_VENDOR_PATH};
- sh AMD-APP-SDK*.sh --tar -xf -C ${AMDAPPSDK};
- echo libamdocl64.so > ${OPENCL_VENDOR_PATH}/amdocl64.icd;
- export LD_LIBRARY_PATH=${AMDAPPSDK}/lib/x86_64:${LD_LIBRARY_PATH};
- chmod +x ${AMDAPPSDK}/bin/x86_64/clinfo;
- ${AMDAPPSDK}/bin/x86_64/clinfo;
- export LIBRARY_PATH="$HOME/miniconda/lib:$LIBRARY_PATH"
- export LD_RUN_PATH="$HOME/miniconda/lib:$LD_RUN_PATH"
- export CPLUS_INCLUDE_PATH="$HOME/miniconda/include:$AMDAPPSDK/include/:$CPLUS_INCLUDE_PATH"
install: install:
- sudo apt-get install -y libopenmpi-dev openmpi-bin build-essential - sudo apt-get install -y libopenmpi-dev openmpi-bin build-essential
- sudo apt-get install -y cmake
- conda install --yes atlas numpy scipy scikit-learn pandas matplotlib - conda install --yes atlas numpy scipy scikit-learn pandas matplotlib
- conda install --yes -c conda-forge boost=1.63.0
- pip install pep8 - pip install pep8
script: script:
- cd $TRAVIS_BUILD_DIR - cd $TRAVIS_BUILD_DIR
- mkdir build && cd build && cmake .. && make -j - mkdir build && cd build && cmake .. && make -j
- cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
- cd $TRAVIS_BUILD_DIR/python-package && python setup.py install - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
- cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
- cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 . - cd $TRAVIS_BUILD_DIR && pep8 --ignore=E501 --exclude=./compute .
- rm -rf build && mkdir build && cd build && cmake -DUSE_MPI=ON ..&& make -j - rm -rf build && mkdir build && cd build && cmake -DUSE_MPI=ON ..&& make -j
- cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py - cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
- cd $TRAVIS_BUILD_DIR/python-package && python setup.py install - cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
- cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py - cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
- cd $TRAVIS_BUILD_DIR
- rm -rf build && mkdir build && cd build && cmake -DUSE_GPU=ON -DBOOST_ROOT="$HOME/miniconda/" -DOpenCL_INCLUDE_DIR=$AMDAPPSDK/include/ ..
- sed -i 's/std::string device_type = "cpu";/std::string device_type = "gpu";/' ../include/LightGBM/config.h
- make -j$(nproc)
- sed -i 's/std::string device_type = "gpu";/std::string device_type = "cpu";/' ../include/LightGBM/config.h
- cd $TRAVIS_BUILD_DIR/tests/c_api_test && python test.py
- cd $TRAVIS_BUILD_DIR/python-package && python setup.py install
- cd $TRAVIS_BUILD_DIR/tests/python_package_test && python test_basic.py && python test_engine.py && python test_sklearn.py && python test_plotting.py
notifications: notifications:
email: false email: false
......
#!/bin/bash
# Original script from https://github.com/gregvw/amd_sdk/
# Location from which get nonce and file name from
URL="http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/amd-accelerated-parallel-processing-app-sdk/"
URLDOWN="http://developer.amd.com/amd-license-agreement-appsdk/"
NONCE1_STRING='name="amd_developer_central_downloads_page_nonce"'
FILE_STRING='name="f"'
POSTID_STRING='name="post_id"'
NONCE2_STRING='name="amd_developer_central_nonce"'
#For newest FORM=`wget -qO - $URL | sed -n '/download-2/,/64-bit/p'`
FORM=`wget -qO - $URL | sed -n '/download-5/,/64-bit/p'`
# Get nonce from form
NONCE1=`echo $FORM | awk -F ${NONCE1_STRING} '{print $2}'`
NONCE1=`echo $NONCE1 | awk -F'"' '{print $2}'`
echo $NONCE1
# get the postid
POSTID=`echo $FORM | awk -F ${POSTID_STRING} '{print $2}'`
POSTID=`echo $POSTID | awk -F'"' '{print $2}'`
echo $POSTID
# get file name
FILE=`echo $FORM | awk -F ${FILE_STRING} '{print $2}'`
FILE=`echo $FILE | awk -F'"' '{print $2}'`
echo $FILE
FORM=`wget -qO - $URLDOWN --post-data "amd_developer_central_downloads_page_nonce=${NONCE1}&f=${FILE}&post_id=${POSTID}"`
NONCE2=`echo $FORM | awk -F ${NONCE2_STRING} '{print $2}'`
NONCE2=`echo $NONCE2 | awk -F'"' '{print $2}'`
echo $NONCE2
wget --content-disposition --trust-server-names $URLDOWN --post-data "amd_developer_central_nonce=${NONCE2}&f=${FILE}" -O AMD-SDK.tar.bz2;
...@@ -9,6 +9,7 @@ PROJECT(lightgbm) ...@@ -9,6 +9,7 @@ PROJECT(lightgbm)
OPTION(USE_MPI "MPI based parallel learning" OFF) OPTION(USE_MPI "MPI based parallel learning" OFF)
OPTION(USE_OPENMP "Enable OpenMP" ON) OPTION(USE_OPENMP "Enable OpenMP" ON)
OPTION(USE_GPU "Enable GPU-acclerated training (EXPERIMENTAL)" OFF)
if(APPLE) if(APPLE)
OPTION(APPLE_OUTPUT_DYLIB "Output dylib shared library" OFF) OPTION(APPLE_OUTPUT_DYLIB "Output dylib shared library" OFF)
...@@ -34,8 +35,17 @@ else() ...@@ -34,8 +35,17 @@ else()
endif() endif()
endif(USE_OPENMP) endif(USE_OPENMP)
if(USE_GPU)
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
MESSAGE(STATUS "OpenCL include directory:" ${OpenCL_INCLUDE_DIRS})
find_package(Boost 1.56.0 COMPONENTS filesystem system REQUIRED)
include_directories(${Boost_INCLUDE_DIRS})
ADD_DEFINITIONS(-DUSE_GPU)
endif(USE_GPU)
if(UNIX OR MINGW OR CYGWIN) if(UNIX OR MINGW OR CYGWIN)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11") SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread -O3 -Wall -std=c++11 -Wno-ignored-attributes")
endif() endif()
if(MSVC) if(MSVC)
...@@ -65,11 +75,13 @@ endif() ...@@ -65,11 +75,13 @@ endif()
SET(LightGBM_HEADER_DIR ${PROJECT_SOURCE_DIR}/include) SET(LightGBM_HEADER_DIR ${PROJECT_SOURCE_DIR}/include)
SET(BOOST_COMPUTE_HEADER_DIR ${PROJECT_SOURCE_DIR}/compute/include)
SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR}) SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
SET(LIBRARY_OUTPUT_PATH ${PROJECT_SOURCE_DIR}) SET(LIBRARY_OUTPUT_PATH ${PROJECT_SOURCE_DIR})
include_directories (${LightGBM_HEADER_DIR}) include_directories (${LightGBM_HEADER_DIR})
include_directories (${BOOST_COMPUTE_HEADER_DIR})
if(APPLE) if(APPLE)
if (APPLE_OUTPUT_DYLIB) if (APPLE_OUTPUT_DYLIB)
...@@ -105,6 +117,11 @@ if(USE_MPI) ...@@ -105,6 +117,11 @@ if(USE_MPI)
TARGET_LINK_LIBRARIES(_lightgbm ${MPI_CXX_LIBRARIES}) TARGET_LINK_LIBRARIES(_lightgbm ${MPI_CXX_LIBRARIES})
endif(USE_MPI) endif(USE_MPI)
if(USE_GPU)
TARGET_LINK_LIBRARIES(lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
TARGET_LINK_LIBRARIES(_lightgbm ${OpenCL_LIBRARY} ${Boost_LIBRARIES})
endif(USE_GPU)
if(WIN32 AND (MINGW OR CYGWIN)) if(WIN32 AND (MINGW OR CYGWIN))
TARGET_LINK_LIBRARIES(lightgbm Ws2_32) TARGET_LINK_LIBRARIES(lightgbm Ws2_32)
TARGET_LINK_LIBRARIES(_lightgbm Ws2_32) TARGET_LINK_LIBRARIES(_lightgbm Ws2_32)
......
Subproject commit 1380a04582080bbe2364352b336270bc4bfa3025
...@@ -59,7 +59,6 @@ public: ...@@ -59,7 +59,6 @@ public:
explicit BinMapper(const void* memory); explicit BinMapper(const void* memory);
~BinMapper(); ~BinMapper();
static double kSparseThreshold;
bool CheckAlign(const BinMapper& other) const { bool CheckAlign(const BinMapper& other) const {
if (num_bin_ != other.num_bin_) { if (num_bin_ != other.num_bin_) {
return false; return false;
...@@ -258,6 +257,7 @@ public: ...@@ -258,6 +257,7 @@ public:
* \return Bin data * \return Bin data
*/ */
virtual uint32_t Get(data_size_t idx) = 0; virtual uint32_t Get(data_size_t idx) = 0;
virtual uint32_t RawGet(data_size_t idx) = 0;
virtual void Reset(data_size_t idx) = 0; virtual void Reset(data_size_t idx) = 0;
virtual ~BinIterator() = default; virtual ~BinIterator() = default;
}; };
...@@ -383,12 +383,13 @@ public: ...@@ -383,12 +383,13 @@ public:
* \param num_bin Number of bin * \param num_bin Number of bin
* \param sparse_rate Sparse rate of this bins( num_bin0/num_data ) * \param sparse_rate Sparse rate of this bins( num_bin0/num_data )
* \param is_enable_sparse True if enable sparse feature * \param is_enable_sparse True if enable sparse feature
* \param sparse_threshold Threshold for treating a feature as a sparse feature
* \param is_sparse Will set to true if this bin is sparse * \param is_sparse Will set to true if this bin is sparse
* \param default_bin Default bin for zeros value * \param default_bin Default bin for zeros value
* \return The bin data object * \return The bin data object
*/ */
static Bin* CreateBin(data_size_t num_data, int num_bin, static Bin* CreateBin(data_size_t num_data, int num_bin,
double sparse_rate, bool is_enable_sparse, bool* is_sparse); double sparse_rate, bool is_enable_sparse, double sparse_threshold, bool* is_sparse);
/*! /*!
* \brief Create object for bin data of one feature, used for dense feature * \brief Create object for bin data of one feature, used for dense feature
......
...@@ -97,6 +97,11 @@ public: ...@@ -97,6 +97,11 @@ public:
int num_iteration_predict = -1; int num_iteration_predict = -1;
bool is_pre_partition = false; bool is_pre_partition = false;
bool is_enable_sparse = true; bool is_enable_sparse = true;
/*! \brief The threshold of zero elements precentage for treating a feature as a sparse feature.
* Default is 0.8, where a feature is treated as a sparse feature when there are over 80% zeros.
* When setting to 1.0, all features are processed as dense features.
*/
double sparse_threshold = 0.8;
bool use_two_round_loading = false; bool use_two_round_loading = false;
bool is_save_binary_file = false; bool is_save_binary_file = false;
bool enable_load_from_binary_file = true; bool enable_load_from_binary_file = true;
...@@ -188,6 +193,16 @@ public: ...@@ -188,6 +193,16 @@ public:
// max_depth < 0 means no limit // max_depth < 0 means no limit
int max_depth = -1; int max_depth = -1;
int top_k = 20; int top_k = 20;
/*! \brief OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform.
* Default value is -1, using the system-wide default platform
*/
int gpu_platform_id = -1;
/*! \brief OpenCL device ID in the specified platform. Each GPU in the selected platform has a
* unique device ID. Default value is -1, using the default device in the selected platform
*/
int gpu_device_id = -1;
/*! \brief Set to true to use double precision math on GPU (default using single precision) */
bool gpu_use_dp = false;
LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override; LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
}; };
...@@ -216,11 +231,14 @@ public: ...@@ -216,11 +231,14 @@ public:
// only used for the regression. Will boost from the average labels. // only used for the regression. Will boost from the average labels.
bool boost_from_average = true; bool boost_from_average = true;
std::string tree_learner_type = "serial"; std::string tree_learner_type = "serial";
std::string device_type = "cpu";
TreeConfig tree_config; TreeConfig tree_config;
LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override; LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
private: private:
void GetTreeLearnerType(const std::unordered_map<std::string, void GetTreeLearnerType(const std::unordered_map<std::string,
std::string>& params); std::string>& params);
void GetDeviceType(const std::unordered_map<std::string,
std::string>& params);
}; };
/*! \brief Config for Network */ /*! \brief Config for Network */
......
...@@ -355,6 +355,9 @@ public: ...@@ -355,6 +355,9 @@ public:
inline int Feture2SubFeature(int feature_idx) const { inline int Feture2SubFeature(int feature_idx) const {
return feature2subfeature_[feature_idx]; return feature2subfeature_[feature_idx];
} }
inline uint64_t GroupBinBoundary(int group_idx) const {
return group_bin_boundaries_[group_idx];
}
inline uint64_t NumTotalBin() const { inline uint64_t NumTotalBin() const {
return group_bin_boundaries_.back(); return group_bin_boundaries_.back();
} }
...@@ -421,6 +424,10 @@ public: ...@@ -421,6 +424,10 @@ public:
const int sub_feature = feature2subfeature_[i]; const int sub_feature = feature2subfeature_[i];
return feature_groups_[group]->bin_mappers_[sub_feature]->num_bin(); return feature_groups_[group]->bin_mappers_[sub_feature]->num_bin();
} }
inline int FeatureGroupNumBin(int group) const {
return feature_groups_[group]->num_total_bin_;
}
inline const BinMapper* FeatureBinMapper(int i) const { inline const BinMapper* FeatureBinMapper(int i) const {
const int group = feature2group_[i]; const int group = feature2group_[i];
...@@ -428,12 +435,25 @@ public: ...@@ -428,12 +435,25 @@ public:
return feature_groups_[group]->bin_mappers_[sub_feature].get(); return feature_groups_[group]->bin_mappers_[sub_feature].get();
} }
inline const Bin* FeatureBin(int i) const {
const int group = feature2group_[i];
return feature_groups_[group]->bin_data_.get();
}
inline const Bin* FeatureGroupBin(int group) const {
return feature_groups_[group]->bin_data_.get();
}
inline BinIterator* FeatureIterator(int i) const { inline BinIterator* FeatureIterator(int i) const {
const int group = feature2group_[i]; const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i]; const int sub_feature = feature2subfeature_[i];
return feature_groups_[group]->SubFeatureIterator(sub_feature); return feature_groups_[group]->SubFeatureIterator(sub_feature);
} }
inline BinIterator* FeatureGroupIterator(int group) const {
return feature_groups_[group]->FeatureGroupIterator();
}
inline double RealThreshold(int i, uint32_t threshold) const { inline double RealThreshold(int i, uint32_t threshold) const {
const int group = feature2group_[i]; const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i]; const int sub_feature = feature2subfeature_[i];
...@@ -461,6 +481,9 @@ public: ...@@ -461,6 +481,9 @@ public:
/*! \brief Get Number of used features */ /*! \brief Get Number of used features */
inline int num_features() const { return num_features_; } inline int num_features() const { return num_features_; }
/*! \brief Get Number of feature groups */
inline int num_feature_groups() const { return num_groups_;}
/*! \brief Get Number of total features */ /*! \brief Get Number of total features */
inline int num_total_features() const { return num_total_features_; } inline int num_total_features() const { return num_total_features_; }
...@@ -516,6 +539,8 @@ private: ...@@ -516,6 +539,8 @@ private:
Metadata metadata_; Metadata metadata_;
/*! \brief index of label column */ /*! \brief index of label column */
int label_idx_ = 0; int label_idx_ = 0;
/*! \brief Threshold for treating a feature as a sparse feature */
double sparse_threshold_;
/*! \brief store feature names */ /*! \brief store feature names */
std::vector<std::string> feature_names_; std::vector<std::string> feature_names_;
/*! \brief store feature names */ /*! \brief store feature names */
......
...@@ -25,10 +25,11 @@ public: ...@@ -25,10 +25,11 @@ public:
* \param bin_mappers Bin mapper for features * \param bin_mappers Bin mapper for features
* \param num_data Total number of data * \param num_data Total number of data
* \param is_enable_sparse True if enable sparse feature * \param is_enable_sparse True if enable sparse feature
* \param sparse_threshold Threshold for treating a feature as a sparse feature
*/ */
FeatureGroup(int num_feature, FeatureGroup(int num_feature,
std::vector<std::unique_ptr<BinMapper>>& bin_mappers, std::vector<std::unique_ptr<BinMapper>>& bin_mappers,
data_size_t num_data, bool is_enable_sparse) : num_feature_(num_feature) { data_size_t num_data, double sparse_threshold, bool is_enable_sparse) : num_feature_(num_feature) {
CHECK(static_cast<int>(bin_mappers.size()) == num_feature); CHECK(static_cast<int>(bin_mappers.size()) == num_feature);
// use bin at zero to store default_bin // use bin at zero to store default_bin
num_total_bin_ = 1; num_total_bin_ = 1;
...@@ -46,7 +47,7 @@ public: ...@@ -46,7 +47,7 @@ public:
} }
double sparse_rate = 1.0f - static_cast<double>(cnt_non_zero) / (num_data); double sparse_rate = 1.0f - static_cast<double>(cnt_non_zero) / (num_data);
bin_data_.reset(Bin::CreateBin(num_data, num_total_bin_, bin_data_.reset(Bin::CreateBin(num_data, num_total_bin_,
sparse_rate, is_enable_sparse, &is_sparse_)); sparse_rate, is_enable_sparse, sparse_threshold, &is_sparse_));
} }
/*! /*!
* \brief Constructor from memory * \brief Constructor from memory
...@@ -120,6 +121,18 @@ public: ...@@ -120,6 +121,18 @@ public:
uint32_t default_bin = bin_mappers_[sub_feature]->GetDefaultBin(); uint32_t default_bin = bin_mappers_[sub_feature]->GetDefaultBin();
return bin_data_->GetIterator(min_bin, max_bin, default_bin); return bin_data_->GetIterator(min_bin, max_bin, default_bin);
} }
/*!
* \brief Returns a BinIterator that can access the entire feature group's raw data.
* The RawGet() function of the iterator should be called for best efficiency.
* \return A pointer to the BinIterator object
*/
inline BinIterator* FeatureGroupIterator() {
uint32_t min_bin = bin_offsets_[0];
uint32_t max_bin = bin_offsets_.back() - 1;
uint32_t default_bin = 0;
return bin_data_->GetIterator(min_bin, max_bin, default_bin);
}
inline data_size_t Split( inline data_size_t Split(
int sub_feature, int sub_feature,
......
...@@ -24,8 +24,9 @@ public: ...@@ -24,8 +24,9 @@ public:
/*! /*!
* \brief Initialize tree learner with training dataset * \brief Initialize tree learner with training dataset
* \param train_data The used training data * \param train_data The used training data
* \param is_constant_hessian True if all hessians share the same value
*/ */
virtual void Init(const Dataset* train_data) = 0; virtual void Init(const Dataset* train_data, bool is_constant_hessian) = 0;
virtual void ResetTrainingData(const Dataset* train_data) = 0; virtual void ResetTrainingData(const Dataset* train_data) = 0;
...@@ -71,10 +72,12 @@ public: ...@@ -71,10 +72,12 @@ public:
/*! /*!
* \brief Create object of tree learner * \brief Create object of tree learner
* \param type Type of tree learner * \param learner_type Type of tree learner
* \param device_type Type of tree learner
* \param tree_config config of tree * \param tree_config config of tree
*/ */
static TreeLearner* CreateTreeLearner(const std::string& type, static TreeLearner* CreateTreeLearner(const std::string& learner_type,
const std::string& device_type,
const TreeConfig* tree_config); const TreeConfig* tree_config);
}; };
......
...@@ -92,10 +92,10 @@ void GBDT::ResetTrainingData(const BoostingConfig* config, const Dataset* train_ ...@@ -92,10 +92,10 @@ void GBDT::ResetTrainingData(const BoostingConfig* config, const Dataset* train_
if (train_data_ != train_data && train_data != nullptr) { if (train_data_ != train_data && train_data != nullptr) {
if (tree_learner_ == nullptr) { if (tree_learner_ == nullptr) {
tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, &new_config->tree_config)); tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, new_config->device_type, &new_config->tree_config));
} }
// init tree learner // init tree learner
tree_learner_->Init(train_data); tree_learner_->Init(train_data, is_constant_hessian_);
// push training metrics // push training metrics
training_metrics_.clear(); training_metrics_.clear();
......
...@@ -339,12 +339,10 @@ template class OrderedSparseBin<uint8_t>; ...@@ -339,12 +339,10 @@ template class OrderedSparseBin<uint8_t>;
template class OrderedSparseBin<uint16_t>; template class OrderedSparseBin<uint16_t>;
template class OrderedSparseBin<uint32_t>; template class OrderedSparseBin<uint32_t>;
double BinMapper::kSparseThreshold = 0.8f;
Bin* Bin::CreateBin(data_size_t num_data, int num_bin, double sparse_rate, Bin* Bin::CreateBin(data_size_t num_data, int num_bin, double sparse_rate,
bool is_enable_sparse, bool* is_sparse) { bool is_enable_sparse, double sparse_threshold, bool* is_sparse) {
// sparse threshold // sparse threshold
if (sparse_rate >= BinMapper::kSparseThreshold && is_enable_sparse) { if (sparse_rate >= sparse_threshold && is_enable_sparse) {
*is_sparse = true; *is_sparse = true;
return CreateSparseBin(num_data, num_bin); return CreateSparseBin(num_data, num_bin);
} else { } else {
......
...@@ -201,6 +201,7 @@ void IOConfig::Set(const std::unordered_map<std::string, std::string>& params) { ...@@ -201,6 +201,7 @@ void IOConfig::Set(const std::unordered_map<std::string, std::string>& params) {
GetInt(params, "bin_construct_sample_cnt", &bin_construct_sample_cnt); GetInt(params, "bin_construct_sample_cnt", &bin_construct_sample_cnt);
GetBool(params, "is_pre_partition", &is_pre_partition); GetBool(params, "is_pre_partition", &is_pre_partition);
GetBool(params, "is_enable_sparse", &is_enable_sparse); GetBool(params, "is_enable_sparse", &is_enable_sparse);
GetDouble(params, "sparse_threshold", &sparse_threshold);
GetBool(params, "use_two_round_loading", &use_two_round_loading); GetBool(params, "use_two_round_loading", &use_two_round_loading);
GetBool(params, "is_save_binary_file", &is_save_binary_file); GetBool(params, "is_save_binary_file", &is_save_binary_file);
GetBool(params, "enable_load_from_binary_file", &enable_load_from_binary_file); GetBool(params, "enable_load_from_binary_file", &enable_load_from_binary_file);
...@@ -305,6 +306,9 @@ void TreeConfig::Set(const std::unordered_map<std::string, std::string>& params) ...@@ -305,6 +306,9 @@ void TreeConfig::Set(const std::unordered_map<std::string, std::string>& params)
GetDouble(params, "histogram_pool_size", &histogram_pool_size); GetDouble(params, "histogram_pool_size", &histogram_pool_size);
GetInt(params, "max_depth", &max_depth); GetInt(params, "max_depth", &max_depth);
GetInt(params, "top_k", &top_k); GetInt(params, "top_k", &top_k);
GetInt(params, "gpu_platform_id", &gpu_platform_id);
GetInt(params, "gpu_device_id", &gpu_device_id);
GetBool(params, "gpu_use_dp", &gpu_use_dp);
} }
...@@ -336,6 +340,7 @@ void BoostingConfig::Set(const std::unordered_map<std::string, std::string>& par ...@@ -336,6 +340,7 @@ void BoostingConfig::Set(const std::unordered_map<std::string, std::string>& par
GetBool(params, "boost_from_average", &boost_from_average); GetBool(params, "boost_from_average", &boost_from_average);
CHECK(drop_rate <= 1.0 && drop_rate >= 0.0); CHECK(drop_rate <= 1.0 && drop_rate >= 0.0);
CHECK(skip_drop <= 1.0 && skip_drop >= 0.0); CHECK(skip_drop <= 1.0 && skip_drop >= 0.0);
GetDeviceType(params);
GetTreeLearnerType(params); GetTreeLearnerType(params);
tree_config.Set(params); tree_config.Set(params);
} }
...@@ -346,6 +351,9 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st ...@@ -346,6 +351,9 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st
std::transform(value.begin(), value.end(), value.begin(), Common::tolower); std::transform(value.begin(), value.end(), value.begin(), Common::tolower);
if (value == std::string("serial")) { if (value == std::string("serial")) {
tree_learner_type = "serial"; tree_learner_type = "serial";
} else if (value == std::string("gpu")) {
tree_learner_type = "serial";
device_type = "gpu";
} else if (value == std::string("feature") || value == std::string("feature_parallel")) { } else if (value == std::string("feature") || value == std::string("feature_parallel")) {
tree_learner_type = "feature"; tree_learner_type = "feature";
} else if (value == std::string("data") || value == std::string("data_parallel")) { } else if (value == std::string("data") || value == std::string("data_parallel")) {
...@@ -358,6 +366,20 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st ...@@ -358,6 +366,20 @@ void BoostingConfig::GetTreeLearnerType(const std::unordered_map<std::string, st
} }
} }
void BoostingConfig::GetDeviceType(const std::unordered_map<std::string, std::string>& params) {
std::string value;
if (GetString(params, "device", &value)) {
std::transform(value.begin(), value.end(), value.begin(), Common::tolower);
if (value == std::string("cpu")) {
device_type = "cpu";
} else if (value == std::string("gpu")) {
device_type = "gpu";
} else {
Log::Fatal("Unknown device type %s", value.c_str());
}
}
}
void NetworkConfig::Set(const std::unordered_map<std::string, std::string>& params) { void NetworkConfig::Set(const std::unordered_map<std::string, std::string>& params) {
GetInt(params, "num_machines", &num_machines); GetInt(params, "num_machines", &num_machines);
CHECK(num_machines >= 1); CHECK(num_machines >= 1);
......
...@@ -50,6 +50,7 @@ void Dataset::Construct( ...@@ -50,6 +50,7 @@ void Dataset::Construct(
size_t, size_t,
const IOConfig& io_config) { const IOConfig& io_config) {
num_total_features_ = static_cast<int>(bin_mappers.size()); num_total_features_ = static_cast<int>(bin_mappers.size());
sparse_threshold_ = io_config.sparse_threshold;
// get num_features // get num_features
std::vector<int> used_features; std::vector<int> used_features;
for (int i = 0; i < static_cast<int>(bin_mappers.size()); ++i) { for (int i = 0; i < static_cast<int>(bin_mappers.size()); ++i) {
...@@ -85,7 +86,7 @@ void Dataset::Construct( ...@@ -85,7 +86,7 @@ void Dataset::Construct(
++cur_fidx; ++cur_fidx;
} }
feature_groups_.emplace_back(std::unique_ptr<FeatureGroup>( feature_groups_.emplace_back(std::unique_ptr<FeatureGroup>(
new FeatureGroup(cur_cnt_features, cur_bin_mappers, num_data_, io_config.is_enable_sparse))); new FeatureGroup(cur_cnt_features, cur_bin_mappers, num_data_, sparse_threshold_, io_config.is_enable_sparse)));
} }
feature_groups_.shrink_to_fit(); feature_groups_.shrink_to_fit();
group_bin_boundaries_.clear(); group_bin_boundaries_.clear();
...@@ -129,6 +130,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) { ...@@ -129,6 +130,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) {
feature_groups_.clear(); feature_groups_.clear();
num_features_ = dataset->num_features_; num_features_ = dataset->num_features_;
num_groups_ = dataset->num_groups_; num_groups_ = dataset->num_groups_;
sparse_threshold_ = dataset->sparse_threshold_;
bool is_enable_sparse = false; bool is_enable_sparse = false;
for (int i = 0; i < num_groups_; ++i) { for (int i = 0; i < num_groups_; ++i) {
if (dataset->feature_groups_[i]->is_sparse_) { if (dataset->feature_groups_[i]->is_sparse_) {
...@@ -146,6 +148,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) { ...@@ -146,6 +148,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) {
dataset->feature_groups_[i]->num_feature_, dataset->feature_groups_[i]->num_feature_,
bin_mappers, bin_mappers,
num_data_, num_data_,
dataset->sparse_threshold_,
is_enable_sparse)); is_enable_sparse));
} }
feature_groups_.shrink_to_fit(); feature_groups_.shrink_to_fit();
...@@ -165,6 +168,7 @@ void Dataset::CreateValid(const Dataset* dataset) { ...@@ -165,6 +168,7 @@ void Dataset::CreateValid(const Dataset* dataset) {
feature_groups_.clear(); feature_groups_.clear();
num_features_ = dataset->num_features_; num_features_ = dataset->num_features_;
num_groups_ = num_features_; num_groups_ = num_features_;
sparse_threshold_ = dataset->sparse_threshold_;
bool is_enable_sparse = true; bool is_enable_sparse = true;
feature2group_.clear(); feature2group_.clear();
feature2subfeature_.clear(); feature2subfeature_.clear();
...@@ -176,6 +180,7 @@ void Dataset::CreateValid(const Dataset* dataset) { ...@@ -176,6 +180,7 @@ void Dataset::CreateValid(const Dataset* dataset) {
1, 1,
bin_mappers, bin_mappers,
num_data_, num_data_,
dataset->sparse_threshold_,
is_enable_sparse)); is_enable_sparse));
feature2group_.push_back(i); feature2group_.push_back(i);
feature2subfeature_.push_back(0); feature2subfeature_.push_back(0);
......
...@@ -25,6 +25,7 @@ public: ...@@ -25,6 +25,7 @@ public:
bias_ = 0; bias_ = 0;
} }
} }
inline uint32_t RawGet(data_size_t idx) override;
inline uint32_t Get(data_size_t idx) override; inline uint32_t Get(data_size_t idx) override;
inline void Reset(data_size_t) override { } inline void Reset(data_size_t) override { }
private: private:
...@@ -284,6 +285,11 @@ uint32_t DenseBinIterator<VAL_T>::Get(data_size_t idx) { ...@@ -284,6 +285,11 @@ uint32_t DenseBinIterator<VAL_T>::Get(data_size_t idx) {
} }
} }
template <typename VAL_T>
inline uint32_t DenseBinIterator<VAL_T>::RawGet(data_size_t idx) {
return bin_data_->data_[idx];
}
template <typename VAL_T> template <typename VAL_T>
BinIterator* DenseBin<VAL_T>::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const { BinIterator* DenseBin<VAL_T>::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const {
return new DenseBinIterator<VAL_T>(this, min_bin, max_bin, default_bin); return new DenseBinIterator<VAL_T>(this, min_bin, max_bin, default_bin);
......
...@@ -23,6 +23,7 @@ public: ...@@ -23,6 +23,7 @@ public:
bias_ = 0; bias_ = 0;
} }
} }
inline uint32_t RawGet(data_size_t idx) override;
inline uint32_t Get(data_size_t idx) override; inline uint32_t Get(data_size_t idx) override;
inline void Reset(data_size_t) override { } inline void Reset(data_size_t) override { }
private: private:
...@@ -74,7 +75,7 @@ public: ...@@ -74,7 +75,7 @@ public:
} }
} }
BinIterator* GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const override; inline BinIterator* GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const override;
void ConstructHistogram(const data_size_t* data_indices, data_size_t num_data, void ConstructHistogram(const data_size_t* data_indices, data_size_t num_data,
const score_t* ordered_gradients, const score_t* ordered_hessians, const score_t* ordered_gradients, const score_t* ordered_hessians,
...@@ -357,7 +358,11 @@ uint32_t Dense4bitsBinIterator::Get(data_size_t idx) { ...@@ -357,7 +358,11 @@ uint32_t Dense4bitsBinIterator::Get(data_size_t idx) {
} }
} }
BinIterator* Dense4bitsBin::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const { uint32_t Dense4bitsBinIterator::RawGet(data_size_t idx) {
return (bin_data_->data_[idx >> 1] >> ((idx & 1) << 2)) & 0xf;
}
inline BinIterator* Dense4bitsBin::GetIterator(uint32_t min_bin, uint32_t max_bin, uint32_t default_bin) const {
return new Dense4bitsBinIterator(this, min_bin, max_bin, default_bin); return new Dense4bitsBinIterator(this, min_bin, max_bin, default_bin);
} }
......
...@@ -38,7 +38,8 @@ public: ...@@ -38,7 +38,8 @@ public:
Reset(start_idx); Reset(start_idx);
} }
inline VAL_T RawGet(data_size_t idx); inline uint32_t RawGet(data_size_t idx) override;
inline VAL_T InnerRawGet(data_size_t idx);
inline uint32_t Get( data_size_t idx) override { inline uint32_t Get( data_size_t idx) override {
VAL_T ret = RawGet(idx); VAL_T ret = RawGet(idx);
...@@ -152,7 +153,7 @@ public: ...@@ -152,7 +153,7 @@ public:
} }
for (data_size_t i = 0; i < num_data; ++i) { for (data_size_t i = 0; i < num_data; ++i) {
const data_size_t idx = data_indices[i]; const data_size_t idx = data_indices[i];
VAL_T bin = iterator.RawGet(idx); VAL_T bin = iterator.InnerRawGet(idx);
if (bin > maxb || bin < minb) { if (bin > maxb || bin < minb) {
default_indices[(*default_count)++] = idx; default_indices[(*default_count)++] = idx;
} else if (bin > th) { } else if (bin > th) {
...@@ -168,7 +169,7 @@ public: ...@@ -168,7 +169,7 @@ public:
} }
for (data_size_t i = 0; i < num_data; ++i) { for (data_size_t i = 0; i < num_data; ++i) {
const data_size_t idx = data_indices[i]; const data_size_t idx = data_indices[i];
VAL_T bin = iterator.RawGet(idx); VAL_T bin = iterator.InnerRawGet(idx);
if (bin > maxb || bin < minb) { if (bin > maxb || bin < minb) {
default_indices[(*default_count)++] = idx; default_indices[(*default_count)++] = idx;
} else if (bin != th) { } else if (bin != th) {
...@@ -327,7 +328,7 @@ public: ...@@ -327,7 +328,7 @@ public:
// transform to delta array // transform to delta array
data_size_t last_idx = 0; data_size_t last_idx = 0;
for (data_size_t i = 0; i < num_used_indices; ++i) { for (data_size_t i = 0; i < num_used_indices; ++i) {
VAL_T bin = iterator.RawGet(used_indices[i]); VAL_T bin = iterator.InnerRawGet(used_indices[i]);
if (bin > 0) { if (bin > 0) {
data_size_t cur_delta = i - last_idx; data_size_t cur_delta = i - last_idx;
while (cur_delta >= 256) { while (cur_delta >= 256) {
...@@ -363,7 +364,12 @@ protected: ...@@ -363,7 +364,12 @@ protected:
}; };
template <typename VAL_T> template <typename VAL_T>
inline VAL_T SparseBinIterator<VAL_T>::RawGet(data_size_t idx) { inline uint32_t SparseBinIterator<VAL_T>::RawGet(data_size_t idx) {
return InnerRawGet(idx);
}
template <typename VAL_T>
inline VAL_T SparseBinIterator<VAL_T>::InnerRawGet(data_size_t idx) {
while (cur_pos_ < idx) { while (cur_pos_ < idx) {
bin_data_->NextNonzero(&i_delta_, &cur_pos_); bin_data_->NextNonzero(&i_delta_, &cur_pos_);
} }
......
...@@ -7,54 +7,59 @@ ...@@ -7,54 +7,59 @@
namespace LightGBM { namespace LightGBM {
DataParallelTreeLearner::DataParallelTreeLearner(const TreeConfig* tree_config) template <typename TREELEARNER_T>
:SerialTreeLearner(tree_config) { DataParallelTreeLearner<TREELEARNER_T>::DataParallelTreeLearner(const TreeConfig* tree_config)
:TREELEARNER_T(tree_config) {
} }
DataParallelTreeLearner::~DataParallelTreeLearner() { template <typename TREELEARNER_T>
DataParallelTreeLearner<TREELEARNER_T>::~DataParallelTreeLearner() {
} }
void DataParallelTreeLearner::Init(const Dataset* train_data) { template <typename TREELEARNER_T>
void DataParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
// initialize SerialTreeLearner // initialize SerialTreeLearner
SerialTreeLearner::Init(train_data); TREELEARNER_T::Init(train_data, is_constant_hessian);
// Get local rank and global machine size // Get local rank and global machine size
rank_ = Network::rank(); rank_ = Network::rank();
num_machines_ = Network::num_machines(); num_machines_ = Network::num_machines();
// allocate buffer for communication // allocate buffer for communication
size_t buffer_size = train_data_->NumTotalBin() * sizeof(HistogramBinEntry); size_t buffer_size = this->train_data_->NumTotalBin() * sizeof(HistogramBinEntry);
input_buffer_.resize(buffer_size); input_buffer_.resize(buffer_size);
output_buffer_.resize(buffer_size); output_buffer_.resize(buffer_size);
is_feature_aggregated_.resize(num_features_); is_feature_aggregated_.resize(this->num_features_);
block_start_.resize(num_machines_); block_start_.resize(num_machines_);
block_len_.resize(num_machines_); block_len_.resize(num_machines_);
buffer_write_start_pos_.resize(num_features_); buffer_write_start_pos_.resize(this->num_features_);
buffer_read_start_pos_.resize(num_features_); buffer_read_start_pos_.resize(this->num_features_);
global_data_count_in_leaf_.resize(tree_config_->num_leaves); global_data_count_in_leaf_.resize(this->tree_config_->num_leaves);
} }
void DataParallelTreeLearner::ResetConfig(const TreeConfig* tree_config) { template <typename TREELEARNER_T>
SerialTreeLearner::ResetConfig(tree_config); void DataParallelTreeLearner<TREELEARNER_T>::ResetConfig(const TreeConfig* tree_config) {
global_data_count_in_leaf_.resize(tree_config_->num_leaves); TREELEARNER_T::ResetConfig(tree_config);
global_data_count_in_leaf_.resize(this->tree_config_->num_leaves);
} }
void DataParallelTreeLearner::BeforeTrain() { template <typename TREELEARNER_T>
SerialTreeLearner::BeforeTrain(); void DataParallelTreeLearner<TREELEARNER_T>::BeforeTrain() {
TREELEARNER_T::BeforeTrain();
// generate feature partition for current tree // generate feature partition for current tree
std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>()); std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>());
std::vector<int> num_bins_distributed(num_machines_, 0); std::vector<int> num_bins_distributed(num_machines_, 0);
for (int i = 0; i < train_data_->num_total_features(); ++i) { for (int i = 0; i < this->train_data_->num_total_features(); ++i) {
int inner_feature_index = train_data_->InnerFeatureIndex(i); int inner_feature_index = this->train_data_->InnerFeatureIndex(i);
if (inner_feature_index == -1) { continue; } if (inner_feature_index == -1) { continue; }
if (is_feature_used_[inner_feature_index]) { if (this->is_feature_used_[inner_feature_index]) {
int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed)); int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed));
feature_distribution[cur_min_machine].push_back(inner_feature_index); feature_distribution[cur_min_machine].push_back(inner_feature_index);
auto num_bin = train_data_->FeatureNumBin(inner_feature_index); auto num_bin = this->train_data_->FeatureNumBin(inner_feature_index);
if (train_data_->FeatureBinMapper(inner_feature_index)->GetDefaultBin() == 0) { if (this->train_data_->FeatureBinMapper(inner_feature_index)->GetDefaultBin() == 0) {
num_bin -= 1; num_bin -= 1;
} }
num_bins_distributed[cur_min_machine] += num_bin; num_bins_distributed[cur_min_machine] += num_bin;
...@@ -71,8 +76,8 @@ void DataParallelTreeLearner::BeforeTrain() { ...@@ -71,8 +76,8 @@ void DataParallelTreeLearner::BeforeTrain() {
for (int i = 0; i < num_machines_; ++i) { for (int i = 0; i < num_machines_; ++i) {
block_len_[i] = 0; block_len_[i] = 0;
for (auto fid : feature_distribution[i]) { for (auto fid : feature_distribution[i]) {
auto num_bin = train_data_->FeatureNumBin(fid); auto num_bin = this->train_data_->FeatureNumBin(fid);
if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) { if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
num_bin -= 1; num_bin -= 1;
} }
block_len_[i] += num_bin * sizeof(HistogramBinEntry); block_len_[i] += num_bin * sizeof(HistogramBinEntry);
...@@ -90,8 +95,8 @@ void DataParallelTreeLearner::BeforeTrain() { ...@@ -90,8 +95,8 @@ void DataParallelTreeLearner::BeforeTrain() {
for (int i = 0; i < num_machines_; ++i) { for (int i = 0; i < num_machines_; ++i) {
for (auto fid : feature_distribution[i]) { for (auto fid : feature_distribution[i]) {
buffer_write_start_pos_[fid] = bin_size; buffer_write_start_pos_[fid] = bin_size;
auto num_bin = train_data_->FeatureNumBin(fid); auto num_bin = this->train_data_->FeatureNumBin(fid);
if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) { if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
num_bin -= 1; num_bin -= 1;
} }
bin_size += num_bin * sizeof(HistogramBinEntry); bin_size += num_bin * sizeof(HistogramBinEntry);
...@@ -102,16 +107,16 @@ void DataParallelTreeLearner::BeforeTrain() { ...@@ -102,16 +107,16 @@ void DataParallelTreeLearner::BeforeTrain() {
bin_size = 0; bin_size = 0;
for (auto fid : feature_distribution[rank_]) { for (auto fid : feature_distribution[rank_]) {
buffer_read_start_pos_[fid] = bin_size; buffer_read_start_pos_[fid] = bin_size;
auto num_bin = train_data_->FeatureNumBin(fid); auto num_bin = this->train_data_->FeatureNumBin(fid);
if (train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) { if (this->train_data_->FeatureBinMapper(fid)->GetDefaultBin() == 0) {
num_bin -= 1; num_bin -= 1;
} }
bin_size += num_bin * sizeof(HistogramBinEntry); bin_size += num_bin * sizeof(HistogramBinEntry);
} }
// sync global data sumup info // sync global data sumup info
std::tuple<data_size_t, double, double> data(smaller_leaf_splits_->num_data_in_leaf(), std::tuple<data_size_t, double, double> data(this->smaller_leaf_splits_->num_data_in_leaf(),
smaller_leaf_splits_->sum_gradients(), smaller_leaf_splits_->sum_hessians()); this->smaller_leaf_splits_->sum_gradients(), this->smaller_leaf_splits_->sum_hessians());
int size = sizeof(data); int size = sizeof(data);
std::memcpy(input_buffer_.data(), &data, size); std::memcpy(input_buffer_.data(), &data, size);
// global sumup reduce // global sumup reduce
...@@ -134,93 +139,95 @@ void DataParallelTreeLearner::BeforeTrain() { ...@@ -134,93 +139,95 @@ void DataParallelTreeLearner::BeforeTrain() {
// copy back // copy back
std::memcpy(&data, output_buffer_.data(), size); std::memcpy(&data, output_buffer_.data(), size);
// set global sumup info // set global sumup info
smaller_leaf_splits_->Init(std::get<1>(data), std::get<2>(data)); this->smaller_leaf_splits_->Init(std::get<1>(data), std::get<2>(data));
// init global data count in leaf // init global data count in leaf
global_data_count_in_leaf_[0] = std::get<0>(data); global_data_count_in_leaf_[0] = std::get<0>(data);
} }
void DataParallelTreeLearner::FindBestThresholds() { template <typename TREELEARNER_T>
ConstructHistograms(is_feature_used_, true); void DataParallelTreeLearner<TREELEARNER_T>::FindBestThresholds() {
this->ConstructHistograms(this->is_feature_used_, true);
// construct local histograms // construct local histograms
#pragma omp parallel for schedule(static) #pragma omp parallel for schedule(static)
for (int feature_index = 0; feature_index < num_features_; ++feature_index) { for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
if ((!is_feature_used_.empty() && is_feature_used_[feature_index] == false)) continue; if ((!this->is_feature_used_.empty() && this->is_feature_used_[feature_index] == false)) continue;
// copy to buffer // copy to buffer
std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index], std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index],
smaller_leaf_histogram_array_[feature_index].RawData(), this->smaller_leaf_histogram_array_[feature_index].RawData(),
smaller_leaf_histogram_array_[feature_index].SizeOfHistgram()); this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram());
} }
// Reduce scatter for histogram // Reduce scatter for histogram
Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, block_start_.data(), Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, block_start_.data(),
block_len_.data(), output_buffer_.data(), &HistogramBinEntry::SumReducer); block_len_.data(), output_buffer_.data(), &HistogramBinEntry::SumReducer);
std::vector<SplitInfo> smaller_best(num_threads_, SplitInfo()); std::vector<SplitInfo> smaller_best(this->num_threads_, SplitInfo());
std::vector<SplitInfo> larger_best(num_threads_, SplitInfo()); std::vector<SplitInfo> larger_best(this->num_threads_, SplitInfo());
OMP_INIT_EX(); OMP_INIT_EX();
#pragma omp parallel for schedule(static) #pragma omp parallel for schedule(static)
for (int feature_index = 0; feature_index < num_features_; ++feature_index) { for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
OMP_LOOP_EX_BEGIN(); OMP_LOOP_EX_BEGIN();
if (!is_feature_aggregated_[feature_index]) continue; if (!is_feature_aggregated_[feature_index]) continue;
const int tid = omp_get_thread_num(); const int tid = omp_get_thread_num();
// restore global histograms from buffer // restore global histograms from buffer
smaller_leaf_histogram_array_[feature_index].FromMemory( this->smaller_leaf_histogram_array_[feature_index].FromMemory(
output_buffer_.data() + buffer_read_start_pos_[feature_index]); output_buffer_.data() + buffer_read_start_pos_[feature_index]);
train_data_->FixHistogram(feature_index, this->train_data_->FixHistogram(feature_index,
smaller_leaf_splits_->sum_gradients(), smaller_leaf_splits_->sum_hessians(), this->smaller_leaf_splits_->sum_gradients(), this->smaller_leaf_splits_->sum_hessians(),
GetGlobalDataCountInLeaf(smaller_leaf_splits_->LeafIndex()), GetGlobalDataCountInLeaf(this->smaller_leaf_splits_->LeafIndex()),
smaller_leaf_histogram_array_[feature_index].RawData()); this->smaller_leaf_histogram_array_[feature_index].RawData());
SplitInfo smaller_split; SplitInfo smaller_split;
// find best threshold for smaller child // find best threshold for smaller child
smaller_leaf_histogram_array_[feature_index].FindBestThreshold( this->smaller_leaf_histogram_array_[feature_index].FindBestThreshold(
smaller_leaf_splits_->sum_gradients(), this->smaller_leaf_splits_->sum_gradients(),
smaller_leaf_splits_->sum_hessians(), this->smaller_leaf_splits_->sum_hessians(),
GetGlobalDataCountInLeaf(smaller_leaf_splits_->LeafIndex()), GetGlobalDataCountInLeaf(this->smaller_leaf_splits_->LeafIndex()),
&smaller_split); &smaller_split);
if (smaller_split.gain > smaller_best[tid].gain) { if (smaller_split.gain > smaller_best[tid].gain) {
smaller_best[tid] = smaller_split; smaller_best[tid] = smaller_split;
smaller_best[tid].feature = train_data_->RealFeatureIndex(feature_index); smaller_best[tid].feature = this->train_data_->RealFeatureIndex(feature_index);
} }
// only root leaf // only root leaf
if (larger_leaf_splits_ == nullptr || larger_leaf_splits_->LeafIndex() < 0) continue; if (this->larger_leaf_splits_ == nullptr || this->larger_leaf_splits_->LeafIndex() < 0) continue;
// construct histgroms for large leaf, we init larger leaf as the parent, so we can just subtract the smaller leaf's histograms // construct histgroms for large leaf, we init larger leaf as the parent, so we can just subtract the smaller leaf's histograms
larger_leaf_histogram_array_[feature_index].Subtract( this->larger_leaf_histogram_array_[feature_index].Subtract(
smaller_leaf_histogram_array_[feature_index]); this->smaller_leaf_histogram_array_[feature_index]);
SplitInfo larger_split; SplitInfo larger_split;
// find best threshold for larger child // find best threshold for larger child
larger_leaf_histogram_array_[feature_index].FindBestThreshold( this->larger_leaf_histogram_array_[feature_index].FindBestThreshold(
larger_leaf_splits_->sum_gradients(), this->larger_leaf_splits_->sum_gradients(),
larger_leaf_splits_->sum_hessians(), this->larger_leaf_splits_->sum_hessians(),
GetGlobalDataCountInLeaf(larger_leaf_splits_->LeafIndex()), GetGlobalDataCountInLeaf(this->larger_leaf_splits_->LeafIndex()),
&larger_split); &larger_split);
if (larger_split.gain > larger_best[tid].gain) { if (larger_split.gain > larger_best[tid].gain) {
larger_best[tid] = larger_split; larger_best[tid] = larger_split;
larger_best[tid].feature = train_data_->RealFeatureIndex(feature_index); larger_best[tid].feature = this->train_data_->RealFeatureIndex(feature_index);
} }
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
} }
OMP_THROW_EX(); OMP_THROW_EX();
auto smaller_best_idx = ArrayArgs<SplitInfo>::ArgMax(smaller_best); auto smaller_best_idx = ArrayArgs<SplitInfo>::ArgMax(smaller_best);
int leaf = smaller_leaf_splits_->LeafIndex(); int leaf = this->smaller_leaf_splits_->LeafIndex();
best_split_per_leaf_[leaf] = smaller_best[smaller_best_idx]; this->best_split_per_leaf_[leaf] = smaller_best[smaller_best_idx];
if (larger_leaf_splits_ == nullptr || larger_leaf_splits_->LeafIndex() < 0) { return; } if (this->larger_leaf_splits_ == nullptr || this->larger_leaf_splits_->LeafIndex() < 0) { return; }
leaf = larger_leaf_splits_->LeafIndex(); leaf = this->larger_leaf_splits_->LeafIndex();
auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best); auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best);
best_split_per_leaf_[leaf] = larger_best[larger_best_idx]; this->best_split_per_leaf_[leaf] = larger_best[larger_best_idx];
} }
void DataParallelTreeLearner::FindBestSplitsForLeaves() { template <typename TREELEARNER_T>
void DataParallelTreeLearner<TREELEARNER_T>::FindBestSplitsForLeaves() {
SplitInfo smaller_best, larger_best; SplitInfo smaller_best, larger_best;
smaller_best = best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()]; smaller_best = this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()];
// find local best split for larger leaf // find local best split for larger leaf
if (larger_leaf_splits_->LeafIndex() >= 0) { if (this->larger_leaf_splits_->LeafIndex() >= 0) {
larger_best = best_split_per_leaf_[larger_leaf_splits_->LeafIndex()]; larger_best = this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()];
} }
// sync global best info // sync global best info
...@@ -234,19 +241,23 @@ void DataParallelTreeLearner::FindBestSplitsForLeaves() { ...@@ -234,19 +241,23 @@ void DataParallelTreeLearner::FindBestSplitsForLeaves() {
std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo)); std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo));
// set best split // set best split
best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()] = smaller_best; this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()] = smaller_best;
if (larger_leaf_splits_->LeafIndex() >= 0) { if (this->larger_leaf_splits_->LeafIndex() >= 0) {
best_split_per_leaf_[larger_leaf_splits_->LeafIndex()] = larger_best; this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()] = larger_best;
} }
} }
void DataParallelTreeLearner::Split(Tree* tree, int best_Leaf, int* left_leaf, int* right_leaf) { template <typename TREELEARNER_T>
SerialTreeLearner::Split(tree, best_Leaf, left_leaf, right_leaf); void DataParallelTreeLearner<TREELEARNER_T>::Split(Tree* tree, int best_Leaf, int* left_leaf, int* right_leaf) {
const SplitInfo& best_split_info = best_split_per_leaf_[best_Leaf]; TREELEARNER_T::Split(tree, best_Leaf, left_leaf, right_leaf);
const SplitInfo& best_split_info = this->best_split_per_leaf_[best_Leaf];
// need update global number of data in leaf // need update global number of data in leaf
global_data_count_in_leaf_[*left_leaf] = best_split_info.left_count; global_data_count_in_leaf_[*left_leaf] = best_split_info.left_count;
global_data_count_in_leaf_[*right_leaf] = best_split_info.right_count; global_data_count_in_leaf_[*right_leaf] = best_split_info.right_count;
} }
// instantiate template classes, otherwise linker cannot find the code
template class DataParallelTreeLearner<GPUTreeLearner>;
template class DataParallelTreeLearner<SerialTreeLearner>;
} // namespace LightGBM } // namespace LightGBM
...@@ -6,15 +6,20 @@ ...@@ -6,15 +6,20 @@
namespace LightGBM { namespace LightGBM {
FeatureParallelTreeLearner::FeatureParallelTreeLearner(const TreeConfig* tree_config)
:SerialTreeLearner(tree_config) { template <typename TREELEARNER_T>
FeatureParallelTreeLearner<TREELEARNER_T>::FeatureParallelTreeLearner(const TreeConfig* tree_config)
:TREELEARNER_T(tree_config) {
} }
FeatureParallelTreeLearner::~FeatureParallelTreeLearner() { template <typename TREELEARNER_T>
FeatureParallelTreeLearner<TREELEARNER_T>::~FeatureParallelTreeLearner() {
} }
void FeatureParallelTreeLearner::Init(const Dataset* train_data) {
SerialTreeLearner::Init(train_data); template <typename TREELEARNER_T>
void FeatureParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
TREELEARNER_T::Init(train_data, is_constant_hessian);
rank_ = Network::rank(); rank_ = Network::rank();
num_machines_ = Network::num_machines(); num_machines_ = Network::num_machines();
input_buffer_.resize(sizeof(SplitInfo) * 2); input_buffer_.resize(sizeof(SplitInfo) * 2);
...@@ -22,35 +27,36 @@ void FeatureParallelTreeLearner::Init(const Dataset* train_data) { ...@@ -22,35 +27,36 @@ void FeatureParallelTreeLearner::Init(const Dataset* train_data) {
} }
template <typename TREELEARNER_T>
void FeatureParallelTreeLearner::BeforeTrain() { void FeatureParallelTreeLearner<TREELEARNER_T>::BeforeTrain() {
SerialTreeLearner::BeforeTrain(); TREELEARNER_T::BeforeTrain();
// get feature partition // get feature partition
std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>()); std::vector<std::vector<int>> feature_distribution(num_machines_, std::vector<int>());
std::vector<int> num_bins_distributed(num_machines_, 0); std::vector<int> num_bins_distributed(num_machines_, 0);
for (int i = 0; i < train_data_->num_total_features(); ++i) { for (int i = 0; i < this->train_data_->num_total_features(); ++i) {
int inner_feature_index = train_data_->InnerFeatureIndex(i); int inner_feature_index = this->train_data_->InnerFeatureIndex(i);
if (inner_feature_index == -1) { continue; } if (inner_feature_index == -1) { continue; }
if (is_feature_used_[inner_feature_index]) { if (this->is_feature_used_[inner_feature_index]) {
int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed)); int cur_min_machine = static_cast<int>(ArrayArgs<int>::ArgMin(num_bins_distributed));
feature_distribution[cur_min_machine].push_back(inner_feature_index); feature_distribution[cur_min_machine].push_back(inner_feature_index);
num_bins_distributed[cur_min_machine] += train_data_->FeatureNumBin(inner_feature_index); num_bins_distributed[cur_min_machine] += this->train_data_->FeatureNumBin(inner_feature_index);
is_feature_used_[inner_feature_index] = false; this->is_feature_used_[inner_feature_index] = false;
} }
} }
// get local used features // get local used features
for (auto fid : feature_distribution[rank_]) { for (auto fid : feature_distribution[rank_]) {
is_feature_used_[fid] = true; this->is_feature_used_[fid] = true;
} }
} }
void FeatureParallelTreeLearner::FindBestSplitsForLeaves() { template <typename TREELEARNER_T>
void FeatureParallelTreeLearner<TREELEARNER_T>::FindBestSplitsForLeaves() {
SplitInfo smaller_best, larger_best; SplitInfo smaller_best, larger_best;
// get best split at smaller leaf // get best split at smaller leaf
smaller_best = best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()]; smaller_best = this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()];
// find local best split for larger leaf // find local best split for larger leaf
if (larger_leaf_splits_->LeafIndex() >= 0) { if (this->larger_leaf_splits_->LeafIndex() >= 0) {
larger_best = best_split_per_leaf_[larger_leaf_splits_->LeafIndex()]; larger_best = this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()];
} }
// sync global best info // sync global best info
std::memcpy(input_buffer_.data(), &smaller_best, sizeof(SplitInfo)); std::memcpy(input_buffer_.data(), &smaller_best, sizeof(SplitInfo));
...@@ -62,10 +68,13 @@ void FeatureParallelTreeLearner::FindBestSplitsForLeaves() { ...@@ -62,10 +68,13 @@ void FeatureParallelTreeLearner::FindBestSplitsForLeaves() {
std::memcpy(&smaller_best, output_buffer_.data(), sizeof(SplitInfo)); std::memcpy(&smaller_best, output_buffer_.data(), sizeof(SplitInfo));
std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo)); std::memcpy(&larger_best, output_buffer_.data() + sizeof(SplitInfo), sizeof(SplitInfo));
// update best split // update best split
best_split_per_leaf_[smaller_leaf_splits_->LeafIndex()] = smaller_best; this->best_split_per_leaf_[this->smaller_leaf_splits_->LeafIndex()] = smaller_best;
if (larger_leaf_splits_->LeafIndex() >= 0) { if (this->larger_leaf_splits_->LeafIndex() >= 0) {
best_split_per_leaf_[larger_leaf_splits_->LeafIndex()] = larger_best; this->best_split_per_leaf_[this->larger_leaf_splits_->LeafIndex()] = larger_best;
} }
} }
// instantiate template classes, otherwise linker cannot find the code
template class FeatureParallelTreeLearner<GPUTreeLearner>;
template class FeatureParallelTreeLearner<SerialTreeLearner>;
} // namespace LightGBM } // namespace LightGBM
#ifdef USE_GPU
#include "gpu_tree_learner.h"
#include "../io/dense_bin.hpp"
#include "../io/dense_nbits_bin.hpp"
#include <LightGBM/utils/array_args.h>
#include <LightGBM/network.h>
#include <LightGBM/bin.h>
#include <algorithm>
#include <vector>
#define GPU_DEBUG 0
namespace LightGBM {
GPUTreeLearner::GPUTreeLearner(const TreeConfig* tree_config)
:SerialTreeLearner(tree_config) {
use_bagging_ = false;
Log::Info("This is the GPU trainer!!");
}
GPUTreeLearner::~GPUTreeLearner() {
if (ptr_pinned_gradients_) {
queue_.enqueue_unmap_buffer(pinned_gradients_, ptr_pinned_gradients_);
}
if (ptr_pinned_hessians_) {
queue_.enqueue_unmap_buffer(pinned_hessians_, ptr_pinned_hessians_);
}
if (ptr_pinned_feature_masks_) {
queue_.enqueue_unmap_buffer(pinned_feature_masks_, ptr_pinned_feature_masks_);
}
}
void GPUTreeLearner::Init(const Dataset* train_data, bool is_constant_hessian) {
// initialize SerialTreeLearner
SerialTreeLearner::Init(train_data, is_constant_hessian);
// some additional variables needed for GPU trainer
num_feature_groups_ = train_data_->num_feature_groups();
// Initialize GPU buffers and kernels
InitGPU(tree_config_->gpu_platform_id, tree_config_->gpu_device_id);
}
// some functions used for debugging the GPU histogram construction
#if GPU_DEBUG > 0
void PrintHistograms(HistogramBinEntry* h, size_t size) {
size_t total = 0;
for (size_t i = 0; i < size; ++i) {
printf("%03lu=%9.3g,%9.3g,%7d\t", i, h[i].sum_gradients, h[i].sum_hessians, h[i].cnt);
total += h[i].cnt;
if ((i & 3) == 3)
printf("\n");
}
printf("\nTotal examples: %lu\n", total);
}
union Float_t
{
int64_t i;
double f;
static int64_t ulp_diff(Float_t a, Float_t b) {
return abs(a.i - b.i);
}
};
void CompareHistograms(HistogramBinEntry* h1, HistogramBinEntry* h2, size_t size, int feature_id) {
size_t i;
Float_t a, b;
for (i = 0; i < size; ++i) {
a.f = h1[i].sum_gradients;
b.f = h2[i].sum_gradients;
int32_t ulps = Float_t::ulp_diff(a, b);
if (fabs(h1[i].cnt - h2[i].cnt != 0)) {
printf("%d != %d\n", h1[i].cnt, h2[i].cnt);
goto err;
}
if (ulps > 0) {
// printf("grad %g != %g (%d ULPs)\n", h1[i].sum_gradients, h2[i].sum_gradients, ulps);
// goto err;
}
a.f = h1[i].sum_hessians;
b.f = h2[i].sum_hessians;
ulps = Float_t::ulp_diff(a, b);
if (ulps > 0) {
// printf("hessian %g != %g (%d ULPs)\n", h1[i].sum_hessians, h2[i].sum_hessians, ulps);
// goto err;
}
}
return;
err:
Log::Warning("Mismatched histograms found for feature %d at location %lu.", feature_id, i);
std::cin.get();
PrintHistograms(h1, size);
printf("\n");
PrintHistograms(h2, size);
std::cin.get();
}
#endif
int GPUTreeLearner::GetNumWorkgroupsPerFeature(data_size_t leaf_num_data) {
// we roughly want 256 workgroups per device, and we have num_dense_feature4_ feature tuples.
// also guarantee that there are at least 2K examples per workgroup
double x = 256.0 / num_dense_feature4_;
int exp_workgroups_per_feature = ceil(log2(x));
double t = leaf_num_data / 1024.0;
#if GPU_DEBUG >= 4
printf("Computing histogram for %d examples and (%d * %d) feature groups\n", leaf_num_data, dword_features_, num_dense_feature4_);
printf("We can have at most %d workgroups per feature4 for efficiency reasons.\n"
"Best workgroup size per feature for full utilization is %d\n", (int)ceil(t), (1 << exp_workgroups_per_feature));
#endif
exp_workgroups_per_feature = std::min(exp_workgroups_per_feature, (int)ceil(log((double)t)/log(2.0)));
if (exp_workgroups_per_feature < 0)
exp_workgroups_per_feature = 0;
if (exp_workgroups_per_feature > kMaxLogWorkgroupsPerFeature)
exp_workgroups_per_feature = kMaxLogWorkgroupsPerFeature;
// return 0;
return exp_workgroups_per_feature;
}
void GPUTreeLearner::GPUHistogram(data_size_t leaf_num_data, bool use_all_features) {
// we have already copied ordered gradients, ordered hessians and indices to GPU
// decide the best number of workgroups working on one feature4 tuple
// set work group size based on feature size
// each 2^exp_workgroups_per_feature workgroups work on a feature4 tuple
int exp_workgroups_per_feature = GetNumWorkgroupsPerFeature(leaf_num_data);
int num_workgroups = (1 << exp_workgroups_per_feature) * num_dense_feature4_;
if (num_workgroups > preallocd_max_num_wg_) {
preallocd_max_num_wg_ = num_workgroups;
Log::Info("Increasing preallocd_max_num_wg_ to %d for launching more workgroups.", preallocd_max_num_wg_);
device_subhistograms_.reset(new boost::compute::vector<char>(
preallocd_max_num_wg_ * dword_features_ * device_bin_size_ * hist_bin_entry_sz_, ctx_));
// we need to refresh the kernel arguments after reallocating
for (int i = 0; i <= kMaxLogWorkgroupsPerFeature; ++i) {
// The only argument that needs to be changed later is num_data_
histogram_kernels_[i].set_arg(7, *device_subhistograms_);
histogram_allfeats_kernels_[i].set_arg(7, *device_subhistograms_);
histogram_fulldata_kernels_[i].set_arg(7, *device_subhistograms_);
}
}
#if GPU_DEBUG >= 4
printf("setting exp_workgroups_per_feature to %d, using %u work groups\n", exp_workgroups_per_feature, num_workgroups);
printf("Constructing histogram with %d examples\n", leaf_num_data);
#endif
// the GPU kernel will process all features in one call, and each
// 2^exp_workgroups_per_feature (compile time constant) workgroup will
// process one feature4 tuple
if (use_all_features) {
histogram_allfeats_kernels_[exp_workgroups_per_feature].set_arg(4, leaf_num_data);
}
else {
histogram_kernels_[exp_workgroups_per_feature].set_arg(4, leaf_num_data);
}
// for the root node, indices are not copied
if (leaf_num_data != num_data_) {
indices_future_.wait();
}
// for constant hessian, hessians are not copied except for the root node
if (!is_constant_hessian_) {
hessians_future_.wait();
}
gradients_future_.wait();
// there will be 2^exp_workgroups_per_feature = num_workgroups / num_dense_feature4 sub-histogram per feature4
// and we will launch num_feature workgroups for this kernel
// will launch threads for all features
// the queue should be asynchrounous, and we will can WaitAndGetHistograms() before we start processing dense feature groups
if (leaf_num_data == num_data_) {
kernel_wait_obj_ = boost::compute::wait_list(queue_.enqueue_1d_range_kernel(histogram_fulldata_kernels_[exp_workgroups_per_feature], 0, num_workgroups * 256, 256));
}
else {
if (use_all_features) {
kernel_wait_obj_ = boost::compute::wait_list(
queue_.enqueue_1d_range_kernel(histogram_allfeats_kernels_[exp_workgroups_per_feature], 0, num_workgroups * 256, 256));
}
else {
kernel_wait_obj_ = boost::compute::wait_list(
queue_.enqueue_1d_range_kernel(histogram_kernels_[exp_workgroups_per_feature], 0, num_workgroups * 256, 256));
}
}
// copy the results asynchronously. Size depends on if double precision is used
size_t output_size = num_dense_feature4_ * dword_features_ * device_bin_size_ * hist_bin_entry_sz_;
boost::compute::event histogram_wait_event;
host_histogram_outputs_ = (void*)queue_.enqueue_map_buffer_async(device_histogram_outputs_, boost::compute::command_queue::map_read,
0, output_size, histogram_wait_event, kernel_wait_obj_);
// we will wait for this object in WaitAndGetHistograms
histograms_wait_obj_ = boost::compute::wait_list(histogram_wait_event);
}
template <typename HistType>
void GPUTreeLearner::WaitAndGetHistograms(HistogramBinEntry* histograms, const std::vector<int8_t>& is_feature_used) {
HistType* hist_outputs = (HistType*) host_histogram_outputs_;
// when the output is ready, the computation is done
histograms_wait_obj_.wait();
#pragma omp parallel for schedule(static)
for(int i = 0; i < num_dense_feature_groups_; ++i) {
if (!feature_masks_[i]) {
continue;
}
int dense_group_index = dense_feature_group_map_[i];
auto old_histogram_array = histograms + train_data_->GroupBinBoundary(dense_group_index);
int bin_size = train_data_->FeatureGroupNumBin(dense_group_index);
if (device_bin_mults_[i] == 1) {
for (int j = 0; j < bin_size; ++j) {
old_histogram_array[j].sum_gradients = hist_outputs[i * device_bin_size_+ j].sum_gradients;
old_histogram_array[j].sum_hessians = hist_outputs[i * device_bin_size_ + j].sum_hessians;
old_histogram_array[j].cnt = hist_outputs[i * device_bin_size_ + j].cnt;
}
}
else {
// values of this feature has been redistributed to multiple bins; need a reduction here
int ind = 0;
for (int j = 0; j < bin_size; ++j) {
double sum_g = 0.0, sum_h = 0.0;
size_t cnt = 0;
for (int k = 0; k < device_bin_mults_[i]; ++k) {
sum_g += hist_outputs[i * device_bin_size_+ ind].sum_gradients;
sum_h += hist_outputs[i * device_bin_size_+ ind].sum_hessians;
cnt += hist_outputs[i * device_bin_size_ + ind].cnt;
ind++;
}
old_histogram_array[j].sum_gradients = sum_g;
old_histogram_array[j].sum_hessians = sum_h;
old_histogram_array[j].cnt = cnt;
}
}
}
queue_.enqueue_unmap_buffer(device_histogram_outputs_, host_histogram_outputs_);
}
void GPUTreeLearner::AllocateGPUMemory() {
num_dense_feature_groups_ = 0;
for (int i = 0; i < num_feature_groups_; ++i) {
if (ordered_bins_[i] == nullptr) {
num_dense_feature_groups_++;
}
}
// how many feature-group tuples we have
num_dense_feature4_ = (num_dense_feature_groups_ + (dword_features_ - 1)) / dword_features_;
// leave some safe margin for prefetching
// 256 work-items per workgroup. Each work-item prefetches one tuple for that feature
int allocated_num_data_ = num_data_ + 256 * (1 << kMaxLogWorkgroupsPerFeature);
// clear sparse/dense maps
dense_feature_group_map_.clear();
device_bin_mults_.clear();
sparse_feature_group_map_.clear();
// do nothing if no features can be processed on GPU
if (!num_dense_feature_groups_) {
Log::Warning("GPU acceleration is disabled because no non-trival dense features can be found");
return;
}
// allocate memory for all features (FIXME: 4 GB barrier on some devices, need to split to multiple buffers)
device_features_.reset();
device_features_ = std::unique_ptr<boost::compute::vector<Feature4>>(new boost::compute::vector<Feature4>(num_dense_feature4_ * num_data_, ctx_));
// unpin old buffer if necessary before destructing them
if (ptr_pinned_gradients_) {
queue_.enqueue_unmap_buffer(pinned_gradients_, ptr_pinned_gradients_);
}
if (ptr_pinned_hessians_) {
queue_.enqueue_unmap_buffer(pinned_hessians_, ptr_pinned_hessians_);
}
if (ptr_pinned_feature_masks_) {
queue_.enqueue_unmap_buffer(pinned_feature_masks_, ptr_pinned_feature_masks_);
}
// make ordered_gradients and hessians larger (including extra room for prefetching), and pin them
ordered_gradients_.reserve(allocated_num_data_);
ordered_hessians_.reserve(allocated_num_data_);
pinned_gradients_ = boost::compute::buffer(); // deallocate
pinned_gradients_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
ordered_gradients_.data());
ptr_pinned_gradients_ = queue_.enqueue_map_buffer(pinned_gradients_, boost::compute::command_queue::map_write_invalidate_region,
0, allocated_num_data_ * sizeof(score_t));
pinned_hessians_ = boost::compute::buffer(); // deallocate
pinned_hessians_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
ordered_hessians_.data());
ptr_pinned_hessians_ = queue_.enqueue_map_buffer(pinned_hessians_, boost::compute::command_queue::map_write_invalidate_region,
0, allocated_num_data_ * sizeof(score_t));
// allocate space for gradients and hessians on device
// we will copy gradients and hessians in after ordered_gradients_ and ordered_hessians_ are constructed
device_gradients_ = boost::compute::buffer(); // deallocate
device_gradients_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_only, nullptr);
device_hessians_ = boost::compute::buffer(); // deallocate
device_hessians_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_only, nullptr);
// allocate feature mask, for disabling some feature-groups' histogram calculation
feature_masks_.resize(num_dense_feature4_ * dword_features_);
device_feature_masks_ = boost::compute::buffer(); // deallocate
device_feature_masks_ = boost::compute::buffer(ctx_, num_dense_feature4_ * dword_features_,
boost::compute::memory_object::read_only, nullptr);
pinned_feature_masks_ = boost::compute::buffer(ctx_, num_dense_feature4_ * dword_features_,
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
feature_masks_.data());
ptr_pinned_feature_masks_ = queue_.enqueue_map_buffer(pinned_feature_masks_, boost::compute::command_queue::map_write_invalidate_region,
0, num_dense_feature4_ * dword_features_);
memset(ptr_pinned_feature_masks_, 0, num_dense_feature4_ * dword_features_);
// copy indices to the device
device_data_indices_.reset();
device_data_indices_ = std::unique_ptr<boost::compute::vector<data_size_t>>(new boost::compute::vector<data_size_t>(allocated_num_data_, ctx_));
boost::compute::fill(device_data_indices_->begin(), device_data_indices_->end(), 0, queue_);
// histogram bin entry size depends on the precision (single/double)
hist_bin_entry_sz_ = tree_config_->gpu_use_dp ? sizeof(HistogramBinEntry) : sizeof(GPUHistogramBinEntry);
Log::Info("Size of histogram bin entry: %d", hist_bin_entry_sz_);
// create output buffer, each feature has a histogram with device_bin_size_ bins,
// each work group generates a sub-histogram of dword_features_ features.
if (!device_subhistograms_) {
// only initialize once here, as this will not need to change when ResetTrainingData() is called
device_subhistograms_ = std::unique_ptr<boost::compute::vector<char>>(new boost::compute::vector<char>(
preallocd_max_num_wg_ * dword_features_ * device_bin_size_ * hist_bin_entry_sz_, ctx_));
}
// create atomic counters for inter-group coordination
sync_counters_.reset();
sync_counters_ = std::unique_ptr<boost::compute::vector<int>>(new boost::compute::vector<int>(
num_dense_feature4_, ctx_));
boost::compute::fill(sync_counters_->begin(), sync_counters_->end(), 0, queue_);
// The output buffer is allocated to host directly, to overlap compute and data transfer
device_histogram_outputs_ = boost::compute::buffer(); // deallocate
device_histogram_outputs_ = boost::compute::buffer(ctx_, num_dense_feature4_ * dword_features_ * device_bin_size_ * hist_bin_entry_sz_,
boost::compute::memory_object::write_only | boost::compute::memory_object::alloc_host_ptr, nullptr);
// find the dense feature-groups and group then into Feature4 data structure (several feature-groups packed into 4 bytes)
int i, k, copied_feature4 = 0, dense_ind[dword_features_];
for (i = 0, k = 0; i < num_feature_groups_; ++i) {
// looking for dword_features_ non-sparse feature-groups
if (ordered_bins_[i] == nullptr) {
dense_ind[k] = i;
// decide if we need to redistribute the bin
double t = device_bin_size_ / (double)train_data_->FeatureGroupNumBin(i);
// multiplier must be a power of 2
device_bin_mults_.push_back((int)round(pow(2, floor(log2(t)))));
// device_bin_mults_.push_back(1);
#if GPU_DEBUG >= 1
printf("feature-group %d using multiplier %d\n", i, device_bin_mults_.back());
#endif
k++;
}
else {
sparse_feature_group_map_.push_back(i);
}
// found
if (k == dword_features_) {
k = 0;
for (int j = 0; j < dword_features_; ++j) {
dense_feature_group_map_.push_back(dense_ind[j]);
}
copied_feature4++;
}
}
// for data transfer time
auto start_time = std::chrono::steady_clock::now();
// Now generate new data structure feature4, and copy data to the device
int nthreads = std::min(omp_get_max_threads(), (int)dense_feature_group_map_.size() / dword_features_);
nthreads = std::max(nthreads, 1);
std::vector<Feature4*> host4_vecs(nthreads);
std::vector<boost::compute::buffer> host4_bufs(nthreads);
std::vector<Feature4*> host4_ptrs(nthreads);
// preallocate arrays for all threads, and pin them
for (int i = 0; i < nthreads; ++i) {
host4_vecs[i] = (Feature4*)boost::alignment::aligned_alloc(4096, num_data_ * sizeof(Feature4));
host4_bufs[i] = boost::compute::buffer(ctx_, num_data_ * sizeof(Feature4),
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
host4_vecs[i]);
host4_ptrs[i] = (Feature4*)queue_.enqueue_map_buffer(host4_bufs[i], boost::compute::command_queue::map_write_invalidate_region,
0, num_data_ * sizeof(Feature4));
}
// building Feature4 bundles; each thread handles dword_features_ features
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < dense_feature_group_map_.size() / dword_features_; ++i) {
int tid = omp_get_thread_num();
Feature4* host4 = host4_ptrs[tid];
auto dense_ind = dense_feature_group_map_.begin() + i * dword_features_;
auto dev_bin_mult = device_bin_mults_.begin() + i * dword_features_;
#if GPU_DEBUG >= 1
printf("Copying feature group ");
for (int l = 0; l < dword_features_; ++l) {
printf("%d ", dense_ind[l]);
}
printf("to devices\n");
#endif
if (dword_features_ == 8) {
// one feature datapoint is 4 bits
BinIterator* bin_iters[8];
for (int s_idx = 0; s_idx < 8; ++s_idx) {
bin_iters[s_idx] = train_data_->FeatureGroupIterator(dense_ind[s_idx]);
if (dynamic_cast<Dense4bitsBinIterator*>(bin_iters[s_idx]) == 0) {
Log::Fatal("GPU tree learner assumes that all bins are Dense4bitsBin when num_bin <= 16, but feature %d is not.", dense_ind[s_idx]);
}
}
// this guarantees that the RawGet() function is inlined, rather than using virtual function dispatching
Dense4bitsBinIterator iters[8] = {
*static_cast<Dense4bitsBinIterator*>(bin_iters[0]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[1]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[2]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[3]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[4]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[5]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[6]),
*static_cast<Dense4bitsBinIterator*>(bin_iters[7])};
for (int j = 0; j < num_data_; ++j) {
host4[j].s0 = (iters[0].RawGet(j) * dev_bin_mult[0] + ((j+0) & (dev_bin_mult[0] - 1)))
|((iters[1].RawGet(j) * dev_bin_mult[1] + ((j+1) & (dev_bin_mult[1] - 1))) << 4);
host4[j].s1 = (iters[2].RawGet(j) * dev_bin_mult[2] + ((j+2) & (dev_bin_mult[2] - 1)))
|((iters[3].RawGet(j) * dev_bin_mult[3] + ((j+3) & (dev_bin_mult[3] - 1))) << 4);
host4[j].s2 = (iters[4].RawGet(j) * dev_bin_mult[4] + ((j+4) & (dev_bin_mult[4] - 1)))
|((iters[5].RawGet(j) * dev_bin_mult[5] + ((j+5) & (dev_bin_mult[5] - 1))) << 4);
host4[j].s3 = (iters[6].RawGet(j) * dev_bin_mult[6] + ((j+6) & (dev_bin_mult[6] - 1)))
|((iters[7].RawGet(j) * dev_bin_mult[7] + ((j+7) & (dev_bin_mult[7] - 1))) << 4);
}
}
else if (dword_features_ == 4) {
// one feature datapoint is one byte
for (int s_idx = 0; s_idx < 4; ++s_idx) {
BinIterator* bin_iter = train_data_->FeatureGroupIterator(dense_ind[s_idx]);
// this guarantees that the RawGet() function is inlined, rather than using virtual function dispatching
if (dynamic_cast<DenseBinIterator<uint8_t>*>(bin_iter) != 0) {
// Dense bin
DenseBinIterator<uint8_t> iter = *static_cast<DenseBinIterator<uint8_t>*>(bin_iter);
for (int j = 0; j < num_data_; ++j) {
host4[j].s[s_idx] = iter.RawGet(j) * dev_bin_mult[s_idx] + ((j+s_idx) & (dev_bin_mult[s_idx] - 1));
}
}
else if (dynamic_cast<Dense4bitsBinIterator*>(bin_iter) != 0) {
// Dense 4-bit bin
Dense4bitsBinIterator iter = *static_cast<Dense4bitsBinIterator*>(bin_iter);
for (int j = 0; j < num_data_; ++j) {
host4[j].s[s_idx] = iter.RawGet(j) * dev_bin_mult[s_idx] + ((j+s_idx) & (dev_bin_mult[s_idx] - 1));
}
}
else {
Log::Fatal("Bug in GPU tree builder: only DenseBin and Dense4bitsBin are supported!");
}
}
}
else {
Log::Fatal("Bug in GPU tree builder: dword_features_ can only be 4 or 8!");
}
queue_.enqueue_write_buffer(device_features_->get_buffer(),
i * num_data_ * sizeof(Feature4), num_data_ * sizeof(Feature4), host4);
#if GPU_DEBUG >= 1
printf("first example of feature-group tuple is: %d %d %d %d\n", host4[0].s0, host4[0].s1, host4[0].s2, host4[0].s3);
printf("Feature-groups copied to device with multipliers ");
for (int l = 0; l < dword_features_; ++l) {
printf("%d ", dev_bin_mult[l]);
}
printf("\n");
#endif
}
// working on the remaining (less than dword_features_) feature groups
if (k != 0) {
Feature4* host4 = host4_ptrs[0];
if (dword_features_ == 8) {
memset(host4, 0, num_data_ * sizeof(Feature4));
}
#if GPU_DEBUG >= 1
printf("%d features left\n", k);
#endif
for (i = 0; i < k; ++i) {
if (dword_features_ == 8) {
BinIterator* bin_iter = train_data_->FeatureGroupIterator(dense_ind[i]);
if (dynamic_cast<Dense4bitsBinIterator*>(bin_iter) != 0) {
Dense4bitsBinIterator iter = *static_cast<Dense4bitsBinIterator*>(bin_iter);
#pragma omp parallel for schedule(static)
for (int j = 0; j < num_data_; ++j) {
host4[j].s[i >> 1] |= ((iter.RawGet(j) * device_bin_mults_[copied_feature4 * dword_features_ + i]
+ ((j+i) & (device_bin_mults_[copied_feature4 * dword_features_ + i] - 1)))
<< ((i & 1) << 2));
}
}
else {
Log::Fatal("GPU tree learner assumes that all bins are Dense4bitsBin when num_bin <= 16, but feature %d is not.", dense_ind[i]);
}
}
else if (dword_features_ == 4) {
BinIterator* bin_iter = train_data_->FeatureGroupIterator(dense_ind[i]);
if (dynamic_cast<DenseBinIterator<uint8_t>*>(bin_iter) != 0) {
DenseBinIterator<uint8_t> iter = *static_cast<DenseBinIterator<uint8_t>*>(bin_iter);
#pragma omp parallel for schedule(static)
for (int j = 0; j < num_data_; ++j) {
host4[j].s[i] = iter.RawGet(j) * device_bin_mults_[copied_feature4 * dword_features_ + i]
+ ((j+i) & (device_bin_mults_[copied_feature4 * dword_features_ + i] - 1));
}
}
else if (dynamic_cast<Dense4bitsBinIterator*>(bin_iter) != 0) {
Dense4bitsBinIterator iter = *static_cast<Dense4bitsBinIterator*>(bin_iter);
#pragma omp parallel for schedule(static)
for (int j = 0; j < num_data_; ++j) {
host4[j].s[i] = iter.RawGet(j) * device_bin_mults_[copied_feature4 * dword_features_ + i]
+ ((j+i) & (device_bin_mults_[copied_feature4 * dword_features_ + i] - 1));
}
}
else {
Log::Fatal("BUG in GPU tree builder: only DenseBin and Dense4bitsBin are supported!");
}
}
else {
Log::Fatal("Bug in GPU tree builder: dword_features_ can only be 4 or 8!");
}
}
// fill the leftover features
if (dword_features_ == 8) {
#pragma omp parallel for schedule(static)
for (int j = 0; j < num_data_; ++j) {
for (i = k; i < dword_features_; ++i) {
// fill this empty feature with some "random" value
host4[j].s[i >> 1] |= ((j & 0xf) << ((i & 1) << 2));
}
}
}
else if (dword_features_ == 4) {
#pragma omp parallel for schedule(static)
for (int j = 0; j < num_data_; ++j) {
for (i = k; i < dword_features_; ++i) {
// fill this empty feature with some "random" value
host4[j].s[i] = j;
}
}
}
// copying the last 1 to (dword_features - 1) feature-groups in the last tuple
queue_.enqueue_write_buffer(device_features_->get_buffer(),
(num_dense_feature4_ - 1) * num_data_ * sizeof(Feature4), num_data_ * sizeof(Feature4), host4);
#if GPU_DEBUG >= 1
printf("Last features copied to device\n");
#endif
for (i = 0; i < k; ++i) {
dense_feature_group_map_.push_back(dense_ind[i]);
}
}
// deallocate pinned space for feature copying
for (int i = 0; i < nthreads; ++i) {
queue_.enqueue_unmap_buffer(host4_bufs[i], host4_ptrs[i]);
host4_bufs[i] = boost::compute::buffer();
boost::alignment::aligned_free(host4_vecs[i]);
}
// data transfer time
std::chrono::duration<double, std::milli> end_time = std::chrono::steady_clock::now() - start_time;
Log::Info("%d dense feature groups (%.2f MB) transfered to GPU in %f secs. %d sparse feature groups.",
dense_feature_group_map_.size(), ((dense_feature_group_map_.size() + (dword_features_ - 1)) / dword_features_) * num_data_ * sizeof(Feature4) / (1024.0 * 1024.0),
end_time * 1e-3, sparse_feature_group_map_.size());
#if GPU_DEBUG >= 1
printf("Dense feature group list (size %lu): ", dense_feature_group_map_.size());
for (i = 0; i < num_dense_feature_groups_; ++i) {
printf("%d ", dense_feature_group_map_[i]);
}
printf("\n");
printf("Sparse feature group list (size %lu): ", sparse_feature_group_map_.size());
for (i = 0; i < num_feature_groups_ - num_dense_feature_groups_; ++i) {
printf("%d ", sparse_feature_group_map_[i]);
}
printf("\n");
#endif
}
void GPUTreeLearner::BuildGPUKernels() {
Log::Info("Compiling OpenCL Kernel with %d bins...", device_bin_size_);
// destroy any old kernels
histogram_kernels_.clear();
histogram_allfeats_kernels_.clear();
histogram_fulldata_kernels_.clear();
// create OpenCL kernels for different number of workgroups per feature
histogram_kernels_.resize(kMaxLogWorkgroupsPerFeature+1);
histogram_allfeats_kernels_.resize(kMaxLogWorkgroupsPerFeature+1);
histogram_fulldata_kernels_.resize(kMaxLogWorkgroupsPerFeature+1);
// currently we don't use constant memory
int use_constants = 0;
#pragma omp parallel for schedule(guided)
for (int i = 0; i <= kMaxLogWorkgroupsPerFeature; ++i) {
boost::compute::program program;
std::ostringstream opts;
// compile the GPU kernel depending if double precision is used, constant hessian is used, etc
opts << " -D POWER_FEATURE_WORKGROUPS=" << i
<< " -D USE_CONSTANT_BUF=" << use_constants << " -D USE_DP_FLOAT=" << int(tree_config_->gpu_use_dp)
<< " -D CONST_HESSIAN=" << int(is_constant_hessian_)
<< " -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math";
#if GPU_DEBUG >= 1
std::cout << "Building GPU kernels with options: " << opts.str() << std::endl;
#endif
// kernel with indices in an array
try {
program = boost::compute::program::build_with_source(kernel_source_, ctx_, opts.str());
}
catch (boost::compute::opencl_error &e) {
if (program.build_log().size() > 0) {
Log::Fatal("GPU program built failure:\n %s", program.build_log().c_str());
}
else {
Log::Fatal("GPU program built failure, log unavailable");
}
}
histogram_kernels_[i] = program.create_kernel(kernel_name_);
// kernel with all features enabled, with elimited branches
opts << " -D ENABLE_ALL_FEATURES=1";
try {
program = boost::compute::program::build_with_source(kernel_source_, ctx_, opts.str());
}
catch (boost::compute::opencl_error &e) {
if (program.build_log().size() > 0) {
Log::Fatal("GPU program built failure:\n %s", program.build_log().c_str());
}
else {
Log::Fatal("GPU program built failure, log unavailable");
}
}
histogram_allfeats_kernels_[i] = program.create_kernel(kernel_name_);
// kernel with all data indices (for root node, and assumes that root node always uses all features)
opts << " -D IGNORE_INDICES=1";
try {
program = boost::compute::program::build_with_source(kernel_source_, ctx_, opts.str());
}
catch (boost::compute::opencl_error &e) {
if (program.build_log().size() > 0) {
Log::Fatal("GPU program built failure:\n %s", program.build_log().c_str());
}
else {
Log::Fatal("GPU program built failure, log unavailable");
}
}
histogram_fulldata_kernels_[i] = program.create_kernel(kernel_name_);
}
Log::Info("GPU programs have been built");
}
void GPUTreeLearner::SetupKernelArguments() {
// do nothing if no features can be processed on GPU
if (!num_dense_feature_groups_) {
return;
}
for (int i = 0; i <= kMaxLogWorkgroupsPerFeature; ++i) {
// The only argument that needs to be changed later is num_data_
if (is_constant_hessian_) {
// hessian is passed as a parameter, but it is not available now.
// hessian will be set in BeforeTrain()
histogram_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, 0.0f,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
histogram_allfeats_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, 0.0f,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
histogram_fulldata_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, 0.0f,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
}
else {
histogram_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, device_hessians_,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
histogram_allfeats_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, device_hessians_,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
histogram_fulldata_kernels_[i].set_args(*device_features_, device_feature_masks_, num_data_,
*device_data_indices_, num_data_, device_gradients_, device_hessians_,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
}
}
}
void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
// Get the max bin size, used for selecting best GPU kernel
max_num_bin_ = 0;
#if GPU_DEBUG >= 1
printf("bin size: ");
#endif
for (int i = 0; i < num_feature_groups_; ++i) {
#if GPU_DEBUG >= 1
printf("%d, ", train_data_->FeatureGroupNumBin(i));
#endif
max_num_bin_ = std::max(max_num_bin_, train_data_->FeatureGroupNumBin(i));
}
#if GPU_DEBUG >= 1
printf("\n");
#endif
// initialize GPU
dev_ = boost::compute::system::default_device();
if (platform_id >= 0 && device_id >= 0) {
const std::vector<boost::compute::platform> platforms = boost::compute::system::platforms();
if ((int)platforms.size() > platform_id) {
const std::vector<boost::compute::device> platform_devices = platforms[platform_id].devices();
if ((int)platform_devices.size() > device_id) {
Log::Info("Using requested OpenCL platform %d device %d", platform_id, device_id);
dev_ = platform_devices[device_id];
}
}
}
// determine which kernel to use based on the max number of bins
if (max_num_bin_ <= 16) {
kernel_source_ = kernel16_src_;
kernel_name_ = "histogram16";
device_bin_size_ = 16;
dword_features_ = 8;
}
else if (max_num_bin_ <= 64) {
kernel_source_ = kernel64_src_;
kernel_name_ = "histogram64";
device_bin_size_ = 64;
dword_features_ = 4;
}
else if ( max_num_bin_ <= 256) {
kernel_source_ = kernel256_src_;
kernel_name_ = "histogram256";
device_bin_size_ = 256;
dword_features_ = 4;
}
else {
Log::Fatal("bin size %d cannot run on GPU", max_num_bin_);
}
if(max_num_bin_ == 65) {
Log::Warning("Setting max_bin to 63 is sugguested for best performance");
}
if(max_num_bin_ == 17) {
Log::Warning("Setting max_bin to 15 is sugguested for best performance");
}
ctx_ = boost::compute::context(dev_);
queue_ = boost::compute::command_queue(ctx_, dev_);
Log::Info("Using GPU Device: %s, Vendor: %s", dev_.name().c_str(), dev_.vendor().c_str());
BuildGPUKernels();
AllocateGPUMemory();
// setup GPU kernel arguments after we allocating all the buffers
SetupKernelArguments();
}
Tree* GPUTreeLearner::Train(const score_t* gradients, const score_t *hessians, bool is_constant_hessian) {
// check if we need to recompile the GPU kernel (is_constant_hessian changed)
// this should rarely occur
if (is_constant_hessian != is_constant_hessian_) {
Log::Info("Recompiling GPU kernel because hessian is %sa constant now", is_constant_hessian ? "" : "not ");
is_constant_hessian_ = is_constant_hessian;
BuildGPUKernels();
SetupKernelArguments();
}
return SerialTreeLearner::Train(gradients, hessians, is_constant_hessian);
}
void GPUTreeLearner::ResetTrainingData(const Dataset* train_data) {
SerialTreeLearner::ResetTrainingData(train_data);
num_feature_groups_ = train_data_->num_feature_groups();
// GPU memory has to been reallocated because data may have been changed
AllocateGPUMemory();
// setup GPU kernel arguments after we allocating all the buffers
SetupKernelArguments();
}
void GPUTreeLearner::BeforeTrain() {
#if GPU_DEBUG >= 2
printf("Copying intial full gradients and hessians to device\n");
#endif
// Copy initial full hessians and gradients to GPU.
// We start copying as early as possible, instead of at ConstructHistogram().
if (!use_bagging_ && num_dense_feature_groups_) {
if (!is_constant_hessian_) {
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, num_data_ * sizeof(score_t), hessians_);
}
else {
// setup hessian parameters only
score_t const_hessian = hessians_[0];
for (int i = 0; i <= kMaxLogWorkgroupsPerFeature; ++i) {
// hessian is passed as a parameter
histogram_kernels_[i].set_arg(6, const_hessian);
histogram_allfeats_kernels_[i].set_arg(6, const_hessian);
histogram_fulldata_kernels_[i].set_arg(6, const_hessian);
}
}
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data_ * sizeof(score_t), gradients_);
}
SerialTreeLearner::BeforeTrain();
// use bagging
if (data_partition_->leaf_count(0) != num_data_ && num_dense_feature_groups_) {
// On GPU, we start copying indices, gradients and hessians now, instead at ConstructHistogram()
// copy used gradients and hessians to ordered buffer
const data_size_t* indices = data_partition_->indices();
data_size_t cnt = data_partition_->leaf_count(0);
#if GPU_DEBUG > 0
printf("Using bagging, examples count = %d\n", cnt);
#endif
// transfer the indices to GPU
indices_future_ = boost::compute::copy_async(indices, indices + cnt, device_data_indices_->begin(), queue_);
if (!is_constant_hessian_) {
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < cnt; ++i) {
ordered_hessians_[i] = hessians_[indices[i]];
}
// transfer hessian to GPU
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, cnt * sizeof(score_t), ordered_hessians_.data());
}
else {
// setup hessian parameters only
score_t const_hessian = hessians_[indices[0]];
for (int i = 0; i <= kMaxLogWorkgroupsPerFeature; ++i) {
// hessian is passed as a parameter
histogram_kernels_[i].set_arg(6, const_hessian);
histogram_allfeats_kernels_[i].set_arg(6, const_hessian);
histogram_fulldata_kernels_[i].set_arg(6, const_hessian);
}
}
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < cnt; ++i) {
ordered_gradients_[i] = gradients_[indices[i]];
}
// transfer gradients to GPU
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, cnt * sizeof(score_t), ordered_gradients_.data());
}
}
bool GPUTreeLearner::BeforeFindBestSplit(const Tree* tree, int left_leaf, int right_leaf) {
int smaller_leaf;
data_size_t num_data_in_left_child = GetGlobalDataCountInLeaf(left_leaf);
data_size_t num_data_in_right_child = GetGlobalDataCountInLeaf(right_leaf);
// only have root
if (right_leaf < 0) {
smaller_leaf = -1;
} else if (num_data_in_left_child < num_data_in_right_child) {
smaller_leaf = left_leaf;
} else {
smaller_leaf = right_leaf;
}
// Copy indices, gradients and hessians as early as possible
if (smaller_leaf >= 0 && num_dense_feature_groups_) {
// only need to initialize for smaller leaf
// Get leaf boundary
const data_size_t* indices = data_partition_->indices();
data_size_t begin = data_partition_->leaf_begin(smaller_leaf);
data_size_t end = begin + data_partition_->leaf_count(smaller_leaf);
// copy indices to the GPU:
#if GPU_DEBUG >= 2
Log::Info("Copying indices, gradients and hessians to GPU...");
printf("indices size %d being copied (left = %d, right = %d)\n", end - begin,num_data_in_left_child,num_data_in_right_child);
#endif
indices_future_ = boost::compute::copy_async(indices + begin, indices + end, device_data_indices_->begin(), queue_);
if (!is_constant_hessian_) {
#pragma omp parallel for schedule(static)
for (data_size_t i = begin; i < end; ++i) {
ordered_hessians_[i - begin] = hessians_[indices[i]];
}
// copy ordered hessians to the GPU:
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, (end - begin) * sizeof(score_t), ptr_pinned_hessians_);
}
#pragma omp parallel for schedule(static)
for (data_size_t i = begin; i < end; ++i) {
ordered_gradients_[i - begin] = gradients_[indices[i]];
}
// copy ordered gradients to the GPU:
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, (end - begin) * sizeof(score_t), ptr_pinned_gradients_);
#if GPU_DEBUG >= 2
Log::Info("gradients/hessians/indiex copied to device with size %d", end - begin);
#endif
}
return SerialTreeLearner::BeforeFindBestSplit(tree, left_leaf, right_leaf);
}
bool GPUTreeLearner::ConstructGPUHistogramsAsync(
const std::vector<int8_t>& is_feature_used,
const data_size_t* data_indices, data_size_t num_data,
const score_t* gradients, const score_t* hessians,
score_t* ordered_gradients, score_t* ordered_hessians) {
if (num_data <= 0) {
return false;
}
// do nothing if no features can be processed on GPU
if (!num_dense_feature_groups_) {
return false;
}
// copy data indices if it is not null
if (data_indices != nullptr && num_data != num_data_) {
indices_future_ = boost::compute::copy_async(data_indices, data_indices + num_data, device_data_indices_->begin(), queue_);
}
// generate and copy ordered_gradients if gradients is not null
if (gradients != nullptr) {
if (num_data != num_data_) {
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < num_data; ++i) {
ordered_gradients[i] = gradients[data_indices[i]];
}
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data * sizeof(score_t), ptr_pinned_gradients_);
}
else {
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data * sizeof(score_t), gradients);
}
}
// generate and copy ordered_hessians if hessians is not null
if (hessians != nullptr && !is_constant_hessian_) {
if (num_data != num_data_) {
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < num_data; ++i) {
ordered_hessians[i] = hessians[data_indices[i]];
}
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, num_data * sizeof(score_t), ptr_pinned_hessians_);
}
else {
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, num_data * sizeof(score_t), hessians);
}
}
// converted indices in is_feature_used to feature-group indices
std::vector<int8_t> is_feature_group_used(num_feature_groups_, 0);
#pragma omp parallel for schedule(static,1024) if (num_features_ >= 2048)
for (int i = 0; i < num_features_; ++i) {
if(is_feature_used[i]) {
is_feature_group_used[train_data_->Feature2Group(i)] = 1;
}
}
// construct the feature masks for dense feature-groups
int used_dense_feature_groups = 0;
#pragma omp parallel for schedule(static,1024) reduction(+:used_dense_feature_groups) if (num_dense_feature_groups_ >= 2048)
for (int i = 0; i < num_dense_feature_groups_; ++i) {
if (is_feature_group_used[dense_feature_group_map_[i]]) {
feature_masks_[i] = 1;
++used_dense_feature_groups;
}
else {
feature_masks_[i] = 0;
}
}
bool use_all_features = used_dense_feature_groups == num_dense_feature_groups_;
// if no feature group is used, just return and do not use GPU
if (used_dense_feature_groups == 0) {
return false;
}
#if GPU_DEBUG >= 1
printf("feature masks:\n");
for (unsigned int i = 0; i < feature_masks_.size(); ++i) {
printf("%d ", feature_masks_[i]);
}
printf("\n");
printf("%d feature groups, %d used, %d\n", num_dense_feature_groups_, used_dense_feature_groups, use_all_features);
#endif
// if not all feature groups are used, we need to transfer the feature mask to GPU
// otherwise, we will use a specialized GPU kernel with all feature groups enabled
if (!use_all_features) {
queue_.enqueue_write_buffer(device_feature_masks_, 0, num_dense_feature4_ * dword_features_, ptr_pinned_feature_masks_);
}
// All data have been prepared, now run the GPU kernel
GPUHistogram(num_data, use_all_features);
return true;
}
void GPUTreeLearner::ConstructHistograms(const std::vector<int8_t>& is_feature_used, bool use_subtract) {
std::vector<int8_t> is_sparse_feature_used(num_features_, 0);
std::vector<int8_t> is_dense_feature_used(num_features_, 0);
#pragma omp parallel for schedule(static)
for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
if (!is_feature_used_[feature_index]) continue;
if (!is_feature_used[feature_index]) continue;
if (ordered_bins_[train_data_->Feature2Group(feature_index)]) {
is_sparse_feature_used[feature_index] = 1;
}
else {
is_dense_feature_used[feature_index] = 1;
}
}
// construct smaller leaf
HistogramBinEntry* ptr_smaller_leaf_hist_data = smaller_leaf_histogram_array_[0].RawData() - 1;
// ConstructGPUHistogramsAsync will return true if there are availabe feature gourps dispatched to GPU
bool is_gpu_used = ConstructGPUHistogramsAsync(is_feature_used,
nullptr, smaller_leaf_splits_->num_data_in_leaf(),
nullptr, nullptr,
nullptr, nullptr);
// then construct sparse features on CPU
// We set data_indices to null to avoid rebuilding ordered gradients/hessians
train_data_->ConstructHistograms(is_sparse_feature_used,
nullptr, smaller_leaf_splits_->num_data_in_leaf(),
smaller_leaf_splits_->LeafIndex(),
ordered_bins_, gradients_, hessians_,
ordered_gradients_.data(), ordered_hessians_.data(), is_constant_hessian_,
ptr_smaller_leaf_hist_data);
// wait for GPU to finish, only if GPU is actually used
if (is_gpu_used) {
if (tree_config_->gpu_use_dp) {
// use double precision
WaitAndGetHistograms<HistogramBinEntry>(ptr_smaller_leaf_hist_data, is_feature_used);
}
else {
// use single precision
WaitAndGetHistograms<GPUHistogramBinEntry>(ptr_smaller_leaf_hist_data, is_feature_used);
}
}
// Compare GPU histogram with CPU histogram, useful for debuggin GPU code problem
// #define GPU_DEBUG_COMPARE
#ifdef GPU_DEBUG_COMPARE
for (int i = 0; i < num_dense_feature_groups_; ++i) {
if (!feature_masks_[i])
continue;
int dense_feature_group_index = dense_feature_group_map_[i];
size_t size = train_data_->FeatureGroupNumBin(dense_feature_group_index);
HistogramBinEntry* ptr_smaller_leaf_hist_data = smaller_leaf_histogram_array_[0].RawData() - 1;
HistogramBinEntry* current_histogram = ptr_smaller_leaf_hist_data + train_data_->GroupBinBoundary(dense_feature_group_index);
HistogramBinEntry* gpu_histogram = new HistogramBinEntry[size];
data_size_t num_data = smaller_leaf_splits_->num_data_in_leaf();
printf("Comparing histogram for feature %d size %d, %lu bins\n", dense_feature_group_index, num_data, size);
std::copy(current_histogram, current_histogram + size, gpu_histogram);
std::memset(current_histogram, 0, train_data_->FeatureGroupNumBin(dense_feature_group_index) * sizeof(HistogramBinEntry));
train_data_->FeatureGroupBin(dense_feature_group_index)->ConstructHistogram(
num_data != num_data_ ? smaller_leaf_splits_->data_indices() : nullptr,
num_data,
num_data != num_data_ ? ordered_gradients_.data() : gradients_,
num_data != num_data_ ? ordered_hessians_.data() : hessians_,
current_histogram);
CompareHistograms(gpu_histogram, current_histogram, size, dense_feature_group_index);
std::copy(gpu_histogram, gpu_histogram + size, current_histogram);
delete [] gpu_histogram;
}
#endif
if (larger_leaf_histogram_array_ != nullptr && !use_subtract) {
// construct larger leaf
HistogramBinEntry* ptr_larger_leaf_hist_data = larger_leaf_histogram_array_[0].RawData() - 1;
is_gpu_used = ConstructGPUHistogramsAsync(is_feature_used,
larger_leaf_splits_->data_indices(), larger_leaf_splits_->num_data_in_leaf(),
gradients_, hessians_,
ordered_gradients_.data(), ordered_hessians_.data());
// then construct sparse features on CPU
// We set data_indices to null to avoid rebuilding ordered gradients/hessians
train_data_->ConstructHistograms(is_sparse_feature_used,
nullptr, larger_leaf_splits_->num_data_in_leaf(),
larger_leaf_splits_->LeafIndex(),
ordered_bins_, gradients_, hessians_,
ordered_gradients_.data(), ordered_hessians_.data(), is_constant_hessian_,
ptr_larger_leaf_hist_data);
// wait for GPU to finish, only if GPU is actually used
if (is_gpu_used) {
if (tree_config_->gpu_use_dp) {
// use double precision
WaitAndGetHistograms<HistogramBinEntry>(ptr_larger_leaf_hist_data, is_feature_used);
}
else {
// use single precision
WaitAndGetHistograms<GPUHistogramBinEntry>(ptr_larger_leaf_hist_data, is_feature_used);
}
}
}
}
void GPUTreeLearner::FindBestThresholds() {
SerialTreeLearner::FindBestThresholds();
#if GPU_DEBUG >= 3
for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
if (!is_feature_used_[feature_index]) continue;
if (parent_leaf_histogram_array_ != nullptr
&& !parent_leaf_histogram_array_[feature_index].is_splittable()) {
smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
continue;
}
size_t bin_size = train_data_->FeatureNumBin(feature_index) + 1;
printf("feature %d smaller leaf:\n", feature_index);
PrintHistograms(smaller_leaf_histogram_array_[feature_index].RawData() - 1, bin_size);
if (larger_leaf_splits_ == nullptr || larger_leaf_splits_->LeafIndex() < 0) { continue; }
printf("feature %d larger leaf:\n", feature_index);
PrintHistograms(larger_leaf_histogram_array_[feature_index].RawData() - 1, bin_size);
}
#endif
}
void GPUTreeLearner::Split(Tree* tree, int best_Leaf, int* left_leaf, int* right_leaf) {
const SplitInfo& best_split_info = best_split_per_leaf_[best_Leaf];
#if GPU_DEBUG >= 2
printf("spliting leaf %d with feature %d thresh %d gain %f stat %f %f %f %f\n", best_Leaf, best_split_info.feature, best_split_info.threshold, best_split_info.gain, best_split_info.left_sum_gradient, best_split_info.right_sum_gradient, best_split_info.left_sum_hessian, best_split_info.right_sum_hessian);
#endif
SerialTreeLearner::Split(tree, best_Leaf, left_leaf, right_leaf);
if (Network::num_machines() == 1) {
// do some sanity check for the GPU algorithm
if (best_split_info.left_count < best_split_info.right_count) {
if ((best_split_info.left_count != smaller_leaf_splits_->num_data_in_leaf()) ||
(best_split_info.right_count!= larger_leaf_splits_->num_data_in_leaf())) {
Log::Fatal("Bug in GPU histogram! split %d: %d, smaller_leaf: %d, larger_leaf: %d\n", best_split_info.left_count, best_split_info.right_count, smaller_leaf_splits_->num_data_in_leaf(), larger_leaf_splits_->num_data_in_leaf());
}
} else {
smaller_leaf_splits_->Init(*right_leaf, data_partition_.get(), best_split_info.right_sum_gradient, best_split_info.right_sum_hessian);
larger_leaf_splits_->Init(*left_leaf, data_partition_.get(), best_split_info.left_sum_gradient, best_split_info.left_sum_hessian);
if ((best_split_info.left_count != larger_leaf_splits_->num_data_in_leaf()) ||
(best_split_info.right_count!= smaller_leaf_splits_->num_data_in_leaf())) {
Log::Fatal("Bug in GPU histogram! split %d: %d, smaller_leaf: %d, larger_leaf: %d\n", best_split_info.left_count, best_split_info.right_count, smaller_leaf_splits_->num_data_in_leaf(), larger_leaf_splits_->num_data_in_leaf());
}
}
}
}
} // namespace LightGBM
#endif // USE_GPU
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment