Unverified Commit 6b56a90c authored by shiyu1994's avatar shiyu1994 Committed by GitHub
Browse files

[CUDA] New CUDA version Part 1 (#4630)



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent b857ee10
...@@ -80,7 +80,7 @@ else # Linux ...@@ -80,7 +80,7 @@ else # Linux
mv $AMDAPPSDK_PATH/lib/x86_64/sdk/* $AMDAPPSDK_PATH/lib/x86_64/ mv $AMDAPPSDK_PATH/lib/x86_64/sdk/* $AMDAPPSDK_PATH/lib/x86_64/
echo libamdocl64.so > $OPENCL_VENDOR_PATH/amdocl64.icd echo libamdocl64.so > $OPENCL_VENDOR_PATH/amdocl64.icd
fi fi
if [[ $TASK == "cuda" ]]; then if [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get update apt-get update
apt-get install --no-install-recommends -y \ apt-get install --no-install-recommends -y \
......
...@@ -190,21 +190,41 @@ if [[ $TASK == "gpu" ]]; then ...@@ -190,21 +190,41 @@ if [[ $TASK == "gpu" ]]; then
elif [[ $METHOD == "source" ]]; then elif [[ $METHOD == "source" ]]; then
cmake -DUSE_GPU=ON -DOpenCL_INCLUDE_DIR=$AMDAPPSDK_PATH/include/ .. cmake -DUSE_GPU=ON -DOpenCL_INCLUDE_DIR=$AMDAPPSDK_PATH/include/ ..
fi fi
elif [[ $TASK == "cuda" ]]; then elif [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h if [[ $TASK == "cuda" ]]; then
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
else
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda_exp";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda_exp"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
# by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'gpu_use_dp = true' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
fi
if [[ $METHOD == "pip" ]]; then if [[ $METHOD == "pip" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py sdist || exit -1 cd $BUILD_DIRECTORY/python-package && python setup.py sdist || exit -1
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1 if [[ $TASK == "cuda" ]]; then
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
else
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda-exp || exit -1
fi
pytest $BUILD_DIRECTORY/tests/python_package_test || exit -1 pytest $BUILD_DIRECTORY/tests/python_package_test || exit -1
exit 0 exit 0
elif [[ $METHOD == "wheel" ]]; then elif [[ $METHOD == "wheel" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1 if [[ $TASK == "cuda" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
else
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda-exp || exit -1
fi
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER*.whl -v || exit -1 pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER*.whl -v || exit -1
pytest $BUILD_DIRECTORY/tests || exit -1 pytest $BUILD_DIRECTORY/tests || exit -1
exit 0 exit 0
elif [[ $METHOD == "source" ]]; then elif [[ $METHOD == "source" ]]; then
cmake -DUSE_CUDA=ON .. if [[ $TASK == "cuda" ]]; then
cmake -DUSE_CUDA=ON ..
else
cmake -DUSE_CUDA_EXP=ON ..
fi
fi fi
elif [[ $TASK == "mpi" ]]; then elif [[ $TASK == "mpi" ]]; then
if [[ $METHOD == "pip" ]]; then if [[ $METHOD == "pip" ]]; then
......
...@@ -16,7 +16,7 @@ env: ...@@ -16,7 +16,7 @@ env:
jobs: jobs:
test: test:
name: cuda ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }}) name: ${{ matrix.tree_learner }} ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
runs-on: [self-hosted, linux] runs-on: [self-hosted, linux]
timeout-minutes: 60 timeout-minutes: 60
strategy: strategy:
...@@ -27,14 +27,27 @@ jobs: ...@@ -27,14 +27,27 @@ jobs:
compiler: gcc compiler: gcc
python_version: "3.8" python_version: "3.8"
cuda_version: "11.5.1" cuda_version: "11.5.1"
tree_learner: cuda
- method: pip - method: pip
compiler: clang compiler: clang
python_version: "3.9" python_version: "3.9"
cuda_version: "10.0" cuda_version: "10.0"
tree_learner: cuda
- method: wheel - method: wheel
compiler: gcc compiler: gcc
python_version: "3.10" python_version: "3.10"
cuda_version: "9.0" cuda_version: "9.0"
tree_learner: cuda
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "11.5.1"
tree_learner: cuda_exp
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
tree_learner: cuda_exp
steps: steps:
- name: Setup or update software on host machine - name: Setup or update software on host machine
run: | run: |
......
...@@ -5,6 +5,7 @@ option(USE_SWIG "Enable SWIG to generate Java API" OFF) ...@@ -5,6 +5,7 @@ option(USE_SWIG "Enable SWIG to generate Java API" OFF)
option(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF) option(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF)
option(USE_TIMETAG "Set to ON to output time costs" OFF) option(USE_TIMETAG "Set to ON to output time costs" OFF)
option(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF) option(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF)
option(USE_CUDA_EXP "Enable CUDA-accelerated training with more acceleration (EXPERIMENTAL)" OFF)
option(USE_DEBUG "Set to ON for Debug mode" OFF) option(USE_DEBUG "Set to ON for Debug mode" OFF)
option(USE_SANITIZER "Use santizer flags" OFF) option(USE_SANITIZER "Use santizer flags" OFF)
set( set(
...@@ -28,7 +29,7 @@ if(__INTEGRATE_OPENCL) ...@@ -28,7 +29,7 @@ if(__INTEGRATE_OPENCL)
cmake_minimum_required(VERSION 3.11) cmake_minimum_required(VERSION 3.11)
elseif(USE_GPU OR APPLE) elseif(USE_GPU OR APPLE)
cmake_minimum_required(VERSION 3.2) cmake_minimum_required(VERSION 3.2)
elseif(USE_CUDA) elseif(USE_CUDA OR USE_CUDA_EXP)
cmake_minimum_required(VERSION 3.16) cmake_minimum_required(VERSION 3.16)
else() else()
cmake_minimum_required(VERSION 3.0) cmake_minimum_required(VERSION 3.0)
...@@ -133,7 +134,7 @@ else() ...@@ -133,7 +134,7 @@ else()
add_definitions(-DUSE_SOCKET) add_definitions(-DUSE_SOCKET)
endif() endif()
if(USE_CUDA) if(USE_CUDA OR USE_CUDA_EXP)
set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}") set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}")
enable_language(CUDA) enable_language(CUDA)
set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE) set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
...@@ -171,8 +172,12 @@ if(__INTEGRATE_OPENCL) ...@@ -171,8 +172,12 @@ if(__INTEGRATE_OPENCL)
endif() endif()
endif() endif()
if(USE_CUDA) if(USE_CUDA OR USE_CUDA_EXP)
find_package(CUDA 9.0 REQUIRED) if(USE_CUDA)
find_package(CUDA 9.0 REQUIRED)
else()
find_package(CUDA 10.0 REQUIRED)
endif()
include_directories(${CUDA_INCLUDE_DIRS}) include_directories(${CUDA_INCLUDE_DIRS})
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall") set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")
...@@ -199,7 +204,12 @@ if(USE_CUDA) ...@@ -199,7 +204,12 @@ if(USE_CUDA)
endif() endif()
message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}") message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")
add_definitions(-DUSE_CUDA) if(USE_CUDA)
add_definitions(-DUSE_CUDA)
elseif(USE_CUDA_EXP)
add_definitions(-DUSE_CUDA_EXP)
endif()
if(NOT DEFINED CMAKE_CUDA_STANDARD) if(NOT DEFINED CMAKE_CUDA_STANDARD)
set(CMAKE_CUDA_STANDARD 11) set(CMAKE_CUDA_STANDARD 11)
set(CMAKE_CUDA_STANDARD_REQUIRED ON) set(CMAKE_CUDA_STANDARD_REQUIRED ON)
...@@ -369,9 +379,17 @@ file( ...@@ -369,9 +379,17 @@ file(
src/objective/*.cpp src/objective/*.cpp
src/network/*.cpp src/network/*.cpp
src/treelearner/*.cpp src/treelearner/*.cpp
if(USE_CUDA) if(USE_CUDA OR USE_CUDA_EXP)
src/treelearner/*.cu src/treelearner/*.cu
endif() endif()
if(USE_CUDA_EXP)
src/treelearner/cuda/*.cpp
src/treelearner/cuda/*.cu
src/io/cuda/*.cu
src/io/cuda/*.cpp
src/cuda/*.cpp
src/cuda/*.cu
endif()
) )
add_library(lightgbm_objs OBJECT ${SOURCES}) add_library(lightgbm_objs OBJECT ${SOURCES})
...@@ -493,7 +511,7 @@ if(__INTEGRATE_OPENCL) ...@@ -493,7 +511,7 @@ if(__INTEGRATE_OPENCL)
target_link_libraries(lightgbm_objs PUBLIC ${INTEGRATED_OPENCL_LIBRARIES}) target_link_libraries(lightgbm_objs PUBLIC ${INTEGRATED_OPENCL_LIBRARIES})
endif() endif()
if(USE_CUDA) if(USE_CUDA OR USE_CUDA_EXP)
# Disable cmake warning about policy CMP0104. Refer to issue #3754 and PR #4268. # Disable cmake warning about policy CMP0104. Refer to issue #3754 and PR #4268.
# Custom target properties does not propagate, thus we need to specify for # Custom target properties does not propagate, thus we need to specify for
# each target that contains or depends on cuda source. # each target that contains or depends on cuda source.
...@@ -501,6 +519,8 @@ if(USE_CUDA) ...@@ -501,6 +519,8 @@ if(USE_CUDA)
set_target_properties(_lightgbm PROPERTIES CUDA_ARCHITECTURES OFF) set_target_properties(_lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(lightgbm PROPERTIES CUDA_ARCHITECTURES OFF) set_target_properties(lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(lightgbm_objs PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
# Device linking is not supported for object libraries. # Device linking is not supported for object libraries.
# Thus we have to specify them on final targets. # Thus we have to specify them on final targets.
set_target_properties(lightgbm PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON) set_target_properties(lightgbm PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
......
...@@ -37,6 +37,10 @@ OBJECTS = \ ...@@ -37,6 +37,10 @@ OBJECTS = \
io/parser.o \ io/parser.o \
io/train_share_states.o \ io/train_share_states.o \
io/tree.o \ io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \ metric/dcg_calculator.o \
metric/metric.o \ metric/metric.o \
objective/objective_function.o \ objective/objective_function.o \
......
...@@ -38,6 +38,10 @@ OBJECTS = \ ...@@ -38,6 +38,10 @@ OBJECTS = \
io/parser.o \ io/parser.o \
io/train_share_states.o \ io/train_share_states.o \
io/tree.o \ io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \ metric/dcg_calculator.o \
metric/metric.o \ metric/metric.o \
objective/objective_function.o \ objective/objective_function.o \
......
...@@ -636,6 +636,8 @@ To build LightGBM CUDA version, run the following commands: ...@@ -636,6 +636,8 @@ To build LightGBM CUDA version, run the following commands:
cmake -DUSE_CUDA=1 .. cmake -DUSE_CUDA=1 ..
make -j4 make -j4
Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``-DUSE_CUDA`` with ``-DUSE_CUDA_EXP`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.
**Note**: glibc >= 2.14 is required. **Note**: glibc >= 2.14 is required.
**Note**: In some rare cases you may need to install OpenMP runtime library separately (use your package manager and search for ``lib[g|i]omp`` for doing this). **Note**: In some rare cases you may need to install OpenMP runtime library separately (use your package manager and search for ``lib[g|i]omp`` for doing this).
......
...@@ -199,7 +199,7 @@ Core Parameters ...@@ -199,7 +199,7 @@ Core Parameters
- **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors - **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors
- ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, aliases: ``device`` - ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, ``cuda_exp``, aliases: ``device``
- device for the tree learning, you can use GPU to achieve the faster learning - device for the tree learning, you can use GPU to achieve the faster learning
...@@ -209,6 +209,10 @@ Core Parameters ...@@ -209,6 +209,10 @@ Core Parameters
- **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support - **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support
- **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``
- **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future
- ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state`` - ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``
- this seed is used to generate other seeds, e.g. ``data_random_seed``, ``feature_fraction_seed``, etc. - this seed is used to generate other seeds, e.g. ``data_random_seed``, ``feature_fraction_seed``, etc.
......
...@@ -34,6 +34,8 @@ def get_parameter_infos( ...@@ -34,6 +34,8 @@ def get_parameter_infos(
member_infos: List[List[Dict[str, List]]] = [] member_infos: List[List[Dict[str, List]]] = []
with open(config_hpp) as config_hpp_file: with open(config_hpp) as config_hpp_file:
for line in config_hpp_file: for line in config_hpp_file:
if line.strip() in {"#ifndef __NVCC__", "#endif // __NVCC__"}:
continue
if "#pragma region Parameters" in line: if "#pragma region Parameters" in line:
is_inparameter = True is_inparameter = True
elif "#pragma region" in line and "Parameters" in line: elif "#pragma region" in line and "Parameters" in line:
......
...@@ -119,6 +119,23 @@ class BinMapper { ...@@ -119,6 +119,23 @@ class BinMapper {
} }
} }
/*!
* \brief Maximum categorical value
* \return Maximum categorical value for categorical features, 0 for numerical features
*/
inline int MaxCatValue() const {
if (bin_2_categorical_.size() == 0) {
return 0;
}
int max_cat_value = bin_2_categorical_[0];
for (size_t i = 1; i < bin_2_categorical_.size(); ++i) {
if (bin_2_categorical_[i] > max_cat_value) {
max_cat_value = bin_2_categorical_[i];
}
}
return max_cat_value;
}
/*! /*!
* \brief Get sizes in byte of this object * \brief Get sizes in byte of this object
*/ */
...@@ -379,6 +396,10 @@ class Bin { ...@@ -379,6 +396,10 @@ class Bin {
* \brief Deep copy the bin * \brief Deep copy the bin
*/ */
virtual Bin* Clone() = 0; virtual Bin* Clone() = 0;
virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const = 0;
virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const = 0;
}; };
...@@ -452,6 +473,14 @@ class MultiValBin { ...@@ -452,6 +473,14 @@ class MultiValBin {
static constexpr double multi_val_bin_sparse_threshold = 0.25f; static constexpr double multi_val_bin_sparse_threshold = 0.25f;
virtual MultiValBin* Clone() = 0; virtual MultiValBin* Clone() = 0;
#ifdef USE_CUDA_EXP
virtual const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const = 0;
#endif // USE_CUDA_EXP
}; };
inline uint32_t BinMapper::ValueToBin(double value) const { inline uint32_t BinMapper::ValueToBin(double value) const {
......
...@@ -81,9 +81,11 @@ struct Config { ...@@ -81,9 +81,11 @@ struct Config {
static void KV2Map(std::unordered_map<std::string, std::string>* params, const char* kv); static void KV2Map(std::unordered_map<std::string, std::string>* params, const char* kv);
static std::unordered_map<std::string, std::string> Str2Map(const char* parameters); static std::unordered_map<std::string, std::string> Str2Map(const char* parameters);
#ifndef __NVCC__
#pragma region Parameters #pragma region Parameters
#pragma region Core Parameters #pragma region Core Parameters
#endif // __NVCC__
// [no-save] // [no-save]
// [doc-only] // [doc-only]
...@@ -204,12 +206,14 @@ struct Config { ...@@ -204,12 +206,14 @@ struct Config {
// [doc-only] // [doc-only]
// type = enum // type = enum
// options = cpu, gpu, cuda // options = cpu, gpu, cuda, cuda_exp
// alias = device // alias = device
// desc = device for the tree learning, you can use GPU to achieve the faster learning // desc = device for the tree learning, you can use GPU to achieve the faster learning
// desc = **Note**: it is recommended to use the smaller ``max_bin`` (e.g. 63) to get the better speed up // desc = **Note**: it is recommended to use the smaller ``max_bin`` (e.g. 63) to get the better speed up
// desc = **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training // desc = **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training
// desc = **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support // desc = **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support
// desc = **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``
// desc = **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future
std::string device_type = "cpu"; std::string device_type = "cpu";
// [doc-only] // [doc-only]
...@@ -228,9 +232,11 @@ struct Config { ...@@ -228,9 +232,11 @@ struct Config {
// desc = **Note**: to avoid potential instability due to numerical issues, please set ``force_col_wise=true`` or ``force_row_wise=true`` when setting ``deterministic=true`` // desc = **Note**: to avoid potential instability due to numerical issues, please set ``force_col_wise=true`` or ``force_row_wise=true`` when setting ``deterministic=true``
bool deterministic = false; bool deterministic = false;
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region Learning Control Parameters #pragma region Learning Control Parameters
#endif // __NVCC__
// desc = used only with ``cpu`` device type // desc = used only with ``cpu`` device type
// desc = set this to ``true`` to force col-wise histogram building // desc = set this to ``true`` to force col-wise histogram building
...@@ -568,11 +574,13 @@ struct Config { ...@@ -568,11 +574,13 @@ struct Config {
// desc = **Note**: can be used only in CLI version // desc = **Note**: can be used only in CLI version
int snapshot_freq = -1; int snapshot_freq = -1;
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region IO Parameters #pragma region IO Parameters
#pragma region Dataset Parameters #pragma region Dataset Parameters
#endif // __NVCC__
// alias = linear_trees // alias = linear_trees
// desc = fit piecewise linear gradient boosting tree // desc = fit piecewise linear gradient boosting tree
...@@ -728,9 +736,11 @@ struct Config { ...@@ -728,9 +736,11 @@ struct Config {
// desc = **Note**: ``lightgbm-transform`` is not maintained by LightGBM's maintainers. Bug reports or feature requests should go to `issues page <https://github.com/microsoft/lightgbm-transform/issues>`__ // desc = **Note**: ``lightgbm-transform`` is not maintained by LightGBM's maintainers. Bug reports or feature requests should go to `issues page <https://github.com/microsoft/lightgbm-transform/issues>`__
std::string parser_config_file = ""; std::string parser_config_file = "";
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region Predict Parameters #pragma region Predict Parameters
#endif // __NVCC__
// [no-save] // [no-save]
// desc = used only in ``prediction`` task // desc = used only in ``prediction`` task
...@@ -800,9 +810,11 @@ struct Config { ...@@ -800,9 +810,11 @@ struct Config {
// desc = **Note**: can be used only in CLI version // desc = **Note**: can be used only in CLI version
std::string output_result = "LightGBM_predict_result.txt"; std::string output_result = "LightGBM_predict_result.txt";
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region Convert Parameters #pragma region Convert Parameters
#endif // __NVCC__
// [no-save] // [no-save]
// desc = used only in ``convert_model`` task // desc = used only in ``convert_model`` task
...@@ -818,11 +830,13 @@ struct Config { ...@@ -818,11 +830,13 @@ struct Config {
// desc = **Note**: can be used only in CLI version // desc = **Note**: can be used only in CLI version
std::string convert_model = "gbdt_prediction.cpp"; std::string convert_model = "gbdt_prediction.cpp";
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma endregion #pragma endregion
#pragma region Objective Parameters #pragma region Objective Parameters
#endif // __NVCC__
// desc = used only in ``rank_xendcg`` objective // desc = used only in ``rank_xendcg`` objective
// desc = random seed for objectives, if random process is needed // desc = random seed for objectives, if random process is needed
...@@ -902,9 +916,11 @@ struct Config { ...@@ -902,9 +916,11 @@ struct Config {
// desc = separate by ``,`` // desc = separate by ``,``
std::vector<double> label_gain; std::vector<double> label_gain;
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region Metric Parameters #pragma region Metric Parameters
#endif // __NVCC__
// [doc-only] // [doc-only]
// alias = metrics, metric_types // alias = metrics, metric_types
...@@ -976,9 +992,11 @@ struct Config { ...@@ -976,9 +992,11 @@ struct Config {
// desc = if not specified, will use equal weights for all classes // desc = if not specified, will use equal weights for all classes
std::vector<double> auc_mu_weights; std::vector<double> auc_mu_weights;
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region Network Parameters #pragma region Network Parameters
#endif // __NVCC__
// check = >0 // check = >0
// alias = num_machine // alias = num_machine
...@@ -1007,9 +1025,11 @@ struct Config { ...@@ -1007,9 +1025,11 @@ struct Config {
// desc = list of machines in the following format: ``ip1:port1,ip2:port2`` // desc = list of machines in the following format: ``ip1:port1,ip2:port2``
std::string machines = ""; std::string machines = "";
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma region GPU Parameters #pragma region GPU Parameters
#endif // __NVCC__
// desc = OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform // desc = OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform
// desc = ``-1`` means the system-wide default platform // desc = ``-1`` means the system-wide default platform
...@@ -1030,9 +1050,11 @@ struct Config { ...@@ -1030,9 +1050,11 @@ struct Config {
// desc = **Note**: can be used only in CUDA implementation // desc = **Note**: can be used only in CUDA implementation
int num_gpu = 1; int num_gpu = 1;
#ifndef __NVCC__
#pragma endregion #pragma endregion
#pragma endregion #pragma endregion
#endif // __NVCC__
size_t file_load_progress_interval_bytes = size_t(10) * 1024 * 1024 * 1024; size_t file_load_progress_interval_bytes = size_t(10) * 1024 * 1024 * 1024;
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifndef LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
#define LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
#ifdef USE_CUDA_EXP
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <LightGBM/bin.h>
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/utils/log.h>
#include <algorithm>
#define NUM_BANKS_DATA_PARTITION (16)
#define LOG_NUM_BANKS_DATA_PARTITION (4)
#define GLOBAL_PREFIX_SUM_BLOCK_SIZE (1024)
#define BITONIC_SORT_NUM_ELEMENTS (1024)
#define BITONIC_SORT_DEPTH (11)
#define BITONIC_SORT_QUERY_ITEM_BLOCK_SIZE (10)
#define CONFLICT_FREE_INDEX(n) \
((n) + ((n) >> LOG_NUM_BANKS_DATA_PARTITION)) \
namespace LightGBM {
template <typename T>
__device__ __forceinline__ T ShufflePrefixSum(T value, T* shared_mem_buffer) {
const uint32_t mask = 0xffffffff;
const uint32_t warpLane = threadIdx.x % warpSize;
const uint32_t warpID = threadIdx.x / warpSize;
const uint32_t num_warp = blockDim.x / warpSize;
for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
const T other_value = __shfl_up_sync(mask, value, offset);
if (warpLane >= offset) {
value += other_value;
}
}
if (warpLane == warpSize - 1) {
shared_mem_buffer[warpID] = value;
}
__syncthreads();
if (warpID == 0) {
T warp_sum = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
const T other_warp_sum = __shfl_up_sync(mask, warp_sum, offset);
if (warpLane >= offset) {
warp_sum += other_warp_sum;
}
}
shared_mem_buffer[warpLane] = warp_sum;
}
__syncthreads();
const T warp_base = warpID == 0 ? 0 : shared_mem_buffer[warpID - 1];
return warp_base + value;
}
template <typename T>
__device__ __forceinline__ T ShufflePrefixSumExclusive(T value, T* shared_mem_buffer) {
const uint32_t mask = 0xffffffff;
const uint32_t warpLane = threadIdx.x % warpSize;
const uint32_t warpID = threadIdx.x / warpSize;
const uint32_t num_warp = blockDim.x / warpSize;
for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
const T other_value = __shfl_up_sync(mask, value, offset);
if (warpLane >= offset) {
value += other_value;
}
}
if (warpLane == warpSize - 1) {
shared_mem_buffer[warpID] = value;
}
__syncthreads();
if (warpID == 0) {
T warp_sum = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
for (uint32_t offset = 1; offset < warpSize; offset <<= 1) {
const T other_warp_sum = __shfl_up_sync(mask, warp_sum, offset);
if (warpLane >= offset) {
warp_sum += other_warp_sum;
}
}
shared_mem_buffer[warpLane] = warp_sum;
}
__syncthreads();
const T warp_base = warpID == 0 ? 0 : shared_mem_buffer[warpID - 1];
const T inclusive_result = warp_base + value;
if (threadIdx.x % warpSize == warpSize - 1) {
shared_mem_buffer[warpLane] = inclusive_result;
}
__syncthreads();
T exclusive_result = __shfl_up_sync(mask, inclusive_result, 1);
if (threadIdx.x == 0) {
exclusive_result = 0;
} else if (threadIdx.x % warpSize == 0) {
exclusive_result = shared_mem_buffer[warpLane - 1];
}
return exclusive_result;
}
template <typename T>
void ShufflePrefixSumGlobal(T* values, size_t len, T* block_prefix_sum_buffer);
template <typename T>
__device__ __forceinline__ T ShuffleReduceSumWarp(T value, const data_size_t len) {
if (len > 0) {
const uint32_t mask = 0xffffffff;
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
value += __shfl_down_sync(mask, value, offset);
}
}
return value;
}
// reduce values from an 1-dimensional block (block size must be no greather than 1024)
template <typename T>
__device__ __forceinline__ T ShuffleReduceSum(T value, T* shared_mem_buffer, const size_t len) {
const uint32_t warpLane = threadIdx.x % warpSize;
const uint32_t warpID = threadIdx.x / warpSize;
const data_size_t warp_len = min(static_cast<data_size_t>(warpSize), static_cast<data_size_t>(len) - static_cast<data_size_t>(warpID * warpSize));
value = ShuffleReduceSumWarp<T>(value, warp_len);
if (warpLane == 0) {
shared_mem_buffer[warpID] = value;
}
__syncthreads();
const data_size_t num_warp = static_cast<data_size_t>((len + warpSize - 1) / warpSize);
if (warpID == 0) {
value = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
value = ShuffleReduceSumWarp<T>(value, num_warp);
}
return value;
}
template <typename T>
__device__ __forceinline__ T ShuffleReduceMaxWarp(T value, const data_size_t len) {
if (len > 0) {
const uint32_t mask = 0xffffffff;
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
value = max(value, __shfl_down_sync(mask, value, offset));
}
}
return value;
}
// reduce values from an 1-dimensional block (block size must be no greather than 1024)
template <typename T>
__device__ __forceinline__ T ShuffleReduceMax(T value, T* shared_mem_buffer, const size_t len) {
const uint32_t warpLane = threadIdx.x % warpSize;
const uint32_t warpID = threadIdx.x / warpSize;
const data_size_t warp_len = min(static_cast<data_size_t>(warpSize), static_cast<data_size_t>(len) - static_cast<data_size_t>(warpID * warpSize));
value = ShuffleReduceMaxWarp<T>(value, warp_len);
if (warpLane == 0) {
shared_mem_buffer[warpID] = value;
}
__syncthreads();
const data_size_t num_warp = static_cast<data_size_t>((len + warpSize - 1) / warpSize);
if (warpID == 0) {
value = (warpLane < num_warp ? shared_mem_buffer[warpLane] : 0);
value = ShuffleReduceMaxWarp<T>(value, num_warp);
}
return value;
}
// calculate prefix sum values within an 1-dimensional block in global memory, exclusively
template <typename T>
__device__ __forceinline__ void GlobalMemoryPrefixSum(T* array, const size_t len) {
const size_t num_values_per_thread = (len + blockDim.x - 1) / blockDim.x;
const size_t start = threadIdx.x * num_values_per_thread;
const size_t end = min(start + num_values_per_thread, len);
T thread_sum = 0;
for (size_t index = start; index < end; ++index) {
thread_sum += array[index];
}
__shared__ T shared_mem[32];
const T thread_base = ShufflePrefixSumExclusive<T>(thread_sum, shared_mem);
if (start < end) {
array[start] += thread_base;
}
for (size_t index = start + 1; index < end; ++index) {
array[index] += array[index - 1];
}
}
template <typename VAL_T, typename INDEX_T, bool ASCENDING>
__device__ __forceinline__ void BitonicArgSort_1024(const VAL_T* scores, INDEX_T* indices, const INDEX_T num_items) {
INDEX_T depth = 1;
INDEX_T num_items_aligend = 1;
INDEX_T num_items_ref = num_items - 1;
while (num_items_ref > 0) {
num_items_ref >>= 1;
num_items_aligend <<= 1;
++depth;
}
for (INDEX_T outer_depth = depth - 1; outer_depth >= 1; --outer_depth) {
const INDEX_T outer_segment_length = 1 << (depth - outer_depth);
const INDEX_T outer_segment_index = threadIdx.x / outer_segment_length;
const bool ascending = ASCENDING ? (outer_segment_index % 2 == 0) : (outer_segment_index % 2 > 0);
for (INDEX_T inner_depth = outer_depth; inner_depth < depth; ++inner_depth) {
const INDEX_T segment_length = 1 << (depth - inner_depth);
const INDEX_T half_segment_length = segment_length >> 1;
const INDEX_T half_segment_index = threadIdx.x / half_segment_length;
if (threadIdx.x < num_items_aligend) {
if (half_segment_index % 2 == 0) {
const INDEX_T index_to_compare = threadIdx.x + half_segment_length;
if ((scores[indices[threadIdx.x]] > scores[indices[index_to_compare]]) == ascending) {
const INDEX_T index = indices[threadIdx.x];
indices[threadIdx.x] = indices[index_to_compare];
indices[index_to_compare] = index;
}
}
}
__syncthreads();
}
}
}
template <typename VAL_T, typename INDEX_T, bool ASCENDING, uint32_t BLOCK_DIM, uint32_t MAX_DEPTH>
__device__ void BitonicArgSortDevice(const VAL_T* values, INDEX_T* indices, const int len) {
__shared__ VAL_T shared_values[BLOCK_DIM];
__shared__ INDEX_T shared_indices[BLOCK_DIM];
int len_to_shift = len - 1;
int max_depth = 1;
while (len_to_shift > 0) {
len_to_shift >>= 1;
++max_depth;
}
const int num_blocks = (len + static_cast<int>(BLOCK_DIM) - 1) / static_cast<int>(BLOCK_DIM);
for (int block_index = 0; block_index < num_blocks; ++block_index) {
const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
if (this_index < len) {
shared_values[threadIdx.x] = values[this_index];
shared_indices[threadIdx.x] = this_index;
} else {
shared_indices[threadIdx.x] = len;
}
__syncthreads();
for (int depth = max_depth - 1; depth > max_depth - static_cast<int>(MAX_DEPTH); --depth) {
const int segment_length = (1 << (max_depth - depth));
const int segment_index = this_index / segment_length;
const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
{
const int half_segment_length = (segment_length >> 1);
const int half_segment_index = this_index / half_segment_length;
const int num_total_segment = (len + segment_length - 1) / segment_length;
const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
(num_total_segment * segment_length - len) : 0;
if (half_segment_index % 2 == 0) {
const int segment_start = segment_index * segment_length;
if (this_index >= offset + segment_start) {
const int other_index = static_cast<int>(threadIdx.x) + half_segment_length - offset;
const INDEX_T this_data_index = shared_indices[threadIdx.x];
const INDEX_T other_data_index = shared_indices[other_index];
const VAL_T this_value = shared_values[threadIdx.x];
const VAL_T other_value = shared_values[other_index];
if (other_data_index < len && (this_value > other_value) == ascending) {
shared_indices[threadIdx.x] = other_data_index;
shared_indices[other_index] = this_data_index;
shared_values[threadIdx.x] = other_value;
shared_values[other_index] = this_value;
}
}
}
__syncthreads();
}
for (int inner_depth = depth + 1; inner_depth < max_depth; ++inner_depth) {
const int half_segment_length = (1 << (max_depth - inner_depth - 1));
const int half_segment_index = this_index / half_segment_length;
if (half_segment_index % 2 == 0) {
const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
const INDEX_T this_data_index = shared_indices[threadIdx.x];
const INDEX_T other_data_index = shared_indices[other_index];
const VAL_T this_value = shared_values[threadIdx.x];
const VAL_T other_value = shared_values[other_index];
if (other_data_index < len && (this_value > other_value) == ascending) {
shared_indices[threadIdx.x] = other_data_index;
shared_indices[other_index] = this_data_index;
shared_values[threadIdx.x] = other_value;
shared_values[other_index] = this_value;
}
}
__syncthreads();
}
}
if (this_index < len) {
indices[this_index] = shared_indices[threadIdx.x];
}
__syncthreads();
}
for (int depth = max_depth - static_cast<int>(MAX_DEPTH); depth >= 1; --depth) {
const int segment_length = (1 << (max_depth - depth));
{
const int num_total_segment = (len + segment_length - 1) / segment_length;
const int half_segment_length = (segment_length >> 1);
for (int block_index = 0; block_index < num_blocks; ++block_index) {
const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
const int segment_index = this_index / segment_length;
const int half_segment_index = this_index / half_segment_length;
const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
const int offset = (segment_index == num_total_segment - 1 && ascending == ASCENDING) ?
(num_total_segment * segment_length - len) : 0;
if (half_segment_index % 2 == 0) {
const int segment_start = segment_index * segment_length;
if (this_index >= offset + segment_start) {
const int other_index = this_index + half_segment_length - offset;
if (other_index < len) {
const INDEX_T this_data_index = indices[this_index];
const INDEX_T other_data_index = indices[other_index];
const VAL_T this_value = values[this_data_index];
const VAL_T other_value = values[other_data_index];
if ((this_value > other_value) == ascending) {
indices[this_index] = other_data_index;
indices[other_index] = this_data_index;
}
}
}
}
}
__syncthreads();
}
for (int inner_depth = depth + 1; inner_depth <= max_depth - static_cast<int>(MAX_DEPTH); ++inner_depth) {
const int half_segment_length = (1 << (max_depth - inner_depth - 1));
for (int block_index = 0; block_index < num_blocks; ++block_index) {
const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
const int segment_index = this_index / segment_length;
const int half_segment_index = this_index / half_segment_length;
const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
if (half_segment_index % 2 == 0) {
const int other_index = this_index + half_segment_length;
if (other_index < len) {
const INDEX_T this_data_index = indices[this_index];
const INDEX_T other_data_index = indices[other_index];
const VAL_T this_value = values[this_data_index];
const VAL_T other_value = values[other_data_index];
if ((this_value > other_value) == ascending) {
indices[this_index] = other_data_index;
indices[other_index] = this_data_index;
}
}
}
__syncthreads();
}
}
for (int block_index = 0; block_index < num_blocks; ++block_index) {
const int this_index = block_index * static_cast<int>(BLOCK_DIM) + static_cast<int>(threadIdx.x);
const int segment_index = this_index / segment_length;
const bool ascending = ASCENDING ? (segment_index % 2 == 0) : (segment_index % 2 == 1);
if (this_index < len) {
const INDEX_T index = indices[this_index];
shared_values[threadIdx.x] = values[index];
shared_indices[threadIdx.x] = index;
} else {
shared_indices[threadIdx.x] = len;
}
__syncthreads();
for (int inner_depth = max_depth - static_cast<int>(MAX_DEPTH) + 1; inner_depth < max_depth; ++inner_depth) {
const int half_segment_length = (1 << (max_depth - inner_depth - 1));
const int half_segment_index = this_index / half_segment_length;
if (half_segment_index % 2 == 0) {
const int other_index = static_cast<int>(threadIdx.x) + half_segment_length;
const INDEX_T this_data_index = shared_indices[threadIdx.x];
const INDEX_T other_data_index = shared_indices[other_index];
const VAL_T this_value = shared_values[threadIdx.x];
const VAL_T other_value = shared_values[other_index];
if (other_data_index < len && (this_value > other_value) == ascending) {
shared_indices[threadIdx.x] = other_data_index;
shared_indices[other_index] = this_data_index;
shared_values[threadIdx.x] = other_value;
shared_values[other_index] = this_value;
}
}
__syncthreads();
}
if (this_index < len) {
indices[this_index] = shared_indices[threadIdx.x];
}
__syncthreads();
}
}
}
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_CUDA_CUDA_ALGORITHMS_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#ifndef LIGHTGBM_CUDA_COLUMN_DATA_HPP_
#define LIGHTGBM_CUDA_COLUMN_DATA_HPP_
#include <LightGBM/config.h>
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/bin.h>
#include <LightGBM/utils/openmp_wrapper.h>
#include <vector>
namespace LightGBM {
class CUDAColumnData {
public:
CUDAColumnData(const data_size_t num_data, const int gpu_device_id);
~CUDAColumnData();
void Init(const int num_columns,
const std::vector<const void*>& column_data,
const std::vector<BinIterator*>& column_bin_iterator,
const std::vector<uint8_t>& column_bit_type,
const std::vector<uint32_t>& feature_max_bin,
const std::vector<uint32_t>& feature_min_bin,
const std::vector<uint32_t>& feature_offset,
const std::vector<uint32_t>& feature_most_freq_bin,
const std::vector<uint32_t>& feature_default_bin,
const std::vector<uint8_t>& feature_missing_is_zero,
const std::vector<uint8_t>& feature_missing_is_na,
const std::vector<uint8_t>& feature_mfb_is_zero,
const std::vector<uint8_t>& feature_mfb_is_na,
const std::vector<int>& feature_to_column);
const void* GetColumnData(const int column_index) const { return data_by_column_[column_index]; }
void CopySubrow(const CUDAColumnData* full_set, const data_size_t* used_indices, const data_size_t num_used_indices);
void* const* cuda_data_by_column() const { return cuda_data_by_column_; }
uint32_t feature_min_bin(const int feature_index) const { return feature_min_bin_[feature_index]; }
uint32_t feature_max_bin(const int feature_index) const { return feature_max_bin_[feature_index]; }
uint32_t feature_offset(const int feature_index) const { return feature_offset_[feature_index]; }
uint32_t feature_most_freq_bin(const int feature_index) const { return feature_most_freq_bin_[feature_index]; }
uint32_t feature_default_bin(const int feature_index) const { return feature_default_bin_[feature_index]; }
uint8_t feature_missing_is_zero(const int feature_index) const { return feature_missing_is_zero_[feature_index]; }
uint8_t feature_missing_is_na(const int feature_index) const { return feature_missing_is_na_[feature_index]; }
uint8_t feature_mfb_is_zero(const int feature_index) const { return feature_mfb_is_zero_[feature_index]; }
uint8_t feature_mfb_is_na(const int feature_index) const { return feature_mfb_is_na_[feature_index]; }
const uint32_t* cuda_feature_min_bin() const { return cuda_feature_min_bin_; }
const uint32_t* cuda_feature_max_bin() const { return cuda_feature_max_bin_; }
const uint32_t* cuda_feature_offset() const { return cuda_feature_offset_; }
const uint32_t* cuda_feature_most_freq_bin() const { return cuda_feature_most_freq_bin_; }
const uint32_t* cuda_feature_default_bin() const { return cuda_feature_default_bin_; }
const uint8_t* cuda_feature_missing_is_zero() const { return cuda_feature_missing_is_zero_; }
const uint8_t* cuda_feature_missing_is_na() const { return cuda_feature_missing_is_na_; }
const uint8_t* cuda_feature_mfb_is_zero() const { return cuda_feature_mfb_is_zero_; }
const uint8_t* cuda_feature_mfb_is_na() const { return cuda_feature_mfb_is_na_; }
const int* cuda_feature_to_column() const { return cuda_feature_to_column_; }
const uint8_t* cuda_column_bit_type() const { return cuda_column_bit_type_; }
int feature_to_column(const int feature_index) const { return feature_to_column_[feature_index]; }
uint8_t column_bit_type(const int column_index) const { return column_bit_type_[column_index]; }
private:
template <bool IS_SPARSE, bool IS_4BIT, typename BIN_TYPE>
void InitOneColumnData(const void* in_column_data, BinIterator* bin_iterator, void** out_column_data_pointer);
void LaunchCopySubrowKernel(void* const* in_cuda_data_by_column);
void InitColumnMetaInfo();
void ResizeWhenCopySubrow(const data_size_t num_used_indices);
int num_threads_;
data_size_t num_data_;
int num_columns_;
std::vector<uint8_t> column_bit_type_;
std::vector<uint32_t> feature_min_bin_;
std::vector<uint32_t> feature_max_bin_;
std::vector<uint32_t> feature_offset_;
std::vector<uint32_t> feature_most_freq_bin_;
std::vector<uint32_t> feature_default_bin_;
std::vector<uint8_t> feature_missing_is_zero_;
std::vector<uint8_t> feature_missing_is_na_;
std::vector<uint8_t> feature_mfb_is_zero_;
std::vector<uint8_t> feature_mfb_is_na_;
void** cuda_data_by_column_;
std::vector<int> feature_to_column_;
std::vector<void*> data_by_column_;
uint8_t* cuda_column_bit_type_;
uint32_t* cuda_feature_min_bin_;
uint32_t* cuda_feature_max_bin_;
uint32_t* cuda_feature_offset_;
uint32_t* cuda_feature_most_freq_bin_;
uint32_t* cuda_feature_default_bin_;
uint8_t* cuda_feature_missing_is_zero_;
uint8_t* cuda_feature_missing_is_na_;
uint8_t* cuda_feature_mfb_is_zero_;
uint8_t* cuda_feature_mfb_is_na_;
int* cuda_feature_to_column_;
// used when bagging with subset
data_size_t* cuda_used_indices_;
data_size_t num_used_indices_;
data_size_t cur_subset_buffer_size_;
};
} // namespace LightGBM
#endif // LIGHTGBM_CUDA_COLUMN_DATA_HPP_
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#ifndef LIGHTGBM_CUDA_META_DATA_HPP_
#define LIGHTGBM_CUDA_META_DATA_HPP_
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/meta.h>
#include <vector>
namespace LightGBM {
class CUDAMetadata {
public:
explicit CUDAMetadata(const int gpu_device_id);
~CUDAMetadata();
void Init(const std::vector<label_t>& label,
const std::vector<label_t>& weight,
const std::vector<data_size_t>& query_boundaries,
const std::vector<label_t>& query_weights,
const std::vector<double>& init_score);
void SetLabel(const label_t* label, data_size_t len);
void SetWeights(const label_t* weights, data_size_t len);
void SetQuery(const data_size_t* query, const label_t* query_weights, data_size_t num_queries);
void SetInitScore(const double* init_score, data_size_t len);
const label_t* cuda_label() const { return cuda_label_; }
const label_t* cuda_weights() const { return cuda_weights_; }
const data_size_t* cuda_query_boundaries() const { return cuda_query_boundaries_; }
const label_t* cuda_query_weights() const { return cuda_query_weights_; }
private:
label_t* cuda_label_;
label_t* cuda_weights_;
data_size_t* cuda_query_boundaries_;
label_t* cuda_query_weights_;
double* cuda_init_score_;
};
} // namespace LightGBM
#endif // LIGHTGBM_CUDA_META_DATA_HPP_
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifndef LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
#define LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
#ifdef USE_CUDA_EXP
#include <cuda.h>
#include <cuda_runtime.h>
namespace LightGBM {
/*!
* \brief A wrapper for random generator
*/
class CUDARandom {
public:
/*!
* \brief Set specific seed
*/
__device__ void SetSeed(int seed) {
x = seed;
}
/*!
* \brief Generate random integer, int16 range. [0, 65536]
* \param lower_bound lower bound
* \param upper_bound upper bound
* \return The random integer between [lower_bound, upper_bound)
*/
__device__ inline int NextShort(int lower_bound, int upper_bound) {
return (RandInt16()) % (upper_bound - lower_bound) + lower_bound;
}
/*!
* \brief Generate random integer, int32 range
* \param lower_bound lower bound
* \param upper_bound upper bound
* \return The random integer between [lower_bound, upper_bound)
*/
__device__ inline int NextInt(int lower_bound, int upper_bound) {
return (RandInt32()) % (upper_bound - lower_bound) + lower_bound;
}
/*!
* \brief Generate random float data
* \return The random float between [0.0, 1.0)
*/
__device__ inline float NextFloat() {
// get random float in [0,1)
return static_cast<float>(RandInt16()) / (32768.0f);
}
private:
__device__ inline int RandInt16() {
x = (214013 * x + 2531011);
return static_cast<int>((x >> 16) & 0x7FFF);
}
__device__ inline int RandInt32() {
x = (214013 * x + 2531011);
return static_cast<int>(x & 0x7FFFFFFF);
}
unsigned int x = 123456789;
};
} // namespace LightGBM
#endif // USE_CUDA_EXP
#endif // LIGHTGBM_CUDA_CUDA_RANDOM_HPP_
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#ifndef LIGHTGBM_CUDA_ROW_DATA_HPP_
#define LIGHTGBM_CUDA_ROW_DATA_HPP_
#include <LightGBM/bin.h>
#include <LightGBM/config.h>
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/dataset.h>
#include <LightGBM/train_share_states.h>
#include <LightGBM/utils/openmp_wrapper.h>
#include <vector>
#define COPY_SUBROW_BLOCK_SIZE_ROW_DATA (1024)
#if CUDART_VERSION == 10000
#define DP_SHARED_HIST_SIZE (5560)
#else
#define DP_SHARED_HIST_SIZE (6144)
#endif
#define SP_SHARED_HIST_SIZE (DP_SHARED_HIST_SIZE * 2)
namespace LightGBM {
class CUDARowData {
public:
CUDARowData(const Dataset* train_data,
const TrainingShareStates* train_share_state,
const int gpu_device_id,
const bool gpu_use_dp);
~CUDARowData();
void Init(const Dataset* train_data,
TrainingShareStates* train_share_state);
void CopySubrow(const CUDARowData* full_set, const data_size_t* used_indices, const data_size_t num_used_indices);
void CopySubcol(const CUDARowData* full_set, const std::vector<int8_t>& is_feature_used, const Dataset* train_data);
void CopySubrowAndSubcol(const CUDARowData* full_set, const data_size_t* used_indices,
const data_size_t num_used_indices, const std::vector<bool>& is_feature_used, const Dataset* train_data);
template <typename BIN_TYPE>
const BIN_TYPE* GetBin() const;
template <typename PTR_TYPE>
const PTR_TYPE* GetPartitionPtr() const;
template <typename PTR_TYPE>
const PTR_TYPE* GetRowPtr() const;
int NumLargeBinPartition() const { return static_cast<int>(large_bin_partitions_.size()); }
int num_feature_partitions() const { return num_feature_partitions_; }
int max_num_column_per_partition() const { return max_num_column_per_partition_; }
bool is_sparse() const { return is_sparse_; }
uint8_t bit_type() const { return bit_type_; }
uint8_t row_ptr_bit_type() const { return row_ptr_bit_type_; }
const int* cuda_feature_partition_column_index_offsets() const { return cuda_feature_partition_column_index_offsets_; }
const uint32_t* cuda_column_hist_offsets() const { return cuda_column_hist_offsets_; }
const uint32_t* cuda_partition_hist_offsets() const { return cuda_partition_hist_offsets_; }
int shared_hist_size() const { return shared_hist_size_; }
private:
void DivideCUDAFeatureGroups(const Dataset* train_data, TrainingShareStates* share_state);
template <typename BIN_TYPE>
void GetDenseDataPartitioned(const BIN_TYPE* row_wise_data, std::vector<BIN_TYPE>* partitioned_data);
template <typename BIN_TYPE, typename ROW_PTR_TYPE>
void GetSparseDataPartitioned(const BIN_TYPE* row_wise_data,
const ROW_PTR_TYPE* row_ptr,
std::vector<std::vector<BIN_TYPE>>* partitioned_data,
std::vector<std::vector<ROW_PTR_TYPE>>* partitioned_row_ptr,
std::vector<ROW_PTR_TYPE>* partition_ptr);
template <typename BIN_TYPE, typename ROW_PTR_TYPE>
void InitSparseData(const BIN_TYPE* host_data,
const ROW_PTR_TYPE* host_row_ptr,
BIN_TYPE** cuda_data,
ROW_PTR_TYPE** cuda_row_ptr,
ROW_PTR_TYPE** cuda_partition_ptr);
/*! \brief number of threads to use */
int num_threads_;
/*! \brief number of training data */
data_size_t num_data_;
/*! \brief number of bins of all features */
int num_total_bin_;
/*! \brief number of feature groups in dataset */
int num_feature_group_;
/*! \brief number of features in dataset */
int num_feature_;
/*! \brief number of bits used to store each bin value */
uint8_t bit_type_;
/*! \brief number of bits used to store each row pointer value */
uint8_t row_ptr_bit_type_;
/*! \brief is sparse row wise data */
bool is_sparse_;
/*! \brief start column index of each feature partition */
std::vector<int> feature_partition_column_index_offsets_;
/*! \brief histogram offset of each column */
std::vector<uint32_t> column_hist_offsets_;
/*! \brief hisotgram offset of each partition */
std::vector<uint32_t> partition_hist_offsets_;
/*! \brief maximum number of columns among all feature partitions */
int max_num_column_per_partition_;
/*! \brief number of partitions */
int num_feature_partitions_;
/*! \brief used when bagging with subset, number of used indices */
data_size_t num_used_indices_;
/*! \brief used when bagging with subset, number of total elements */
uint64_t num_total_elements_;
/*! \brief used when bagging with column subset, the size of maximum number of feature partitions */
int cur_num_feature_partition_buffer_size_;
/*! \brief CUDA device ID */
int gpu_device_id_;
/*! \brief index of partitions with large bins that its histogram cannot fit into shared memory, each large bin partition contains a single column */
std::vector<int> large_bin_partitions_;
/*! \brief index of partitions with small bins */
std::vector<int> small_bin_partitions_;
/*! \brief shared memory size used by histogram */
int shared_hist_size_;
/*! \brief whether to use double precision in histograms per block */
bool gpu_use_dp_;
// CUDA memory
/*! \brief row-wise data stored in CUDA, 8 bits */
uint8_t* cuda_data_uint8_t_;
/*! \brief row-wise data stored in CUDA, 16 bits */
uint16_t* cuda_data_uint16_t_;
/*! \brief row-wise data stored in CUDA, 32 bits */
uint32_t* cuda_data_uint32_t_;
/*! \brief row pointer stored in CUDA, 16 bits */
uint16_t* cuda_row_ptr_uint16_t_;
/*! \brief row pointer stored in CUDA, 32 bits */
uint32_t* cuda_row_ptr_uint32_t_;
/*! \brief row pointer stored in CUDA, 64 bits */
uint64_t* cuda_row_ptr_uint64_t_;
/*! \brief partition bin offsets, 16 bits */
uint16_t* cuda_partition_ptr_uint16_t_;
/*! \brief partition bin offsets, 32 bits */
uint32_t* cuda_partition_ptr_uint32_t_;
/*! \brief partition bin offsets, 64 bits */
uint64_t* cuda_partition_ptr_uint64_t_;
/*! \brief start column index of each feature partition */
int* cuda_feature_partition_column_index_offsets_;
/*! \brief histogram offset of each column */
uint32_t* cuda_column_hist_offsets_;
/*! \brief hisotgram offset of each partition */
uint32_t* cuda_partition_hist_offsets_;
/*! \brief block buffer when calculating prefix sum */
uint16_t* cuda_block_buffer_uint16_t_;
/*! \brief block buffer when calculating prefix sum */
uint32_t* cuda_block_buffer_uint32_t_;
/*! \brief block buffer when calculating prefix sum */
uint64_t* cuda_block_buffer_uint64_t_;
};
} // namespace LightGBM
#endif // LIGHTGBM_CUDA_ROW_DATA_HPP_
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for
* license information.
*/
#ifdef USE_CUDA_EXP
#ifndef LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
#define LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
#include <LightGBM/meta.h>
namespace LightGBM {
class CUDASplitInfo {
public:
bool is_valid;
int leaf_index;
double gain;
int inner_feature_index;
uint32_t threshold;
bool default_left;
double left_sum_gradients;
double left_sum_hessians;
data_size_t left_count;
double left_gain;
double left_value;
double right_sum_gradients;
double right_sum_hessians;
data_size_t right_count;
double right_gain;
double right_value;
int num_cat_threshold = 0;
uint32_t* cat_threshold = nullptr;
int* cat_threshold_real = nullptr;
__device__ CUDASplitInfo() {
num_cat_threshold = 0;
cat_threshold = nullptr;
cat_threshold_real = nullptr;
}
__device__ ~CUDASplitInfo() {
if (num_cat_threshold > 0) {
if (cat_threshold != nullptr) {
cudaFree(cat_threshold);
}
if (cat_threshold_real != nullptr) {
cudaFree(cat_threshold_real);
}
}
}
__device__ CUDASplitInfo& operator=(const CUDASplitInfo& other) {
is_valid = other.is_valid;
leaf_index = other.leaf_index;
gain = other.gain;
inner_feature_index = other.inner_feature_index;
threshold = other.threshold;
default_left = other.default_left;
left_sum_gradients = other.left_sum_gradients;
left_sum_hessians = other.left_sum_hessians;
left_count = other.left_count;
left_gain = other.left_gain;
left_value = other.left_value;
right_sum_gradients = other.right_sum_gradients;
right_sum_hessians = other.right_sum_hessians;
right_count = other.right_count;
right_gain = other.right_gain;
right_value = other.right_value;
num_cat_threshold = other.num_cat_threshold;
if (num_cat_threshold > 0 && cat_threshold == nullptr) {
cat_threshold = new uint32_t[num_cat_threshold];
}
if (num_cat_threshold > 0 && cat_threshold_real == nullptr) {
cat_threshold_real = new int[num_cat_threshold];
}
if (num_cat_threshold > 0) {
if (other.cat_threshold != nullptr) {
for (int i = 0; i < num_cat_threshold; ++i) {
cat_threshold[i] = other.cat_threshold[i];
}
}
if (other.cat_threshold_real != nullptr) {
for (int i = 0; i < num_cat_threshold; ++i) {
cat_threshold_real[i] = other.cat_threshold_real[i];
}
}
}
return *this;
}
};
} // namespace LightGBM
#endif // LIGHTGBM_CUDA_CUDA_SPLIT_INFO_HPP_
#endif // USE_CUDA_EXP
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifdef USE_CUDA_EXP
#ifndef LIGHTGBM_CUDA_CUDA_TREE_HPP_
#define LIGHTGBM_CUDA_CUDA_TREE_HPP_
#include <LightGBM/cuda/cuda_column_data.hpp>
#include <LightGBM/cuda/cuda_split_info.hpp>
#include <LightGBM/tree.h>
#include <LightGBM/bin.h>
namespace LightGBM {
__device__ void SetDecisionTypeCUDA(int8_t* decision_type, bool input, int8_t mask);
__device__ void SetMissingTypeCUDA(int8_t* decision_type, int8_t input);
__device__ bool GetDecisionTypeCUDA(int8_t decision_type, int8_t mask);
__device__ int8_t GetMissingTypeCUDA(int8_t decision_type);
__device__ bool IsZeroCUDA(double fval);
class CUDATree : public Tree {
public:
/*!
* \brief Constructor
* \param max_leaves The number of max leaves
* \param track_branch_features Whether to keep track of ancestors of leaf nodes
* \param is_linear Whether the tree has linear models at each leaf
*/
explicit CUDATree(int max_leaves, bool track_branch_features, bool is_linear,
const int gpu_device_id, const bool has_categorical_feature);
explicit CUDATree(const Tree* host_tree);
~CUDATree() noexcept;
int Split(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info);
int SplitCategorical(
const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
uint32_t* cuda_bitset,
size_t cuda_bitset_len,
uint32_t* cuda_bitset_inner,
size_t cuda_bitset_inner_len);
const int* cuda_leaf_parent() const { return cuda_leaf_parent_; }
const int* cuda_left_child() const { return cuda_left_child_; }
const int* cuda_right_child() const { return cuda_right_child_; }
const int* cuda_split_feature_inner() const { return cuda_split_feature_inner_; }
const int* cuda_split_feature() const { return cuda_split_feature_; }
const uint32_t* cuda_threshold_in_bin() const { return cuda_threshold_in_bin_; }
const double* cuda_threshold() const { return cuda_threshold_; }
const int8_t* cuda_decision_type() const { return cuda_decision_type_; }
const double* cuda_leaf_value() const { return cuda_leaf_value_; }
double* cuda_leaf_value_ref() { return cuda_leaf_value_; }
inline void Shrinkage(double rate) override;
inline void AddBias(double val) override;
void ToHost();
void SyncLeafOutputFromHostToCUDA();
void SyncLeafOutputFromCUDAToHost();
private:
void InitCUDAMemory();
void InitCUDA();
void LaunchSplitKernel(const int leaf_index,
const int real_feature_index,
const double real_threshold,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info);
void LaunchSplitCategoricalKernel(
const int leaf_index,
const int real_feature_index,
const MissingType missing_type,
const CUDASplitInfo* cuda_split_info,
size_t cuda_bitset_len,
size_t cuda_bitset_inner_len);
void LaunchShrinkageKernel(const double rate);
void LaunchAddBiasKernel(const double val);
int* cuda_left_child_;
int* cuda_right_child_;
int* cuda_split_feature_inner_;
int* cuda_split_feature_;
int* cuda_leaf_depth_;
int* cuda_leaf_parent_;
uint32_t* cuda_threshold_in_bin_;
double* cuda_threshold_;
double* cuda_internal_weight_;
double* cuda_internal_value_;
int8_t* cuda_decision_type_;
double* cuda_leaf_value_;
data_size_t* cuda_leaf_count_;
double* cuda_leaf_weight_;
data_size_t* cuda_internal_count_;
float* cuda_split_gain_;
CUDAVector<uint32_t> cuda_bitset_;
CUDAVector<uint32_t> cuda_bitset_inner_;
CUDAVector<int> cuda_cat_boundaries_;
CUDAVector<int> cuda_cat_boundaries_inner_;
cudaStream_t cuda_stream_;
const int num_threads_per_block_add_prediction_to_score_;
};
} // namespace LightGBM
#endif // LIGHTGBM_CUDA_CUDA_TREE_HPP_
#endif // USE_CUDA_EXP
/*! /*!
* Copyright (c) 2020 IBM Corporation. All rights reserved. * Copyright (c) 2020-2021 IBM Corporation, Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information. * Licensed under the MIT License. See LICENSE file in the project root for license information.
*/ */
#ifndef LIGHTGBM_CUDA_CUDA_UTILS_H_ #ifndef LIGHTGBM_CUDA_CUDA_UTILS_H_
#define LIGHTGBM_CUDA_CUDA_UTILS_H_ #define LIGHTGBM_CUDA_CUDA_UTILS_H_
#ifdef USE_CUDA #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
#include <cuda.h> #include <cuda.h>
#include <cuda_runtime.h> #include <cuda_runtime.h>
#include <stdio.h> #include <stdio.h>
#endif // USE_CUDA || USE_CUDA_EXP
#ifdef USE_CUDA_EXP
#include <LightGBM/utils/log.h>
#include <vector>
#endif // USE_CUDA_EXP
namespace LightGBM {
#if defined(USE_CUDA) || defined(USE_CUDA_EXP)
#define CUDASUCCESS_OR_FATAL(ans) { gpuAssert((ans), __FILE__, __LINE__); } #define CUDASUCCESS_OR_FATAL(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) { inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
if (code != cudaSuccess) { if (code != cudaSuccess) {
...@@ -18,7 +27,157 @@ inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = ...@@ -18,7 +27,157 @@ inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort =
if (abort) exit(code); if (abort) exit(code);
} }
} }
#endif // USE_CUDA || USE_CUDA_EXP
#ifdef USE_CUDA_EXP
#define CUDASUCCESS_OR_FATAL_OUTER(ans) { gpuAssert((ans), file, line); }
void SetCUDADevice(int gpu_device_id, const char* file, int line);
template <typename T>
void AllocateCUDAMemory(T** out_ptr, size_t size, const char* file, const int line) {
void* tmp_ptr = nullptr;
CUDASUCCESS_OR_FATAL_OUTER(cudaMalloc(&tmp_ptr, size * sizeof(T)));
*out_ptr = reinterpret_cast<T*>(tmp_ptr);
}
template <typename T>
void CopyFromHostToCUDADevice(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
size_t size_in_bytes = size * sizeof(T);
CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyHostToDevice));
}
template <typename T>
void InitCUDAMemoryFromHostMemory(T** dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
AllocateCUDAMemory<T>(dst_ptr, size, file, line);
CopyFromHostToCUDADevice<T>(*dst_ptr, src_ptr, size, file, line);
}
template <typename T>
void CopyFromCUDADeviceToHost(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
size_t size_in_bytes = size * sizeof(T);
CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToHost));
}
template <typename T>
void CopyFromCUDADeviceToHostAsync(T* dst_ptr, const T* src_ptr, size_t size, cudaStream_t stream, const char* file, const int line) {
void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
size_t size_in_bytes = size * sizeof(T);
CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpyAsync(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToHost, stream));
}
template <typename T>
void CopyFromCUDADeviceToCUDADevice(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
size_t size_in_bytes = size * sizeof(T);
CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpy(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToDevice));
}
template <typename T>
void CopyFromCUDADeviceToCUDADeviceAsync(T* dst_ptr, const T* src_ptr, size_t size, const char* file, const int line) {
void* void_dst_ptr = reinterpret_cast<void*>(dst_ptr);
const void* void_src_ptr = reinterpret_cast<const void*>(src_ptr);
size_t size_in_bytes = size * sizeof(T);
CUDASUCCESS_OR_FATAL_OUTER(cudaMemcpyAsync(void_dst_ptr, void_src_ptr, size_in_bytes, cudaMemcpyDeviceToDevice));
}
void SynchronizeCUDADevice(const char* file, const int line);
template <typename T>
void SetCUDAMemory(T* dst_ptr, int value, size_t size, const char* file, const int line) {
CUDASUCCESS_OR_FATAL_OUTER(cudaMemset(reinterpret_cast<void*>(dst_ptr), value, size * sizeof(T)));
SynchronizeCUDADevice(file, line);
}
template <typename T>
void DeallocateCUDAMemory(T** ptr, const char* file, const int line) {
if (*ptr != nullptr) {
CUDASUCCESS_OR_FATAL_OUTER(cudaFree(reinterpret_cast<void*>(*ptr)));
*ptr = nullptr;
}
}
void PrintLastCUDAError();
template <typename T>
class CUDAVector {
public:
CUDAVector() {
size_ = 0;
data_ = nullptr;
}
explicit CUDAVector(size_t size) {
size_ = size;
AllocateCUDAMemory<T>(&data_, size_, __FILE__, __LINE__);
}
void Resize(size_t size) {
if (size == 0) {
Clear();
}
T* new_data = nullptr;
AllocateCUDAMemory<T>(&new_data, size, __FILE__, __LINE__);
if (size_ > 0 && data_ != nullptr) {
CopyFromCUDADeviceToCUDADevice<T>(new_data, data_, size, __FILE__, __LINE__);
}
DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
data_ = new_data;
size_ = size;
}
void Clear() {
if (size_ > 0 && data_ != nullptr) {
DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
}
size_ = 0;
}
void PushBack(const T* values, size_t len) {
T* new_data = nullptr;
AllocateCUDAMemory<T>(&new_data, size_ + len, __FILE__, __LINE__);
if (size_ > 0 && data_ != nullptr) {
CopyFromCUDADeviceToCUDADevice<T>(new_data, data_, size_, __FILE__, __LINE__);
}
CopyFromCUDADeviceToCUDADevice<T>(new_data + size_, values, len, __FILE__, __LINE__);
DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
size_ += len;
data_ = new_data;
}
size_t Size() {
return size_;
}
~CUDAVector() {
DeallocateCUDAMemory<T>(&data_, __FILE__, __LINE__);
}
std::vector<T> ToHost() {
std::vector<T> host_vector(size_);
if (size_ > 0 && data_ != nullptr) {
CopyFromCUDADeviceToHost(host_vector.data(), data_, size_, __FILE__, __LINE__);
}
return host_vector;
}
T* RawData() {
return data_;
}
private:
T* data_;
size_t size_;
};
#endif // USE_CUDA_EXP
#endif // USE_CUDA } // namespace LightGBM
#endif // LIGHTGBM_CUDA_CUDA_UTILS_H_ #endif // LIGHTGBM_CUDA_CUDA_UTILS_H_
/*! /*!
* Copyright (c) 2020 IBM Corporation. All rights reserved. * Copyright (c) 2020 IBM Corporation, Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information. * Licensed under the MIT License. See LICENSE file in the project root for license information.
*/ */
#ifndef LIGHTGBM_CUDA_VECTOR_CUDAHOST_H_ #ifndef LIGHTGBM_CUDA_VECTOR_CUDAHOST_H_
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
#include <LightGBM/utils/common.h> #include <LightGBM/utils/common.h>
#ifdef USE_CUDA #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
#include <cuda.h> #include <cuda.h>
#include <cuda_runtime.h> #include <cuda_runtime.h>
#endif #endif
...@@ -42,8 +42,8 @@ struct CHAllocator { ...@@ -42,8 +42,8 @@ struct CHAllocator {
T* allocate(std::size_t n) { T* allocate(std::size_t n) {
T* ptr; T* ptr;
if (n == 0) return NULL; if (n == 0) return NULL;
n = (n + kAlignedSize - 1) & -kAlignedSize; n = SIZE_ALIGNED(n);
#ifdef USE_CUDA #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
if (LGBM_config_::current_device == lgbm_device_cuda) { if (LGBM_config_::current_device == lgbm_device_cuda) {
cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable); cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);
if (ret != cudaSuccess) { if (ret != cudaSuccess) {
...@@ -62,7 +62,7 @@ struct CHAllocator { ...@@ -62,7 +62,7 @@ struct CHAllocator {
void deallocate(T* p, std::size_t n) { void deallocate(T* p, std::size_t n) {
(void)n; // UNUSED (void)n; // UNUSED
if (p == NULL) return; if (p == NULL) return;
#ifdef USE_CUDA #if defined(USE_CUDA) || defined(USE_CUDA_EXP)
if (LGBM_config_::current_device == lgbm_device_cuda) { if (LGBM_config_::current_device == lgbm_device_cuda) {
cudaPointerAttributes attributes; cudaPointerAttributes attributes;
cudaPointerGetAttributes(&attributes, p); cudaPointerGetAttributes(&attributes, p);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment