• shiyu1994's avatar
    [CUDA] New CUDA version Part 1 (#4630) · 6b56a90c
    shiyu1994 authored
    
    
    * new cuda framework
    
    * add histogram construction kernel
    
    * before removing multi-gpu
    
    * new cuda framework
    
    * tree learner cuda kernels
    
    * single tree framework ready
    
    * single tree training framework
    
    * remove comments
    
    * boosting with cuda
    
    * optimize for best split find
    
    * data split
    
    * move boosting into cuda
    
    * parallel synchronize best split point
    
    * merge split data kernels
    
    * before code refactor
    
    * use tasks instead of features as units for split finding
    
    * refactor cuda best split finder
    
    * fix configuration error with small leaves in data split
    
    * skip histogram construction of too small leaf
    
    * skip split finding of invalid leaves
    
    stop when no leaf to split
    
    * support row wise with CUDA
    
    * copy data for split by column
    
    * copy data from host to CPU by column for data partition
    
    * add synchronize best splits for one leaf from multiple blocks
    
    * partition dense row data
    
    * fix sync best split from task blocks
    
    * add support for sparse row wise for CUDA
    
    * remove useless code
    
    * add l2 regression objective
    
    * sparse multi value bin enabled for CUDA
    
    * fix cuda ranking objective
    
    * support for number of items <= 2048 per query
    
    * speedup histogram construction by interleaving global memory access
    
    * split optimization
    
    * add cuda tree predictor
    
    * remove comma
    
    * refactor objective and score updater
    
    * before use struct
    
    * use structure for split information
    
    * use structure for leaf splits
    
    * return CUDASplitInfo directly after finding best split
    
    * split with CUDATree directly
    
    * use cuda row data in cuda histogram constructor
    
    * clean src/treelearner/cuda
    
    * gather shared cuda device functions
    
    * put shared CUDA functions into header file
    
    * change smaller leaf from <= back to < for consistent result with CPU
    
    * add tree predictor
    
    * remove useless cuda_tree_predictor
    
    * predict on CUDA with pipeline
    
    * add global sort algorithms
    
    * add global argsort for queries with many items in ranking tasks
    
    * remove limitation of maximum number of items per query in ranking
    
    * add cuda metrics
    
    * fix CUDA AUC
    
    * remove debug code
    
    * add regression metrics
    
    * remove useless file
    
    * don't use mask in shuffle reduce
    
    * add more regression objectives
    
    * fix cuda mape loss
    
    add cuda xentropy loss
    
    * use template for different versions of BitonicArgSortDevice
    
    * add multiclass metrics
    
    * add ndcg metric
    
    * fix cross entropy objectives and metrics
    
    * fix cross entropy and ndcg metrics
    
    * add support for customized objective in CUDA
    
    * complete multiclass ova for CUDA
    
    * separate cuda tree learner
    
    * use shuffle based prefix sum
    
    * clean up cuda_algorithms.hpp
    
    * add copy subset on CUDA
    
    * add bagging for CUDA
    
    * clean up code
    
    * copy gradients from host to device
    
    * support bagging without using subset
    
    * add support of bagging with subset for CUDAColumnData
    
    * add support of bagging with subset for dense CUDARowData
    
    * refactor copy sparse subrow
    
    * use copy subset for column subset
    
    * add reset train data and reset config for CUDA tree learner
    
    add deconstructors for cuda tree learner
    
    * add USE_CUDA ifdef to cuda tree learner files
    
    * check that dataset doesn't contain CUDA tree learner
    
    * remove printf debug information
    
    * use full new cuda tree learner only when using single GPU
    
    * disable all CUDA code when using CPU version
    
    * recover main.cpp
    
    * add cpp files for multi value bins
    
    * update LightGBM.vcxproj
    
    * update LightGBM.vcxproj
    
    fix lint errors
    
    * fix lint errors
    
    * fix lint errors
    
    * update Makevars
    
    fix lint errors
    
    * fix the case with 0 feature and 0 bin
    
    fix split finding for invalid leaves
    
    create cuda column data when loaded from bin file
    
    * fix lint errors
    
    hide GetRowWiseData when cuda is not used
    
    * recover default device type to cpu
    
    * fix na_as_missing case
    
    fix cuda feature meta information
    
    * fix UpdateDataIndexToLeafIndexKernel
    
    * create CUDA trees when needed in CUDADataPartition::UpdateTrainScore
    
    * add refit by tree for cuda tree learner
    
    * fix test_refit in test_engine.py
    
    * create set of large bin partitions in CUDARowData
    
    * add histogram construction for columns with a large number of bins
    
    * add find best split for categorical features on CUDA
    
    * add bitvectors for categorical split
    
    * cuda data partition split for categorical features
    
    * fix split tree with categorical feature
    
    * fix categorical feature splits
    
    * refactor cuda_data_partition.cu with multi-level templates
    
    * refactor CUDABestSplitFinder by grouping task information into struct
    
    * pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder
    
    * fix misuse of reference
    
    * remove useless changes
    
    * add support for path smoothing
    
    * virtual destructor for LightGBM::Tree
    
    * fix overlapped cat threshold in best split infos
    
    * reset histogram pointers in data partition and spllit finder in ResetConfig
    
    * comment useless parameter
    
    * fix reverse case when na is missing and default bin is zero
    
    * fix mfb_is_na and mfb_is_zero and is_single_feature_column
    
    * remove debug log
    
    * fix cat_l2 when one-hot
    
    fix gradient copy when data subset is used
    
    * switch shared histogram size according to CUDA version
    
    * gpu_use_dp=true when cuda test
    
    * revert modification in config.h
    
    * fix setting of gpu_use_dp=true in .ci/test.sh
    
    * fix linter errors
    
    * fix linter error
    
    remove useless change
    
    * recover main.cpp
    
    * separate cuda_exp and cuda
    
    * fix ci bash scripts
    
    add description for cuda_exp
    
    * add USE_CUDA_EXP flag
    
    * switch off USE_CUDA_EXP
    
    * revert changes in python-packages
    
    * more careful separation for USE_CUDA_EXP
    
    * fix CUDARowData::DivideCUDAFeatureGroups
    
    fix set fields for cuda metadata
    
    * revert config.h
    
    * fix test settings for cuda experimental version
    
    * skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version
    
    * fix lint issue by adding a blank line
    
    * fix lint errors by resorting imports
    
    * fix lint errors by resorting imports
    
    * fix lint errors by resorting imports
    
    * merge cuda.yml and cuda_exp.yml
    
    * update python version in cuda.yml
    
    * remove cuda_exp.yml
    
    * remove unrelated changes
    
    * fix compilation warnings
    
    fix cuda exp ci task name
    
    * recover task
    
    * use multi-level template in histogram construction
    
    check split only in debug mode
    
    * ignore NVCC related lines in parameter_generator.py
    
    * update job name for CUDA tests
    
    * apply review suggestions
    
    * Update .github/workflows/cuda.yml
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Update .github/workflows/cuda.yml
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * update header
    
    * remove useless TODOs
    
    * remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062
    
    * #include <LightGBM/utils/log.h> for USE_CUDA_EXP only
    
    * fix include order
    
    * fix include order
    
    * remove extra space
    
    * address review comments
    
    * add warning when cuda_exp is used together with deterministic
    
    * add comment about gpu_use_dp in .ci/test.sh
    
    * revert changing order of included headers
    Co-authored-by: default avatarYu Shi <shiyu1994@qq.com>
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    6b56a90c
cuda_algorithms.hpp 16 KB