Unverified Commit fffd066c authored by Yifei Liu's avatar Yifei Liu Committed by GitHub
Browse files

Decouple Boosting Types (fixes #3128) (#4827)



* add parameter data_sample_strategy

* abstract GOSS as a sample strategy(GOSS1), togetherwith origial GOSS (Normal Bagging has not been abstracted, so do NOT use it now)

* abstract Bagging as a subclass (BAGGING), but original Bagging members in GBDT are still kept

* fix some variables

* remove GOSS(as boost) and Bagging logic in GBDT

* rename GOSS1 to GOSS(as sample strategy)

* add warning about use GOSS as boosting_type

* a little ; bug

* remove CHECK when "gradients != nullptr"

* rename DataSampleStrategy to avoid confusion

* remove and add some ccomments, followingconvention

* fix bug about GBDT::ResetConfig (ObjectiveFunction inconsistencty bet…

* add std::ignore to avoid compiler warnings (anpotential fails)

* update Makevars and vcxproj

* handle constant hessian

move resize of gradient vectors out of sample strategy

* mark override for IsHessianChange

* fix lint errors

* rerun parameter_generator.py

* update config_auto.cpp

* delete redundant blank line

* update num_data_ when train_data_ is updated

set gradients and hessians when GOSS

* check bagging_freq is not zero

* reset config_ value

merge ResetBaggingConfig and ResetGOSS

* remove useless check

* add ttests in test_engine.py

* remove whitespace in blank line

* remove arguments verbose_eval and evals_result

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

reduce num_boost_round
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update src/boosting/sample_strategy.cpp

modify warning about setting goss as `boosting_type`
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* Update tests/python_package_test/test_engine.py

replace load_boston() with make_regression()

remove value checks of mean_squared_error in test_sample_strategy_with_boosting()

* Update tests/python_package_test/test_engine.py

add value checks of mean_squared_error in test_sample_strategy_with_boosting()

* Modify warnning about using goss as boosting type

* Update tests/python_package_test/test_engine.py

add random_state=42 for make_regression()

reduce the threshold of mean_square_error

* Update src/boosting/sample_strategy.cpp
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* remove goss from boosting types in documentation

* Update src/boosting/bagging.hpp
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update src/boosting/bagging.hpp
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update src/boosting/goss.hpp
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update src/boosting/goss.hpp
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* rename GOSS with GOSSStrategy

* update doc

* address comments

* fix table in doc

* Update include/LightGBM/config.h
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* update documentation

* update test case

* revert useless change in test_engine.py

* add tests for evaluation results in test_sample_strategy_with_boosting

* include <string>

* change to assert_allclose in test_goss_boosting_and_strategy_equivalent

* more tolerance in result checking, due to minor difference in results of gpu versions

* change == to np.testing.assert_allclose

* fix test case

* set gpu_use_dp to true

* change --report to --report-level for rstcheck

* use gpu_use_dp=true in test_goss_boosting_and_strategy_equivalent

* revert unexpected changes of non-ascii characters

* revert unexpected changes of non-ascii characters

* remove useless changes

* allocate gradients_pointer_ and hessians_pointer when necessary

* add spaces

* remove redundant virtual

* include <LightGBM/utils/log.h> for USE_CUDA

* check for  in test_goss_boosting_and_strategy_equivalent

* check for identity in test_sample_strategy_with_boosting

* remove cuda  option in test_sample_strategy_with_boosting

* Update tests/python_package_test/test_engine.py
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update tests/python_package_test/test_engine.py
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>

* ResetGradientBuffers after ResetSampleConfig

* ResetGradientBuffers after ResetSampleConfig

* ResetGradientBuffers after bagging

* remove useless code

* check objective_function_ instead of gradients

* enable rf with goss

simplify params in test cases

* remove useless changes

* allow rf with feature subsampling alone

* change position of ResetGradientBuffers

* check for dask

* add parameter types for data_sample_strategy
Co-authored-by: default avatarGuangda Liu <v-guangdaliu@microsoft.com>
Co-authored-by: default avatarYu Shi <shiyu_k1994@qq.com>
Co-authored-by: default avatarGuangdaLiu <90019144+GuangdaLiu@users.noreply.github.com>
Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent a2ae6b95
...@@ -26,6 +26,7 @@ OBJECTS = \ ...@@ -26,6 +26,7 @@ OBJECTS = \
boosting/gbdt_model_text.o \ boosting/gbdt_model_text.o \
boosting/gbdt_prediction.o \ boosting/gbdt_prediction.o \
boosting/prediction_early_stop.o \ boosting/prediction_early_stop.o \
boosting/sample_strategy.o \
io/bin.o \ io/bin.o \
io/config.o \ io/config.o \
io/config_auto.o \ io/config_auto.o \
......
...@@ -27,6 +27,7 @@ OBJECTS = \ ...@@ -27,6 +27,7 @@ OBJECTS = \
boosting/gbdt_model_text.o \ boosting/gbdt_model_text.o \
boosting/gbdt_prediction.o \ boosting/gbdt_prediction.o \
boosting/prediction_early_stop.o \ boosting/prediction_early_stop.o \
boosting/sample_strategy.o \
io/bin.o \ io/bin.o \
io/config.o \ io/config.o \
io/config_auto.o \ io/config_auto.o \
......
...@@ -19,7 +19,7 @@ Important Classes ...@@ -19,7 +19,7 @@ Important Classes
+-------------------------+----------------------------------------------------------------------------------------+ +-------------------------+----------------------------------------------------------------------------------------+
| ``Bin`` | Data structure used for storing feature discrete values (converted from float values) | | ``Bin`` | Data structure used for storing feature discrete values (converted from float values) |
+-------------------------+----------------------------------------------------------------------------------------+ +-------------------------+----------------------------------------------------------------------------------------+
| ``Boosting`` | Boosting interface (GBDT, DART, GOSS, etc.) | | ``Boosting`` | Boosting interface (GBDT, DART, etc.) |
+-------------------------+----------------------------------------------------------------------------------------+ +-------------------------+----------------------------------------------------------------------------------------+
| ``Config`` | Stores parameters and configurations | | ``Config`` | Stores parameters and configurations |
+-------------------------+----------------------------------------------------------------------------------------+ +-------------------------+----------------------------------------------------------------------------------------+
......
...@@ -127,7 +127,7 @@ Core Parameters ...@@ -127,7 +127,7 @@ Core Parameters
- label should be ``int`` type, and larger number represents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect) - label should be ``int`` type, and larger number represents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)
- ``boosting`` :raw-html:`<a id="boosting" title="Permalink to this parameter" href="#boosting">&#x1F517;&#xFE0E;</a>`, default = ``gbdt``, type = enum, options: ``gbdt``, ``rf``, ``dart``, ``goss``, aliases: ``boosting_type``, ``boost`` - ``boosting`` :raw-html:`<a id="boosting" title="Permalink to this parameter" href="#boosting">&#x1F517;&#xFE0E;</a>`, default = ``gbdt``, type = enum, options: ``gbdt``, ``rf``, ``dart``, aliases: ``boosting_type``, ``boost``
- ``gbdt``, traditional Gradient Boosting Decision Tree, aliases: ``gbrt`` - ``gbdt``, traditional Gradient Boosting Decision Tree, aliases: ``gbrt``
...@@ -135,10 +135,16 @@ Core Parameters ...@@ -135,10 +135,16 @@ Core Parameters
- ``dart``, `Dropouts meet Multiple Additive Regression Trees <https://arxiv.org/abs/1505.01866>`__ - ``dart``, `Dropouts meet Multiple Additive Regression Trees <https://arxiv.org/abs/1505.01866>`__
- ``goss``, Gradient-based One-Side Sampling
- **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations - **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations
- ``data_sample_strategy`` :raw-html:`<a id="data_sample_strategy" title="Permalink to this parameter" href="#data_sample_strategy">&#x1F517;&#xFE0E;</a>`, default = ``bagging``, type = enum, options: ``bagging``, ``goss``
- ``bagging``, Randomly Bagging Sampling
- **Note**: ``bagging`` is only effective when ``bagging_freq > 0`` and ``bagging_fraction < 1.0``
- ``goss``, Gradient-based One-Side Sampling
- ``data`` :raw-html:`<a id="data" title="Permalink to this parameter" href="#data">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string, aliases: ``train``, ``train_data``, ``train_data_file``, ``data_filename`` - ``data`` :raw-html:`<a id="data" title="Permalink to this parameter" href="#data">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string, aliases: ``train``, ``train_data``, ``train_data_file``, ``data_filename``
- path of training data, LightGBM will train from this data - path of training data, LightGBM will train from this data
...@@ -268,7 +274,7 @@ Learning Control Parameters ...@@ -268,7 +274,7 @@ Learning Control Parameters
- ``num_threads`` is relatively small, e.g. ``<= 16`` - ``num_threads`` is relatively small, e.g. ``<= 16``
- you want to use small ``bagging_fraction`` or ``goss`` boosting to speed up - you want to use small ``bagging_fraction`` or ``goss`` sample strategy to speed up
- **Note**: setting this to ``true`` will double the memory cost for Dataset object. If you have not enough memory, you can try setting ``force_col_wise=true`` - **Note**: setting this to ``true`` will double the memory cost for Dataset object. If you have not enough memory, you can try setting ``force_col_wise=true``
......
...@@ -153,14 +153,21 @@ struct Config { ...@@ -153,14 +153,21 @@ struct Config {
// [doc-only] // [doc-only]
// type = enum // type = enum
// alias = boosting_type, boost // alias = boosting_type, boost
// options = gbdt, rf, dart, goss // options = gbdt, rf, dart
// desc = ``gbdt``, traditional Gradient Boosting Decision Tree, aliases: ``gbrt`` // desc = ``gbdt``, traditional Gradient Boosting Decision Tree, aliases: ``gbrt``
// desc = ``rf``, Random Forest, aliases: ``random_forest`` // desc = ``rf``, Random Forest, aliases: ``random_forest``
// desc = ``dart``, `Dropouts meet Multiple Additive Regression Trees <https://arxiv.org/abs/1505.01866>`__ // desc = ``dart``, `Dropouts meet Multiple Additive Regression Trees <https://arxiv.org/abs/1505.01866>`__
// desc = ``goss``, Gradient-based One-Side Sampling
// descl2 = **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations // descl2 = **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations
std::string boosting = "gbdt"; std::string boosting = "gbdt";
// [doc-only]
// type = enum
// options = bagging, goss
// desc = ``bagging``, Randomly Bagging Sampling
// descl2 = **Note**: ``bagging`` is only effective when ``bagging_freq > 0`` and ``bagging_fraction < 1.0``
// desc = ``goss``, Gradient-based One-Side Sampling
std::string data_sample_strategy = "bagging";
// alias = train, train_data, train_data_file, data_filename // alias = train, train_data, train_data_file, data_filename
// desc = path of training data, LightGBM will train from this data // desc = path of training data, LightGBM will train from this data
// desc = **Note**: can be used only in CLI version // desc = **Note**: can be used only in CLI version
...@@ -263,7 +270,7 @@ struct Config { ...@@ -263,7 +270,7 @@ struct Config {
// desc = enabling this is recommended when: // desc = enabling this is recommended when:
// descl2 = the number of data points is large, and the total number of bins is relatively small // descl2 = the number of data points is large, and the total number of bins is relatively small
// descl2 = ``num_threads`` is relatively small, e.g. ``<= 16`` // descl2 = ``num_threads`` is relatively small, e.g. ``<= 16``
// descl2 = you want to use small ``bagging_fraction`` or ``goss`` boosting to speed up // descl2 = you want to use small ``bagging_fraction`` or ``goss`` sample strategy to speed up
// desc = **Note**: setting this to ``true`` will double the memory cost for Dataset object. If you have not enough memory, you can try setting ``force_col_wise=true`` // desc = **Note**: setting this to ``true`` will double the memory cost for Dataset object. If you have not enough memory, you can try setting ``force_col_wise=true``
// desc = **Note**: when both ``force_col_wise`` and ``force_row_wise`` are ``false``, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to ``true`` manually // desc = **Note**: when both ``force_col_wise`` and ``force_row_wise`` are ``false``, LightGBM will firstly try them both, and then use the faster one. To remove the overhead of testing set the faster one to ``true`` manually
// desc = **Note**: this parameter cannot be used at the same time with ``force_col_wise``, choose only one of them // desc = **Note**: this parameter cannot be used at the same time with ``force_col_wise``, choose only one of them
......
...@@ -10,10 +10,10 @@ ...@@ -10,10 +10,10 @@
#include <cuda.h> #include <cuda.h>
#include <cuda_runtime.h> #include <cuda_runtime.h>
#include <stdio.h> #include <stdio.h>
#include <LightGBM/utils/log.h>
#endif // USE_CUDA || USE_CUDA_EXP #endif // USE_CUDA || USE_CUDA_EXP
#ifdef USE_CUDA_EXP #ifdef USE_CUDA_EXP
#include <LightGBM/utils/log.h>
#include <vector> #include <vector>
#endif // USE_CUDA_EXP #endif // USE_CUDA_EXP
...@@ -124,6 +124,7 @@ class CUDAVector { ...@@ -124,6 +124,7 @@ class CUDAVector {
} }
if (size == 0) { if (size == 0) {
Clear(); Clear();
return;
} }
T* new_data = nullptr; T* new_data = nullptr;
AllocateCUDAMemory<T>(&new_data, size, __FILE__, __LINE__); AllocateCUDAMemory<T>(&new_data, size, __FILE__, __LINE__);
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifndef LIGHTGBM_SAMPLE_STRATEGY_H_
#define LIGHTGBM_SAMPLE_STRATEGY_H_
#include <LightGBM/cuda/cuda_utils.h>
#include <LightGBM/utils/random.h>
#include <LightGBM/utils/common.h>
#include <LightGBM/utils/threading.h>
#include <LightGBM/config.h>
#include <LightGBM/dataset.h>
#include <LightGBM/tree_learner.h>
#include <LightGBM/objective_function.h>
#include <memory>
#include <vector>
namespace LightGBM {
class SampleStrategy {
public:
SampleStrategy() : balanced_bagging_(false), bagging_runner_(0, bagging_rand_block_), need_resize_gradients_(false) {}
virtual ~SampleStrategy() {}
static SampleStrategy* CreateSampleStrategy(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function, int num_tree_per_iteration);
virtual void Bagging(int iter, TreeLearner* tree_learner, score_t* gradients, score_t* hessians) = 0;
virtual void ResetSampleConfig(const Config* config, bool is_change_dataset) = 0;
bool is_use_subset() const { return is_use_subset_; }
data_size_t bag_data_cnt() const { return bag_data_cnt_; }
std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>>& bag_data_indices() { return bag_data_indices_; }
#ifdef USE_CUDA_EXP
CUDAVector<data_size_t>& cuda_bag_data_indices() { return cuda_bag_data_indices_; }
#endif // USE_CUDA_EXP
void UpdateObjectiveFunction(const ObjectiveFunction* objective_function) {
objective_function_ = objective_function;
}
void UpdateTrainingData(const Dataset* train_data) {
train_data_ = train_data;
num_data_ = train_data->num_data();
}
virtual bool IsHessianChange() const = 0;
bool NeedResizeGradients() const { return need_resize_gradients_; }
protected:
const Config* config_;
const Dataset* train_data_;
const ObjectiveFunction* objective_function_;
std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>> bag_data_indices_;
data_size_t bag_data_cnt_;
data_size_t num_data_;
int num_tree_per_iteration_;
std::unique_ptr<Dataset> tmp_subset_;
bool is_use_subset_;
bool balanced_bagging_;
const int bagging_rand_block_ = 1024;
std::vector<Random> bagging_rands_;
ParallelPartitionRunner<data_size_t, false> bagging_runner_;
/*! \brief whether need to resize the gradient vectors */
bool need_resize_gradients_;
#ifdef USE_CUDA_EXP
/*! \brief Buffer for bag_data_indices_ on GPU, used only with cuda_exp */
CUDAVector<data_size_t> cuda_bag_data_indices_;
#endif // USE_CUDA_EXP
};
} // namespace LightGBM
#endif // LIGHTGBM_SAMPLE_STRATEGY_H_
...@@ -1042,6 +1042,8 @@ class _DaskLGBMModel: ...@@ -1042,6 +1042,8 @@ class _DaskLGBMModel:
eval_at: Optional[Iterable[int]] = None, eval_at: Optional[Iterable[int]] = None,
**kwargs: Any **kwargs: Any
) -> "_DaskLGBMModel": ) -> "_DaskLGBMModel":
if not DASK_INSTALLED:
raise LightGBMError('dask is required for lightgbm.dask')
if not all((DASK_INSTALLED, PANDAS_INSTALLED, SKLEARN_INSTALLED)): if not all((DASK_INSTALLED, PANDAS_INSTALLED, SKLEARN_INSTALLED)):
raise LightGBMError('dask, pandas and scikit-learn are required for lightgbm.dask') raise LightGBMError('dask, pandas and scikit-learn are required for lightgbm.dask')
......
...@@ -382,7 +382,6 @@ class LGBMModel(_LGBMModelBase): ...@@ -382,7 +382,6 @@ class LGBMModel(_LGBMModelBase):
boosting_type : str, optional (default='gbdt') boosting_type : str, optional (default='gbdt')
'gbdt', traditional Gradient Boosting Decision Tree. 'gbdt', traditional Gradient Boosting Decision Tree.
'dart', Dropouts meet Multiple Additive Regression Trees. 'dart', Dropouts meet Multiple Additive Regression Trees.
'goss', Gradient-based One-Side Sampling.
'rf', Random Forest. 'rf', Random Forest.
num_leaves : int, optional (default=31) num_leaves : int, optional (default=31)
Maximum tree leaves for base learners. Maximum tree leaves for base learners.
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifndef LIGHTGBM_BOOSTING_BAGGING_HPP_
#define LIGHTGBM_BOOSTING_BAGGING_HPP_
#include <string>
namespace LightGBM {
class BaggingSampleStrategy : public SampleStrategy {
public:
BaggingSampleStrategy(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function, int num_tree_per_iteration)
: need_re_bagging_(false) {
config_ = config;
train_data_ = train_data;
num_data_ = train_data->num_data();
objective_function_ = objective_function;
num_tree_per_iteration_ = num_tree_per_iteration;
}
~BaggingSampleStrategy() {}
void Bagging(int iter, TreeLearner* tree_learner, score_t* /*gradients*/, score_t* /*hessians*/) override {
Common::FunctionTimer fun_timer("GBDT::Bagging", global_timer);
// if need bagging
if ((bag_data_cnt_ < num_data_ && iter % config_->bagging_freq == 0) ||
need_re_bagging_) {
need_re_bagging_ = false;
auto left_cnt = bagging_runner_.Run<true>(
num_data_,
[=](int, data_size_t cur_start, data_size_t cur_cnt, data_size_t* left,
data_size_t*) {
data_size_t cur_left_count = 0;
if (balanced_bagging_) {
cur_left_count =
BalancedBaggingHelper(cur_start, cur_cnt, left);
} else {
cur_left_count = BaggingHelper(cur_start, cur_cnt, left);
}
return cur_left_count;
},
bag_data_indices_.data());
bag_data_cnt_ = left_cnt;
Log::Debug("Re-bagging, using %d data to train", bag_data_cnt_);
// set bagging data to tree learner
if (!is_use_subset_) {
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner->SetBaggingData(nullptr, cuda_bag_data_indices_.RawData(), bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner->SetBaggingData(nullptr, bag_data_indices_.data(), bag_data_cnt_);
#ifdef USE_CUDA_EXP
}
#endif // USE_CUDA_EXP
} else {
// get subset
tmp_subset_->ReSize(bag_data_cnt_);
tmp_subset_->CopySubrow(train_data_, bag_data_indices_.data(),
bag_data_cnt_, false);
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner->SetBaggingData(tmp_subset_.get(), cuda_bag_data_indices_.RawData(),
bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner->SetBaggingData(tmp_subset_.get(), bag_data_indices_.data(),
bag_data_cnt_);
#ifdef USE_CUDA_EXP
}
#endif // USE_CUDA_EXP
}
}
}
void ResetSampleConfig(const Config* config, bool is_change_dataset) override {
need_resize_gradients_ = false;
// if need bagging, create buffer
data_size_t num_pos_data = 0;
if (objective_function_ != nullptr) {
num_pos_data = objective_function_->NumPositiveData();
}
bool balance_bagging_cond = (config->pos_bagging_fraction < 1.0 || config->neg_bagging_fraction < 1.0) && (num_pos_data > 0);
if ((config->bagging_fraction < 1.0 || balance_bagging_cond) && config->bagging_freq > 0) {
need_re_bagging_ = false;
if (!is_change_dataset &&
config_ != nullptr && config_->bagging_fraction == config->bagging_fraction && config_->bagging_freq == config->bagging_freq
&& config_->pos_bagging_fraction == config->pos_bagging_fraction && config_->neg_bagging_fraction == config->neg_bagging_fraction) {
config_ = config;
return;
}
config_ = config;
if (balance_bagging_cond) {
balanced_bagging_ = true;
bag_data_cnt_ = static_cast<data_size_t>(num_pos_data * config_->pos_bagging_fraction)
+ static_cast<data_size_t>((num_data_ - num_pos_data) * config_->neg_bagging_fraction);
} else {
bag_data_cnt_ = static_cast<data_size_t>(config_->bagging_fraction * num_data_);
}
bag_data_indices_.resize(num_data_);
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
cuda_bag_data_indices_.Resize(num_data_);
}
#endif // USE_CUDA_EXP
bagging_runner_.ReSize(num_data_);
bagging_rands_.clear();
for (int i = 0;
i < (num_data_ + bagging_rand_block_ - 1) / bagging_rand_block_; ++i) {
bagging_rands_.emplace_back(config_->bagging_seed + i);
}
double average_bag_rate =
(static_cast<double>(bag_data_cnt_) / num_data_) / config_->bagging_freq;
is_use_subset_ = false;
if (config_->device_type != std::string("cuda_exp")) {
const int group_threshold_usesubset = 100;
const double average_bag_rate_threshold = 0.5;
if (average_bag_rate <= average_bag_rate_threshold
&& (train_data_->num_feature_groups() < group_threshold_usesubset)) {
if (tmp_subset_ == nullptr || is_change_dataset) {
tmp_subset_.reset(new Dataset(bag_data_cnt_));
tmp_subset_->CopyFeatureMapperFrom(train_data_);
}
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
}
}
need_re_bagging_ = true;
if (is_use_subset_ && bag_data_cnt_ < num_data_) {
// resize gradient vectors to copy the customized gradients for using subset data
need_resize_gradients_ = true;
}
} else {
bag_data_cnt_ = num_data_;
bag_data_indices_.clear();
#ifdef USE_CUDA_EXP
cuda_bag_data_indices_.Clear();
#endif // USE_CUDA_EXP
bagging_runner_.ReSize(0);
is_use_subset_ = false;
}
}
bool IsHessianChange() const override {
return false;
}
private:
data_size_t BaggingHelper(data_size_t start, data_size_t cnt, data_size_t* buffer) {
if (cnt <= 0) {
return 0;
}
data_size_t cur_left_cnt = 0;
data_size_t cur_right_pos = cnt;
// random bagging, minimal unit is one record
for (data_size_t i = 0; i < cnt; ++i) {
auto cur_idx = start + i;
if (bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() < config_->bagging_fraction) {
buffer[cur_left_cnt++] = cur_idx;
} else {
buffer[--cur_right_pos] = cur_idx;
}
}
return cur_left_cnt;
}
data_size_t BalancedBaggingHelper(data_size_t start, data_size_t cnt, data_size_t* buffer) {
if (cnt <= 0) {
return 0;
}
auto label_ptr = train_data_->metadata().label();
data_size_t cur_left_cnt = 0;
data_size_t cur_right_pos = cnt;
// random bagging, minimal unit is one record
for (data_size_t i = 0; i < cnt; ++i) {
auto cur_idx = start + i;
bool is_pos = label_ptr[start + i] > 0;
bool is_in_bag = false;
if (is_pos) {
is_in_bag = bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() <
config_->pos_bagging_fraction;
} else {
is_in_bag = bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() <
config_->neg_bagging_fraction;
}
if (is_in_bag) {
buffer[cur_left_cnt++] = cur_idx;
} else {
buffer[--cur_right_pos] = cur_idx;
}
}
return cur_left_cnt;
}
/*! \brief whether need restart bagging in continued training */
bool need_re_bagging_;
};
} // namespace LightGBM
#endif // LIGHTGBM_BOOSTING_BAGGING_HPP_
...@@ -6,7 +6,6 @@ ...@@ -6,7 +6,6 @@
#include "dart.hpp" #include "dart.hpp"
#include "gbdt.h" #include "gbdt.h"
#include "goss.hpp"
#include "rf.hpp" #include "rf.hpp"
namespace LightGBM { namespace LightGBM {
...@@ -39,7 +38,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename ...@@ -39,7 +38,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename
} else if (type == std::string("dart")) { } else if (type == std::string("dart")) {
return new DART(); return new DART();
} else if (type == std::string("goss")) { } else if (type == std::string("goss")) {
return new GOSS(); return new GBDT();
} else if (type == std::string("rf")) { } else if (type == std::string("rf")) {
return new RF(); return new RF();
} else { } else {
...@@ -53,7 +52,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename ...@@ -53,7 +52,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename
} else if (type == std::string("dart")) { } else if (type == std::string("dart")) {
ret.reset(new DART()); ret.reset(new DART());
} else if (type == std::string("goss")) { } else if (type == std::string("goss")) {
ret.reset(new GOSS()); ret.reset(new GBDT());
} else if (type == std::string("rf")) { } else if (type == std::string("rf")) {
return new RF(); return new RF();
} else { } else {
......
This diff is collapsed.
...@@ -11,6 +11,7 @@ ...@@ -11,6 +11,7 @@
#include <LightGBM/cuda/vector_cudahost.h> #include <LightGBM/cuda/vector_cudahost.h>
#include <LightGBM/utils/json11.h> #include <LightGBM/utils/json11.h>
#include <LightGBM/utils/threading.h> #include <LightGBM/utils/threading.h>
#include <LightGBM/sample_strategy.h>
#include <string> #include <string>
#include <algorithm> #include <algorithm>
...@@ -453,7 +454,7 @@ class GBDT : public GBDTBase { ...@@ -453,7 +454,7 @@ class GBDT : public GBDTBase {
protected: protected:
virtual bool GetIsConstHessian(const ObjectiveFunction* objective_function) { virtual bool GetIsConstHessian(const ObjectiveFunction* objective_function) {
if (objective_function != nullptr) { if (objective_function != nullptr && !data_sample_strategy_->IsHessianChange()) {
return objective_function->IsConstantHessian(); return objective_function->IsConstantHessian();
} else { } else {
return false; return false;
...@@ -469,18 +470,6 @@ class GBDT : public GBDTBase { ...@@ -469,18 +470,6 @@ class GBDT : public GBDTBase {
*/ */
void ResetBaggingConfig(const Config* config, bool is_change_dataset); void ResetBaggingConfig(const Config* config, bool is_change_dataset);
/*!
* \brief Implement bagging logic
* \param iter Current interation
*/
virtual void Bagging(int iter);
virtual data_size_t BaggingHelper(data_size_t start, data_size_t cnt,
data_size_t* buffer);
data_size_t BalancedBaggingHelper(data_size_t start, data_size_t cnt,
data_size_t* buffer);
/*! /*!
* \brief calculate the objective function * \brief calculate the objective function
*/ */
...@@ -508,6 +497,11 @@ class GBDT : public GBDTBase { ...@@ -508,6 +497,11 @@ class GBDT : public GBDTBase {
double BoostFromAverage(int class_id, bool update_scorer); double BoostFromAverage(int class_id, bool update_scorer);
/*!
* \brief Reset gradient buffers, must be called after sample strategy is reset
*/
void ResetGradientBuffers();
/*! \brief current iteration */ /*! \brief current iteration */
int iter_; int iter_;
/*! \brief Pointer to training data */ /*! \brief Pointer to training data */
...@@ -561,18 +555,16 @@ class GBDT : public GBDTBase { ...@@ -561,18 +555,16 @@ class GBDT : public GBDTBase {
/*! \brief Whether boosting is done on GPU, used for cuda_exp */ /*! \brief Whether boosting is done on GPU, used for cuda_exp */
bool boosting_on_gpu_; bool boosting_on_gpu_;
#ifdef USE_CUDA_EXP #ifdef USE_CUDA_EXP
/*! \brief Gradient vector on GPU */
CUDAVector<score_t> cuda_gradients_;
/*! \brief Hessian vector on GPU */
CUDAVector<score_t> cuda_hessians_;
/*! \brief Buffer for scores when boosting is on GPU but evaluation is not, used only with cuda_exp */ /*! \brief Buffer for scores when boosting is on GPU but evaluation is not, used only with cuda_exp */
mutable std::vector<double> host_score_; mutable std::vector<double> host_score_;
/*! \brief Buffer for scores when boosting is not on GPU but evaluation is, used only with cuda_exp */ /*! \brief Buffer for scores when boosting is not on GPU but evaluation is, used only with cuda_exp */
mutable CUDAVector<double> cuda_score_; mutable CUDAVector<double> cuda_score_;
/*! \brief Buffer for bag_data_indices_ on GPU, used only with cuda_exp */
CUDAVector<data_size_t> cuda_bag_data_indices_;
#endif // USE_CUDA_EXP #endif // USE_CUDA_EXP
/*! \brief Store the indices of in-bag data */
std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>> bag_data_indices_;
/*! \brief Number of in-bag data */
data_size_t bag_data_cnt_;
/*! \brief Number of training data */ /*! \brief Number of training data */
data_size_t num_data_; data_size_t num_data_;
/*! \brief Number of trees per iterations */ /*! \brief Number of trees per iterations */
...@@ -592,8 +584,6 @@ class GBDT : public GBDTBase { ...@@ -592,8 +584,6 @@ class GBDT : public GBDTBase {
/*! \brief Feature names */ /*! \brief Feature names */
std::vector<std::string> feature_names_; std::vector<std::string> feature_names_;
std::vector<std::string> feature_infos_; std::vector<std::string> feature_infos_;
std::unique_ptr<Dataset> tmp_subset_;
bool is_use_subset_;
std::vector<bool> class_need_train_; std::vector<bool> class_need_train_;
bool is_constant_hessian_; bool is_constant_hessian_;
std::unique_ptr<ObjectiveFunction> loaded_objective_; std::unique_ptr<ObjectiveFunction> loaded_objective_;
...@@ -602,11 +592,9 @@ class GBDT : public GBDTBase { ...@@ -602,11 +592,9 @@ class GBDT : public GBDTBase {
bool balanced_bagging_; bool balanced_bagging_;
std::string loaded_parameter_; std::string loaded_parameter_;
std::vector<int8_t> monotone_constraints_; std::vector<int8_t> monotone_constraints_;
const int bagging_rand_block_ = 1024;
std::vector<Random> bagging_rands_;
ParallelPartitionRunner<data_size_t, false> bagging_runner_;
Json forced_splits_json_; Json forced_splits_json_;
bool linear_tree_; bool linear_tree_;
std::unique_ptr<SampleStrategy> data_sample_strategy_;
}; };
} // namespace LightGBM } // namespace LightGBM
......
/*! /*!
* Copyright (c) 2017 Microsoft Corporation. All rights reserved. * Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information. * Licensed under the MIT License. See LICENSE file in the project root for license information.
*/ */
#ifndef LIGHTGBM_BOOSTING_GOSS_H_
#define LIGHTGBM_BOOSTING_GOSS_H_
#include <LightGBM/boosting.h> #ifndef LIGHTGBM_BOOSTING_GOSS_HPP_
#define LIGHTGBM_BOOSTING_GOSS_HPP_
#include <LightGBM/utils/array_args.h> #include <LightGBM/utils/array_args.h>
#include <LightGBM/utils/log.h> #include <LightGBM/sample_strategy.h>
#include <string>
#include <algorithm> #include <algorithm>
#include <chrono> #include <string>
#include <cstdio>
#include <cstdint>
#include <fstream>
#include <vector> #include <vector>
#include "gbdt.h"
#include "score_updater.hpp"
namespace LightGBM { namespace LightGBM {
class GOSS: public GBDT { class GOSSStrategy : public SampleStrategy {
public: public:
/*! GOSSStrategy(const Config* config, const Dataset* train_data, int num_tree_per_iteration) {
* \brief Constructor config_ = config;
*/ train_data_ = train_data;
GOSS() : GBDT() { num_tree_per_iteration_ = num_tree_per_iteration;
} num_data_ = train_data->num_data();
~GOSS() {
}
void Init(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function,
const std::vector<const Metric*>& training_metrics) override {
GBDT::Init(config, train_data, objective_function, training_metrics);
ResetGoss();
if (objective_function_ == nullptr) {
// use customized objective function
size_t total_size = static_cast<size_t>(num_data_) * num_tree_per_iteration_;
gradients_.resize(total_size, 0.0f);
hessians_.resize(total_size, 0.0f);
}
}
void ResetTrainingData(const Dataset* train_data, const ObjectiveFunction* objective_function,
const std::vector<const Metric*>& training_metrics) override {
GBDT::ResetTrainingData(train_data, objective_function, training_metrics);
ResetGoss();
} }
void ResetConfig(const Config* config) override { ~GOSSStrategy() {
GBDT::ResetConfig(config);
ResetGoss();
} }
bool TrainOneIter(const score_t* gradients, const score_t* hessians) override { void Bagging(int iter, TreeLearner* tree_learner, score_t* gradients, score_t* hessians) override {
if (gradients != nullptr) { bag_data_cnt_ = num_data_;
// use customized objective function // not subsample for first iterations
CHECK(hessians != nullptr && objective_function_ == nullptr); if (iter < static_cast<int>(1.0f / config_->learning_rate)) { return; }
int64_t total_size = static_cast<int64_t>(num_data_) * num_tree_per_iteration_; auto left_cnt = bagging_runner_.Run<true>(
#pragma omp parallel for schedule(static) num_data_,
for (int64_t i = 0; i < total_size; ++i) { [=](int, data_size_t cur_start, data_size_t cur_cnt, data_size_t* left,
gradients_[i] = gradients[i]; data_size_t*) {
hessians_[i] = hessians[i]; data_size_t cur_left_count = 0;
cur_left_count = Helper(cur_start, cur_cnt, left, gradients, hessians);
return cur_left_count;
},
bag_data_indices_.data());
bag_data_cnt_ = left_cnt;
// set bagging data to tree learner
if (!is_use_subset_) {
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner->SetBaggingData(nullptr, cuda_bag_data_indices_.RawData(), bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner->SetBaggingData(nullptr, bag_data_indices_.data(), bag_data_cnt_);
#ifdef USE_CUDA_EXP
} }
return GBDT::TrainOneIter(gradients_.data(), hessians_.data()); #endif // USE_CUDA_EXP
} else { } else {
CHECK(hessians == nullptr); // get subset
return GBDT::TrainOneIter(nullptr, nullptr); tmp_subset_->ReSize(bag_data_cnt_);
tmp_subset_->CopySubrow(train_data_, bag_data_indices_.data(),
bag_data_cnt_, false);
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner->SetBaggingData(tmp_subset_.get(), cuda_bag_data_indices_.RawData(),
bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner->SetBaggingData(tmp_subset_.get(), bag_data_indices_.data(),
bag_data_cnt_);
#ifdef USE_CUDA_EXP
}
#endif // USE_CUDA_EXP
} }
} }
void ResetGoss() { void ResetSampleConfig(const Config* config, bool /*is_change_dataset*/) override {
// Cannot use bagging in GOSS
config_ = config;
need_resize_gradients_ = false;
if (objective_function_ == nullptr) {
// resize gradient vectors to copy the customized gradients for goss
need_resize_gradients_ = true;
}
CHECK_LE(config_->top_rate + config_->other_rate, 1.0f); CHECK_LE(config_->top_rate + config_->other_rate, 1.0f);
CHECK(config_->top_rate > 0.0f && config_->other_rate > 0.0f); CHECK(config_->top_rate > 0.0f && config_->other_rate > 0.0f);
if (config_->bagging_freq > 0 && config_->bagging_fraction != 1.0f) { if (config_->bagging_freq > 0 && config_->bagging_fraction != 1.0f) {
...@@ -100,7 +108,12 @@ class GOSS: public GBDT { ...@@ -100,7 +108,12 @@ class GOSS: public GBDT {
bag_data_cnt_ = num_data_; bag_data_cnt_ = num_data_;
} }
data_size_t BaggingHelper(data_size_t start, data_size_t cnt, data_size_t* buffer) override { bool IsHessianChange() const override {
return true;
}
private:
data_size_t Helper(data_size_t start, data_size_t cnt, data_size_t* buffer, score_t* gradients, score_t* hessians) {
if (cnt <= 0) { if (cnt <= 0) {
return 0; return 0;
} }
...@@ -108,7 +121,7 @@ class GOSS: public GBDT { ...@@ -108,7 +121,7 @@ class GOSS: public GBDT {
for (data_size_t i = 0; i < cnt; ++i) { for (data_size_t i = 0; i < cnt; ++i) {
for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + start + i; size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + start + i;
tmp_gradients[i] += std::fabs(gradients_[idx] * hessians_[idx]); tmp_gradients[i] += std::fabs(gradients[idx] * hessians[idx]);
} }
} }
data_size_t top_k = static_cast<data_size_t>(cnt * config_->top_rate); data_size_t top_k = static_cast<data_size_t>(cnt * config_->top_rate);
...@@ -126,7 +139,7 @@ class GOSS: public GBDT { ...@@ -126,7 +139,7 @@ class GOSS: public GBDT {
score_t grad = 0.0f; score_t grad = 0.0f;
for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + cur_idx; size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + cur_idx;
grad += std::fabs(gradients_[idx] * hessians_[idx]); grad += std::fabs(gradients[idx] * hessians[idx]);
} }
if (grad >= threshold) { if (grad >= threshold) {
buffer[cur_left_cnt++] = cur_idx; buffer[cur_left_cnt++] = cur_idx;
...@@ -140,8 +153,8 @@ class GOSS: public GBDT { ...@@ -140,8 +153,8 @@ class GOSS: public GBDT {
buffer[cur_left_cnt++] = cur_idx; buffer[cur_left_cnt++] = cur_idx;
for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + cur_idx; size_t idx = static_cast<size_t>(cur_tree_id) * num_data_ + cur_idx;
gradients_[idx] *= multiply; gradients[idx] *= multiply;
hessians_[idx] *= multiply; hessians[idx] *= multiply;
} }
} else { } else {
buffer[--cur_right_pos] = cur_idx; buffer[--cur_right_pos] = cur_idx;
...@@ -150,58 +163,8 @@ class GOSS: public GBDT { ...@@ -150,58 +163,8 @@ class GOSS: public GBDT {
} }
return cur_left_cnt; return cur_left_cnt;
} }
void Bagging(int iter) override {
bag_data_cnt_ = num_data_;
// not subsample for first iterations
if (iter < static_cast<int>(1.0f / config_->learning_rate)) { return; }
auto left_cnt = bagging_runner_.Run<true>(
num_data_,
[=](int, data_size_t cur_start, data_size_t cur_cnt, data_size_t* left,
data_size_t*) {
data_size_t cur_left_count = 0;
cur_left_count = BaggingHelper(cur_start, cur_cnt, left);
return cur_left_count;
},
bag_data_indices_.data());
bag_data_cnt_ = left_cnt;
// set bagging data to tree learner
if (!is_use_subset_) {
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner_->SetBaggingData(nullptr, cuda_bag_data_indices_.RawData(), bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner_->SetBaggingData(nullptr, bag_data_indices_.data(), bag_data_cnt_);
#ifdef USE_CUDA_EXP
}
#endif // USE_CUDA_EXP
} else {
// get subset
tmp_subset_->ReSize(bag_data_cnt_);
tmp_subset_->CopySubrow(train_data_, bag_data_indices_.data(),
bag_data_cnt_, false);
#ifdef USE_CUDA_EXP
if (config_->device_type == std::string("cuda_exp")) {
CopyFromHostToCUDADevice<data_size_t>(cuda_bag_data_indices_.RawData(), bag_data_indices_.data(), static_cast<size_t>(num_data_), __FILE__, __LINE__);
tree_learner_->SetBaggingData(tmp_subset_.get(), cuda_bag_data_indices_.RawData(),
bag_data_cnt_);
} else {
#endif // USE_CUDA_EXP
tree_learner_->SetBaggingData(tmp_subset_.get(), bag_data_indices_.data(),
bag_data_cnt_);
#ifdef USE_CUDA_EXP
}
#endif // USE_CUDA_EXP
}
}
protected:
bool GetIsConstHessian(const ObjectiveFunction*) override {
return false;
}
}; };
} // namespace LightGBM } // namespace LightGBM
#endif // LIGHTGBM_BOOSTING_GOSS_H_
#endif // LIGHTGBM_BOOSTING_GOSS_HPP_
...@@ -32,8 +32,12 @@ class RF : public GBDT { ...@@ -32,8 +32,12 @@ class RF : public GBDT {
void Init(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function, void Init(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function,
const std::vector<const Metric*>& training_metrics) override { const std::vector<const Metric*>& training_metrics) override {
CHECK(config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f); if (config->data_sample_strategy == std::string("bagging")) {
CHECK(config->feature_fraction <= 1.0f && config->feature_fraction > 0.0f); CHECK((config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f) ||
(config->feature_fraction < 1.0f && config->feature_fraction > 0.0f));
} else {
CHECK_EQ(config->data_sample_strategy, std::string("goss"));
}
GBDT::Init(config, train_data, objective_function, training_metrics); GBDT::Init(config, train_data, objective_function, training_metrics);
if (num_init_iteration_ > 0) { if (num_init_iteration_ > 0) {
...@@ -48,15 +52,19 @@ class RF : public GBDT { ...@@ -48,15 +52,19 @@ class RF : public GBDT {
shrinkage_rate_ = 1.0f; shrinkage_rate_ = 1.0f;
// only boosting one time // only boosting one time
Boosting(); Boosting();
if (is_use_subset_ && bag_data_cnt_ < num_data_) { if (data_sample_strategy_->is_use_subset() && data_sample_strategy_->bag_data_cnt() < num_data_) {
tmp_grad_.resize(num_data_); tmp_grad_.resize(num_data_);
tmp_hess_.resize(num_data_); tmp_hess_.resize(num_data_);
} }
} }
void ResetConfig(const Config* config) override { void ResetConfig(const Config* config) override {
CHECK(config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f); if (config->data_sample_strategy == std::string("bagging")) {
CHECK(config->feature_fraction <= 1.0f && config->feature_fraction > 0.0f); CHECK((config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f) ||
(config->feature_fraction < 1.0f && config->feature_fraction > 0.0f));
} else {
CHECK_EQ(config->data_sample_strategy, std::string("goss"));
}
GBDT::ResetConfig(config); GBDT::ResetConfig(config);
// not shrinkage rate for the RF // not shrinkage rate for the RF
shrinkage_rate_ = 1.0f; shrinkage_rate_ = 1.0f;
...@@ -73,7 +81,7 @@ class RF : public GBDT { ...@@ -73,7 +81,7 @@ class RF : public GBDT {
CHECK_EQ(num_tree_per_iteration_, num_class_); CHECK_EQ(num_tree_per_iteration_, num_class_);
// only boosting one time // only boosting one time
Boosting(); Boosting();
if (is_use_subset_ && bag_data_cnt_ < num_data_) { if (data_sample_strategy_->is_use_subset() && data_sample_strategy_->bag_data_cnt() < num_data_) {
tmp_grad_.resize(num_data_); tmp_grad_.resize(num_data_);
tmp_hess_.resize(num_data_); tmp_hess_.resize(num_data_);
} }
...@@ -102,7 +110,11 @@ class RF : public GBDT { ...@@ -102,7 +110,11 @@ class RF : public GBDT {
bool TrainOneIter(const score_t* gradients, const score_t* hessians) override { bool TrainOneIter(const score_t* gradients, const score_t* hessians) override {
// bagging logic // bagging logic
Bagging(iter_); data_sample_strategy_ ->Bagging(iter_, tree_learner_.get(), gradients_.data(), hessians_.data());
const bool is_use_subset = data_sample_strategy_->is_use_subset();
const data_size_t bag_data_cnt = data_sample_strategy_->bag_data_cnt();
const std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>>& bag_data_indices = data_sample_strategy_->bag_data_indices();
CHECK_EQ(gradients, nullptr); CHECK_EQ(gradients, nullptr);
CHECK_EQ(hessians, nullptr); CHECK_EQ(hessians, nullptr);
...@@ -115,11 +127,10 @@ class RF : public GBDT { ...@@ -115,11 +127,10 @@ class RF : public GBDT {
auto grad = gradients + offset; auto grad = gradients + offset;
auto hess = hessians + offset; auto hess = hessians + offset;
// need to copy gradients for bagging subset. if (is_use_subset && bag_data_cnt < num_data_ && !boosting_on_gpu_) {
if (is_use_subset_ && bag_data_cnt_ < num_data_ && !boosting_on_gpu_) { for (int i = 0; i < bag_data_cnt; ++i) {
for (int i = 0; i < bag_data_cnt_; ++i) { tmp_grad_[i] = grad[bag_data_indices[i]];
tmp_grad_[i] = grad[bag_data_indices_[i]]; tmp_hess_[i] = hess[bag_data_indices[i]];
tmp_hess_[i] = hess[bag_data_indices_[i]];
} }
grad = tmp_grad_.data(); grad = tmp_grad_.data();
hess = tmp_hess_.data(); hess = tmp_hess_.data();
...@@ -132,7 +143,7 @@ class RF : public GBDT { ...@@ -132,7 +143,7 @@ class RF : public GBDT {
double pred = init_scores_[cur_tree_id]; double pred = init_scores_[cur_tree_id];
auto residual_getter = [pred](const label_t* label, int i) {return static_cast<double>(label[i]) - pred; }; auto residual_getter = [pred](const label_t* label, int i) {return static_cast<double>(label[i]) - pred; };
tree_learner_->RenewTreeOutput(new_tree.get(), objective_function_, residual_getter, tree_learner_->RenewTreeOutput(new_tree.get(), objective_function_, residual_getter,
num_data_, bag_data_indices_.data(), bag_data_cnt_, train_score_updater_->score()); num_data_, bag_data_indices.data(), bag_data_cnt, train_score_updater_->score());
if (std::fabs(init_scores_[cur_tree_id]) > kEpsilon) { if (std::fabs(init_scores_[cur_tree_id]) > kEpsilon) {
new_tree->AddBias(init_scores_[cur_tree_id]); new_tree->AddBias(init_scores_[cur_tree_id]);
} }
......
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#include <LightGBM/sample_strategy.h>
#include "goss.hpp"
#include "bagging.hpp"
namespace LightGBM {
SampleStrategy* SampleStrategy::CreateSampleStrategy(
const Config* config,
const Dataset* train_data,
const ObjectiveFunction* objective_function,
int num_tree_per_iteration) {
if (config->data_sample_strategy == std::string("goss")) {
return new GOSSStrategy(config, train_data, num_tree_per_iteration);
} else {
return new BaggingSampleStrategy(config, train_data, objective_function, num_tree_per_iteration);
}
}
} // namespace LightGBM
...@@ -99,6 +99,20 @@ void GetBoostingType(const std::unordered_map<std::string, std::string>& params, ...@@ -99,6 +99,20 @@ void GetBoostingType(const std::unordered_map<std::string, std::string>& params,
} }
} }
void GetDataSampleStrategy(const std::unordered_map<std::string, std::string>& params, std::string* strategy) {
std::string value;
if (Config::GetString(params, "data_sample_strategy", &value)) {
std::transform(value.begin(), value.end(), value.begin(), Common::tolower);
if (value == std::string("goss")) {
*strategy = "goss";
} else if (value == std::string("bagging")) {
*strategy = "bagging";
} else {
Log::Fatal("Unknown sample strategy %s", value.c_str());
}
}
}
void ParseMetrics(const std::string& value, std::vector<std::string>* out_metric) { void ParseMetrics(const std::string& value, std::vector<std::string>* out_metric) {
std::unordered_set<std::string> metric_sets; std::unordered_set<std::string> metric_sets;
out_metric->clear(); out_metric->clear();
...@@ -242,6 +256,7 @@ void Config::Set(const std::unordered_map<std::string, std::string>& params) { ...@@ -242,6 +256,7 @@ void Config::Set(const std::unordered_map<std::string, std::string>& params) {
GetTaskType(params, &task); GetTaskType(params, &task);
GetBoostingType(params, &boosting); GetBoostingType(params, &boosting);
GetDataSampleStrategy(params, &data_sample_strategy);
GetObjectiveType(params, &objective); GetObjectiveType(params, &objective);
GetMetricType(params, objective, &metric); GetMetricType(params, objective, &metric);
GetDeviceType(params, &device_type); GetDeviceType(params, &device_type);
...@@ -423,6 +438,12 @@ void Config::CheckParamConflict() { ...@@ -423,6 +438,12 @@ void Config::CheckParamConflict() {
"Will set min_data_in_leaf to 1."); "Will set min_data_in_leaf to 1.");
min_data_in_leaf = 1; min_data_in_leaf = 1;
} }
if (boosting == std::string("goss")) {
boosting = std::string("gbdt");
data_sample_strategy = std::string("goss");
Log::Warning("Found boosting=goss. For backwards compatibility reasons, LightGBM interprets this as boosting=gbdt, data_sample_strategy=goss."
"To suppress this warning, set data_sample_strategy=goss instead.");
}
} }
std::string Config::ToString() const { std::string Config::ToString() const {
......
...@@ -186,6 +186,7 @@ const std::unordered_set<std::string>& Config::parameter_set() { ...@@ -186,6 +186,7 @@ const std::unordered_set<std::string>& Config::parameter_set() {
"task", "task",
"objective", "objective",
"boosting", "boosting",
"data_sample_strategy",
"data", "data",
"valid", "valid",
"num_iterations", "num_iterations",
...@@ -762,6 +763,7 @@ const std::unordered_map<std::string, std::vector<std::string>>& Config::paramet ...@@ -762,6 +763,7 @@ const std::unordered_map<std::string, std::vector<std::string>>& Config::paramet
{"task", {"task_type"}}, {"task", {"task_type"}},
{"objective", {"objective_type", "app", "application", "loss"}}, {"objective", {"objective_type", "app", "application", "loss"}},
{"boosting", {"boosting_type", "boost"}}, {"boosting", {"boosting_type", "boost"}},
{"data_sample_strategy", {}},
{"data", {"train", "train_data", "train_data_file", "data_filename"}}, {"data", {"train", "train_data", "train_data_file", "data_filename"}},
{"valid", {"test", "valid_data", "valid_data_file", "test_data", "test_data_file", "valid_filenames"}}, {"valid", {"test", "valid_data", "valid_data_file", "test_data", "test_data_file", "valid_filenames"}},
{"num_iterations", {"num_iteration", "n_iter", "num_tree", "num_trees", "num_round", "num_rounds", "nrounds", "num_boost_round", "n_estimators", "max_iter"}}, {"num_iterations", {"num_iteration", "n_iter", "num_tree", "num_trees", "num_round", "num_rounds", "nrounds", "num_boost_round", "n_estimators", "max_iter"}},
...@@ -899,6 +901,7 @@ const std::unordered_map<std::string, std::string>& Config::ParameterTypes() { ...@@ -899,6 +901,7 @@ const std::unordered_map<std::string, std::string>& Config::ParameterTypes() {
{"config", "string"}, {"config", "string"},
{"objective", "string"}, {"objective", "string"},
{"boosting", "string"}, {"boosting", "string"},
{"data_sample_strategy", "string"},
{"data", "string"}, {"data", "string"},
{"valid", "vector<string>"}, {"valid", "vector<string>"},
{"num_iterations", "int"}, {"num_iterations", "int"},
......
...@@ -19,7 +19,9 @@ namespace LightGBM { ...@@ -19,7 +19,9 @@ namespace LightGBM {
ObjectiveFunction* ObjectiveFunction::CreateObjectiveFunction(const std::string& type, const Config& config) { ObjectiveFunction* ObjectiveFunction::CreateObjectiveFunction(const std::string& type, const Config& config) {
#ifdef USE_CUDA_EXP #ifdef USE_CUDA_EXP
if (config.device_type == std::string("cuda_exp") && config.boosting == std::string("gbdt")) { if (config.device_type == std::string("cuda_exp") &&
config.data_sample_strategy != std::string("goss") &&
config.boosting != std::string("rf")) {
if (type == std::string("regression")) { if (type == std::string("regression")) {
return new CUDARegressionL2loss(config); return new CUDARegressionL2loss(config);
} else if (type == std::string("regression_l1")) { } else if (type == std::string("regression_l1")) {
......
...@@ -3592,6 +3592,142 @@ def test_force_split_with_feature_fraction(tmp_path): ...@@ -3592,6 +3592,142 @@ def test_force_split_with_feature_fraction(tmp_path):
assert tree_structure['split_feature'] == 0 assert tree_structure['split_feature'] == 0
def test_goss_boosting_and_strategy_equivalent():
X, y = make_synthetic_regression(n_samples=10_000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
base_params = {
'metric': 'l2',
'verbose': -1,
'bagging_seed': 0,
'learning_rate': 0.05,
'num_threads': 1,
'force_row_wise': True,
'gpu_use_dp': True,
}
params1 = {**base_params, 'boosting': 'goss'}
evals_result1 = {}
lgb.train(params1, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result1)])
params2 = {**base_params, 'data_sample_strategy': 'goss'}
evals_result2 = {}
lgb.train(params2, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result2)])
assert evals_result1['valid_0']['l2'] == evals_result2['valid_0']['l2']
def test_sample_strategy_with_boosting():
X, y = make_synthetic_regression(n_samples=10_000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
base_params = {
'metric': 'l2',
'verbose': -1,
'num_threads': 1,
'force_row_wise': True,
'gpu_use_dp': True,
}
params1 = {**base_params, 'boosting': 'dart', 'data_sample_strategy': 'goss'}
evals_result = {}
gbm = lgb.train(params1, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res1 = evals_result['valid_0']['l2'][-1]
test_res1 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res1 == pytest.approx(3149.393862, abs=1.0)
assert eval_res1 == pytest.approx(test_res1)
params2 = {**base_params, 'boosting': 'gbdt', 'data_sample_strategy': 'goss'}
evals_result = {}
gbm = lgb.train(params2, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res2 = evals_result['valid_0']['l2'][-1]
test_res2 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res2 == pytest.approx(2547.715968, abs=1.0)
assert eval_res2 == pytest.approx(test_res2)
params3 = {**base_params, 'boosting': 'goss', 'data_sample_strategy': 'goss'}
evals_result = {}
gbm = lgb.train(params3, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res3 = evals_result['valid_0']['l2'][-1]
test_res3 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res3 == pytest.approx(2547.715968, abs=1.0)
assert eval_res3 == pytest.approx(test_res3)
params4 = {**base_params, 'boosting': 'rf', 'data_sample_strategy': 'goss'}
evals_result = {}
gbm = lgb.train(params4, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res4 = evals_result['valid_0']['l2'][-1]
test_res4 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res4 == pytest.approx(2095.538735, abs=1.0)
assert eval_res4 == pytest.approx(test_res4)
assert test_res1 != test_res2
assert eval_res1 != eval_res2
assert test_res2 == test_res3
assert eval_res2 == eval_res3
assert eval_res1 != eval_res4
assert test_res1 != test_res4
assert eval_res2 != eval_res4
assert test_res2 != test_res4
params5 = {**base_params, 'boosting': 'dart', 'data_sample_strategy': 'bagging', 'bagging_freq': 1, 'bagging_fraction': 0.5}
evals_result = {}
gbm = lgb.train(params5, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res5 = evals_result['valid_0']['l2'][-1]
test_res5 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res5 == pytest.approx(3134.866931, abs=1.0)
assert eval_res5 == pytest.approx(test_res5)
params6 = {**base_params, 'boosting': 'gbdt', 'data_sample_strategy': 'bagging', 'bagging_freq': 1, 'bagging_fraction': 0.5}
evals_result = {}
gbm = lgb.train(params6, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res6 = evals_result['valid_0']['l2'][-1]
test_res6 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res6 == pytest.approx(2539.792378, abs=1.0)
assert eval_res6 == pytest.approx(test_res6)
assert test_res5 != test_res6
assert eval_res5 != eval_res6
params7 = {**base_params, 'boosting': 'rf', 'data_sample_strategy': 'bagging', 'bagging_freq': 1, 'bagging_fraction': 0.5}
evals_result = {}
gbm = lgb.train(params7, lgb_train,
num_boost_round=10,
valid_sets=lgb_eval,
callbacks=[lgb.record_evaluation(evals_result)])
eval_res7 = evals_result['valid_0']['l2'][-1]
test_res7 = mean_squared_error(y_test, gbm.predict(X_test))
assert test_res7 == pytest.approx(1518.704481, abs=1.0)
assert eval_res7 == pytest.approx(test_res7)
assert test_res5 != test_res7
assert eval_res5 != eval_res7
assert test_res6 != test_res7
assert eval_res6 != eval_res7
def test_record_evaluation_with_train(): def test_record_evaluation_with_train():
X, y = make_synthetic_regression() X, y = make_synthetic_regression()
ds = lgb.Dataset(X, y) ds = lgb.Dataset(X, y)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment