Unverified Commit fcfd4132 authored by Belinda Trotta's avatar Belinda Trotta Committed by GitHub
Browse files

Trees with linear models at leaves (#3299)

* Add Eigen library.

* Working for simple test.

* Apply changes to config params.

* Handle nan data.

* Update docs.

* Add test.

* Only load raw data if boosting=gbdt_linear

* Remove unneeded code.

* Minor updates.

* Update to work with sk-learn interface.

* Update to work with chunked datasets.

* Throw error if we try to create a Booster with an already-constructed dataset having incompatible parameters.

* Save raw data in binary dataset file.

* Update docs and fix parameter checking.

* Fix dataset loading.

* Add test for regularization.

* Fix bugs when saving and loading tree.

* Add test for load/save linear model.

* Remove unneeded code.

* Fix case where not enough leaf data for linear model.

* Simplify code.

* Speed up code.

* Speed up code.

* Simplify code.

* Speed up code.

* Fix bugs.

* Working version.

* Store feature data column-wise (not fully working yet).

* Fix bugs.

* Speed up.

* Speed up.

* Remove unneeded code.

* Small speedup.

* Speed up.

* Minor updates.

* Remove unneeded code.

* Fix bug.

* Fix bug.

* Speed up.

* Speed up.

* Simplify code.

* Remove unneeded code.

* Fix bug, add more tests.

* Fix bug and add test.

* Only store numerical features

* Fix bug and speed up using templates.

* Speed up prediction.

* Fix bug with regularisation

* Visual studio files.

* Working version

* Only check nans if necessary

* Store coeff matrix as an array.

* Align cache lines

* Align cache lines

* Preallocation coefficient calculation matrices

* Small speedups

* Small speedup

* Reverse cache alignment changes

* Change to dynamic schedule

* Update docs.

* Refactor so that linear tree learner is not a separate class.

* Add refit capability.

* Speed up

* Small speedups.

* Speed up add prediction to score.

* Fix bug

* Fix bug and speed up.

* Speed up dataload.

* Speed up dataload

* Use vectors instead of pointers

* Fix bug

* Add OMP exception handling.

* Change return type of LGBM_BoosterGetLinear to bool

* Change return type of LGBM_BoosterGetLinear back to int, only parameter type needed to change

* Remove unused internal_parent_ property of tree

* Remove unused parameter to CreateTreeLearner

* Remove reference to LinearTreeLearner

* Minor style issues

* Remove unneeded check

* Reverse temporary testing change

* Fix Visual Studio project files

* Restore LightGBM.vcxproj.filters

* Speed up

* Speed up

* Simplify code

* Update docs

* Simplify code

* Initialise storage space for max num threads

* Move Eigen to include directory and delete unused files

* Remove old files.

* Fix so it compiles with mingw

* Fix gpu tree learner

* Change AddPredictionToScore back to const

* Fix python lint error

* Fix C++ lint errors

* Change eigen to a submodule

* Update comment

* Add the eigen folder

* Try to fix build issues with eigen

* Remove eigen files

* Add eigen as submodule

* Fix include paths

* Exclude eigen files from Python linter

* Ignore eigen folders for pydocstyle

* Fix C++ linting errors

* Fix docs

* Fix docs

* Exclude eigen directories from doxygen

* Update manifest to include eigen

* Update build_r to include eigen files

* Fix compiler warnings

* Store raw feature data as float

* Use float for calculating linear coefficients

* Remove eigen directory from GLOB

* Don't compile linear model code when building R package

* Fix doxygen issue

* Fix lint issue

* Fix lint issue

* Remove uneeded code

* Restore delected lines

* Restore delected lines

* Change return type of has_raw to bool

* Update docs

* Rename some variables and functions for readability

* Make tree_learner parameter const in AddScore

* Fix style issues

* Pass vectors as const reference when setting tree properties

* Make temporary storage of serial_tree_learner mutable so we can make the object's methods const

* Remove get_raw_size, use num_numeric_features instead

* Fix typo

* Make contains_nan_ and any_nan_ properties immutable again

* Remove data_has_nan_ property of tree

* Remove temporary test code

* Make linear_tree a dataset param

* Fix lint error

* Make LinearTreeLearner a separate class

* Fix lint errors

* Fix lint error

* Add linear_tree_learner.o

* Simulate omp_get_max_threads if openmp is not available

* Update PushOneData to also store raw data.

* Cast size to int

* Fix bug in ReshapeRaw

* Speed up code with multithreading

* Use OMP_NUM_THREADS

* Speed up with multithreading

* Update to use ArrayToString

* Fix tests

* Fix test

* Fix bug introduced in merge

* Minor updates

* Update docs
parent d90a16d5
...@@ -40,6 +40,7 @@ GBDT::GBDT() ...@@ -40,6 +40,7 @@ GBDT::GBDT()
bagging_runner_(0, bagging_rand_block_) { bagging_runner_(0, bagging_rand_block_) {
average_output_ = false; average_output_ = false;
tree_learner_ = nullptr; tree_learner_ = nullptr;
linear_tree_ = false;
} }
GBDT::~GBDT() { GBDT::~GBDT() {
...@@ -88,7 +89,8 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective ...@@ -88,7 +89,8 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective
is_constant_hessian_ = GetIsConstHessian(objective_function); is_constant_hessian_ = GetIsConstHessian(objective_function);
tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(config_->tree_learner, config_->device_type, config_.get())); tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(config_->tree_learner, config_->device_type,
config_.get()));
// init tree learner // init tree learner
tree_learner_->Init(train_data_, is_constant_hessian_); tree_learner_->Init(train_data_, is_constant_hessian_);
...@@ -129,6 +131,10 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective ...@@ -129,6 +131,10 @@ void GBDT::Init(const Config* config, const Dataset* train_data, const Objective
class_need_train_[i] = objective_function_->ClassNeedTrain(i); class_need_train_[i] = objective_function_->ClassNeedTrain(i);
} }
} }
if (config_->linear_tree) {
linear_tree_ = true;
}
} }
void GBDT::AddValidDataset(const Dataset* valid_data, void GBDT::AddValidDataset(const Dataset* valid_data,
...@@ -282,6 +288,19 @@ void GBDT::RefitTree(const std::vector<std::vector<int>>& tree_leaf_prediction) ...@@ -282,6 +288,19 @@ void GBDT::RefitTree(const std::vector<std::vector<int>>& tree_leaf_prediction)
CHECK_EQ(static_cast<size_t>(models_.size()), tree_leaf_prediction[0].size()); CHECK_EQ(static_cast<size_t>(models_.size()), tree_leaf_prediction[0].size());
int num_iterations = static_cast<int>(models_.size() / num_tree_per_iteration_); int num_iterations = static_cast<int>(models_.size() / num_tree_per_iteration_);
std::vector<int> leaf_pred(num_data_); std::vector<int> leaf_pred(num_data_);
if (linear_tree_) {
std::vector<int> max_leaves_by_thread = std::vector<int>(OMP_NUM_THREADS(), 0);
#pragma omp parallel for schedule(static)
for (int i = 0; i < static_cast<int>(tree_leaf_prediction.size()); ++i) {
int tid = omp_get_thread_num();
for (size_t j = 0; j < tree_leaf_prediction[i].size(); ++j) {
max_leaves_by_thread[tid] = std::max(max_leaves_by_thread[tid], tree_leaf_prediction[i][j]);
}
}
int max_leaves = *std::max_element(max_leaves_by_thread.begin(), max_leaves_by_thread.end());
max_leaves += 1;
tree_learner_->InitLinear(train_data_, max_leaves);
}
for (int iter = 0; iter < num_iterations; ++iter) { for (int iter = 0; iter < num_iterations; ++iter) {
Boosting(); Boosting();
for (int tree_id = 0; tree_id < num_tree_per_iteration_; ++tree_id) { for (int tree_id = 0; tree_id < num_tree_per_iteration_; ++tree_id) {
...@@ -365,7 +384,7 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) { ...@@ -365,7 +384,7 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {
bool should_continue = false; bool should_continue = false;
for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
const size_t offset = static_cast<size_t>(cur_tree_id) * num_data_; const size_t offset = static_cast<size_t>(cur_tree_id) * num_data_;
std::unique_ptr<Tree> new_tree(new Tree(2, false)); std::unique_ptr<Tree> new_tree(new Tree(2, false, false));
if (class_need_train_[cur_tree_id] && train_data_->num_features() > 0) { if (class_need_train_[cur_tree_id] && train_data_->num_features() > 0) {
auto grad = gradients + offset; auto grad = gradients + offset;
auto hess = hessians + offset; auto hess = hessians + offset;
...@@ -378,7 +397,8 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) { ...@@ -378,7 +397,8 @@ bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {
grad = gradients_.data() + offset; grad = gradients_.data() + offset;
hess = hessians_.data() + offset; hess = hessians_.data() + offset;
} }
new_tree.reset(tree_learner_->Train(grad, hess)); bool is_first_tree = models_.size() < static_cast<size_t>(num_tree_per_iteration_);
new_tree.reset(tree_learner_->Train(grad, hess, is_first_tree));
} }
if (new_tree->num_leaves() > 1) { if (new_tree->num_leaves() > 1) {
......
...@@ -44,6 +44,7 @@ class GBDT : public GBDTBase { ...@@ -44,6 +44,7 @@ class GBDT : public GBDTBase {
*/ */
~GBDT(); ~GBDT();
/*! /*!
* \brief Initialization logic * \brief Initialization logic
* \param gbdt_config Config for boosting * \param gbdt_config Config for boosting
...@@ -391,6 +392,8 @@ class GBDT : public GBDTBase { ...@@ -391,6 +392,8 @@ class GBDT : public GBDTBase {
*/ */
const char* SubModelName() const override { return "tree"; } const char* SubModelName() const override { return "tree"; }
bool IsLinear() const override { return linear_tree_; }
protected: protected:
virtual bool GetIsConstHessian(const ObjectiveFunction* objective_function) { virtual bool GetIsConstHessian(const ObjectiveFunction* objective_function) {
if (objective_function != nullptr) { if (objective_function != nullptr) {
...@@ -530,6 +533,7 @@ class GBDT : public GBDTBase { ...@@ -530,6 +533,7 @@ class GBDT : public GBDTBase {
std::vector<Random> bagging_rands_; std::vector<Random> bagging_rands_;
ParallelPartitionRunner<data_size_t, false> bagging_runner_; ParallelPartitionRunner<data_size_t, false> bagging_runner_;
Json forced_splits_json_; Json forced_splits_json_;
bool linear_tree_;
}; };
} // namespace LightGBM } // namespace LightGBM
......
...@@ -581,6 +581,11 @@ bool GBDT::LoadModelFromString(const char* buffer, size_t len) { ...@@ -581,6 +581,11 @@ bool GBDT::LoadModelFromString(const char* buffer, size_t len) {
break; break;
} else if (is_inparameter) { } else if (is_inparameter) {
ss << cur_line << "\n"; ss << cur_line << "\n";
if (Common::StartsWith(cur_line, "[linear_tree: ")) {
int is_linear = 0;
Common::Atoi(cur_line.substr(14, 1).c_str(), &is_linear);
linear_tree_ = static_cast<bool>(is_linear);
}
} }
} }
p += line_len; p += line_len;
......
...@@ -109,7 +109,7 @@ class RF : public GBDT { ...@@ -109,7 +109,7 @@ class RF : public GBDT {
gradients = gradients_.data(); gradients = gradients_.data();
hessians = hessians_.data(); hessians = hessians_.data();
for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) { for (int cur_tree_id = 0; cur_tree_id < num_tree_per_iteration_; ++cur_tree_id) {
std::unique_ptr<Tree> new_tree(new Tree(2, false)); std::unique_ptr<Tree> new_tree(new Tree(2, false, false));
size_t offset = static_cast<size_t>(cur_tree_id)* num_data_; size_t offset = static_cast<size_t>(cur_tree_id)* num_data_;
if (class_need_train_[cur_tree_id]) { if (class_need_train_[cur_tree_id]) {
auto grad = gradients + offset; auto grad = gradients + offset;
...@@ -125,7 +125,7 @@ class RF : public GBDT { ...@@ -125,7 +125,7 @@ class RF : public GBDT {
hess = tmp_hess_.data(); hess = tmp_hess_.data();
} }
new_tree.reset(tree_learner_->Train(grad, hess)); new_tree.reset(tree_learner_->Train(grad, hess, false));
} }
if (new_tree->num_leaves() > 1) { if (new_tree->num_leaves() > 1) {
......
...@@ -289,6 +289,9 @@ class Booster { ...@@ -289,6 +289,9 @@ class Booster {
"You need to set `feature_pre_filter=false` to dynamically change " "You need to set `feature_pre_filter=false` to dynamically change "
"the `min_data_in_leaf`."); "the `min_data_in_leaf`.");
} }
if (new_param.count("linear_tree") && (new_config.linear_tree != old_config.linear_tree)) {
Log:: Fatal("Cannot change between gbdt_linear boosting and other boosting types after Dataset handle has been constructed.");
}
} }
void ResetConfig(const char* parameters) { void ResetConfig(const char* parameters) {
...@@ -960,6 +963,9 @@ int LGBM_DatasetPushRows(DatasetHandle dataset, ...@@ -960,6 +963,9 @@ int LGBM_DatasetPushRows(DatasetHandle dataset,
API_BEGIN(); API_BEGIN();
auto p_dataset = reinterpret_cast<Dataset*>(dataset); auto p_dataset = reinterpret_cast<Dataset*>(dataset);
auto get_row_fun = RowFunctionFromDenseMatric(data, nrow, ncol, data_type, 1); auto get_row_fun = RowFunctionFromDenseMatric(data, nrow, ncol, data_type, 1);
if (p_dataset->has_raw()) {
p_dataset->ResizeRaw(p_dataset->num_numeric_features() + nrow);
}
OMP_INIT_EX(); OMP_INIT_EX();
#pragma omp parallel for schedule(static) #pragma omp parallel for schedule(static)
for (int i = 0; i < nrow; ++i) { for (int i = 0; i < nrow; ++i) {
...@@ -990,14 +996,16 @@ int LGBM_DatasetPushRowsByCSR(DatasetHandle dataset, ...@@ -990,14 +996,16 @@ int LGBM_DatasetPushRowsByCSR(DatasetHandle dataset,
auto p_dataset = reinterpret_cast<Dataset*>(dataset); auto p_dataset = reinterpret_cast<Dataset*>(dataset);
auto get_row_fun = RowFunctionFromCSR<int>(indptr, indptr_type, indices, data, data_type, nindptr, nelem); auto get_row_fun = RowFunctionFromCSR<int>(indptr, indptr_type, indices, data, data_type, nindptr, nelem);
int32_t nrow = static_cast<int32_t>(nindptr - 1); int32_t nrow = static_cast<int32_t>(nindptr - 1);
if (p_dataset->has_raw()) {
p_dataset->ResizeRaw(p_dataset->num_numeric_features() + nrow);
}
OMP_INIT_EX(); OMP_INIT_EX();
#pragma omp parallel for schedule(static) #pragma omp parallel for schedule(static)
for (int i = 0; i < nrow; ++i) { for (int i = 0; i < nrow; ++i) {
OMP_LOOP_EX_BEGIN(); OMP_LOOP_EX_BEGIN();
const int tid = omp_get_thread_num(); const int tid = omp_get_thread_num();
auto one_row = get_row_fun(i); auto one_row = get_row_fun(i);
p_dataset->PushOneRow(tid, p_dataset->PushOneRow(tid, static_cast<data_size_t>(start_row + i), one_row);
static_cast<data_size_t>(start_row + i), one_row);
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
} }
OMP_THROW_EX(); OMP_THROW_EX();
...@@ -1090,6 +1098,9 @@ int LGBM_DatasetCreateFromMats(int32_t nmat, ...@@ -1090,6 +1098,9 @@ int LGBM_DatasetCreateFromMats(int32_t nmat,
ret.reset(new Dataset(total_nrow)); ret.reset(new Dataset(total_nrow));
ret->CreateValid( ret->CreateValid(
reinterpret_cast<const Dataset*>(reference)); reinterpret_cast<const Dataset*>(reference));
if (ret->has_raw()) {
ret->ResizeRaw(total_nrow);
}
} }
int32_t start_row = 0; int32_t start_row = 0;
for (int j = 0; j < nmat; ++j) { for (int j = 0; j < nmat; ++j) {
...@@ -1166,6 +1177,9 @@ int LGBM_DatasetCreateFromCSR(const void* indptr, ...@@ -1166,6 +1177,9 @@ int LGBM_DatasetCreateFromCSR(const void* indptr,
ret.reset(new Dataset(nrow)); ret.reset(new Dataset(nrow));
ret->CreateValid( ret->CreateValid(
reinterpret_cast<const Dataset*>(reference)); reinterpret_cast<const Dataset*>(reference));
if (ret->has_raw()) {
ret->ResizeRaw(nrow);
}
} }
OMP_INIT_EX(); OMP_INIT_EX();
#pragma omp parallel for schedule(static) #pragma omp parallel for schedule(static)
...@@ -1234,6 +1248,9 @@ int LGBM_DatasetCreateFromCSRFunc(void* get_row_funptr, ...@@ -1234,6 +1248,9 @@ int LGBM_DatasetCreateFromCSRFunc(void* get_row_funptr,
ret.reset(new Dataset(nrow)); ret.reset(new Dataset(nrow));
ret->CreateValid( ret->CreateValid(
reinterpret_cast<const Dataset*>(reference)); reinterpret_cast<const Dataset*>(reference));
if (ret->has_raw()) {
ret->ResizeRaw(nrow);
}
} }
OMP_INIT_EX(); OMP_INIT_EX();
...@@ -1326,12 +1343,12 @@ int LGBM_DatasetCreateFromCSC(const void* col_ptr, ...@@ -1326,12 +1343,12 @@ int LGBM_DatasetCreateFromCSC(const void* col_ptr,
row_idx = pair.first; row_idx = pair.first;
// no more data // no more data
if (row_idx < 0) { break; } if (row_idx < 0) { break; }
ret->PushOneData(tid, row_idx, group, sub_feature, pair.second); ret->PushOneData(tid, row_idx, group, feature_idx, sub_feature, pair.second);
} }
} else { } else {
for (int row_idx = 0; row_idx < nrow; ++row_idx) { for (int row_idx = 0; row_idx < nrow; ++row_idx) {
auto val = col_it.Get(row_idx); auto val = col_it.Get(row_idx);
ret->PushOneData(tid, row_idx, group, sub_feature, val); ret->PushOneData(tid, row_idx, group, feature_idx, sub_feature, val);
} }
} }
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
...@@ -1600,6 +1617,13 @@ int LGBM_BoosterGetNumClasses(BoosterHandle handle, int* out_len) { ...@@ -1600,6 +1617,13 @@ int LGBM_BoosterGetNumClasses(BoosterHandle handle, int* out_len) {
API_END(); API_END();
} }
int LGBM_BoosterGetLinear(BoosterHandle handle, bool* out) {
API_BEGIN();
Booster* ref_booster = reinterpret_cast<Booster*>(handle);
*out = ref_booster->GetBoosting()->IsLinear();
API_END();
}
int LGBM_BoosterRefit(BoosterHandle handle, const int32_t* leaf_preds, int32_t nrow, int32_t ncol) { int LGBM_BoosterRefit(BoosterHandle handle, const int32_t* leaf_preds, int32_t nrow, int32_t ncol) {
API_BEGIN(); API_BEGIN();
Booster* ref_booster = reinterpret_cast<Booster*>(handle); Booster* ref_booster = reinterpret_cast<Booster*>(handle);
......
...@@ -335,13 +335,28 @@ void Config::CheckParamConflict() { ...@@ -335,13 +335,28 @@ void Config::CheckParamConflict() {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic."); Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
} }
} }
// force gpu_use_dp for CUDA // force gpu_use_dp for CUDA
if (device_type == std::string("cuda") && !gpu_use_dp) { if (device_type == std::string("cuda") && !gpu_use_dp) {
Log::Warning("CUDA currently requires double precision calculations."); Log::Warning("CUDA currently requires double precision calculations.");
gpu_use_dp = true; gpu_use_dp = true;
} }
// linear tree learner must be serial type and cpu device
if (linear_tree) {
if (device_type == std::string("gpu")) {
device_type = "cpu";
Log::Warning("Linear tree learner only works with CPU.");
}
if (tree_learner != std::string("serial")) {
tree_learner = "serial";
Log::Warning("Linear tree learner must be serial.");
}
if (zero_as_missing) {
Log::Fatal("zero_as_missing must be false when fitting linear trees.");
}
if (objective == std::string("regresson_l1")) {
Log::Fatal("Cannot use regression_l1 objective when fitting linear trees.");
}
}
// min_data_in_leaf must be at least 2 if path smoothing is active. This is because when the split is calculated // min_data_in_leaf must be at least 2 if path smoothing is active. This is because when the split is calculated
// the count is calculated using the proportion of hessian in the leaf which is rounded up to nearest int, so it can // the count is calculated using the proportion of hessian in the leaf which is rounded up to nearest int, so it can
// be 1 when there is actually no data in the leaf. In rare cases this can cause a bug because with path smoothing the // be 1 when there is actually no data in the leaf. In rare cases this can cause a bug because with path smoothing the
......
...@@ -174,6 +174,7 @@ const std::unordered_set<std::string>& Config::parameter_set() { ...@@ -174,6 +174,7 @@ const std::unordered_set<std::string>& Config::parameter_set() {
"task", "task",
"objective", "objective",
"boosting", "boosting",
"linear_tree",
"data", "data",
"valid", "valid",
"num_iterations", "num_iterations",
...@@ -205,6 +206,7 @@ const std::unordered_set<std::string>& Config::parameter_set() { ...@@ -205,6 +206,7 @@ const std::unordered_set<std::string>& Config::parameter_set() {
"max_delta_step", "max_delta_step",
"lambda_l1", "lambda_l1",
"lambda_l2", "lambda_l2",
"linear_lambda",
"min_gain_to_split", "min_gain_to_split",
"drop_rate", "drop_rate",
"max_drop", "max_drop",
...@@ -304,6 +306,8 @@ const std::unordered_set<std::string>& Config::parameter_set() { ...@@ -304,6 +306,8 @@ const std::unordered_set<std::string>& Config::parameter_set() {
void Config::GetMembersFromString(const std::unordered_map<std::string, std::string>& params) { void Config::GetMembersFromString(const std::unordered_map<std::string, std::string>& params) {
std::string tmp_str = ""; std::string tmp_str = "";
GetBool(params, "linear_tree", &linear_tree);
GetString(params, "data", &data); GetString(params, "data", &data);
if (GetString(params, "valid", &tmp_str)) { if (GetString(params, "valid", &tmp_str)) {
...@@ -380,6 +384,9 @@ void Config::GetMembersFromString(const std::unordered_map<std::string, std::str ...@@ -380,6 +384,9 @@ void Config::GetMembersFromString(const std::unordered_map<std::string, std::str
GetDouble(params, "lambda_l2", &lambda_l2); GetDouble(params, "lambda_l2", &lambda_l2);
CHECK_GE(lambda_l2, 0.0); CHECK_GE(lambda_l2, 0.0);
GetDouble(params, "linear_lambda", &linear_lambda);
CHECK_GE(linear_lambda, 0.0);
GetDouble(params, "min_gain_to_split", &min_gain_to_split); GetDouble(params, "min_gain_to_split", &min_gain_to_split);
CHECK_GE(min_gain_to_split, 0.0); CHECK_GE(min_gain_to_split, 0.0);
...@@ -622,6 +629,7 @@ void Config::GetMembersFromString(const std::unordered_map<std::string, std::str ...@@ -622,6 +629,7 @@ void Config::GetMembersFromString(const std::unordered_map<std::string, std::str
std::string Config::SaveMembersToString() const { std::string Config::SaveMembersToString() const {
std::stringstream str_buf; std::stringstream str_buf;
str_buf << "[linear_tree: " << linear_tree << "]\n";
str_buf << "[data: " << data << "]\n"; str_buf << "[data: " << data << "]\n";
str_buf << "[valid: " << Common::Join(valid, ",") << "]\n"; str_buf << "[valid: " << Common::Join(valid, ",") << "]\n";
str_buf << "[num_iterations: " << num_iterations << "]\n"; str_buf << "[num_iterations: " << num_iterations << "]\n";
...@@ -650,6 +658,7 @@ std::string Config::SaveMembersToString() const { ...@@ -650,6 +658,7 @@ std::string Config::SaveMembersToString() const {
str_buf << "[max_delta_step: " << max_delta_step << "]\n"; str_buf << "[max_delta_step: " << max_delta_step << "]\n";
str_buf << "[lambda_l1: " << lambda_l1 << "]\n"; str_buf << "[lambda_l1: " << lambda_l1 << "]\n";
str_buf << "[lambda_l2: " << lambda_l2 << "]\n"; str_buf << "[lambda_l2: " << lambda_l2 << "]\n";
str_buf << "[linear_lambda: " << linear_lambda << "]\n";
str_buf << "[min_gain_to_split: " << min_gain_to_split << "]\n"; str_buf << "[min_gain_to_split: " << min_gain_to_split << "]\n";
str_buf << "[drop_rate: " << drop_rate << "]\n"; str_buf << "[drop_rate: " << drop_rate << "]\n";
str_buf << "[max_drop: " << max_drop << "]\n"; str_buf << "[max_drop: " << max_drop << "]\n";
......
...@@ -26,6 +26,7 @@ Dataset::Dataset() { ...@@ -26,6 +26,7 @@ Dataset::Dataset() {
data_filename_ = "noname"; data_filename_ = "noname";
num_data_ = 0; num_data_ = 0;
is_finish_load_ = false; is_finish_load_ = false;
has_raw_ = false;
} }
Dataset::Dataset(data_size_t num_data) { Dataset::Dataset(data_size_t num_data) {
...@@ -35,6 +36,7 @@ Dataset::Dataset(data_size_t num_data) { ...@@ -35,6 +36,7 @@ Dataset::Dataset(data_size_t num_data) {
metadata_.Init(num_data_, NO_SPECIFIC, NO_SPECIFIC); metadata_.Init(num_data_, NO_SPECIFIC, NO_SPECIFIC);
is_finish_load_ = false; is_finish_load_ = false;
group_bin_boundaries_.push_back(0); group_bin_boundaries_.push_back(0);
has_raw_ = false;
} }
Dataset::~Dataset() {} Dataset::~Dataset() {}
...@@ -411,6 +413,18 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers, ...@@ -411,6 +413,18 @@ void Dataset::Construct(std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
bin_construct_sample_cnt_ = io_config.bin_construct_sample_cnt; bin_construct_sample_cnt_ = io_config.bin_construct_sample_cnt;
use_missing_ = io_config.use_missing; use_missing_ = io_config.use_missing;
zero_as_missing_ = io_config.zero_as_missing; zero_as_missing_ = io_config.zero_as_missing;
has_raw_ = false;
if (io_config.linear_tree) {
has_raw_ = true;
}
numeric_feature_map_ = std::vector<int>(num_features_, -1);
num_numeric_features_ = 0;
for (int i = 0; i < num_features_; ++i) {
if (FeatureBinMapper(i)->bin_type() == BinType::NumericalBin) {
numeric_feature_map_[i] = num_numeric_features_;
++num_numeric_features_;
}
}
} }
void Dataset::FinishLoad() { void Dataset::FinishLoad() {
...@@ -696,6 +710,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) { ...@@ -696,6 +710,7 @@ void Dataset::CopyFeatureMapperFrom(const Dataset* dataset) {
feature_groups_.clear(); feature_groups_.clear();
num_features_ = dataset->num_features_; num_features_ = dataset->num_features_;
num_groups_ = dataset->num_groups_; num_groups_ = dataset->num_groups_;
has_raw_ = dataset->has_raw();
// copy feature bin mapper data // copy feature bin mapper data
for (int i = 0; i < num_groups_; ++i) { for (int i = 0; i < num_groups_; ++i) {
feature_groups_.emplace_back( feature_groups_.emplace_back(
...@@ -722,7 +737,10 @@ void Dataset::CreateValid(const Dataset* dataset) { ...@@ -722,7 +737,10 @@ void Dataset::CreateValid(const Dataset* dataset) {
num_groups_ = num_features_; num_groups_ = num_features_;
feature2group_.clear(); feature2group_.clear();
feature2subfeature_.clear(); feature2subfeature_.clear();
has_raw_ = dataset->has_raw();
numeric_feature_map_ = dataset->numeric_feature_map_;
num_numeric_features_ = dataset->num_numeric_features_;
// copy feature bin mapper data
feature_need_push_zeros_.clear(); feature_need_push_zeros_.clear();
group_bin_boundaries_.clear(); group_bin_boundaries_.clear();
uint64_t num_total_bin = 0; uint64_t num_total_bin = 0;
...@@ -785,6 +803,17 @@ void Dataset::CopySubrow(const Dataset* fullset, ...@@ -785,6 +803,17 @@ void Dataset::CopySubrow(const Dataset* fullset,
metadata_.Init(fullset->metadata_, used_indices, num_used_indices); metadata_.Init(fullset->metadata_, used_indices, num_used_indices);
} }
is_finish_load_ = true; is_finish_load_ = true;
numeric_feature_map_ = fullset->numeric_feature_map_;
num_numeric_features_ = fullset->num_numeric_features_;
if (has_raw_) {
ResizeRaw(num_used_indices);
#pragma omp parallel for schedule(static)
for (int i = 0; i < num_used_indices; ++i) {
for (int j = 0; j < num_numeric_features_; ++j) {
raw_data_[j][i] = fullset->raw_data_[j][used_indices[i]];
}
}
}
} }
bool Dataset::SetFloatField(const char* field_name, const float* field_data, bool Dataset::SetFloatField(const char* field_name, const float* field_data,
...@@ -922,8 +951,7 @@ void Dataset::SaveBinaryFile(const char* bin_filename) { ...@@ -922,8 +951,7 @@ void Dataset::SaveBinaryFile(const char* bin_filename) {
2 * VirtualFileWriter::AlignedSize(sizeof(int) * num_groups_) + 2 * VirtualFileWriter::AlignedSize(sizeof(int) * num_groups_) +
VirtualFileWriter::AlignedSize(sizeof(int32_t) * num_total_features_) + VirtualFileWriter::AlignedSize(sizeof(int32_t) * num_total_features_) +
VirtualFileWriter::AlignedSize(sizeof(int)) * 3 + VirtualFileWriter::AlignedSize(sizeof(int)) * 3 +
VirtualFileWriter::AlignedSize(sizeof(bool)) * 2; VirtualFileWriter::AlignedSize(sizeof(bool)) * 3;
// size of feature names // size of feature names
for (int i = 0; i < num_total_features_; ++i) { for (int i = 0; i < num_total_features_; ++i) {
size_of_header += size_of_header +=
...@@ -947,6 +975,7 @@ void Dataset::SaveBinaryFile(const char* bin_filename) { ...@@ -947,6 +975,7 @@ void Dataset::SaveBinaryFile(const char* bin_filename) {
writer->AlignedWrite(&min_data_in_bin_, sizeof(min_data_in_bin_)); writer->AlignedWrite(&min_data_in_bin_, sizeof(min_data_in_bin_));
writer->AlignedWrite(&use_missing_, sizeof(use_missing_)); writer->AlignedWrite(&use_missing_, sizeof(use_missing_));
writer->AlignedWrite(&zero_as_missing_, sizeof(zero_as_missing_)); writer->AlignedWrite(&zero_as_missing_, sizeof(zero_as_missing_));
writer->AlignedWrite(&has_raw_, sizeof(has_raw_));
writer->AlignedWrite(used_feature_map_.data(), writer->AlignedWrite(used_feature_map_.data(),
sizeof(int) * num_total_features_); sizeof(int) * num_total_features_);
writer->AlignedWrite(&num_groups_, sizeof(num_groups_)); writer->AlignedWrite(&num_groups_, sizeof(num_groups_));
...@@ -998,6 +1027,18 @@ void Dataset::SaveBinaryFile(const char* bin_filename) { ...@@ -998,6 +1027,18 @@ void Dataset::SaveBinaryFile(const char* bin_filename) {
// write feature // write feature
feature_groups_[i]->SaveBinaryToFile(writer.get()); feature_groups_[i]->SaveBinaryToFile(writer.get());
} }
// write raw data; use row-major order so we can read row-by-row
if (has_raw_) {
for (int i = 0; i < num_data_; ++i) {
for (int j = 0; j < num_features_; ++j) {
int feat_ind = numeric_feature_map_[j];
if (feat_ind > -1) {
writer->Write(&raw_data_[feat_ind][i], sizeof(float));
}
}
}
}
} }
} }
...@@ -1264,6 +1305,9 @@ void Dataset::AddFeaturesFrom(Dataset* other) { ...@@ -1264,6 +1305,9 @@ void Dataset::AddFeaturesFrom(Dataset* other) {
"Cannot add features from other Dataset with a different number of " "Cannot add features from other Dataset with a different number of "
"rows"); "rows");
} }
if (other->has_raw_ != has_raw_) {
Log::Fatal("Can only add features from other Dataset if both or neither have raw data.");
}
int mv_gid = -1; int mv_gid = -1;
int other_mv_gid = -1; int other_mv_gid = -1;
for (int i = 0; i < num_groups_; ++i) { for (int i = 0; i < num_groups_; ++i) {
...@@ -1393,6 +1437,20 @@ void Dataset::AddFeaturesFrom(Dataset* other) { ...@@ -1393,6 +1437,20 @@ void Dataset::AddFeaturesFrom(Dataset* other) {
PushClearIfEmpty(&max_bin_by_feature_, num_total_features_, PushClearIfEmpty(&max_bin_by_feature_, num_total_features_,
other->max_bin_by_feature_, other->num_total_features_, -1); other->max_bin_by_feature_, other->num_total_features_, -1);
num_total_features_ += other->num_total_features_; num_total_features_ += other->num_total_features_;
for (size_t i = 0; i < (other->numeric_feature_map_).size(); ++i) {
int feat_ind = numeric_feature_map_[i];
if (feat_ind > -1) {
numeric_feature_map_.push_back(feat_ind + num_numeric_features_);
} else {
numeric_feature_map_.push_back(-1);
}
}
num_numeric_features_ += other->num_numeric_features_;
if (has_raw_) {
for (int i = 0; i < other->num_numeric_features_; ++i) {
raw_data_.push_back(other->raw_data_[i]);
}
}
} }
} // namespace LightGBM } // namespace LightGBM
...@@ -22,6 +22,10 @@ DatasetLoader::DatasetLoader(const Config& io_config, const PredictFunction& pre ...@@ -22,6 +22,10 @@ DatasetLoader::DatasetLoader(const Config& io_config, const PredictFunction& pre
weight_idx_ = NO_SPECIFIC; weight_idx_ = NO_SPECIFIC;
group_idx_ = NO_SPECIFIC; group_idx_ = NO_SPECIFIC;
SetHeader(filename); SetHeader(filename);
store_raw_ = false;
if (io_config.linear_tree) {
store_raw_ = true;
}
} }
DatasetLoader::~DatasetLoader() { DatasetLoader::~DatasetLoader() {
...@@ -183,6 +187,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac ...@@ -183,6 +187,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
} }
} }
auto dataset = std::unique_ptr<Dataset>(new Dataset()); auto dataset = std::unique_ptr<Dataset>(new Dataset());
if (store_raw_) {
dataset->SetHasRaw(true);
}
data_size_t num_global_data = 0; data_size_t num_global_data = 0;
std::vector<data_size_t> used_data_indices; std::vector<data_size_t> used_data_indices;
auto bin_filename = CheckCanLoadFromBin(filename); auto bin_filename = CheckCanLoadFromBin(filename);
...@@ -205,6 +212,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac ...@@ -205,6 +212,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
static_cast<size_t>(dataset->num_data_)); static_cast<size_t>(dataset->num_data_));
// construct feature bin mappers // construct feature bin mappers
ConstructBinMappersFromTextData(rank, num_machines, sample_data, parser.get(), dataset.get()); ConstructBinMappersFromTextData(rank, num_machines, sample_data, parser.get(), dataset.get());
if (dataset->has_raw()) {
dataset->ResizeRaw(dataset->num_data_);
}
// initialize label // initialize label
dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_); dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_);
// extract features // extract features
...@@ -222,6 +232,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac ...@@ -222,6 +232,9 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
static_cast<size_t>(dataset->num_data_)); static_cast<size_t>(dataset->num_data_));
// construct feature bin mappers // construct feature bin mappers
ConstructBinMappersFromTextData(rank, num_machines, sample_data, parser.get(), dataset.get()); ConstructBinMappersFromTextData(rank, num_machines, sample_data, parser.get(), dataset.get());
if (dataset->has_raw()) {
dataset->ResizeRaw(dataset->num_data_);
}
// initialize label // initialize label
dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_); dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_);
Log::Debug("Making second pass..."); Log::Debug("Making second pass...");
...@@ -248,6 +261,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename, ...@@ -248,6 +261,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename,
data_size_t num_global_data = 0; data_size_t num_global_data = 0;
std::vector<data_size_t> used_data_indices; std::vector<data_size_t> used_data_indices;
auto dataset = std::unique_ptr<Dataset>(new Dataset()); auto dataset = std::unique_ptr<Dataset>(new Dataset());
if (store_raw_) {
dataset->SetHasRaw(true);
}
auto bin_filename = CheckCanLoadFromBin(filename); auto bin_filename = CheckCanLoadFromBin(filename);
if (bin_filename.size() == 0) { if (bin_filename.size() == 0) {
auto parser = std::unique_ptr<Parser>(Parser::CreateParser(filename, config_.header, 0, label_idx_)); auto parser = std::unique_ptr<Parser>(Parser::CreateParser(filename, config_.header, 0, label_idx_));
...@@ -264,6 +280,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename, ...@@ -264,6 +280,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename,
// initialize label // initialize label
dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_); dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_);
dataset->CreateValid(train_data); dataset->CreateValid(train_data);
if (dataset->has_raw()) {
dataset->ResizeRaw(dataset->num_data_);
}
// extract features // extract features
ExtractFeaturesFromMemory(&text_data, parser.get(), dataset.get()); ExtractFeaturesFromMemory(&text_data, parser.get(), dataset.get());
text_data.clear(); text_data.clear();
...@@ -275,6 +294,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename, ...@@ -275,6 +294,9 @@ Dataset* DatasetLoader::LoadFromFileAlignWithOtherDataset(const char* filename,
// initialize label // initialize label
dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_); dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_);
dataset->CreateValid(train_data); dataset->CreateValid(train_data);
if (dataset->has_raw()) {
dataset->ResizeRaw(dataset->num_data_);
}
// extract features // extract features
ExtractFeaturesFromFile(filename, parser.get(), used_data_indices, dataset.get()); ExtractFeaturesFromFile(filename, parser.get(), used_data_indices, dataset.get());
} }
...@@ -356,6 +378,8 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b ...@@ -356,6 +378,8 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b
mem_ptr += VirtualFileWriter::AlignedSize(sizeof(dataset->use_missing_)); mem_ptr += VirtualFileWriter::AlignedSize(sizeof(dataset->use_missing_));
dataset->zero_as_missing_ = *(reinterpret_cast<const bool*>(mem_ptr)); dataset->zero_as_missing_ = *(reinterpret_cast<const bool*>(mem_ptr));
mem_ptr += VirtualFileWriter::AlignedSize(sizeof(dataset->zero_as_missing_)); mem_ptr += VirtualFileWriter::AlignedSize(sizeof(dataset->zero_as_missing_));
dataset->has_raw_ = *(reinterpret_cast<const bool*>(mem_ptr));
mem_ptr += VirtualFileWriter::AlignedSize(sizeof(dataset->has_raw_));
const int* tmp_feature_map = reinterpret_cast<const int*>(mem_ptr); const int* tmp_feature_map = reinterpret_cast<const int*>(mem_ptr);
dataset->used_feature_map_.clear(); dataset->used_feature_map_.clear();
for (int i = 0; i < dataset->num_total_features_; ++i) { for (int i = 0; i < dataset->num_total_features_; ++i) {
...@@ -550,6 +574,43 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b ...@@ -550,6 +574,43 @@ Dataset* DatasetLoader::LoadFromBinFile(const char* data_filename, const char* b
*used_data_indices, i))); *used_data_indices, i)));
} }
dataset->feature_groups_.shrink_to_fit(); dataset->feature_groups_.shrink_to_fit();
// raw data
dataset->numeric_feature_map_ = std::vector<int>(dataset->num_features_, false);
dataset->num_numeric_features_ = 0;
for (int i = 0; i < dataset->num_features_; ++i) {
if (dataset->FeatureBinMapper(i)->bin_type() == BinType::CategoricalBin) {
dataset->numeric_feature_map_[i] = -1;
} else {
dataset->numeric_feature_map_[i] = dataset->num_numeric_features_;
++dataset->num_numeric_features_;
}
}
if (dataset->has_raw()) {
dataset->ResizeRaw(dataset->num_data());
size_t row_size = dataset->num_numeric_features_ * sizeof(float);
if (row_size > buffer_size) {
buffer_size = row_size;
buffer.resize(buffer_size);
}
for (int i = 0; i < dataset->num_data(); ++i) {
read_cnt = reader->Read(buffer.data(), row_size);
if (read_cnt != row_size) {
Log::Fatal("Binary file error: row %d of raw data is incorrect, read count: %d", i, read_cnt);
}
mem_ptr = buffer.data();
const float* tmp_ptr_raw_row = reinterpret_cast<const float*>(mem_ptr);
std::vector<float> curr_row(dataset->num_numeric_features_, 0);
for (int j = 0; j < dataset->num_features(); ++j) {
int feat_ind = dataset->numeric_feature_map_[j];
if (feat_ind >= 0) {
dataset->raw_data_[feat_ind][i] = tmp_ptr_raw_row[feat_ind];
}
}
mem_ptr += row_size;
}
}
dataset->is_finish_load_ = true; dataset->is_finish_load_ = true;
return dataset.release(); return dataset.release();
} }
...@@ -704,6 +765,9 @@ Dataset* DatasetLoader::ConstructFromSampleData(double** sample_values, ...@@ -704,6 +765,9 @@ Dataset* DatasetLoader::ConstructFromSampleData(double** sample_values,
} }
auto dataset = std::unique_ptr<Dataset>(new Dataset(num_data)); auto dataset = std::unique_ptr<Dataset>(new Dataset(num_data));
dataset->Construct(&bin_mappers, num_total_features, forced_bin_bounds, sample_indices, sample_values, num_per_col, num_col, total_sample_size, config_); dataset->Construct(&bin_mappers, num_total_features, forced_bin_bounds, sample_indices, sample_values, num_per_col, num_col, total_sample_size, config_);
if (dataset->has_raw()) {
dataset->ResizeRaw(num_data);
}
dataset->set_feature_names(feature_names_); dataset->set_feature_names(feature_names_);
return dataset.release(); return dataset.release();
} }
...@@ -1061,6 +1125,9 @@ void DatasetLoader::ConstructBinMappersFromTextData(int rank, int num_machines, ...@@ -1061,6 +1125,9 @@ void DatasetLoader::ConstructBinMappersFromTextData(int rank, int num_machines,
dataset->Construct(&bin_mappers, dataset->num_total_features_, forced_bin_bounds, Common::Vector2Ptr<int>(&sample_indices).data(), dataset->Construct(&bin_mappers, dataset->num_total_features_, forced_bin_bounds, Common::Vector2Ptr<int>(&sample_indices).data(),
Common::Vector2Ptr<double>(&sample_values).data(), Common::Vector2Ptr<double>(&sample_values).data(),
Common::VectorSize<int>(sample_indices).data(), static_cast<int>(sample_indices.size()), sample_data.size(), config_); Common::VectorSize<int>(sample_indices).data(), static_cast<int>(sample_indices.size()), sample_data.size(), config_);
if (dataset->has_raw()) {
dataset->ResizeRaw(sample_data.size());
}
} }
/*! \brief Extract local features from memory */ /*! \brief Extract local features from memory */
...@@ -1068,10 +1135,11 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1068,10 +1135,11 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
std::vector<std::pair<int, double>> oneline_features; std::vector<std::pair<int, double>> oneline_features;
double tmp_label = 0.0f; double tmp_label = 0.0f;
auto& ref_text_data = *text_data; auto& ref_text_data = *text_data;
std::vector<float> feature_row(dataset->num_features_);
if (predict_fun_ == nullptr) { if (predict_fun_ == nullptr) {
OMP_INIT_EX(); OMP_INIT_EX();
// if doesn't need to prediction with initial model // if doesn't need to prediction with initial model
#pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label) #pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label, feature_row)
for (data_size_t i = 0; i < dataset->num_data_; ++i) { for (data_size_t i = 0; i < dataset->num_data_; ++i) {
OMP_LOOP_EX_BEGIN(); OMP_LOOP_EX_BEGIN();
const int tid = omp_get_thread_num(); const int tid = omp_get_thread_num();
...@@ -1095,6 +1163,9 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1095,6 +1163,9 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
int group = dataset->feature2group_[feature_idx]; int group = dataset->feature2group_[feature_idx];
int sub_feature = dataset->feature2subfeature_[feature_idx]; int sub_feature = dataset->feature2subfeature_[feature_idx];
dataset->feature_groups_[group]->PushData(tid, sub_feature, i, inner_data.second); dataset->feature_groups_[group]->PushData(tid, sub_feature, i, inner_data.second);
if (dataset->has_raw()) {
feature_row[feature_idx] = inner_data.second;
}
} else { } else {
if (inner_data.first == weight_idx_) { if (inner_data.first == weight_idx_) {
dataset->metadata_.SetWeightAt(i, static_cast<label_t>(inner_data.second)); dataset->metadata_.SetWeightAt(i, static_cast<label_t>(inner_data.second));
...@@ -1103,6 +1174,14 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1103,6 +1174,14 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
} }
} }
} }
if (dataset->has_raw()) {
for (size_t j = 0; j < feature_row.size(); ++j) {
int feat_ind = dataset->numeric_feature_map_[j];
if (feat_ind >= 0) {
dataset->raw_data_[feat_ind][i] = feature_row[j];
}
}
}
dataset->FinishOneRow(tid, i, is_feature_added); dataset->FinishOneRow(tid, i, is_feature_added);
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
} }
...@@ -1111,7 +1190,7 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1111,7 +1190,7 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
OMP_INIT_EX(); OMP_INIT_EX();
// if need to prediction with initial model // if need to prediction with initial model
std::vector<double> init_score(dataset->num_data_ * num_class_); std::vector<double> init_score(dataset->num_data_ * num_class_);
#pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label) #pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label, feature_row)
for (data_size_t i = 0; i < dataset->num_data_; ++i) { for (data_size_t i = 0; i < dataset->num_data_; ++i) {
OMP_LOOP_EX_BEGIN(); OMP_LOOP_EX_BEGIN();
const int tid = omp_get_thread_num(); const int tid = omp_get_thread_num();
...@@ -1141,6 +1220,9 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1141,6 +1220,9 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
int group = dataset->feature2group_[feature_idx]; int group = dataset->feature2group_[feature_idx];
int sub_feature = dataset->feature2subfeature_[feature_idx]; int sub_feature = dataset->feature2subfeature_[feature_idx];
dataset->feature_groups_[group]->PushData(tid, sub_feature, i, inner_data.second); dataset->feature_groups_[group]->PushData(tid, sub_feature, i, inner_data.second);
if (dataset->has_raw()) {
feature_row[feature_idx] = inner_data.second;
}
} else { } else {
if (inner_data.first == weight_idx_) { if (inner_data.first == weight_idx_) {
dataset->metadata_.SetWeightAt(i, static_cast<label_t>(inner_data.second)); dataset->metadata_.SetWeightAt(i, static_cast<label_t>(inner_data.second));
...@@ -1150,6 +1232,14 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat ...@@ -1150,6 +1232,14 @@ void DatasetLoader::ExtractFeaturesFromMemory(std::vector<std::string>* text_dat
} }
} }
dataset->FinishOneRow(tid, i, is_feature_added); dataset->FinishOneRow(tid, i, is_feature_added);
if (dataset->has_raw()) {
for (size_t j = 0; j < feature_row.size(); ++j) {
int feat_ind = dataset->numeric_feature_map_[j];
if (feat_ind >= 0) {
dataset->raw_data_[feat_ind][i] = feature_row[j];
}
}
}
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
} }
OMP_THROW_EX(); OMP_THROW_EX();
...@@ -1173,8 +1263,9 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser* ...@@ -1173,8 +1263,9 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser*
(data_size_t start_idx, const std::vector<std::string>& lines) { (data_size_t start_idx, const std::vector<std::string>& lines) {
std::vector<std::pair<int, double>> oneline_features; std::vector<std::pair<int, double>> oneline_features;
double tmp_label = 0.0f; double tmp_label = 0.0f;
std::vector<float> feature_row(dataset->num_features_);
OMP_INIT_EX(); OMP_INIT_EX();
#pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label) #pragma omp parallel for schedule(static) private(oneline_features) firstprivate(tmp_label, feature_row)
for (data_size_t i = 0; i < static_cast<data_size_t>(lines.size()); ++i) { for (data_size_t i = 0; i < static_cast<data_size_t>(lines.size()); ++i) {
OMP_LOOP_EX_BEGIN(); OMP_LOOP_EX_BEGIN();
const int tid = omp_get_thread_num(); const int tid = omp_get_thread_num();
...@@ -1202,6 +1293,9 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser* ...@@ -1202,6 +1293,9 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser*
int group = dataset->feature2group_[feature_idx]; int group = dataset->feature2group_[feature_idx];
int sub_feature = dataset->feature2subfeature_[feature_idx]; int sub_feature = dataset->feature2subfeature_[feature_idx];
dataset->feature_groups_[group]->PushData(tid, sub_feature, start_idx + i, inner_data.second); dataset->feature_groups_[group]->PushData(tid, sub_feature, start_idx + i, inner_data.second);
if (dataset->has_raw()) {
feature_row[feature_idx] = inner_data.second;
}
} else { } else {
if (inner_data.first == weight_idx_) { if (inner_data.first == weight_idx_) {
dataset->metadata_.SetWeightAt(start_idx + i, static_cast<label_t>(inner_data.second)); dataset->metadata_.SetWeightAt(start_idx + i, static_cast<label_t>(inner_data.second));
...@@ -1210,6 +1304,14 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser* ...@@ -1210,6 +1304,14 @@ void DatasetLoader::ExtractFeaturesFromFile(const char* filename, const Parser*
} }
} }
} }
if (dataset->has_raw()) {
for (size_t j = 0; j < feature_row.size(); ++j) {
int feat_ind = dataset->numeric_feature_map_[j];
if (feat_ind >= 0) {
dataset->raw_data_[feat_ind][i] = feature_row[j];
}
}
}
dataset->FinishOneRow(tid, i, is_feature_added); dataset->FinishOneRow(tid, i, is_feature_added);
OMP_LOOP_EX_END(); OMP_LOOP_EX_END();
} }
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
namespace LightGBM { namespace LightGBM {
Tree::Tree(int max_leaves, bool track_branch_features) Tree::Tree(int max_leaves, bool track_branch_features, bool is_linear)
:max_leaves_(max_leaves), track_branch_features_(track_branch_features) { :max_leaves_(max_leaves), track_branch_features_(track_branch_features) {
left_child_.resize(max_leaves_ - 1); left_child_.resize(max_leaves_ - 1);
right_child_.resize(max_leaves_ - 1); right_child_.resize(max_leaves_ - 1);
...@@ -46,6 +46,13 @@ Tree::Tree(int max_leaves, bool track_branch_features) ...@@ -46,6 +46,13 @@ Tree::Tree(int max_leaves, bool track_branch_features)
cat_boundaries_.push_back(0); cat_boundaries_.push_back(0);
cat_boundaries_inner_.push_back(0); cat_boundaries_inner_.push_back(0);
max_depth_ = -1; max_depth_ = -1;
is_linear_ = is_linear;
if (is_linear_) {
leaf_coeff_.resize(max_leaves_);
leaf_const_ = std::vector<double>(max_leaves_, 0);
leaf_features_.resize(max_leaves_);
leaf_features_inner_.resize(max_leaves_);
}
} }
int Tree::Split(int leaf, int feature, int real_feature, uint32_t threshold_bin, int Tree::Split(int leaf, int feature, int real_feature, uint32_t threshold_bin,
...@@ -103,8 +110,42 @@ int Tree::SplitCategorical(int leaf, int feature, int real_feature, const uint32 ...@@ -103,8 +110,42 @@ int Tree::SplitCategorical(int leaf, int feature, int real_feature, const uint32
score[(data_idx)] += static_cast<double>(leaf_value_[~node]); \ score[(data_idx)] += static_cast<double>(leaf_value_[~node]); \
}\ }\
#define PredictionFunLinear(niter, fidx_in_iter, start_pos, decision_fun, \
iter_idx, data_idx) \
std::vector<std::unique_ptr<BinIterator>> iter((niter)); \
for (int i = 0; i < (niter); ++i) { \
iter[i].reset(data->FeatureIterator((fidx_in_iter))); \
iter[i]->Reset((start_pos)); \
} \
for (data_size_t i = start; i < end; ++i) { \
int node = 0; \
while (node >= 0) { \
node = decision_fun(iter[(iter_idx)]->Get((data_idx)), node, \
default_bins[node], max_bins[node]); \
} \
double add_score = leaf_const_[~node]; \
bool nan_found = false; \
const double* coeff_ptr = leaf_coeff_[~node].data(); \
const float** data_ptr = feat_ptr[~node].data(); \
for (size_t j = 0; j < leaf_features_inner_[~node].size(); ++j) { \
float feat_val = data_ptr[j][(data_idx)]; \
if (std::isnan(feat_val)) { \
nan_found = true; \
break; \
} \
add_score += coeff_ptr[j] * feat_val; \
} \
if (nan_found) { \
score[(data_idx)] += leaf_value_[~node]; \
} else { \
score[(data_idx)] += add_score; \
} \
}\
void Tree::AddPredictionToScore(const Dataset* data, data_size_t num_data, double* score) const { void Tree::AddPredictionToScore(const Dataset* data, data_size_t num_data, double* score) const {
if (num_leaves_ <= 1) { if (!is_linear_ && num_leaves_ <= 1) {
if (leaf_value_[0] != 0.0f) { if (leaf_value_[0] != 0.0f) {
#pragma omp parallel for schedule(static, 512) if (num_data >= 1024) #pragma omp parallel for schedule(static, 512) if (num_data >= 1024)
for (data_size_t i = 0; i < num_data; ++i) { for (data_size_t i = 0; i < num_data; ++i) {
...@@ -121,37 +162,71 @@ void Tree::AddPredictionToScore(const Dataset* data, data_size_t num_data, doubl ...@@ -121,37 +162,71 @@ void Tree::AddPredictionToScore(const Dataset* data, data_size_t num_data, doubl
default_bins[i] = bin_mapper->GetDefaultBin(); default_bins[i] = bin_mapper->GetDefaultBin();
max_bins[i] = bin_mapper->num_bin() - 1; max_bins[i] = bin_mapper->num_bin() - 1;
} }
if (num_cat_ > 0) { if (is_linear_) {
if (data->num_features() > num_leaves_ - 1) { std::vector<std::vector<const float*>> feat_ptr(num_leaves_);
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins] for (int leaf_num = 0; leaf_num < num_leaves_; ++leaf_num) {
(int, data_size_t start, data_size_t end) { for (int feat : leaf_features_inner_[leaf_num]) {
PredictionFun(num_leaves_ - 1, split_feature_inner_[i], start, DecisionInner, node, i); feat_ptr[leaf_num].push_back(data->raw_index(feat));
}); }
}
if (num_cat_ > 0) {
if (data->num_features() > num_leaves_ - 1) {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(num_leaves_ - 1, split_feature_inner_[i], start, DecisionInner, node, i);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(data->num_features(), i, start, DecisionInner, split_feature_inner_[node], i);
});
}
} else { } else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins, &feat_ptr]
PredictionFun(data->num_features(), i, start, DecisionInner, split_feature_inner_[node], i); (int, data_size_t start, data_size_t end) {
}); PredictionFunLinear(num_leaves_ - 1, split_feature_inner_[i], start, NumericalDecisionInner, node, i);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(data->num_features(), i, start, NumericalDecisionInner, split_feature_inner_[node], i);
});
}
} }
} else { } else {
if (data->num_features() > num_leaves_ - 1) { if (num_cat_ > 0) {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins]
PredictionFun(num_leaves_ - 1, split_feature_inner_[i], start, NumericalDecisionInner, node, i); (int, data_size_t start, data_size_t end) {
}); PredictionFun(num_leaves_ - 1, split_feature_inner_[i], start, DecisionInner, node, i);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins]
(int, data_size_t start, data_size_t end) {
PredictionFun(data->num_features(), i, start, DecisionInner, split_feature_inner_[node], i);
});
}
} else { } else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins]
PredictionFun(data->num_features(), i, start, NumericalDecisionInner, split_feature_inner_[node], i); (int, data_size_t start, data_size_t end) {
}); PredictionFun(num_leaves_ - 1, split_feature_inner_[i], start, NumericalDecisionInner, node, i);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, &default_bins, &max_bins]
(int, data_size_t start, data_size_t end) {
PredictionFun(data->num_features(), i, start, NumericalDecisionInner, split_feature_inner_[node], i);
});
}
} }
} }
} }
void Tree::AddPredictionToScore(const Dataset* data, void Tree::AddPredictionToScore(const Dataset* data,
const data_size_t* used_data_indices, const data_size_t* used_data_indices,
data_size_t num_data, double* score) const { data_size_t num_data, double* score) const {
if (num_leaves_ <= 1) { if (!is_linear_ && num_leaves_ <= 1) {
if (leaf_value_[0] != 0.0f) { if (leaf_value_[0] != 0.0f) {
#pragma omp parallel for schedule(static, 512) if (num_data >= 1024) #pragma omp parallel for schedule(static, 512) if (num_data >= 1024)
for (data_size_t i = 0; i < num_data; ++i) { for (data_size_t i = 0; i < num_data; ++i) {
...@@ -168,34 +243,72 @@ void Tree::AddPredictionToScore(const Dataset* data, ...@@ -168,34 +243,72 @@ void Tree::AddPredictionToScore(const Dataset* data,
default_bins[i] = bin_mapper->GetDefaultBin(); default_bins[i] = bin_mapper->GetDefaultBin();
max_bins[i] = bin_mapper->num_bin() - 1; max_bins[i] = bin_mapper->num_bin() - 1;
} }
if (num_cat_ > 0) { if (is_linear_) {
if (data->num_features() > num_leaves_ - 1) { std::vector<std::vector<const float*>> feat_ptr(num_leaves_);
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins] for (int leaf_num = 0; leaf_num < num_leaves_; ++leaf_num) {
(int, data_size_t start, data_size_t end) { for (int feat : leaf_features_inner_[leaf_num]) {
PredictionFun(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], DecisionInner, node, used_data_indices[i]); feat_ptr[leaf_num].push_back(data->raw_index(feat));
}); }
}
if (num_cat_ > 0) {
if (data->num_features() > num_leaves_ - 1) {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], DecisionInner,
node, used_data_indices[i]);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(data->num_features(), i, used_data_indices[start], DecisionInner, split_feature_inner_[node], used_data_indices[i]);
});
}
} else { } else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins, &feat_ptr]
PredictionFun(data->num_features(), i, used_data_indices[start], DecisionInner, split_feature_inner_[node], used_data_indices[i]); (int, data_size_t start, data_size_t end) {
}); PredictionFunLinear(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], NumericalDecisionInner,
node, used_data_indices[i]);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins, &feat_ptr]
(int, data_size_t start, data_size_t end) {
PredictionFunLinear(data->num_features(), i, used_data_indices[start], NumericalDecisionInner,
split_feature_inner_[node], used_data_indices[i]);
});
}
} }
} else { } else {
if (data->num_features() > num_leaves_ - 1) { if (num_cat_ > 0) {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins]
PredictionFun(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], NumericalDecisionInner, node, used_data_indices[i]); (int, data_size_t start, data_size_t end) {
}); PredictionFun(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], DecisionInner, node, used_data_indices[i]);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins]
(int, data_size_t start, data_size_t end) {
PredictionFun(data->num_features(), i, used_data_indices[start], DecisionInner, split_feature_inner_[node], used_data_indices[i]);
});
}
} else { } else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins] if (data->num_features() > num_leaves_ - 1) {
(int, data_size_t start, data_size_t end) { Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins]
PredictionFun(data->num_features(), i, used_data_indices[start], NumericalDecisionInner, split_feature_inner_[node], used_data_indices[i]); (int, data_size_t start, data_size_t end) {
}); PredictionFun(num_leaves_ - 1, split_feature_inner_[i], used_data_indices[start], NumericalDecisionInner, node, used_data_indices[i]);
});
} else {
Threading::For<data_size_t>(0, num_data, 512, [this, &data, score, used_data_indices, &default_bins, &max_bins]
(int, data_size_t start, data_size_t end) {
PredictionFun(data->num_features(), i, used_data_indices[start], NumericalDecisionInner, split_feature_inner_[node], used_data_indices[i]);
});
}
} }
} }
} }
#undef PredictionFun #undef PredictionFun
#undef PredictionFunLinear
double Tree::GetUpperBoundValue() const { double Tree::GetUpperBoundValue() const {
double upper_bound = leaf_value_[0]; double upper_bound = leaf_value_[0];
...@@ -259,8 +372,37 @@ std::string Tree::ToString() const { ...@@ -259,8 +372,37 @@ std::string Tree::ToString() const {
str_buf << "cat_threshold=" str_buf << "cat_threshold="
<< ArrayToString(cat_threshold_, cat_threshold_.size()) << '\n'; << ArrayToString(cat_threshold_, cat_threshold_.size()) << '\n';
} }
str_buf << "is_linear=" << is_linear_ << '\n';
if (is_linear_) {
str_buf << "leaf_const="
<< ArrayToString(leaf_const_, num_leaves_) << '\n';
std::vector<int> num_feat(num_leaves_);
for (int i = 0; i < num_leaves_; ++i) {
num_feat[i] = leaf_coeff_[i].size();
}
str_buf << "num_features="
<< ArrayToString(num_feat, num_leaves_) << '\n';
str_buf << "leaf_features=";
for (int i = 0; i < num_leaves_; ++i) {
if (num_feat[i] > 0) {
str_buf << ArrayToString(leaf_features_[i], leaf_features_[i].size()) << ' ';
}
str_buf << ' ';
}
str_buf << '\n';
str_buf << "leaf_coeff=";
for (int i = 0; i < num_leaves_; ++i) {
if (num_feat[i] > 0) {
str_buf << ArrayToString(leaf_coeff_[i], leaf_coeff_[i].size()) << ' ';
}
str_buf << ' ';
}
str_buf << '\n';
}
str_buf << "shrinkage=" << shrinkage_ << '\n'; str_buf << "shrinkage=" << shrinkage_ << '\n';
str_buf << '\n'; str_buf << '\n';
return str_buf.str(); return str_buf.str();
} }
...@@ -508,7 +650,7 @@ std::string Tree::NodeToIfElseByMap(int index, bool predict_leaf_index) const { ...@@ -508,7 +650,7 @@ std::string Tree::NodeToIfElseByMap(int index, bool predict_leaf_index) const {
Tree::Tree(const char* str, size_t* used_len) { Tree::Tree(const char* str, size_t* used_len) {
auto p = str; auto p = str;
std::unordered_map<std::string, std::string> key_vals; std::unordered_map<std::string, std::string> key_vals;
const int max_num_line = 17; const int max_num_line = 22;
int read_line = 0; int read_line = 0;
while (read_line < max_num_line) { while (read_line < max_num_line) {
if (*p == '\r' || *p == '\n') break; if (*p == '\r' || *p == '\n') break;
...@@ -549,7 +691,13 @@ Tree::Tree(const char* str, size_t* used_len) { ...@@ -549,7 +691,13 @@ Tree::Tree(const char* str, size_t* used_len) {
shrinkage_ = 1.0f; shrinkage_ = 1.0f;
} }
if (num_leaves_ <= 1) { return; } if (key_vals.count("is_linear")) {
int is_linear_int;
Common::Atoi(key_vals["is_linear"].c_str(), &is_linear_int);
is_linear_ = static_cast<bool>(is_linear_int);
}
if ((num_leaves_ <= 1) && !is_linear_) { return; }
if (key_vals.count("left_child")) { if (key_vals.count("left_child")) {
left_child_ = CommonC::StringToArrayFast<int>(key_vals["left_child"], num_leaves_ - 1); left_child_ = CommonC::StringToArrayFast<int>(key_vals["left_child"], num_leaves_ - 1);
...@@ -617,6 +765,45 @@ Tree::Tree(const char* str, size_t* used_len) { ...@@ -617,6 +765,45 @@ Tree::Tree(const char* str, size_t* used_len) {
decision_type_ = std::vector<int8_t>(num_leaves_ - 1, 0); decision_type_ = std::vector<int8_t>(num_leaves_ - 1, 0);
} }
if (is_linear_) {
if (key_vals.count("leaf_const")) {
leaf_const_ = Common::StringToArrayFast<double>(key_vals["leaf_const"], num_leaves_);
} else {
leaf_const_.resize(num_leaves_);
}
std::vector<int> num_feat;
if (key_vals.count("num_features")) {
num_feat = Common::StringToArrayFast<int>(key_vals["num_features"], num_leaves_);
}
leaf_coeff_.resize(num_leaves_);
leaf_features_.resize(num_leaves_);
leaf_features_inner_.resize(num_leaves_);
if (num_feat.size() > 0) {
int total_num_feat = 0;
for (size_t i = 0; i < num_feat.size(); ++i) { total_num_feat += num_feat[i]; }
std::vector<int> all_leaf_features;
if (key_vals.count("leaf_features")) {
all_leaf_features = Common::StringToArrayFast<int>(key_vals["leaf_features"], total_num_feat);
}
std::vector<double> all_leaf_coeff;
if (key_vals.count("leaf_coeff")) {
all_leaf_coeff = Common::StringToArrayFast<double>(key_vals["leaf_coeff"], total_num_feat);
}
int sum_num_feat = 0;
for (int i = 0; i < num_leaves_; ++i) {
if (num_feat[i] > 0) {
if (key_vals.count("leaf_features")) {
leaf_features_[i].assign(all_leaf_features.begin() + sum_num_feat, all_leaf_features.begin() + sum_num_feat + num_feat[i]);
}
if (key_vals.count("leaf_coeff")) {
leaf_coeff_[i].assign(all_leaf_coeff.begin() + sum_num_feat, all_leaf_coeff.begin() + sum_num_feat + num_feat[i]);
}
}
sum_num_feat += num_feat[i];
}
}
}
if (num_cat_ > 0) { if (num_cat_ > 0) {
if (key_vals.count("cat_boundaries")) { if (key_vals.count("cat_boundaries")) {
cat_boundaries_ = CommonC::StringToArrayFast<int>(key_vals["cat_boundaries"], num_cat_ + 1); cat_boundaries_ = CommonC::StringToArrayFast<int>(key_vals["cat_boundaries"], num_cat_ + 1);
......
...@@ -734,8 +734,8 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) { ...@@ -734,8 +734,8 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
SetupKernelArguments(); SetupKernelArguments();
} }
Tree* GPUTreeLearner::Train(const score_t* gradients, const score_t *hessians) { Tree* GPUTreeLearner::Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) {
return SerialTreeLearner::Train(gradients, hessians); return SerialTreeLearner::Train(gradients, hessians, is_first_tree);
} }
void GPUTreeLearner::ResetTrainingDataInner(const Dataset* train_data, bool is_constant_hessian, bool reset_multi_val_bin) { void GPUTreeLearner::ResetTrainingDataInner(const Dataset* train_data, bool is_constant_hessian, bool reset_multi_val_bin) {
......
...@@ -48,7 +48,7 @@ class GPUTreeLearner: public SerialTreeLearner { ...@@ -48,7 +48,7 @@ class GPUTreeLearner: public SerialTreeLearner {
void Init(const Dataset* train_data, bool is_constant_hessian) override; void Init(const Dataset* train_data, bool is_constant_hessian) override;
void ResetTrainingDataInner(const Dataset* train_data, bool is_constant_hessian, bool reset_multi_val_bin) override; void ResetTrainingDataInner(const Dataset* train_data, bool is_constant_hessian, bool reset_multi_val_bin) override;
void ResetIsConstantHessian(bool is_constant_hessian) override; void ResetIsConstantHessian(bool is_constant_hessian) override;
Tree* Train(const score_t* gradients, const score_t *hessians) override; Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override { void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override {
SerialTreeLearner::SetBaggingData(subset, used_indices, num_data); SerialTreeLearner::SetBaggingData(subset, used_indices, num_data);
......
/*!
* Copyright (c) 2016 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#include "linear_tree_learner.h"
#include <algorithm>
#ifndef LGB_R_BUILD
// preprocessor definition ensures we use only MPL2-licensed code
#define EIGEN_MPL2_ONLY
#include <Eigen/Dense>
#endif // !LGB_R_BUILD
namespace LightGBM {
void LinearTreeLearner::Init(const Dataset* train_data, bool is_constant_hessian) {
SerialTreeLearner::Init(train_data, is_constant_hessian);
LinearTreeLearner::InitLinear(train_data, config_->num_leaves);
}
void LinearTreeLearner::InitLinear(const Dataset* train_data, const int max_leaves) {
leaf_map_ = std::vector<int>(train_data->num_data(), -1);
contains_nan_ = std::vector<int8_t>(train_data->num_features(), 0);
// identify features containing nans
#pragma omp parallel for schedule(static)
for (int feat = 0; feat < train_data->num_features(); ++feat) {
auto bin_mapper = train_data_->FeatureBinMapper(feat);
if (bin_mapper->bin_type() == BinType::NumericalBin) {
const float* feat_ptr = train_data_->raw_index(feat);
for (int i = 0; i < train_data->num_data(); ++i) {
if (std::isnan(feat_ptr[i])) {
contains_nan_[feat] = 1;
break;
}
}
}
}
for (int feat = 0; feat < train_data->num_features(); ++feat) {
if (contains_nan_[feat]) {
any_nan_ = true;
break;
}
}
// preallocate the matrix used to calculate linear model coefficients
int max_num_feat = std::min(max_leaves, train_data_->num_numeric_features());
XTHX_.clear();
XTg_.clear();
for (int i = 0; i < max_leaves; ++i) {
// store only upper triangular half of matrix as an array, in row-major order
// this requires (max_num_feat + 1) * (max_num_feat + 2) / 2 entries (including the constant terms of the regression)
// we add another 8 to ensure cache lines are not shared among processors
XTHX_.push_back(std::vector<float>((max_num_feat + 1) * (max_num_feat + 2) / 2 + 8, 0));
XTg_.push_back(std::vector<float>(max_num_feat + 9, 0.0));
}
XTHX_by_thread_.clear();
XTg_by_thread_.clear();
int max_threads = omp_get_max_threads();
for (int i = 0; i < max_threads; ++i) {
XTHX_by_thread_.push_back(XTHX_);
XTg_by_thread_.push_back(XTg_);
}
}
Tree* LinearTreeLearner::Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) {
Common::FunctionTimer fun_timer("SerialTreeLearner::Train", global_timer);
gradients_ = gradients;
hessians_ = hessians;
int num_threads = OMP_NUM_THREADS();
if (share_state_->num_threads != num_threads && share_state_->num_threads > 0) {
Log::Warning(
"Detected that num_threads changed during training (from %d to %d), "
"it may cause unexpected errors.",
share_state_->num_threads, num_threads);
}
share_state_->num_threads = num_threads;
// some initial works before training
BeforeTrain();
auto tree = std::unique_ptr<Tree>(new Tree(config_->num_leaves, true, true));
auto tree_ptr = tree.get();
constraints_->ShareTreePointer(tree_ptr);
// root leaf
int left_leaf = 0;
int cur_depth = 1;
// only root leaf can be splitted on first time
int right_leaf = -1;
int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);
for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
// some initial works before finding best split
if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
// find best threshold for every feature
FindBestSplits(tree_ptr);
}
// Get a leaf with max split gain
int best_leaf = static_cast<int>(ArrayArgs<SplitInfo>::ArgMax(best_split_per_leaf_));
// Get split information for best leaf
const SplitInfo& best_leaf_SplitInfo = best_split_per_leaf_[best_leaf];
// cannot split, quit
if (best_leaf_SplitInfo.gain <= 0.0) {
Log::Warning("No further splits with positive gain, best gain: %f", best_leaf_SplitInfo.gain);
break;
}
// split tree with best leaf
Split(tree_ptr, best_leaf, &left_leaf, &right_leaf);
cur_depth = std::max(cur_depth, tree->leaf_depth(left_leaf));
}
bool has_nan = false;
if (any_nan_) {
for (int i = 0; i < tree->num_leaves() - 1 ; ++i) {
if (contains_nan_[tree_ptr->split_feature_inner(i)]) {
has_nan = true;
break;
}
}
}
GetLeafMap(tree_ptr);
if (has_nan) {
CalculateLinear<true>(tree_ptr, false, gradients_, hessians_, is_first_tree);
} else {
CalculateLinear<false>(tree_ptr, false, gradients_, hessians_, is_first_tree);
}
Log::Debug("Trained a tree with leaves = %d and max_depth = %d", tree->num_leaves(), cur_depth);
return tree.release();
}
Tree* LinearTreeLearner::FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t *hessians) const {
auto tree = SerialTreeLearner::FitByExistingTree(old_tree, gradients, hessians);
bool has_nan = false;
if (any_nan_) {
for (int i = 0; i < tree->num_leaves() - 1 ; ++i) {
if (contains_nan_[train_data_->InnerFeatureIndex(tree->split_feature(i))]) {
has_nan = true;
break;
}
}
}
GetLeafMap(tree);
if (has_nan) {
CalculateLinear<true>(tree, true, gradients, hessians, false);
} else {
CalculateLinear<false>(tree, true, gradients, hessians, false);
}
return tree;
}
Tree* LinearTreeLearner::FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t *hessians) const {
data_partition_->ResetByLeafPred(leaf_pred, old_tree->num_leaves());
return LinearTreeLearner::FitByExistingTree(old_tree, gradients, hessians);
}
void LinearTreeLearner::GetLeafMap(Tree* tree) const {
std::fill(leaf_map_.begin(), leaf_map_.end(), -1);
// map data to leaf number
const data_size_t* ind = data_partition_->indices();
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < tree->num_leaves(); ++i) {
data_size_t idx = data_partition_->leaf_begin(i);
for (int j = 0; j < data_partition_->leaf_count(i); ++j) {
leaf_map_[ind[idx + j]] = i;
}
}
}
#ifdef LGB_R_BUILD
template<bool HAS_NAN>
void LinearTreeLearner::CalculateLinear(Tree* tree, bool is_refit, const score_t* gradients, const score_t* hessians, bool is_first_tree) const {
Log::Fatal("Linear tree learner does not work with R package.");
}
#else
template<bool HAS_NAN>
void LinearTreeLearner::CalculateLinear(Tree* tree, bool is_refit, const score_t* gradients, const score_t* hessians, bool is_first_tree) const {
tree->SetIsLinear(true);
int num_leaves = tree->num_leaves();
int num_threads = OMP_NUM_THREADS();
if (is_first_tree) {
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
tree->SetLeafConst(leaf_num, tree->LeafOutput(leaf_num));
}
return;
}
// calculate coefficients using the method described in Eq 3 of https://arxiv.org/pdf/1802.05640.pdf
// the coefficients vector is given by
// - (X_T * H * X + lambda) ^ (-1) * (X_T * g)
// where:
// X is the matrix where the first column is the feature values and the second is all ones,
// H is the diagonal matrix of the hessian,
// lambda is the diagonal matrix with diagonal entries equal to the regularisation term linear_lambda
// g is the vector of gradients
// the subscript _T denotes the transpose
// create array of pointers to raw data, and coefficient matrices, for each leaf
std::vector<std::vector<int>> leaf_features;
std::vector<int> leaf_num_features;
std::vector<std::vector<const float*>> raw_data_ptr;
int max_num_features = 0;
for (int i = 0; i < num_leaves; ++i) {
std::vector<int> raw_features;
if (is_refit) {
raw_features = tree->LeafFeatures(i);
} else {
raw_features = tree->branch_features(i);
}
std::sort(raw_features.begin(), raw_features.end());
auto new_end = std::unique(raw_features.begin(), raw_features.end());
raw_features.erase(new_end, raw_features.end());
std::vector<int> numerical_features;
std::vector<const float*> data_ptr;
for (size_t j = 0; j < raw_features.size(); ++j) {
int feat = train_data_->InnerFeatureIndex(raw_features[j]);
auto bin_mapper = train_data_->FeatureBinMapper(feat);
if (bin_mapper->bin_type() == BinType::NumericalBin) {
numerical_features.push_back(feat);
data_ptr.push_back(train_data_->raw_index(feat));
}
}
leaf_features.push_back(numerical_features);
raw_data_ptr.push_back(data_ptr);
leaf_num_features.push_back(numerical_features.size());
if (static_cast<int>(numerical_features.size()) > max_num_features) {
max_num_features = numerical_features.size();
}
}
// clear the coefficient matrices
#pragma omp parallel for schedule(static)
for (int i = 0; i < num_threads; ++i) {
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
int num_feat = leaf_features[leaf_num].size();
std::fill(XTHX_by_thread_[i][leaf_num].begin(), XTHX_by_thread_[i][leaf_num].begin() + (num_feat + 1) * (num_feat + 2) / 2, 0);
std::fill(XTg_by_thread_[i][leaf_num].begin(), XTg_by_thread_[i][leaf_num].begin() + num_feat + 1, 0);
}
}
#pragma omp parallel for schedule(static)
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
int num_feat = leaf_features[leaf_num].size();
std::fill(XTHX_[leaf_num].begin(), XTHX_[leaf_num].begin() + (num_feat + 1) * (num_feat + 2) / 2, 0);
std::fill(XTg_[leaf_num].begin(), XTg_[leaf_num].begin() + num_feat + 1, 0);
}
std::vector<std::vector<int>> num_nonzero;
for (int i = 0; i < num_threads; ++i) {
if (HAS_NAN) {
num_nonzero.push_back(std::vector<int>(num_leaves, 0));
}
}
OMP_INIT_EX();
#pragma omp parallel if (num_data_ > 1024)
{
std::vector<float> curr_row(max_num_features + 1);
int tid = omp_get_thread_num();
#pragma omp for schedule(static)
for (int i = 0; i < num_data_; ++i) {
OMP_LOOP_EX_BEGIN();
int leaf_num = leaf_map_[i];
if (leaf_num < 0) {
continue;
}
bool nan_found = false;
int num_feat = leaf_num_features[leaf_num];
for (int feat = 0; feat < num_feat; ++feat) {
if (HAS_NAN) {
float val = raw_data_ptr[leaf_num][feat][i];
if (std::isnan(val)) {
nan_found = true;
break;
}
num_nonzero[tid][leaf_num] += 1;
curr_row[feat] = val;
} else {
curr_row[feat] = raw_data_ptr[leaf_num][feat][i];
}
}
if (HAS_NAN) {
if (nan_found) {
continue;
}
}
curr_row[num_feat] = 1.0;
double h = hessians[i];
double g = gradients[i];
int j = 0;
for (int feat1 = 0; feat1 < num_feat + 1; ++feat1) {
double f1_val = curr_row[feat1];
XTg_by_thread_[tid][leaf_num][feat1] += f1_val * g;
f1_val *= h;
for (int feat2 = feat1; feat2 < num_feat + 1; ++feat2) {
XTHX_by_thread_[tid][leaf_num][j] += f1_val * curr_row[feat2];
++j;
}
}
OMP_LOOP_EX_END();
}
}
OMP_THROW_EX();
auto total_nonzero = std::vector<int>(tree->num_leaves());
// aggregate results from different threads
for (int tid = 0; tid < num_threads; ++tid) {
#pragma omp parallel for schedule(static)
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
int num_feat = leaf_features[leaf_num].size();
for (int j = 0; j < (num_feat + 1) * (num_feat + 2) / 2; ++j) {
XTHX_[leaf_num][j] += XTHX_by_thread_[tid][leaf_num][j];
}
for (int feat1 = 0; feat1 < num_feat + 1; ++feat1) {
XTg_[leaf_num][feat1] += XTg_by_thread_[tid][leaf_num][feat1];
}
if (HAS_NAN) {
total_nonzero[leaf_num] += num_nonzero[tid][leaf_num];
}
}
}
if (!HAS_NAN) {
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
total_nonzero[leaf_num] = data_partition_->leaf_count(leaf_num);
}
}
double shrinkage = tree->shrinkage();
double decay_rate = config_->refit_decay_rate;
// copy into eigen matrices and solve
#pragma omp parallel for schedule(static)
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
if (total_nonzero[leaf_num] < static_cast<int>(leaf_features[leaf_num].size()) + 1) {
if (is_refit) {
double old_const = tree->LeafConst(leaf_num);
tree->SetLeafConst(leaf_num, decay_rate * old_const + (1.0 - decay_rate) * tree->LeafOutput(leaf_num) * shrinkage);
tree->SetLeafCoeffs(leaf_num, std::vector<double>(leaf_features[leaf_num].size(), 0));
tree->SetLeafFeaturesInner(leaf_num, leaf_features[leaf_num]);
} else {
tree->SetLeafConst(leaf_num, tree->LeafOutput(leaf_num));
}
continue;
}
int num_feat = leaf_features[leaf_num].size();
Eigen::MatrixXd XTHX_mat(num_feat + 1, num_feat + 1);
Eigen::MatrixXd XTg_mat(num_feat + 1, 1);
int j = 0;
for (int feat1 = 0; feat1 < num_feat + 1; ++feat1) {
for (int feat2 = feat1; feat2 < num_feat + 1; ++feat2) {
XTHX_mat(feat1, feat2) = XTHX_[leaf_num][j];
XTHX_mat(feat2, feat1) = XTHX_mat(feat1, feat2);
if ((feat1 == feat2) && (feat1 < num_feat)) {
XTHX_mat(feat1, feat2) += config_->linear_lambda;
}
++j;
}
XTg_mat(feat1) = XTg_[leaf_num][feat1];
}
Eigen::MatrixXd coeffs = - XTHX_mat.fullPivLu().inverse() * XTg_mat;
std::vector<double> coeffs_vec;
std::vector<int> features_new;
std::vector<double> old_coeffs = tree->LeafCoeffs(leaf_num);
for (size_t i = 0; i < leaf_features[leaf_num].size(); ++i) {
if (is_refit) {
features_new.push_back(leaf_features[leaf_num][i]);
coeffs_vec.push_back(decay_rate * old_coeffs[i] + (1.0 - decay_rate) * coeffs(i) * shrinkage);
} else {
if (coeffs(i) < -kZeroThreshold || coeffs(i) > kZeroThreshold) {
coeffs_vec.push_back(coeffs(i));
int feat = leaf_features[leaf_num][i];
features_new.push_back(feat);
}
}
}
// update the tree properties
tree->SetLeafFeaturesInner(leaf_num, features_new);
std::vector<int> features_raw(features_new.size());
for (size_t i = 0; i < features_new.size(); ++i) {
features_raw[i] = train_data_->RealFeatureIndex(features_new[i]);
}
tree->SetLeafFeatures(leaf_num, features_raw);
tree->SetLeafCoeffs(leaf_num, coeffs_vec);
if (is_refit) {
double old_const = tree->LeafConst(leaf_num);
tree->SetLeafConst(leaf_num, decay_rate * old_const + (1.0 - decay_rate) * coeffs(num_feat) * shrinkage);
} else {
tree->SetLeafConst(leaf_num, coeffs(num_feat));
}
}
}
#endif // LGB_R_BUILD
} // namespace LightGBM
/*!
* Copyright (c) 2016 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/
#ifndef LIGHTGBM_TREELEARNER_LINEAR_TREE_LEARNER_H_
#define LIGHTGBM_TREELEARNER_LINEAR_TREE_LEARNER_H_
#include <string>
#include <cmath>
#include <cstdio>
#include <memory>
#include <random>
#include <vector>
#include "serial_tree_learner.h"
namespace LightGBM {
class LinearTreeLearner: public SerialTreeLearner {
public:
explicit LinearTreeLearner(const Config* config) : SerialTreeLearner(config) {}
void Init(const Dataset* train_data, bool is_constant_hessian) override;
void InitLinear(const Dataset* train_data, const int max_leaves) override;
Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
/*! \brief Create array mapping dataset to leaf index, used for linear trees */
void GetLeafMap(Tree* tree) const;
template<bool HAS_NAN>
void CalculateLinear(Tree* tree, bool is_refit, const score_t* gradients, const score_t* hessians, bool is_first_tree) const;
Tree* FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const override;
Tree* FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t* hessians) const override;
void AddPredictionToScore(const Tree* tree,
double* out_score) const override {
CHECK_LE(tree->num_leaves(), data_partition_->num_leaves());
bool has_nan = false;
if (any_nan_) {
for (int i = 0; i < tree->num_leaves() - 1 ; ++i) {
// use split_feature because split_feature_inner doesn't work when refitting existing tree
if (contains_nan_[train_data_->InnerFeatureIndex(tree->split_feature(i))]) {
has_nan = true;
break;
}
}
}
if (has_nan) {
AddPredictionToScoreInner<true>(tree, out_score);
} else {
AddPredictionToScoreInner<false>(tree, out_score);
}
}
template<bool HAS_NAN>
void AddPredictionToScoreInner(const Tree* tree, double* out_score) const {
int num_leaves = tree->num_leaves();
std::vector<double> leaf_const(num_leaves);
std::vector<std::vector<double>> leaf_coeff(num_leaves);
std::vector<std::vector<const float*>> feat_ptr(num_leaves);
std::vector<double> leaf_output(num_leaves);
std::vector<int> leaf_num_features(num_leaves);
for (int leaf_num = 0; leaf_num < num_leaves; ++leaf_num) {
leaf_const[leaf_num] = tree->LeafConst(leaf_num);
leaf_coeff[leaf_num] = tree->LeafCoeffs(leaf_num);
leaf_output[leaf_num] = tree->LeafOutput(leaf_num);
for (int feat : tree->LeafFeaturesInner(leaf_num)) {
feat_ptr[leaf_num].push_back(train_data_->raw_index(feat));
}
leaf_num_features[leaf_num] = feat_ptr[leaf_num].size();
}
OMP_INIT_EX();
#pragma omp parallel for schedule(static) if (num_data_ > 1024)
for (int i = 0; i < num_data_; ++i) {
OMP_LOOP_EX_BEGIN();
int leaf_num = leaf_map_[i];
if (leaf_num < 0) {
continue;
}
double output = leaf_const[leaf_num];
int num_feat = leaf_num_features[leaf_num];
if (HAS_NAN) {
bool nan_found = false;
for (int feat_ind = 0; feat_ind < num_feat; ++feat_ind) {
float val = feat_ptr[leaf_num][feat_ind][i];
if (std::isnan(val)) {
nan_found = true;
break;
}
output += val * leaf_coeff[leaf_num][feat_ind];
}
if (nan_found) {
out_score[i] += leaf_output[leaf_num];
} else {
out_score[i] += output;
}
} else {
for (int feat_ind = 0; feat_ind < num_feat; ++feat_ind) {
output += feat_ptr[leaf_num][feat_ind][i] * leaf_coeff[leaf_num][feat_ind];
}
out_score[i] += output;
}
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
}
protected:
/*! \brief whether numerical features contain any nan values */
std::vector<int8_t> contains_nan_;
/*! whether any numerical feature contains a nan value */
bool any_nan_;
/*! \brief map dataset to leaves */
mutable std::vector<int> leaf_map_;
/*! \brief temporary storage for calculating linear model coefficients */
mutable std::vector<std::vector<float>> XTHX_;
mutable std::vector<std::vector<float>> XTg_;
mutable std::vector<std::vector<std::vector<float>>> XTHX_by_thread_;
mutable std::vector<std::vector<std::vector<float>>> XTg_by_thread_;
};
} // namespace LightGBM
#endif // LightGBM_TREELEARNER_LINEAR_TREE_LEARNER_H_
...@@ -155,7 +155,7 @@ void SerialTreeLearner::ResetConfig(const Config* config) { ...@@ -155,7 +155,7 @@ void SerialTreeLearner::ResetConfig(const Config* config) {
constraints_.reset(LeafConstraintsBase::Create(config_, config_->num_leaves, train_data_->num_features())); constraints_.reset(LeafConstraintsBase::Create(config_, config_->num_leaves, train_data_->num_features()));
} }
Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) { Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians, bool /*is_first_tree*/) {
Common::FunctionTimer fun_timer("SerialTreeLearner::Train", global_timer); Common::FunctionTimer fun_timer("SerialTreeLearner::Train", global_timer);
gradients_ = gradients; gradients_ = gradients;
hessians_ = hessians; hessians_ = hessians;
...@@ -172,7 +172,7 @@ Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians ...@@ -172,7 +172,7 @@ Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians
BeforeTrain(); BeforeTrain();
bool track_branch_features = !(config_->interaction_constraints_vector.empty()); bool track_branch_features = !(config_->interaction_constraints_vector.empty());
auto tree = std::unique_ptr<Tree>(new Tree(config_->num_leaves, track_branch_features)); auto tree = std::unique_ptr<Tree>(new Tree(config_->num_leaves, track_branch_features, false));
auto tree_ptr = tree.get(); auto tree_ptr = tree.get();
constraints_->ShareTreePointer(tree_ptr); constraints_->ShareTreePointer(tree_ptr);
...@@ -203,6 +203,7 @@ Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians ...@@ -203,6 +203,7 @@ Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians
Split(tree_ptr, best_leaf, &left_leaf, &right_leaf); Split(tree_ptr, best_leaf, &left_leaf, &right_leaf);
cur_depth = std::max(cur_depth, tree->leaf_depth(left_leaf)); cur_depth = std::max(cur_depth, tree->leaf_depth(left_leaf));
} }
Log::Debug("Trained a tree with leaves = %d and max_depth = %d", tree->num_leaves(), cur_depth); Log::Debug("Trained a tree with leaves = %d and max_depth = %d", tree->num_leaves(), cur_depth);
return tree.release(); return tree.release();
} }
...@@ -242,7 +243,8 @@ Tree* SerialTreeLearner::FitByExistingTree(const Tree* old_tree, const score_t* ...@@ -242,7 +243,8 @@ Tree* SerialTreeLearner::FitByExistingTree(const Tree* old_tree, const score_t*
return tree.release(); return tree.release();
} }
Tree* SerialTreeLearner::FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred, const score_t* gradients, const score_t *hessians) { Tree* SerialTreeLearner::FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t *hessians) const {
data_partition_->ResetByLeafPred(leaf_pred, old_tree->num_leaves()); data_partition_->ResetByLeafPred(leaf_pred, old_tree->num_leaves());
return FitByExistingTree(old_tree, gradients, hessians); return FitByExistingTree(old_tree, gradients, hessians);
} }
......
...@@ -39,6 +39,7 @@ using json11::Json; ...@@ -39,6 +39,7 @@ using json11::Json;
/*! \brief forward declaration */ /*! \brief forward declaration */
class CostEfficientGradientBoosting; class CostEfficientGradientBoosting;
/*! /*!
* \brief Used for learning a tree by single machine * \brief Used for learning a tree by single machine
*/ */
...@@ -74,12 +75,12 @@ class SerialTreeLearner: public TreeLearner { ...@@ -74,12 +75,12 @@ class SerialTreeLearner: public TreeLearner {
} }
} }
Tree* Train(const score_t* gradients, const score_t *hessians) override; Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
Tree* FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const override; Tree* FitByExistingTree(const Tree* old_tree, const score_t* gradients, const score_t* hessians) const override;
Tree* FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred, Tree* FitByExistingTree(const Tree* old_tree, const std::vector<int>& leaf_pred,
const score_t* gradients, const score_t* hessians) override; const score_t* gradients, const score_t* hessians) const override;
void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override { void SetBaggingData(const Dataset* subset, const data_size_t* used_indices, data_size_t num_data) override {
if (subset == nullptr) { if (subset == nullptr) {
...@@ -96,10 +97,10 @@ class SerialTreeLearner: public TreeLearner { ...@@ -96,10 +97,10 @@ class SerialTreeLearner: public TreeLearner {
void AddPredictionToScore(const Tree* tree, void AddPredictionToScore(const Tree* tree,
double* out_score) const override { double* out_score) const override {
CHECK_LE(tree->num_leaves(), data_partition_->num_leaves());
if (tree->num_leaves() <= 1) { if (tree->num_leaves() <= 1) {
return; return;
} }
CHECK_LE(tree->num_leaves(), data_partition_->num_leaves());
#pragma omp parallel for schedule(static, 1) #pragma omp parallel for schedule(static, 1)
for (int i = 0; i < tree->num_leaves(); ++i) { for (int i = 0; i < tree->num_leaves(); ++i) {
double output = static_cast<double>(tree->LeafOutput(i)); double output = static_cast<double>(tree->LeafOutput(i));
...@@ -189,7 +190,6 @@ class SerialTreeLearner: public TreeLearner { ...@@ -189,7 +190,6 @@ class SerialTreeLearner: public TreeLearner {
FeatureHistogram* smaller_leaf_histogram_array_; FeatureHistogram* smaller_leaf_histogram_array_;
/*! \brief pointer to histograms array of larger leaf */ /*! \brief pointer to histograms array of larger leaf */
FeatureHistogram* larger_leaf_histogram_array_; FeatureHistogram* larger_leaf_histogram_array_;
/*! \brief store best split points for all leaves */ /*! \brief store best split points for all leaves */
std::vector<SplitInfo> best_split_per_leaf_; std::vector<SplitInfo> best_split_per_leaf_;
/*! \brief store best split per feature for all leaves */ /*! \brief store best split per feature for all leaves */
......
...@@ -8,13 +8,22 @@ ...@@ -8,13 +8,22 @@
#include "gpu_tree_learner.h" #include "gpu_tree_learner.h"
#include "parallel_tree_learner.h" #include "parallel_tree_learner.h"
#include "serial_tree_learner.h" #include "serial_tree_learner.h"
#include "linear_tree_learner.h"
namespace LightGBM { namespace LightGBM {
TreeLearner* TreeLearner::CreateTreeLearner(const std::string& learner_type, const std::string& device_type, const Config* config) { TreeLearner* TreeLearner::CreateTreeLearner(const std::string& learner_type, const std::string& device_type,
const Config* config) {
if (device_type == std::string("cpu")) { if (device_type == std::string("cpu")) {
if (learner_type == std::string("serial")) { if (learner_type == std::string("serial")) {
return new SerialTreeLearner(config); if (config->linear_tree) {
#ifdef LGB_R_BUILD
Log::Fatal("Linear tree learner does not work with R package.");
#endif // LGB_R_BUILD
return new LinearTreeLearner(config);
} else {
return new SerialTreeLearner(config);
}
} else if (learner_type == std::string("feature")) { } else if (learner_type == std::string("feature")) {
return new FeatureParallelTreeLearner<SerialTreeLearner>(config); return new FeatureParallelTreeLearner<SerialTreeLearner>(config);
} else if (learner_type == std::string("data")) { } else if (learner_type == std::string("data")) {
......
...@@ -99,6 +99,38 @@ class TestBasic(unittest.TestCase): ...@@ -99,6 +99,38 @@ class TestBasic(unittest.TestCase):
train_data.construct() train_data.construct()
valid_data.construct() valid_data.construct()
def test_chunked_dataset_linear(self):
X_train, X_test, y_train, y_test = train_test_split(*load_breast_cancer(return_X_y=True), test_size=0.1,
random_state=2)
chunk_size = X_train.shape[0] // 10 + 1
X_train = [X_train[i * chunk_size:(i + 1) * chunk_size, :] for i in range(X_train.shape[0] // chunk_size + 1)]
X_test = [X_test[i * chunk_size:(i + 1) * chunk_size, :] for i in range(X_test.shape[0] // chunk_size + 1)]
params = {"bin_construct_sample_cnt": 100, 'linear_tree': True}
train_data = lgb.Dataset(X_train, label=y_train, params=params)
valid_data = train_data.create_valid(X_test, label=y_test, params=params)
train_data.construct()
valid_data.construct()
def test_save_and_load_linear(self):
X_train, X_test, y_train, y_test = train_test_split(*load_breast_cancer(return_X_y=True), test_size=0.1,
random_state=2)
X_train = np.concatenate([np.ones((X_train.shape[0], 1)), X_train], 1)
X_train[:X_train.shape[0] // 2, 0] = 0
y_train[:X_train.shape[0] // 2] = 1
params = {'linear_tree': True}
train_data = lgb.Dataset(X_train, label=y_train, params=params)
est = lgb.train(params, train_data, num_boost_round=10, categorical_feature=[0])
pred1 = est.predict(X_train)
train_data.save_binary('temp_dataset.bin')
train_data_2 = lgb.Dataset('temp_dataset.bin')
est = lgb.train(params, train_data_2, num_boost_round=10)
pred2 = est.predict(X_train)
np.testing.assert_allclose(pred1, pred2)
est.save_model('temp_model.txt')
est2 = lgb.Booster(model_file='temp_model.txt')
pred3 = est2.predict(X_train)
np.testing.assert_allclose(pred2, pred3)
def test_subset_group(self): def test_subset_group(self):
X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)), X_train, y_train = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)),
'../../examples/lambdarank/rank.train')) '../../examples/lambdarank/rank.train'))
......
...@@ -78,6 +78,18 @@ class TestEngine(unittest.TestCase): ...@@ -78,6 +78,18 @@ class TestEngine(unittest.TestCase):
fd.train_predict_check(lgb_train, X_test, X_test_fn, sk_pred) fd.train_predict_check(lgb_train, X_test, X_test_fn, sk_pred)
fd.file_load_check(lgb_train, '.train') fd.file_load_check(lgb_train, '.train')
def test_binary_linear(self):
fd = FileLoader('../../examples/binary_classification', 'binary', 'train_linear.conf')
X_train, y_train, _ = fd.load_dataset('.train')
X_test, _, X_test_fn = fd.load_dataset('.test')
weight_train = fd.load_field('.train.weight')
lgb_train = lgb.Dataset(X_train, y_train, params=fd.params, weight=weight_train)
gbm = lgb.LGBMClassifier(**fd.params)
gbm.fit(X_train, y_train, sample_weight=weight_train)
sk_pred = gbm.predict_proba(X_test)[:, 1]
fd.train_predict_check(lgb_train, X_test, X_test_fn, sk_pred)
fd.file_load_check(lgb_train, '.train')
def test_multiclass(self): def test_multiclass(self):
fd = FileLoader('../../examples/multiclass_classification', 'multiclass') fd = FileLoader('../../examples/multiclass_classification', 'multiclass')
X_train, y_train, _ = fd.load_dataset('.train') X_train, y_train, _ = fd.load_dataset('.train')
......
...@@ -2420,6 +2420,86 @@ class TestEngine(unittest.TestCase): ...@@ -2420,6 +2420,86 @@ class TestEngine(unittest.TestCase):
[1] + list(range(2, num_features))]), [1] + list(range(2, num_features))]),
train_data, num_boost_round=10) train_data, num_boost_round=10)
def test_linear(self):
# check that setting boosting=gbdt_linear fits better than boosting=gbdt when data has linear relationship
np.random.seed(0)
x = np.arange(0, 100, 0.1)
y = 2 * x + np.random.normal(0, 0.1, len(x))
lgb_train = lgb.Dataset(x[:, np.newaxis], label=y)
params = {'verbose': -1,
'metric': 'mse',
'seed': 0,
'num_leaves': 2}
est = lgb.train(params, lgb_train, num_boost_round=10)
pred1 = est.predict(x[:, np.newaxis])
lgb_train = lgb.Dataset(x[:, np.newaxis], label=y)
res = {}
est = lgb.train(dict(params, linear_tree=True), lgb_train, num_boost_round=10, evals_result=res,
valid_sets=[lgb_train], valid_names=['train'])
pred2 = est.predict(x[:, np.newaxis])
np.testing.assert_allclose(res['train']['l2'][-1], mean_squared_error(y, pred2), atol=10**(-1))
self.assertLess(mean_squared_error(y, pred2), mean_squared_error(y, pred1))
# test again with nans in data
x[:10] = np.nan
lgb_train = lgb.Dataset(x[:, np.newaxis], label=y)
est = lgb.train(params, lgb_train, num_boost_round=10)
pred1 = est.predict(x[:, np.newaxis])
lgb_train = lgb.Dataset(x[:, np.newaxis], label=y)
res = {}
est = lgb.train(dict(params, linear_tree=True), lgb_train, num_boost_round=10, evals_result=res,
valid_sets=[lgb_train], valid_names=['train'])
pred2 = est.predict(x[:, np.newaxis])
np.testing.assert_allclose(res['train']['l2'][-1], mean_squared_error(y, pred2), atol=10**(-1))
self.assertLess(mean_squared_error(y, pred2), mean_squared_error(y, pred1))
# test again with bagging
res = {}
est = lgb.train(dict(params, linear_tree=True, subsample=0.8, bagging_freq=1), lgb_train,
num_boost_round=10, evals_result=res, valid_sets=[lgb_train], valid_names=['train'])
pred = est.predict(x[:, np.newaxis])
np.testing.assert_allclose(res['train']['l2'][-1], mean_squared_error(y, pred), atol=10**(-1))
# test with a feature that has only one non-nan value
x = np.concatenate([np.ones([x.shape[0], 1]), x[:, np.newaxis]], 1)
x[500:, 1] = np.nan
y[500:] += 10
lgb_train = lgb.Dataset(x, label=y)
res = {}
est = lgb.train(dict(params, linear_tree=True, subsample=0.8, bagging_freq=1), lgb_train,
num_boost_round=10, evals_result=res, valid_sets=[lgb_train], valid_names=['train'])
pred = est.predict(x)
np.testing.assert_allclose(res['train']['l2'][-1], mean_squared_error(y, pred), atol=10**(-1))
# test with a categorical feature
x[:250, 0] = 0
y[:250] += 10
lgb_train = lgb.Dataset(x, label=y)
est = lgb.train(dict(params, linear_tree=True, subsample=0.8, bagging_freq=1), lgb_train,
num_boost_round=10, categorical_feature=[0])
# test refit: same results on same data
est2 = est.refit(x, label=y)
p1 = est.predict(x)
p2 = est2.predict(x)
self.assertLess(np.mean(np.abs(p1 - p2)), 2)
# test refit with save and load
est.save_model('temp_model.txt')
est2 = lgb.Booster(model_file='temp_model.txt')
est2 = est2.refit(x, label=y)
p1 = est.predict(x)
p2 = est2.predict(x)
self.assertLess(np.mean(np.abs(p1 - p2)), 2)
# test refit: different results training on different data
est2 = est.refit(x[:100, :], label=y[:100])
p3 = est2.predict(x)
self.assertGreater(np.mean(np.abs(p2 - p1)), np.abs(np.max(p3 - p1)))
# test when num_leaves - 1 < num_features and when num_leaves - 1 > num_features
X_train, X_test, y_train, y_test = train_test_split(*load_breast_cancer(return_X_y=True), test_size=0.1, random_state=2)
params = {'linear_tree': True,
'verbose': -1,
'metric': 'mse',
'seed': 0}
train_data = lgb.Dataset(X_train, label=y_train, params=dict(params, num_leaves=2))
est = lgb.train(params, train_data, num_boost_round=10, categorical_feature=[0])
train_data = lgb.Dataset(X_train, label=y_train, params=dict(params, num_leaves=60))
est = lgb.train(params, train_data, num_boost_round=10, categorical_feature=[0])
def test_predict_with_start_iteration(self): def test_predict_with_start_iteration(self):
def inner_test(X, y, params, early_stopping_rounds): def inner_test(X, y, params, early_stopping_rounds):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment