Commits · 5c8a331bf5966f1df546f154f64ed9b8856a90ee · tianlh / LightGBM-DCU

05 Oct, 2021 2 commits
- add param aliases from scikit-learn (#4637) · e95d5ab8
  Nikita Titov authored Oct 05, 2021
  
  e95d5ab8
- remove unused BinMapper::SizeForSpecificBin() (#4643) · e81eaaaf
  James Lamb authored Oct 04, 2021
```
Co-authored-by: Nikita Titov <nekit94-12@hotmail.com>
```
  e81eaaaf
25 Aug, 2021 1 commit

[docs] Clarify the fact that predict() on a file does not support saved... · 417ba192

James Lamb authored Aug 25, 2021


[docs] Clarify the fact that predict() on a file does not support saved Datasets (fixes #4034) (#4545)

* documentation changes

* add list of supported formats to error message

* add unit tests

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update per review comments

* make references consistent
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

417ba192

22 Aug, 2021 1 commit

factor out .size() checks in GetDataType() (#4541) · 4db10d86

James Lamb authored Aug 22, 2021



* factor out .size() checks in GetDataType()

* Update src/io/parser.cpp
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

4db10d86

26 Jun, 2021 1 commit
- fix param aliases (#4387) · aab8fc18
  Nikita Titov authored Jun 26, 2021
  
  aab8fc18
03 Jun, 2021 2 commits

Add linear leaf models to json output (fixes #4186) (#4329) · 1b5bec00

Belinda Trotta authored Jun 03, 2021



* Add linear leaf models to json output

* Add closing bracket

* Move test into test_engine.py and add asserts

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

1b5bec00

skip empty bin when calculating cnt_in_bin in BinMapper::FindBin (fix #4301) (#4325) · 3dd4a3f9
shiyu1994 authored Jun 03, 2021

3dd4a3f9

10 May, 2021 1 commit
- [docs] remove extra spaces in comments and docs (#4269) · a8ee487a
  James Lamb authored May 10, 2021
  
  a8ee487a
07 May, 2021 1 commit

Precise text file parsing (#4081) · f8318088

Chen Yufei authored May 07, 2021



* New build option: USE_PRECISE_TEXT_PARSER.

Use fast_double_parser for text file parsing. For each number, fallback
to strtod in case of parse failure.

* Add benchmark for CSVParser with Atof and AtofPrecise.

* Fix lint complaint.

* Fix typo in open result error message.

* Revert "Fix lint complaint."

This reverts commit 92ab0b6bce9f17d7be9eaeb20f19d4a0a36f0387.

* Revert "Add benchmark for CSVParser with Atof and AtofPrecise."

This reverts commit 4f8639abd06c679d4382eb715a1793afd94df3d2.

* Use AtofPrecise in Common::__StringToTHelper.

* [option] precise_float_parser: precise float number parsing for text input.

* Remove USE_PRECISE_TEXT_PARSER compile option.

* test: add test for Common::AtofPrecise.

* test: remove ChunkedArrayTest with 0 length.

This triggers Log::Fatal which aborts the test program.

* fix lint, add copyright.

* Revert "test: remove ChunkedArrayTest with 0 length."

This reverts commit 346c76affe9e78b6ca2738c4a56dbb9c00f31102.

* Use LightGBM::Common::Sign

* save precise_float_parser in model file.

* Fix error checking in AtofPrecise. Add more test cases.

* Remove test case that can't pass under macOS.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

f8318088

04 May, 2021 1 commit

Correct spelling (#4250) · e79716e0

Andrew Ziem authored May 04, 2021



* Correct spelling

Most changes were in comments, and there were a few changes to literals for log output.

There were no changes to variable names, function names, IDs, or functionality.

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Correct spelling

Most are code comments, but one case is a literal in a logging message.

There are a few grammar fixes too.
Co-authored-by: James Lamb <jaylamb20@gmail.com>

e79716e0

27 Apr, 2021 1 commit
- Fix typo in binary file already exists error message. (#4231) · d5c2c556
  Chen Yufei authored Apr 27, 2021
  
  d5c2c556
23 Apr, 2021 1 commit
- added aliases to params (#4205) · 8b477ba3
  Nikita Titov authored Apr 23, 2021
  
  8b477ba3
15 Apr, 2021 1 commit
- fix: Dataset::CreateValid init fields which saves to binary (#4177) · 98e5a210
  Chen Yufei authored Apr 16, 2021
  
  98e5a210
17 Mar, 2021 1 commit

Range check for DCG position discount lookup (#4069) · 4580393f

ashok-ponnuswami-msft authored Mar 17, 2021

* Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data.

* Change debug logging location so that we can print the data file name as well.

* Revert "Change debug logging location so that we can print the data file name as well."

This reverts commit 3981b34bd6e0530f89c4733e78e6b6603bf50d48.

* Add data file name to debug logging.

* Move log line to a place where it is output even when query IDs are read from a separate file.

* Also add the out-of-range check to rank metrics.

* Perform check after number of queries is initialized.

* Update

4580393f

12 Mar, 2021 1 commit
- set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) · ec4bd1e0
  shiyu1994 authored Mar 13, 2021
  
  ec4bd1e0
21 Feb, 2021 1 commit

Fix evalution of linear trees with a single leaf. (#3987) · 605c97b5

mjmckp authored Feb 22, 2021



* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* Fix evalution of linear trees with a single leaf.

Note that trees without linear models at the leaf always handle num_leaves = 1 as a special case and directly output the leaf value.  Linear trees were missing this special case handling, and hence would have the following issues:
 * Calling Tree::Predict or Tree::PredictByMap would cause an access violation exception attempting to access the first value of the empty split_feature_ array in GetLeaf.
 * PredictionFunLinear would either cause an access violation or go into an infinite loop when attempting to do the equivalent of GetLeaf.

Note also that PredictionFun does not need the same changes as PredictionFunLinear, since both are only called by Tree::AddPredictionToScore, which has a special case for (!is_linear_ && num_leaves_ <= 1) that precludes calling PredictionFun.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

605c97b5

19 Feb, 2021 2 commits

Use high precision conversion from double to string in Tree::ToString() for... · 7f91dc66

mjmckp authored Feb 20, 2021


Use high precision conversion from double to string in Tree::ToString() for new linear tree members (#3938)

* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* In Tree::ToString() method, print double values for linear tree models with high precision, so that the tree may be accurately reproduced elsewhere (LightGBM.Net in particular)

* Need to use more precise StringToArray instead of StringToArrayFast when parsing double valued arrays for linear trees, to ensure models round-trip via string or file correctly.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

7f91dc66

[docs] Change some 'parallel learning' references to 'distributed learning' (#4000) · 7880b79f
James Lamb authored Feb 19, 2021
```
* [docs] Change some 'parallel learning' references to 'distributed learning'

* found a few more

* one more reference
```
7880b79f

06 Feb, 2021 1 commit
- fix typos in log messages (#3914) · e31244cf
  James Lamb authored Feb 06, 2021
  
  e31244cf
03 Feb, 2021 1 commit
- Add new task type: "save_binary" (#3651) · 111d0c80
  Chen Yufei authored Feb 03, 2021
```
* Add new task type: "save_binary".

* Document for task "save_binary".
```
  111d0c80
25 Jan, 2021 1 commit
- change Dataset::CopySubrow from group wise to column wise (#3720) · 36531679
  shiyu1994 authored Jan 25, 2021
  
  36531679
11 Jan, 2021 1 commit
- fix bug in corner case of hist bin mismatch (#3694) · a86a211b
  shiyu1994 authored Jan 11, 2021
  
  a86a211b
09 Jan, 2021 1 commit
- move CheckParamConflict() after LogLevel processing (#3742) · d6f6abf6
  h-vetinari authored Jan 09, 2021
  
  d6f6abf6
07 Jan, 2021 2 commits
- fix bug in ExtractFeaturesFromMemory when predidct_fun_ is used (#3721) · 31bc196a
  shiyu1994 authored Jan 07, 2021
  
  31bc196a
- Fix compiler warnings caused by implicit type conversion (fixes #3677) (#3729) · 753b0e9c
  Belinda Trotta authored Jan 07, 2021
```
* Fix compiler warnings caused by implicit type conversion

* Fix more warnings

* Fix more warnings
```
  753b0e9c
28 Dec, 2020 1 commit

small code and docs refactoring (#3681) · 5a460846

Nikita Titov authored Dec 29, 2020

* small code and docs refactoring

* Update CMakeLists.txt

* Update .vsts-ci.yml

* Update test.sh

* continue

* continue

* revert stable sort for all-unique values

5a460846

24 Dec, 2020 1 commit

Trees with linear models at leaves (#3299) · fcfd4132

Belinda Trotta authored Dec 24, 2020

* Add Eigen library.

* Working for simple test.

* Apply changes to config params.

* Handle nan data.

* Update docs.

* Add test.

* Only load raw data if boosting=gbdt_linear

* Remove unneeded code.

* Minor updates.

* Update to work with sk-learn interface.

* Update to work with chunked datasets.

* Throw error if we try to create a Booster with an already-constructed dataset having incompatible parameters.

* Save raw data in binary dataset file.

* Update docs and fix parameter checking.

* Fix dataset loading.

* Add test for regularization.

* Fix bugs when saving and loading tree.

* Add test for load/save linear model.

* Remove unneeded code.

* Fix case where not enough leaf data for linear model.

* Simplify code.

* Speed up code.

* Speed up code.

* Simplify code.

* Speed up code.

* Fix bugs.

* Working version.

* Store feature data column-wise (not fully working yet).

* Fix bugs.

* Speed up.

* Speed up.

* Remove unneeded code.

* Small speedup.

* Speed up.

* Minor updates.

* Remove unneeded code.

* Fix bug.

* Fix bug.

* Speed up.

* Speed up.

* Simplify code.

* Remove unneeded code.

* Fix bug, add more tests.

* Fix bug and add test.

* Only store numerical features

* Fix bug and speed up using templates.

* Speed up prediction.

* Fix bug with regularisation

* Visual studio files.

* Working version

* Only check nans if necessary

* Store coeff matrix as an array.

* Align cache lines

* Align cache lines

* Preallocation coefficient calculation matrices

* Small speedups

* Small speedup

* Reverse cache alignment changes

* Change to dynamic schedule

* Update docs.

* Refactor so that linear tree learner is not a separate class.

* Add refit capability.

* Speed up

* Small speedups.

* Speed up add prediction to score.

* Fix bug

* Fix bug and speed up.

* Speed up dataload.

* Speed up dataload

* Use vectors instead of pointers

* Fix bug

* Add OMP exception handling.

* Change return type of LGBM_BoosterGetLinear to bool

* Change return type of LGBM_BoosterGetLinear back to int, only parameter type needed to change

* Remove unused internal_parent_ property of tree

* Remove unused parameter to CreateTreeLearner

* Remove reference to LinearTreeLearner

* Minor style issues

* Remove unneeded check

* Reverse temporary testing change

* Fix Visual Studio project files

* Restore LightGBM.vcxproj.filters

* Speed up

* Speed up

* Simplify code

* Update docs

* Simplify code

* Initialise storage space for max num threads

* Move Eigen to include directory and delete unused files

* Remove old files.

* Fix so it compiles with mingw

* Fix gpu tree learner

* Change AddPredictionToScore back to const

* Fix python lint error

* Fix C++ lint errors

* Change eigen to a submodule

* Update comment

* Add the eigen folder

* Try to fix build issues with eigen

* Remove eigen files

* Add eigen as submodule

* Fix include paths

* Exclude eigen files from Python linter

* Ignore eigen folders for pydocstyle

* Fix C++ linting errors

* Fix docs

* Fix docs

* Exclude eigen directories from doxygen

* Update manifest to include eigen

* Update build_r to include eigen files

* Fix compiler warnings

* Store raw feature data as float

* Use float for calculating linear coefficients

* Remove eigen directory from GLOB

* Don't compile linear model code when building R package

* Fix doxygen issue

* Fix lint issue

* Fix lint issue

* Remove uneeded code

* Restore delected lines

* Restore delected lines

* Change return type of has_raw to bool

* Update docs

* Rename some variables and functions for readability

* Make tree_learner parameter const in AddScore

* Fix style issues

* Pass vectors as const reference when setting tree properties

* Make temporary storage of serial_tree_learner mutable so we can make the object's methods const

* Remove get_raw_size, use num_numeric_features instead

* Fix typo

* Make contains_nan_ and any_nan_ properties immutable again

* Remove data_has_nan_ property of tree

* Remove temporary test code

* Make linear_tree a dataset param

* Fix lint error

* Make LinearTreeLearner a separate class

* Fix lint errors

* Fix lint error

* Add linear_tree_learner.o

* Simulate omp_get_max_threads if openmp is not available

* Update PushOneData to also store raw data.

* Cast size to int

* Fix bug in ReshapeRaw

* Speed up code with multithreading

* Use OMP_NUM_THREADS

* Speed up with multithreading

* Update to use ArrayToString

* Fix tests

* Fix test

* Fix bug introduced in merge

* Minor updates

* Update docs

fcfd4132

08 Dec, 2020 1 commit

Fix model locale issue and improve model R/W performance. (#3405) · 792c9303

Alberto Ferreira authored Dec 08, 2020

* Fix LightGBM models locale sensitivity and improve R/W performance.

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:
 - C++ streams are imbued with the classic locale
 - Calls to functions that are dependent on the locale are replaced
 - The default locale is not changed!

This approach means:
 - The user's locale is never tampered with, avoiding issues such as
    https://github.com/microsoft/LightGBM/issues/2979 with the previous
    approach https://github.com/microsoft/LightGBM/pull/2891
 - Datasets can still be read according the user's locale
 - The model file has a single format independent of locale

Changes:
 - Add CommonC namespace which provides faster locale-independent versions of Common's methods
 - Model code makes conversions through CommonC
 - Cleanup unused Common methods
 - Performance improvements. Use fast libraries for locale-agnostic conversion:
   - value->string: https://github.com/fmtlib/fmt
   - string->double: https://github.com/lemire/fast_double_parser (10x
      faster double parsing according to their benchmark)

Bugfixes:
 - https://github.com/microsoft/LightGBM/issues/2500
 - https://github.com/microsoft/LightGBM/issues/2890
 - https://github.com/ninia/jep/issues/205

 (as it is related to LGBM as well)

* Align CommonC namespace

* Add new external_libs/ to python setup

* Try fast_double_parser fix #1

Testing commit e09e5aad828bcb16bea7ed0ed8322e019112fdbe

If it works it should fix more LGBM builds

* CMake: Attempt to link fmt without explicit PUBLIC tag

* Exclude external_libs from linting

* Add exernal_libs to MANIFEST.in

* Set dynamic linking option for fmt.

* linting issues

* Try to fix lint includes

* Try to pass fPIC with static fmt lib

* Try CMake P_I_C option with fmt library

* [R-package] Add CMake support for R and CRAN

* Cleanup CMakeLists

* Try fmt hack to remove stdout

* Switch to header-only mode

* Add PRIVATE argument to target_link_libraries

* use fmt in header-only mode

* Remove CMakeLists comment

* Change OpenMP to PUBLIC linking in Mac

* Update fmt submodule to 7.1.2

* Use fmt in header-only-mode

* Remove fmt from CMakeLists.txt

* Upgrade fast_double_parser to v0.2.0

* Revert "Add PRIVATE argument to target_link_libraries"

This reverts commit 3dd45dde7b92531b2530ab54522bb843c56227a7.

* Address James Lamb's comments

* Update R-package/.Rbuildignore
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Upgrade to fast_double_parser v0.3.0 - Solaris support

* Use legacy code only in Solaris

* Fix lint issues

* Fix comment

* Address StrikerRUS's comments (solaris ifdef).

* Change header guards
Co-authored-by: James Lamb <jaylamb20@gmail.com>

792c9303

07 Dec, 2020 1 commit
- fix typo in dataset checks (#3631) · bcdf1162
  Nikita Titov authored Dec 07, 2020
  
  bcdf1162
05 Dec, 2020 1 commit

Check max_bin, etc. match config when using binary (#3592) · 2c958dd4

Chen Yufei authored Dec 05, 2020

* Check max_bin, etc. match config when using binary.

* Check max_bin_by_feature, bin_construct_sample_cnt matching config.

2c958dd4

24 Nov, 2020 1 commit

Fix #3557 and potential issue with dense multi-val feature groups. (#3590) · 530b5cef

shiyu1994 authored Nov 24, 2020

Fix num_total_bin_ and bin_offsets_ of FeatureGroup
if a dense multi val feature group with non zero most freq bin
is the first feature group of the dataset.

530b5cef

23 Nov, 2020 1 commit

fix max_block_size in train states (fix #3570) (#3575) · d6f20e37

shiyu1994 authored Nov 23, 2020

* remove max_block_size_ in train states (fix #3570)

* avoid zero elements per row

* add min constraint for min_block_size_

d6f20e37

14 Nov, 2020 1 commit
- fix warnings · 2f4ce973
  Guolin Ke authored Nov 14, 2020
  
  2f4ce973
13 Nov, 2020 1 commit

Optimization of row-wise histogram construction (#3522) · 0655d67c

shiyu1994 authored Nov 13, 2020



* store without offset in multi_val_dense_bin

* fix offset bug

* add comment for offset

* add comment for bin type selection

* faster operations for offset

* keep most freq bin in histogram for multi val dense

* use original feature iterators

* consider 9 cases (3 x 3) for multi val bin construction

* fix dense bin setting

* fix bin data in multi val group

* fix offset of the first feature histogram

* use float hist buf

* avx in histogram construction

* use avx for hist construction without prefetch

* vectorize bin extraction

* use only 128 vec

* use avx2

* use vectorization for sparse row wise

* add bit size for multi val dense bin

* float with no vectorization

* change multithreading strategy to dynamic

* remove intrinsic header

* fix dense multi val col copy

* remove bit size

* use large enough block size when the bin number is large

* calc min block size by sparsity

* rescale gradients

* rollback gradients scaling

* single precision histogram buffer as an option

* add float hist buffer with thread buffer

* fix setting zero in hist data

* fix hist begin pointer in tree learners

* remove debug logs

* remove omp simd

* update Makevars of R-package

* fix feature group binary storing

* two row wise for double hist buffer

* add subfeature for two row wise

* remove useless code and fix two row wise

* refactor code

* grouping the dense feature groups can get sparse multi val bin

* clean format problems

* one thread for two blocks in sep row wise

* use ordered gradients for sep row wise

* fix grad ptr

* ordered grad with combined block for sep row wise

* fix block threading

* use the same min block size

* rollback share min block size

* remove logs

* Update src/io/dataset.cpp
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

* fix parameter description

* remove sep_row_wise

* remove check codes

* add check for empty multi val bin

* fix lint error

* rollback changes in config.h

* Apply suggestions from code review
Co-authored-by: Ubuntu <shiyu@gbdt-04.ren3kv4wanvufliwrpy4k03lsf.xx.internal.cloudapp.net>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

0655d67c

08 Nov, 2020 1 commit

Fix #2898: Clearer warning message for user (2^max_depth > num_leaves) (#3537) · a5448233

Alberto Ferreira authored Nov 08, 2020



* Fix #2898: Clearer warning message (2^max_depth > num_leaves).

* Update src/io/config.cpp
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update src/io/config.cpp
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

a5448233

06 Nov, 2020 1 commit

better document for bin_construct_sample_cnt (#3521) · bee732af

Guolin Ke authored Nov 06, 2020



* better document for bin_construct_sample_cnt

* add warnings
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>

bee732af

01 Nov, 2020 1 commit

Support deterministic (#3494) · c39afb9d

Guolin Ke authored Nov 01, 2020



* implement

* fix compilation

* Update config.cpp

* unify wordings
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>

c39afb9d

28 Oct, 2020 1 commit
- avoid min_data and min_hessian are zeros at the same time (#3492) · 56c1e2ed
  Guolin Ke authored Oct 28, 2020
```
* check min_data and min_hessian

* Apply suggestions from code review
```
  56c1e2ed
26 Oct, 2020 1 commit

Fix add features (#2754) · 53977f36

Guolin Ke authored Oct 27, 2020



* fix subset bug

* typo

* add fixme tag

* bin mapper

* fix test

* fix add_features_from

* Update dataset.cpp

* fix merge bug

* added Python merge code

* added test for add_features

* Update dataset.cpp

* Update src/io/dataset.cpp

* continue implementing

* warn users about categorical features
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

53977f36

09 Oct, 2020 1 commit

Move Tree destructor to header file (#3417) · f1aaa9b9

Lucas David authored Oct 09, 2020



~ Added 'noexcept' specifier and defaulted desctructor.
Co-authored-by: Lucas DAVID <lucas@isdom.isoft.fr>

f1aaa9b9