Commits · 765ceadc2ed059522c6107214b0b4809183e3c97 · tianlh / LightGBM-DCU

28 Oct, 2021 1 commit

Improve warning wordings (#4731) · 765ceadc

Nikita Titov authored Oct 28, 2021

* Update dataset_loader.cpp

* Update dataset_loader.cpp

* Update dataset_loader.cpp

765ceadc

27 Oct, 2021 1 commit
- Add some warnings when loading dataset from binary file (#4724) · 5fbfa00b
  Nikita Titov authored Oct 28, 2021
  
  5fbfa00b
25 Oct, 2021 1 commit
- Fix some paramater hints when loading from binary file (#4701) · dc02dcaf
  Zhiyuan He authored Oct 25, 2021
```
Co-authored-by: hzy46 <email@example.com>
```
  dc02dcaf
20 Oct, 2021 1 commit
- Fix ASAN issues with `std::function` usage (#4673) · 13ed38ca
  david-cortes authored Oct 20, 2021
```
* don't compare std::function to nullptr ref #4633

* Update dataset_loader.h
```
  13ed38ca
13 Oct, 2021 1 commit
- fix behavior for default objective and metric (#4660) · d130bb19
  Nikita Titov authored Oct 13, 2021
  
  d130bb19
08 Oct, 2021 1 commit
- fix possible precision loss in xentropy and fair loss objectives (#4651) · 1c558a54
  James Lamb authored Oct 07, 2021
  
  1c558a54
05 Oct, 2021 3 commits
- remove unused `DCGCalculator::CalDCGAtK()` (#4650) · df8c10ba
  James Lamb authored Oct 05, 2021
  
  df8c10ba
- add param aliases from scikit-learn (#4637) · e95d5ab8
  Nikita Titov authored Oct 05, 2021
  
  e95d5ab8
- remove unused BinMapper::SizeForSpecificBin() (#4643) · e81eaaaf
  James Lamb authored Oct 04, 2021
```
Co-authored-by: Nikita Titov <nekit94-12@hotmail.com>
```
  e81eaaaf
23 Sep, 2021 2 commits
- move Network method implementations from network.h to network.cpp (fixes #4464) (#4496) · e1572794
  James Lamb authored Sep 22, 2021
  
  e1572794
- simplify and speed up comparisons for splits with identical gains (#4542) · b52ecb16
  James Lamb authored Sep 22, 2021
```
* fix incorrect behavior of SplitInfo == operator for splits with identical gains

* LightSplitInfo too, and improve comment

* dont check features unnecessarily

* update LightSplitInfo too
```
  b52ecb16
25 Aug, 2021 1 commit

[docs] Clarify the fact that predict() on a file does not support saved... · 417ba192

James Lamb authored Aug 25, 2021


[docs] Clarify the fact that predict() on a file does not support saved Datasets (fixes #4034) (#4545)

* documentation changes

* add list of supported formats to error message

* add unit tests

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update per review comments

* make references consistent
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

417ba192

22 Aug, 2021 1 commit

factor out .size() checks in GetDataType() (#4541) · 4db10d86

James Lamb authored Aug 22, 2021



* factor out .size() checks in GetDataType()

* Update src/io/parser.cpp
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

4db10d86

23 Jul, 2021 1 commit
- [refactor] Use `CreateSampleIndices()` in `c_api.cpp` (#4478) · 3be611e7
  Chen Yufei authored Jul 23, 2021
```
This removes code duplication for creating sample indices.
```
  3be611e7
02 Jul, 2021 1 commit

[python-package] Create Dataset from multiple data files (#4089) · c359896e

Chen Yufei authored Jul 02, 2021

* [python-package] create Dataset from sampled data.

* [python-package] create Dataset from List[Sequence].

1. Use random access for data sampling
2. Support read data from multiple input files
3. Read data in batch so no need to hold all data in memory

* [python-package] example: create Dataset from multiple HDF5 file.

* fix: revert is_class implementation for seq

* fix: unwanted memory view reference for seq

* fix: seq is_class accepts sklearn matrices

* fix: requirements for example

* fix: pycode

* feat: print static code linting stage

* fix: linting: avoid shell str regex conversion

* code style: doc style

* code style: isort

* fix ci dependency: h5py on windows

* [py] remove rm files in test seq
https://github.com/microsoft/LightGBM/pull/4089#discussion_r612929623

* docs(python): init_from_sample summary

https://github.com/microsoft/LightGBM/pull/4089#discussion_r612903389



* remove dataset dump sample data debugging code.

* remove typo fix.

Create separate PR for this.

* fix typo in src/c_api.cpp
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* style(linting): py3 type hint for seq

* test(basic): os.path style path handling

* Revert "feat: print static code linting stage"

This reverts commit 10bd79f7f8258bea8e61c3abb8c9c7e4456a916d.

* feat(python): sequence on validation set

* minor(python): comment

* minor(python): test option hint

* style(python): fix code linting

* style(python): add pydoc for ref_dataset

* doc(python): sequence
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

* revert(python): sequence class abc

* chore(python): remove rm_files

* Remove useless static_assert.

* refactor: test_basic test for sequence.

* fix lint complaint.

* remove dataset._dump_text in sequence test.

* Fix reverting typo fix.

* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Fix type hint, code and doc style.

* fix failing test_basic.

* Remove TODO about keep constant in sync with cpp.

* Install h5py only when running python-examples.

* Fix lint complaint.

* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Doc fixes, remove unused params_str in __init_from_seqs.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Remove unnecessary conda install in windows ci script.

* Keep param as example in dataset_from_multi_hdf5.py

* Add _get_sample_count function to remove code duplication.

* Use batch_size parameter in generate_hdf.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Fix after applying suggestions.

* Fix test, check idx is instance of numbers.Integral.

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Expose Sequence class in Python-API doc.

* Handle Sequence object not having batch_size.

* Fix isort lint complaint.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update docstring to mention Sequence as data input.

* Remove get_one_line in test_basic.py

* Make Sequence an abstract class.

* Reduce number of tests for test_sequence.

* Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices.

* empty commit to trigger ci

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t.

Also rename total_nrow to num_total_row in c_api.h for consistency.

* Doc about Sequence in docs/Python-Intro.rst.

* Fix: basic.py change LGBM_SampleIndices out_len to int32.

* Add create_valid test case with Dataset from Sequence.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Apply suggestions from code review
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

* Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Willian Zhang <willian@willian.email>
Co-authored-by: Willian Z <Willian@Willian-Zhang.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

c359896e

28 Jun, 2021 1 commit
- [CUDA] fix CUDA memory error by reducing block number (fixed #4315) (#4327) · 77d9529d
  Robin Dong authored Jun 28, 2021
  
  77d9529d
26 Jun, 2021 1 commit
- fix param aliases (#4387) · aab8fc18
  Nikita Titov authored Jun 26, 2021
  
  aab8fc18
25 Jun, 2021 1 commit
- sync for init score of binary objective function (#4332) · 0701a32d
  Arcs authored Jun 25, 2021
```
Co-authored-by: 未闲 <weixian.lzf@antfin.com>
```
  0701a32d
03 Jun, 2021 2 commits

Add linear leaf models to json output (fixes #4186) (#4329) · 1b5bec00

Belinda Trotta authored Jun 03, 2021



* Add linear leaf models to json output

* Add closing bracket

* Move test into test_engine.py and add asserts

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update tests/python_package_test/test_engine.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

1b5bec00

skip empty bin when calculating cnt_in_bin in BinMapper::FindBin (fix #4301) (#4325) · 3dd4a3f9
shiyu1994 authored Jun 03, 2021

3dd4a3f9

26 May, 2021 1 commit
- fix GatherInfoForThresholdNumerical boundary (fix #4286) (#4322) · 346f8839
  shiyu1994 authored May 26, 2021
  
  346f8839
21 May, 2021 1 commit

fix calculation of weighted gamma loss (fixes #4174) (#4283) · 4b1b4124

Michael Mayer authored May 21, 2021

* fixed weighted gamma obj

* added unit tests

* fixing linter errors

* another linter

* set seed

* fix linter (integer seed)

4b1b4124

18 May, 2021 1 commit
- Replace division of exponential in Gamma loss (#4289) · 32fec820
  Christian Lorentzen authored May 18, 2021
  
  32fec820
10 May, 2021 1 commit
- [docs] remove extra spaces in comments and docs (#4269) · a8ee487a
  James Lamb authored May 10, 2021
  
  a8ee487a
07 May, 2021 1 commit

Precise text file parsing (#4081) · f8318088

Chen Yufei authored May 07, 2021



* New build option: USE_PRECISE_TEXT_PARSER.

Use fast_double_parser for text file parsing. For each number, fallback
to strtod in case of parse failure.

* Add benchmark for CSVParser with Atof and AtofPrecise.

* Fix lint complaint.

* Fix typo in open result error message.

* Revert "Fix lint complaint."

This reverts commit 92ab0b6bce9f17d7be9eaeb20f19d4a0a36f0387.

* Revert "Add benchmark for CSVParser with Atof and AtofPrecise."

This reverts commit 4f8639abd06c679d4382eb715a1793afd94df3d2.

* Use AtofPrecise in Common::__StringToTHelper.

* [option] precise_float_parser: precise float number parsing for text input.

* Remove USE_PRECISE_TEXT_PARSER compile option.

* test: add test for Common::AtofPrecise.

* test: remove ChunkedArrayTest with 0 length.

This triggers Log::Fatal which aborts the test program.

* fix lint, add copyright.

* Revert "test: remove ChunkedArrayTest with 0 length."

This reverts commit 346c76affe9e78b6ca2738c4a56dbb9c00f31102.

* Use LightGBM::Common::Sign

* save precise_float_parser in model file.

* Fix error checking in AtofPrecise. Add more test cases.

* Remove test case that can't pass under macOS.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

f8318088

04 May, 2021 2 commits

fix param name (#4253) · fcd24535
Nikita Titov authored May 05, 2021
```
* fix param name

* Update gpu_tree_learner.h

* Update gbdt.h
```
fcd24535

Correct spelling (#4250) · e79716e0

Andrew Ziem authored May 04, 2021



* Correct spelling

Most changes were in comments, and there were a few changes to literals for log output.

There were no changes to variable names, function names, IDs, or functionality.

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Correct spelling

Most are code comments, but one case is a literal in a logging message.

There are a few grammar fixes too.
Co-authored-by: James Lamb <jaylamb20@gmail.com>

e79716e0

29 Apr, 2021 1 commit
- show specific error message in TCP accept/send/receive logs (#4128) · f97aa86e
  James Lamb authored Apr 28, 2021
  
  f97aa86e
27 Apr, 2021 1 commit
- Fix typo in binary file already exists error message. (#4231) · d5c2c556
  Chen Yufei authored Apr 27, 2021
  
  d5c2c556
23 Apr, 2021 1 commit
- added aliases to params (#4205) · 8b477ba3
  Nikita Titov authored Apr 23, 2021
  
  8b477ba3
22 Apr, 2021 1 commit
- when a leaf has no local data, its histogram shuold be cleared (#4185) · 0a847efe
  shiyu1994 authored Apr 22, 2021
  
  0a847efe
15 Apr, 2021 1 commit
- fix: Dataset::CreateValid init fields which saves to binary (#4177) · 98e5a210
  Chen Yufei authored Apr 16, 2021
  
  98e5a210
11 Apr, 2021 1 commit

enforce interaction constraints with monotone_constraints_method = intermediate/advanced (#4043) · 9e1d7fa1

Christoph Aymanns authored Apr 11, 2021



* add test for interaction constraints and monotone constraints

* enforce interaction constraints in RecomputeBestSplitForLeaf

* code formatting

* code formatting

* move interaction constraint test to test_engine

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

9e1d7fa1

05 Apr, 2021 1 commit
- clarify DEBUG-level log about tree depth (#4126) · 6d825cd3
  James Lamb authored Apr 05, 2021
```
* clarify DEBUG-level log about tree depth

* more places
```
  6d825cd3
24 Mar, 2021 1 commit
- fix tcp_no_deplay type by using int (#4058) · c591b77e
  htgeis authored Mar 25, 2021
  
  c591b77e
17 Mar, 2021 1 commit

Range check for DCG position discount lookup (#4069) · 4580393f

ashok-ponnuswami-msft authored Mar 17, 2021

* Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data.

* Change debug logging location so that we can print the data file name as well.

* Revert "Change debug logging location so that we can print the data file name as well."

This reverts commit 3981b34bd6e0530f89c4733e78e6b6603bf50d48.

* Add data file name to debug logging.

* Move log line to a place where it is output even when query IDs are read from a separate file.

* Also add the out-of-range check to rank metrics.

* Perform check after number of queries is initialized.

* Update

4580393f

12 Mar, 2021 1 commit
- set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) · ec4bd1e0
  shiyu1994 authored Mar 13, 2021
  
  ec4bd1e0
23 Feb, 2021 1 commit
- [DOCS] Update docs to note that pred_contrib is not available for linear trees (#4006) · b09c1ff7
  Belinda Trotta authored Feb 24, 2021
```
* Update docs to note that pred_contrib is not available for linear trees

* Add warning in code

* Change warning to error
```
  b09c1ff7
21 Feb, 2021 1 commit

Fix evalution of linear trees with a single leaf. (#3987) · 605c97b5

mjmckp authored Feb 22, 2021



* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* Fix evalution of linear trees with a single leaf.

Note that trees without linear models at the leaf always handle num_leaves = 1 as a special case and directly output the leaf value.  Linear trees were missing this special case handling, and hence would have the following issues:
 * Calling Tree::Predict or Tree::PredictByMap would cause an access violation exception attempting to access the first value of the empty split_feature_ array in GetLeaf.
 * PredictionFunLinear would either cause an access violation or go into an infinite loop when attempting to do the equivalent of GetLeaf.

Note also that PredictionFun does not need the same changes as PredictionFunLinear, since both are only called by Tree::AddPredictionToScore, which has a special case for (!is_linear_ && num_leaves_ <= 1) that precludes calling PredictionFun.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

605c97b5

19 Feb, 2021 1 commit

Use high precision conversion from double to string in Tree::ToString() for... · 7f91dc66

mjmckp authored Feb 20, 2021


Use high precision conversion from double to string in Tree::ToString() for new linear tree members (#3938)

* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* In Tree::ToString() method, print double values for linear tree models with high precision, so that the tree may be accurately reproduced elsewhere (LightGBM.Net in particular)

* Need to use more precise StringToArray instead of StringToArrayFast when parsing double valued arrays for linear trees, to ensure models round-trip via string or file correctly.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>

7f91dc66