Commits · 680f4b081ed73fb16cf59b86a76ff267a85ca5a5 · tianlh / LightGBM-DCU

30 Jul, 2022 1 commit

reproducible parameter alias resolution for wrappers (fixes #5304) (#5338) · 83627ff0

José Morales authored Jul 30, 2022

* dump sorted parameter aliases

* update lgb.check.wrapper_param

* update _choose_param_value to look like lgb.check.wrapper_param

* apply suggestions from review

* reduce diff

* move DumpAliases to config

* remove unnecessary check

* restore parameter check

83627ff0

29 Jul, 2022 1 commit

[CUDA] Initial work for boosting and evaluation with CUDA (#5279) · e0af160a

shiyu1994 authored Jul 29, 2022

* initial work for boosting and evaluation with CUDA

* fix compatibility with CPU code

* fix creating objective without USE_CUDA_EXP

* fix static analysis errors

* fix static analysis errors

e0af160a

21 Jul, 2022 1 commit

fix: Adjust LGBM_DatasetCreateFromSampledColumn to handle distributed data (#5344) · f94050a4

Scott Votaw authored Jul 21, 2022

* Adjust LGBM_DatasetCreateFromSampledColumn to handle distributed data better

* linting fix

* switch to 1 API with breaking change

* Fix pything native call

* more python test fixes

f94050a4

27 Jun, 2022 1 commit

[python-package] check feature names in predict with dataframe (fixes #812) (#4909) · bdb02e05

José Morales authored Jun 27, 2022



* check feature names and order in predict with dataframe

* slice df in predict to remove the target

* scramble features

* handle int column names

* only change column order when needed

* include validate_features param in booster and sklearn estimators

* document validate_features argument

* use all_close in preds checks and check for assertion error to compare different arrays

* perform remapping and checks in cpp

* remove extra logs

* fixes

* revert cpp

* proposal

* remove extra arg

* lint

* restore _data_from_pandas arguments

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* move data conversion to Predictor.predict

* use Vector2Ptr
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

bdb02e05

29 May, 2022 1 commit
- Remove leftovers after the drop of Solaris support (#5248) · fb37e507
  Nikita Titov authored May 29, 2022
```
* Update tree.cpp

* Update common.h

* Update common.h
```
  fb37e507
22 May, 2022 1 commit
- remove support for Solaris (fixes #5216) (#5226) · b0774151
  James Lamb authored May 21, 2022
  
  b0774151
15 Apr, 2022 1 commit
- [docs] Fix formula in path smoothing docs (fixes #5139)(#5154) · fc0c8fd4
  Samuel Wilson authored Apr 15, 2022
  
  fc0c8fd4
10 Apr, 2022 1 commit

[docs] Document behaviour of the first linear estimator (#5132) · 5f57d6c6

Pablo Dávila Herrero authored Apr 10, 2022



* Document behaviour of the first linear estimator

* Properly update docs
Co-authored-by: Pablo-Davila <Pablo-Davila@users.noreply.github.com>

5f57d6c6

27 Mar, 2022 1 commit

Log warnings for number of bins of categorical features (#4448) · d163c2c1

shiyu1994 authored Mar 28, 2022

* log warnings when number of bins of categorical features exceeds the configured maximum number of bins

* log only one warning information for all categorical features

* Add #include <memory> for unique_ptr

* remove useless param description

d163c2c1

26 Mar, 2022 1 commit
- Load initial scores with binary data files in CLI version (#4807) · 17d4e007
  shiyu1994 authored Mar 27, 2022
  
  17d4e007
23 Mar, 2022 1 commit

[CUDA] New CUDA version Part 1 (#4630) · 6b56a90c

shiyu1994 authored Mar 23, 2022



* new cuda framework

* add histogram construction kernel

* before removing multi-gpu

* new cuda framework

* tree learner cuda kernels

* single tree framework ready

* single tree training framework

* remove comments

* boosting with cuda

* optimize for best split find

* data split

* move boosting into cuda

* parallel synchronize best split point

* merge split data kernels

* before code refactor

* use tasks instead of features as units for split finding

* refactor cuda best split finder

* fix configuration error with small leaves in data split

* skip histogram construction of too small leaf

* skip split finding of invalid leaves

stop when no leaf to split

* support row wise with CUDA

* copy data for split by column

* copy data from host to CPU by column for data partition

* add synchronize best splits for one leaf from multiple blocks

* partition dense row data

* fix sync best split from task blocks

* add support for sparse row wise for CUDA

* remove useless code

* add l2 regression objective

* sparse multi value bin enabled for CUDA

* fix cuda ranking objective

* support for number of items <= 2048 per query

* speedup histogram construction by interleaving global memory access

* split optimization

* add cuda tree predictor

* remove comma

* refactor objective and score updater

* before use struct

* use structure for split information

* use structure for leaf splits

* return CUDASplitInfo directly after finding best split

* split with CUDATree directly

* use cuda row data in cuda histogram constructor

* clean src/treelearner/cuda

* gather shared cuda device functions

* put shared CUDA functions into header file

* change smaller leaf from <= back to < for consistent result with CPU

* add tree predictor

* remove useless cuda_tree_predictor

* predict on CUDA with pipeline

* add global sort algorithms

* add global argsort for queries with many items in ranking tasks

* remove limitation of maximum number of items per query in ranking

* add cuda metrics

* fix CUDA AUC

* remove debug code

* add regression metrics

* remove useless file

* don't use mask in shuffle reduce

* add more regression objectives

* fix cuda mape loss

add cuda xentropy loss

* use template for different versions of BitonicArgSortDevice

* add multiclass metrics

* add ndcg metric

* fix cross entropy objectives and metrics

* fix cross entropy and ndcg metrics

* add support for customized objective in CUDA

* complete multiclass ova for CUDA

* separate cuda tree learner

* use shuffle based prefix sum

* clean up cuda_algorithms.hpp

* add copy subset on CUDA

* add bagging for CUDA

* clean up code

* copy gradients from host to device

* support bagging without using subset

* add support of bagging with subset for CUDAColumnData

* add support of bagging with subset for dense CUDARowData

* refactor copy sparse subrow

* use copy subset for column subset

* add reset train data and reset config for CUDA tree learner

add deconstructors for cuda tree learner

* add USE_CUDA ifdef to cuda tree learner files

* check that dataset doesn't contain CUDA tree learner

* remove printf debug information

* use full new cuda tree learner only when using single GPU

* disable all CUDA code when using CPU version

* recover main.cpp

* add cpp files for multi value bins

* update LightGBM.vcxproj

* update LightGBM.vcxproj

fix lint errors

* fix lint errors

* fix lint errors

* update Makevars

fix lint errors

* fix the case with 0 feature and 0 bin

fix split finding for invalid leaves

create cuda column data when loaded from bin file

* fix lint errors

hide GetRowWiseData when cuda is not used

* recover default device type to cpu

* fix na_as_missing case

fix cuda feature meta information

* fix UpdateDataIndexToLeafIndexKernel

* create CUDA trees when needed in CUDADataPartition::UpdateTrainScore

* add refit by tree for cuda tree learner

* fix test_refit in test_engine.py

* create set of large bin partitions in CUDARowData

* add histogram construction for columns with a large number of bins

* add find best split for categorical features on CUDA

* add bitvectors for categorical split

* cuda data partition split for categorical features

* fix split tree with categorical feature

* fix categorical feature splits

* refactor cuda_data_partition.cu with multi-level templates

* refactor CUDABestSplitFinder by grouping task information into struct

* pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder

* fix misuse of reference

* remove useless changes

* add support for path smoothing

* virtual destructor for LightGBM::Tree

* fix overlapped cat threshold in best split infos

* reset histogram pointers in data partition and spllit finder in ResetConfig

* comment useless parameter

* fix reverse case when na is missing and default bin is zero

* fix mfb_is_na and mfb_is_zero and is_single_feature_column

* remove debug log

* fix cat_l2 when one-hot

fix gradient copy when data subset is used

* switch shared histogram size according to CUDA version

* gpu_use_dp=true when cuda test

* revert modification in config.h

* fix setting of gpu_use_dp=true in .ci/test.sh

* fix linter errors

* fix linter error

remove useless change

* recover main.cpp

* separate cuda_exp and cuda

* fix ci bash scripts

add description for cuda_exp

* add USE_CUDA_EXP flag

* switch off USE_CUDA_EXP

* revert changes in python-packages

* more careful separation for USE_CUDA_EXP

* fix CUDARowData::DivideCUDAFeatureGroups

fix set fields for cuda metadata

* revert config.h

* fix test settings for cuda experimental version

* skip some tests due to unsupported features or differences in implementation details for CUDA Experimental version

* fix lint issue by adding a blank line

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* fix lint errors by resorting imports

* merge cuda.yml and cuda_exp.yml

* update python version in cuda.yml

* remove cuda_exp.yml

* remove unrelated changes

* fix compilation warnings

fix cuda exp ci task name

* recover task

* use multi-level template in histogram construction

check split only in debug mode

* ignore NVCC related lines in parameter_generator.py

* update job name for CUDA tests

* apply review suggestions

* Update .github/workflows/cuda.yml
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update .github/workflows/cuda.yml
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update header

* remove useless TODOs

* remove [TODO(shiyu1994): constrain the split with min_data_in_group] and record in #5062

* #include <LightGBM/utils/log.h> for USE_CUDA_EXP only

* fix include order

* fix include order

* remove extra space

* address review comments

* add warning when cuda_exp is used together with deterministic

* add comment about gpu_use_dp in .ci/test.sh

* revert changing order of included headers
Co-authored-by: Yu Shi <shiyu1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

6b56a90c

15 Mar, 2022 1 commit

[c-api][python-package][R-package] expose feature num bin (#5048) · d10372e2

José Morales authored Mar 14, 2022



* expose FeatureNumBin in C api

* parametrize min_data_in_bin and add test with max_bin_by_feature

* include feature_num_bin in R package

* add suggestion from review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update error message and lint

* lint

* add call method

* minor improvements in tests

* add suggestions from review

* lint

* rename argument to feature in python and r packages
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

d10372e2

24 Feb, 2022 1 commit

Correct documentation for sparse predictions (#4979) · 7e478047

david-cortes authored Feb 24, 2022

* Correct documentation for sparse predictions

The documentation says that the parameter `nindptr` for `LGBM_BoosterPredictSparseOutput` should be the number of rows plus one, but this is incorrect when the input type is CSC. This PR fixes it.

* Update c_api.h

* Update c_api.h

* Update c_api.h

7e478047

23 Feb, 2022 1 commit

[Docs] Weights non-negative for train data (#5013) · 6ced58ad

Miguel Trejo Marrufo authored Feb 22, 2022

* docs: weight parameter non-negative

* docs: weights non negative only for train data

* docs: weights should be non negative for validation data

* typo in html render

* docs: brief weights non-negative description

6ced58ad

20 Feb, 2022 1 commit

[docs] clarify that categorical features will be converted to integers internally (#4959) · 820ae7e6

José Morales authored Feb 20, 2022

* clarify that categoricals will be converted to ints and not that they should be ints in the input data

* update remaining sections

* update config.h

* add suggestions

820ae7e6

14 Feb, 2022 1 commit
- document rounding behavior of floating point numbers in categorical features · 2d1caf14
  Yu Shi authored Feb 14, 2022
  
  2d1caf14
30 Dec, 2021 1 commit

[python] raise an informative error instead of segfaulting when custom... · af5b40e1

Yaqub Alwan authored Dec 30, 2021


[python] raise an informative error instead of segfaulting when custom objective produces incorrect output (#4815)

* fix for bad grads causing segfault

* adjust checking criteria to properly reflect reality of multi-class classifiers

* fix styling

* Line break before operator

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* add a note to the C-API docs

* rearrange text s;ightly

* add some tests to python package

* Update include/LightGBM/c_api.h
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* PR comments

* match argument is a regex and our expression has brackets ..

* rework tests

* isorting imports

* updating test to relfect that the python APi does not take pres/labels as a fobj function
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

af5b40e1

03 Dec, 2021 1 commit

Add C API function that returns all parameter names with their aliases (#4829) · cf38071b

Nikita Titov authored Dec 03, 2021



* add C API function that returns all param names with aliases

* add C API function that returns all param names with aliases

* add R code

* test R code

* remove debug CI

* fix R lint

* refactor

* run CI

* fix R

* fix

* revert CI checks

* revert changes in docs

* Try to make function `const`
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* add `const` in cpp file

* address review comments and sync with `master`
Co-authored-by: James Lamb <jaylamb20@gmail.com>

cf38071b

29 Nov, 2021 1 commit
- [docs] document that `pred_early_stop` can be used only in normal and raw scores prediction (#4823) · 67b4205c
  Nikita Titov authored Nov 29, 2021
  
  67b4205c
16 Nov, 2021 1 commit

Add customized parser support (#4782) · b0137deb

chjinche authored Nov 16, 2021

* add customized parser support

* fix typo of parser_config_file description

* make delimiter as parameter of JoinedLines

b0137deb

15 Nov, 2021 1 commit

[c_api] Improve ANSI compatibility by avoiding <stdbool.h> (#4697) · bfb346c1

Drew Miller authored Nov 15, 2021

* [c_api] Improve ANSI compatibility by avoiding <stdbool.h>

* fixes in response to CI linting

* inline NOLINT instead of separate test

* moving length declaration to non-ANSI C conditional

* [c_api] Align expected return type in `basic.py` with new c_api type.

bfb346c1

11 Nov, 2021 1 commit

Add 'nrounds' as an alias for 'num_iterations' (fixes #4743) (#4746) · 3b6ebd79

Michael Mahoney authored Nov 10, 2021

* Add 'nrounds' as an alias for 'num_iterations'

* Improve tests

* Compare against nrounds directly

* Fix whitespace lints

3b6ebd79

30 Oct, 2021 1 commit
- [docs] improve docs about `nthreads` parameter (#4756) · dac0dffe
  Nikita Titov authored Oct 31, 2021
```
* in predict(), respect params set via `set_params()` after fit()

* extract docs changes
```
  dac0dffe
28 Oct, 2021 1 commit
- Reset OpenMP thread number if num_threads <= 0 (#4704) · 42914830
  Zhiyuan He authored Oct 29, 2021
```
* mock func for no openmp

* use omp_get_max_threads
Co-authored-by: hzy46 <email@example.com>
```
  42914830
25 Oct, 2021 1 commit
- Fix some paramater hints when loading from binary file (#4701) · dc02dcaf
  Zhiyuan He authored Oct 25, 2021
```
Co-authored-by: hzy46 <email@example.com>
```
  dc02dcaf
21 Oct, 2021 1 commit
- [docs] fix C API docs rendering (#4688) · d88b4456
  Nikita Titov authored Oct 22, 2021
```
* fix C API docs rendering

* place comments before members they describe
```
  d88b4456
20 Oct, 2021 1 commit
- Fix ASAN issues with `std::function` usage (#4673) · 13ed38ca
  david-cortes authored Oct 20, 2021
```
* don't compare std::function to nullptr ref #4633

* Update dataset_loader.h
```
  13ed38ca
05 Oct, 2021 4 commits
- remove unused `DCGCalculator::CalDCGAtK()` (#4650) · df8c10ba
  James Lamb authored Oct 05, 2021
  
  df8c10ba
- allow inclusion in C programs (#4608) · f3037c18
  Drew Miller authored Oct 05, 2021
```
* allow inclusion in C programs

* adding documentation to macro

* Support for ANSI C, _Thread_local where available.

* fix macro for docs
```
  f3037c18
- add param aliases from scikit-learn (#4637) · e95d5ab8
  Nikita Titov authored Oct 05, 2021
  
  e95d5ab8
- remove unused BinMapper::SizeForSpecificBin() (#4643) · e81eaaaf
  James Lamb authored Oct 04, 2021
```
Co-authored-by: Nikita Titov <nekit94-12@hotmail.com>
```
  e81eaaaf
23 Sep, 2021 1 commit
- move Network method implementations from network.h to network.cpp (fixes #4464) (#4496) · e1572794
  James Lamb authored Sep 22, 2021
  
  e1572794
17 Sep, 2021 1 commit

[R-package] Fix R memory leaks (fixes #4282, fixes #3462) (#4597) · eda0d3ca

david-cortes authored Sep 17, 2021

* fix R memory leaks

* attempt at solving linter complaints

* fix compilation on windows

* move R_API_BEGIN to correct place

* make sure exception objects reach out of scope

* better way to solve rchk complaints

* remove goto statement

eda0d3ca

20 Aug, 2021 1 commit
- consolidate duplicate conditions in TextReader (#4530) · a926d4fe
  James Lamb authored Aug 20, 2021
  
  a926d4fe
03 Aug, 2021 1 commit

Update c_api LGBM_SampleIndices() comment. (#4490) · 1dbf4382

Chen Yufei authored Aug 04, 2021



* Update c_api LGBM_SampleIndices() comment.

rand.Sample() now returns exactly given number of samples, thus the
comment should be fixed.

* Update include/LightGBM/c_api.h
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

1dbf4382

25 Jul, 2021 1 commit
- [docs] document CLI behavior when label_column is omitted (#4485) · fdc582ea
  James Lamb authored Jul 24, 2021
  
  fdc582ea
21 Jul, 2021 1 commit
- Fix undefined behavior with NaN input in `CategoricalDecision()` (#4468) · 0012fc28
  Philip Hyunsu Cho authored Jul 21, 2021
```
* Fix undefined behavior with NaN input in CategoricalDecision()

* Always associate the right child with NaN inputs
```
  0012fc28
09 Jul, 2021 1 commit
- [docs] clarify description of prediction early stopping (#4411) · 0d1d12fb
  Nikita Titov authored Jul 09, 2021
  
  0d1d12fb
07 Jul, 2021 1 commit
- fix Reservoir Sampling in Sample of random.h (#4450) · a06899ab
  shiyu1994 authored Jul 07, 2021
  
  a06899ab
02 Jul, 2021 1 commit

[python-package] Create Dataset from multiple data files (#4089) · c359896e

Chen Yufei authored Jul 02, 2021

* [python-package] create Dataset from sampled data.

* [python-package] create Dataset from List[Sequence].

1. Use random access for data sampling
2. Support read data from multiple input files
3. Read data in batch so no need to hold all data in memory

* [python-package] example: create Dataset from multiple HDF5 file.

* fix: revert is_class implementation for seq

* fix: unwanted memory view reference for seq

* fix: seq is_class accepts sklearn matrices

* fix: requirements for example

* fix: pycode

* feat: print static code linting stage

* fix: linting: avoid shell str regex conversion

* code style: doc style

* code style: isort

* fix ci dependency: h5py on windows

* [py] remove rm files in test seq
https://github.com/microsoft/LightGBM/pull/4089#discussion_r612929623

* docs(python): init_from_sample summary

https://github.com/microsoft/LightGBM/pull/4089#discussion_r612903389



* remove dataset dump sample data debugging code.

* remove typo fix.

Create separate PR for this.

* fix typo in src/c_api.cpp
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* style(linting): py3 type hint for seq

* test(basic): os.path style path handling

* Revert "feat: print static code linting stage"

This reverts commit 10bd79f7f8258bea8e61c3abb8c9c7e4456a916d.

* feat(python): sequence on validation set

* minor(python): comment

* minor(python): test option hint

* style(python): fix code linting

* style(python): add pydoc for ref_dataset

* doc(python): sequence
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

* revert(python): sequence class abc

* chore(python): remove rm_files

* Remove useless static_assert.

* refactor: test_basic test for sequence.

* fix lint complaint.

* remove dataset._dump_text in sequence test.

* Fix reverting typo fix.

* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Fix type hint, code and doc style.

* fix failing test_basic.

* Remove TODO about keep constant in sync with cpp.

* Install h5py only when running python-examples.

* Fix lint complaint.

* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Doc fixes, remove unused params_str in __init_from_seqs.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Remove unnecessary conda install in windows ci script.

* Keep param as example in dataset_from_multi_hdf5.py

* Add _get_sample_count function to remove code duplication.

* Use batch_size parameter in generate_hdf.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Fix after applying suggestions.

* Fix test, check idx is instance of numbers.Integral.

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Expose Sequence class in Python-API doc.

* Handle Sequence object not having batch_size.

* Fix isort lint complaint.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update docstring to mention Sequence as data input.

* Remove get_one_line in test_basic.py

* Make Sequence an abstract class.

* Reduce number of tests for test_sequence.

* Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices.

* empty commit to trigger ci

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t.

Also rename total_nrow to num_total_row in c_api.h for consistency.

* Doc about Sequence in docs/Python-Intro.rst.

* Fix: basic.py change LGBM_SampleIndices out_len to int32.

* Add create_valid test case with Dataset from Sequence.

* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Apply suggestions from code review
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>

* Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.

* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Willian Zhang <willian@willian.email>
Co-authored-by: Willian Z <Willian@Willian-Zhang.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

c359896e