1. 30 Dec, 2021 1 commit
    • Yaqub Alwan's avatar
      [python] raise an informative error instead of segfaulting when custom... · af5b40e1
      Yaqub Alwan authored
      
      [python] raise an informative error instead of segfaulting when custom objective produces incorrect output (#4815)
      
      * fix for bad grads causing segfault
      
      * adjust checking criteria to properly reflect reality of multi-class classifiers
      
      * fix styling
      
      * Line break before operator
      
      * Update python-package/lightgbm/basic.py
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Update python-package/lightgbm/basic.py
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * add a note to the C-API docs
      
      * rearrange text s;ightly
      
      * add some tests to python package
      
      * Update include/LightGBM/c_api.h
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * PR comments
      
      * match argument is a regex and our expression has brackets ..
      
      * rework tests
      
      * isorting imports
      
      * updating test to relfect that the python APi does not take pres/labels as a fobj function
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      af5b40e1
  2. 03 Dec, 2021 1 commit
  3. 16 Nov, 2021 1 commit
  4. 07 Oct, 2021 1 commit
  5. 17 Sep, 2021 1 commit
    • José Morales's avatar
      [python-package] Support 2d collections as input for `init_score` in... · f1f5ba15
      José Morales authored
      
      [python-package] Support 2d collections as input for `init_score` in multiclass classification task (#4150)
      
      * initial implementation of init_score for multiclass classification
      
      * check for 1d or 2d collection in init_score
      
      * remove dataset import
      
      * initial comments
      
      * update dask test and docstrings
      
      * update docstrings
      
      * move logic to set_field. reshape back on get_field
      
      * add type hints and update docstrings for dask. fix Dataset.set_field
      
      * revert wrong docstrings and type hints
      
      * add extra comma for consistency
      
      * prefix private functions with underscore
      
      add type hints to new functions
      
      make commas consistent in dask and basic
      
      * add missing spaces after type hint
      
      * remove shape condition for dataframe in is_2d_collection
      Co-authored-by: default avatarNikita Titov <nekit94-12@hotmail.com>
      f1f5ba15
  6. 31 Jul, 2021 1 commit
  7. 30 Jul, 2021 1 commit
  8. 07 Jul, 2021 1 commit
  9. 05 Jul, 2021 1 commit
  10. 04 Jul, 2021 2 commits
  11. 02 Jul, 2021 1 commit
    • Chen Yufei's avatar
      [python-package] Create Dataset from multiple data files (#4089) · c359896e
      Chen Yufei authored
      * [python-package] create Dataset from sampled data.
      
      * [python-package] create Dataset from List[Sequence].
      
      1. Use random access for data sampling
      2. Support read data from multiple input files
      3. Read data in batch so no need to hold all data in memory
      
      * [python-package] example: create Dataset from multiple HDF5 file.
      
      * fix: revert is_class implementation for seq
      
      * fix: unwanted memory view reference for seq
      
      * fix: seq is_class accepts sklearn matrices
      
      * fix: requirements for example
      
      * fix: pycode
      
      * feat: print static code linting stage
      
      * fix: linting: avoid shell str regex conversion
      
      * code style: doc style
      
      * code style: isort
      
      * fix ci dependency: h5py on windows
      
      * [py] remove rm files in test seq
      https://github.com/microsoft/LightGBM/pull/4089#discussion_r612929623
      
      * docs(python): init_from_sample summary
      
      https://github.com/microsoft/LightGBM/pull/4089#discussion_r612903389
      
      
      
      * remove dataset dump sample data debugging code.
      
      * remove typo fix.
      
      Create separate PR for this.
      
      * fix typo in src/c_api.cpp
      Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
      
      * style(linting): py3 type hint for seq
      
      * test(basic): os.path style path handling
      
      * Revert "feat: print static code linting stage"
      
      This reverts commit 10bd79f7f8258bea8e61c3abb8c9c7e4456a916d.
      
      * feat(python): sequence on validation set
      
      * minor(python): comment
      
      * minor(python): test option hint
      
      * style(python): fix code linting
      
      * style(python): add pydoc for ref_dataset
      
      * doc(python): sequence
      Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
      
      * revert(python): sequence class abc
      
      * chore(python): remove rm_files
      
      * Remove useless static_assert.
      
      * refactor: test_basic test for sequence.
      
      * fix lint complaint.
      
      * remove dataset._dump_text in sequence test.
      
      * Fix reverting typo fix.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
      
      * Fix type hint, code and doc style.
      
      * fix failing test_basic.
      
      * Remove TODO about keep constant in sync with cpp.
      
      * Install h5py only when running python-examples.
      
      * Fix lint complaint.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
      
      * Doc fixes, remove unused params_str in __init_from_seqs.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Remove unnecessary conda install in windows ci script.
      
      * Keep param as example in dataset_from_multi_hdf5.py
      
      * Add _get_sample_count function to remove code duplication.
      
      * Use batch_size parameter in generate_hdf.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Fix after applying suggestions.
      
      * Fix test, check idx is instance of numbers.Integral.
      
      * Update python-package/lightgbm/basic.py
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Expose Sequence class in Python-API doc.
      
      * Handle Sequence object not having batch_size.
      
      * Fix isort lint complaint.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Update docstring to mention Sequence as data input.
      
      * Remove get_one_line in test_basic.py
      
      * Make Sequence an abstract class.
      
      * Reduce number of tests for test_sequence.
      
      * Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices.
      
      * empty commit to trigger ci
      
      * Apply suggestions from code review
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t.
      
      Also rename total_nrow to num_total_row in c_api.h for consistency.
      
      * Doc about Sequence in docs/Python-Intro.rst.
      
      * Fix: basic.py change LGBM_SampleIndices out_len to int32.
      
      * Add create_valid test case with Dataset from Sequence.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Apply suggestions from code review
      Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
      
      * Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.
      
      * Update python-package/lightgbm/basic.py
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      Co-authored-by: default avatarWillian Zhang <willian@willian.email>
      Co-authored-by: default avatarWillian Z <Willian@Willian-Zhang.com>
      Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
      Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      c359896e
  12. 21 May, 2021 2 commits
  13. 24 Feb, 2021 1 commit
    • jmoralez's avatar
      [dask][python-package] include support for column array as label (#3943) · 5dacd603
      jmoralez authored
      * include support for column array as label
      
      * remove nested ifs
      
      * fix linting errors
      
      * include tests for sklearn regressors
      
      * include docstring for numpy_1d_array_to_dtype
      
      * include . at end of docstring
      
      * remove pandas import and test for regression, classification and ranking
      
      * check predictions of sklearn models as well
      
      * test training only in dask. drop pandas series tests
      
      * use PANDAS_INSTALLED and pd_Series
      
      * inline imports
      
      * use col array in fit for test_dask
      
      * include review comments
      5dacd603
  14. 16 Feb, 2021 1 commit
  15. 26 Jan, 2021 1 commit
  16. 23 Jan, 2021 1 commit
  17. 15 Jan, 2021 2 commits
  18. 04 Jan, 2021 1 commit
  19. 28 Dec, 2020 1 commit
    • Nikita Titov's avatar
      small code and docs refactoring (#3681) · 5a460846
      Nikita Titov authored
      * small code and docs refactoring
      
      * Update CMakeLists.txt
      
      * Update .vsts-ci.yml
      
      * Update test.sh
      
      * continue
      
      * continue
      
      * revert stable sort for all-unique values
      5a460846
  20. 24 Dec, 2020 1 commit
    • Belinda Trotta's avatar
      Trees with linear models at leaves (#3299) · fcfd4132
      Belinda Trotta authored
      * Add Eigen library.
      
      * Working for simple test.
      
      * Apply changes to config params.
      
      * Handle nan data.
      
      * Update docs.
      
      * Add test.
      
      * Only load raw data if boosting=gbdt_linear
      
      * Remove unneeded code.
      
      * Minor updates.
      
      * Update to work with sk-learn interface.
      
      * Update to work with chunked datasets.
      
      * Throw error if we try to create a Booster with an already-constructed dataset having incompatible parameters.
      
      * Save raw data in binary dataset file.
      
      * Update docs and fix parameter checking.
      
      * Fix dataset loading.
      
      * Add test for regularization.
      
      * Fix bugs when saving and loading tree.
      
      * Add test for load/save linear model.
      
      * Remove unneeded code.
      
      * Fix case where not enough leaf data for linear model.
      
      * Simplify code.
      
      * Speed up code.
      
      * Speed up code.
      
      * Simplify code.
      
      * Speed up code.
      
      * Fix bugs.
      
      * Working version.
      
      * Store feature data column-wise (not fully working yet).
      
      * Fix bugs.
      
      * Speed up.
      
      * Speed up.
      
      * Remove unneeded code.
      
      * Small speedup.
      
      * Speed up.
      
      * Minor updates.
      
      * Remove unneeded code.
      
      * Fix bug.
      
      * Fix bug.
      
      * Speed up.
      
      * Speed up.
      
      * Simplify code.
      
      * Remove unneeded code.
      
      * Fix bug, add more tests.
      
      * Fix bug and add test.
      
      * Only store numerical features
      
      * Fix bug and speed up using templates.
      
      * Speed up prediction.
      
      * Fix bug with regularisation
      
      * Visual studio files.
      
      * Working version
      
      * Only check nans if necessary
      
      * Store coeff matrix as an array.
      
      * Align cache lines
      
      * Align cache lines
      
      * Preallocation coefficient calculation matrices
      
      * Small speedups
      
      * Small speedup
      
      * Reverse cache alignment changes
      
      * Change to dynamic schedule
      
      * Update docs.
      
      * Refactor so that linear tree learner is not a separate class.
      
      * Add refit capability.
      
      * Speed up
      
      * Small speedups.
      
      * Speed up add prediction to score.
      
      * Fix bug
      
      * Fix bug and speed up.
      
      * Speed up dataload.
      
      * Speed up dataload
      
      * Use vectors instead of pointers
      
      * Fix bug
      
      * Add OMP exception handling.
      
      * Change return type of LGBM_BoosterGetLinear to bool
      
      * Change return type of LGBM_BoosterGetLinear back to int, only parameter type needed to change
      
      * Remove unused internal_parent_ property of tree
      
      * Remove unused parameter to CreateTreeLearner
      
      * Remove reference to LinearTreeLearner
      
      * Minor style issues
      
      * Remove unneeded check
      
      * Reverse temporary testing change
      
      * Fix Visual Studio project files
      
      * Restore LightGBM.vcxproj.filters
      
      * Speed up
      
      * Speed up
      
      * Simplify code
      
      * Update docs
      
      * Simplify code
      
      * Initialise storage space for max num threads
      
      * Move Eigen to include directory and delete unused files
      
      * Remove old files.
      
      * Fix so it compiles with mingw
      
      * Fix gpu tree learner
      
      * Change AddPredictionToScore back to const
      
      * Fix python lint error
      
      * Fix C++ lint errors
      
      * Change eigen to a submodule
      
      * Update comment
      
      * Add the eigen folder
      
      * Try to fix build issues with eigen
      
      * Remove eigen files
      
      * Add eigen as submodule
      
      * Fix include paths
      
      * Exclude eigen files from Python linter
      
      * Ignore eigen folders for pydocstyle
      
      * Fix C++ linting errors
      
      * Fix docs
      
      * Fix docs
      
      * Exclude eigen directories from doxygen
      
      * Update manifest to include eigen
      
      * Update build_r to include eigen files
      
      * Fix compiler warnings
      
      * Store raw feature data as float
      
      * Use float for calculating linear coefficients
      
      * Remove eigen directory from GLOB
      
      * Don't compile linear model code when building R package
      
      * Fix doxygen issue
      
      * Fix lint issue
      
      * Fix lint issue
      
      * Remove uneeded code
      
      * Restore delected lines
      
      * Restore delected lines
      
      * Change return type of has_raw to bool
      
      * Update docs
      
      * Rename some variables and functions for readability
      
      * Make tree_learner parameter const in AddScore
      
      * Fix style issues
      
      * Pass vectors as const reference when setting tree properties
      
      * Make temporary storage of serial_tree_learner mutable so we can make the object's methods const
      
      * Remove get_raw_size, use num_numeric_features instead
      
      * Fix typo
      
      * Make contains_nan_ and any_nan_ properties immutable again
      
      * Remove data_has_nan_ property of tree
      
      * Remove temporary test code
      
      * Make linear_tree a dataset param
      
      * Fix lint error
      
      * Make LinearTreeLearner a separate class
      
      * Fix lint errors
      
      * Fix lint error
      
      * Add linear_tree_learner.o
      
      * Simulate omp_get_max_threads if openmp is not available
      
      * Update PushOneData to also store raw data.
      
      * Cast size to int
      
      * Fix bug in ReshapeRaw
      
      * Speed up code with multithreading
      
      * Use OMP_NUM_THREADS
      
      * Speed up with multithreading
      
      * Update to use ArrayToString
      
      * Fix tests
      
      * Fix test
      
      * Fix bug introduced in merge
      
      * Minor updates
      
      * Update docs
      fcfd4132
  21. 30 Oct, 2020 1 commit
  22. 29 Oct, 2020 1 commit
  23. 26 Oct, 2020 1 commit
    • Guolin Ke's avatar
      Fix add features (#2754) · 53977f36
      Guolin Ke authored
      
      
      * fix subset bug
      
      * typo
      
      * add fixme tag
      
      * bin mapper
      
      * fix test
      
      * fix add_features_from
      
      * Update dataset.cpp
      
      * fix merge bug
      
      * added Python merge code
      
      * added test for add_features
      
      * Update dataset.cpp
      
      * Update src/io/dataset.cpp
      
      * continue implementing
      
      * warn users about categorical features
      Co-authored-by: default avatarStrikerRUS <nekit94-12@hotmail.com>
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      53977f36
  24. 11 Aug, 2020 1 commit
  25. 11 Jun, 2020 1 commit
  26. 20 Feb, 2020 1 commit
    • Joan Fontanals's avatar
      Add capability to get possible max and min values for a model (#2737) · 18e7de4f
      Joan Fontanals authored
      
      
      * Add capability to get possible max and min values for a model
      
      * Change implementation to have return value in tree.cpp, change naming to upper and lower bound, move implementation to gdbt.cpp
      
      * Update include/LightGBM/c_api.h
      Co-Authored-By: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Change iteration to avoid potential overflow, add bindings to R and Python and a basic test
      
      * Adjust test values
      
      * Consider const correctness and multithreading protection
      
      * Update test values
      
      * Update test values
      
      * Add test to check that model is exactly the same in all platforms
      
      * Try to parse the model to get the expected values
      
      * Try to parse the model to get the expected values
      
      * Fix implementation, num_leaves can be lower than the leaf_value_ size
      
      * Do not check for num_leaves to be smaller than actual size and get back to test with hardcoded value
      
      * Change test order
      
      * Add gpu_use_dp option in test
      
      * Remove helper test method
      
      * Update src/c_api.cpp
      Co-Authored-By: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Update src/io/tree.cpp
      Co-Authored-By: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Update src/io/tree.cpp
      Co-Authored-By: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Update tests/python_package_test/test_basic.py
      Co-Authored-By: default avatarNikita Titov <nekit94-08@mail.ru>
      
      * Remoove imports
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      18e7de4f
  27. 19 Feb, 2020 1 commit
    • Guolin Ke's avatar
      [python] [R-package] refine the parameters for Dataset (#2594) · 9f79e840
      Guolin Ke authored
      
      
      * reset
      
      * fix a bug
      
      * fix test
      
      * Update c_api.h
      
      * support to no filter features by min_data
      
      * add warning in reset config
      
      * refine warnings for override dataset's parameter
      
      * some cleans
      
      * clean code
      
      * clean code
      
      * refine C API function doxygen comments
      
      * refined new param description
      
      * refined doxygen comments for R API function
      
      * removed stuff related to int8
      
      * break long line in warning message
      
      * removed tests which results cannot be validated anymore
      
      * added test for warnings about unchangeable params
      
      * write parameter from dataset to booster
      
      * consider free_raw_data.
      
      * fix params
      
      * fix bug
      
      * implementing R
      
      * fix typo
      
      * filter params in R
      
      * fix R
      
      * not min_data
      
      * refined tests
      
      * fixed linting
      
      * refine
      
      * pilint
      
      * add docstring
      
      * fix docstring
      
      * R lint
      
      * updated description for C API function
      
      * use param aliases in Python
      
      * fixed typo
      
      * fixed typo
      
      * added more params to test
      
      * removed debug print
      
      * fix dataset construct place
      
      * fix merge bug
      
      * Update feature_histogram.hpp
      
      * add is_sparse back
      
      * remove unused parameters
      
      * fix lint
      
      * add data random seed
      
      * update
      
      * [R-package] centrallized Dataset parameter aliases and added tests on Dataset parameter updating (#2767)
      Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
      Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
      9f79e840
  28. 14 Jan, 2020 1 commit
  29. 10 Jan, 2020 1 commit
    • Patrick Ford's avatar
      [python] Output model to a pandas DataFrame (#2592) · 301402c8
      Patrick Ford authored
      * trees_to_df method and unit test added. PEP 8 fixes for integration.
      
      * Co-Authored-By: Nikita Titov <nekit94-08@mail.ru>
      
      Post-review changes
      
      * changes from second round of reviews from striker
      
      * third round of review. formatting and added 2 more tests
      
      * replaced pandas dot attribute accessor with string attribute accessor
      
      * dealt with single tree edge case and minor refactor of tests
      
      * slight refactor for checking if tree is a single node
      301402c8
  30. 27 Oct, 2019 2 commits
  31. 03 Oct, 2019 1 commit
  32. 26 Sep, 2019 1 commit
  33. 09 Sep, 2019 1 commit
  34. 20 Jun, 2019 1 commit
  35. 04 Apr, 2019 1 commit
    • remcob-gr's avatar
      Add Cost Effective Gradient Boosting (#2014) · 76102284
      remcob-gr authored
      * Add configuration parameters for CEGB.
      
      * Add skeleton CEGB tree learner
      
      Like the original CEGB version, this inherits from SerialTreeLearner.
      Currently, it changes nothing from the original.
      
      * Track features used in CEGB tree learner.
      
      * Pull CEGB tradeoff and coupled feature penalty from config.
      
      * Implement finding best splits for CEGB
      
      This is heavily based on the serial version, but just adds using the coupled penalties.
      
      * Set proper defaults for cegb parameters.
      
      * Ensure sanity checks don't switch off CEGB.
      
      * Implement per-data-point feature penalties in CEGB.
      
      * Implement split penalty and remove unused parameters.
      
      * Merge changes from CEGB tree learner into serial tree learner
      
      * Represent features_used_in_data by a bitset, to reduce the memory overhead of CEGB, and add sanity checks for the lengths of the penalty vectors.
      
      * Fix bug where CEGB would incorrectly penalise a previously used feature
      
      The tree learner did not update the gains of previously computed leaf splits when splitting a leaf elsewhere in the tree.
      This caused it to prefer new features due to incorrectly penalising splitting on previously used features.
      
      * Document CEGB parameters and add them to the appropriate section.
      
      * Remove leftover reference to cegb tree learner.
      
      * Remove outdated diff.
      
      * Fix warnings
      
      * Fix minor issues identified by @StrikerRUS.
      
      * Add docs section on CEGB, including citation.
      
      * Fix link.
      
      * Fix CI failure.
      
      * Add some unit tests
      
      * Fix pylint issues.
      
      * Fix remaining pylint issue
      76102284
  36. 07 Mar, 2019 1 commit