• Chen Yufei's avatar
    [python-package] Create Dataset from multiple data files (#4089) · c359896e
    Chen Yufei authored
    * [python-package] create Dataset from sampled data.
    
    * [python-package] create Dataset from List[Sequence].
    
    1. Use random access for data sampling
    2. Support read data from multiple input files
    3. Read data in batch so no need to hold all data in memory
    
    * [python-package] example: create Dataset from multiple HDF5 file.
    
    * fix: revert is_class implementation for seq
    
    * fix: unwanted memory view reference for seq
    
    * fix: seq is_class accepts sklearn matrices
    
    * fix: requirements for example
    
    * fix: pycode
    
    * feat: print static code linting stage
    
    * fix: linting: avoid shell str regex conversion
    
    * code style: doc style
    
    * code style: isort
    
    * fix ci dependency: h5py on windows
    
    * [py] remove rm files in test seq
    https://github.com/microsoft/LightGBM/pull/4089#discussion_r612929623
    
    * docs(python): init_from_sample summary
    
    https://github.com/microsoft/LightGBM/pull/4089#discussion_r612903389
    
    
    
    * remove dataset dump sample data debugging code.
    
    * remove typo fix.
    
    Create separate PR for this.
    
    * fix typo in src/c_api.cpp
    Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
    
    * style(linting): py3 type hint for seq
    
    * test(basic): os.path style path handling
    
    * Revert "feat: print static code linting stage"
    
    This reverts commit 10bd79f7f8258bea8e61c3abb8c9c7e4456a916d.
    
    * feat(python): sequence on validation set
    
    * minor(python): comment
    
    * minor(python): test option hint
    
    * style(python): fix code linting
    
    * style(python): add pydoc for ref_dataset
    
    * doc(python): sequence
    Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
    
    * revert(python): sequence class abc
    
    * chore(python): remove rm_files
    
    * Remove useless static_assert.
    
    * refactor: test_basic test for sequence.
    
    * fix lint complaint.
    
    * remove dataset._dump_text in sequence test.
    
    * Fix reverting typo fix.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
    
    * Fix type hint, code and doc style.
    
    * fix failing test_basic.
    
    * Remove TODO about keep constant in sync with cpp.
    
    * Install h5py only when running python-examples.
    
    * Fix lint complaint.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
    
    * Doc fixes, remove unused params_str in __init_from_seqs.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Remove unnecessary conda install in windows ci script.
    
    * Keep param as example in dataset_from_multi_hdf5.py
    
    * Add _get_sample_count function to remove code duplication.
    
    * Use batch_size parameter in generate_hdf.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Fix after applying suggestions.
    
    * Fix test, check idx is instance of numbers.Integral.
    
    * Update python-package/lightgbm/basic.py
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Expose Sequence class in Python-API doc.
    
    * Handle Sequence object not having batch_size.
    
    * Fix isort lint complaint.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Update docstring to mention Sequence as data input.
    
    * Remove get_one_line in test_basic.py
    
    * Make Sequence an abstract class.
    
    * Reduce number of tests for test_sequence.
    
    * Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices.
    
    * empty commit to trigger ci
    
    * Apply suggestions from code review
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t.
    
    Also rename total_nrow to num_total_row in c_api.h for consistency.
    
    * Doc about Sequence in docs/Python-Intro.rst.
    
    * Fix: basic.py change LGBM_SampleIndices out_len to int32.
    
    * Add create_valid test case with Dataset from Sequence.
    
    * Apply suggestions from code review
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    
    * Apply suggestions from code review
    Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
    
    * Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.
    
    * Update python-package/lightgbm/basic.py
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    Co-authored-by: default avatarWillian Zhang <willian@willian.email>
    Co-authored-by: default avatarWillian Z <Willian@Willian-Zhang.com>
    Co-authored-by: default avatarJames Lamb <jaylamb20@gmail.com>
    Co-authored-by: default avatarshiyu1994 <shiyu_k1994@qq.com>
    Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
    c359896e
test_basic.py 19.8 KB