- 04 Mar, 2025 1 commit
-
-
myhloli authored
- Optimize paragraph splitting algorithm for better text block separation - Update fast-langdetect dependency to ensure compatibility
-
- 24 Dec, 2024 1 commit
-
-
myhloli authored
- Add LLM-aided formula and text correction functionality - Update config reader to include LLM-aided settings - Create new LLM-aided processing module - Update main processing script to incorporate LLM-aided corrections - Modify download scripts to check for new config version
-
- 30 Nov, 2024 1 commit
-
-
myhloli authored
- Decrease the line height multiplier from 0.8 to 0.7 for both left and right sides - This modification aims to improve the accuracy of paragraph splitting
-
- 28 Nov, 2024 1 commit
-
-
myhloli authored
- Add language detection for each block of text - Implement language-specific logic for right margin alignment - Introduce logging for debugging purposes
-
- 25 Nov, 2024 1 commit
-
-
myhloli authored
- Add checks for uppercase character start in the first span of a block
-
- 22 Nov, 2024 1 commit
-
-
myhloli authored
- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py - Remove unused debug_mode parameter from para_split function in para_split_v3.py
-
- 19 Nov, 2024 1 commit
-
-
icecraft authored
-
- 18 Nov, 2024 2 commits
-
-
myhloli authored
- Introduce a variable threshold for right margin based on block width - Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5) - Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths
-
myhloli authored
- Add page size information to blocks - Calculate block width ratio relative to page width - Adjust threshold for determining right side indentation - Implement additional checks for merging blocks across pages - Improve logic for identifying list structures
-
- 11 Nov, 2024 1 commit
-
-
hyastar authored
-
- 03 Nov, 2024 1 commit
-
-
myhloli authored
- Add block_height calculation to determine block aspect ratio - Update list identification condition to include aspect ratio check - Improve code readability with better formatting and line breaks
-
- 02 Nov, 2024 2 commits
-
-
myhloli authored
feat(list): improve list detection algorithm- Add center_close_num and external_sides_not_close_num variables to analyze line positioning - Implement new list detection condition for centered lines - Enhance existing list detection logic with additional checks
-
myhloli authored
fix(list): improve list identification accuracy- Adjust the threshold for determining right-side spacing to 0.26 * block_weight - Add TODO comment for special list identification with all centered lines- Modify the condition for recognizing short item lists with left alignment - Update the condition for identifying the end of a list item
-
- 21 Oct, 2024 1 commit
-
-
myhloli authored
- Adjust the threshold for identifying index blocks from 3 lines to 2 lines - Add a new function __is_list_group to detect if a group of blocks is a list - Modify the paragraph merging logic to handle list groups differently
-
- 15 Oct, 2024 3 commits
-
-
myhloli authored
- Update list block detection logic to require at least 2 numeric start lines - Ensure the number of numeric start lines matches the number of end lines - Remove detection of non-border starting lines for simplicity
-
myhloli authored
-
myhloli authored
- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block() - Simplify block type determination logic - Remove redundant code and improve readability - Optimize block merging process
-
- 14 Oct, 2024 1 commit
-
-
myhloli authored
- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages - Update block types to include list and index categories - Adjust text merging logic to handle new block types - Modify layout drawing to distinguish list and index blocks
-
- 10 Oct, 2024 2 commits
-
-
myhloli authored
-
myhloli authored
- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process - Add support for specifying page range in doc_analyze_by_custom_model - Implement garbage collection and memory cleaning after processing - Refine image loading from PDF, including handling out-of-range pages
-