- 21 Nov, 2024 5 commits
-
-
myhloli authored
- Update OCR utils to handle different box formats and improve angle calculation - Modify PDF extraction kit to support OCR option and optimize processing flow - Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy
-
myhloli authored
- Improve logic to skip dropped spans in overlap detection - Enhance efficiency by avoiding unnecessary comparisons
-
myhloli authored
- fix the bug where hyphens in the middle of a line are being discarded
-
myhloli authored
- Add threshold parameter to merge_spans_to_line function - Make threshold configurable for y-axis overlap check - Improve flexibility and accuracy of line merging algorithm
-
myhloli authored
- Check if language string is empty and set it to None - This prevents potential errors when an empty language string is passed
-
- 18 Nov, 2024 3 commits
-
-
myhloli authored
- Introduce a variable threshold for right margin based on block width - Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5) - Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths
-
myhloli authored
- Add page size information to blocks - Calculate block width ratio relative to page width - Adjust threshold for determining right side indentation - Implement additional checks for merging blocks across pages - Improve logic for identifying list structures
-
myhloli authored
- Add calculate_is_angle function to detect angled text boxes - Update update_det_boxes and merge_det_boxes functions to handle angled text boxes - Modify angle detection logic in various parts of the code
-
- 15 Nov, 2024 1 commit
-
-
myhloli authored
-
- 14 Nov, 2024 1 commit
-
-
myhloli authored
fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print.
-
- 13 Nov, 2024 1 commit
-
-
myhloli authored
- Add digit check for single-character content to avoid adding unnecessary spaces
-
- 11 Nov, 2024 1 commit
-
-
hyastar authored
-
- 08 Nov, 2024 5 commits
-
-
myhloli authored
- Integrate RapidOCR with RapidTable model for table recognition - Improve memory management for devices with <= 8GB VRAM - Update table recognition process to use RapidOCR for RapidTable - Add rapidocr-paddle dependency in setup.py
-
myhloli authored
- Change the default table model from TABLE_MASTER to RAPID_TABLE
-
myhloli authored
- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py
-
myhloli authored
- Lower the line count threshold from 316 to 200 to ensure compatibility - This change aims to prevent potential issues with layoutreader's maximum line support
-
myhloli authored
- Decrease the maximum line count from 512 to 316 for layoutreader
-
- 07 Nov, 2024 1 commit
-
-
myhloli authored
- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
-
- 06 Nov, 2024 1 commit
-
-
myhloli authored
- Remove unused code for copying detection and recognition models - Simplify OCR model initialization using atom_model_manager - Delete unnecessary comments and empty lines
-
- 05 Nov, 2024 1 commit
-
-
myhloli authored
- Replace np.array with np.asarray for better performance - Add image color conversion from RGB to BGR using OpenCV
-
- 04 Nov, 2024 4 commits
-
-
myhloli authored
- Implement __replace_ligatures function to split ligature characters- Integrate ligature replacement into the merge_para_with_text function - Handle common ligatures such as fi, fl, ff, ffi, and ffl
-
myhloli authored
- Import 're' module for regular expression operations - Implement HTML minification for 'output_format=html' - Add 'minify_html' method to remove unnecessary whitespace and format HTML
-
myhloli authored
- Comment out an unused code block in the ppTableModel.py file - Improve code readability and maintainability by removing unnecessary code
-
myhloli authored
- Update StructTableModel to use the latest struct-eqtable library - Add support for HTML table extraction in PDF Extract Kit - Improve error handling and model initialization - Update dependencies in setup.py for struct-eqtable
-
- 03 Nov, 2024 2 commits
-
-
myhloli authored
- Optimize content stripping and checking logic - Add special case handling for single-character content - Adjust spacing rules for different content types
-
myhloli authored
- Add block_height calculation to determine block aspect ratio - Update list identification condition to include aspect ratio check - Improve code readability with better formatting and line breaks
-
- 02 Nov, 2024 2 commits
-
-
myhloli authored
feat(list): improve list detection algorithm- Add center_close_num and external_sides_not_close_num variables to analyze line positioning - Implement new list detection condition for centered lines - Enhance existing list detection logic with additional checks
-
myhloli authored
fix(list): improve list identification accuracy- Adjust the threshold for determining right-side spacing to 0.26 * block_weight - Add TODO comment for special list identification with all centered lines- Modify the condition for recognizing short item lists with left alignment - Update the condition for identifying the end of a list item
-
- 01 Nov, 2024 8 commits
-
-
myhloli authored
- Include InlineEquation in the condition for handling text content - Remove separate block for InlineEquation processing - Ensures consistent handling of inline equations and text, improving content formatting
-
myhloli authored
fix(ocr_mkcontent): improve content handling for different languages and equation types- Adjust content formatting for Chinese, Japanese, Korean, and Western languages - Implement proper spacing rules around inline equations- Remove unnecessary empty lines in paragraph text
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
icecraft authored
-
xu rui authored
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
- 31 Oct, 2024 1 commit
-
-
myhloli authored
- Add new function `remove_outside_spans` to filter spans based on image and table blocks - Reorder span processing steps to improve efficiency - Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`
-
- 30 Oct, 2024 2 commits
-
-
myhloli authored
- Add check for 'image_path' in spans to avoid errors when it's missing - Update image handling in both paragraph text and content dictionary - Improve error handling and make the code more robust
-
myhloli authored
- Update image content extraction to iterate through all spans in a block - Add support for extracting table content from spans within a block - Handle multiple content types within table spans (latex, html, image) - Refactor code to be more modular and easier to maintain
-
- 28 Oct, 2024 1 commit
-
-
myhloli authored
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster - Log error and exit if StructEqTable is selected, as it's under upgrade - Update README files to reflect changes in table parsing capabilities
-