- 03 Nov, 2024 2 commits
-
-
myhloli authored
- Optimize content stripping and checking logic - Add special case handling for single-character content - Adjust spacing rules for different content types
-
myhloli authored
- Add block_height calculation to determine block aspect ratio - Update list identification condition to include aspect ratio check - Improve code readability with better formatting and line breaks
-
- 02 Nov, 2024 2 commits
-
-
myhloli authored
feat(list): improve list detection algorithm- Add center_close_num and external_sides_not_close_num variables to analyze line positioning - Implement new list detection condition for centered lines - Enhance existing list detection logic with additional checks
-
myhloli authored
fix(list): improve list identification accuracy- Adjust the threshold for determining right-side spacing to 0.26 * block_weight - Add TODO comment for special list identification with all centered lines- Modify the condition for recognizing short item lists with left alignment - Update the condition for identifying the end of a list item
-
- 01 Nov, 2024 8 commits
-
-
myhloli authored
- Include InlineEquation in the condition for handling text content - Remove separate block for InlineEquation processing - Ensures consistent handling of inline equations and text, improving content formatting
-
myhloli authored
fix(ocr_mkcontent): improve content handling for different languages and equation types- Adjust content formatting for Chinese, Japanese, Korean, and Western languages - Implement proper spacing rules around inline equations- Remove unnecessary empty lines in paragraph text
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
icecraft authored
-
xu rui authored
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
- 31 Oct, 2024 1 commit
-
-
myhloli authored
- Add new function `remove_outside_spans` to filter spans based on image and table blocks - Reorder span processing steps to improve efficiency - Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`
-
- 30 Oct, 2024 2 commits
-
-
myhloli authored
- Add check for 'image_path' in spans to avoid errors when it's missing - Update image handling in both paragraph text and content dictionary - Improve error handling and make the code more robust
-
myhloli authored
- Update image content extraction to iterate through all spans in a block - Add support for extracting table content from spans within a block - Handle multiple content types within table spans (latex, html, image) - Refactor code to be more modular and easier to maintain
-
- 28 Oct, 2024 5 commits
-
-
myhloli authored
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster - Log error and exit if StructEqTable is selected, as it's under upgrade - Update README files to reflect changes in table parsing capabilities
-
icecraft authored
-
liukaiwen authored
-
liukaiwen authored
-
icecraft authored
-
- 27 Oct, 2024 2 commits
-
-
myhloli authored
- Modify the logic for splitting wide blocks exceeding 0.4 page width - Remove the specific case for blocks exceeding 0.25 page width - Add comments to explain the reasoning behind different splitting strategies
-
myhloli authored
- Update model download instructions for versions 0.9.x and later - Simplify demo scripts by removing unnecessary model configuration - Add visualization function to draw bounding boxes - Update CLI help message with new URL
-
- 26 Oct, 2024 1 commit
-
-
myhloli authored
- Add support for drawing bounding boxes of table and image sub-blocks - Implement sorting of table blocks based on type order - Update bounding box drawing for text and title blocks - Refactor code to handle different block types and their sub-blocks
-
- 25 Oct, 2024 7 commits
-
-
myhloli authored
-
myhloli authored
-
icecraft authored
-
myhloli authored
- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5 - Reduce the unclip ratio for OCR detection from 2.4 to 1.8
-
myhloli authored
- Split image and table blocks into separate categories - Add group_id to image and table blocks- Update block processing logic to handle new categories - Modify layout splitting and span filling to accommodate new block types - Adjust block indexing and sorting to consider new structures
-
icecraft authored
-
icecraft authored
-
- 24 Oct, 2024 3 commits
- 23 Oct, 2024 1 commit
-
-
myhloli authored
- Add new layout model option: DocLayout-YOLO - Implement model initialization and prediction for DocLayout-YOLO - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models - Update Gradio app to support more Custom Switch
-
- 21 Oct, 2024 2 commits
-
-
myhloli authored
- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types
-
myhloli authored
- Adjust the threshold for identifying index blocks from 3 lines to 2 lines - Add a new function __is_list_group to detect if a group of blocks is a list - Modify the paragraph merging logic to handle list groups differently
-
- 18 Oct, 2024 1 commit
-
-
myhloli authored
- Remove unused parameters parse_type and lang from various functions - Simplify function calls by removing unnecessary arguments - Update related files to reflect these changes
-
- 17 Oct, 2024 2 commits
-
-
liukaiwen authored
-
myhloli authored
- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc. - Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements - Update ocr_model_init to include additional parameters for OCR model configuration
-
- 15 Oct, 2024 1 commit
-
-
myhloli authored
- Update list block detection logic to require at least 2 numeric start lines - Ensure the number of numeric start lines matches the number of end lines - Remove detection of non-border starting lines for simplicity
-