- 01 Nov, 2024 4 commits
-
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
- 31 Oct, 2024 1 commit
-
-
myhloli authored
- Add new function `remove_outside_spans` to filter spans based on image and table blocks - Reorder span processing steps to improve efficiency - Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`
-
- 30 Oct, 2024 2 commits
-
-
myhloli authored
- Add check for 'image_path' in spans to avoid errors when it's missing - Update image handling in both paragraph text and content dictionary - Improve error handling and make the code more robust
-
myhloli authored
- Update image content extraction to iterate through all spans in a block - Add support for extracting table content from spans within a block - Handle multiple content types within table spans (latex, html, image) - Refactor code to be more modular and easier to maintain
-
- 28 Oct, 2024 5 commits
-
-
myhloli authored
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster - Log error and exit if StructEqTable is selected, as it's under upgrade - Update README files to reflect changes in table parsing capabilities
-
icecraft authored
-
liukaiwen authored
-
liukaiwen authored
-
icecraft authored
-
- 27 Oct, 2024 2 commits
-
-
myhloli authored
- Modify the logic for splitting wide blocks exceeding 0.4 page width - Remove the specific case for blocks exceeding 0.25 page width - Add comments to explain the reasoning behind different splitting strategies
-
myhloli authored
- Update model download instructions for versions 0.9.x and later - Simplify demo scripts by removing unnecessary model configuration - Add visualization function to draw bounding boxes - Update CLI help message with new URL
-
- 26 Oct, 2024 1 commit
-
-
myhloli authored
- Add support for drawing bounding boxes of table and image sub-blocks - Implement sorting of table blocks based on type order - Update bounding box drawing for text and title blocks - Refactor code to handle different block types and their sub-blocks
-
- 25 Oct, 2024 7 commits
-
-
myhloli authored
-
myhloli authored
-
icecraft authored
-
myhloli authored
- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5 - Reduce the unclip ratio for OCR detection from 2.4 to 1.8
-
myhloli authored
- Split image and table blocks into separate categories - Add group_id to image and table blocks- Update block processing logic to handle new categories - Modify layout splitting and span filling to accommodate new block types - Adjust block indexing and sorting to consider new structures
-
icecraft authored
-
icecraft authored
-
- 24 Oct, 2024 3 commits
- 23 Oct, 2024 1 commit
-
-
myhloli authored
- Add new layout model option: DocLayout-YOLO - Implement model initialization and prediction for DocLayout-YOLO - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models - Update Gradio app to support more Custom Switch
-
- 21 Oct, 2024 2 commits
-
-
myhloli authored
- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types
-
myhloli authored
- Adjust the threshold for identifying index blocks from 3 lines to 2 lines - Add a new function __is_list_group to detect if a group of blocks is a list - Modify the paragraph merging logic to handle list groups differently
-
- 18 Oct, 2024 1 commit
-
-
myhloli authored
- Remove unused parameters parse_type and lang from various functions - Simplify function calls by removing unnecessary arguments - Update related files to reflect these changes
-
- 17 Oct, 2024 2 commits
-
-
liukaiwen authored
-
myhloli authored
- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc. - Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements - Update ocr_model_init to include additional parameters for OCR model configuration
-
- 15 Oct, 2024 4 commits
-
-
myhloli authored
- Update list block detection logic to require at least 2 numeric start lines - Ensure the number of numeric start lines matches the number of end lines - Remove detection of non-border starting lines for simplicity
-
myhloli authored
-
myhloli authored
Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.
-
myhloli authored
- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block() - Simplify block type determination logic - Remove redundant code and improve readability - Optimize block merging process
-
- 14 Oct, 2024 2 commits
-
-
myhloli authored
Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.
-
myhloli authored
- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages - Update block types to include list and index categories - Adjust text merging logic to handle new block types - Modify layout drawing to distinguish list and index blocks
-
- 10 Oct, 2024 2 commits
-
-
myhloli authored
-
myhloli authored
- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process - Add support for specifying page range in doc_analyze_by_custom_model - Implement garbage collection and memory cleaning after processing - Refine image loading from PDF, including handling out-of-range pages
-
- 08 Oct, 2024 1 commit
-
-
myhloli authored
- Add function to get local LayoutReader model directory- Check and use local model directory if available - Fall back to online model if local directory not found - Update model initialization to support local path - Refactor model loading in singleton class
-