- 26 Nov, 2024 3 commits
- 25 Nov, 2024 1 commit
-
-
myhloli authored
-
- 22 Nov, 2024 1 commit
-
-
myhloli authored
-
- 21 Nov, 2024 1 commit
-
-
myhloli authored
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy - Add character filling in spans for better text reconstruction - Introduce empty span handling using OCR for missed text - Optimize span filtering and overlap removal
-
- 19 Nov, 2024 1 commit
-
-
icecraft authored
-
- 18 Nov, 2024 1 commit
-
-
icecraft authored
-
- 15 Nov, 2024 1 commit
-
-
myhloli authored
-
- 08 Nov, 2024 2 commits
-
-
myhloli authored
- Change the default table model from TABLE_MASTER to RAPID_TABLE
-
myhloli authored
- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py
-
- 07 Nov, 2024 1 commit
-
-
myhloli authored
- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
-
- 06 Nov, 2024 2 commits
- 01 Nov, 2024 1 commit
-
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
- 28 Oct, 2024 1 commit
-
-
liukaiwen authored
-
- 26 Oct, 2024 1 commit
-
-
myhloli authored
- Add support for drawing bounding boxes of table and image sub-blocks - Implement sorting of table blocks based on type order - Update bounding box drawing for text and title blocks - Refactor code to handle different block types and their sub-blocks
-
- 24 Oct, 2024 1 commit
-
-
icecraft authored
feat: add Data api
-
- 23 Oct, 2024 1 commit
-
-
myhloli authored
- Add new layout model option: DocLayout-YOLO - Implement model initialization and prediction for DocLayout-YOLO - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models - Update Gradio app to support more Custom Switch
-
- 17 Oct, 2024 1 commit
-
-
liukaiwen authored
-
- 14 Oct, 2024 2 commits
-
-
myhloli authored
Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.
-
myhloli authored
- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages - Update block types to include list and index categories - Adjust text merging logic to handle new block types - Modify layout drawing to distinguish list and index blocks
-
- 08 Oct, 2024 1 commit
-
-
myhloli authored
- Add function to get local LayoutReader model directory- Check and use local model directory if available - Fall back to online model if local directory not found - Update model initialization to support local path - Refactor model loading in singleton class
-
- 29 Sep, 2024 2 commits
-
-
myhloli authored
- Insert lines into blocks based on median line height- Calculate block index using line indices median - Remove virtual line information for table and image blocks - Enhance line sorting algorithm for different block types - Add line height calculation function
-
myhloli authored
The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used. This change streamlines the code and prevents potential confusion regarding its purpose.
-
- 27 Sep, 2024 6 commits
-
-
myhloli authored
refactor(draw_bbox): remove commented-out code and streamline bbox drawingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which was used for diagnostic purposes and was no longer necessary. This change streamlines the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update also adjusts the order of operations slightly for improved readability without altering the functionality.
-
myhloli authored
Refactor the draw bbox functions by removing unused imports and simplifying the code logic for drawing layout and line sorting bounding boxes. Adjust the debug configuration to enable content list dumping and disable markdown making mode.
-
myhloli authored
Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding box will be drawn, allowing for situations where only text
-
myhloli authored
Remove debug code related to layout bbox visualization and adjust drawing functions to support optional line sorting bboxes. This change includes the removal of `draw_layout_bbox` function and updates to `draw_bbox_with_number` to support variable line width for bbox drawing.
-
myhloli authored
Add a new function `draw_line_sort_bbox` to visualize the sorting of lines on each page. This includes indexing lines and handling both text and non-text elements such as tables and images for better content organization. Also, comment out GPU-related code for flexibility and remove overlaps in bounding box detection, which improves the accuracy of layout splitting.
-
myhloli authored
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory. - Modified the bbox sorting logic to sort text and title blocks separately. - Adjusted drawing colors for better distinction in debug visualizations.
-
- 26 Sep, 2024 2 commits
-
-
myhloli authored
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory. - Modified the bbox sorting logic to sort text and title blocks separately. - Adjusted drawing colors for better distinction in debug visualizations.
-
myhloli authored
Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
-
- 25 Sep, 2024 1 commit
-
-
myhloli authored
Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
-
- 18 Sep, 2024 1 commit
-
-
myhloli authored
feat(ocr_mkcontent): support drop reason in none_with_reason modeEnable the `NONE_WITH_REASON` drop mode in `para_to_standard_format_v2` by updating the function signature to include the `drop_reason` parameter and handling it within the function logic. This enhancement allows the function to convey the reason for dropping content in the output.
-
- 12 Sep, 2024 5 commits