Commits · 1279f2cd0f2639dcad2fb3437fceda7539455ae2 · wangsen / MinerU

23 Oct, 2024 1 commit

feat(model): add support for DocLayout-YOLO model · 1279f2cd

myhloli authored Oct 23, 2024

- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch

1279f2cd

21 Oct, 2024 2 commits

fix(ocr_mkcontent): expand para_to_standard_format_v2 to handle list and index blocks · 64408576

myhloli authored Oct 21, 2024

- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types

64408576

refactor(para): improve paragraph splitting algorithm · 8cc76c49

myhloli authored Oct 21, 2024

- Adjust the threshold for identifying index blocks from 3 lines to 2 lines
- Add a new function __is_list_group to detect if a group of blocks is a list
- Modify the paragraph merging logic to handle list groups differently

8cc76c49

18 Oct, 2024 1 commit

refactor(magic_pdf): remove unused parameters and simplify functions · fc49f5c4

myhloli authored Oct 18, 2024

- Remove unused parameters parse_type and lang from various functions
- Simplify function calls by removing unnecessary arguments
- Update related files to reflect these changes

fc49f5c4

17 Oct, 2024 1 commit

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

15 Oct, 2024 4 commits

refactor(para_split_v3): refine list block detection in paragraph splitting · 81b9fd7b

myhloli authored Oct 15, 2024

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity

81b9fd7b

fix(split_v3): Fix the rule adaptation for some special list samples. · 244b8684
myhloli authored Oct 15, 2024

244b8684

refactor(pdf): adjust span filling threshold in block construction · 7e301b84

myhloli authored Oct 15, 2024

Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.

7e301b84

refactor(para_split_v3): merge list and index block detection · fdcb49d3

myhloli authored Oct 15, 2024

- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block()
- Simplify block type determination logic
- Remove redundant code and improve readability
- Optimize block merging process

fdcb49d3

14 Oct, 2024 2 commits

fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e

myhloli authored Oct 15, 2024

Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.

0a9a6d3e

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 15, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

10 Oct, 2024 2 commits

fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks · 7b42d5a0
myhloli authored Oct 10, 2024

7b42d5a0

feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e

myhloli authored Oct 10, 2024

- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages

6f63e70e

08 Oct, 2024 5 commits

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8
fix: caption or footnote match algorithm · ef45ad08
icecraft authored Oct 08, 2024

ef45ad08

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv... · a71db703

myhloli authored Oct 08, 2024

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper.
- Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.

a71db703

06 Oct, 2024 1 commit

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 1 commit
- chore: remove useless files · fcf24242
  myhloli authored Sep 30, 2024
  
  fcf24242
29 Sep, 2024 2 commits

refactor(magic_pdf): improve line sorting and block indexing · 564c4ce1

myhloli authored Sep 30, 2024

- Insert lines into blocks based on median line height- Calculate block index using line indices median
- Remove virtual line information for table and image blocks
- Enhance line sorting algorithm for different block types
- Add line height calculation function

564c4ce1

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab

28 Sep, 2024 3 commits

refactor(magic_pdf): import model helpers directly for clarity · 42a7d792

myhloli authored Sep 28, 2024

Update import statements in `pdf_parse_union_core_v2.py` to directly import
`prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers`
instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the
code more readable and maintaining a cleaner approach to modular design.

42a7d792

refactor(pdf_parse_union_core_v2): update import paths to use new package structure · 5522d0a3

myhloli authored Sep 28, 2024

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.

5522d0a3

fix(pdf_parse): handle blocks without lines and enable bf16 on compatible devices · 2145a8b6

myhloli authored Sep 28, 2024

Blocks without lines are now correctly indexed even when they contain textual content rendered
as images. The sorting logic has been updated to accommodate this scenario. Additionally, the
LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that
support it, offering potential performance benefits on supported hardware.

2145a8b6

27 Sep, 2024 9 commits

refactor(pdf_parse): remove redundant sorting and optimize block indexing · 177ab08e

myhloli authored Sep 28, 2024

Removed redundant sorting of lines by model and optimized calculation of block
indexes by using a single pass through the sorted lines. This change simplifies the
code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.

177ab08e

refactor(draw_bbox): remove commented-out code and streamline bbox... · 83c07387

myhloli authored Sep 28, 2024

refactor(draw_bbox): remove commented-out code and streamline bbox drawingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which
was used for diagnostic purposes and was no longer necessary. This change streamlines
the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update
also adjusts the order of operations slightly for improved readability without altering
the functionality.

83c07387

refactor(pdf_parse_union_core_v2): implement model initialization within... · b9dfdea3

myhloli authored Sep 28, 2024

refactor(pdf_parse_union_core_v2): implement model initialization within classRefactored model initialization to be handled by a singleton class to ensure that model
instances are reused across calls, avoiding redundant initializations. Removed logger
information that was commented out and ensured consistency in logging behavior.

b9dfdea3

refactor(drawing): simplify draw bbox functions and adjust debug config · b2790f6f

myhloli authored Sep 28, 2024

Refactor the draw bbox functions by removing unused imports and simplifying the
code logic for drawing layout and line sorting bounding boxes. Adjust the debug
configuration to enable content list dumping and disable markdown making mode.

b2790f6f

feat(draw_bbox): add option to toggle bounding box drawing · 43a57d56

myhloli authored Sep 27, 2024

Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to
enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding
box will be drawn, allowing for situations where only text

43a57d56

refactor(draw_bbox): remove conditional layout bbox drawing · c56de493

myhloli authored Sep 27, 2024

Remove debug code related to layout bbox visualization and adjust drawing functions to
support optional line sorting bboxes. This change includes the removal of `draw_layout_bbox`
function and updates to `draw_bbox_with_number` to support variable line width for bbox drawing.

c56de493

refactor(draw_bbox): add line sorting visualization · 34f89650

myhloli authored Sep 27, 2024

Add a new function `draw_line_sort_bbox` to visualize the sorting of lines on each page.
This includes indexing lines and handling both text and non-text elements such as tables
and images for better content organization.

Also, comment out GPU-related code for flexibility and remove overlaps in bounding box
detection, which improves the accuracy of layout splitting.

34f89650

refactor(pdf_parse_union): integrate LayoutLMv3 for block orderingReplace the... · 1efebe42

myhloli authored Sep 27, 2024

refactor(pdf_parse_union): integrate LayoutLMv3 for block orderingReplace the heuristic-based block ordering algorithm with LayoutLMv3 model predictions toimprove the accuracy of block ordering on PDF pages. Additionally, refactor the span
handling during block filling to ensure spans are correctly assigned.

- Introduce LayoutLMv3ForTokenClassification from 'hantian/layoutreader' to predict block
  order.
- Implement span replacement strategy to use pymu spans for non-OCR content.
- Enhance cleanup process to free GPU memory more effectively after model use.
- Adjust block ordering logic to use median line index for text, title, and interline equation blocks.
- Refactor page parsing core logic for better maintainability.

BREAKING CHANGE: The integration of LayoutLMv3 changes the internal block handling and
ordering mechanism, which may affect downstream systems relying on the previous
implementation. Ensure to test thoroughly before deployment.

1efebe42

refactor(draw_bbox): clear cuda cache and update bbox sorting · 36220d69

myhloli authored Sep 27, 2024

- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.

36220d69

26 Sep, 2024 2 commits

refactor(draw_bbox): clear cuda cache and update bbox sorting · 00cda7a6

myhloli authored Sep 26, 2024

- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.

00cda7a6

feat(draw_bbox): add layout sorting visualization · 270ffb02

myhloli authored Sep 26, 2024

Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the
layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function
predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.

270ffb02

25 Sep, 2024 1 commit

feat(draw_bbox): add layout sorting visualization · 3cbcf2de

myhloli authored Sep 25, 2024

3cbcf2de

20 Sep, 2024 1 commit
- fix(pdf_extract_kit):change unimernet base -> small · f2a3a495
  myhloli authored Sep 20, 2024
  
  f2a3a495
19 Sep, 2024 2 commits
- fix(pdf-extract): ensure model is set to evaluation mode before processing · 4811a3d1
  myhloli authored Sep 19, 2024
```
Add model.eval() invocation to pdf_extract_kit initialization sequence to ensure the
model is set to evaluation mode. This is critical for proper inference and performance
metrics when processing PDF content.
```
  4811a3d1
- refactor(pdf_extract): use Image.crop directly with layout detection · c36fa049
  myhloli authored Sep 19, 2024
  
  c36fa049