Commits · bedefd8d7882bda8a7614c7c0068451ac1fbda56 · wangsen / MinerU

28 Oct, 2024 1 commit
- fix: patter match algorithm · f09148b9
  icecraft authored Oct 28, 2024
  
  f09148b9
27 Oct, 2024 2 commits

refactor(pdf_parse): adjust block splitting logic for wide blocks · 4cf7e9a2

myhloli authored Oct 27, 2024

- Modify the logic for splitting wide blocks exceeding 0.4 page width
- Remove the specific case for blocks exceeding 0.25 page width
- Add comments to explain the reasoning behind different splitting strategies

4cf7e9a2

docs: update model download instructions and simplify demo scripts · acab8de5

myhloli authored Oct 27, 2024

- Update model download instructions for versions 0.9.x and later
- Simplify demo scripts by removing unnecessary model configuration
- Add visualization function to draw bounding boxes
- Update CLI help message with new URL

acab8de5

26 Oct, 2024 1 commit

feat(draw_bbox): update bounding box drawing for tables and images · 0e8d5893

myhloli authored Oct 26, 2024

- Add support for drawing bounding boxes of table and image sub-blocks
- Implement sorting of table blocks based on type order
- Update bounding box drawing for text and title blocks
- Refactor code to handle different block types and their sub-blocks

0e8d5893

25 Oct, 2024 7 commits
- add init to magic_pdf.utils · 9cda7051
  myhloli authored Oct 26, 2024
  
  9cda7051
- add init to magic_pdf.config · 02b79992
  myhloli authored Oct 26, 2024
  
  02b79992
- fix: uncorrect pair match · 969101dd
  icecraft authored Oct 25, 2024
  
  969101dd
- refactor(ocr): adjust OCR processing parameters · 1807126e
  myhloli authored Oct 25, 2024
```
- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5
- Reduce the unclip ratio for OCR detection from 2.4 to 1.8
```
  1807126e
- refactor(ocr): improve image and table block handling · c34c9d21
  myhloli authored Oct 25, 2024
```
- Split image and table blocks into separate categories
- Add group_id to image and table blocks- Update block processing logic to handle new categories
- Modify layout splitting and span filling to accommodate new block types
- Adjust block indexing and sorting to consider new structures
```
  c34c9d21
- feat: update return result · 2c60172b
  icecraft authored Oct 25, 2024
  
  2c60172b
- feat: update table match caption algorithm · 92579040
  icecraft authored Oct 25, 2024
  
  92579040
24 Oct, 2024 3 commits
- refactor(magic_pdf): adjust confidence threshold for DocLayout_YOLO model · ce72cf05
  myhloli authored Oct 24, 2024
```
- Changed the confidence threshold from0.15 to 0.25 in the DocLayout_YOLO model prediction
- This adjustment aims to improve the accuracy of layout detection by filtering out low-confidence predictions
```
  ce72cf05
- style: remove unsed log info · c200effc
  icecraft authored Oct 24, 2024
  
  c200effc
- feat: add [figure | table] match [caption | footnote] match algorithm v2 · 283b597a
  icecraft authored Oct 19, 2024
```
feat: add Data api
```
  283b597a
23 Oct, 2024 1 commit

feat(model): add support for DocLayout-YOLO model · 1279f2cd

myhloli authored Oct 23, 2024

- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch

1279f2cd

21 Oct, 2024 2 commits

fix(ocr_mkcontent): expand para_to_standard_format_v2 to handle list and index blocks · 64408576

myhloli authored Oct 21, 2024

- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types

64408576

refactor(para): improve paragraph splitting algorithm · 8cc76c49

myhloli authored Oct 21, 2024

- Adjust the threshold for identifying index blocks from 3 lines to 2 lines
- Add a new function __is_list_group to detect if a group of blocks is a list
- Modify the paragraph merging logic to handle list groups differently

8cc76c49

18 Oct, 2024 1 commit

refactor(magic_pdf): remove unused parameters and simplify functions · fc49f5c4

myhloli authored Oct 18, 2024

- Remove unused parameters parse_type and lang from various functions
- Simplify function calls by removing unnecessary arguments
- Update related files to reflect these changes

fc49f5c4

17 Oct, 2024 1 commit

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

15 Oct, 2024 4 commits

refactor(para_split_v3): refine list block detection in paragraph splitting · 81b9fd7b

myhloli authored Oct 15, 2024

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity

81b9fd7b

fix(split_v3): Fix the rule adaptation for some special list samples. · 244b8684
myhloli authored Oct 15, 2024

244b8684

refactor(pdf): adjust span filling threshold in block construction · 7e301b84

myhloli authored Oct 15, 2024

Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.

7e301b84

refactor(para_split_v3): merge list and index block detection · fdcb49d3

myhloli authored Oct 15, 2024

- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block()
- Simplify block type determination logic
- Remove redundant code and improve readability
- Optimize block merging process

fdcb49d3

14 Oct, 2024 2 commits

fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e

myhloli authored Oct 15, 2024

Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.

0a9a6d3e

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 15, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

10 Oct, 2024 2 commits

fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks · 7b42d5a0
myhloli authored Oct 10, 2024

7b42d5a0

feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e

myhloli authored Oct 10, 2024

- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages

6f63e70e

08 Oct, 2024 5 commits

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8
fix: caption or footnote match algorithm · ef45ad08
icecraft authored Oct 08, 2024

ef45ad08

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv... · a71db703

myhloli authored Oct 08, 2024

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper.
- Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.

a71db703

06 Oct, 2024 1 commit

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 1 commit
- chore: remove useless files · fcf24242
  myhloli authored Sep 30, 2024
  
  fcf24242
29 Sep, 2024 2 commits

refactor(magic_pdf): improve line sorting and block indexing · 564c4ce1

myhloli authored Sep 30, 2024

- Insert lines into blocks based on median line height- Calculate block index using line indices median
- Remove virtual line information for table and image blocks
- Enhance line sorting algorithm for different block types
- Add line height calculation function

564c4ce1

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab

28 Sep, 2024 3 commits

refactor(magic_pdf): import model helpers directly for clarity · 42a7d792

myhloli authored Sep 28, 2024

Update import statements in `pdf_parse_union_core_v2.py` to directly import
`prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers`
instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the
code more readable and maintaining a cleaner approach to modular design.

42a7d792

refactor(pdf_parse_union_core_v2): update import paths to use new package structure · 5522d0a3

myhloli authored Sep 28, 2024

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.

5522d0a3

fix(pdf_parse): handle blocks without lines and enable bf16 on compatible devices · 2145a8b6

myhloli authored Sep 28, 2024

Blocks without lines are now correctly indexed even when they contain textual content rendered
as images. The sorting logic has been updated to accommodate this scenario. Additionally, the
LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that
support it, offering potential performance benefits on supported hardware.

2145a8b6

27 Sep, 2024 1 commit

refactor(pdf_parse): remove redundant sorting and optimize block indexing · 177ab08e

myhloli authored Sep 28, 2024

Removed redundant sorting of lines by model and optimized calculation of block
indexes by using a single pass through the sorted lines. This change simplifies the
code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.

177ab08e