Commits · ded2818ac2384fe344761a1741bbe9b33244e6d2 · wangsen / MinerU

08 Oct, 2024 5 commits

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8
fix: caption or footnote match algorithm · ef45ad08
icecraft authored Oct 08, 2024

ef45ad08

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv... · a71db703

myhloli authored Oct 08, 2024

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper.
- Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.

a71db703

06 Oct, 2024 1 commit

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 1 commit
- chore: remove useless files · fcf24242
  myhloli authored Sep 30, 2024
  
  fcf24242
29 Sep, 2024 2 commits

refactor(magic_pdf): improve line sorting and block indexing · 564c4ce1

myhloli authored Sep 30, 2024

- Insert lines into blocks based on median line height- Calculate block index using line indices median
- Remove virtual line information for table and image blocks
- Enhance line sorting algorithm for different block types
- Add line height calculation function

564c4ce1

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab

28 Sep, 2024 3 commits

refactor(magic_pdf): import model helpers directly for clarity · 42a7d792

myhloli authored Sep 28, 2024

Update import statements in `pdf_parse_union_core_v2.py` to directly import
`prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers`
instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the
code more readable and maintaining a cleaner approach to modular design.

42a7d792

refactor(pdf_parse_union_core_v2): update import paths to use new package structure · 5522d0a3

myhloli authored Sep 28, 2024

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.

5522d0a3

fix(pdf_parse): handle blocks without lines and enable bf16 on compatible devices · 2145a8b6

myhloli authored Sep 28, 2024

Blocks without lines are now correctly indexed even when they contain textual content rendered
as images. The sorting logic has been updated to accommodate this scenario. Additionally, the
LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that
support it, offering potential performance benefits on supported hardware.

2145a8b6

27 Sep, 2024 9 commits

refactor(pdf_parse): remove redundant sorting and optimize block indexing · 177ab08e

myhloli authored Sep 28, 2024

Removed redundant sorting of lines by model and optimized calculation of block
indexes by using a single pass through the sorted lines. This change simplifies the
code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.

177ab08e

refactor(draw_bbox): remove commented-out code and streamline bbox... · 83c07387

myhloli authored Sep 28, 2024

refactor(draw_bbox): remove commented-out code and streamline bbox drawingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which
was used for diagnostic purposes and was no longer necessary. This change streamlines
the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update
also adjusts the order of operations slightly for improved readability without altering
the functionality.

83c07387

refactor(pdf_parse_union_core_v2): implement model initialization within... · b9dfdea3

myhloli authored Sep 28, 2024

refactor(pdf_parse_union_core_v2): implement model initialization within classRefactored model initialization to be handled by a singleton class to ensure that model
instances are reused across calls, avoiding redundant initializations. Removed logger
information that was commented out and ensured consistency in logging behavior.

b9dfdea3

refactor(drawing): simplify draw bbox functions and adjust debug config · b2790f6f

myhloli authored Sep 28, 2024

Refactor the draw bbox functions by removing unused imports and simplifying the
code logic for drawing layout and line sorting bounding boxes. Adjust the debug
configuration to enable content list dumping and disable markdown making mode.

b2790f6f

feat(draw_bbox): add option to toggle bounding box drawing · 43a57d56

myhloli authored Sep 27, 2024

Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to
enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding
box will be drawn, allowing for situations where only text

43a57d56

refactor(draw_bbox): remove conditional layout bbox drawing · c56de493

myhloli authored Sep 27, 2024

Remove debug code related to layout bbox visualization and adjust drawing functions to
support optional line sorting bboxes. This change includes the removal of `draw_layout_bbox`
function and updates to `draw_bbox_with_number` to support variable line width for bbox drawing.

c56de493

refactor(draw_bbox): add line sorting visualization · 34f89650

myhloli authored Sep 27, 2024

Add a new function `draw_line_sort_bbox` to visualize the sorting of lines on each page.
This includes indexing lines and handling both text and non-text elements such as tables
and images for better content organization.

Also, comment out GPU-related code for flexibility and remove overlaps in bounding box
detection, which improves the accuracy of layout splitting.

34f89650

refactor(pdf_parse_union): integrate LayoutLMv3 for block orderingReplace the... · 1efebe42

myhloli authored Sep 27, 2024

refactor(pdf_parse_union): integrate LayoutLMv3 for block orderingReplace the heuristic-based block ordering algorithm with LayoutLMv3 model predictions toimprove the accuracy of block ordering on PDF pages. Additionally, refactor the span
handling during block filling to ensure spans are correctly assigned.

- Introduce LayoutLMv3ForTokenClassification from 'hantian/layoutreader' to predict block
  order.
- Implement span replacement strategy to use pymu spans for non-OCR content.
- Enhance cleanup process to free GPU memory more effectively after model use.
- Adjust block ordering logic to use median line index for text, title, and interline equation blocks.
- Refactor page parsing core logic for better maintainability.

BREAKING CHANGE: The integration of LayoutLMv3 changes the internal block handling and
ordering mechanism, which may affect downstream systems relying on the previous
implementation. Ensure to test thoroughly before deployment.

1efebe42

refactor(draw_bbox): clear cuda cache and update bbox sorting · 36220d69

myhloli authored Sep 27, 2024

- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.

36220d69

26 Sep, 2024 2 commits

refactor(draw_bbox): clear cuda cache and update bbox sorting · 00cda7a6

myhloli authored Sep 26, 2024

- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.

00cda7a6

feat(draw_bbox): add layout sorting visualization · 270ffb02

myhloli authored Sep 26, 2024

Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the
layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function
predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.

270ffb02

25 Sep, 2024 1 commit

feat(draw_bbox): add layout sorting visualization · 3cbcf2de

myhloli authored Sep 25, 2024

3cbcf2de

20 Sep, 2024 1 commit
- fix(pdf_extract_kit):change unimernet base -> small · f2a3a495
  myhloli authored Sep 20, 2024
  
  f2a3a495
19 Sep, 2024 2 commits
- fix(pdf-extract): ensure model is set to evaluation mode before processing · 4811a3d1
  myhloli authored Sep 19, 2024
```
Add model.eval() invocation to pdf_extract_kit initialization sequence to ensure the
model is set to evaluation mode. This is critical for proper inference and performance
metrics when processing PDF content.
```
  4811a3d1
- refactor(pdf_extract): use Image.crop directly with layout detection · c36fa049
  myhloli authored Sep 19, 2024
  
  c36fa049
18 Sep, 2024 4 commits
- feat(UNIPipe): change default drop_mode to NONE_WITH_REASON · 23b621e0
  myhloli authored Sep 18, 2024
  
  23b621e0
- fix(ocr_mkcontent): streamline drop reason handling · 16699a9a
  myhloli authored Sep 18, 2024
  
  16699a9a
- fix(ocr_mkcontent): correct drop mode handling for pages with drop reasons · 196de029
  myhloli authored Sep 18, 2024
  
  196de029
- feat(ocr_mkcontent): support drop reason in none_with_reason modeEnable the... · 37fbe998
  myhloli authored Sep 18, 2024
```
feat(ocr_mkcontent): support drop reason in none_with_reason modeEnable the `NONE_WITH_REASON` drop mode in `para_to_standard_format_v2` by updating the
function signature to include the `drop_reason` parameter and handling it within the
function logic. This enhancement allows the function to convey the reason for dropping
content in the output.
```
  37fbe998
12 Sep, 2024 7 commits
- Delete magic_pdf/__pycache__ directory · 4ec0373c
  Xiaomeng Zhao authored Sep 12, 2024
  
  4ec0373c
- fix: solve conflicts · a4c72e2e
  myhloli authored Sep 12, 2024
  
  a4c72e2e
- fix: recovert the lang option in tools/cli.py · 78bdf53e
  icecraft authored Sep 12, 2024
  
  78bdf53e
- fix: 1. resolve uncorrect pair relation of figure and footnote, 2. resolve... · 6cc8cbca
  icecraft authored Sep 12, 2024
```
fix: 1. resolve uncorrect pair relation of figure and footnote, 2. resolve uncorrect pair relation of table and caption #590
```
  6cc8cbca
- feat: add magic-pdf-dev case · fea2b7bd
  quyuan authored Sep 12, 2024
  
  fea2b7bd
- feat: add magic-pdf-dev case · 65734029
  quyuan authored Sep 12, 2024
  
  65734029
- feat(pipeline): pass language parameter for parsing and markdown conversion · 6062862c
  myhloli authored Sep 12, 2024
```
The pipeline now supports passing the language parameter to parsing functions and
during markdown conversion to optimize processing based on the specified language.
This enhancement allows for more accurate parsing and markdown generation, particularly
when dealing with non-English content.
```
  6062862c
10 Sep, 2024 2 commits

Update version.py with new version · 1df51694
myhloli authored Sep 10, 2024

1df51694

Realese 0.8.0 (#587) · 55404808

drunkpig authored Sep 10, 2024



* release: release 0.7.1 version (#526)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: Kaiwen Liu <lkw_buaa@163.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: wangbinDL <wangbin_research@163.com>

* Hotfix readme 0.7.1 (#528)

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

add HF、modelscope、colab url

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Rename README.md to README_zh-CN.md

* Create readme.md

* Rename readme.md to README.md

* Rename README.md to README_zh-CN.md

* Update README_zh-CN.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Create download_models_hf.py

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update FAQ_zh_cn.md

* Update FAQ_en_us.md

* Update FAQ_zh_cn.md

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* Update README_zh-CN.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* add rag data api

* Update README_zh-CN.md

update rag api image

* Update README.md

docs: remove RAG related release notes

* Update README_zh-CN.md

docs: remove RAG related release notes

* Update README_zh-CN.md

update 更新记录

---------
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: Kaiwen Liu <lkw_buaa@163.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: wangbinDL <wangbin_research@163.com>

55404808