Unverified Commit 9f352df0 authored by drunkpig's avatar drunkpig Committed by GitHub
Browse files

Realese 0.8.0 (#586)



* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* fix(ocr_mkcontent): improve language detection and content formatting (#458)

Optimize the language detection logic to enhance content formatting.  This
change addresses issues with long word segmentation. Language detection now uses a
threshold to determine the language of a text based on the proportion of English characters.
Formatting rules for content have been updated to consider a list of languages (initially
including Chinese, Japanese, and Korean) where no space is added between content segments
for inline equations and text spans, improving the handling of Asian languages.

The impact of these changes includes improved accuracy in language detection, better
segmentation of long words, and more appropriate spacing in content formatting for multiple
languages.

* fix(self_modify): merge detection boxes for optimized text region detection (#448)

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

* fix(pdf-extract): adjust box threshold for OCR detection (#447)

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

* feat: rename the file generated by command line tools (#401)

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: default avataricecraft <tmortred@gmail.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* fix(ocr_mkcontent): revise table caption output (#397)

* fix(ocr_mkcontent): revise table caption output

- Ensuring that
  table captions are properly included in the output.
- Remove the redundant `table_caption` variable。

* Update cla.yml

* Update bug_report.yml

* feat(cli): add debug option for detailed error handling

Enable users to invoke the CLI command with a new debug flag to get detailed debugging information.

* fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified
to improve the accuracy and consistency of OCR processing. The new values are set to 25
to ensure more precise image cropping and pasting which leads to better OCR recognition
results.

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and
handling of border cases. This change will help in improving the overall quality of
OCR'ed text by providing more context around the detected text areas.

* fix(common): deep copy model list before drawing model bbox

Use a deep copy of the original model list in `drow_model_bbox` to avoid potential
modifications to the source data. This ensures the integrity of the original models
is maintained while generating the model bounding boxes visualization.

---------
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* build(docker): update docker build step (#471)

* build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved
performance and stability.

Additionally, integrate PaddlePaddle GPU version 3.0.0b1
into the Docker build for enhanced AI capabilities. The MinIO configuration file has
also been updated to the latest version.

* build(dockerfile): Updated the Dockerfile

* build(Dockerfile): update Dockerfile

* docs(docker): add instructions for quick deployment with Docker

Include Docker-based deployment instructions in the README for both English and
Chinese locales. This update provides users a quick-start guide to using Docker for
deployment, with notes on GPU VRAM requirements and default acceleration features.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* upload an introduction about chemical formula and update readme.md (#489)

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

* upload an introduction about chemical formula and update readme.md (#489)

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

* fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494)
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* feat: add test case (#499)
Co-authored-by: default avatarquyuan <quyuan@pjlab.org>

* Update cla.yml

* Update gpu-ci.yml

* Update cli.yml

* Delete .github/workflows/gpu-ci.yml

* fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500)

Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to
improve the accuracy of block filling based on layout analysis.

* fix(detect_all_bboxes): remove small overlapping blocks by merging (#501)

Previously, small blocks that overlapped with larger ones were merely removed. This fix
changes the approach to merge smaller blocks into the larger block instead, ensuring that
no information is lost and the larger block encompasses all the text content fully.

* feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507)

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* Feat/support rag (#510)

* Create requirements-docker.txt

* feat: update deps to support rag

* feat: add support to rag, add rag_data_reader api for rag integration

* feat: let user retrieve the filename of the processed file

* feat: add projects demo for rag integrations

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* Update Dockerfile

* feat(gradio): add app by gradio (#512)

* fix: replace \u0002, \u0003 in common text (#521)

* fix replace \u0002, \u0003 in common text

* fix(para): When an English line ends with a hyphen, do not add a space at the end.

* fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518)

* fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523)

* fix replace \u0002, \u0003 in common text

* fix(para): When an English line ends with a hyphen, do not add a space at the end.

* fix: delete hyphen at end of line

* Release: Release  0.7.1 verison, update dev (#527)

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Hotfix readme 0.7.1 (#529)

* release: release 0.7.1 version (#526)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

---------
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Update README_zh-CN.md

delete Known issue about table recognition

* Update Dockerfile

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: typo error in markdown (#536)
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* fix(gradio): remove unused imports and simplify pdf display (#534)

Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also,
replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler
and more integrated PDF upload and display solution, improving code maintainability and readability.

* Feat/support footnote in figure (#532)

* feat: support figure footnote

* feat: using the relative position to combine footnote, table, image

* feat: add the readme of projects

* fix: code spell in unittest

---------
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533)

Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing
atomic models. This change ensures that atomic models are instantiated at most once,
optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton
class now manages the instantiation and retrieval of atomic models, improving the
overall structure and efficiency of the codebase.

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

add HF、modelscope、colab url

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Rename README.md to README_zh-CN.md

* Create readme.md

* Rename readme.md to README.md

* Rename README.md to README_zh-CN.md

* Update README_zh-CN.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* Update README_zh-CN.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* add rag data api

* Update README_zh-CN.md

update rag api image

* Update README.md

docs: remove RAG related release notes

* Update README_zh-CN.md

docs: remove RAG related release notes

* Update README_zh-CN.md

update 更新记录

---------
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avataricecraft <tmortred@163.com>
Co-authored-by: default avataricecraft <tmortred@gmail.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarSiyu Hao <131659128+GDDGCZ518@users.noreply.github.com>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarquyuan <quyuan@pjlab.org>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>
parent b6633cd6
...@@ -6,20 +6,24 @@ on: ...@@ -6,20 +6,24 @@ on:
push: push:
branches: branches:
- "master" - "master"
- "dev"
paths-ignore: paths-ignore:
- "cmds/**" - "cmds/**"
- "**.md" - "**.md"
- "**.yml"
pull_request: pull_request:
branches: branches:
- "master" - "master"
- "dev"
paths-ignore: paths-ignore:
- "cmds/**" - "cmds/**"
- "**.md" - "**.md"
- "**.yml"
workflow_dispatch: workflow_dispatch:
jobs: jobs:
cli-test: cli-test:
runs-on: ubuntu-latest runs-on: pdf
timeout-minutes: 40 timeout-minutes: 120
strategy: strategy:
fail-fast: true fail-fast: true
...@@ -28,27 +32,22 @@ jobs: ...@@ -28,27 +32,22 @@ jobs:
uses: actions/checkout@v3 uses: actions/checkout@v3
with: with:
fetch-depth: 2 fetch-depth: 2
- name: check-requirements - name: install
run: |
pip install -r requirements.txt
pip install -r requirements-qa.txt
pip install magic-pdf
- name: test_cli
run: | run: |
cp magic-pdf.template.json ~/magic-pdf.json echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
echo $GITHUB_WORKSPACE - name: unit test
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && pytest -s -v tests/test_unit.py run: |
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli.py cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m pytest tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
- name: benchmark - name: cli test
run: | run: |
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_bench.py cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu: notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }} if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
needs: [cli-test] needs: cli-test
runs-on: ubuntu-latest runs-on: pdf
steps: steps:
- name: get_actor - name: get_actor
run: | run: |
...@@ -67,9 +66,5 @@ jobs: ...@@ -67,9 +66,5 @@ jobs:
- name: notify - name: notify
run: | run: |
curl ${{ secrets.WEBHOOK_URL }} -H 'Content-Type: application/json' -d '{ echo ${{ secrets.USER_ID }}
"msgtype": "text", curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}
"text": {
"mentioned_list": ["${{ env.METIONS }}"] , "content": "'${{ github.repository }}' GitHubAction Failed!\n 细节请查看:https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"
}
}'
\ No newline at end of file
...@@ -30,10 +30,10 @@ tmp/ ...@@ -30,10 +30,10 @@ tmp/
tmp tmp
.vscode .vscode
.vscode/ .vscode/
/tests/
ocr_demo ocr_demo
/app/common/__init__.py /app/common/__init__.py
/magic_pdf/config/__init__.py /magic_pdf/config/__init__.py
source.dev.env source.dev.env
tmp
...@@ -3,6 +3,7 @@ repos: ...@@ -3,6 +3,7 @@ repos:
rev: 5.0.4 rev: 5.0.4
hooks: hooks:
- id: flake8 - id: flake8
args: ["--max-line-length=120", "--ignore=E131,E125,W503,W504,E203"]
- repo: https://github.com/PyCQA/isort - repo: https://github.com/PyCQA/isort
rev: 5.11.5 rev: 5.11.5
hooks: hooks:
...@@ -11,6 +12,7 @@ repos: ...@@ -11,6 +12,7 @@ repos:
rev: v0.32.0 rev: v0.32.0
hooks: hooks:
- id: yapf - id: yapf
args: ["--style={based_on_style: google, column_limit: 120, indent_width: 4}"]
- repo: https://github.com/codespell-project/codespell - repo: https://github.com/codespell-project/codespell
rev: v2.2.1 rev: v2.2.1
hooks: hooks:
...@@ -41,4 +43,4 @@ repos: ...@@ -41,4 +43,4 @@ repos:
rev: v1.3.1 rev: v1.3.1
hooks: hooks:
- id: docformatter - id: docformatter
args: ["--in-place", "--wrap-descriptions", "79"] args: ["--in-place", "--wrap-descriptions", "119"]
# Use the official Ubuntu base image # Use the official Ubuntu base image
FROM ubuntu:latest FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation # Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
...@@ -29,17 +29,23 @@ RUN python3 -m venv /opt/mineru_venv ...@@ -29,17 +29,23 @@ RUN python3 -m venv /opt/mineru_venv
# Activate the virtual environment and install necessary Python packages # Activate the virtual environment and install necessary Python packages
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \ RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip install --upgrade pip && \ pip3 install --upgrade pip && \
pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/" wget https://gitee.com/myhloli/MinerU/raw/master/requirements-docker.txt && \
pip3 install -r requirements-docker.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple && \
# Copy the configuration file template and set up the model directory pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
COPY magic-pdf.template.json /root/magic-pdf.json
# Copy the configuration file template and install magic-pdf latest
# Set the models directory in the configuration file (adjust the path as needed) RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json && \
RUN sed -i 's|/tmp/models|/opt/models|g' /root/magic-pdf.json cp magic-pdf.template.json /root/magic-pdf.json && \
source /opt/mineru_venv/bin/activate && \
# Create the models directory pip3 install -U magic-pdf"
RUN mkdir -p /opt/models
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope && \
wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py && \
python3 download_models.py && \
sed -i 's|/tmp/models|/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models|g' /root/magic-pdf.json && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
# Set the entry point to activate the virtual environment and run the command line tool # Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"] ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
...@@ -5,6 +5,7 @@ ...@@ -5,6 +5,7 @@
</p> </p>
<!-- icon --> <!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
...@@ -12,17 +13,26 @@ ...@@ -12,17 +13,26 @@
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf) [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language --> <!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md) [English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link --> <!-- hot link -->
<p align="center"> <p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
</p> </p>
<!-- join us --> <!-- join us -->
<p align="center"> <p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a> 👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p> </p>
...@@ -30,12 +40,14 @@ ...@@ -30,12 +40,14 @@
</div> </div>
# Changelog # Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option - 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality - 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation - 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release - 2024/07/05: Initial open-source release
<!-- TABLE OF CONTENT --> <!-- TABLE OF CONTENT -->
<details open="open"> <details open="open">
<summary><h2 style="display: inline-block">Table of Contents</h2></summary> <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol> <ol>
...@@ -74,10 +86,10 @@ ...@@ -74,10 +86,10 @@
</ol> </ol>
</details> </details>
# MinerU # MinerU
## Project Introduction ## Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**. Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
...@@ -101,6 +113,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -101,6 +113,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br> If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br> If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU: There are three different ways to experience MinerU:
- [Online Demo (No Installation Required)](#online-demo) - [Online Demo (No Installation Required)](#online-demo)
- [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo) - [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo)
- [Linux/Windows + CUDA](#Using-GPU) - [Linux/Windows + CUDA](#Using-GPU)
...@@ -171,33 +184,41 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -171,33 +184,41 @@ In non-mainline environments, due to the diversity of hardware and software conf
### Quick CPU Demo ### Quick CPU Demo
#### 1. Install magic-pdf #### 1. Install magic-pdf
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
``` ```
#### 2. Download model weight files #### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions. Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
> ❗️After downloading the models, please make sure to verify the completeness of the model files. > ❗️After downloading the models, please make sure to verify the completeness of the model files.
> >
> Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files. > Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
#### 3. Copy and configure the template file #### 3. Copy and configure the template file
You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository. You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
> ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run. > ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
> >
> The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`. > The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
```bash ```bash
cp magic-pdf.template.json ~/magic-pdf.json cp magic-pdf.template.json ~/magic-pdf.json
``` ```
Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files). Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
> ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files. > ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
> >
> On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences. > On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
> >
> For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`. > For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
```json ```json
{ {
// other config // other config
...@@ -210,13 +231,26 @@ Find the `magic-pdf.json` file in your user directory and configure the "models- ...@@ -210,13 +231,26 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
} }
``` ```
### Using GPU ### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system: If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md) - [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md) - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## Usage ## Usage
...@@ -230,12 +264,12 @@ Options: ...@@ -230,12 +264,12 @@ Options:
-v, --version display the version and exit -v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required] -p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory -o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf. -m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf, ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr, txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf auto: automatically choose the best method for parsing pdf
from ocr and txt. from ocr and txt.
without method specified, auto will be used by default. without method specified, auto will be used by default.
--help Show this message and exit. --help Show this message and exit.
...@@ -250,13 +284,13 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto ...@@ -250,13 +284,13 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
The results will be saved in the `{some_output_dir}` directory. The output file list is as follows: The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
```text ```text
├── some_pdf.md # markdown file ├── some_pdf.md # markdown file
├── images # directory for storing images ├── images # directory for storing images
├── layout.pdf # layout diagram ├── some_pdf_layout.pdf # layout diagram
├── middle.json # MinerU intermediate processing result ├── some_pdf_middle.json # MinerU intermediate processing result
├── model.json # model inference result ├── some_pdf_model.json # model inference result
├── origin.pdf # original PDF file ├── some_pdf_origin.pdf # original PDF file
└── spans.pdf # smallest granularity bbox position information diagram └── some_pdf_spans.pdf # smallest granularity bbox position information diagram
``` ```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md). For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
...@@ -264,6 +298,7 @@ For more information about the output files, please refer to the [Output File De ...@@ -264,6 +298,7 @@ For more information about the output files, please refer to the [Output File De
### API ### API
Processing files from local disk Processing files from local disk
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
...@@ -276,6 +311,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -276,6 +311,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
Processing files from object storage Processing files from object storage
```python ```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
...@@ -290,10 +326,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -290,10 +326,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
For detailed implementation, refer to: For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py) - [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide ### Development Guide
TODO TODO
...@@ -305,10 +341,11 @@ TODO ...@@ -305,10 +341,11 @@ TODO
- [ ] Code block recognition within the text - [ ] Code block recognition within the text
- [ ] Table of contents recognition - [ ] Table of contents recognition
- [x] Table recognition - [x] Table recognition
- [ ] Chemical formula recognition - [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition - [ ] Geometric shape recognition
# Known Issues # Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases - Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported - Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model - Lists, code blocks, and table of contents are not yet supported in the layout model
...@@ -318,11 +355,11 @@ TODO ...@@ -318,11 +355,11 @@ TODO
# FAQ # FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md) [FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md) [FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors # All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors"> <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
...@@ -335,8 +372,8 @@ TODO ...@@ -335,8 +372,8 @@ TODO
This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility. This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility.
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
...@@ -373,9 +410,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However, ...@@ -373,9 +410,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
</a> </a>
# Magic-doc # Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html # Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links # Links
......
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
<img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;"> <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p> </p>
<!-- icon --> <!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
...@@ -13,33 +13,41 @@ ...@@ -13,33 +13,41 @@
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf) [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language --> <!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link --> <!-- hot link -->
<p align="center"> <p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
</p> </p>
<!-- join us --> <!-- join us -->
<p align="center"> <p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a> 👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p> </p>
</div> </div>
# 更新记录 # 更新记录
- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能 - 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能 - 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档 - 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源 - 2024/07/05 首次开源
<!-- TABLE OF CONTENT --> <!-- TABLE OF CONTENT -->
<details open="open"> <details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary> <summary><h2 style="display: inline-block">文档目录</h2></summary>
<ol> <ol>
...@@ -78,10 +86,10 @@ ...@@ -78,10 +86,10 @@
</ol> </ol>
</details> </details>
# MinerU # MinerU
## 项目简介 ## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。 MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。 MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF** 相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**
...@@ -100,17 +108,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -100,17 +108,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
- 支持CPU和GPU环境 - 支持CPU和GPU环境
- 支持windows/linux/mac平台 - 支持windows/linux/mac平台
## 快速开始 ## 快速开始
如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br> 如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br>
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br> 如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有3种不同方式可以体验MinerU的效果: 有3种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验) - [在线体验(无需任何安装)](#在线体验)
- [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验) - [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验)
- [Linux/Windows + CUDA](#使用gpu) - [Linux/Windows + CUDA](#使用gpu)
**⚠️安装前必看——软硬件环境支持说明** **⚠️安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。 为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
...@@ -174,38 +181,47 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -174,38 +181,47 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF) [在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
### 使用CPU快速体验 ### 使用CPU快速体验
#### 1. 安装magic-pdf #### 1. 安装magic-pdf
最新版本国内镜像源同步可能会有延迟,请耐心等待 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
#### 2. 下载模型权重文件 #### 2. 下载模型权重文件
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md) 详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
> ❗️模型下载后请务必检查模型文件是否下载完整 > ❗️模型下载后请务必检查模型文件是否下载完整
> >
> 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整 > 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
#### 3. 拷贝配置文件并进行配置 #### 3. 拷贝配置文件并进行配置
在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件 在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
> ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行 > ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行
> >
> windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名" > windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
```bash ```bash
cp magic-pdf.template.json ~/magic-pdf.json cp magic-pdf.template.json ~/magic-pdf.json
``` ```
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录 在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行 > ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
> >
> windows系统中此路径应包含盘符,且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。 > windows系统中此路径应包含盘符,且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。
> >
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models" > 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json ```json
{ {
// other config // other config
...@@ -218,13 +234,27 @@ cp magic-pdf.template.json ~/magic-pdf.json ...@@ -218,13 +234,27 @@ cp magic-pdf.template.json ~/magic-pdf.json
} }
``` ```
### 使用GPU ### 使用GPU
如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程: 如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程:
- [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md) - [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md) - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署
> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
>
> 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## 使用 ## 使用
...@@ -238,12 +268,12 @@ Options: ...@@ -238,12 +268,12 @@ Options:
-v, --version display the version and exit -v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required] -p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory -o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf. -m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf, ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr, txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf auto: automatically choose the best method for parsing pdf
from ocr and txt. from ocr and txt.
without method specified, auto will be used by default. without method specified, auto will be used by default.
--help Show this message and exit. --help Show this message and exit.
...@@ -258,21 +288,21 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto ...@@ -258,21 +288,21 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下 运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下
```text ```text
├── some_pdf.md # markdown 文件 ├── some_pdf.md # markdown 文件
├── images # 存放图片目录 ├── images # 存放图片目录
├── layout.pdf # layout 绘图 ├── some_pdf_layout.pdf # layout 绘图
├── middle.json # minerU 中间处理结果 ├── some_pdf_middle.json # minerU 中间处理结果
├── model.json # 模型推理结果 ├── some_pdf_model.json # 模型推理结果
├── origin.pdf # 原 pdf 文件 ├── some_pdf_origin.pdf # 原 pdf 文件
└── spans.pdf # 最小粒度的bbox位置信息绘图 └── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
``` ```
更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md) 更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md)
### API ### API
处理本地磁盘上的文件 处理本地磁盘上的文件
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
...@@ -285,6 +315,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -285,6 +315,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
处理对象存储上的文件 处理对象存储上的文件
```python ```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
...@@ -298,11 +329,11 @@ pipe.pipe_parse() ...@@ -298,11 +329,11 @@ pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
详细实现可参考 详细实现可参考
- [demo.py 最简单的处理方式](demo/demo.py) - [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
### 二次开发 ### 二次开发
TODO TODO
...@@ -314,11 +345,11 @@ TODO ...@@ -314,11 +345,11 @@ TODO
- [ ] 正文中代码块识别 - [ ] 正文中代码块识别
- [ ] 目录识别 - [ ] 目录识别
- [x] 表格识别 - [x] 表格识别
- [ ] 化学式识别 - [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] 几何图形识别 - [ ] 几何图形识别
# Known Issues # Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序 - 阅读顺序基于规则的分割,在一些情况下会乱序
- 不支持竖排文字 - 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持 - 列表、代码块、目录在layout模型里还没有支持
...@@ -328,10 +359,11 @@ TODO ...@@ -328,10 +359,11 @@ TODO
# FAQ # FAQ
[常见问题](docs/FAQ_zh_cn.md) [常见问题](docs/FAQ_zh_cn.md)
[FAQ](docs/FAQ_en_us.md)
[FAQ](docs/FAQ_en_us.md)
# All Thanks To Our Contributors # All Thanks To Our Contributors
...@@ -346,6 +378,7 @@ TODO ...@@ -346,6 +378,7 @@ TODO
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。 本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
...@@ -382,9 +415,11 @@ TODO ...@@ -382,9 +415,11 @@ TODO
</a> </a>
# Magic-doc # Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html # Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links # Links
......
This diff is collapsed.
# Copyright (c) Opendatalab. All rights reserved.
import base64
import os
import time
import zipfile
from pathlib import Path
import re
from loguru import logger
from magic_pdf.libs.hash_utils import compute_sha256
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env
os.system("pip install gradio")
os.system("pip install gradio-pdf")
import gradio as gr
from gradio_pdf import PDF
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_pdf(doc_path, output_dir, end_page_id):
os.makedirs(output_dir, exist_ok=True)
try:
file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
pdf_data = read_fn(doc_path)
parse_method = "auto"
local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
do_parse(
output_dir,
file_name,
pdf_data,
[],
parse_method,
False,
end_page_id=end_page_id,
)
return local_md_dir, file_name
except Exception as e:
logger.exception(e)
def compress_directory_to_zip(directory_path, output_zip_path):
"""
压缩指定目录到一个 ZIP 文件。
:param directory_path: 要压缩的目录路径
:param output_zip_path: 输出的 ZIP 文件路径
"""
try:
with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
# 遍历目录中的所有文件和子目录
for root, dirs, files in os.walk(directory_path):
for file in files:
# 构建完整的文件路径
file_path = os.path.join(root, file)
# 计算相对路径
arcname = os.path.relpath(file_path, directory_path)
# 添加文件到 ZIP 文件
zipf.write(file_path, arcname)
return 0
except Exception as e:
logger.exception(e)
return -1
def image_to_base64(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def replace_image_with_base64(markdown_text, image_dir_path):
# 匹配Markdown中的图片标签
pattern = r'\!\[(?:[^\]]*)\]\(([^)]+)\)'
# 替换图片链接
def replace(match):
relative_path = match.group(1)
full_path = os.path.join(image_dir_path, relative_path)
base64_image = image_to_base64(full_path)
return f"![{relative_path}](data:image/jpeg;base64,{base64_image})"
# 应用替换
return re.sub(pattern, replace, markdown_text)
def to_markdown(file_path, end_pages):
# 获取识别的md文件以及压缩包文件路径
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
if zip_archive_success == 0:
logger.info("压缩成功")
else:
logger.error("压缩失败")
md_path = os.path.join(local_md_dir, file_name + ".md")
with open(md_path, 'r', encoding='utf-8') as f:
txt_content = f.read()
md_content = replace_image_with_base64(txt_content, local_md_dir)
# 返回转换后的PDF路径
new_pdf_path = os.path.join(local_md_dir, file_name + "_layout.pdf")
return md_content, txt_content, archive_zip_path, new_pdf_path
# def show_pdf(file_path):
# with open(file_path, "rb") as f:
# base64_pdf = base64.b64encode(f.read()).decode('utf-8')
# pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
# f'width="100%" height="1000" type="application/pdf">'
# return pdf_display
latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
{"left": '$', "right": '$', "display": False}]
def init_model():
from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
try:
model_manager = ModelSingleton()
txt_model = model_manager.get_model(False, False)
logger.info(f"txt_model init final")
ocr_model = model_manager.get_model(True, False)
logger.info(f"ocr_model init final")
return 0
except Exception as e:
logger.exception(e)
return -1
model_init = init_model()
logger.info(f"model_init: {model_init}")
if __name__ == "__main__":
with gr.Blocks() as demo:
with gr.Row():
with gr.Column(variant='panel', scale=5):
pdf_show = gr.Markdown()
max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
with gr.Row() as bu_flow:
change_bu = gr.Button("Convert")
clear_bu = gr.ClearButton([pdf_show], value="Clear")
pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
with gr.Column(variant='panel', scale=5):
output_file = gr.File(label="convert result", interactive=False)
with gr.Tabs():
with gr.Tab("Markdown rendering"):
md = gr.Markdown(label="Markdown rendering", height=900, show_copy_button=True,
latex_delimiters=latex_delimiters, line_breaks=True)
with gr.Tab("Markdown text"):
md_text = gr.TextArea(lines=45, show_copy_button=True)
change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
clear_bu.add([md, pdf_show, md_text, output_file])
demo.launch()
## Overview ## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one. After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
### layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors. Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
![layout example](images/layout_example.png) ![layout example](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas. All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png) ![spans example](images/spans_example.png)
### some_pdf_model.json
### model.json
#### Structure Definition #### Structure Definition
```python ```python
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from enum import IntEnum from enum import IntEnum
...@@ -34,12 +33,12 @@ class CategoryType(IntEnum): ...@@ -34,12 +33,12 @@ class CategoryType(IntEnum):
table_footnote = 7 # Table footnote table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label formula_caption = 9 # Formula label
embedding = 13 # Inline formula embedding = 13 # Inline formula
isolated = 14 # Block formula isolated = 14 # Block formula
text = 15 # OCR recognition result text = 15 # OCR recognition result
class PageInfo(BaseModel): class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0) page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0) height: int = Field(description="Page height", gt=0)
...@@ -51,22 +50,20 @@ class ObjectInferenceResult(BaseModel): ...@@ -51,22 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="Confidence of the inference result") score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None) latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None) html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel): class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0) layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata") page_info: PageInfo = Field(description="Page metadata")
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
inference_result: list[PageInferenceResults] = [] inference_result: list[PageInferenceResults] = []
``` ```
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively. The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png) ![Poly Coordinate Diagram](images/poly.png)
#### example #### example
```json ```json
...@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen ...@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
] ]
``` ```
### some_pdf_middle.json
### middle.json | Field Name | Description |
| :------------- | :------------------------------------------------------------------------------------------------------------- |
| Field Name | Description | | pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
| :-----|:------------------------------------------| | \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
|pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details | | \_version_name | string, indicates the version of magic-pdf used in this parsing |
|_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
|_version_name | string, indicates the version of magic-pdf used in this parsing |
<br> <br>
...@@ -136,18 +131,18 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen ...@@ -136,18 +131,18 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
Field structure description Field structure description
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :------------------ | :----------------------------------------------------------------------------------------------------------------- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented | | preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order | | layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 | | page_idx | Page number, starting from 0 |
| page_size | Page width and height | | page_size | Page width and height |
| _layout_tree | Layout tree structure | | \_layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block | | images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block | | tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block | | interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped | | discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks | | para_blocks | Result after segmenting preproc_blocks |
In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting. In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting.
...@@ -157,35 +152,35 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a ...@@ -157,35 +152,35 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a
The outer block is referred to as a first-level block, and the fields in the first-level block include: The outer block is referred to as a first-level block, and the fields in the first-level block include:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :------------------------------------------------------------- |
| type | Block type (table\|image)| | type | Block type (table\|image) |
|bbox | Block bounding box coordinates | | bbox | Block bounding box coordinates |
|blocks |list, each element is a dict representing a second-level block | | blocks | list, each element is a dict representing a second-level block |
<br> <br>
There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks. There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
The fields in a second-level block include: The fields in a second-level block include:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :---------------------------------------------------------------------------------------------------------- |
| type | Block type | | type | Block type |
| bbox | Block bounding box coordinates | | bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information| | lines | list, each element is a dict representing a line, used to describe the composition of a line of information |
Detailed explanation of second-level block types Detailed explanation of second-level block types
| type | Description | | type | Description |
|:-------------------| :---- | | :----------------- | :--------------------- |
| image_body | Main body of the image | | image_body | Main body of the image |
| image_caption | Image description text | | image_caption | Image description text |
| table_body | Main body of the table | | table_body | Main body of the table |
| table_caption | Table description text | | table_caption | Table description text |
| table_footnote | Table footnote | | table_footnote | Table footnote |
| text | Text block | | text | Text block |
| title | Title block | | title | Title block |
| interline_equation | Block formula| | interline_equation | Block formula |
<br> <br>
...@@ -193,31 +188,30 @@ Detailed explanation of second-level block types ...@@ -193,31 +188,30 @@ Detailed explanation of second-level block types
The field format of a line is as follows: The field format of a line is as follows:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :------------------------------------------------------------------------------------------------------ |
| bbox | Bounding box coordinates of the line | | bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit | | spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
<br> <br>
**span** **span**
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------------------------------------------- |
| bbox | Bounding box coordinates of the span | | bbox | Bounding box coordinates of the span |
| type | Type of the span | | type | Type of the span |
| content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information | | content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
The types of spans are as follows: The types of spans are as follows:
| type | Description | | type | Description |
| :-----| :---- | | :----------------- | :------------- |
| image | Image | | image | Image |
| table | Table | | table | Table |
| text | Text | | text | Text |
| inline_equation | Inline formula | | inline_equation | Inline formula |
| interline_equation | Block formula | | interline_equation | Block formula |
**Summary** **Summary**
...@@ -229,7 +223,6 @@ The block structure is as follows: ...@@ -229,7 +223,6 @@ The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span First-level block (if any) -> Second-level block -> Line -> Span
#### example #### example
```json ```json
......
## 概览 ## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件 `magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
### layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。 每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png) ![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。 根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png) ![span 页面示例](images/spans_example.png)
### some_pdf_model.json
### model.json
#### 结构定义 #### 结构定义
```python ```python
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from enum import IntEnum from enum import IntEnum
...@@ -33,13 +32,13 @@ class CategoryType(IntEnum): ...@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
table_caption = 6 # 表格描述 table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释 table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式 isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号 formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式 embedding = 13 # 行内公式
isolated = 14 # 行间公式 isolated = 14 # 行间公式
text = 15 # ocr 识别结果 text = 15 # ocr 识别结果
class PageInfo(BaseModel): class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0) page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0) height: int = Field(description="页面高度", gt=0)
...@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel): ...@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="推理结果的置信度") score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None) latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None) html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel): class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0) layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息") page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果 # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = [] inference_result: list[PageInferenceResults] = []
``` ```
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标 poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png) ![poly 坐标示意图](images/poly.png)
#### 示例数据 #### 示例数据
```json ```json
...@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
] ]
``` ```
### some_pdf_middle.json
### middle.json | 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| 字段名 | 解释 | | pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| :-----|:------------------------------------------| | \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 | | \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
|_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br> <br>
**pdf_info** **pdf_info**
字段结构说明 字段结构说明
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 | | preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 | | layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 | | page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 | | page_size | 页面的宽度和高度 |
| _layout_tree | 布局树状结构 | | \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block | | images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block | | tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block | | interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 | | discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 | | para_blocks | 将preproc_blocks进行分段之后的结果 |
上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套 上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
...@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
外层block被称为一级block,一级block中的字段包括 外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :---------------------------------------------- |
| type | block类型(table\|image)| | type | block类型(table\|image) |
|bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
|blocks |list,里面的每个元素都是一个dict格式的二级block | | blocks | list,里面的每个元素都是一个dict格式的二级block |
<br> <br>
一级block只有"table"和"image"两种类型,其余block均为二级block 一级block只有"table"和"image"两种类型,其余block均为二级block
二级block中的字段包括 二级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :----------------------------------------------------------- |
| type | block类型 | | type | block类型 |
| bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成| | lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解 二级block的类型详解
| type | desc | | type | desc |
|:-------------------| :---- | | :----------------- | :------------- |
| image_body | 图像的本体 | | image_body | 图像的本体 |
| image_caption | 图像的描述文本 | | image_caption | 图像的描述文本 |
| table_body | 表格本体 | | table_body | 表格本体 |
| table_caption | 表格的描述文本 | | table_caption | 表格的描述文本 |
| table_footnote | 表格的脚注 | | table_footnote | 表格的脚注 |
| text | 文本块 | | text | 文本块 |
| title | 标题块 | | title | 标题块 |
| interline_equation | 行间公式块| | interline_equation | 行间公式块 |
<br> <br>
...@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
line 的 字段格式如下 line 的 字段格式如下
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 | | bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 | | spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br> <br>
**span** **span**
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 | | bbox | span的矩形框坐标 |
| type | span的类型 | | type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 | | content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
span 的类型有如下几种 span 的类型有如下几种
| type | desc | | type | desc |
| :-----| :---- | | :----------------- | :------- |
| image | 图片 | | image | 图片 |
| table | 表格 | | table | 表格 |
| text | 文本 | | text | 文本 |
| inline_equation | 行内公式 | | inline_equation | 行内公式 |
| interline_equation | 行间公式 | | interline_equation | 行间公式 |
**总结** **总结**
span是所有元素的最小存储单元 span是所有元素的最小存储单元
...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息 ...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
一级block(如有)->二级block->line->span 一级block(如有)->二级block->line->span
#### 示例数据 #### 示例数据
```json ```json
...@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息 ...@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
"_parse_type": "txt", "_parse_type": "txt",
"_version_name": "0.6.1" "_version_name": "0.6.1"
} }
``` ```
\ No newline at end of file
This diff is collapsed.
import os
from pathlib import Path
from loguru import logger
from magic_pdf.integrations.rag.type import (ElementRelation, LayoutElements,
Node)
from magic_pdf.integrations.rag.utils import inference
class RagPageReader:
def __init__(self, pagedata: LayoutElements):
self.o = [
Node(
category_type=v.category_type,
text=v.text,
image_path=v.image_path,
anno_id=v.anno_id,
latex=v.latex,
html=v.html,
) for v in pagedata.layout_dets
]
self.pagedata = pagedata
def __iter__(self):
return iter(self.o)
def get_rel_map(self) -> list[ElementRelation]:
return self.pagedata.extra.element_relation
class RagDocumentReader:
def __init__(self, ragdata: list[LayoutElements]):
self.o = [RagPageReader(v) for v in ragdata]
def __iter__(self):
return iter(self.o)
class DataReader:
def __init__(self, path_or_directory: str, method: str, output_dir: str):
self.path_or_directory = path_or_directory
self.method = method
self.output_dir = output_dir
self.pdfs = []
if os.path.isdir(path_or_directory):
for doc_path in Path(path_or_directory).glob('*.pdf'):
self.pdfs.append(doc_path)
else:
assert path_or_directory.endswith('.pdf')
self.pdfs.append(Path(path_or_directory))
def get_documents_count(self) -> int:
"""Returns the number of documents in the directory."""
return len(self.pdfs)
def get_document_result(self, idx: int) -> RagDocumentReader | None:
"""
Args:
idx (int): the index of documents under the
directory path_or_directory
Returns:
RagDocumentReader | None: RagDocumentReader is an iterable object,
more details @RagDocumentReader
"""
if idx >= self.get_documents_count() or idx < 0:
logger.error(f'invalid idx: {idx}')
return None
res = inference(str(self.pdfs[idx]), self.output_dir, self.method)
if res is None:
logger.warning(f'failed to inference pdf {self.pdfs[idx]}')
return None
return RagDocumentReader(res)
def get_document_filename(self, idx: int) -> Path:
"""get the filename of the document."""
return self.pdfs[idx]
from enum import Enum
from pydantic import BaseModel, Field
# rag
class CategoryType(Enum): # py310 not support StrEnum
text = 'text'
title = 'title'
interline_equation = 'interline_equation'
image = 'image'
image_body = 'image_body'
image_caption = 'image_caption'
table = 'table'
table_body = 'table_body'
table_caption = 'table_caption'
table_footnote = 'table_footnote'
class ElementRelType(Enum):
sibling = 'sibling'
class PageInfo(BaseModel):
page_no: int = Field(description='the index of page, start from zero',
ge=0)
height: int = Field(description='the height of page', gt=0)
width: int = Field(description='the width of page', ge=0)
image_path: str | None = Field(description='the image of this page',
default=None)
class ContentObject(BaseModel):
category_type: CategoryType = Field(description='类别')
poly: list[float] = Field(
description=('Coordinates, need to convert back to PDF coordinates,'
' order is top-left, top-right, bottom-right, bottom-left'
' x,y coordinates'))
ignore: bool = Field(description='whether ignore this object',
default=False)
text: str | None = Field(description='text content of the object',
default=None)
image_path: str | None = Field(description='path of embedded image',
default=None)
order: int = Field(description='the order of this object within a page',
default=-1)
anno_id: int = Field(description='unique id', default=-1)
latex: str | None = Field(description='latex result', default=None)
html: str | None = Field(description='html result', default=None)
class ElementRelation(BaseModel):
source_anno_id: int = Field(description='unique id of the source object',
default=-1)
target_anno_id: int = Field(description='unique id of the target object',
default=-1)
relation: ElementRelType = Field(
description='the relation between source and target element')
class LayoutElementsExtra(BaseModel):
element_relation: list[ElementRelation] = Field(
description='the relation between source and target element')
class LayoutElements(BaseModel):
layout_dets: list[ContentObject] = Field(
description='layout element details')
page_info: PageInfo = Field(description='page info')
extra: LayoutElementsExtra = Field(description='extra information')
# iter data format
class Node(BaseModel):
category_type: CategoryType = Field(description='类别')
text: str | None = Field(description='text content of the object',
default=None)
image_path: str | None = Field(description='path of embedded image',
default=None)
anno_id: int = Field(description='unique id', default=-1)
latex: str | None = Field(description='latex result', default=None)
html: str | None = Field(description='html result', default=None)
import json
import os
from pathlib import Path
from loguru import logger
import magic_pdf.model as model_config
from magic_pdf.dict2md.ocr_mkcontent import merge_para_with_text
from magic_pdf.integrations.rag.type import (CategoryType, ContentObject,
ElementRelation, ElementRelType,
LayoutElements,
LayoutElementsExtra, PageInfo)
from magic_pdf.libs.ocr_content_type import BlockType, ContentType
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env
def convert_middle_json_to_layout_elements(
json_data: dict,
output_dir: str,
) -> list[LayoutElements]:
uniq_anno_id = 0
res: list[LayoutElements] = []
for page_no, page_data in enumerate(json_data['pdf_info']):
order_id = 0
page_info = PageInfo(
height=int(page_data['page_size'][1]),
width=int(page_data['page_size'][0]),
page_no=page_no,
)
layout_dets: list[ContentObject] = []
extra_element_relation: list[ElementRelation] = []
for para_block in page_data['para_blocks']:
para_text = ''
para_type = para_block['type']
if para_type == BlockType.Text:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.text,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.Title:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.title,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.InterlineEquation:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.interline_equation,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.Image:
body_anno_id = -1
caption_anno_id = -1
for block in para_block['blocks']:
if block['type'] == BlockType.ImageBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Image:
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.image_body,
image_path=os.path.join(
output_dir, span['image_path']),
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
body_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.image_caption,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
caption_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
if body_anno_id > 0 and caption_anno_id > 0:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=caption_anno_id,
)
extra_element_relation.append(element_relation)
elif para_type == BlockType.Table:
body_anno_id, caption_anno_id, footnote_anno_id = -1, -1, -1
for block in para_block['blocks']:
if block['type'] == BlockType.TableCaption:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_caption,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
caption_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.TableBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Table:
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_body,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
body_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
# if processed by table model
if span.get('latex', ''):
content.latex = span['latex']
else:
content.image_path = os.path.join(
output_dir, span['image_path'])
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.TableFootnote:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_footnote,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
footnote_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
if caption_anno_id != -1 and body_anno_id != -1:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=caption_anno_id,
)
extra_element_relation.append(element_relation)
if footnote_anno_id != -1 and body_anno_id != -1:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=footnote_anno_id,
)
extra_element_relation.append(element_relation)
res.append(
LayoutElements(
page_info=page_info,
layout_dets=layout_dets,
extra=LayoutElementsExtra(
element_relation=extra_element_relation),
))
return res
def inference(path, output_dir, method):
model_config.__use_inside_model__ = True
model_config.__model_mode__ = 'full'
if output_dir == '':
if os.path.isdir(path):
output_dir = os.path.join(path, 'output')
else:
output_dir = os.path.join(os.path.dirname(path), 'output')
local_image_dir, local_md_dir = prepare_env(output_dir,
str(Path(path).stem), method)
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_doc(doc_path: str):
try:
file_name = str(Path(doc_path).stem)
pdf_data = read_fn(doc_path)
do_parse(
output_dir,
file_name,
pdf_data,
[],
method,
False,
f_draw_span_bbox=False,
f_draw_layout_bbox=False,
f_dump_md=False,
f_dump_middle_json=True,
f_dump_model_json=False,
f_dump_orig_pdf=False,
f_dump_content_list=False,
f_draw_model_bbox=False,
)
middle_json_fn = os.path.join(local_md_dir,
f'{file_name}_middle.json')
with open(middle_json_fn) as fd:
jso = json.load(fd)
os.remove(middle_json_fn)
return convert_middle_json_to_layout_elements(jso, local_image_dir)
except Exception as e:
logger.exception(e)
return parse_doc(path)
if __name__ == '__main__':
import pprint
base_dir = '/opt/data/pdf/resources/samples/'
if 0:
with open(base_dir + 'json_outputs/middle.json') as f:
d = json.load(f)
result = convert_middle_json_to_layout_elements(d, '/tmp')
pprint.pp(result)
if 0:
with open(base_dir + 'json_outputs/middle.3.json') as f:
d = json.load(f)
result = convert_middle_json_to_layout_elements(d, '/tmp')
pprint.pp(result)
if 1:
res = inference(
base_dir + 'samples/pdf/one_page_with_table_image.pdf',
'/tmp/output',
'ocr',
)
pprint.pp(res)
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment