Unverified Commit 9f352df0 authored by drunkpig's avatar drunkpig Committed by GitHub
Browse files

Realese 0.8.0 (#586)



* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* fix(ocr_mkcontent): improve language detection and content formatting (#458)

Optimize the language detection logic to enhance content formatting.  This
change addresses issues with long word segmentation. Language detection now uses a
threshold to determine the language of a text based on the proportion of English characters.
Formatting rules for content have been updated to consider a list of languages (initially
including Chinese, Japanese, and Korean) where no space is added between content segments
for inline equations and text spans, improving the handling of Asian languages.

The impact of these changes includes improved accuracy in language detection, better
segmentation of long words, and more appropriate spacing in content formatting for multiple
languages.

* fix(self_modify): merge detection boxes for optimized text region detection (#448)

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

* fix(pdf-extract): adjust box threshold for OCR detection (#447)

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

* feat: rename the file generated by command line tools (#401)

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: default avataricecraft <tmortred@gmail.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* fix(ocr_mkcontent): revise table caption output (#397)

* fix(ocr_mkcontent): revise table caption output

- Ensuring that
  table captions are properly included in the output.
- Remove the redundant `table_caption` variable。

* Update cla.yml

* Update bug_report.yml

* feat(cli): add debug option for detailed error handling

Enable users to invoke the CLI command with a new debug flag to get detailed debugging information.

* fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified
to improve the accuracy and consistency of OCR processing. The new values are set to 25
to ensure more precise image cropping and pasting which leads to better OCR recognition
results.

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and
handling of border cases. This change will help in improving the overall quality of
OCR'ed text by providing more context around the detected text areas.

* fix(common): deep copy model list before drawing model bbox

Use a deep copy of the original model list in `drow_model_bbox` to avoid potential
modifications to the source data. This ensures the integrity of the original models
is maintained while generating the model bounding boxes visualization.

---------
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* build(docker): update docker build step (#471)

* build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved
performance and stability.

Additionally, integrate PaddlePaddle GPU version 3.0.0b1
into the Docker build for enhanced AI capabilities. The MinIO configuration file has
also been updated to the latest version.

* build(dockerfile): Updated the Dockerfile

* build(Dockerfile): update Dockerfile

* docs(docker): add instructions for quick deployment with Docker

Include Docker-based deployment instructions in the README for both English and
Chinese locales. This update provides users a quick-start guide to using Docker for
deployment, with notes on GPU VRAM requirements and default acceleration features.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* upload an introduction about chemical formula and update readme.md (#489)

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

* upload an introduction about chemical formula and update readme.md (#489)

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

* fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494)
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* feat: add test case (#499)
Co-authored-by: default avatarquyuan <quyuan@pjlab.org>

* Update cla.yml

* Update gpu-ci.yml

* Update cli.yml

* Delete .github/workflows/gpu-ci.yml

* fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500)

Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to
improve the accuracy of block filling based on layout analysis.

* fix(detect_all_bboxes): remove small overlapping blocks by merging (#501)

Previously, small blocks that overlapped with larger ones were merely removed. This fix
changes the approach to merge smaller blocks into the larger block instead, ensuring that
no information is lost and the larger block encompasses all the text content fully.

* feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507)

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* Feat/support rag (#510)

* Create requirements-docker.txt

* feat: update deps to support rag

* feat: add support to rag, add rag_data_reader api for rag integration

* feat: let user retrieve the filename of the processed file

* feat: add projects demo for rag integrations

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* Update Dockerfile

* feat(gradio): add app by gradio (#512)

* fix: replace \u0002, \u0003 in common text (#521)

* fix replace \u0002, \u0003 in common text

* fix(para): When an English line ends with a hyphen, do not add a space at the end.

* fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518)

* fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523)

* fix replace \u0002, \u0003 in common text

* fix(para): When an English line ends with a hyphen, do not add a space at the end.

* fix: delete hyphen at end of line

* Release: Release  0.7.1 verison, update dev (#527)

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Hotfix readme 0.7.1 (#529)

* release: release 0.7.1 version (#526)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: default avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

---------
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>

* Update README_zh-CN.md

delete Known issue about table recognition

* Update Dockerfile

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: typo error in markdown (#536)
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* fix(gradio): remove unused imports and simplify pdf display (#534)

Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also,
replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler
and more integrated PDF upload and display solution, improving code maintainability and readability.

* Feat/support footnote in figure (#532)

* feat: support figure footnote

* feat: using the relative position to combine footnote, table, image

* feat: add the readme of projects

* fix: code spell in unittest

---------
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>

* refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533)

Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing
atomic models. This change ensures that atomic models are instantiated at most once,
optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton
class now manages the instantiation and retrieval of atomic models, improving the
overall structure and efficiency of the codebase.

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

add HF、modelscope、colab url

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Rename README.md to README_zh-CN.md

* Create readme.md

* Rename readme.md to README.md

* Rename README.md to README_zh-CN.md

* Update README_zh-CN.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* Update README_zh-CN.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* add rag data api

* Update README_zh-CN.md

update rag api image

* Update README.md

docs: remove RAG related release notes

* Update README_zh-CN.md

docs: remove RAG related release notes

* Update README_zh-CN.md

update 更新记录

---------
Co-authored-by: default avatarsfk <18810651050@163.com>
Co-authored-by: default avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: default avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: default avataricecraft <tmortred@163.com>
Co-authored-by: default avataricecraft <tmortred@gmail.com>
Co-authored-by: default avataricecraft <xurui1@pjlab.org.cn>
Co-authored-by: default avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: default avatarSiyu Hao <131659128+GDDGCZ518@users.noreply.github.com>
Co-authored-by: default avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: default avatarquyuan <quyuan@pjlab.org>
Co-authored-by: default avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: default avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: default avatarwangbinDL <wangbin_research@163.com>
parent b6633cd6
......@@ -6,20 +6,24 @@ on:
push:
branches:
- "master"
- "dev"
paths-ignore:
- "cmds/**"
- "**.md"
- "**.yml"
pull_request:
branches:
- "master"
- "dev"
paths-ignore:
- "cmds/**"
- "**.md"
- "**.yml"
workflow_dispatch:
jobs:
cli-test:
runs-on: ubuntu-latest
timeout-minutes: 40
runs-on: pdf
timeout-minutes: 120
strategy:
fail-fast: true
......@@ -28,27 +32,22 @@ jobs:
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: check-requirements
run: |
pip install -r requirements.txt
pip install -r requirements-qa.txt
pip install magic-pdf
- name: test_cli
- name: install
run: |
cp magic-pdf.template.json ~/magic-pdf.json
echo $GITHUB_WORKSPACE
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && pytest -s -v tests/test_unit.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli.py
- name: benchmark
echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
- name: unit test
run: |
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m pytest tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
- name: cli test
run: |
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_bench.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
needs: [cli-test]
runs-on: ubuntu-latest
needs: cli-test
runs-on: pdf
steps:
- name: get_actor
run: |
......@@ -67,9 +66,5 @@ jobs:
- name: notify
run: |
curl ${{ secrets.WEBHOOK_URL }} -H 'Content-Type: application/json' -d '{
"msgtype": "text",
"text": {
"mentioned_list": ["${{ env.METIONS }}"] , "content": "'${{ github.repository }}' GitHubAction Failed!\n 细节请查看:https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"
}
}'
\ No newline at end of file
echo ${{ secrets.USER_ID }}
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}
......@@ -30,10 +30,10 @@ tmp/
tmp
.vscode
.vscode/
/tests/
ocr_demo
/app/common/__init__.py
/magic_pdf/config/__init__.py
source.dev.env
tmp
......@@ -3,6 +3,7 @@ repos:
rev: 5.0.4
hooks:
- id: flake8
args: ["--max-line-length=120", "--ignore=E131,E125,W503,W504,E203"]
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
......@@ -11,6 +12,7 @@ repos:
rev: v0.32.0
hooks:
- id: yapf
args: ["--style={based_on_style: google, column_limit: 120, indent_width: 4}"]
- repo: https://github.com/codespell-project/codespell
rev: v2.2.1
hooks:
......@@ -41,4 +43,4 @@ repos:
rev: v1.3.1
hooks:
- id: docformatter
args: ["--in-place", "--wrap-descriptions", "79"]
args: ["--in-place", "--wrap-descriptions", "119"]
# Use the official Ubuntu base image
FROM ubuntu:latest
FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
......@@ -29,17 +29,23 @@ RUN python3 -m venv /opt/mineru_venv
# Activate the virtual environment and install necessary Python packages
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip install --upgrade pip && \
pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"
# Copy the configuration file template and set up the model directory
COPY magic-pdf.template.json /root/magic-pdf.json
# Set the models directory in the configuration file (adjust the path as needed)
RUN sed -i 's|/tmp/models|/opt/models|g' /root/magic-pdf.json
# Create the models directory
RUN mkdir -p /opt/models
pip3 install --upgrade pip && \
wget https://gitee.com/myhloli/MinerU/raw/master/requirements-docker.txt && \
pip3 install -r requirements-docker.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
# Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json && \
cp magic-pdf.template.json /root/magic-pdf.json && \
source /opt/mineru_venv/bin/activate && \
pip3 install -U magic-pdf"
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope && \
wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py && \
python3 download_models.py && \
sed -i 's|/tmp/models|/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models|g' /root/magic-pdf.json && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
......@@ -5,6 +5,7 @@
</p>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
......@@ -12,17 +13,26 @@
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p>
......@@ -30,12 +40,14 @@
</div>
# Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol>
......@@ -74,10 +86,10 @@
</ol>
</details>
# MinerU
## Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
......@@ -101,6 +113,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU:
- [Online Demo (No Installation Required)](#online-demo)
- [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo)
- [Linux/Windows + CUDA](#Using-GPU)
......@@ -171,33 +184,41 @@ In non-mainline environments, due to the diversity of hardware and software conf
### Quick CPU Demo
#### 1. Install magic-pdf
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
```
#### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
> ❗️After downloading the models, please make sure to verify the completeness of the model files.
>
>
> Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
#### 3. Copy and configure the template file
You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
> ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
>
>
> The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
> ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
>
> On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
>
>
> For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
```json
{
// other config
......@@ -210,13 +231,26 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
}
```
### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## Usage
......@@ -230,12 +264,12 @@ Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
without method specified, auto will be used by default.
--help Show this message and exit.
......@@ -250,13 +284,13 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
```text
├── some_pdf.md # markdown file
├── images # directory for storing images
├── layout.pdf # layout diagram
├── middle.json # MinerU intermediate processing result
├── model.json # model inference result
├── origin.pdf # original PDF file
└── spans.pdf # smallest granularity bbox position information diagram
├── some_pdf.md # markdown file
├── images # directory for storing images
├── some_pdf_layout.pdf # layout diagram
├── some_pdf_middle.json # MinerU intermediate processing result
├── some_pdf_model.json # model inference result
├── some_pdf_origin.pdf # original PDF file
└── some_pdf_spans.pdf # smallest granularity bbox position information diagram
```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
......@@ -264,6 +298,7 @@ For more information about the output files, please refer to the [Output File De
### API
Processing files from local disk
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
......@@ -276,6 +311,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
Processing files from object storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
......@@ -290,10 +326,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide
TODO
......@@ -305,10 +341,11 @@ TODO
- [ ] Code block recognition within the text
- [ ] Table of contents recognition
- [x] Table recognition
- [ ] Chemical formula recognition
- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition
# Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model
......@@ -318,11 +355,11 @@ TODO
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
......@@ -335,8 +372,8 @@ TODO
This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility.
# Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
......@@ -373,9 +410,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
......
......@@ -4,8 +4,8 @@
<img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
......@@ -13,33 +13,41 @@
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p>
</div>
# 更新记录
- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary>
<ol>
......@@ -78,10 +86,10 @@
</ol>
</details>
# MinerU
## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**
......@@ -100,17 +108,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
- 支持CPU和GPU环境
- 支持windows/linux/mac平台
## 快速开始
如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br>
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有3种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验)
- [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验)
- [Linux/Windows + CUDA](#使用gpu)
**⚠️安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
......@@ -174,38 +181,47 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
### 使用CPU快速体验
#### 1. 安装magic-pdf
最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
```
#### 2. 下载模型权重文件
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
> ❗️模型下载后请务必检查模型文件是否下载完整
>
>
> 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
#### 3. 拷贝配置文件并进行配置
在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
> ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行
>
> windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
>
> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
>
> windows系统中此路径应包含盘符,且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。
>
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json
{
// other config
......@@ -218,13 +234,27 @@ cp magic-pdf.template.json ~/magic-pdf.json
}
```
### 使用GPU
如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程:
- [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署
> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
>
> 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## 使用
......@@ -238,12 +268,12 @@ Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
without method specified, auto will be used by default.
--help Show this message and exit.
......@@ -258,21 +288,21 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下
```text
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── layout.pdf # layout 绘图
├── middle.json # minerU 中间处理结果
├── model.json # 模型推理结果
├── origin.pdf # 原 pdf 文件
└── spans.pdf # 最小粒度的bbox位置信息绘图
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图
├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件
└── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
```
更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md)
### API
处理本地磁盘上的文件
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
......@@ -285,6 +315,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
处理对象存储上的文件
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
......@@ -298,11 +329,11 @@ pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
详细实现可参考
详细实现可参考
- [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
### 二次开发
TODO
......@@ -314,11 +345,11 @@ TODO
- [ ] 正文中代码块识别
- [ ] 目录识别
- [x] 表格识别
- [ ] 化学式识别
- [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] 几何图形识别
# Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序
- 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持
......@@ -328,10 +359,11 @@ TODO
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
[FAQ](docs/FAQ_en_us.md)
[FAQ](docs/FAQ_en_us.md)
# All Thanks To Our Contributors
......@@ -346,6 +378,7 @@ TODO
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
# Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
......@@ -382,9 +415,11 @@ TODO
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
......
<div id="top">
<div align="center" xmlns="http://www.w3.org/1999/html">
<!-- logo -->
<p align="center">
<img src="docs/images/MinerU-logo.png" width="160px" style="vertical-align:middle;">
<img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p>
</div>
<div align="center">
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
......@@ -14,289 +15,382 @@
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
<!-- language -->
</div>
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link -->
<div align="center">
<p align="center">
<a href="https://github.com/opendatalab/MinerU">MinerU: 端到端的PDF解析工具(基于PDF-Extract-Kit)支持PDF转Markdown</a>🚀🚀🚀<br>
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/gPxmVeGC" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p>
</div>
# MinerU
## 简介
MinerU 是一款一站式、开源、高质量的数据提取工具,主要包含以下功能:
- [Magic-PDF](#Magic-PDF) PDF文档提取
- [Magic-Doc](#Magic-Doc) 网页与电子书提取
# Magic-PDF
</div>
## 简介
Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。
主要功能包含
- 支持多种前端模型输入
- 删除页眉、页脚、脚注、页码等元素
- 符合人类阅读顺序的排版格式
- 保留原文档的结构和格式,包括标题、段落、列表等
- 提取图像和表格并在markdown中展示
- 将公式转换成latex
- 乱码PDF自动识别并转换
- 支持cpu和gpu环境
- 支持windows/linux/mac平台
# 更新记录
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary>
<ol>
<li>
<a href="#mineru">MinerU</a>
<ul>
<li><a href="#项目简介">项目简介</a></li>
<li><a href="#主要功能">主要功能</a></li>
<li><a href="#快速开始">快速开始</a>
<ul>
<li><a href="#在线体验">在线体验</a></li>
<li><a href="#使用CPU快速体验">使用CPU快速体验</a></li>
<li><a href="#使用GPU">使用GPU</a></li>
</ul>
</li>
<li><a href="#使用">使用方式</a>
<ul>
<li><a href="#命令行">命令行</a></li>
<li><a href="#api">API</a></li>
<li><a href="#二次开发">二次开发</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#todo">TODO</a></li>
<li><a href="#known-issues">Known Issues</a></li>
<li><a href="#faq">FAQ</a></li>
<li><a href="#all-thanks-to-our-contributors">Contributors</a></li>
<li><a href="#license-information">License Information</a></li>
<li><a href="#acknowledgments">Acknowledgements</a></li>
<li><a href="#citation">Citation</a></li>
<li><a href="#star-history">Star History</a></li>
<li><a href="#magic-doc">magic-doc快速提取PPT/DOC/PDF</a></li>
<li><a href="#magic-html">magic-html提取混合网页内容</a></li>
<li><a href="#links">Links</a></li>
</ol>
</details>
# MinerU
## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**。
https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
## 主要功能
- 删除页眉、页脚、脚注、页码等元素,保持语义连贯
- 对多栏输出符合人类阅读顺序的文本
- 保留原文档的结构,包括标题、段落、列表等
- 提取图像、图片标题、表格、表格标题
- 自动识别文档中的公式并将公式转换成latex
- 自动识别文档中的表格并将表格转换成latex
- 乱码PDF自动检测并启用OCR
- 支持CPU和GPU环境
- 支持windows/linux/mac平台
## 项目全景
![项目全景图](docs/images/project_panorama_zh_cn.png)
## 流程图
![流程图](docs/images/flowchart_zh_cn.png)
### 子模块仓库
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- 高质量的PDF内容提取工具包
## 上手指南
### 配置要求
python >= 3.9
## 快速开始
如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br>
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有3种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验)
- [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验)
- [Linux/Windows + CUDA](#使用gpu)
**⚠️安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
通过集中资源和精力于主线环境,我们团队能够更高效地解决潜在的BUG,及时开发新功能。
在非主线环境中,由于硬件、软件配置的多样性,以及第三方依赖项的兼容性问题,我们无法100%保证项目的完全可用性。因此,对于希望在非推荐环境中使用本项目的用户,我们建议先仔细阅读文档以及FAQ,大多数问题已经在FAQ中有对应的解决方案,除此之外我们鼓励社区反馈问题,以便我们能够逐步扩大支持范围。
<table>
<tr>
<td colspan="3" rowspan="2">操作系统</td>
</tr>
<tr>
<td>Ubuntu 22.04 LTS</td>
<td>Windows 10 / 11</td>
<td>macOS 11+</td>
</tr>
<tr>
<td colspan="3">CPU</td>
<td>x86_64</td>
<td>x86_64</td>
<td>x86_64 / arm64</td>
</tr>
<tr>
<td colspan="3">内存</td>
<td colspan="3">大于等于16GB,推荐32G以上</td>
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
<td>latest(专有驱动)</td>
<td>latest</td>
<td>None</td>
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td>自动安装[12.1(pytorch)+11.8(paddle)]</td>
<td>11.8(手动安装)+cuDNN v8.7.0(手动安装)</td>
<td>None</td>
</tr>
<tr>
<td rowspan="2">GPU硬件支持列表</td>
<td colspan="2">最低要求 8G+显存</td>
<td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
8G显存仅可开启lavout和公式识别加速</td>
<td rowspan="2">None</td>
</tr>
<tr>
<td colspan="2">推荐配置 16G+显存</td>
<td colspan="2">3090/3090ti/4070tisuper/4080/4090<br>
16G及以上可以同时开启layout,公式识别和ocr加速</td>
</tr>
</table>
### 在线体验
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
### 使用CPU快速体验
#### 1. 安装magic-pdf
最新版本国内镜像源同步可能会有延迟,请耐心等待
推荐使用虚拟环境,以避免可能发生的依赖冲突,venv和conda均可使用。
例如:
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
```
开发基于python 3.10,如果在其他版本python出现问题请切换至3.10。
### 安装配置
#### 1. 安装Magic-PDF
**1.安装依赖**
完整功能包依赖detectron2,该库需要编译安装,如需自行编译,请参考 https://github.com/facebookresearch/detectron2/issues/5114
或是直接使用我们预编译的whl包:
> ❗️预编译版本仅支持64位系统(windows/linux/macOS)+pyton 3.10平台;不支持任何32位系统和非mac的arm平台,如系统不支持请自行编译安装。
```bash
pip install detectron2 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
```
**2.使用pip安装完整功能包**
> 受pypi限制,pip安装的完整功能包仅支持cpu推理,建议只用于快速测试解析能力。
>
> 如需在生产环境使用CUDA/MPS加速请参考[使用CUDA或MPS加速推理](#4-使用CUDA或MPS加速推理)
```bash
pip install magic-pdf[full]==0.6.2b1 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
> ❗️❗️❗️
> 我们预发布了0.6.2beta版本,该版本解决了很多issue中提出的问题,同时提高了安装成功率。但是该版本未经过完整的QA测试,不代表最终正式发布的质量水平。如果你遇到任何问题,请通过提交issue的方式及时向我们反馈,或者回退到使用0.6.1版本。
> ```bash
> pip install magic-pdf[full-cpu]==0.6.1
> ```
#### 2. 下载模型权重文件
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
> ❗️模型下载后请务必检查模型文件是否下载完整
>
>
> 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
#### 3. 拷贝配置文件并进行配置
在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
> ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行
>
> windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
>
> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
>
> windows系统中此路径应包含盘符,且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。
>
> windows系统中此路径应包含盘符,且需把路径中所有的"\"替换为"/",否则会因为转义原因导致json文件语法错误。
>
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json
{
"models-dir": "/tmp/models"
}
```
#### 4. 使用CUDA或MPS加速推理
如您有可用的Nvidia显卡或在使用Apple Silicon的Mac,可以使用CUDA或MPS进行加速
##### CUDA
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
需要根据自己的CUDA版本安装对应的pytorch版本
以下是对应CUDA 11.8版本的安装命令,更多信息请参考 https://pytorch.org/get-started/locally/
```bash
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```
> ❗️务必在命令中指定以下版本
> ```bash
> torch==2.3.1 torchvision==0.18.1
> ```
> 这是我们支持的最高版本,如果不指定版本会自动安装更高版本导致程序无法运行
同时需要修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值
```json
{
"device-mode":"cuda"
// other config
"models-dir": "D:/models",
"table-config": {
"is_table_recog_enable": false, // 表格识别功能默认是关闭的,如果需要修改此处的值
"max_time": 400
}
}
```
##### MPS
使用macOS(M系列芯片设备)可以使用MPS进行推理加速
需要修改配置文件magic-pdf.json中"device-mode"的值
```json
{
"device-mode":"mps"
}
```
### 使用GPU
如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程:
### 使用说明
- [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署
> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:0.7.0b1 .
docker run --rm -it --gpus=all mineru:0.7.0b1 /bin/bash
magic-pdf --help
```
#### 1. 通过命令行使用
## 使用
###### 直接使用
### 命令行
```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
程序运行完成后,你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件,markdown目录中可以找到对应的xxx_model.json文件
如果您有意对后处理pipeline进行二次开发,可以使用命令
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
```
这样就不需要重跑模型数据,调试起来更方便
###### 更多用法
```bash
magic-pdf --help
其中 `{some_pdf}` 可以是单个pdf文件,也可以是一个包含多个pdf文件的目录。
运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下
```text
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图
├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件
└── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
```
更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md)
#### 2. 通过接口调用
### API
处理本地磁盘上的文件
###### 本地使用
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
###### 在对象存储上使用
处理对象存储上的文件
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
详细实现可参考 [demo.py](demo/demo.py)
### 常见问题处理解答
参考 [FAQ](docs/FAQ_zh_cn.md)
# Magic-Doc
## 简介
Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。
主要功能包含
- Web网页提取
- 跨模态精准解析图文、表格、公式信息
详细实现可参考
- 电子书文献提取
- 支持 epub,mobi等多格式文献,文本图片全适配
- [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
- 语言类型鉴定
- 支持176种语言的准确识别
### 二次开发
https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
TODO
# TODO
- [ ] 基于语义的阅读顺序
- [ ] 正文中列表识别
- [ ] 正文中代码块识别
- [ ] 目录识别
- [x] 表格识别
- [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] 几何图形识别
https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
# Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序
- 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持
- 漫画书、艺术图册、小学教材、习题尚不能很好解析
- 在一些公式密集的PDF上强制启用OCR效果会更好
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
- **表格识别**目前处于测试阶段,识别速度较慢,识别准确度有待提升。以下是我们在Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090环境下的一些性能测试结果,可供参考。
| 表格大小 | 解析耗时 |
| ------------ | -------- |
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
<<<<<<< HEAD
=======
[FAQ](docs/FAQ_en_us.md)
>>>>>>> 7f0fe20004af7416db886f4b75c116bcc1c986b4
## 项目仓库
[FAQ](docs/FAQ_en_us.md)
- [Magic-Doc](https://github.com/InternLM/magic-doc)
优秀的网页与电子书提取工具
## 感谢我们的贡献者
# All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
<img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>
## 版权说明
# License Information
[LICENSE.md](LICENSE.md)
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
# Acknowledgments
## 致谢
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
# 引用
# Citation
```bibtex
@article{he2024opendatalab,
title={Opendatalab: Empowering general artificial intelligence with open datasets},
author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
journal={arXiv preprint arXiv:2407.13773},
year={2024}
}
@misc{2024mineru,
title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
author={MinerU Contributors},
......@@ -305,7 +399,6 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
}
```
# Star History
<a>
......@@ -316,7 +409,16 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
</picture>
</a>
## 友情链接
- [LabelU (轻量级多模态标注工具)](https://github.com/opendatalab/labelU)
- [LabelLLM (开源LLM对话标注平台)](https://github.com/opendatalab/LabelLLM)
- [PDF-Extract-Kit (用于高质量PDF内容提取的综合工具包)](https://github.com/opendatalab/PDF-Extract-Kit))
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
# Copyright (c) Opendatalab. All rights reserved.
import base64
import os
import time
import zipfile
from pathlib import Path
import re
from loguru import logger
from magic_pdf.libs.hash_utils import compute_sha256
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env
os.system("pip install gradio")
os.system("pip install gradio-pdf")
import gradio as gr
from gradio_pdf import PDF
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_pdf(doc_path, output_dir, end_page_id):
os.makedirs(output_dir, exist_ok=True)
try:
file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
pdf_data = read_fn(doc_path)
parse_method = "auto"
local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
do_parse(
output_dir,
file_name,
pdf_data,
[],
parse_method,
False,
end_page_id=end_page_id,
)
return local_md_dir, file_name
except Exception as e:
logger.exception(e)
def compress_directory_to_zip(directory_path, output_zip_path):
"""
压缩指定目录到一个 ZIP 文件。
:param directory_path: 要压缩的目录路径
:param output_zip_path: 输出的 ZIP 文件路径
"""
try:
with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
# 遍历目录中的所有文件和子目录
for root, dirs, files in os.walk(directory_path):
for file in files:
# 构建完整的文件路径
file_path = os.path.join(root, file)
# 计算相对路径
arcname = os.path.relpath(file_path, directory_path)
# 添加文件到 ZIP 文件
zipf.write(file_path, arcname)
return 0
except Exception as e:
logger.exception(e)
return -1
def image_to_base64(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def replace_image_with_base64(markdown_text, image_dir_path):
# 匹配Markdown中的图片标签
pattern = r'\!\[(?:[^\]]*)\]\(([^)]+)\)'
# 替换图片链接
def replace(match):
relative_path = match.group(1)
full_path = os.path.join(image_dir_path, relative_path)
base64_image = image_to_base64(full_path)
return f"![{relative_path}](data:image/jpeg;base64,{base64_image})"
# 应用替换
return re.sub(pattern, replace, markdown_text)
def to_markdown(file_path, end_pages):
# 获取识别的md文件以及压缩包文件路径
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
if zip_archive_success == 0:
logger.info("压缩成功")
else:
logger.error("压缩失败")
md_path = os.path.join(local_md_dir, file_name + ".md")
with open(md_path, 'r', encoding='utf-8') as f:
txt_content = f.read()
md_content = replace_image_with_base64(txt_content, local_md_dir)
# 返回转换后的PDF路径
new_pdf_path = os.path.join(local_md_dir, file_name + "_layout.pdf")
return md_content, txt_content, archive_zip_path, new_pdf_path
# def show_pdf(file_path):
# with open(file_path, "rb") as f:
# base64_pdf = base64.b64encode(f.read()).decode('utf-8')
# pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
# f'width="100%" height="1000" type="application/pdf">'
# return pdf_display
latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
{"left": '$', "right": '$', "display": False}]
def init_model():
from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
try:
model_manager = ModelSingleton()
txt_model = model_manager.get_model(False, False)
logger.info(f"txt_model init final")
ocr_model = model_manager.get_model(True, False)
logger.info(f"ocr_model init final")
return 0
except Exception as e:
logger.exception(e)
return -1
model_init = init_model()
logger.info(f"model_init: {model_init}")
if __name__ == "__main__":
with gr.Blocks() as demo:
with gr.Row():
with gr.Column(variant='panel', scale=5):
pdf_show = gr.Markdown()
max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
with gr.Row() as bu_flow:
change_bu = gr.Button("Convert")
clear_bu = gr.ClearButton([pdf_show], value="Clear")
pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
with gr.Column(variant='panel', scale=5):
output_file = gr.File(label="convert result", interactive=False)
with gr.Tabs():
with gr.Tab("Markdown rendering"):
md = gr.Markdown(label="Markdown rendering", height=900, show_copy_button=True,
latex_delimiters=latex_delimiters, line_breaks=True)
with gr.Tab("Markdown text"):
md_text = gr.TextArea(lines=45, show_copy_button=True)
change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
clear_bu.add([md, pdf_show, md_text, output_file])
demo.launch()
## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
### layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
![layout example](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png)
### model.json
### some_pdf_model.json
#### Structure Definition
```python
from pydantic import BaseModel, Field
from enum import IntEnum
......@@ -34,12 +33,12 @@ class CategoryType(IntEnum):
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
......@@ -51,22 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata")
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
inference_result: list[PageInferenceResults] = []
```
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png)
#### example
```json
......@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
]
```
### some_pdf_middle.json
### middle.json
| Field Name | Description |
| :-----|:------------------------------------------|
|pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
|_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
|_version_name | string, indicates the version of magic-pdf used in this parsing |
| Field Name | Description |
| :------------- | :------------------------------------------------------------------------------------------------------------- |
| pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
| \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
| \_version_name | string, indicates the version of magic-pdf used in this parsing |
<br>
......@@ -136,18 +131,18 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
Field structure description
| Field Name | Description |
| :-----| :---- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 |
| page_size | Page width and height |
| _layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks |
| Field Name | Description |
| :------------------ | :----------------------------------------------------------------------------------------------------------------- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 |
| page_size | Page width and height |
| \_layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks |
In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting.
......@@ -157,35 +152,35 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a
The outer block is referred to as a first-level block, and the fields in the first-level block include:
| Field Name | Description |
| :-----| :---- |
| type | Block type (table\|image)|
|bbox | Block bounding box coordinates |
|blocks |list, each element is a dict representing a second-level block |
| Field Name | Description |
| :--------- | :------------------------------------------------------------- |
| type | Block type (table\|image) |
| bbox | Block bounding box coordinates |
| blocks | list, each element is a dict representing a second-level block |
<br>
There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
The fields in a second-level block include:
| Field Name | Description |
| :-----| :---- |
| type | Block type |
| bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information|
| Field Name | Description |
| :--------- | :---------------------------------------------------------------------------------------------------------- |
| type | Block type |
| bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information |
Detailed explanation of second-level block types
| type | Description |
|:-------------------| :---- |
| type | Description |
| :----------------- | :--------------------- |
| image_body | Main body of the image |
| image_caption | Image description text |
| table_body | Main body of the table |
| table_caption | Table description text |
| table_footnote | Table footnote |
| text | Text block |
| title | Title block |
| interline_equation | Block formula|
| table_footnote | Table footnote |
| text | Text block |
| title | Title block |
| interline_equation | Block formula |
<br>
......@@ -193,31 +188,30 @@ Detailed explanation of second-level block types
The field format of a line is as follows:
| Field Name | Description |
| :-----| :---- |
| bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
| Field Name | Description |
| :--------- | :------------------------------------------------------------------------------------------------------ |
| bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
<br>
**span**
| Field Name | Description |
| :-----| :---- |
| bbox | Bounding box coordinates of the span |
| type | Type of the span |
| Field Name | Description |
| :------------------ | :------------------------------------------------------------------------------------------------------- |
| bbox | Bounding box coordinates of the span |
| type | Type of the span |
| content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
The types of spans are as follows:
| type | Description |
| :-----| :---- |
| image | Image |
| table | Table |
| text | Text |
| inline_equation | Inline formula |
| interline_equation | Block formula |
| type | Description |
| :----------------- | :------------- |
| image | Image |
| table | Table |
| text | Text |
| inline_equation | Inline formula |
| interline_equation | Block formula |
**Summary**
......@@ -229,7 +223,6 @@ The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span
#### example
```json
......
## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
### layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png)
### model.json
### some_pdf_model.json
#### 结构定义
```python
from pydantic import BaseModel, Field
from enum import IntEnum
......@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号
formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式
isolated = 14 # 行间公式
text = 15 # ocr 识别结果
class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0)
......@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []
```
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标
poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png)
#### 示例数据
```json
......@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
]
```
### some_pdf_middle.json
### middle.json
| 字段名 | 解释 |
| :-----|:------------------------------------------|
|pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
|_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
| 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br>
**pdf_info**
字段结构说明
| 字段名 | 解释 |
| :-----| :---- |
| preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 |
| _layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 |
| \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
......@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 |
| :-----| :---- |
| type | block类型(table\|image)|
|bbox | block矩形框坐标 |
|blocks |list,里面的每个元素都是一个dict格式的二级block |
| 字段名 | 解释 |
| :----- | :---------------------------------------------- |
| type | block类型(table\|image) |
| bbox | block矩形框坐标 |
| blocks | list,里面的每个元素都是一个dict格式的二级block |
<br>
一级block只有"table"和"image"两种类型,其余block均为二级block
二级block中的字段包括
| 字段名 | 解释 |
| :-----| :---- |
| type | block类型 |
| bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成|
| 字段名 | 解释 |
| :----- | :----------------------------------------------------------- |
| type | block类型 |
| bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解
| type | desc |
|:-------------------| :---- |
| image_body | 图像的本体 |
| type | desc |
| :----------------- | :------------- |
| image_body | 图像的本体 |
| image_caption | 图像的描述文本 |
| table_body | 表格本体 |
| table_body | 表格本体 |
| table_caption | 表格的描述文本 |
| table_footnote | 表格的脚注 |
| text | 文本块 |
| title | 标题块 |
| interline_equation | 行间公式块|
| table_footnote | 表格的脚注 |
| text | 文本块 |
| title | 标题块 |
| interline_equation | 行间公式块 |
<br>
......@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
line 的 字段格式如下
| 字段名 | 解释 |
| :-----| :---- |
| bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| 字段名 | 解释 |
| :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br>
**span**
| 字段名 | 解释 |
| :-----| :---- |
| bbox | span的矩形框坐标 |
| type | span的类型 |
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 |
| type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
span 的类型有如下几种
| type | desc |
| :-----| :---- |
| image | 图片 |
| table | 表格 |
| text | 文本 |
| inline_equation | 行内公式 |
| type | desc |
| :----------------- | :------- |
| image | 图片 |
| table | 表格 |
| text | 文本 |
| inline_equation | 行内公式 |
| interline_equation | 行间公式 |
**总结**
span是所有元素的最小存储单元
......@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
一级block(如有)->二级block->line->span
#### 示例数据
```json
......@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
"_parse_type": "txt",
"_version_name": "0.6.1"
}
```
\ No newline at end of file
```
import re
import wordninja
from loguru import logger
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.libs.commons import join_path
from magic_pdf.libs.language import detect_lang
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
from magic_pdf.libs.ocr_content_type import ContentType, BlockType
import wordninja
import re
from magic_pdf.libs.ocr_content_type import BlockType, ContentType
def __is_hyphen_at_line_end(line):
"""
Check if a line ends with one or more letters followed by a hyphen.
Args:
line (str): The line of text to check.
Returns:
bool: True if the line ends with one or more letters followed by a hyphen, False otherwise.
"""
# Use regex to check if the line ends with one or more letters followed by a hyphen
return bool(re.search(r'[A-Za-z]+-\s*$', line))
def split_long_words(text):
......@@ -14,7 +29,7 @@ def split_long_words(text):
for i in range(len(segments)):
words = re.findall(r'\w+|[^\w]', segments[i], re.UNICODE)
for j in range(len(words)):
if len(words[j]) > 15:
if len(words[j]) > 10:
words[j] = ' '.join(wordninja.split(words[j]))
segments[i] = ''.join(words)
return ' '.join(segments)
......@@ -23,8 +38,9 @@ def split_long_words(text):
def ocr_mk_mm_markdown_with_para(pdf_info_list: list, img_buket_path):
markdown = []
for page_info in pdf_info_list:
paras_of_layout = page_info.get("para_blocks")
page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
paras_of_layout = page_info.get('para_blocks')
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'mm', img_buket_path)
markdown.extend(page_markdown)
return '\n\n'.join(markdown)
......@@ -32,29 +48,34 @@ def ocr_mk_mm_markdown_with_para(pdf_info_list: list, img_buket_path):
def ocr_mk_nlp_markdown_with_para(pdf_info_dict: list):
markdown = []
for page_info in pdf_info_dict:
paras_of_layout = page_info.get("para_blocks")
page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "nlp")
paras_of_layout = page_info.get('para_blocks')
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'nlp')
markdown.extend(page_markdown)
return '\n\n'.join(markdown)
def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: list, img_buket_path):
def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: list,
img_buket_path):
markdown_with_para_and_pagination = []
page_no = 0
for page_info in pdf_info_dict:
paras_of_layout = page_info.get("para_blocks")
paras_of_layout = page_info.get('para_blocks')
if not paras_of_layout:
continue
page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'mm', img_buket_path)
markdown_with_para_and_pagination.append({
'page_no': page_no,
'md_content': '\n\n'.join(page_markdown)
'page_no':
page_no,
'md_content':
'\n\n'.join(page_markdown)
})
page_no += 1
return markdown_with_para_and_pagination
def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=""):
def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=''):
page_markdown = []
for paras in paras_of_layout:
for para in paras:
......@@ -67,8 +88,9 @@ def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=""):
if span_type == ContentType.Text:
content = span['content']
language = detect_lang(content)
if language == 'en': # 只对英文长词进行分词处理,中文分词会丢失文本
content = ocr_escape_special_markdown_char(split_long_words(content))
if (language == 'en'): # 只对英文长词进行分词处理,中文分词会丢失文本
content = ocr_escape_special_markdown_char(
split_long_words(content))
else:
content = ocr_escape_special_markdown_char(content)
elif span_type == ContentType.InlineEquation:
......@@ -92,7 +114,9 @@ def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=""):
return page_markdown
def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
mode,
img_buket_path=''):
page_markdown = []
for para_block in paras_of_layout:
para_text = ''
......@@ -100,7 +124,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
if para_type == BlockType.Text:
para_text = merge_para_with_text(para_block)
elif para_type == BlockType.Title:
para_text = f"# {merge_para_with_text(para_block)}"
para_text = f'# {merge_para_with_text(para_block)}'
elif para_type == BlockType.InterlineEquation:
para_text = merge_para_with_text(para_block)
elif para_type == BlockType.Image:
......@@ -116,14 +140,16 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block)
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageFootnote:
para_text += merge_para_with_text(block)
elif para_type == BlockType.Table:
if mode == 'nlp':
continue
elif mode == 'mm':
table_caption = ''
for block in para_block['blocks']: # 1st.拼table_caption
if block['type'] == BlockType.TableCaption:
table_caption = merge_para_with_text(block)
para_text += merge_para_with_text(block)
for block in para_block['blocks']: # 2nd.拼table_body
if block['type'] == BlockType.TableBody:
for line in block['lines']:
......@@ -135,7 +161,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
elif span.get('html', ''):
para_text += f"\n\n{span['html']}\n\n"
else:
para_text += f"\n![{table_caption}]({join_path(img_buket_path, span['image_path'])}) \n"
para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 3rd.拼table_footnote
if block['type'] == BlockType.TableFootnote:
para_text += merge_para_with_text(block)
......@@ -149,24 +175,39 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
def merge_para_with_text(para_block):
def detect_language(text):
en_pattern = r'[a-zA-Z]+'
en_matches = re.findall(en_pattern, text)
en_length = sum(len(match) for match in en_matches)
if len(text) > 0:
if en_length / len(text) >= 0.5:
return 'en'
else:
return 'unknown'
else:
return 'empty'
para_text = ''
for line in para_block['lines']:
line_text = ""
line_lang = ""
line_text = ''
line_lang = ''
for span in line['spans']:
span_type = span['type']
if span_type == ContentType.Text:
line_text += span['content'].strip()
if line_text != "":
if line_text != '':
line_lang = detect_lang(line_text)
for span in line['spans']:
span_type = span['type']
content = ''
if span_type == ContentType.Text:
content = span['content']
language = detect_lang(content)
# language = detect_lang(content)
language = detect_language(content)
if language == 'en': # 只对英文长词进行分词处理,中文分词会丢失文本
content = ocr_escape_special_markdown_char(split_long_words(content))
content = ocr_escape_special_markdown_char(
split_long_words(content))
else:
content = ocr_escape_special_markdown_char(content)
elif span_type == ContentType.InlineEquation:
......@@ -175,10 +216,17 @@ def merge_para_with_text(para_block):
content = f"\n$$\n{span['content']}\n$$\n"
if content != '':
if 'zh' in line_lang: # 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
para_text += content # 中文语境下,content间不需要空格分隔
langs = ['zh', 'ja', 'ko']
if line_lang in langs: # 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
para_text += content # 中文/日语/韩文语境下,content间不需要空格分隔
elif line_lang == 'en':
# 如果是前一行带有-连字符,那么末尾不应该加空格
if __is_hyphen_at_line_end(content):
para_text += content[:-1]
else:
para_text += content + ' '
else:
para_text += content + ' ' # 英文语境下 content间需要空格分隔
para_text += content + ' ' # 西方文本语境下 content间需要空格分隔
return para_text
......@@ -193,18 +241,18 @@ def para_to_standard_format(para, img_buket_path):
for span in line['spans']:
language = ''
span_type = span.get('type')
content = ""
content = ''
if span_type == ContentType.Text:
content = span['content']
language = detect_lang(content)
if language == 'en': # 只对英文长词进行分词处理,中文分词会丢失文本
content = ocr_escape_special_markdown_char(split_long_words(content))
content = ocr_escape_special_markdown_char(
split_long_words(content))
else:
content = ocr_escape_special_markdown_char(content)
elif span_type == ContentType.InlineEquation:
content = f"${span['content']}$"
inline_equation_num += 1
if language == 'en': # 英文语境下 content间需要空格分隔
para_text += content + ' '
else: # 中文语境下,content间不需要空格分隔
......@@ -212,7 +260,7 @@ def para_to_standard_format(para, img_buket_path):
para_content = {
'type': 'text',
'text': para_text,
'inline_equation_num': inline_equation_num
'inline_equation_num': inline_equation_num,
}
return para_content
......@@ -223,37 +271,35 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
para_content = {
'type': 'text',
'text': merge_para_with_text(para_block),
'page_idx': page_idx
'page_idx': page_idx,
}
elif para_type == BlockType.Title:
para_content = {
'type': 'text',
'text': merge_para_with_text(para_block),
'text_level': 1,
'page_idx': page_idx
'page_idx': page_idx,
}
elif para_type == BlockType.InterlineEquation:
para_content = {
'type': 'equation',
'text': merge_para_with_text(para_block),
'text_format': "latex",
'page_idx': page_idx
'text_format': 'latex',
'page_idx': page_idx,
}
elif para_type == BlockType.Image:
para_content = {
'type': 'image',
'page_idx': page_idx
}
para_content = {'type': 'image', 'page_idx': page_idx}
for block in para_block['blocks']:
if block['type'] == BlockType.ImageBody:
para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
para_content['img_path'] = join_path(
img_buket_path,
block['lines'][0]['spans'][0]['image_path'])
if block['type'] == BlockType.ImageCaption:
para_content['img_caption'] = merge_para_with_text(block)
if block['type'] == BlockType.ImageFootnote:
para_content['img_footnote'] = merge_para_with_text(block)
elif para_type == BlockType.Table:
para_content = {
'type': 'table',
'page_idx': page_idx
}
para_content = {'type': 'table', 'page_idx': page_idx}
for block in para_block['blocks']:
if block['type'] == BlockType.TableBody:
if block["lines"][0]["spans"][0].get('latex', ''):
......@@ -272,17 +318,18 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
def make_standard_format_with_para(pdf_info_dict: list, img_buket_path: str):
content_list = []
for page_info in pdf_info_dict:
paras_of_layout = page_info.get("para_blocks")
paras_of_layout = page_info.get('para_blocks')
if not paras_of_layout:
continue
for para_block in paras_of_layout:
para_content = para_to_standard_format_v2(para_block, img_buket_path)
para_content = para_to_standard_format_v2(para_block,
img_buket_path)
content_list.append(para_content)
return content_list
def line_to_standard_format(line, img_buket_path):
line_text = ""
line_text = ''
inline_equation_num = 0
for span in line['spans']:
if not span.get('content'):
......@@ -292,13 +339,15 @@ def line_to_standard_format(line, img_buket_path):
if span['type'] == ContentType.Image:
content = {
'type': 'image',
'img_path': join_path(img_buket_path, span['image_path'])
'img_path': join_path(img_buket_path,
span['image_path']),
}
return content
elif span['type'] == ContentType.Table:
content = {
'type': 'table',
'img_path': join_path(img_buket_path, span['image_path'])
'img_path': join_path(img_buket_path,
span['image_path']),
}
return content
else:
......@@ -306,36 +355,33 @@ def line_to_standard_format(line, img_buket_path):
interline_equation = span['content']
content = {
'type': 'equation',
'latex': f"$$\n{interline_equation}\n$$"
'latex': f'$$\n{interline_equation}\n$$'
}
return content
elif span['type'] == ContentType.InlineEquation:
inline_equation = span['content']
line_text += f"${inline_equation}$"
line_text += f'${inline_equation}$'
inline_equation_num += 1
elif span['type'] == ContentType.Text:
text_content = ocr_escape_special_markdown_char(span['content']) # 转义特殊符号
text_content = ocr_escape_special_markdown_char(
span['content']) # 转义特殊符号
line_text += text_content
content = {
'type': 'text',
'text': line_text,
'inline_equation_num': inline_equation_num
'inline_equation_num': inline_equation_num,
}
return content
def ocr_mk_mm_standard_format(pdf_info_dict: list):
"""
content_list
type string image/text/table/equation(行间的单独拿出来,行内的和text合并)
latex string latex文本字段。
text string 纯文本格式的文本数据。
md string markdown格式的文本数据。
img_path string s3://full/path/to/img.jpg
"""
"""content_list type string
image/text/table/equation(行间的单独拿出来,行内的和text合并) latex string
latex文本字段。 text string 纯文本格式的文本数据。 md string
markdown格式的文本数据。 img_path string s3://full/path/to/img.jpg."""
content_list = []
for page_info in pdf_info_dict:
blocks = page_info.get("preproc_blocks")
blocks = page_info.get('preproc_blocks')
if not blocks:
continue
for block in blocks:
......@@ -345,34 +391,42 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
return content_list
def union_make(pdf_info_dict: list, make_mode: str, drop_mode: str, img_buket_path: str = ""):
def union_make(pdf_info_dict: list,
make_mode: str,
drop_mode: str,
img_buket_path: str = ''):
output_content = []
for page_info in pdf_info_dict:
if page_info.get("need_drop", False):
drop_reason = page_info.get("drop_reason")
if page_info.get('need_drop', False):
drop_reason = page_info.get('drop_reason')
if drop_mode == DropMode.NONE:
pass
elif drop_mode == DropMode.WHOLE_PDF:
raise Exception(f"drop_mode is {DropMode.WHOLE_PDF} , drop_reason is {drop_reason}")
raise Exception((f'drop_mode is {DropMode.WHOLE_PDF} ,'
f'drop_reason is {drop_reason}'))
elif drop_mode == DropMode.SINGLE_PAGE:
logger.warning(f"drop_mode is {DropMode.SINGLE_PAGE} , drop_reason is {drop_reason}")
logger.warning((f'drop_mode is {DropMode.SINGLE_PAGE} ,'
f'drop_reason is {drop_reason}'))
continue
else:
raise Exception(f"drop_mode can not be null")
raise Exception('drop_mode can not be null')
paras_of_layout = page_info.get("para_blocks")
page_idx = page_info.get("page_idx")
paras_of_layout = page_info.get('para_blocks')
page_idx = page_info.get('page_idx')
if not paras_of_layout:
continue
if make_mode == MakeMode.MM_MD:
page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'mm', img_buket_path)
output_content.extend(page_markdown)
elif make_mode == MakeMode.NLP_MD:
page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "nlp")
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'nlp')
output_content.extend(page_markdown)
elif make_mode == MakeMode.STANDARD_FORMAT:
for para_block in paras_of_layout:
para_content = para_to_standard_format_v2(para_block, img_buket_path, page_idx)
para_content = para_to_standard_format_v2(
para_block, img_buket_path, page_idx)
output_content.append(para_content)
if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
return '\n\n'.join(output_content)
......
import os
from pathlib import Path
from loguru import logger
from magic_pdf.integrations.rag.type import (ElementRelation, LayoutElements,
Node)
from magic_pdf.integrations.rag.utils import inference
class RagPageReader:
def __init__(self, pagedata: LayoutElements):
self.o = [
Node(
category_type=v.category_type,
text=v.text,
image_path=v.image_path,
anno_id=v.anno_id,
latex=v.latex,
html=v.html,
) for v in pagedata.layout_dets
]
self.pagedata = pagedata
def __iter__(self):
return iter(self.o)
def get_rel_map(self) -> list[ElementRelation]:
return self.pagedata.extra.element_relation
class RagDocumentReader:
def __init__(self, ragdata: list[LayoutElements]):
self.o = [RagPageReader(v) for v in ragdata]
def __iter__(self):
return iter(self.o)
class DataReader:
def __init__(self, path_or_directory: str, method: str, output_dir: str):
self.path_or_directory = path_or_directory
self.method = method
self.output_dir = output_dir
self.pdfs = []
if os.path.isdir(path_or_directory):
for doc_path in Path(path_or_directory).glob('*.pdf'):
self.pdfs.append(doc_path)
else:
assert path_or_directory.endswith('.pdf')
self.pdfs.append(Path(path_or_directory))
def get_documents_count(self) -> int:
"""Returns the number of documents in the directory."""
return len(self.pdfs)
def get_document_result(self, idx: int) -> RagDocumentReader | None:
"""
Args:
idx (int): the index of documents under the
directory path_or_directory
Returns:
RagDocumentReader | None: RagDocumentReader is an iterable object,
more details @RagDocumentReader
"""
if idx >= self.get_documents_count() or idx < 0:
logger.error(f'invalid idx: {idx}')
return None
res = inference(str(self.pdfs[idx]), self.output_dir, self.method)
if res is None:
logger.warning(f'failed to inference pdf {self.pdfs[idx]}')
return None
return RagDocumentReader(res)
def get_document_filename(self, idx: int) -> Path:
"""get the filename of the document."""
return self.pdfs[idx]
from enum import Enum
from pydantic import BaseModel, Field
# rag
class CategoryType(Enum): # py310 not support StrEnum
text = 'text'
title = 'title'
interline_equation = 'interline_equation'
image = 'image'
image_body = 'image_body'
image_caption = 'image_caption'
table = 'table'
table_body = 'table_body'
table_caption = 'table_caption'
table_footnote = 'table_footnote'
class ElementRelType(Enum):
sibling = 'sibling'
class PageInfo(BaseModel):
page_no: int = Field(description='the index of page, start from zero',
ge=0)
height: int = Field(description='the height of page', gt=0)
width: int = Field(description='the width of page', ge=0)
image_path: str | None = Field(description='the image of this page',
default=None)
class ContentObject(BaseModel):
category_type: CategoryType = Field(description='类别')
poly: list[float] = Field(
description=('Coordinates, need to convert back to PDF coordinates,'
' order is top-left, top-right, bottom-right, bottom-left'
' x,y coordinates'))
ignore: bool = Field(description='whether ignore this object',
default=False)
text: str | None = Field(description='text content of the object',
default=None)
image_path: str | None = Field(description='path of embedded image',
default=None)
order: int = Field(description='the order of this object within a page',
default=-1)
anno_id: int = Field(description='unique id', default=-1)
latex: str | None = Field(description='latex result', default=None)
html: str | None = Field(description='html result', default=None)
class ElementRelation(BaseModel):
source_anno_id: int = Field(description='unique id of the source object',
default=-1)
target_anno_id: int = Field(description='unique id of the target object',
default=-1)
relation: ElementRelType = Field(
description='the relation between source and target element')
class LayoutElementsExtra(BaseModel):
element_relation: list[ElementRelation] = Field(
description='the relation between source and target element')
class LayoutElements(BaseModel):
layout_dets: list[ContentObject] = Field(
description='layout element details')
page_info: PageInfo = Field(description='page info')
extra: LayoutElementsExtra = Field(description='extra information')
# iter data format
class Node(BaseModel):
category_type: CategoryType = Field(description='类别')
text: str | None = Field(description='text content of the object',
default=None)
image_path: str | None = Field(description='path of embedded image',
default=None)
anno_id: int = Field(description='unique id', default=-1)
latex: str | None = Field(description='latex result', default=None)
html: str | None = Field(description='html result', default=None)
import json
import os
from pathlib import Path
from loguru import logger
import magic_pdf.model as model_config
from magic_pdf.dict2md.ocr_mkcontent import merge_para_with_text
from magic_pdf.integrations.rag.type import (CategoryType, ContentObject,
ElementRelation, ElementRelType,
LayoutElements,
LayoutElementsExtra, PageInfo)
from magic_pdf.libs.ocr_content_type import BlockType, ContentType
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env
def convert_middle_json_to_layout_elements(
json_data: dict,
output_dir: str,
) -> list[LayoutElements]:
uniq_anno_id = 0
res: list[LayoutElements] = []
for page_no, page_data in enumerate(json_data['pdf_info']):
order_id = 0
page_info = PageInfo(
height=int(page_data['page_size'][1]),
width=int(page_data['page_size'][0]),
page_no=page_no,
)
layout_dets: list[ContentObject] = []
extra_element_relation: list[ElementRelation] = []
for para_block in page_data['para_blocks']:
para_text = ''
para_type = para_block['type']
if para_type == BlockType.Text:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.text,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.Title:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.title,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.InterlineEquation:
para_text = merge_para_with_text(para_block)
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.interline_equation,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
elif para_type == BlockType.Image:
body_anno_id = -1
caption_anno_id = -1
for block in para_block['blocks']:
if block['type'] == BlockType.ImageBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Image:
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.image_body,
image_path=os.path.join(
output_dir, span['image_path']),
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
body_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.image_caption,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
caption_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
if body_anno_id > 0 and caption_anno_id > 0:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=caption_anno_id,
)
extra_element_relation.append(element_relation)
elif para_type == BlockType.Table:
body_anno_id, caption_anno_id, footnote_anno_id = -1, -1, -1
for block in para_block['blocks']:
if block['type'] == BlockType.TableCaption:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_caption,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
caption_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.TableBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Table:
x0, y0, x1, y1 = para_block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_body,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
body_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
# if processed by table model
if span.get('latex', ''):
content.latex = span['latex']
else:
content.image_path = os.path.join(
output_dir, span['image_path'])
layout_dets.append(content)
for block in para_block['blocks']:
if block['type'] == BlockType.TableFootnote:
para_text += merge_para_with_text(block)
x0, y0, x1, y1 = block['bbox']
content = ContentObject(
anno_id=uniq_anno_id,
category_type=CategoryType.table_footnote,
text=para_text,
order=order_id,
poly=[x0, y0, x1, y0, x1, y1, x0, y1],
)
footnote_anno_id = uniq_anno_id
uniq_anno_id += 1
order_id += 1
layout_dets.append(content)
if caption_anno_id != -1 and body_anno_id != -1:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=caption_anno_id,
)
extra_element_relation.append(element_relation)
if footnote_anno_id != -1 and body_anno_id != -1:
element_relation = ElementRelation(
relation=ElementRelType.sibling,
source_anno_id=body_anno_id,
target_anno_id=footnote_anno_id,
)
extra_element_relation.append(element_relation)
res.append(
LayoutElements(
page_info=page_info,
layout_dets=layout_dets,
extra=LayoutElementsExtra(
element_relation=extra_element_relation),
))
return res
def inference(path, output_dir, method):
model_config.__use_inside_model__ = True
model_config.__model_mode__ = 'full'
if output_dir == '':
if os.path.isdir(path):
output_dir = os.path.join(path, 'output')
else:
output_dir = os.path.join(os.path.dirname(path), 'output')
local_image_dir, local_md_dir = prepare_env(output_dir,
str(Path(path).stem), method)
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_doc(doc_path: str):
try:
file_name = str(Path(doc_path).stem)
pdf_data = read_fn(doc_path)
do_parse(
output_dir,
file_name,
pdf_data,
[],
method,
False,
f_draw_span_bbox=False,
f_draw_layout_bbox=False,
f_dump_md=False,
f_dump_middle_json=True,
f_dump_model_json=False,
f_dump_orig_pdf=False,
f_dump_content_list=False,
f_draw_model_bbox=False,
)
middle_json_fn = os.path.join(local_md_dir,
f'{file_name}_middle.json')
with open(middle_json_fn) as fd:
jso = json.load(fd)
os.remove(middle_json_fn)
return convert_middle_json_to_layout_elements(jso, local_image_dir)
except Exception as e:
logger.exception(e)
return parse_doc(path)
if __name__ == '__main__':
import pprint
base_dir = '/opt/data/pdf/resources/samples/'
if 0:
with open(base_dir + 'json_outputs/middle.json') as f:
d = json.load(f)
result = convert_middle_json_to_layout_elements(d, '/tmp')
pprint.pp(result)
if 0:
with open(base_dir + 'json_outputs/middle.3.json') as f:
d = json.load(f)
result = convert_middle_json_to_layout_elements(d, '/tmp')
pprint.pp(result)
if 1:
res = inference(
base_dir + 'samples/pdf/one_page_with_table_image.pdf',
'/tmp/output',
'ocr',
)
pprint.pp(res)
"""
对pdf上的box进行layout识别,并对内部组成的box进行排序
"""
"""对pdf上的box进行layout识别,并对内部组成的box进行排序."""
from loguru import logger
from magic_pdf.layout.bbox_sort import CONTENT_IDX, CONTENT_TYPE_IDX, X0_EXT_IDX, X0_IDX, X1_EXT_IDX, X1_IDX, Y0_EXT_IDX, Y0_IDX, Y1_EXT_IDX, Y1_IDX, paper_bbox_sort
from magic_pdf.layout.layout_det_utils import find_all_left_bbox_direct, find_all_right_bbox_direct, find_bottom_bbox_direct_from_left_edge, find_bottom_bbox_direct_from_right_edge, find_top_bbox_direct_from_left_edge, find_top_bbox_direct_from_right_edge, find_all_top_bbox_direct, find_all_bottom_bbox_direct, get_left_edge_bboxes, get_right_edge_bboxes
from magic_pdf.libs.boxbase import get_bbox_in_boundry
from magic_pdf.layout.bbox_sort import (CONTENT_IDX, CONTENT_TYPE_IDX,
X0_EXT_IDX, X0_IDX, X1_EXT_IDX, X1_IDX,
Y0_EXT_IDX, Y0_IDX, Y1_EXT_IDX, Y1_IDX,
paper_bbox_sort)
from magic_pdf.layout.layout_det_utils import (
find_all_bottom_bbox_direct, find_all_left_bbox_direct,
find_all_right_bbox_direct, find_all_top_bbox_direct,
find_bottom_bbox_direct_from_left_edge,
find_bottom_bbox_direct_from_right_edge,
find_top_bbox_direct_from_left_edge, find_top_bbox_direct_from_right_edge,
get_left_edge_bboxes, get_right_edge_bboxes)
from magic_pdf.libs.boxbase import get_bbox_in_boundary
LAYOUT_V = 'V'
LAYOUT_H = 'H'
LAYOUT_UNPROC = 'U'
LAYOUT_BAD = 'B'
LAYOUT_V = "V"
LAYOUT_H = "H"
LAYOUT_UNPROC = "U"
LAYOUT_BAD = "B"
def _is_single_line_text(bbox):
"""
检查bbox里面的文字是否只有一行
"""
return True # TODO
"""检查bbox里面的文字是否只有一行."""
return True # TODO
box_type = bbox[CONTENT_TYPE_IDX]
if box_type != 'text':
return False
paras = bbox[CONTENT_IDX]["paras"]
text_content = ""
paras = bbox[CONTENT_IDX]['paras']
text_content = ''
for para_id, para in paras.items(): # 拼装内部的段落文本
is_title = para['is_title']
if is_title!=0:
if is_title != 0:
text_content += f"## {para['text']}"
else:
text_content += para["text"]
text_content += "\n\n"
return bbox[CONTENT_TYPE_IDX] == 'text' and len(text_content.split("\n\n")) <= 1
text_content += para['text']
text_content += '\n\n'
return bbox[CONTENT_TYPE_IDX] == 'text' and len(text_content.split('\n\n')) <= 1
def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
def _horizontal_split(bboxes: list, boundary: tuple, avg_font_size=20) -> list:
"""
对bboxes进行水平切割
方法是:找到左侧和右侧都没有被直接遮挡的box,然后进行扩展,之后进行切割
return:
返回几个大的Layout区域 [[x0, y0, x1, y1, "h|u|v"], ], h代表水平,u代表未探测的,v代表垂直布局
"""
sorted_layout_blocks = [] # 这是要最终返回的值
bound_x0, bound_y0, bound_x1, bound_y1 = boundry
all_bboxes = get_bbox_in_boundry(bboxes, boundry)
#all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
sorted_layout_blocks = [] # 这是要最终返回的值
bound_x0, bound_y0, bound_x1, bound_y1 = boundary
all_bboxes = get_bbox_in_boundary(bboxes, boundary)
# all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
"""
首先在水平方向上扩展独占一行的bbox
"""
last_h_split_line_y1 = bound_y0 #记录下上次的水平分割线
last_h_split_line_y1 = bound_y0 # 记录下上次的水平分割线
for i, bbox in enumerate(all_bboxes):
left_nearest_bbox = find_all_left_bbox_direct(bbox, all_bboxes) # 非扩展线
left_nearest_bbox = find_all_left_bbox_direct(bbox, all_bboxes) # 非扩展线
right_nearest_bbox = find_all_right_bbox_direct(bbox, all_bboxes)
if left_nearest_bbox is None and right_nearest_bbox is None: # 独占一行
if left_nearest_bbox is None and right_nearest_bbox is None: # 独占一行
"""
然而,如果只是孤立的一行文字,那么就还要满足以下几个条件才可以:
1. bbox和中心线相交。或者
......@@ -62,16 +68,20 @@ def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
3. TODO 加强条件:这个bbox上方和下方是同一列column,那么就不能算作独占一行
"""
# 先检查这个bbox里是否只包含一行文字
is_single_line = _is_single_line_text(bbox)
# is_single_line = _is_single_line_text(bbox)
"""
这里有个点需要注意,当页面内容不是居中的时候,第一次调用传递的是page的boundry,这个时候mid_x就不是中心线了.
所以这里计算出最紧致的boundry,然后再计算mid_x
这里有个点需要注意,当页面内容不是居中的时候,第一次调用传递的是page的boundary,这个时候mid_x就不是中心线了.
所以这里计算出最紧致的boundary,然后再计算mid_x
"""
boundry_real_x0, boundry_real_x1 = min([bbox[X0_IDX] for bbox in all_bboxes]), max([bbox[X1_IDX] for bbox in all_bboxes])
mid_x = (boundry_real_x0+boundry_real_x1)/2
boundary_real_x0, boundary_real_x1 = min(
[bbox[X0_IDX] for bbox in all_bboxes]
), max([bbox[X1_IDX] for bbox in all_bboxes])
mid_x = (boundary_real_x0 + boundary_real_x1) / 2
# 检查这个box是否内容在中心线有交
# 必须跨过去2个字符的宽度
is_cross_boundry_mid_line = min(mid_x-bbox[X0_IDX], bbox[X1_IDX]-mid_x) > avg_font_size*2
is_cross_boundary_mid_line = (
min(mid_x - bbox[X0_IDX], bbox[X1_IDX] - mid_x) > avg_font_size * 2
)
"""
检查条件2
"""
......@@ -84,50 +94,78 @@ def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
"""
以迭代的方式向上找,查找范围是[bound_x0, last_h_sp, bound_x1, bbox[Y0_IDX]]
"""
#先确定上方的y0, y0
# 先确定上方的y0, y0
b_y0, b_y1 = last_h_split_line_y1, bbox[Y0_IDX]
#然后从box开始逐个向上找到所有与box在x上有交集的box
# 然后从box开始逐个向上找到所有与box在x上有交集的box
box_to_check = [bound_x0, b_y0, bound_x1, b_y1]
bbox_in_bound_check = get_bbox_in_boundry(all_bboxes, box_to_check)
bbox_in_bound_check = get_bbox_in_boundary(all_bboxes, box_to_check)
bboxes_on_top = []
virtual_box = bbox
while True:
b_on_top = find_all_top_bbox_direct(virtual_box, bbox_in_bound_check)
if b_on_top is not None:
bboxes_on_top.append(b_on_top)
virtual_box = [min([virtual_box[X0_IDX], b_on_top[X0_IDX]]), min(virtual_box[Y0_IDX], b_on_top[Y0_IDX]), max([virtual_box[X1_IDX], b_on_top[X1_IDX]]), b_y1]
virtual_box = [
min([virtual_box[X0_IDX], b_on_top[X0_IDX]]),
min(virtual_box[Y0_IDX], b_on_top[Y0_IDX]),
max([virtual_box[X1_IDX], b_on_top[X1_IDX]]),
b_y1,
]
else:
break
# 随后确定这些box的最小x0, 最大x1
if len(bboxes_on_top)>0 and len(bboxes_on_top) != len(bbox_in_bound_check):# virtual_box可能会膨胀到占满整个区域,这实际上就不能属于一个col了。
if len(bboxes_on_top) > 0 and len(bboxes_on_top) != len(
bbox_in_bound_check
): # virtual_box可能会膨胀到占满整个区域,这实际上就不能属于一个col了。
min_x0, max_x1 = virtual_box[X0_IDX], virtual_box[X1_IDX]
# 然后采用一种比较粗糙的方法,看min_x0,max_x1是否与位于[bound_x0, last_h_sp, bound_x1, bbox[Y0_IDX]]之间的box有相交
if not any([b[X0_IDX] <= min_x0-1 <= b[X1_IDX] or b[X0_IDX] <= max_x1+1 <= b[X1_IDX] for b in bbox_in_bound_check]):
if not any(
[
b[X0_IDX] <= min_x0 - 1 <= b[X1_IDX]
or b[X0_IDX] <= max_x1 + 1 <= b[X1_IDX]
for b in bbox_in_bound_check
]
):
# 其上,下都不能被扩展成行,暂时只检查一下上方 TODO
top_nearest_bbox = find_all_top_bbox_direct(bbox, bboxes)
bottom_nearest_bbox = find_all_bottom_bbox_direct(bbox, bboxes)
if not any([
top_nearest_bbox is not None and (find_all_left_bbox_direct(top_nearest_bbox, bboxes) is None and find_all_right_bbox_direct(top_nearest_bbox, bboxes) is None),
bottom_nearest_bbox is not None and (find_all_left_bbox_direct(bottom_nearest_bbox, bboxes) is None and find_all_right_bbox_direct(bottom_nearest_bbox, bboxes) is None),
top_nearest_bbox is None or bottom_nearest_bbox is None
]):
is_belong_to_col = True
if not any(
[
top_nearest_bbox is not None
and (
find_all_left_bbox_direct(top_nearest_bbox, bboxes)
is None
and find_all_right_bbox_direct(top_nearest_bbox, bboxes)
is None
),
bottom_nearest_bbox is not None
and (
find_all_left_bbox_direct(bottom_nearest_bbox, bboxes)
is None
and find_all_right_bbox_direct(
bottom_nearest_bbox, bboxes
)
is None
),
top_nearest_bbox is None or bottom_nearest_bbox is None,
]
):
is_belong_to_col = True
# 检查是否能被下方col吸收 TODO
"""
这里为什么没有is_cross_boundry_mid_line的条件呢?
这里为什么没有is_cross_boundary_mid_line的条件呢?
确实有些杂志左右两栏宽度不是对称的。
"""
if not is_belong_to_col or is_cross_boundry_mid_line:
if not is_belong_to_col or is_cross_boundary_mid_line:
bbox[X0_EXT_IDX] = bound_x0
bbox[Y0_EXT_IDX] = bbox[Y0_IDX]
bbox[X1_EXT_IDX] = bound_x1
bbox[Y1_EXT_IDX] = bbox[Y1_IDX]
last_h_split_line_y1 = bbox[Y1_IDX] # 更新这条线
last_h_split_line_y1 = bbox[Y1_IDX] # 更新这条线
else:
continue
"""
......@@ -142,13 +180,12 @@ def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
if bbox[X0_EXT_IDX] == bound_x0 and bbox[X1_EXT_IDX] == bound_x1:
h_bbox_group.append(bbox)
else:
if len(h_bbox_group)>0:
h_bboxes.append(h_bbox_group)
if len(h_bbox_group) > 0:
h_bboxes.append(h_bbox_group)
h_bbox_group = []
# 最后一个group
if len(h_bbox_group)>0:
if len(h_bbox_group) > 0:
h_bboxes.append(h_bbox_group)
"""
现在h_bboxes里面是所有的group了,每个group都是一个list
对h_bboxes里的每个group进行计算放回到sorted_layouts里
......@@ -157,52 +194,67 @@ def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
for gp in h_bboxes:
gp.sort(key=lambda x: x[Y0_IDX])
# 然后计算这个group的layout_bbox,也就是最小的x0,y0, 最大的x1,y1
x0, y0, x1, y1 = gp[0][X0_EXT_IDX], gp[0][Y0_EXT_IDX], gp[-1][X1_EXT_IDX], gp[-1][Y1_EXT_IDX]
h_layouts.append([x0, y0, x1, y1, LAYOUT_H]) # 水平的布局
x0, y0, x1, y1 = (
gp[0][X0_EXT_IDX],
gp[0][Y0_EXT_IDX],
gp[-1][X1_EXT_IDX],
gp[-1][Y1_EXT_IDX],
)
h_layouts.append([x0, y0, x1, y1, LAYOUT_H]) # 水平的布局
"""
接下来利用这些连续的水平bbox的layout_bbox的y0, y1,从水平上切分开其余的为几个部分
"""
h_split_lines = [bound_y0]
for gp in h_bboxes: # gp是一个list[bbox_list]
for gp in h_bboxes: # gp是一个list[bbox_list]
y0, y1 = gp[0][1], gp[-1][3]
h_split_lines.append(y0)
h_split_lines.append(y1)
h_split_lines.append(bound_y1)
unsplited_bboxes = []
for i in range(0, len(h_split_lines), 2):
start_y0, start_y1 = h_split_lines[i:i+2]
start_y0, start_y1 = h_split_lines[i : i + 2]
# 然后找出[start_y0, start_y1]之间的其他bbox,这些组成一个未分割板块
bboxes_in_block = [bbox for bbox in all_bboxes if bbox[Y0_IDX]>=start_y0 and bbox[Y1_IDX]<=start_y1]
bboxes_in_block = [
bbox
for bbox in all_bboxes
if bbox[Y0_IDX] >= start_y0 and bbox[Y1_IDX] <= start_y1
]
unsplited_bboxes.append(bboxes_in_block)
# 接着把未处理的加入到h_layouts里
for bboxes_in_block in unsplited_bboxes:
if len(bboxes_in_block) == 0:
continue
x0, y0, x1, y1 = bound_x0, min([bbox[Y0_IDX] for bbox in bboxes_in_block]), bound_x1, max([bbox[Y1_IDX] for bbox in bboxes_in_block])
x0, y0, x1, y1 = (
bound_x0,
min([bbox[Y0_IDX] for bbox in bboxes_in_block]),
bound_x1,
max([bbox[Y1_IDX] for bbox in bboxes_in_block]),
)
h_layouts.append([x0, y0, x1, y1, LAYOUT_UNPROC])
h_layouts.sort(key=lambda x: x[1]) # 按照y0排序, 也就是从上到下的顺序
h_layouts.sort(key=lambda x: x[1]) # 按照y0排序, 也就是从上到下的顺序
"""
转换成如下格式返回
"""
for layout in h_layouts:
sorted_layout_blocks.append({
"layout_bbox": layout[:4],
"layout_label":layout[4],
"sub_layout":[],
})
sorted_layout_blocks.append(
{
'layout_bbox': layout[:4],
'layout_label': layout[4],
'sub_layout': [],
}
)
return sorted_layout_blocks
###############################################################################################
#
# 垂直方向的处理
#
#
###############################################################################################
def _vertical_align_split_v1(bboxes:list, boundry:tuple)-> list:
###############################################################################################
def _vertical_align_split_v1(bboxes: list, boundary: tuple) -> list:
"""
计算垂直方向上的对齐, 并分割bboxes成layout。负责对一列多行的进行列维度分割。
如果不能完全分割,剩余部分作为layout_lable为u的layout返回
......@@ -214,85 +266,124 @@ def _vertical_align_split_v1(bboxes:list, boundry:tuple)-> list:
-------------------------
此函数会将:以上布局将会切分出来2列
"""
sorted_layout_blocks = [] # 这是要最终返回的值
new_boundry = [boundry[0], boundry[1], boundry[2], boundry[3]]
sorted_layout_blocks = [] # 这是要最终返回的值
new_boundary = [boundary[0], boundary[1], boundary[2], boundary[3]]
v_blocks = []
"""
先从左到右切分
"""
while True:
all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
while True:
all_bboxes = get_bbox_in_boundary(bboxes, new_boundary)
left_edge_bboxes = get_left_edge_bboxes(all_bboxes)
if len(left_edge_bboxes) == 0:
break
right_split_line_x1 = max([bbox[X1_IDX] for bbox in left_edge_bboxes])+1
right_split_line_x1 = max([bbox[X1_IDX] for bbox in left_edge_bboxes]) + 1
# 然后检查这条线能不与其他bbox的左边界相交或者重合
if any([bbox[X0_IDX] <= right_split_line_x1 <= bbox[X1_IDX] for bbox in all_bboxes]):
if any(
[bbox[X0_IDX] <= right_split_line_x1 <= bbox[X1_IDX] for bbox in all_bboxes]
):
# 垂直切分线与某些box发生相交,说明无法完全垂直方向切分。
break
else: # 说明成功分割出一列
else: # 说明成功分割出一列
# 找到左侧边界最靠左的bbox作为layout的x0
layout_x0 = min([bbox[X0_IDX] for bbox in left_edge_bboxes]) # 这里主要是为了画出来有一定间距
v_blocks.append([layout_x0, new_boundry[1], right_split_line_x1, new_boundry[3], LAYOUT_V])
new_boundry[0] = right_split_line_x1 # 更新边界
layout_x0 = min(
[bbox[X0_IDX] for bbox in left_edge_bboxes]
) # 这里主要是为了画出来有一定间距
v_blocks.append(
[
layout_x0,
new_boundary[1],
right_split_line_x1,
new_boundary[3],
LAYOUT_V,
]
)
new_boundary[0] = right_split_line_x1 # 更新边界
"""
再从右到左切, 此时如果还是无法完全切分,那么剩余部分作为layout_lable为u的layout返回
"""
unsplited_block = []
while True:
all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
all_bboxes = get_bbox_in_boundary(bboxes, new_boundary)
right_edge_bboxes = get_right_edge_bboxes(all_bboxes)
if len(right_edge_bboxes) == 0:
break
left_split_line_x0 = min([bbox[X0_IDX] for bbox in right_edge_bboxes])-1
left_split_line_x0 = min([bbox[X0_IDX] for bbox in right_edge_bboxes]) - 1
# 然后检查这条线能不与其他bbox的左边界相交或者重合
if any([bbox[X0_IDX] <= left_split_line_x0 <= bbox[X1_IDX] for bbox in all_bboxes]):
if any(
[bbox[X0_IDX] <= left_split_line_x0 <= bbox[X1_IDX] for bbox in all_bboxes]
):
# 这里是余下的
unsplited_block.append([new_boundry[0], new_boundry[1], new_boundry[2], new_boundry[3], LAYOUT_UNPROC])
unsplited_block.append(
[
new_boundary[0],
new_boundary[1],
new_boundary[2],
new_boundary[3],
LAYOUT_UNPROC,
]
)
break
else:
# 找到右侧边界最靠右的bbox作为layout的x1
layout_x1 = max([bbox[X1_IDX] for bbox in right_edge_bboxes])
v_blocks.append([left_split_line_x0, new_boundry[1], layout_x1, new_boundry[3], LAYOUT_V])
new_boundry[2] = left_split_line_x0 # 更新右边界
v_blocks.append(
[
left_split_line_x0,
new_boundary[1],
layout_x1,
new_boundary[3],
LAYOUT_V,
]
)
new_boundary[2] = left_split_line_x0 # 更新右边界
"""
最后拼装成layout格式返回
"""
for block in v_blocks:
sorted_layout_blocks.append({
"layout_bbox": block[:4],
"layout_label":block[4],
"sub_layout":[],
})
sorted_layout_blocks.append(
{
'layout_bbox': block[:4],
'layout_label': block[4],
'sub_layout': [],
}
)
for block in unsplited_block:
sorted_layout_blocks.append({
"layout_bbox": block[:4],
"layout_label":block[4],
"sub_layout":[],
})
sorted_layout_blocks.append(
{
'layout_bbox': block[:4],
'layout_label': block[4],
'sub_layout': [],
}
)
# 按照x0排序
sorted_layout_blocks.sort(key=lambda x: x['layout_bbox'][0])
return sorted_layout_blocks
def _vertical_align_split_v2(bboxes:list, boundry:tuple)-> list:
"""
改进的 _vertical_align_split算法,原算法会因为第二列的box由于左侧没有遮挡被认为是左侧的一部分,导致整个layout多列被识别为一列。
利用从左上角的box开始向下看的方法,不断扩展w_x0, w_x1,直到不能继续向下扩展,或者到达边界下边界
"""
sorted_layout_blocks = [] # 这是要最终返回的值
new_boundry = [boundry[0], boundry[1], boundry[2], boundry[3]]
bad_boxes = [] # 被割中的box
def _vertical_align_split_v2(bboxes: list, boundary: tuple) -> list:
"""改进的
_vertical_align_split算法,原算法会因为第二列的box由于左侧没有遮挡被认为是左侧的一部分,导致整个layout多列被识别为一列
利用从左上角的box开始向下看的方法,不断扩展w_x0, w_x1,直到不能继续向下扩展,或者到达边界下边界。"""
sorted_layout_blocks = [] # 这是要最终返回的值
new_boundary = [boundary[0], boundary[1], boundary[2], boundary[3]]
bad_boxes = [] # 被割中的box
v_blocks = []
while True:
all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
all_bboxes = get_bbox_in_boundary(bboxes, new_boundary)
if len(all_bboxes) == 0:
break
left_top_box = min(all_bboxes, key=lambda x: (x[X0_IDX],x[Y0_IDX]))# 这里应该加强,检查一下必须是在第一列的 TODO
start_box = [left_top_box[X0_IDX], left_top_box[Y0_IDX], left_top_box[X1_IDX], left_top_box[Y1_IDX]]
left_top_box = min(
all_bboxes, key=lambda x: (x[X0_IDX], x[Y0_IDX])
) # 这里应该加强,检查一下必须是在第一列的 TODO
start_box = [
left_top_box[X0_IDX],
left_top_box[Y0_IDX],
left_top_box[X1_IDX],
left_top_box[Y1_IDX],
]
w_x0, w_x1 = left_top_box[X0_IDX], left_top_box[X1_IDX]
"""
然后沿着这个box线向下找最近的那个box, 然后扩展w_x0, w_x1
......@@ -301,98 +392,140 @@ def _vertical_align_split_v2(bboxes:list, boundry:tuple)-> list:
1. 达到,那么更新左边界继续分下一个列
2. 没有达到,那么此时开始从右侧切分进入下面的循环里
"""
while left_top_box is not None: # 向下去找
while left_top_box is not None: # 向下去找
virtual_box = [w_x0, left_top_box[Y0_IDX], w_x1, left_top_box[Y1_IDX]]
left_top_box = find_bottom_bbox_direct_from_left_edge(virtual_box, all_bboxes)
left_top_box = find_bottom_bbox_direct_from_left_edge(
virtual_box, all_bboxes
)
if left_top_box:
w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max([virtual_box[X1_IDX], left_top_box[X1_IDX]])
w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max(
[virtual_box[X1_IDX], left_top_box[X1_IDX]]
)
# 万一这个初始的box在column中间,那么还要向上看
start_box = [w_x0, start_box[Y0_IDX], w_x1, start_box[Y1_IDX]] # 扩展一下宽度更鲁棒
start_box = [
w_x0,
start_box[Y0_IDX],
w_x1,
start_box[Y1_IDX],
] # 扩展一下宽度更鲁棒
left_top_box = find_top_bbox_direct_from_left_edge(start_box, all_bboxes)
while left_top_box is not None: # 向上去找
while left_top_box is not None: # 向上去找
virtual_box = [w_x0, left_top_box[Y0_IDX], w_x1, left_top_box[Y1_IDX]]
left_top_box = find_top_bbox_direct_from_left_edge(virtual_box, all_bboxes)
if left_top_box:
w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max([virtual_box[X1_IDX], left_top_box[X1_IDX]])
# 检查相交
if any([bbox[X0_IDX] <= w_x1+1 <= bbox[X1_IDX] for bbox in all_bboxes]):
w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max(
[virtual_box[X1_IDX], left_top_box[X1_IDX]]
)
# 检查相交
if any([bbox[X0_IDX] <= w_x1 + 1 <= bbox[X1_IDX] for bbox in all_bboxes]):
for b in all_bboxes:
if b[X0_IDX] <= w_x1+1 <= b[X1_IDX]:
if b[X0_IDX] <= w_x1 + 1 <= b[X1_IDX]:
bad_boxes.append([b[X0_IDX], b[Y0_IDX], b[X1_IDX], b[Y1_IDX]])
break
else: # 说明成功分割出一列
v_blocks.append([w_x0, new_boundry[1], w_x1, new_boundry[3], LAYOUT_V])
new_boundry[0] = w_x1 # 更新边界
else: # 说明成功分割出一列
v_blocks.append([w_x0, new_boundary[1], w_x1, new_boundary[3], LAYOUT_V])
new_boundary[0] = w_x1 # 更新边界
"""
接着开始从右上角的box扫描
"""
w_x0 , w_x1 = 0, 0
w_x0, w_x1 = 0, 0
unsplited_block = []
while True:
all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
all_bboxes = get_bbox_in_boundary(bboxes, new_boundary)
if len(all_bboxes) == 0:
break
# 先找到X1最大的
bbox_list_sorted = sorted(all_bboxes, key=lambda bbox: bbox[X1_IDX], reverse=True)
bbox_list_sorted = sorted(
all_bboxes, key=lambda bbox: bbox[X1_IDX], reverse=True
)
# Then, find the boxes with the smallest Y0 value
bigest_x1 = bbox_list_sorted[0][X1_IDX]
boxes_with_bigest_x1 = [bbox for bbox in bbox_list_sorted if bbox[X1_IDX] == bigest_x1] # 也就是最靠右的那些
right_top_box = min(boxes_with_bigest_x1, key=lambda bbox: bbox[Y0_IDX]) # y0最小的那个
start_box = [right_top_box[X0_IDX], right_top_box[Y0_IDX], right_top_box[X1_IDX], right_top_box[Y1_IDX]]
boxes_with_bigest_x1 = [
bbox for bbox in bbox_list_sorted if bbox[X1_IDX] == bigest_x1
] # 也就是最靠右的那些
right_top_box = min(
boxes_with_bigest_x1, key=lambda bbox: bbox[Y0_IDX]
) # y0最小的那个
start_box = [
right_top_box[X0_IDX],
right_top_box[Y0_IDX],
right_top_box[X1_IDX],
right_top_box[Y1_IDX],
]
w_x0, w_x1 = right_top_box[X0_IDX], right_top_box[X1_IDX]
while right_top_box is not None:
virtual_box = [w_x0, right_top_box[Y0_IDX], w_x1, right_top_box[Y1_IDX]]
right_top_box = find_bottom_bbox_direct_from_right_edge(virtual_box, all_bboxes)
right_top_box = find_bottom_bbox_direct_from_right_edge(
virtual_box, all_bboxes
)
if right_top_box:
w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max([w_x1, right_top_box[X1_IDX]])
w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max(
[w_x1, right_top_box[X1_IDX]]
)
# 在向上扫描
start_box = [w_x0, start_box[Y0_IDX], w_x1, start_box[Y1_IDX]] # 扩展一下宽度更鲁棒
start_box = [
w_x0,
start_box[Y0_IDX],
w_x1,
start_box[Y1_IDX],
] # 扩展一下宽度更鲁棒
right_top_box = find_top_bbox_direct_from_right_edge(start_box, all_bboxes)
while right_top_box is not None:
virtual_box = [w_x0, right_top_box[Y0_IDX], w_x1, right_top_box[Y1_IDX]]
right_top_box = find_top_bbox_direct_from_right_edge(virtual_box, all_bboxes)
right_top_box = find_top_bbox_direct_from_right_edge(
virtual_box, all_bboxes
)
if right_top_box:
w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max([w_x1, right_top_box[X1_IDX]])
w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max(
[w_x1, right_top_box[X1_IDX]]
)
# 检查是否与其他box相交, 垂直切分线与某些box发生相交,说明无法完全垂直方向切分。
if any([bbox[X0_IDX] <= w_x0-1 <= bbox[X1_IDX] for bbox in all_bboxes]):
unsplited_block.append([new_boundry[0], new_boundry[1], new_boundry[2], new_boundry[3], LAYOUT_UNPROC])
if any([bbox[X0_IDX] <= w_x0 - 1 <= bbox[X1_IDX] for bbox in all_bboxes]):
unsplited_block.append(
[
new_boundary[0],
new_boundary[1],
new_boundary[2],
new_boundary[3],
LAYOUT_UNPROC,
]
)
for b in all_bboxes:
if b[X0_IDX] <= w_x0-1 <= b[X1_IDX]:
if b[X0_IDX] <= w_x0 - 1 <= b[X1_IDX]:
bad_boxes.append([b[X0_IDX], b[Y0_IDX], b[X1_IDX], b[Y1_IDX]])
break
else: # 说明成功分割出一列
v_blocks.append([w_x0, new_boundry[1], w_x1, new_boundry[3], LAYOUT_V])
new_boundry[2] = w_x0
else: # 说明成功分割出一列
v_blocks.append([w_x0, new_boundary[1], w_x1, new_boundary[3], LAYOUT_V])
new_boundary[2] = w_x0
"""转换数据结构"""
for block in v_blocks:
sorted_layout_blocks.append({
"layout_bbox": block[:4],
"layout_label":block[4],
"sub_layout":[],
})
sorted_layout_blocks.append(
{
'layout_bbox': block[:4],
'layout_label': block[4],
'sub_layout': [],
}
)
for block in unsplited_block:
sorted_layout_blocks.append({
"layout_bbox": block[:4],
"layout_label":block[4],
"sub_layout":[],
"bad_boxes": bad_boxes # 记录下来,这个box是被割中的
})
sorted_layout_blocks.append(
{
'layout_bbox': block[:4],
'layout_label': block[4],
'sub_layout': [],
'bad_boxes': bad_boxes, # 记录下来,这个box是被割中的
}
)
# 按照x0排序
sorted_layout_blocks.sort(key=lambda x: x['layout_bbox'][0])
return sorted_layout_blocks
def _try_horizontal_mult_column_split(bboxes:list, boundry:tuple)-> list:
def _try_horizontal_mult_column_split(bboxes: list, boundary: tuple) -> list:
"""
尝试水平切分,如果切分不动,那就当一个BAD_LAYOUT返回
------------------
......@@ -406,51 +539,58 @@ def _try_horizontal_mult_column_split(bboxes:list, boundry:tuple)-> list:
pass
def _vertical_split(bboxes:list, boundry:tuple)-> list:
def _vertical_split(bboxes: list, boundary: tuple) -> list:
"""
从垂直方向进行切割,分block
这个版本里,如果垂直切分不动,那就当一个BAD_LAYOUT返回
--------------------------
| | |
| | |
| |
这种列是此函数要切分的 -> | |
这种列是此函数要切分的 -> | |
| |
| | |
| | |
-------------------------
"""
sorted_layout_blocks = [] # 这是要最终返回的值
bound_x0, bound_y0, bound_x1, bound_y1 = boundry
all_bboxes = get_bbox_in_boundry(bboxes, boundry)
sorted_layout_blocks = [] # 这是要最终返回的值
bound_x0, bound_y0, bound_x1, bound_y1 = boundary
all_bboxes = get_bbox_in_boundary(bboxes, boundary)
"""
all_bboxes = fix_vertical_bbox_pos(all_bboxes) # 垂直方向解覆盖
all_bboxes = fix_hor_bbox_pos(all_bboxes) # 水平解覆盖
这两行代码目前先不执行,因为公式检测,表格检测还不是很成熟,导致非常多的textblock参与了运算,时间消耗太大。
这两行代码的作用是:
如果遇到互相重叠的bbox, 那么会把面积较小的box进行压缩,从而避免重叠。对布局切分来说带来正反馈。
"""
#all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
# all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
"""
首先在垂直方向上扩展独占一行的bbox
"""
for bbox in all_bboxes:
top_nearest_bbox = find_all_top_bbox_direct(bbox, all_bboxes) # 非扩展线
top_nearest_bbox = find_all_top_bbox_direct(bbox, all_bboxes) # 非扩展线
bottom_nearest_bbox = find_all_bottom_bbox_direct(bbox, all_bboxes)
if top_nearest_bbox is None and bottom_nearest_bbox is None and not any([b[X0_IDX]<bbox[X1_IDX]<b[X1_IDX] or b[X0_IDX]<bbox[X0_IDX]<b[X1_IDX] for b in all_bboxes]): # 独占一列, 且不和其他重叠
if (
top_nearest_bbox is None
and bottom_nearest_bbox is None
and not any(
[
b[X0_IDX] < bbox[X1_IDX] < b[X1_IDX]
or b[X0_IDX] < bbox[X0_IDX] < b[X1_IDX]
for b in all_bboxes
]
)
): # 独占一列, 且不和其他重叠
bbox[X0_EXT_IDX] = bbox[X0_IDX]
bbox[Y0_EXT_IDX] = bound_y0
bbox[X1_EXT_IDX] = bbox[X1_IDX]
bbox[Y1_EXT_IDX] = bound_y1
"""
"""
此时独占一列的被成功扩展到指定的边界上,这个时候利用边界条件合并连续的bbox,成为一个group
然后合并所有连续垂直方向的bbox.
"""
......@@ -458,20 +598,23 @@ def _vertical_split(bboxes:list, boundry:tuple)-> list:
# fix: 这里水平方向的列不要合并成一个行,因为需要保证返回给下游的最小block,总是可以无脑从上到下阅读文字。
v_bboxes = []
for box in all_bboxes:
if box[Y0_EXT_IDX] == bound_y0 and box[Y1_EXT_IDX] == bound_y1:
if box[Y0_EXT_IDX] == bound_y0 and box[Y1_EXT_IDX] == bound_y1:
v_bboxes.append(box)
"""
现在v_bboxes里面是所有的group了,每个group都是一个list
对v_bboxes里的每个group进行计算放回到sorted_layouts里
"""
v_layouts = []
for vbox in v_bboxes:
#gp.sort(key=lambda x: x[X0_IDX])
# gp.sort(key=lambda x: x[X0_IDX])
# 然后计算这个group的layout_bbox,也就是最小的x0,y0, 最大的x1,y1
x0, y0, x1, y1 = vbox[X0_EXT_IDX], vbox[Y0_EXT_IDX], vbox[X1_EXT_IDX], vbox[Y1_EXT_IDX]
v_layouts.append([x0, y0, x1, y1, LAYOUT_V]) # 垂直的布局
x0, y0, x1, y1 = (
vbox[X0_EXT_IDX],
vbox[Y0_EXT_IDX],
vbox[X1_EXT_IDX],
vbox[Y1_EXT_IDX],
)
v_layouts.append([x0, y0, x1, y1, LAYOUT_V]) # 垂直的布局
"""
接下来利用这些连续的垂直bbox的layout_bbox的x0, x1,从垂直上切分开其余的为几个部分
"""
......@@ -481,29 +624,41 @@ def _vertical_split(bboxes:list, boundry:tuple)-> list:
v_split_lines.append(x0)
v_split_lines.append(x1)
v_split_lines.append(bound_x1)
unsplited_bboxes = []
for i in range(0, len(v_split_lines), 2):
start_x0, start_x1 = v_split_lines[i:i+2]
start_x0, start_x1 = v_split_lines[i : i + 2]
# 然后找出[start_x0, start_x1]之间的其他bbox,这些组成一个未分割板块
bboxes_in_block = [bbox for bbox in all_bboxes if bbox[X0_IDX]>=start_x0 and bbox[X1_IDX]<=start_x1]
bboxes_in_block = [
bbox
for bbox in all_bboxes
if bbox[X0_IDX] >= start_x0 and bbox[X1_IDX] <= start_x1
]
unsplited_bboxes.append(bboxes_in_block)
# 接着把未处理的加入到v_layouts里
for bboxes_in_block in unsplited_bboxes:
if len(bboxes_in_block) == 0:
continue
x0, y0, x1, y1 = min([bbox[X0_IDX] for bbox in bboxes_in_block]), bound_y0, max([bbox[X1_IDX] for bbox in bboxes_in_block]), bound_y1
v_layouts.append([x0, y0, x1, y1, LAYOUT_UNPROC]) # 说明这篇区域未能够分析出可靠的版面
v_layouts.sort(key=lambda x: x[0]) # 按照x0排序, 也就是从左到右的顺序
x0, y0, x1, y1 = (
min([bbox[X0_IDX] for bbox in bboxes_in_block]),
bound_y0,
max([bbox[X1_IDX] for bbox in bboxes_in_block]),
bound_y1,
)
v_layouts.append(
[x0, y0, x1, y1, LAYOUT_UNPROC]
) # 说明这篇区域未能够分析出可靠的版面
v_layouts.sort(key=lambda x: x[0]) # 按照x0排序, 也就是从左到右的顺序
for layout in v_layouts:
sorted_layout_blocks.append({
"layout_bbox": layout[:4],
"layout_label":layout[4],
"sub_layout":[],
})
sorted_layout_blocks.append(
{
'layout_bbox': layout[:4],
'layout_label': layout[4],
'sub_layout': [],
}
)
"""
至此,垂直方向切成了2种类型,其一是独占一列的,其二是未处理的。
下面对这些未处理的进行垂直方向切分,这个切分要切出来类似“吕”这种类型的垂直方向的布局
......@@ -513,24 +668,32 @@ def _vertical_split(bboxes:list, boundry:tuple)-> list:
x0, y0, x1, y1 = layout['layout_bbox']
v_split_layouts = _vertical_align_split_v2(bboxes, [x0, y0, x1, y1])
sorted_layout_blocks[i] = {
"layout_bbox": [x0, y0, x1, y1],
"layout_label": LAYOUT_H,
"sub_layout": v_split_layouts
'layout_bbox': [x0, y0, x1, y1],
'layout_label': LAYOUT_H,
'sub_layout': v_split_layouts,
}
layout['layout_label'] = LAYOUT_H # 被垂线切分成了水平布局
layout['layout_label'] = LAYOUT_H # 被垂线切分成了水平布局
return sorted_layout_blocks
def split_layout(bboxes:list, boundry:tuple, page_num:int)-> list:
def split_layout(bboxes: list, boundary: tuple, page_num: int) -> list:
"""
把bboxes切割成layout
return:
[
{
"layout_bbox": [x0, y0, x1, y1],
"layout_bbox": [x0,y0,x1,y1],
"layout_label":"u|v|h|b", 未处理|垂直|水平|BAD_LAYOUT
"sub_layout": [] #每个元素都是[x0, y0, x1, y1, block_content, idx_x, idx_y, content_type, ext_x0, ext_y0, ext_x1, ext_y1], 并且顺序就是阅读顺序
"sub_layout":[] #每个元素都是[
x0,y0,
x1,y1,
block_content,
idx_x,idx_y,
content_type,
ext_x0,ext_y0,
ext_x1,ext_y1
], 并且顺序就是阅读顺序
}
]
example:
......@@ -539,7 +702,7 @@ def split_layout(bboxes:list, boundry:tuple, page_num:int)-> list:
"layout_bbox": [0, 0, 100, 100],
"layout_label":"u|v|h|b",
"sub_layout":[
]
},
{
......@@ -559,35 +722,35 @@ def split_layout(bboxes:list, boundry:tuple, page_num:int)-> list:
"layout_bbox": [0, 0, 100, 100],
"layout_label":"u|v|h|b",
"sub_layout":[
]
}
}
]
]
"""
sorted_layouts = [] # 最终返回的结果
boundry_x0, boundry_y0, boundry_x1, boundry_y1 = boundry
if len(bboxes) <=1:
sorted_layouts = [] # 最终返回的结果
boundary_x0, boundary_y0, boundary_x1, boundary_y1 = boundary
if len(bboxes) <= 1:
return [
{
"layout_bbox": [boundry_x0, boundry_y0, boundry_x1, boundry_y1],
"layout_label": LAYOUT_V,
"sub_layout":[]
'layout_bbox': [boundary_x0, boundary_y0, boundary_x1, boundary_y1],
'layout_label': LAYOUT_V,
'sub_layout': [],
}
]
"""
接下来按照先水平后垂直的顺序进行切分
"""
bboxes = paper_bbox_sort(bboxes, boundry_x1-boundry_x0, boundry_y1-boundry_y0)
sorted_layouts = _horizontal_split(bboxes, boundry) # 通过水平分割出来的layout
bboxes = paper_bbox_sort(
bboxes, boundary_x1 - boundary_x0, boundary_y1 - boundary_y0
)
sorted_layouts = _horizontal_split(bboxes, boundary) # 通过水平分割出来的layout
for i, layout in enumerate(sorted_layouts):
x0, y0, x1, y1 = layout['layout_bbox']
layout_type = layout['layout_label']
if layout_type == LAYOUT_UNPROC: # 说明是非独占单行的,这些需要垂直切分
if layout_type == LAYOUT_UNPROC: # 说明是非独占单行的,这些需要垂直切分
v_split_layouts = _vertical_split(bboxes, [x0, y0, x1, y1])
"""
最后这里有个逻辑问题:如果这个函数只分离出来了一个column layout,那么这个layout分割肯定超出了算法能力范围。因为我们假定的是传进来的
box已经把行全部剥离了,所以这里必须十多个列才可以。如果只剥离出来一个layout,并且是多个box,那么就说明这个layout是无法分割的,标记为LAYOUT_UNPROC
......@@ -596,28 +759,26 @@ def split_layout(bboxes:list, boundry:tuple, page_num:int)-> list:
if len(v_split_layouts) == 1:
if len(v_split_layouts[0]['sub_layout']) == 0:
layout_label = LAYOUT_UNPROC
#logger.warning(f"WARNING: pageno={page_num}, 无法分割的layout: ", v_split_layouts)
# logger.warning(f"WARNING: pageno={page_num}, 无法分割的layout: ", v_split_layouts)
"""
组合起来最终的layout
"""
sorted_layouts[i] = {
"layout_bbox": [x0, y0, x1, y1],
"layout_label": layout_label,
"sub_layout": v_split_layouts
'layout_bbox': [x0, y0, x1, y1],
'layout_label': layout_label,
'sub_layout': v_split_layouts,
}
layout['layout_label'] = LAYOUT_H
"""
水平和垂直方向都切分完毕了。此时还有一些未处理的,这些未处理的可能是因为水平和垂直方向都无法切分。
这些最后调用_try_horizontal_mult_block_split做一次水平多个block的联合切分,如果也不能切分最终就当做BAD_LAYOUT返回
"""
# TODO
return sorted_layouts
def get_bboxes_layout(all_boxes:list, boundry:tuple, page_id:int):
def get_bboxes_layout(all_boxes: list, boundary: tuple, page_id: int):
"""
对利用layout排序之后的box,进行排序
return:
......@@ -628,10 +789,10 @@ def get_bboxes_layout(all_boxes:list, boundry:tuple, page_id:int):
},
]
"""
def _preorder_traversal(layout):
"""
对sorted_layouts的叶子节点,也就是len(sub_layout)==0的节点进行排序。排序按照前序遍历的顺序,也就是从上到下,从左到右的顺序
"""
"""对sorted_layouts的叶子节点,也就是len(sub_layout)==0的节点进行排序。排序按照前序遍历的顺序,也就是从上到
下,从左到右的顺序."""
sorted_layout_blocks = []
for layout in layout:
sub_layout = layout['sub_layout']
......@@ -641,71 +802,89 @@ def get_bboxes_layout(all_boxes:list, boundry:tuple, page_id:int):
s = _preorder_traversal(sub_layout)
sorted_layout_blocks.extend(s)
return sorted_layout_blocks
# -------------------------------------------------------------------------------------------------------------------------
sorted_layouts = split_layout(all_boxes, boundry, page_id)# 先切分成layout,得到一个Tree
total_sorted_layout_blocks = _preorder_traversal(sorted_layouts)
sorted_layouts = split_layout(
all_boxes, boundary, page_id
) # 先切分成layout,得到一个Tree
total_sorted_layout_blocks = _preorder_traversal(sorted_layouts)
return total_sorted_layout_blocks, sorted_layouts
def get_columns_cnt_of_layout(layout_tree):
"""
获取一个layout的宽度
"""
max_width_list = [0] # 初始化一个元素,防止max,min函数报错
for items in layout_tree: # 针对每一层(横切)计算列数,横着的算一列
"""获取一个layout的宽度."""
max_width_list = [0] # 初始化一个元素,防止max,min函数报错
for items in layout_tree: # 针对每一层(横切)计算列数,横着的算一列
layout_type = items['layout_label']
sub_layouts = items['sub_layout']
if len(sub_layouts)==0:
if len(sub_layouts) == 0:
max_width_list.append(1)
else:
if layout_type == LAYOUT_H:
max_width_list.append(1)
else:
width = 0
for l in sub_layouts:
if len(l['sub_layout']) == 0:
for sub_layout in sub_layouts:
if len(sub_layout['sub_layout']) == 0:
width += 1
else:
for lay in l['sub_layout']:
for lay in sub_layout['sub_layout']:
width += get_columns_cnt_of_layout([lay])
max_width_list.append(width)
return max(max_width_list)
def sort_with_layout(bboxes:list, page_width, page_height) -> (list,list):
"""
输入是一个bbox的list.
def sort_with_layout(bboxes: list, page_width, page_height) -> (list, list):
"""输入是一个bbox的list.
获取到输入之后,先进行layout切分,然后对这些bbox进行排序。返回排序后的bboxes
"""
new_bboxes = []
for box in bboxes:
# new_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None])
new_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None, box[4]])
layout_bboxes, _ = get_bboxes_layout(new_bboxes, [0, 0, page_width, page_height], 0)
if any([lay['layout_label']==LAYOUT_UNPROC for lay in layout_bboxes]):
logger.warning(f"drop this pdf, reason: 复杂版面")
return None,None
sorted_bboxes = []
new_bboxes.append(
[
box[0],
box[1],
box[2],
box[3],
None,
None,
None,
'text',
None,
None,
None,
None,
box[4],
]
)
layout_bboxes, _ = get_bboxes_layout(
new_bboxes, tuple([0, 0, page_width, page_height]), 0
)
if any([lay['layout_label'] == LAYOUT_UNPROC for lay in layout_bboxes]):
logger.warning('drop this pdf, reason: 复杂版面')
return None, None
sorted_bboxes = []
# 利用layout bbox每次框定一些box,然后排序
for layout in layout_bboxes:
lbox = layout['layout_bbox']
bbox_in_layout = get_bbox_in_boundry(new_bboxes, lbox)
sorted_bbox = paper_bbox_sort(bbox_in_layout, lbox[2]-lbox[0], lbox[3]-lbox[1])
bbox_in_layout = get_bbox_in_boundary(new_bboxes, lbox)
sorted_bbox = paper_bbox_sort(
bbox_in_layout, lbox[2] - lbox[0], lbox[3] - lbox[1]
)
sorted_bboxes.extend(sorted_bbox)
return sorted_bboxes, layout_bboxes
def sort_text_block(text_block, layout_bboxes):
"""
对一页的text_block进行排序
"""
"""对一页的text_block进行排序."""
sorted_text_bbox = []
all_text_bbox = []
# 做一个box=>text的映射
......@@ -714,19 +893,29 @@ def sort_text_block(text_block, layout_bboxes):
box = blk['bbox']
box_to_text[(box[0], box[1], box[2], box[3])] = blk
all_text_bbox.append(box)
# text_blocks_to_sort = []
# for box in box_to_text.keys():
# text_blocks_to_sort.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None])
# 按照layout_bboxes的顺序,对text_block进行排序
for layout in layout_bboxes:
layout_box = layout['layout_bbox']
text_bbox_in_layout = get_bbox_in_boundry(all_text_bbox, [layout_box[0]-1, layout_box[1]-1, layout_box[2]+1, layout_box[3]+1])
#sorted_bbox = paper_bbox_sort(text_bbox_in_layout, layout_box[2]-layout_box[0], layout_box[3]-layout_box[1])
text_bbox_in_layout.sort(key = lambda x: x[1]) # 一个layout内部的box,按照y0自上而下排序
#sorted_bbox = [[b] for b in text_blocks_to_sort]
text_bbox_in_layout = get_bbox_in_boundary(
all_text_bbox,
[
layout_box[0] - 1,
layout_box[1] - 1,
layout_box[2] + 1,
layout_box[3] + 1,
],
)
# sorted_bbox = paper_bbox_sort(text_bbox_in_layout, layout_box[2]-layout_box[0], layout_box[3]-layout_box[1])
text_bbox_in_layout.sort(
key=lambda x: x[1]
) # 一个layout内部的box,按照y0自上而下排序
# sorted_bbox = [[b] for b in text_blocks_to_sort]
for sb in text_bbox_in_layout:
sorted_text_bbox.append(box_to_text[(sb[0], sb[1], sb[2], sb[3])])
return sorted_text_bbox
from loguru import logger
import math
def _is_in_or_part_overlap(box1, box2) -> bool:
"""
两个bbox是否有部分重叠或者包含
"""
"""两个bbox是否有部分重叠或者包含."""
if box1 is None or box2 is None:
return False
x0_1, y0_1, x1_1, y1_1 = box1
x0_2, y0_2, x1_2, y1_2 = box2
return not (x1_1 < x0_2 or # box1在box2的左边
x0_1 > x1_2 or # box1在box2的右边
y1_1 < y0_2 or # box1在box2的上边
y0_1 > y1_2) # box1在box2的下边
y0_1 > y1_2) # box1在box2的下边
def _is_in_or_part_overlap_with_area_ratio(box1, box2, area_ratio_threshold=0.6):
"""
判断box1是否在box2里面,或者box1和box2有部分重叠,且重叠面积占box1的比例超过area_ratio_threshold
"""
def _is_in_or_part_overlap_with_area_ratio(box1,
box2,
area_ratio_threshold=0.6):
"""判断box1是否在box2里面,或者box1和box2有部分重叠,且重叠面积占box1的比例超过area_ratio_threshold."""
if box1 is None or box2 is None:
return False
x0_1, y0_1, x1_1, y1_1 = box1
x0_2, y0_2, x1_2, y1_2 = box2
if not _is_in_or_part_overlap(box1, box2):
return False
# 计算重叠面积
x_left = max(x0_1, x0_2)
y_top = max(y0_1, y0_2)
x_right = min(x1_1, x1_2)
y_bottom = min(y1_1, y1_2)
overlap_area = (x_right - x_left) * (y_bottom - y_top)
# 计算box1的面积
box1_area = (x1_1 - x0_1) * (y1_1 - y0_1)
return overlap_area / box1_area > area_ratio_threshold
def _is_in(box1, box2) -> bool:
"""
box1是否完全在box2里面
"""
"""box1是否完全在box2里面."""
x0_1, y0_1, x1_1, y1_1 = box1
x0_2, y0_2, x1_2, y1_2 = box2
return (x0_1 >= x0_2 and # box1的左边界不在box2的左边外
y0_1 >= y0_2 and # box1的上边界不在box2的上边外
x1_1 <= x1_2 and # box1的右边界不在box2的右边外
y1_1 <= y1_2) # box1的下边界不在box2的下边外
y1_1 <= y1_2) # box1的下边界不在box2的下边外
def _is_part_overlap(box1, box2) -> bool:
"""
两个bbox是否有部分重叠,但不完全包含
"""
"""两个bbox是否有部分重叠,但不完全包含."""
if box1 is None or box2 is None:
return False
return _is_in_or_part_overlap(box1, box2) and not _is_in(box1, box2)
def _left_intersect(left_box, right_box):
"检查两个box的左边界是否有交集,也就是left_box的右边界是否在right_box的左边界内"
"""检查两个box的左边界是否有交集,也就是left_box的右边界是否在right_box的左边界内."""
if left_box is None or right_box is None:
return False
x0_1, y0_1, x1_1, y1_1 = left_box
x0_2, y0_2, x1_2, y1_2 = right_box
return x1_1>x0_2 and x0_1<x0_2 and (y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1)
return x1_1 > x0_2 and x0_1 < x0_2 and (y0_1 <= y0_2 <= y1_1
or y0_1 <= y1_2 <= y1_1)
def _right_intersect(left_box, right_box):
"""
检查box是否在右侧边界有交集,也就是left_box的左边界是否在right_box的右边界内
"""
"""检查box是否在右侧边界有交集,也就是left_box的左边界是否在right_box的右边界内."""
if left_box is None or right_box is None:
return False
x0_1, y0_1, x1_1, y1_1 = left_box
x0_2, y0_2, x1_2, y1_2 = right_box
return x0_1<x1_2 and x1_1>x1_2 and (y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1)
return x0_1 < x1_2 and x1_1 > x1_2 and (y0_1 <= y0_2 <= y1_1
or y0_1 <= y1_2 <= y1_1)
def _is_vertical_full_overlap(box1, box2, x_torlence=2):
"""
x方向上:要么box1包含box2, 要么box2包含box1。不能部分包含
y方向上:box1和box2有重叠
"""
"""x方向上:要么box1包含box2, 要么box2包含box1。不能部分包含 y方向上:box1和box2有重叠."""
# 解析box的坐标
x11, y11, x12, y12 = box1 # 左上角和右下角的坐标 (x1, y1, x2, y2)
x21, y21, x22, y22 = box2
# 在x轴方向上,box1是否包含box2 或 box2包含box1
contains_in_x = (x11-x_torlence <= x21 and x12+x_torlence >= x22) or (x21-x_torlence <= x11 and x22+x_torlence >= x12)
contains_in_x = (x11 - x_torlence <= x21 and x12 + x_torlence >= x22) or (
x21 - x_torlence <= x11 and x22 + x_torlence >= x12)
# 在y轴方向上,box1和box2是否有重叠
overlap_in_y = not (y12 < y21 or y11 > y22)
return contains_in_x and overlap_in_y
def _is_bottom_full_overlap(box1, box2, y_tolerance=2):
"""
检查box1下方和box2的上方有轻微的重叠,轻微程度收到y_tolerance的限制
这个函数和_is_vertical-full_overlap的区别是,这个函数允许box1和box2在x方向上有轻微的重叠,允许一定的模糊度
"""
"""检查box1下方和box2的上方有轻微的重叠,轻微程度收到y_tolerance的限制 这个函数和_is_vertical-
full_overlap的区别是,这个函数允许box1和box2在x方向上有轻微的重叠,允许一定的模糊度."""
if box1 is None or box2 is None:
return False
x0_1, y0_1, x1_1, y1_1 = box1
x0_2, y0_2, x1_2, y1_2 = box2
tolerance_margin = 2
is_xdir_full_overlap = ((x0_1-tolerance_margin<=x0_2<=x1_1+tolerance_margin and x0_1-tolerance_margin<=x1_2<=x1_1+tolerance_margin) or (x0_2-tolerance_margin<=x0_1<=x1_2+tolerance_margin and x0_2-tolerance_margin<=x1_1<=x1_2+tolerance_margin))
return y0_2<y1_1 and 0<(y1_1-y0_2)<y_tolerance and is_xdir_full_overlap
is_xdir_full_overlap = (
(x0_1 - tolerance_margin <= x0_2 <= x1_1 + tolerance_margin
and x0_1 - tolerance_margin <= x1_2 <= x1_1 + tolerance_margin)
or (x0_2 - tolerance_margin <= x0_1 <= x1_2 + tolerance_margin
and x0_2 - tolerance_margin <= x1_1 <= x1_2 + tolerance_margin))
return y0_2 < y1_1 and 0 < (y1_1 -
y0_2) < y_tolerance and is_xdir_full_overlap
def _is_left_overlap(
box1,
box2,
):
"""检查box1的左侧是否和box2有重叠 在Y方向上可以是部分重叠或者是完全重叠。不分box1和box2的上下关系,也就是无论box1在box2下
方还是box2在box1下方,都可以检测到重叠。 X方向上."""
def _is_left_overlap(box1, box2,):
"""
检查box1的左侧是否和box2有重叠
在Y方向上可以是部分重叠或者是完全重叠。不分box1和box2的上下关系,也就是无论box1在box2下方还是box2在box1下方,都可以检测到重叠。
X方向上
"""
def __overlap_y(Ay1, Ay2, By1, By2):
return max(0, min(Ay2, By2) - max(Ay1, By1))
if box1 is None or box2 is None:
return False
x0_1, y0_1, x1_1, y1_1 = box1
x0_2, y0_2, x1_2, y1_2 = box2
y_overlap_len = __overlap_y(y0_1, y1_1, y0_2, y1_2)
ratio_1 = 1.0 * y_overlap_len / (y1_1 - y0_1) if y1_1-y0_1!=0 else 0
ratio_2 = 1.0 * y_overlap_len / (y1_2 - y0_2) if y1_2-y0_2!=0 else 0
ratio_1 = 1.0 * y_overlap_len / (y1_1 - y0_1) if y1_1 - y0_1 != 0 else 0
ratio_2 = 1.0 * y_overlap_len / (y1_2 - y0_2) if y1_2 - y0_2 != 0 else 0
vertical_overlap_cond = ratio_1 >= 0.5 or ratio_2 >= 0.5
#vertical_overlap_cond = y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1 or y0_2<=y0_1<=y1_2 or y0_2<=y1_1<=y1_2
return x0_1<=x0_2<=x1_1 and vertical_overlap_cond
# vertical_overlap_cond = y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1 or y0_2<=y0_1<=y1_2 or y0_2<=y1_1<=y1_2
return x0_1 <= x0_2 <= x1_1 and vertical_overlap_cond
def __is_overlaps_y_exceeds_threshold(bbox1, bbox2, overlap_ratio_threshold=0.8):
def __is_overlaps_y_exceeds_threshold(bbox1,
bbox2,
overlap_ratio_threshold=0.8):
"""检查两个bbox在y轴上是否有重叠,并且该重叠区域的高度占两个bbox高度更低的那个超过80%"""
_, y0_1, _, y1_1 = bbox1
_, y0_2, _, y1_2 = bbox2
overlap = max(0, min(y1_1, y1_2) - max(y0_1, y0_2))
height1, height2 = y1_1 - y0_1, y1_2 - y0_2
max_height = max(height1, height2)
# max_height = max(height1, height2)
min_height = min(height1, height2)
return (overlap / min_height) > overlap_ratio_threshold
def calculate_iou(bbox1, bbox2):
"""
计算两个边界框的交并比(IOU)。
"""计算两个边界框的交并比(IOU)。
Args:
bbox1 (list[float]): 第一个边界框的坐标,格式为 [x1, y1, x2, y2],其中 (x1, y1) 为左上角坐标,(x2, y2) 为右下角坐标。
......@@ -170,7 +168,6 @@ def calculate_iou(bbox1, bbox2):
Returns:
float: 两个边界框的交并比(IOU),取值范围为 [0, 1]。
"""
# Determine the coordinates of the intersection rectangle
x_left = max(bbox1[0], bbox2[0])
......@@ -188,16 +185,15 @@ def calculate_iou(bbox1, bbox2):
bbox1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
bbox2_area = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
# Compute the intersection over union by taking the intersection area
# Compute the intersection over union by taking the intersection area
# and dividing it by the sum of both areas minus the intersection area
iou = intersection_area / float(bbox1_area + bbox2_area - intersection_area)
iou = intersection_area / float(bbox1_area + bbox2_area -
intersection_area)
return iou
def calculate_overlap_area_2_minbox_area_ratio(bbox1, bbox2):
"""
计算box1和box2的重叠面积占最小面积的box的比例
"""
"""计算box1和box2的重叠面积占最小面积的box的比例."""
# Determine the coordinates of the intersection rectangle
x_left = max(bbox1[0], bbox2[0])
y_top = max(bbox1[1], bbox2[1])
......@@ -209,16 +205,16 @@ def calculate_overlap_area_2_minbox_area_ratio(bbox1, bbox2):
# The area of overlap area
intersection_area = (x_right - x_left) * (y_bottom - y_top)
min_box_area = min([(bbox1[2]-bbox1[0])*(bbox1[3]-bbox1[1]), (bbox2[3]-bbox2[1])*(bbox2[2]-bbox2[0])])
if min_box_area==0:
min_box_area = min([(bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1]),
(bbox2[3] - bbox2[1]) * (bbox2[2] - bbox2[0])])
if min_box_area == 0:
return 0
else:
return intersection_area / min_box_area
def calculate_overlap_area_in_bbox1_area_ratio(bbox1, bbox2):
"""
计算box1和box2的重叠面积占bbox1的比例
"""
"""计算box1和box2的重叠面积占bbox1的比例."""
# Determine the coordinates of the intersection rectangle
x_left = max(bbox1[0], bbox2[0])
y_top = max(bbox1[1], bbox2[1])
......@@ -230,7 +226,7 @@ def calculate_overlap_area_in_bbox1_area_ratio(bbox1, bbox2):
# The area of overlap area
intersection_area = (x_right - x_left) * (y_bottom - y_top)
bbox1_area = (bbox1[2]-bbox1[0])*(bbox1[3]-bbox1[1])
bbox1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
if bbox1_area == 0:
return 0
else:
......@@ -238,11 +234,8 @@ def calculate_overlap_area_in_bbox1_area_ratio(bbox1, bbox2):
def get_minbox_if_overlap_by_ratio(bbox1, bbox2, ratio):
"""
通过calculate_overlap_area_2_minbox_area_ratio计算两个bbox重叠的面积占最小面积的box的比例
如果比例大于ratio,则返回小的那个bbox,
否则返回None
"""
"""通过calculate_overlap_area_2_minbox_area_ratio计算两个bbox重叠的面积占最小面积的box的比例
如果比例大于ratio,则返回小的那个bbox, 否则返回None."""
x1_min, y1_min, x1_max, y1_max = bbox1
x2_min, y2_min, x2_max, y2_max = bbox2
area1 = (x1_max - x1_min) * (y1_max - y1_min)
......@@ -256,89 +249,118 @@ def get_minbox_if_overlap_by_ratio(bbox1, bbox2, ratio):
else:
return None
def get_bbox_in_boundry(bboxes:list, boundry:tuple)-> list:
x0, y0, x1, y1 = boundry
new_boxes = [box for box in bboxes if box[0] >= x0 and box[1] >= y0 and box[2] <= x1 and box[3] <= y1]
def get_bbox_in_boundary(bboxes: list, boundary: tuple) -> list:
x0, y0, x1, y1 = boundary
new_boxes = [
box for box in bboxes
if box[0] >= x0 and box[1] >= y0 and box[2] <= x1 and box[3] <= y1
]
return new_boxes
def is_vbox_on_side(bbox, width, height, side_threshold=0.2):
"""
判断一个bbox是否在pdf页面的边缘
"""
"""判断一个bbox是否在pdf页面的边缘."""
x0, x1 = bbox[0], bbox[2]
if x1<=width*side_threshold or x0>=width*(1-side_threshold):
if x1 <= width * side_threshold or x0 >= width * (1 - side_threshold):
return True
return False
def find_top_nearest_text_bbox(pymu_blocks, obj_bbox):
tolerance_margin = 4
top_boxes = [box for box in pymu_blocks if obj_bbox[1]-box['bbox'][3] >=-tolerance_margin and not _is_in(box['bbox'], obj_bbox)]
top_boxes = [
box for box in pymu_blocks
if obj_bbox[1] - box['bbox'][3] >= -tolerance_margin
and not _is_in(box['bbox'], obj_bbox)
]
# 然后找到X方向上有互相重叠的
top_boxes = [box for box in top_boxes if any([obj_bbox[0]-tolerance_margin <=box['bbox'][0]<=obj_bbox[2]+tolerance_margin,
obj_bbox[0]-tolerance_margin <=box['bbox'][2]<=obj_bbox[2]+tolerance_margin,
box['bbox'][0]-tolerance_margin <=obj_bbox[0]<=box['bbox'][2]+tolerance_margin,
box['bbox'][0]-tolerance_margin <=obj_bbox[2]<=box['bbox'][2]+tolerance_margin
])]
top_boxes = [
box for box in top_boxes if any([
obj_bbox[0] - tolerance_margin <= box['bbox'][0] <= obj_bbox[2] +
tolerance_margin, obj_bbox[0] -
tolerance_margin <= box['bbox'][2] <= obj_bbox[2] +
tolerance_margin, box['bbox'][0] -
tolerance_margin <= obj_bbox[0] <= box['bbox'][2] +
tolerance_margin, box['bbox'][0] -
tolerance_margin <= obj_bbox[2] <= box['bbox'][2] +
tolerance_margin
])
]
# 然后找到y1最大的那个
if len(top_boxes)>0:
if len(top_boxes) > 0:
top_boxes.sort(key=lambda x: x['bbox'][3], reverse=True)
return top_boxes[0]
else:
return None
def find_bottom_nearest_text_bbox(pymu_blocks, obj_bbox):
bottom_boxes = [box for box in pymu_blocks if box['bbox'][1] - obj_bbox[3]>=-2 and not _is_in(box['bbox'], obj_bbox)]
bottom_boxes = [
box for box in pymu_blocks if box['bbox'][1] -
obj_bbox[3] >= -2 and not _is_in(box['bbox'], obj_bbox)
]
# 然后找到X方向上有互相重叠的
bottom_boxes = [box for box in bottom_boxes if any([obj_bbox[0]-2 <=box['bbox'][0]<=obj_bbox[2]+2,
obj_bbox[0]-2 <=box['bbox'][2]<=obj_bbox[2]+2,
box['bbox'][0]-2 <=obj_bbox[0]<=box['bbox'][2]+2,
box['bbox'][0]-2 <=obj_bbox[2]<=box['bbox'][2]+2
])]
bottom_boxes = [
box for box in bottom_boxes if any([
obj_bbox[0] - 2 <= box['bbox'][0] <= obj_bbox[2] + 2, obj_bbox[0] -
2 <= box['bbox'][2] <= obj_bbox[2] + 2, box['bbox'][0] -
2 <= obj_bbox[0] <= box['bbox'][2] + 2, box['bbox'][0] -
2 <= obj_bbox[2] <= box['bbox'][2] + 2
])
]
# 然后找到y0最小的那个
if len(bottom_boxes)>0:
if len(bottom_boxes) > 0:
bottom_boxes.sort(key=lambda x: x['bbox'][1], reverse=False)
return bottom_boxes[0]
else:
return None
def find_left_nearest_text_bbox(pymu_blocks, obj_bbox):
"""
寻找左侧最近的文本block
"""
left_boxes = [box for box in pymu_blocks if obj_bbox[0]-box['bbox'][2]>=-2 and not _is_in(box['bbox'], obj_bbox)]
"""寻找左侧最近的文本block."""
left_boxes = [
box for box in pymu_blocks if obj_bbox[0] -
box['bbox'][2] >= -2 and not _is_in(box['bbox'], obj_bbox)
]
# 然后找到X方向上有互相重叠的
left_boxes = [box for box in left_boxes if any([obj_bbox[1]-2 <=box['bbox'][1]<=obj_bbox[3]+2,
obj_bbox[1]-2 <=box['bbox'][3]<=obj_bbox[3]+2,
box['bbox'][1]-2 <=obj_bbox[1]<=box['bbox'][3]+2,
box['bbox'][1]-2 <=obj_bbox[3]<=box['bbox'][3]+2
])]
left_boxes = [
box for box in left_boxes if any([
obj_bbox[1] - 2 <= box['bbox'][1] <= obj_bbox[3] + 2, obj_bbox[1] -
2 <= box['bbox'][3] <= obj_bbox[3] + 2, box['bbox'][1] -
2 <= obj_bbox[1] <= box['bbox'][3] + 2, box['bbox'][1] -
2 <= obj_bbox[3] <= box['bbox'][3] + 2
])
]
# 然后找到x1最大的那个
if len(left_boxes)>0:
if len(left_boxes) > 0:
left_boxes.sort(key=lambda x: x['bbox'][2], reverse=True)
return left_boxes[0]
else:
return None
def find_right_nearest_text_bbox(pymu_blocks, obj_bbox):
"""
寻找右侧最近的文本block
"""
right_boxes = [box for box in pymu_blocks if box['bbox'][0]-obj_bbox[2]>=-2 and not _is_in(box['bbox'], obj_bbox)]
"""寻找右侧最近的文本block."""
right_boxes = [
box for box in pymu_blocks if box['bbox'][0] -
obj_bbox[2] >= -2 and not _is_in(box['bbox'], obj_bbox)
]
# 然后找到X方向上有互相重叠的
right_boxes = [box for box in right_boxes if any([obj_bbox[1]-2 <=box['bbox'][1]<=obj_bbox[3]+2,
obj_bbox[1]-2 <=box['bbox'][3]<=obj_bbox[3]+2,
box['bbox'][1]-2 <=obj_bbox[1]<=box['bbox'][3]+2,
box['bbox'][1]-2 <=obj_bbox[3]<=box['bbox'][3]+2
])]
right_boxes = [
box for box in right_boxes if any([
obj_bbox[1] - 2 <= box['bbox'][1] <= obj_bbox[3] + 2, obj_bbox[1] -
2 <= box['bbox'][3] <= obj_bbox[3] + 2, box['bbox'][1] -
2 <= obj_bbox[1] <= box['bbox'][3] + 2, box['bbox'][1] -
2 <= obj_bbox[3] <= box['bbox'][3] + 2
])
]
# 然后找到x0最小的那个
if len(right_boxes)>0:
if len(right_boxes) > 0:
right_boxes.sort(key=lambda x: x['bbox'][0], reverse=False)
return right_boxes[0]
else:
......@@ -346,8 +368,7 @@ def find_right_nearest_text_bbox(pymu_blocks, obj_bbox):
def bbox_relative_pos(bbox1, bbox2):
"""
判断两个矩形框的相对位置关系
"""判断两个矩形框的相对位置关系.
Args:
bbox1: 一个四元组,表示第一个矩形框的左上角和右下角的坐标,格式为(x1, y1, x1b, y1b)
......@@ -357,20 +378,19 @@ def bbox_relative_pos(bbox1, bbox2):
一个四元组,表示矩形框1相对于矩形框2的位置关系,格式为(left, right, bottom, top)
其中,left表示矩形框1是否在矩形框2的左侧,right表示矩形框1是否在矩形框2的右侧,
bottom表示矩形框1是否在矩形框2的下方,top表示矩形框1是否在矩形框2的上方
"""
x1, y1, x1b, y1b = bbox1
x2, y2, x2b, y2b = bbox2
left = x2b < x1
right = x1b < x2
bottom = y2b < y1
top = y1b < y2
return left, right, bottom, top
def bbox_distance(bbox1, bbox2):
"""
计算两个矩形框的距离。
"""计算两个矩形框的距离。
Args:
bbox1 (tuple): 第一个矩形框的坐标,格式为 (x1, y1, x2, y2),其中 (x1, y1) 为左上角坐标,(x2, y2) 为右下角坐标。
......@@ -378,16 +398,17 @@ def bbox_distance(bbox1, bbox2):
Returns:
float: 矩形框之间的距离。
"""
def dist(point1, point2):
return math.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)
return math.sqrt((point1[0] - point2[0])**2 +
(point1[1] - point2[1])**2)
x1, y1, x1b, y1b = bbox1
x2, y2, x2b, y2b = bbox2
left, right, bottom, top = bbox_relative_pos(bbox1, bbox2)
if top and left:
return dist((x1, y1b), (x2b, y2))
elif left and bottom:
......@@ -404,5 +425,4 @@ def bbox_distance(bbox1, bbox2):
return y1 - y2b
elif top:
return y2 - y1b
else: # rectangles intersect
return 0
\ No newline at end of file
return 0.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment