Realese 0.8.0 (#586)

* Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * fix(ocr_mkcontent): improve language detection and content formatting (#458) Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages. * fix(self_modify): merge detection boxes for optimized text region detection (#448) Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks. * fix(pdf-extract): adjust box threshold for OCR detection (#447) Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process. * feat: rename the file generated by command line tools (#401) * feat: rename the file generated by command line tools * feat: add pdf filename as prefix to {span,layout,model}.pdf --------- Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(ocr_mkcontent): revise table caption output (#397) * fix(ocr_mkcontent): revise table caption output - Ensuring that table captions are properly included in the output. - Remove the redundant `table_caption` variable。 * Update cla.yml * Update bug_report.yml * feat(cli): add debug option for detailed error handling Enable users to invoke the CLI command with a new debug flag to get detailed debugging information. * fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified to improve the accuracy and consistency of OCR processing. The new values are set to 25 to ensure more precise image cropping and pasting which leads to better OCR recognition results. * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and handling of border cases. This change will help in improving the overall quality of OCR'ed text by providing more context around the detected text areas. * fix(common): deep copy model list before drawing model bbox Use a deep copy of the original model list in `drow_model_bbox` to avoid potential modifications to the source data. This ensures the integrity of the original models is maintained while generating the model bounding boxes visualization. --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * build(docker): update docker build step (#471) * build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved performance and stability. Additionally, integrate PaddlePaddle GPU version 3.0.0b1 into the Docker build for enhanced AI capabilities. The MinIO configuration file has also been updated to the latest version. * build(dockerfile): Updated the Dockerfile * build(Dockerfile): update Dockerfile * docs(docker): add instructions for quick deployment with Docker Include Docker-based deployment instructions in the README for both English and Chinese locales. This update provides users a quick-start guide to using Docker for deployment, with notes on GPU VRAM requirements and default acceleration features. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * feat: add test case (#499) Co-authored-by: quyuan <quyuan@pjlab.org> * Update cla.yml * Update gpu-ci.yml * Update cli.yml * Delete .github/workflows/gpu-ci.yml * fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500) Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to improve the accuracy of block filling based on layout analysis. * fix(detect_all_bboxes): remove small overlapping blocks by merging (#501) Previously, small blocks that overlapped with larger ones were merely removed. This fix changes the approach to merge smaller blocks into the larger block instead, ensuring that no information is lost and the larger block encompasses all the text content fully. * feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * Feat/support rag (#510) * Create requirements-docker.txt * feat: update deps to support rag * feat: add support to rag, add rag_data_reader api for rag integration * feat: let user retrieve the filename of the processed file * feat: add projects demo for rag integrations --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * Update Dockerfile * feat(gradio): add app by gradio (#512) * fix: replace \u0002, \u0003 in common text (#521) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) * fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix: delete hyphen at end of line * Release: Release 0.7.1 verison, update dev (#527) * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Hotfix readme 0.7.1 (#529) * release: release 0.7.1 version (#526) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md --------- Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README_zh-CN.md delete Known issue about table recognition * Update Dockerfile * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: typo error in markdown (#536) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(gradio): remove unused imports and simplify pdf display (#534) Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also, replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler and more integrated PDF upload and display solution, improving code maintainability and readability. * Feat/support footnote in figure (#532) * feat: support figure footnote * feat: using the relative position to combine footnote, table, image * feat: add the readme of projects * fix: code spell in unittest --------- Co-authored-by: icecraft <xurui1@pjlab.org.cn> * refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533) Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing atomic models. This change ensures that atomic models are instantiated at most once, optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton class now manages the instantiation and retrieval of atomic models, improving the overall structure and efficiency of the codebase. * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md add HF、modelscope、colab url * Update README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Rename README.md to README_zh-CN.md * Create readme.md * Rename readme.md to README.md * Rename README.md to README_zh-CN.md * Update README_zh-CN.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * Update README_zh-CN.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * add rag data api * Update README_zh-CN.md update rag api image * Update README.md docs: remove RAG related release notes * Update README_zh-CN.md docs: remove RAG related release notes * Update README_zh-CN.md update 更新记录 --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <tmortred@163.com> Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Siyu Hao <131659128+GDDGCZ518@users.noreply.github.com> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: quyuan <quyuan@pjlab.org> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com>

Realese 0.8.0 (#586)
* Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * fix(ocr_mkcontent): improve language detection and content formatting (#458) Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages. * fix(self_modify): merge detection boxes for optimized text region detection (#448) Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks. * fix(pdf-extract): adjust box threshold for OCR detection (#447) Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process. * feat: rename the file generated by command line tools (#401) * feat: rename the file generated by command line tools * feat: add pdf filename as prefix to {span,layout,model}.pdf --------- Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(ocr_mkcontent): revise table caption output (#397) * fix(ocr_mkcontent): revise table caption output - Ensuring that table captions are properly included in the output. - Remove the redundant `table_caption` variable。 * Update cla.yml * Update bug_report.yml * feat(cli): add debug option for detailed error handling Enable users to invoke the CLI command with a new debug flag to get detailed debugging information. * fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified to improve the accuracy and consistency of OCR processing. The new values are set to 25 to ensure more precise image cropping and pasting which leads to better OCR recognition results. * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and handling of border cases. This change will help in improving the overall quality of OCR'ed text by providing more context around the detected text areas. * fix(common): deep copy model list before drawing model bbox Use a deep copy of the original model list in `drow_model_bbox` to avoid potential modifications to the source data. This ensures the integrity of the original models is maintained while generating the model bounding boxes visualization. --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * build(docker): update docker build step (#471) * build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved performance and stability. Additionally, integrate PaddlePaddle GPU version 3.0.0b1 into the Docker build for enhanced AI capabilities. The MinIO configuration file has also been updated to the latest version. * build(dockerfile): Updated the Dockerfile * build(Dockerfile): update Dockerfile * docs(docker): add instructions for quick deployment with Docker Include Docker-based deployment instructions in the README for both English and Chinese locales. This update provides users a quick-start guide to using Docker for deployment, with notes on GPU VRAM requirements and default acceleration features. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * feat: add test case (#499) Co-authored-by: quyuan <quyuan@pjlab.org> * Update cla.yml * Update gpu-ci.yml * Update cli.yml * Delete .github/workflows/gpu-ci.yml * fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500) Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to improve the accuracy of block filling based on layout analysis. * fix(detect_all_bboxes): remove small overlapping blocks by merging (#501) Previously, small blocks that overlapped with larger ones were merely removed. This fix changes the approach to merge smaller blocks into the larger block instead, ensuring that no information is lost and the larger block encompasses all the text content fully. * feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * Feat/support rag (#510) * Create requirements-docker.txt * feat: update deps to support rag * feat: add support to rag, add rag_data_reader api for rag integration * feat: let user retrieve the filename of the processed file * feat: add projects demo for rag integrations --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * Update Dockerfile * feat(gradio): add app by gradio (#512) * fix: replace \u0002, \u0003 in common text (#521) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) * fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix: delete hyphen at end of line * Release: Release 0.7.1 verison, update dev (#527) * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Hotfix readme 0.7.1 (#529) * release: release 0.7.1 version (#526) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md --------- Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README_zh-CN.md delete Known issue about table recognition * Update Dockerfile * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: typo error in markdown (#536) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(gradio): remove unused imports and simplify pdf display (#534) Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also, replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler and more integrated PDF upload and display solution, improving code maintainability and readability. * Feat/support footnote in figure (#532) * feat: support figure footnote * feat: using the relative position to combine footnote, table, image * feat: add the readme of projects * fix: code spell in unittest --------- Co-authored-by: icecraft <xurui1@pjlab.org.cn> * refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533) Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing atomic models. This change ensures that atomic models are instantiated at most once, optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton class now manages the instantiation and retrieval of atomic models, improving the overall structure and efficiency of the codebase. * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md add HF、modelscope、colab url * Update README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Rename README.md to README_zh-CN.md * Create readme.md * Rename readme.md to README.md * Rename README.md to README_zh-CN.md * Update README_zh-CN.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * Update README_zh-CN.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * add rag data api * Update README_zh-CN.md update rag api image * Update README.md docs: remove RAG related release notes * Update README_zh-CN.md docs: remove RAG related release notes * Update README_zh-CN.md update 更新记录 --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <tmortred@163.com> Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Siyu Hao <131659128+GDDGCZ518@users.noreply.github.com> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: quyuan <quyuan@pjlab.org> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com>
9f352df0 · drunkpig · GitHub · b6633cd6 · 9f352df0 · 9f352df0
Unverified Commit 9f352df0 authored Sep 10, 2024 by drunkpig Committed by GitHub Sep 10, 2024
20 changed files
--- a/.github/workflows/cli.yml
+++ b/.github/workflows/cli.yml
@@ -6,20 +6,24 @@ on:
  push:
    branches:
      - "master"
+      - "dev"
    paths-ignore:
      - "cmds/**"
      - "**.md"
+      - "**.yml"
  pull_request:
    branches:
      - "master"
+      - "dev"
    paths-ignore:
      - "cmds/**"
      - "**.md"
+      - "**.yml"
  workflow_dispatch:
 jobs:
  cli-test:
-    runs-on: ubuntu-latest
+    runs-on: pdf
-    timeout-minutes: 40
+    timeout-minutes: 120
    strategy:
      fail-fast: true
@@ -28,27 +32,22 @@ jobs:
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
-    - name: check-requirements
+    - name: install
-      run: |
-        pip install -r requirements.txt
-        pip install -r requirements-qa.txt
-        pip install magic-pdf
-    - name: test_cli
      run: |
-        cp magic-pdf.template.json ~/magic-pdf.json
+        echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
-        echo $GITHUB_WORKSPACE
+    - name: unit test
-        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && pytest -s -v tests/test_unit.py
+      run: |        
-        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli.py
+        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m  pytest  tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
+        cd $GITHUB_WORKSPACE && python tests/get_coverage.py
-    - name: benchmark
+    - name: cli test
      run: |
-        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_bench.py
+        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli_sdk.py
  notify_to_feishu:
    if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
-    needs: [cli-test]
+    needs: cli-test
-    runs-on: ubuntu-latest
+    runs-on: pdf
    steps:
    - name: get_actor
      run: |
@@ -67,9 +66,5 @@ jobs:
    - name: notify
      run: |
-        curl  ${{ secrets.WEBHOOK_URL }} -H 'Content-Type: application/json'  -d '{
+        echo ${{ secrets.USER_ID }}
-        "msgtype": "text",
+        curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}'  ${{ secrets.WEBHOOK_URL }}
-        "text": {
-            "mentioned_list": ["${{ env.METIONS }}"] , "content": "'${{ github.repository }}' GitHubAction Failed!\n 细节请查看：https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"
-        }
-        }'   
\ No newline at end of file
--- a/.gitignore
+++ b/.gitignore
@@ -30,10 +30,10 @@ tmp/
 tmp
 .vscode
 .vscode/
-/tests/
 ocr_demo
 /app/common/__init__.py
 /magic_pdf/config/__init__.py
 source.dev.env
+tmp
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -3,6 +3,7 @@ repos:
    rev: 5.0.4
    hooks:
      - id: flake8
+        args: ["--max-line-length=120", "--ignore=E131,E125,W503,W504,E203"]
  - repo: https://github.com/PyCQA/isort
    rev: 5.11.5
    hooks:
@@ -11,6 +12,7 @@ repos:
    rev: v0.32.0
    hooks:
      - id: yapf
+        args: ["--style={based_on_style: google, column_limit: 120, indent_width: 4}"]
  - repo: https://github.com/codespell-project/codespell
    rev: v2.2.1
    hooks:
@@ -41,4 +43,4 @@ repos:
    rev: v1.3.1
    hooks:
      - id: docformatter
-        args: ["--in-place", "--wrap-descriptions", "79"]
+        args: ["--in-place", "--wrap-descriptions", "119"]
--- a/Dockerfile
+++ b/Dockerfile
 # Use the official Ubuntu base image
-FROM ubuntu:latest
+FROM ubuntu:22.04
 # Set environment variables to non-interactive to avoid prompts during installation
 ENV DEBIAN_FRONTEND=noninteractive
@@ -29,17 +29,23 @@ RUN python3 -m venv /opt/mineru_venv
 # Activate the virtual environment and install necessary Python packages
 RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
-    pip install --upgrade pip && \
+    pip3 install --upgrade pip && \
-    pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"
+    wget https://gitee.com/myhloli/MinerU/raw/master/requirements-docker.txt && \
+    pip3 install -r requirements-docker.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple && \
-# Copy the configuration file template and set up the model directory
+    pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
-COPY magic-pdf.template.json /root/magic-pdf.json
+# Copy the configuration file template and install magic-pdf latest
-# Set the models directory in the configuration file (adjust the path as needed)
+RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json && \
-RUN sed -i 's|/tmp/models|/opt/models|g' /root/magic-pdf.json
+    cp magic-pdf.template.json /root/magic-pdf.json && \
+    source /opt/mineru_venv/bin/activate && \
-# Create the models directory
+    pip3 install -U magic-pdf"
-RUN mkdir -p /opt/models
+# Download models and update the configuration file
+RUN /bin/bash -c "pip3 install modelscope && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py && \
+    python3 download_models.py && \
+    sed -i 's|/tmp/models|/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models|g' /root/magic-pdf.json && \
+    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
 # Set the entry point to activate the virtual environment and run the command line tool
 ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
--- a/README.md
+++ b/README.md
@@ -5,6 +5,7 @@
 </p>
 <!-- icon -->
 [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
 [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
 [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
@@ -12,17 +13,26 @@
 [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
 [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
 [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
+[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAF8AAABYCAMAAACkl9t/AAAAk1BMVEVHcEz/nQv/nQv/nQr/nQv/nQr/nQv/nQv/nQr/wRf/txT/pg7/yRr/rBD/zRz/ngv/oAz/zhz/nwv/txT/ngv/0B3+zBz/nQv/0h7/wxn/vRb/thXkuiT/rxH/pxD/ogzcqyf/nQvTlSz/czCxky7/SjifdjT/Mj3+Mj3wMj15aTnDNz+DSD9RTUBsP0FRO0Q6O0WyIxEIAAAAGHRSTlMADB8zSWF3krDDw8TJ1NbX5efv8ff9/fxKDJ9uAAAGKklEQVR42u2Z63qjOAyGC4RwCOfB2JAGqrSb2WnTw/1f3UaWcSGYNKTdf/P+mOkTrE+yJBulvfvLT2A5ruenaVHyIks33npl/6C4s/ZLAM45SOi/1FtZPyFur1OYofBX3w7d54Bxm+E8db+nDr12ttmESZ4zludJEG5S7TO72YPlKZFyE+YCYUJTBZsMiNS5Sd7NlDmKM2Eg2JQg8awbglfqgbhArjxkS7dgp2RH6hc9AMLdZYUtZN5DJr4molC8BfKrEkPKEnEVjLbgW1fLy77ZVOJagoIcLIl+IxaQZGjiX597HopF5CkaXVMDO9Pyix3AFV3kw4lQLCbHuMovz8FallbcQIJ5Ta0vks9RnolbCK84BtjKRS5uA43hYoZcOBGIG2Epbv6CvFVQ8m8loh66WNySsnN7htL58LNp+NXT8/PhXiBXPMjLSxtwp8W9f/1AngRierBkA+kk/IpUSOeKByzn8y3kAAAfh//0oXgV4roHm/kz4E2z//zRc3/lgwBzbM2mJxQEa5pqgX7d1L0htrhx7LKxOZlKbwcAWyEOWqYSI8YPtgDQVjpB5nvaHaSnBaQSD6hweDi8PosxD6/PT09YY3xQA7LTCTKfYX+QHpA0GCcqmEHvr/cyfKQTEuwgbs2kPxJEB0iNjfJcCTPyocx+A0griHSmADiC91oNGVwJ69RudYe65vJmoqfpul0lrqXadW0jFKH5BKwAeCq+Den7s+3zfRJzA61/Uj/9H/VzLKTx9jFPPdXeeP+L7WEvDLAKAIoF8bPTKT0+TM7W8ePj3Rz/Yn3kOAp2f1Kf0Weony7pn/cPydvhQYV+eFOfmOu7VB/ViPe34/EN3RFHY/yRuT8ddCtMPH/McBAT5s+vRde/gf2c/sPsjLK+m5IBQF5tO+h2tTlBGnP6693JdsvofjOPnnEHkh2TnV/X1fBl9S5zrwuwF8NFrAVJVwCAPTe8gaJlomqlp0pv4Pjn98tJ/t/fL++6unpR1YGC2n/KCoa0tTLoKiEeUPDl94nj+5/Tv3/eT5vBQ60X1S0oZr+IWRR8Ldhu7AlLjPISlJcO9vrFotky9SpzDequlwEir5beYAc0R7D9KS1DXva0jhYRDXoExPdc6yw5GShkZXe9QdO/uOvHofxjrV/TNS6iMJS+4TcSTgk9n5agJdBQbB//IfF/HpvPt3Tbi7b6I6K0R72p6ajryEJrENW2bbeVUGjfgoals4L443c7BEE4mJO2SpbRngxQrAKRudRzGQ8jVOL2qDVjjI8K1gc3TIJ5KiFZ1q+gdsARPB4NQS4AjwVSt72DSoXNyOWUrU5mQ9nRYyjp89Xo7oRI6Bga9QNT1mQ/ptaJq5T/7WcgAZywR/XlPGAUDdet3LE+qS0TI+g+aJU8MIqjo0Kx8Ly+maxLjJmjQ18rA0YCkxLQbUZP1WqdmyQGJLUm7VnQFqodmXSqmRrdVpqdzk5LvmvgtEcW8PMGdaS23EOWyDVbACZzUJPaqMbjDxpA3Qrgl0AikimGDbqmyT8P8NOYiqrldF8rX+YN7TopX4UoHuSCYY7cgX4gHwclQKl1zhx0THf+tCAUValzjI7Wg9EhptrkIcfIJjA94evOn8B2eHaVzvBrnl2ig0So6hvPaz0IGcOvTHvUIlE2+prqAxLSQxZlU2stql1NqCCLdIiIN/i1DBEHUoElM9dBravbiAnKqgpi4IBkw+utSPIoBijDXJipSVV7MpOEJUAc5Qmm3BnUN+w3hteEieYKfRZSIUcXKMVf0u5wD4EwsUNVvZOtUT7A2GkffHjByWpHqvRBYrTV72a6j8zZ6W0DTE86Hn04bmyWX3Ri9WH7ZU6Q7h+ZHo0nHUAcsQvVhXRDZHChwiyi/hnPuOsSEF6Exk3o6Y9DT1eZ+6cASXk2Y9k+6EOQMDGm6WBK10wOQJCBwren86cPPWUcRAnTVjGcU1LBgs9FURiX/e6479yZcLwCBmTxiawEwrOcleuu12t3tbLv/N4RLYIBhYexm7Fcn4OJcn0+zc+s8/VfPeddZHAGN6TT8eGczHdR/Gts1/MzDkThr23zqrVfAMFT33Nx1RJsx1k5zuWILLnG/vsH+Fv5D4NTVcp1Gzo8AAAAAElFTkSuQmCC)](https://huggingface.co/spaces/opendatalab/MinerU)
+[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
+[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
+[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 <!-- language -->
 [English](README.md) | [简体中文](README_zh-CN.md)
 <!-- hot link -->
 <p align="center">
 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
 </p>
 <!-- join us -->
 <p align="center">
    👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
 </p>
@@ -30,12 +40,14 @@
 </div>
 # Changelog
+- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
 - 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
 - 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
 - 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
 - 2024/07/05: Initial open-source release
 <!-- TABLE OF CONTENT -->
 <details open="open">
  <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
  <ol>
@@ -74,10 +86,10 @@
  </ol>
 </details>
 # MinerU
 ## Project Introduction
 MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
 MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
 Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
@@ -101,6 +113,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
 If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
 There are three different ways to experience MinerU:
 - [Online Demo (No Installation Required)](#online-demo)
 - [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo)
 - [Linux/Windows + CUDA](#Using-GPU)
@@ -171,33 +184,41 @@ In non-mainline environments, due to the diversity of hardware and software conf
 ### Quick CPU Demo
 #### 1. Install magic-pdf
 ```bash
 conda create -n MinerU python=3.10
 conda activate MinerU
 pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
 ```
 #### 2. Download model weight files
 Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
 > ❗️After downloading the models, please make sure to verify the completeness of the model files.
-> 
+>
 > Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
 #### 3. Copy and configure the template file
 You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
 > ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
-> 
+>
 > The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
 ```bash
 cp magic-pdf.template.json ~/magic-pdf.json
 ```
 Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
 > ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
 >
 > On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
-> 
+>
 > For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
 ```json
 {
  // other config
@@ -210,13 +231,26 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
 }
 ```
 ### Using GPU
 If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
 - [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
+- Quick Deployment with Docker
+    > Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+    >
+    > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
+    > 
+    > ```bash
+    > docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+    > ```
+  ```bash
+  wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
+  docker build -t mineru:latest .
+  docker run --rm -it --gpus=all mineru:latest /bin/bash
+  magic-pdf --help
+  ```
 ## Usage
@@ -230,12 +264,12 @@ Options:
  -v, --version                display the version and exit
  -p, --path PATH              local pdf filepath or directory  [required]
  -o, --output-dir TEXT        output local directory
-  -m, --method [ocr|txt|auto]  the method for parsing pdf.  
+  -m, --method [ocr|txt|auto]  the method for parsing pdf.
                               ocr: using ocr technique to extract information from pdf,
                               txt: suitable for the text-based pdf only and outperform ocr,
                               auto: automatically choose the best method for parsing pdf
                                  from ocr and txt.
-                               without method specified, auto will be used by default. 
+                               without method specified, auto will be used by default.
  --help                       Show this message and exit.
@@ -250,13 +284,13 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
 The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
 ```text
-├── some_pdf.md                 # markdown file
+├── some_pdf.md                          # markdown file
-├── images                      # directory for storing images
+├── images                               # directory for storing images
-├── layout.pdf                  # layout diagram
+├── some_pdf_layout.pdf                  # layout diagram
-├── middle.json                 # MinerU intermediate processing result
+├── some_pdf_middle.json                 # MinerU intermediate processing result
-├── model.json                  # model inference result
+├── some_pdf_model.json                  # model inference result
-├── origin.pdf                  # original PDF file
+├── some_pdf_origin.pdf                  # original PDF file
-└── spans.pdf                   # smallest granularity bbox position information diagram
+└── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
 ```
 For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
@@ -264,6 +298,7 @@ For more information about the output files, please refer to the [Output File De
 ### API
 Processing files from local disk
 ```python
 image_writer = DiskReaderWriter(local_image_dir)
 image_dir = str(os.path.basename(local_image_dir))
@@ -276,6 +311,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 ```
 Processing files from object storage
 ```python
 s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
 image_dir = "s3://img_bucket/"
@@ -290,10 +326,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 ```
 For detailed implementation, refer to:
 - [demo.py Simplest Processing Method](demo/demo.py)
 - [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
 ### Development Guide
 TODO
@@ -305,10 +341,11 @@ TODO
 - [ ] Code block recognition within the text
 - [ ] Table of contents recognition
 - [x] Table recognition
- [ ] Chemical formula recognition
+- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
 - [ ] Geometric shape recognition
 # Known Issues
 - Reading order is segmented based on rules, which can cause disordered sequences in some cases
 - Vertical text is not supported
 - Lists, code blocks, and table of contents are not yet supported in the layout model
@@ -318,11 +355,11 @@ TODO
 # FAQ
 [FAQ in Chinese](docs/FAQ_zh_cn.md)
 [FAQ in English](docs/FAQ_en_us.md)
 # All Thanks To Our Contributors
 <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
@@ -335,8 +372,8 @@ TODO
 This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility.
 # Acknowledgments
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
@@ -373,9 +410,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 </a>
 # Magic-doc
 [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
 # Magic-html
 [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
 # Links

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -4,8 +4,8 @@
  <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
 </p>
 <!-- icon -->
 [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
 [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
 [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
@@ -13,33 +13,41 @@
 [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
 [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
 [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
+[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAF8AAABYCAMAAACkl9t/AAAAk1BMVEVHcEz/nQv/nQv/nQr/nQv/nQr/nQv/nQv/nQr/wRf/txT/pg7/yRr/rBD/zRz/ngv/oAz/zhz/nwv/txT/ngv/0B3+zBz/nQv/0h7/wxn/vRb/thXkuiT/rxH/pxD/ogzcqyf/nQvTlSz/czCxky7/SjifdjT/Mj3+Mj3wMj15aTnDNz+DSD9RTUBsP0FRO0Q6O0WyIxEIAAAAGHRSTlMADB8zSWF3krDDw8TJ1NbX5efv8ff9/fxKDJ9uAAAGKklEQVR42u2Z63qjOAyGC4RwCOfB2JAGqrSb2WnTw/1f3UaWcSGYNKTdf/P+mOkTrE+yJBulvfvLT2A5ruenaVHyIks33npl/6C4s/ZLAM45SOi/1FtZPyFur1OYofBX3w7d54Bxm+E8db+nDr12ttmESZ4zludJEG5S7TO72YPlKZFyE+YCYUJTBZsMiNS5Sd7NlDmKM2Eg2JQg8awbglfqgbhArjxkS7dgp2RH6hc9AMLdZYUtZN5DJr4molC8BfKrEkPKEnEVjLbgW1fLy77ZVOJagoIcLIl+IxaQZGjiX597HopF5CkaXVMDO9Pyix3AFV3kw4lQLCbHuMovz8FallbcQIJ5Ta0vks9RnolbCK84BtjKRS5uA43hYoZcOBGIG2Epbv6CvFVQ8m8loh66WNySsnN7htL58LNp+NXT8/PhXiBXPMjLSxtwp8W9f/1AngRierBkA+kk/IpUSOeKByzn8y3kAAAfh//0oXgV4roHm/kz4E2z//zRc3/lgwBzbM2mJxQEa5pqgX7d1L0htrhx7LKxOZlKbwcAWyEOWqYSI8YPtgDQVjpB5nvaHaSnBaQSD6hweDi8PosxD6/PT09YY3xQA7LTCTKfYX+QHpA0GCcqmEHvr/cyfKQTEuwgbs2kPxJEB0iNjfJcCTPyocx+A0griHSmADiC91oNGVwJ69RudYe65vJmoqfpul0lrqXadW0jFKH5BKwAeCq+Den7s+3zfRJzA61/Uj/9H/VzLKTx9jFPPdXeeP+L7WEvDLAKAIoF8bPTKT0+TM7W8ePj3Rz/Yn3kOAp2f1Kf0Weony7pn/cPydvhQYV+eFOfmOu7VB/ViPe34/EN3RFHY/yRuT8ddCtMPH/McBAT5s+vRde/gf2c/sPsjLK+m5IBQF5tO+h2tTlBGnP6693JdsvofjOPnnEHkh2TnV/X1fBl9S5zrwuwF8NFrAVJVwCAPTe8gaJlomqlp0pv4Pjn98tJ/t/fL++6unpR1YGC2n/KCoa0tTLoKiEeUPDl94nj+5/Tv3/eT5vBQ60X1S0oZr+IWRR8Ldhu7AlLjPISlJcO9vrFotky9SpzDequlwEir5beYAc0R7D9KS1DXva0jhYRDXoExPdc6yw5GShkZXe9QdO/uOvHofxjrV/TNS6iMJS+4TcSTgk9n5agJdBQbB//IfF/HpvPt3Tbi7b6I6K0R72p6ajryEJrENW2bbeVUGjfgoals4L443c7BEE4mJO2SpbRngxQrAKRudRzGQ8jVOL2qDVjjI8K1gc3TIJ5KiFZ1q+gdsARPB4NQS4AjwVSt72DSoXNyOWUrU5mQ9nRYyjp89Xo7oRI6Bga9QNT1mQ/ptaJq5T/7WcgAZywR/XlPGAUDdet3LE+qS0TI+g+aJU8MIqjo0Kx8Ly+maxLjJmjQ18rA0YCkxLQbUZP1WqdmyQGJLUm7VnQFqodmXSqmRrdVpqdzk5LvmvgtEcW8PMGdaS23EOWyDVbACZzUJPaqMbjDxpA3Qrgl0AikimGDbqmyT8P8NOYiqrldF8rX+YN7TopX4UoHuSCYY7cgX4gHwclQKl1zhx0THf+tCAUValzjI7Wg9EhptrkIcfIJjA94evOn8B2eHaVzvBrnl2ig0So6hvPaz0IGcOvTHvUIlE2+prqAxLSQxZlU2stql1NqCCLdIiIN/i1DBEHUoElM9dBravbiAnKqgpi4IBkw+utSPIoBijDXJipSVV7MpOEJUAc5Qmm3BnUN+w3hteEieYKfRZSIUcXKMVf0u5wD4EwsUNVvZOtUT7A2GkffHjByWpHqvRBYrTV72a6j8zZ6W0DTE86Hn04bmyWX3Ri9WH7ZU6Q7h+ZHo0nHUAcsQvVhXRDZHChwiyi/hnPuOsSEF6Exk3o6Y9DT1eZ+6cASXk2Y9k+6EOQMDGm6WBK10wOQJCBwren86cPPWUcRAnTVjGcU1LBgs9FURiX/e6479yZcLwCBmTxiawEwrOcleuu12t3tbLv/N4RLYIBhYexm7Fcn4OJcn0+zc+s8/VfPeddZHAGN6TT8eGczHdR/Gts1/MzDkThr23zqrVfAMFT33Nx1RJsx1k5zuWILLnG/vsH+Fv5D4NTVcp1Gzo8AAAAAElFTkSuQmCC)](https://huggingface.co/spaces/opendatalab/MinerU)
+[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
+[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
+[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 <!-- language -->
-[English](README.md) | [简体中文](README_zh-CN.md)
+[English](README.md) | [简体中文](README_zh-CN.md)
 <!-- hot link -->
 <p align="center">
 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
 </p>
 <!-- join us -->
 <p align="center">
    👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
 </p>
 </div>
 # 更新记录
+- 2024/09/09 0.8.0发布，支持Dockerfile快速部署，同时上线了huggingface、modelscope demo
 - 2024/08/30 0.7.1发布，集成了paddle tablemaster表格识别功能
 - 2024/08/09 0.7.0b1发布，简化安装步骤提升易用性，加入表格识别功能
 - 2024/08/01 0.6.2b1发布，优化了依赖冲突问题和安装文档
 - 2024/07/05 首次开源
 <!-- TABLE OF CONTENT -->
 <details open="open">
  <summary><h2 style="display: inline-block">文档目录</h2></summary>
  <ol>
@@ -78,10 +86,10 @@
  </ol>
 </details>
 # MinerU
 ## 项目简介
 MinerU是一款将PDF转化为机器可读格式的工具（如markdown、json），可以很方便地抽取为任意格式。
 MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中，我们将会集中精力解决科技文献中的符号转化问题，希望在大模型时代为科技发展做出贡献。
 相比国内外知名商用产品MinerU还很年轻，如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题，同时**附上相关PDF**。
@@ -100,17 +108,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 - 支持CPU和GPU环境
 - 支持windows/linux/mac平台
 ## 快速开始
 如果遇到任何安装问题，请先查询 <a href="#faq">FAQ</a> </br>
 如果遇到解析效果不及预期，参考 <a href="#known-issues">Known Issues</a></br>
 有3种不同方式可以体验MinerU的效果：
 - [在线体验(无需任何安装)](#在线体验)
 - [使用CPU快速体验（Windows，Linux，Mac）](#使用cpu快速体验)
 - [Linux/Windows + CUDA](#使用gpu)
 **⚠️安装前必看——软硬件环境支持说明**
 为了确保项目的稳定性和可靠性，我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时，能够获得最佳的性能表现和最少的兼容性问题。
@@ -174,38 +181,47 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 [在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
 ### 使用CPU快速体验
 #### 1. 安装magic-pdf
 最新版本国内镜像源同步可能会有延迟，请耐心等待
 ```bash
 conda create -n MinerU python=3.10
 conda activate MinerU
 pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
 #### 2. 下载模型权重文件
 详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
 > ❗️模型下载后请务必检查模型文件是否下载完整
-> 
+>
 > 请检查目录下的模型文件大小与网页上描述是否一致，如果可以的话，最好通过sha256校验模型是否下载完整
 #### 3. 拷贝配置文件并进行配置
 在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
 > ❗️务必执行以下命令将配置文件拷贝到【用户目录】下，否则程序将无法运行
-> 
+>
->  windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
+> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
 ```bash
 cp magic-pdf.template.json ~/magic-pdf.json
 ```
 在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
 > ❗️务必正确配置模型权重文件所在目录的【绝对路径】，否则会因为找不到模型文件而导致程序无法运行
 >
 > windows系统中此路径应包含盘符，且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。
 > 
 > 例如：模型放在D盘根目录的models目录，则model-dir的值应为"D:/models"
 ```json
 {
  // other config
@@ -218,13 +234,27 @@ cp magic-pdf.template.json ~/magic-pdf.json
 }
 ```
 ### 使用GPU
 如果您的设备支持CUDA，且满足主线环境中的显卡要求，则可以使用GPU加速，请根据自己的系统选择适合的教程：
 - [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
+- 使用Docker快速部署
+    > Docker 需设备gpu显存大于等于16GB，默认开启所有加速功能
+    > 
+    > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
+    > 
+    > ```bash
+    > docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+    > ```
+  ```bash
+  wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
+  docker build -t mineru:latest .
+  docker run --rm -it --gpus=all mineru:latest /bin/bash
+  magic-pdf --help
+  ```
 ## 使用
@@ -238,12 +268,12 @@ Options:
  -v, --version                display the version and exit
  -p, --path PATH              local pdf filepath or directory  [required]
  -o, --output-dir TEXT        output local directory
-  -m, --method [ocr|txt|auto]  the method for parsing pdf.  
+  -m, --method [ocr|txt|auto]  the method for parsing pdf.
                               ocr: using ocr technique to extract information from pdf,
                               txt: suitable for the text-based pdf only and outperform ocr,
                               auto: automatically choose the best method for parsing pdf
                                  from ocr and txt.
-                               without method specified, auto will be used by default. 
+                               without method specified, auto will be used by default.
  --help                       Show this message and exit.
@@ -258,21 +288,21 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
 运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下
 ```text
-├── some_pdf.md                 # markdown 文件
+├── some_pdf.md                          # markdown 文件
-├── images                      # 存放图片目录
+├── images                               # 存放图片目录
-├── layout.pdf                  # layout 绘图
+├── some_pdf_layout.pdf                  # layout 绘图
-├── middle.json                 # minerU 中间处理结果
+├── some_pdf_middle.json                 # minerU 中间处理结果
-├── model.json                  # 模型推理结果
+├── some_pdf_model.json                  # 模型推理结果
-├── origin.pdf                  # 原 pdf 文件
+├── some_pdf_origin.pdf                  # 原 pdf 文件
-└── spans.pdf                   # 最小粒度的bbox位置信息绘图
+└── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图
 ```
 更多有关输出文件的信息，请参考[输出文件说明](docs/output_file_zh_cn.md)
 ### API
 处理本地磁盘上的文件
 ```python
 image_writer = DiskReaderWriter(local_image_dir)
 image_dir = str(os.path.basename(local_image_dir))
@@ -285,6 +315,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 ```
 处理对象存储上的文件
 ```python
 s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
 image_dir = "s3://img_bucket/"
@@ -298,11 +329,11 @@ pipe.pipe_parse()
 md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 ```
-详细实现可参考 
+详细实现可参考
 - [demo.py 最简单的处理方式](demo/demo.py)
 - [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
 ### 二次开发
 TODO
@@ -314,11 +345,11 @@ TODO
 - [ ] 正文中代码块识别
 - [ ] 目录识别
 - [x] 表格识别
- [ ] 化学式识别
+- [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
 - [ ] 几何图形识别
 # Known Issues
 - 阅读顺序基于规则的分割，在一些情况下会乱序
 - 不支持竖排文字
 - 列表、代码块、目录在layout模型里还没有支持
@@ -328,10 +359,11 @@ TODO
 # FAQ
 [常见问题](docs/FAQ_zh_cn.md)
-[FAQ](docs/FAQ_en_us.md)
+[FAQ](docs/FAQ_en_us.md)
 # All Thanks To Our Contributors
@@ -346,6 +378,7 @@ TODO
 本项目目前采用PyMuPDF以实现高级功能，但因其遵循AGPL协议，可能对某些使用场景构成限制。未来版本迭代中，我们计划探索并替换为许可条款更为宽松的PDF处理库，以提升用户友好度及灵活性。
 # Acknowledgments
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
@@ -382,9 +415,11 @@ TODO
 </a>
 # Magic-doc
 [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
 # Magic-html
 [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
 # Links

--- a/README_zh-CN.md.bak
+++ b/README_zh-CN.md.bak
--- a/app.py
+++ b/app.py
+# Copyright (c) Opendatalab. All rights reserved.
+import base64
+import os
+import time
+import zipfile
+from pathlib import Path
+import re
+from loguru import logger
+from magic_pdf.libs.hash_utils import compute_sha256
+from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+from magic_pdf.tools.common import do_parse, prepare_env
+os.system("pip install gradio")
+os.system("pip install gradio-pdf")
+import gradio as gr
+from gradio_pdf import PDF
+def read_fn(path):
+    disk_rw = DiskReaderWriter(os.path.dirname(path))
+    return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
+def parse_pdf(doc_path, output_dir, end_page_id):
+    os.makedirs(output_dir, exist_ok=True)
+    try:
+        file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
+        pdf_data = read_fn(doc_path)
+        parse_method = "auto"
+        local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
+        do_parse(
+            output_dir,
+            file_name,
+            pdf_data,
+            [],
+            parse_method,
+            False,
+            end_page_id=end_page_id,
+        )
+        return local_md_dir, file_name
+    except Exception as e:
+        logger.exception(e)
+def compress_directory_to_zip(directory_path, output_zip_path):
+    """
+    压缩指定目录到一个 ZIP 文件。
+    :param directory_path: 要压缩的目录路径
+    :param output_zip_path: 输出的 ZIP 文件路径
+    """
+    try:
+        with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
+            # 遍历目录中的所有文件和子目录
+            for root, dirs, files in os.walk(directory_path):
+                for file in files:
+                    # 构建完整的文件路径
+                    file_path = os.path.join(root, file)
+                    # 计算相对路径
+                    arcname = os.path.relpath(file_path, directory_path)
+                    # 添加文件到 ZIP 文件
+                    zipf.write(file_path, arcname)
+        return 0
+    except Exception as e:
+        logger.exception(e)
+        return -1
+def image_to_base64(image_path):
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode('utf-8')
+def replace_image_with_base64(markdown_text, image_dir_path):
+    # 匹配Markdown中的图片标签
+    pattern = r'\!\[(?:[^\]]*)\]\(([^)]+)\)'
+    # 替换图片链接
+    def replace(match):
+        relative_path = match.group(1)
+        full_path = os.path.join(image_dir_path, relative_path)
+        base64_image = image_to_base64(full_path)
+        return f"![{relative_path}](data:image/jpeg;base64,{base64_image})"
+    # 应用替换
+    return re.sub(pattern, replace, markdown_text)
+def to_markdown(file_path, end_pages):
+    # 获取识别的md文件以及压缩包文件路径
+    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
+    archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
+    zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
+    if zip_archive_success == 0:
+        logger.info("压缩成功")
+    else:
+        logger.error("压缩失败")
+    md_path = os.path.join(local_md_dir, file_name + ".md")
+    with open(md_path, 'r', encoding='utf-8') as f:
+        txt_content = f.read()
+    md_content = replace_image_with_base64(txt_content, local_md_dir)
+    # 返回转换后的PDF路径
+    new_pdf_path = os.path.join(local_md_dir, file_name + "_layout.pdf")
+    return md_content, txt_content, archive_zip_path, new_pdf_path
+# def show_pdf(file_path):
+#     with open(file_path, "rb") as f:
+#         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
+#     pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
+#                   f'width="100%" height="1000" type="application/pdf">'
+#     return pdf_display
+latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
+                    {"left": '$', "right": '$', "display": False}]
+def init_model():
+    from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
+    try:
+        model_manager = ModelSingleton()
+        txt_model = model_manager.get_model(False, False)
+        logger.info(f"txt_model init final")
+        ocr_model = model_manager.get_model(True, False)
+        logger.info(f"ocr_model init final")
+        return 0
+    except Exception as e:
+        logger.exception(e)
+        return -1
+model_init = init_model()
+logger.info(f"model_init: {model_init}")
+if __name__ == "__main__":
+    with gr.Blocks() as demo:
+        with gr.Row():
+            with gr.Column(variant='panel', scale=5):
+                pdf_show = gr.Markdown()
+                max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
+                with gr.Row() as bu_flow:
+                    change_bu = gr.Button("Convert")
+                    clear_bu = gr.ClearButton([pdf_show], value="Clear")
+                pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
+            with gr.Column(variant='panel', scale=5):
+                output_file = gr.File(label="convert result", interactive=False)
+                with gr.Tabs():
+                    with gr.Tab("Markdown rendering"):
+                        md = gr.Markdown(label="Markdown rendering", height=900, show_copy_button=True,
+                                         latex_delimiters=latex_delimiters, line_breaks=True)
+                    with gr.Tab("Markdown text"):
+                        md_text = gr.TextArea(lines=45, show_copy_button=True)
+        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
+        clear_bu.add([md, pdf_show, md_text, output_file])
+    demo.launch()
--- a/docs/chemical_knowledge_introduction/introduction.pdf
+++ b/docs/chemical_knowledge_introduction/introduction.pdf
--- a/docs/chemical_knowledge_introduction/introduction.xmind
+++ b/docs/chemical_knowledge_introduction/introduction.xmind
--- a/docs/output_file_en_us.md
+++ b/docs/output_file_en_us.md
 ## Overview
 After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
+### some_pdf_layout.pdf
-### layout.pdf
 Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
 ![layout example](images/layout_example.png)
+### some_pdf_spans.pdf
-### spans.pdf
 All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
 ![spans example](images/spans_example.png)
+### some_pdf_model.json
-### model.json
 #### Structure Definition
 ```python
 from pydantic import BaseModel, Field
 from enum import IntEnum
@@ -34,12 +33,12 @@ class CategoryType(IntEnum):
     table_footnote = 7      # Table footnote
     isolate_formula = 8     # Block formula
     formula_caption = 9     # Formula label
     embedding = 13          # Inline formula
     isolated = 14           # Block formula
     text = 15               # OCR recognition result
 class PageInfo(BaseModel):
    page_no: int = Field(description="Page number, the first page is 0", ge=0)
    height: int = Field(description="Page height", gt=0)
@@ -51,22 +50,20 @@ class ObjectInferenceResult(BaseModel):
    score: float = Field(description="Confidence of the inference result")
    latex: str | None = Field(description="LaTeX parsing result", default=None)
    html: str | None = Field(description="HTML parsing result", default=None)
 class PageInferenceResults(BaseModel):
     layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
     page_info: PageInfo = Field(description="Page metadata")
 # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
 inference_result: list[PageInferenceResults] = []
 ```
-The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
+The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
 ![Poly Coordinate Diagram](images/poly.png)
 #### example
 ```json
@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
 ]
 ```
+### some_pdf_middle.json
-### middle.json
+| Field Name     | Description                                                                                                    |
+| :------------- | :------------------------------------------------------------------------------------------------------------- |
-| Field Name | Description |
+| pdf_info       | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
-| :-----|:------------------------------------------|
+| \_parse_type   | ocr \| txt, used to indicate the mode used in this intermediate parsing state                                  |
-|pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
+| \_version_name | string, indicates the version of magic-pdf used in this parsing                                                |
-|_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
-|_version_name | string, indicates the version of magic-pdf used in this parsing |
 <br>
@@ -136,18 +131,18 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
 Field structure description
-| Field Name | Description | 
+| Field Name          | Description                                                                                                        |
-| :-----| :---- |
+| :------------------ | :----------------------------------------------------------------------------------------------------------------- |
-| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
+| preproc_blocks      | Intermediate result after PDF preprocessing, not yet segmented                                                     |
-| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
+| layout_bboxes       | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
-| page_idx | Page number, starting from 0 |
+| page_idx            | Page number, starting from 0                                                                                       |
-| page_size | Page width and height | 
+| page_size           | Page width and height                                                                                              |
-| _layout_tree | Layout tree structure |
+| \_layout_tree       | Layout tree structure                                                                                              |
-| images | list, each element is a dict representing an img_block |
+| images              | list, each element is a dict representing an img_block                                                             |
-| tables | list, each element is a dict representing a table_block |
+| tables              | list, each element is a dict representing a table_block                                                            |
-| interline_equations | list, each element is a dict representing an interline_equation_block |
+| interline_equations | list, each element is a dict representing an interline_equation_block                                              |
-| discarded_blocks | List, block information returned by the model that needs to be dropped |
+| discarded_blocks    | List, block information returned by the model that needs to be dropped                                             |
-| para_blocks | Result after segmenting preproc_blocks |
+| para_blocks         | Result after segmenting preproc_blocks                                                                             |
 In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting.
@@ -157,35 +152,35 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a
 The outer block is referred to as a first-level block, and the fields in the first-level block include:
-| Field Name | Description |
+| Field Name | Description                                                    |
-| :-----| :---- |
+| :--------- | :------------------------------------------------------------- |
-| type | Block type (table\|image)|
+| type       | Block type (table\|image)                                      |
-|bbox | Block bounding box coordinates |
+| bbox       | Block bounding box coordinates                                 |
-|blocks |list, each element is a dict representing a second-level block |
+| blocks     | list, each element is a dict representing a second-level block |
 <br>
 There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
 The fields in a second-level block include:
-| Field Name | Description |
+| Field Name | Description                                                                                                 |
-| :-----| :---- |
+| :--------- | :---------------------------------------------------------------------------------------------------------- |
-| type | Block type |
+| type       | Block type                                                                                                  |
-| bbox | Block bounding box coordinates |
+| bbox       | Block bounding box coordinates                                                                              |
-| lines | list, each element is a dict representing a line, used to describe the composition of a line of information| 
+| lines      | list, each element is a dict representing a line, used to describe the composition of a line of information |
 Detailed explanation of second-level block types
-| type               | Description | 
+| type               | Description            |
-|:-------------------| :---- |
+| :----------------- | :--------------------- |
 | image_body         | Main body of the image |
 | image_caption      | Image description text |
 | table_body         | Main body of the table |
 | table_caption      | Table description text |
-| table_footnote     | Table footnote |
+| table_footnote     | Table footnote         |
-| text               | Text block |
+| text               | Text block             |
-| title              | Title block |
+| title              | Title block            |
-| interline_equation | Block formula| 
+| interline_equation | Block formula          |
 <br>
@@ -193,31 +188,30 @@ Detailed explanation of second-level block types
 The field format of a line is as follows:
-| Field Name | Description | 
+| Field Name | Description                                                                                             |
-| :-----| :---- |
+| :--------- | :------------------------------------------------------------------------------------------------------ |
-| bbox | Bounding box coordinates of the line |
+| bbox       | Bounding box coordinates of the line                                                                    |
-| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
+| spans      | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
 <br>
 **span**
-| Field Name | Description | 
+| Field Name          | Description                                                                                              |
-| :-----| :---- |
+| :------------------ | :------------------------------------------------------------------------------------------------------- |
-| bbox | Bounding box coordinates of the span |
+| bbox                | Bounding box coordinates of the span                                                                     |
-| type | Type of the span |
+| type                | Type of the span                                                                                         |
 | content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
 The types of spans are as follows:
-| type | Description | 
+| type               | Description    |
-| :-----| :---- |
+| :----------------- | :------------- |
-| image | Image | 
+| image              | Image          |
-| table | Table |
+| table              | Table          |
-| text | Text |
+| text               | Text           |
-| inline_equation | Inline formula |
+| inline_equation    | Inline formula |
-| interline_equation | Block formula |
+| interline_equation | Block formula  |
 **Summary**
@@ -229,7 +223,6 @@ The block structure is as follows:
 First-level block (if any) -> Second-level block -> Line -> Span
 #### example
 ```json

--- a/docs/output_file_zh_cn.md
+++ b/docs/output_file_zh_cn.md
 ## 概览
 `magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外，还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
+### some_pdf_layout.pdf
-### layout.pdf 
 每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
 ![layout 页面示例](images/layout_example.png)
+### some_pdf_spans.pdf
-### spans.pdf 
 根据 span 类型的不同，采用不同颜色线框绘制页面上所有 span。该文件可以用于质检，可以快速排查出文本丢失、行间公式未识别等问题。
 ![span 页面示例](images/spans_example.png)
+### some_pdf_model.json
-### model.json
 #### 结构定义
 ```python
 from pydantic import BaseModel, Field
 from enum import IntEnum
@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
     table_caption = 6       # 表格描述
     table_footnote = 7      # 表格注释
     isolate_formula = 8     # 行间公式
-     formula_caption = 9     # 行间公式的标号 
+     formula_caption = 9     # 行间公式的标号
     embedding = 13          # 行内公式
     isolated = 14           # 行间公式
     text = 15               # ocr 识别结果
 class PageInfo(BaseModel):
    page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
    height: int = Field(description="页面高度", gt=0)
@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
    score: float = Field(description="推理结果的置信度")
    latex: str | None = Field(description="latex 解析结果", default=None)
    html: str | None = Field(description="html 解析结果", default=None)
 class PageInferenceResults(BaseModel):
     layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
     page_info: PageInfo = Field(description="页面元信息")
 # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
 inference_result: list[PageInferenceResults] = []
 ```
-poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标
+poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
 ![poly 坐标示意图](images/poly.png)
 #### 示例数据
 ```json
@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
 ]
 ```
+### some_pdf_middle.json
-### middle.json
+| 字段名         | 解释                                                               |
+| :------------- | :----------------------------------------------------------------- |
-| 字段名 | 解释                                        | 
+| pdf_info       | list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表 |
-| :-----|:------------------------------------------|
+| \_parse_type   | ocr \| txt，用来标识本次解析的中间态使用的模式                     |
-|pdf_info | list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表 |
+| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号                      |
-|_parse_type | ocr \| txt，用来标识本次解析的中间态使用的模式              |
-|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号          |
 <br>
 **pdf_info**
 字段结构说明
-| 字段名 | 解释 | 
+| 字段名              | 解释                                                                 |
-| :-----| :---- |
+| :------------------ | :------------------------------------------------------------------- |
-| preproc_blocks | pdf预处理后，未分段的中间结果 |
+| preproc_blocks      | pdf预处理后，未分段的中间结果                                        |
-| layout_bboxes | 布局分割的结果，含有布局的方向（垂直、水平），和bbox，按阅读顺序排序 |
+| layout_bboxes       | 布局分割的结果，含有布局的方向（垂直、水平），和bbox，按阅读顺序排序 |
-| page_idx | 页码，从0开始 |
+| page_idx            | 页码，从0开始                                                        |
-| page_size | 页面的宽度和高度 | 
+| page_size           | 页面的宽度和高度                                                     |
-| _layout_tree | 布局树状结构 |
+| \_layout_tree       | 布局树状结构                                                         |
-| images | list，每个元素是一个dict，每个dict表示一个img_block |
+| images              | list，每个元素是一个dict，每个dict表示一个img_block                  |
-| tables | list，每个元素是一个dict，每个dict表示一个table_block |
+| tables              | list，每个元素是一个dict，每个dict表示一个table_block                |
-| interline_equations | list，每个元素是一个dict，每个dict表示一个interline_equation_block |
+| interline_equations | list，每个元素是一个dict，每个dict表示一个interline_equation_block   |
-| discarded_blocks | List, 模型返回的需要drop的block信息 |
+| discarded_blocks    | List, 模型返回的需要drop的block信息                                  |
-| para_blocks | 将preproc_blocks进行分段之后的结果 |
+| para_blocks         | 将preproc_blocks进行分段之后的结果                                   |
 上表中 `para_blocks` 是个dict的数组，每个dict是一个block结构，block最多支持一次嵌套
@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
 外层block被称为一级block，一级block中的字段包括
-| 字段名 | 解释 |
+| 字段名 | 解释                                            |
-| :-----| :---- |
+| :----- | :---------------------------------------------- |
-| type | block类型（table\|image）|
+| type   | block类型（table\|image）                       |
-|bbox | block矩形框坐标 |
+| bbox   | block矩形框坐标                                 |
-|blocks |list，里面的每个元素都是一个dict格式的二级block |
+| blocks | list，里面的每个元素都是一个dict格式的二级block |
 <br>
 一级block只有"table"和"image"两种类型，其余block均为二级block
 二级block中的字段包括
-| 字段名 | 解释 |
+| 字段名 | 解释                                                         |
-| :-----| :---- |
+| :----- | :----------------------------------------------------------- |
-| type | block类型 |
+| type   | block类型                                                    |
-| bbox | block矩形框坐标 |
+| bbox   | block矩形框坐标                                              |
-| lines | list，每个元素都是一个dict表示的line，用来描述一行信息的构成| 
+| lines  | list，每个元素都是一个dict表示的line，用来描述一行信息的构成 |
 二级block的类型详解
-| type               | desc | 
+| type               | desc           |
-|:-------------------| :---- |
+| :----------------- | :------------- |
-| image_body         | 图像的本体 |
+| image_body         | 图像的本体     |
 | image_caption      | 图像的描述文本 |
-| table_body         | 表格本体 |
+| table_body         | 表格本体       |
 | table_caption      | 表格的描述文本 |
-| table_footnote     | 表格的脚注 |
+| table_footnote     | 表格的脚注     |
-| text               | 文本块 |
+| text               | 文本块         |
-| title              | 标题块 |
+| title              | 标题块         |
-| interline_equation | 行间公式块| 
+| interline_equation | 行间公式块     |
 <br>
@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
 line 的 字段格式如下
-| 字段名 | 解释 | 
+| 字段名 | 解释                                                                 |
-| :-----| :---- |
+| :----- | :------------------------------------------------------------------- |
-| bbox | line的矩形框坐标 |
+| bbox   | line的矩形框坐标                                                     |
-| spans | list，每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成 |
+| spans  | list，每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成 |
 <br>
 **span**
-| 字段名 | 解释 | 
+| 字段名              | 解释                                                                             |
-| :-----| :---- |
+| :------------------ | :------------------------------------------------------------------------------- |
-| bbox | span的矩形框坐标 |
+| bbox                | span的矩形框坐标                                                                 |
-| type | span的类型 |
+| type                | span的类型                                                                       |
 | content \| img_path | 文本类型的span使用content，图表类使用img_path 用来存储实际的文本或者截图路径信息 |
 span 的类型有如下几种
-| type | desc | 
+| type               | desc     |
-| :-----| :---- |
+| :----------------- | :------- |
-| image | 图片 | 
+| image              | 图片     |
-| table | 表格 |
+| table              | 表格     |
-| text | 文本 |
+| text               | 文本     |
-| inline_equation | 行内公式 |
+| inline_equation    | 行内公式 |
 | interline_equation | 行间公式 |
 **总结**
 span是所有元素的最小存储单元
@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
 一级block(如有)->二级block->line->span
 #### 示例数据
 ```json
@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
    "_parse_type": "txt",
    "_version_name": "0.6.1"
 }
 ```
\ No newline at end of file
--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
--- a/magic_pdf/integrations/__init__.py
+++ b/magic_pdf/integrations/__init__.py
--- a/magic_pdf/integrations/rag/__init__.py
+++ b/magic_pdf/integrations/rag/__init__.py
--- a/magic_pdf/integrations/rag/api.py
+++ b/magic_pdf/integrations/rag/api.py
+import os
+from pathlib import Path
+from loguru import logger
+from magic_pdf.integrations.rag.type import (ElementRelation, LayoutElements,
+                                             Node)
+from magic_pdf.integrations.rag.utils import inference
+class RagPageReader:
+    def __init__(self, pagedata: LayoutElements):
+        self.o = [
+            Node(
+                category_type=v.category_type,
+                text=v.text,
+                image_path=v.image_path,
+                anno_id=v.anno_id,
+                latex=v.latex,
+                html=v.html,
+            ) for v in pagedata.layout_dets
+        ]
+        self.pagedata = pagedata
+    def __iter__(self):
+        return iter(self.o)
+    def get_rel_map(self) -> list[ElementRelation]:
+        return self.pagedata.extra.element_relation
+class RagDocumentReader:
+    def __init__(self, ragdata: list[LayoutElements]):
+        self.o = [RagPageReader(v) for v in ragdata]
+    def __iter__(self):
+        return iter(self.o)
+class DataReader:
+    def __init__(self, path_or_directory: str, method: str, output_dir: str):
+        self.path_or_directory = path_or_directory
+        self.method = method
+        self.output_dir = output_dir
+        self.pdfs = []
+        if os.path.isdir(path_or_directory):
+            for doc_path in Path(path_or_directory).glob('*.pdf'):
+                self.pdfs.append(doc_path)
+        else:
+            assert path_or_directory.endswith('.pdf')
+            self.pdfs.append(Path(path_or_directory))
+    def get_documents_count(self) -> int:
+        """Returns the number of documents in the directory."""
+        return len(self.pdfs)
+    def get_document_result(self, idx: int) -> RagDocumentReader | None:
+        """
+        Args:
+            idx (int): the index of documents under the
+                directory path_or_directory
+        Returns:
+            RagDocumentReader | None: RagDocumentReader is an iterable object,
+            more details @RagDocumentReader
+        """
+        if idx >= self.get_documents_count() or idx < 0:
+            logger.error(f'invalid idx: {idx}')
+            return None
+        res = inference(str(self.pdfs[idx]), self.output_dir, self.method)
+        if res is None:
+            logger.warning(f'failed to inference pdf {self.pdfs[idx]}')
+            return None
+        return RagDocumentReader(res)
+    def get_document_filename(self, idx: int) -> Path:
+        """get the filename of the document."""
+        return self.pdfs[idx]
--- a/magic_pdf/integrations/rag/type.py
+++ b/magic_pdf/integrations/rag/type.py
+from enum import Enum
+from pydantic import BaseModel, Field
+# rag
+class CategoryType(Enum):  # py310 not support StrEnum
+    text = 'text'
+    title = 'title'
+    interline_equation = 'interline_equation'
+    image = 'image'
+    image_body = 'image_body'
+    image_caption = 'image_caption'
+    table = 'table'
+    table_body = 'table_body'
+    table_caption = 'table_caption'
+    table_footnote = 'table_footnote'
+class ElementRelType(Enum):
+    sibling = 'sibling'
+class PageInfo(BaseModel):
+    page_no: int = Field(description='the index of page, start from zero',
+                         ge=0)
+    height: int = Field(description='the height of page', gt=0)
+    width: int = Field(description='the width of page', ge=0)
+    image_path: str | None = Field(description='the image of this page',
+                                   default=None)
+class ContentObject(BaseModel):
+    category_type: CategoryType = Field(description='类别')
+    poly: list[float] = Field(
+        description=('Coordinates, need to convert back to PDF coordinates,'
+                     ' order is top-left, top-right, bottom-right, bottom-left'
+                     ' x,y coordinates'))
+    ignore: bool = Field(description='whether ignore this object',
+                         default=False)
+    text: str | None = Field(description='text content of the object',
+                             default=None)
+    image_path: str | None = Field(description='path of embedded image',
+                                   default=None)
+    order: int = Field(description='the order of this object within a page',
+                       default=-1)
+    anno_id: int = Field(description='unique id', default=-1)
+    latex: str | None = Field(description='latex result', default=None)
+    html: str | None = Field(description='html result', default=None)
+class ElementRelation(BaseModel):
+    source_anno_id: int = Field(description='unique id of the source object',
+                                default=-1)
+    target_anno_id: int = Field(description='unique id of the target object',
+                                default=-1)
+    relation: ElementRelType = Field(
+        description='the relation between source and target element')
+class LayoutElementsExtra(BaseModel):
+    element_relation: list[ElementRelation] = Field(
+        description='the relation between source and target element')
+class LayoutElements(BaseModel):
+    layout_dets: list[ContentObject] = Field(
+        description='layout element details')
+    page_info: PageInfo = Field(description='page info')
+    extra: LayoutElementsExtra = Field(description='extra information')
+# iter data format
+class Node(BaseModel):
+    category_type: CategoryType = Field(description='类别')
+    text: str | None = Field(description='text content of the object',
+                             default=None)
+    image_path: str | None = Field(description='path of embedded image',
+                                   default=None)
+    anno_id: int = Field(description='unique id', default=-1)
+    latex: str | None = Field(description='latex result', default=None)
+    html: str | None = Field(description='html result', default=None)
--- a/magic_pdf/integrations/rag/utils.py
+++ b/magic_pdf/integrations/rag/utils.py
+import json
+import os
+from pathlib import Path
+from loguru import logger
+import magic_pdf.model as model_config
+from magic_pdf.dict2md.ocr_mkcontent import merge_para_with_text
+from magic_pdf.integrations.rag.type import (CategoryType, ContentObject,
+                                             ElementRelation, ElementRelType,
+                                             LayoutElements,
+                                             LayoutElementsExtra, PageInfo)
+from magic_pdf.libs.ocr_content_type import BlockType, ContentType
+from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+from magic_pdf.tools.common import do_parse, prepare_env
+def convert_middle_json_to_layout_elements(
+    json_data: dict,
+    output_dir: str,
+) -> list[LayoutElements]:
+    uniq_anno_id = 0
+    res: list[LayoutElements] = []
+    for page_no, page_data in enumerate(json_data['pdf_info']):
+        order_id = 0
+        page_info = PageInfo(
+            height=int(page_data['page_size'][1]),
+            width=int(page_data['page_size'][0]),
+            page_no=page_no,
+        )
+        layout_dets: list[ContentObject] = []
+        extra_element_relation: list[ElementRelation] = []
+        for para_block in page_data['para_blocks']:
+            para_text = ''
+            para_type = para_block['type']
+            if para_type == BlockType.Text:
+                para_text = merge_para_with_text(para_block)
+                x0, y0, x1, y1 = para_block['bbox']
+                content = ContentObject(
+                    anno_id=uniq_anno_id,
+                    category_type=CategoryType.text,
+                    text=para_text,
+                    order=order_id,
+                    poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                )
+                uniq_anno_id += 1
+                order_id += 1
+                layout_dets.append(content)
+            elif para_type == BlockType.Title:
+                para_text = merge_para_with_text(para_block)
+                x0, y0, x1, y1 = para_block['bbox']
+                content = ContentObject(
+                    anno_id=uniq_anno_id,
+                    category_type=CategoryType.title,
+                    text=para_text,
+                    order=order_id,
+                    poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                )
+                uniq_anno_id += 1
+                order_id += 1
+                layout_dets.append(content)
+            elif para_type == BlockType.InterlineEquation:
+                para_text = merge_para_with_text(para_block)
+                x0, y0, x1, y1 = para_block['bbox']
+                content = ContentObject(
+                    anno_id=uniq_anno_id,
+                    category_type=CategoryType.interline_equation,
+                    text=para_text,
+                    order=order_id,
+                    poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                )
+                uniq_anno_id += 1
+                order_id += 1
+                layout_dets.append(content)
+            elif para_type == BlockType.Image:
+                body_anno_id = -1
+                caption_anno_id = -1
+                for block in para_block['blocks']:
+                    if block['type'] == BlockType.ImageBody:
+                        for line in block['lines']:
+                            for span in line['spans']:
+                                if span['type'] == ContentType.Image:
+                                    x0, y0, x1, y1 = block['bbox']
+                                    content = ContentObject(
+                                        anno_id=uniq_anno_id,
+                                        category_type=CategoryType.image_body,
+                                        image_path=os.path.join(
+                                            output_dir, span['image_path']),
+                                        order=order_id,
+                                        poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                                    )
+                                    body_anno_id = uniq_anno_id
+                                    uniq_anno_id += 1
+                                    order_id += 1
+                                    layout_dets.append(content)
+                for block in para_block['blocks']:
+                    if block['type'] == BlockType.ImageCaption:
+                        para_text += merge_para_with_text(block)
+                        x0, y0, x1, y1 = block['bbox']
+                        content = ContentObject(
+                            anno_id=uniq_anno_id,
+                            category_type=CategoryType.image_caption,
+                            text=para_text,
+                            order=order_id,
+                            poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                        )
+                        caption_anno_id = uniq_anno_id
+                        uniq_anno_id += 1
+                        order_id += 1
+                        layout_dets.append(content)
+                if body_anno_id > 0 and caption_anno_id > 0:
+                    element_relation = ElementRelation(
+                        relation=ElementRelType.sibling,
+                        source_anno_id=body_anno_id,
+                        target_anno_id=caption_anno_id,
+                    )
+                    extra_element_relation.append(element_relation)
+            elif para_type == BlockType.Table:
+                body_anno_id, caption_anno_id, footnote_anno_id = -1, -1, -1
+                for block in para_block['blocks']:
+                    if block['type'] == BlockType.TableCaption:
+                        para_text += merge_para_with_text(block)
+                        x0, y0, x1, y1 = block['bbox']
+                        content = ContentObject(
+                            anno_id=uniq_anno_id,
+                            category_type=CategoryType.table_caption,
+                            text=para_text,
+                            order=order_id,
+                            poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                        )
+                        caption_anno_id = uniq_anno_id
+                        uniq_anno_id += 1
+                        order_id += 1
+                        layout_dets.append(content)
+                for block in para_block['blocks']:
+                    if block['type'] == BlockType.TableBody:
+                        for line in block['lines']:
+                            for span in line['spans']:
+                                if span['type'] == ContentType.Table:
+                                    x0, y0, x1, y1 = para_block['bbox']
+                                    content = ContentObject(
+                                        anno_id=uniq_anno_id,
+                                        category_type=CategoryType.table_body,
+                                        order=order_id,
+                                        poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                                    )
+                                    body_anno_id = uniq_anno_id
+                                    uniq_anno_id += 1
+                                    order_id += 1
+                                    # if processed by table model
+                                    if span.get('latex', ''):
+                                        content.latex = span['latex']
+                                    else:
+                                        content.image_path = os.path.join(
+                                            output_dir, span['image_path'])
+                                    layout_dets.append(content)
+                for block in para_block['blocks']:
+                    if block['type'] == BlockType.TableFootnote:
+                        para_text += merge_para_with_text(block)
+                        x0, y0, x1, y1 = block['bbox']
+                        content = ContentObject(
+                            anno_id=uniq_anno_id,
+                            category_type=CategoryType.table_footnote,
+                            text=para_text,
+                            order=order_id,
+                            poly=[x0, y0, x1, y0, x1, y1, x0, y1],
+                        )
+                        footnote_anno_id = uniq_anno_id
+                        uniq_anno_id += 1
+                        order_id += 1
+                        layout_dets.append(content)
+                if caption_anno_id != -1 and body_anno_id != -1:
+                    element_relation = ElementRelation(
+                        relation=ElementRelType.sibling,
+                        source_anno_id=body_anno_id,
+                        target_anno_id=caption_anno_id,
+                    )
+                    extra_element_relation.append(element_relation)
+                if footnote_anno_id != -1 and body_anno_id != -1:
+                    element_relation = ElementRelation(
+                        relation=ElementRelType.sibling,
+                        source_anno_id=body_anno_id,
+                        target_anno_id=footnote_anno_id,
+                    )
+                    extra_element_relation.append(element_relation)
+        res.append(
+            LayoutElements(
+                page_info=page_info,
+                layout_dets=layout_dets,
+                extra=LayoutElementsExtra(
+                    element_relation=extra_element_relation),
+            ))
+    return res
+def inference(path, output_dir, method):
+    model_config.__use_inside_model__ = True
+    model_config.__model_mode__ = 'full'
+    if output_dir == '':
+        if os.path.isdir(path):
+            output_dir = os.path.join(path, 'output')
+        else:
+            output_dir = os.path.join(os.path.dirname(path), 'output')
+    local_image_dir, local_md_dir = prepare_env(output_dir,
+                                                str(Path(path).stem), method)
+    def read_fn(path):
+        disk_rw = DiskReaderWriter(os.path.dirname(path))
+        return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
+    def parse_doc(doc_path: str):
+        try:
+            file_name = str(Path(doc_path).stem)
+            pdf_data = read_fn(doc_path)
+            do_parse(
+                output_dir,
+                file_name,
+                pdf_data,
+                [],
+                method,
+                False,
+                f_draw_span_bbox=False,
+                f_draw_layout_bbox=False,
+                f_dump_md=False,
+                f_dump_middle_json=True,
+                f_dump_model_json=False,
+                f_dump_orig_pdf=False,
+                f_dump_content_list=False,
+                f_draw_model_bbox=False,
+            )
+            middle_json_fn = os.path.join(local_md_dir,
+                                          f'{file_name}_middle.json')
+            with open(middle_json_fn) as fd:
+                jso = json.load(fd)
+            os.remove(middle_json_fn)
+            return convert_middle_json_to_layout_elements(jso, local_image_dir)
+        except Exception as e:
+            logger.exception(e)
+    return parse_doc(path)
+if __name__ == '__main__':
+    import pprint
+    base_dir = '/opt/data/pdf/resources/samples/'
+    if 0:
+        with open(base_dir + 'json_outputs/middle.json') as f:
+            d = json.load(f)
+        result = convert_middle_json_to_layout_elements(d, '/tmp')
+        pprint.pp(result)
+    if 0:
+        with open(base_dir + 'json_outputs/middle.3.json') as f:
+            d = json.load(f)
+        result = convert_middle_json_to_layout_elements(d, '/tmp')
+        pprint.pp(result)
+    if 1:
+        res = inference(
+            base_dir + 'samples/pdf/one_page_with_table_image.pdf',
+            '/tmp/output',
+            'ocr',
+        )
+        pprint.pp(res)
--- a/magic_pdf/layout/layout_sort.py
+++ b/magic_pdf/layout/layout_sort.py
--- a/magic_pdf/libs/boxbase.py
+++ b/magic_pdf/libs/boxbase.py