Commits · d714ac8b76b8af7d572f650c4ea35285a14a0ab4 · wangsen / MinerU

02 Sep, 2024 2 commits

Release: Release 0.7.1 verison, update dev (#527) · d714ac8b

yyy authored Sep 02, 2024



* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

---------
Co-authored-by: Kaiwen Liu <lkw_buaa@163.com>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: wangbinDL <wangbin_research@163.com>

d714ac8b

fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) · 068fab7f
Xiaomeng Zhao authored Sep 02, 2024

068fab7f

30 Aug, 2024 1 commit

feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) · 0f91fcf6

Xiaomeng Zhao authored Aug 30, 2024

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

0f91fcf6

20 Aug, 2024 2 commits

fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465

Xiaomeng Zhao authored Aug 20, 2024

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

041b9465

fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411

Xiaomeng Zhao authored Aug 20, 2024

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

3da5c411

09 Aug, 2024 3 commits

fix(doc-analyze): adjust image scaling limit to 9000 pixels · 445a397f

myhloli authored Aug 09, 2024

Previously, images were not enlarged if their width or height exceeded 3000 pixels.
This threshold has been increased to 9000 pixels to better handle high-resolutionscans and improve the analysis of documents with larger dimensions.

445a397f

fix(pdf-extract-kit): ensure table extraction success with additional ending... · 334ccac2

myhloli authored Aug 09, 2024

fix(pdf-extract-kit): ensure table extraction success with additional ending conditionAdd an additional condition to determine the success of table extraction by checking
if the latex_code ends with 'end{table}'. This extends the validation to cover table
environments that may not strictly end with 'end{tabular}', thus improving the robustnessof table recognition processing.

334ccac2

refactor(pdf_extract_kit): optimize image processing and table recognition... · 29e590a7

myhloli authored Aug 09, 2024

refactor(pdf_extract_kit): optimize image processing and table recognition logicRefactor the image processing logic for OCR and table recognition to ensure
consistency and improve performance. Remove redundant initialization of PIL images,
unify image cropping logic, and streamline the handling of formula detection results.
Also, adjust the table recognition process to improve integration with the updated image
processing logic and enhance overall efficiency.

29e590a7

07 Aug, 2024 2 commits
- add table recognition success detect · 377b49eb
  liukaiwen authored Aug 07, 2024
  
  377b49eb
- add table recognition success detect · b18496b0
  liukaiwen authored Aug 07, 2024
  
  b18496b0
05 Aug, 2024 1 commit
- fix table recognition bug#321 · cae215bb
  liukaiwen authored Aug 05, 2024
  
  cae215bb
04 Aug, 2024 1 commit

fix(pdf-extract): ensure table recognition config defaults to disabled · 52156eae

myhloli authored Aug 04, 2024

If 'table-config' is not present in the configuration file, the table recognition
feature will default to being disabled to ensure consistent behavior. This change
adds a warning log and sets a default configuration for table recognition when the
expected config is missing.

52156eae

02 Aug, 2024 1 commit

feat(model inference): add table recognition and conversion to LaTeX (#284) · 37925f36

Kaiwen Liu authored Aug 02, 2024

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # feat(model inference): add table recognition and convertion to LaTeX

# What's Changed

### New Features

- Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex.

### Instruction

- pip install pypandoc struct-eqtable==0.1.0
- Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec

) and put it under models/ directory.
- Edit 'table-mode' value to turn on table recognition function which is turned off by default.
- If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。

* add table recognition and convertion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

---------
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

37925f36

01 Aug, 2024 4 commits
- add table recognition and conversion to LaTeX · b9667fd3
  liukaiwen authored Aug 01, 2024
  
  b9667fd3
- add table recognition and conversion to LaTeX · dbe628ee
  liukaiwen authored Aug 01, 2024
  
  dbe628ee
- add table recognition and conversion to LaTeX · 78238f39
  liukaiwen authored Aug 01, 2024
  
  78238f39
- add table recognition and convertion to LaTeX · 4c096443
  liukaiwen authored Aug 01, 2024
  
  4c096443
31 Jul, 2024 2 commits

# add table recognition using struct-eqtable · d6c58ecc

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

d6c58ecc

# add table recognition using struct-eqtable · b29badc1

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

b29badc1

30 Jul, 2024 1 commit
- fix(magic_pdf): add warning for Lite model usage in doc analysis · 5be6ee8f
  myhloli authored Jul 30, 2024
  
  5be6ee8f
29 Jul, 2024 1 commit
- fix(magic_pdf):disable torchtext deprecation warning >=0.18.0 · 00ad8e67
  myhloli authored Jul 29, 2024
  
  00ad8e67
28 Jul, 2024 1 commit
- fix(magic_pdf): remove unused import from pdf_extract_kit · 7ecc82da
  myhloli authored Jul 28, 2024
  
  7ecc82da
25 Jul, 2024 1 commit

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure... · 20499ec3

myhloli authored Jul 25, 2024

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure the model configuration file is read with utf-8 encoding to support
non-ASCII characters and prevent potential encoding errors.

20499ec3

24 Jul, 2024 3 commits
- feat(magic-pdf): add conditional application of formula detection and recognition · 4c39bcd3
  赵小蒙 authored Jul 24, 2024
  
  4c39bcd3
- fix(magic-pdf): add default values and improve warning logs for config... · 30ac6f22
  myhloli authored Jul 24, 2024
```
fix(magic-pdf): add default values and improve warning logs for config optionsEnsure that 'temp-output-dir', 'models-dir', and 'device-mode' have sensible default
values in case they are not specified in the config file.
```
  30ac6f22
- fix(magic-pdf): prevent Albumentations update check · 4bf58088
  myhloli authored Jul 24, 2024
  
  4bf58088
23 Jul, 2024 3 commits
- feat(magic_pdf): update installation commands for simplified dependency options · f6f1d00d
  myhloli authored Jul 24, 2024
  
  f6f1d00d
- refactor(magic_pdf): replace math module with local_math · 12bec17e
  myhloli authored Jul 23, 2024
  
  12bec17e
- fix(magic_pdf): filter out formulas outside image bounds during cropped_img · ee81b339
  myhloli authored Jul 23, 2024
  
  ee81b339
22 Jul, 2024 2 commits
- fix(magic_pdf): correct color channel conversion for OCR in PDF extract · c9059987
  myhloli authored Jul 22, 2024
  
  c9059987
- fix(magic_pdf): optimize formula area selection for OCR · e7ce3051
  myhloli authored Jul 22, 2024
  
  e7ce3051
18 Jul, 2024 1 commit
- fix(magic_pdf): handle import errors with exception logging · 46cacbc0
  myhloli authored Jul 18, 2024
  
  46cacbc0
17 Jul, 2024 2 commits
- fix: object cluster algorithm · ddff4b42
  blue authored Jul 17, 2024
  
  ddff4b42
- feat(magic_pdf): enable inside model usage by default · 6d65855c
  myhloli authored Jul 17, 2024
  
  6d65855c
15 Jul, 2024 1 commit
- refactor(layoutlmv3): remove outdated COCO instances registration · 724001df
  myhloli authored Jul 15, 2024
  
  724001df
14 Jul, 2024 1 commit

refactor(magic_pdf): implement model singleton pattern for custom models · 054abe33

myhloli authored Jul 14, 2024

Introduce a Singleton pattern to manage custom models in the magic_pdf module.
This change improves the efficiency by ensuring that a single instance of the
custom model is created and reused, thereby reducing the overhead of multiple
instantiate calls for the same model configuration.

054abe33

12 Jul, 2024 4 commits
- update error output · 278ba2f6
  myhloli authored Jul 13, 2024
  
  278ba2f6
- refactor(model): update init methods and improve model loading logic · 4101c357
  zhaoxiaomeng authored Jul 12, 2024
  
  4101c357
- remove useless files · 9515c2aa
  myhloli authored Jul 12, 2024
  
  9515c2aa
- feat(config-reader): add models-dir and device-mode configurations · 695b3579
  myhloli authored Jul 12, 2024
```
Add new configuration options for custom model directories and device modeselection. This allows users to specify the directory where models are stored
and choose between CPU and GPU modes for model inference. The configurations
are read from a JSON file and can be easily extended to support additional
options in the future.
```
  695b3579