Commits · 4f340c442985c391fa0f59e2f8345b11e9ccae20 · wangsen / MinerU

10 Sep, 2024 1 commit

refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 · 4f340c44

myhloli authored Sep 10, 2024

Update the paths to model weights and configuration files for the UniMERNet architecture
in both the demo.yaml and model_configs.yaml files. Adjust the mfr_model_init function toreflect the new weight and configuration paths. The changes include specifying more detailed
paths to the unimernet_base directory and changing the weight file extension to .pth.

4f340c44

03 Sep, 2024 1 commit

refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533) · aac91094

Xiaomeng Zhao authored Sep 03, 2024

Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing
atomic models. This change ensures that atomic models are instantiated at most once,
optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton
class now manages the instantiation and retrieval of atomic models, improving the
overall structure and efficiency of the codebase.

aac91094

02 Sep, 2024 1 commit

Release: Release 0.7.1 verison, update dev (#527) · d714ac8b

yyy authored Sep 02, 2024



* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

---------
Co-authored-by: Kaiwen Liu <lkw_buaa@163.com>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: wangbinDL <wangbin_research@163.com>

d714ac8b

20 Aug, 2024 1 commit

fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465

Xiaomeng Zhao authored Aug 20, 2024

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

041b9465

09 Aug, 2024 2 commits

fix(pdf-extract-kit): ensure table extraction success with additional ending... · 334ccac2

myhloli authored Aug 09, 2024

fix(pdf-extract-kit): ensure table extraction success with additional ending conditionAdd an additional condition to determine the success of table extraction by checking
if the latex_code ends with 'end{table}'. This extends the validation to cover table
environments that may not strictly end with 'end{tabular}', thus improving the robustnessof table recognition processing.

334ccac2

refactor(pdf_extract_kit): optimize image processing and table recognition... · 29e590a7

myhloli authored Aug 09, 2024

refactor(pdf_extract_kit): optimize image processing and table recognition logicRefactor the image processing logic for OCR and table recognition to ensure
consistency and improve performance. Remove redundant initialization of PIL images,
unify image cropping logic, and streamline the handling of formula detection results.
Also, adjust the table recognition process to improve integration with the updated image
processing logic and enhance overall efficiency.

29e590a7

07 Aug, 2024 2 commits
- add table recognition success detect · 377b49eb
  liukaiwen authored Aug 07, 2024
  
  377b49eb
- add table recognition success detect · b18496b0
  liukaiwen authored Aug 07, 2024
  
  b18496b0
05 Aug, 2024 1 commit
- fix table recognition bug#321 · cae215bb
  liukaiwen authored Aug 05, 2024
  
  cae215bb
04 Aug, 2024 1 commit

fix(pdf-extract): ensure table recognition config defaults to disabled · 52156eae

myhloli authored Aug 04, 2024

If 'table-config' is not present in the configuration file, the table recognition
feature will default to being disabled to ensure consistent behavior. This change
adds a warning log and sets a default configuration for table recognition when the
expected config is missing.

52156eae

02 Aug, 2024 1 commit

feat(model inference): add table recognition and conversion to LaTeX (#284) · 37925f36

Kaiwen Liu authored Aug 02, 2024

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # feat(model inference): add table recognition and convertion to LaTeX

# What's Changed

### New Features

- Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex.

### Instruction

- pip install pypandoc struct-eqtable==0.1.0
- Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec

) and put it under models/ directory.
- Edit 'table-mode' value to turn on table recognition function which is turned off by default.
- If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。

* add table recognition and convertion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

---------
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

37925f36

01 Aug, 2024 3 commits
- add table recognition and conversion to LaTeX · b9667fd3
  liukaiwen authored Aug 01, 2024
  
  b9667fd3
- add table recognition and conversion to LaTeX · dbe628ee
  liukaiwen authored Aug 01, 2024
  
  dbe628ee
- add table recognition and convertion to LaTeX · 4c096443
  liukaiwen authored Aug 01, 2024
  
  4c096443
31 Jul, 2024 2 commits

# add table recognition using struct-eqtable · d6c58ecc

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

d6c58ecc

# add table recognition using struct-eqtable · b29badc1

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

b29badc1

29 Jul, 2024 1 commit
- fix(magic_pdf):disable torchtext deprecation warning >=0.18.0 · 00ad8e67
  myhloli authored Jul 29, 2024
  
  00ad8e67
28 Jul, 2024 1 commit
- fix(magic_pdf): remove unused import from pdf_extract_kit · 7ecc82da
  myhloli authored Jul 28, 2024
  
  7ecc82da
25 Jul, 2024 1 commit

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure... · 20499ec3

myhloli authored Jul 25, 2024

fix(pdf_extract_kit): specify utf-8 encoding when reading model configEnsure the model configuration file is read with utf-8 encoding to support
non-ASCII characters and prevent potential encoding errors.

20499ec3

24 Jul, 2024 3 commits
- feat(magic-pdf): add conditional application of formula detection and recognition · 4c39bcd3
  赵小蒙 authored Jul 24, 2024
  
  4c39bcd3
- fix(magic-pdf): add default values and improve warning logs for config... · 30ac6f22
  myhloli authored Jul 24, 2024
```
fix(magic-pdf): add default values and improve warning logs for config optionsEnsure that 'temp-output-dir', 'models-dir', and 'device-mode' have sensible default
values in case they are not specified in the config file.
```
  30ac6f22
- fix(magic-pdf): prevent Albumentations update check · 4bf58088
  myhloli authored Jul 24, 2024
  
  4bf58088
23 Jul, 2024 2 commits
- feat(magic_pdf): update installation commands for simplified dependency options · f6f1d00d
  myhloli authored Jul 24, 2024
  
  f6f1d00d
- fix(magic_pdf): filter out formulas outside image bounds during cropped_img · ee81b339
  myhloli authored Jul 23, 2024
  
  ee81b339
22 Jul, 2024 2 commits
- fix(magic_pdf): correct color channel conversion for OCR in PDF extract · c9059987
  myhloli authored Jul 22, 2024
  
  c9059987
- fix(magic_pdf): optimize formula area selection for OCR · e7ce3051
  myhloli authored Jul 22, 2024
  
  e7ce3051
18 Jul, 2024 1 commit
- fix(magic_pdf): handle import errors with exception logging · 46cacbc0
  myhloli authored Jul 18, 2024
  
  46cacbc0
12 Jul, 2024 5 commits
- update error output · 278ba2f6
  myhloli authored Jul 13, 2024
  
  278ba2f6
- refactor(model): update init methods and improve model loading logic · 4101c357
  zhaoxiaomeng authored Jul 12, 2024
  
  4101c357
- remove useless files · 9515c2aa
  myhloli authored Jul 12, 2024
  
  9515c2aa
- feat(config-reader): add models-dir and device-mode configurations · 695b3579
  myhloli authored Jul 12, 2024
```
Add new configuration options for custom model directories and device modeselection. This allows users to specify the directory where models are stored
and choose between CPU and GPU modes for model inference. The configurations
are read from a JSON file and can be easily extended to support additional
options in the future.
```
  695b3579
- feat(model-config): Unify all device selections through a single YAML file · 45e7fbd2
  myhloli authored Jul 12, 2024
  
  45e7fbd2
11 Jul, 2024 1 commit
- update:Modify the PEK module to parse page by page. · 2b8db660
  myhloli authored Jul 11, 2024
  
  2b8db660
10 Jul, 2024 2 commits
- update: add mfr cost time each batch of dataloader · 84b3c3bb
  zhaoxiaomeng authored Jul 10, 2024
  
  84b3c3bb
- small fix · 14f45075
  myhloli authored Jul 10, 2024
  
  14f45075
09 Jul, 2024 2 commits
- update:Complete the parsing logic of PEK · 831db2e0
  myhloli authored Jul 09, 2024
  
  831db2e0
- update:Integrate the PDF-Extract-Kit inside · 1fac6aa7
  myhloli authored Jul 09, 2024
  
  1fac6aa7
08 Jul, 2024 1 commit

update: · 1ee81a9a

赵小蒙 authored Jul 08, 2024

1.Disable scaling when loading large images.
2.Moving the logic for channel conversion in image processing.

1ee81a9a