Commits · 4b372f3f7e4455fa96253089b6a592adc664f3e2 · wangsen / MinerU

09 Sep, 2024 1 commit

feat(ocr): pass language parameter for custom model init · 4b372f3f

myhloli authored Sep 09, 2024

Pass the `lang` parameter to `custom_model_init` in `doc_analyze` to support language-specific OCR configurations. This enhancement allows the use of language information to improve OCR accuracy when processing PDFs.

4b372f3f

02 Sep, 2024 1 commit
- fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) · 068fab7f
  Xiaomeng Zhao authored Sep 02, 2024
  
  068fab7f
30 Aug, 2024 1 commit

feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) · 0f91fcf6

Xiaomeng Zhao authored Aug 30, 2024

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

* feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing
pipeline to support pagination functionality. This feature allows users to specify the
range of pages to be processed, enhancing the efficiency and flexibility of the system.

0f91fcf6

09 Aug, 2024 1 commit

fix(doc-analyze): adjust image scaling limit to 9000 pixels · 445a397f

myhloli authored Aug 09, 2024

Previously, images were not enlarged if their width or height exceeded 3000 pixels.
This threshold has been increased to 9000 pixels to better handle high-resolutionscans and improve the analysis of documents with larger dimensions.

445a397f

02 Aug, 2024 1 commit

feat(model inference): add table recognition and conversion to LaTeX (#284) · 37925f36

Kaiwen Liu authored Aug 02, 2024

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # feat(model inference): add table recognition and convertion to LaTeX

# What's Changed

### New Features

- Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex.

### Instruction

- pip install pypandoc struct-eqtable==0.1.0
- Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec

) and put it under models/ directory.
- Edit 'table-mode' value to turn on table recognition function which is turned off by default.
- If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。

* add table recognition and convertion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

---------
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

37925f36

01 Aug, 2024 1 commit
- add table recognition and conversion to LaTeX · dbe628ee
  liukaiwen authored Aug 01, 2024
  
  dbe628ee
31 Jul, 2024 1 commit

# add table recognition using struct-eqtable · b29badc1

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

b29badc1

30 Jul, 2024 1 commit
- fix(magic_pdf): add warning for Lite model usage in doc analysis · 5be6ee8f
  myhloli authored Jul 30, 2024
  
  5be6ee8f
14 Jul, 2024 1 commit

refactor(magic_pdf): implement model singleton pattern for custom models · 054abe33

myhloli authored Jul 14, 2024

Introduce a Singleton pattern to manage custom models in the magic_pdf module.
This change improves the efficiency by ensuring that a single instance of the
custom model is created and reused, thereby reducing the overhead of multiple
instantiate calls for the same model configuration.

054abe33

12 Jul, 2024 1 commit

feat(config-reader): add models-dir and device-mode configurations · 695b3579

myhloli authored Jul 12, 2024

Add new configuration options for custom model directories and device modeselection. This allows users to specify the directory where models are stored
and choose between CPU and GPU modes for model inference. The configurations
are read from a JSON file and can be easily extended to support additional
options in the future.

695b3579

11 Jul, 2024 2 commits

feat(model): add model mode selection for PDF analysis · bc0f6932

myhloli authored Jul 11, 2024

Introduce a new feature that allows users to choose between a "lite" and a "full"
model mode for PDF document analysis. The "lite" mode uses a faster, less
accurate model, while the "full" mode employs a higher-precision model at the
cost of speed. This selection can be made through the CLI or API, providing
flexibility for different use cases.

bc0f6932

update:Modify the PEK module to parse page by page. · 2b8db660
myhloli authored Jul 11, 2024

2b8db660

10 Jul, 2024 1 commit
- small fix · 14f45075
  myhloli authored Jul 10, 2024
  
  14f45075
09 Jul, 2024 1 commit
- update:Integrate the PDF-Extract-Kit inside · 1fac6aa7
  myhloli authored Jul 09, 2024
  
  1fac6aa7
08 Jul, 2024 1 commit

update: · 1ee81a9a

赵小蒙 authored Jul 08, 2024

1.Disable scaling when loading large images.
2.Moving the logic for channel conversion in image processing.

1ee81a9a

28 Jun, 2024 1 commit
- fix: add try import opencv-python and Pillow · 53ccd5a6
  赵小蒙 authored Jun 28, 2024
  
  53ccd5a6
26 Jun, 2024 1 commit
- update: fix cli and inside model used logic · aad5652c
  赵小蒙 authored Jun 26, 2024
  
  aad5652c
18 Jun, 2024 1 commit
- update custom model framework · 389826c5
  赵小蒙 authored Jun 18, 2024
  
  389826c5