Commits · 3da5c411152471d3005f31057d4a8c950b122caa · wangsen / MinerU

20 Aug, 2024 1 commit

fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411

Xiaomeng Zhao authored Aug 20, 2024

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

3da5c411

02 Aug, 2024 1 commit

feat(model inference): add table recognition and conversion to LaTeX (#284) · 37925f36

Kaiwen Liu authored Aug 02, 2024

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # add table recognition using struct-eqtable
## Changelog
31/07/20204
- Support table recognition. Table images will be converted into LaTex.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

* # feat(model inference): add table recognition and convertion to LaTeX

# What's Changed

### New Features

- Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex.

### Instruction

- pip install pypandoc struct-eqtable==0.1.0
- Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec

) and put it under models/ directory.
- Edit 'table-mode' value to turn on table recognition function which is turned off by default.
- If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。

* add table recognition and convertion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

* add table recognition and conversion to LaTeX

---------
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

37925f36

01 Aug, 2024 2 commits
- add table recognition and conversion to LaTeX · 78238f39
  liukaiwen authored Aug 01, 2024
  
  78238f39
- add table recognition and convertion to LaTeX · 4c096443
  liukaiwen authored Aug 01, 2024
  
  4c096443
31 Jul, 2024 1 commit

# add table recognition using struct-eqtable · b29badc1

liukaiwen authored Jul 31, 2024

## Changelog
31/07/20204
- Support table recognition. Table images will be converted into html.

### how to use the new feature:
set the attribute 'table-mode' to 'true' in magic-pdf.json

### caution:
it takes 200s to 500s to convert a single table image using cpu

b29badc1

22 Jul, 2024 2 commits
- fix(magic_pdf): correct color channel conversion for OCR in PDF extract · c9059987
  myhloli authored Jul 22, 2024
  
  c9059987
- fix(magic_pdf): optimize formula area selection for OCR · e7ce3051
  myhloli authored Jul 22, 2024
  
  e7ce3051
15 Jul, 2024 1 commit
- refactor(layoutlmv3): remove outdated COCO instances registration · 724001df
  myhloli authored Jul 15, 2024
  
  724001df
12 Jul, 2024 1 commit
- feat(model-config): Unify all device selections through a single YAML file · 45e7fbd2
  myhloli authored Jul 12, 2024
  
  45e7fbd2
11 Jul, 2024 1 commit
- update:Modify the PEK module to parse page by page. · 2b8db660
  myhloli authored Jul 11, 2024
  
  2b8db660
09 Jul, 2024 1 commit
- update:Integrate the PDF-Extract-Kit inside · 1fac6aa7
  myhloli authored Jul 09, 2024
  
  1fac6aa7