Commits · c638fc5d1f7cbe7101a7e774ab31b8d4a3eb5d5b · wangsen / MinerU

13 Dec, 2024 1 commit

fix(pdf): improve ligature handling and text extraction · c638fc5d

myhloli authored Dec 13, 2024

- Move ligature replacement function to pdf_parse_union_core_v2.py
- Optimize ligature replacement using a more efficient approach
- Modify text extraction flags to preserve ligatures in PDF content
- Remove unnecessary function from ocr_mkcontent.py

c638fc5d

12 Dec, 2024 1 commit

perf(layout): optimize layout detection for PDF extraction · 6a75d7dc

myhloli authored Dec 12, 2024

- Add initial setup for layout detection
- Implement conditional cropping for tall images
- Skip cropping for wide images to improve performance
- Reuse Image object across layout detection steps

6a75d7dc

11 Dec, 2024 32 commits
- Merge pull request #1268 from icecraft/fix/classify · 56b0e18b
  Xiaomeng Zhao authored Dec 12, 2024
  
  56b0e18b
- fix: classif pdf type · 712d7d4a
  xu rui authored Dec 11, 2024
  
  712d7d4a
- Merge pull request #1257 from icecraft/docs/refactor_en_docs · bdacf291
  Xiaomeng Zhao authored Dec 11, 2024
```
Docs/refactor en docs
```
  bdacf291
- Merge pull request #1267 from opendatalab/master · 2df3e901
  Xiaomeng Zhao authored Dec 11, 2024
```
master->dev
```
  2df3e901
- Update version.py with new version · 391a9986
  myhloli authored Dec 11, 2024
  
  391a9986
- Merge pull request #1266 from opendatalab/release-0.10.6 · 613074b8
  Xiaomeng Zhao authored Dec 11, 2024
```
Release 0.10.6
```
  613074b8
- Merge pull request #1265 from opendatalab/dev · 1c29f99e
  Xiaomeng Zhao authored Dec 11, 2024
```
Dev->release
```
  1c29f99e
- Merge pull request #1264 from myhloli/dev · a502c5c9
  Xiaomeng Zhao authored Dec 11, 2024
```
build(docker): add torch and torchvision dependencies
```
  a502c5c9
- build(docker): add torch and torchvision dependencies · 28dae588
  myhloli authored Dec 11, 2024
```
- Add torch>=2.2.2,<=2.3.1 to requirements-docker.txt- Add torchvision>=0.17.2,<=0.18.1 to requirements-docker.txt
```
  28dae588
- Merge pull request #1261 from opendatalab/release-0.10.6 · b4f7b53e
  Xiaomeng Zhao authored Dec 11, 2024
```
Release 0.10.6
```
  b4f7b53e
- Merge pull request #1263 from opendatalab/dev · d3b51aa5
  Xiaomeng Zhao authored Dec 11, 2024
```
refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename
```
  d3b51aa5
- Merge pull request #1262 from myhloli/dev · fcba88b5
  Xiaomeng Zhao authored Dec 11, 2024
```
refactor(draw_bbox): remove redundant '_line_sort' suffix from output  filename
```
  fcba88b5
- refactor(draw_bbox): remove redundant '_line_sort' suffix from output filename · ef78819a
  myhloli authored Dec 11, 2024
```
- Updated the filename generation logic in the draw_bbox function
- Removed the unnecessary '_line_sort' suffix from the output PDF filename
```
  ef78819a
- refactor(magic_pdf): remove unused import in pdf_parse_union_core_v2.py · 9efc35ec
  myhloli authored Dec 11, 2024
```
- Remove unused import of ocr_model_init from magic_pdf.model.sub_modules.model_init
- Keep existing functionality and structure intact
```
  9efc35ec
- Merge pull request #1260 from opendatalab/dev · 0440ee87
  Xiaomeng Zhao authored Dec 11, 2024
```
fix: dup classify pdf type & improve layout detection for DocLayout_YOLO model 
```
  0440ee87
- Merge pull request #1259 from myhloli/dev · 327fdf90
  Xiaomeng Zhao authored Dec 11, 2024
```
feat(layout): improve layout detection for DocLayout_YOLO model
```
  327fdf90
- feat(layout): improve layout detection for DocLayout_YOLO model · f5d812b3
  myhloli authored Dec 11, 2024
```
- Implement image cropping and pasting technique to enhance layout detection
- Adjust detected polygons to original image coordinates
- Add comments for better code readability
```
  f5d812b3
- feat: remove pipe_auto_mode · 302a6950
  xu rui authored Dec 11, 2024
  
  302a6950
- fix: fix ut · 3062217d
  icecraft authored Dec 11, 2024
  
  3062217d
- docs: check links in doc · b04867f9
  xu rui authored Dec 11, 2024
  
  b04867f9
- feat: support ms-office and images file in command line tools · cece8f53
  xu rui authored Dec 11, 2024
  
  cece8f53
- docs: add quick_start example · 7dc3b0a9
  xu rui authored Dec 10, 2024
  
  7dc3b0a9
- fix: not create empty directory · 1d32722f
  xu rui authored Dec 10, 2024
  
  1d32722f
- feat: support convert ppt/pptx/doc/docx · f6af67eb
  xu rui authored Dec 10, 2024
  
  f6af67eb
- fix: read_api list files · f3ceebc4
  xu rui authored Dec 10, 2024
  
  f3ceebc4
- feat: rewrite code snippet · 3cd51d49
  xu rui authored Dec 09, 2024
  
  3cd51d49
- docs: rewrite install and usage docs · 6ca86bea
  xu rui authored Dec 09, 2024
  
  6ca86bea
- Merge pull request #1258 from icecraft/fix/dup_classify · fd2f3c58
  Xiaomeng Zhao authored Dec 11, 2024
```
fix: dup classify pdf type
```
  fd2f3c58
- fix: dup classify pdf type · 4e7511fb
  icecraft authored Dec 11, 2024
  
  4e7511fb
- Merge pull request #1256 from opendatalab/dev · fb468671
  Xiaomeng Zhao authored Dec 11, 2024
```
build(deps): update torch and torchvision version requirements
```
  fb468671
- Merge pull request #1255 from myhloli/dev · 168a1115
  Xiaomeng Zhao authored Dec 11, 2024
```
build(deps): update torch and torchvision version requirements
```
  168a1115
- build(deps): update torch and torchvision version requirements · 9a96362d
  myhloli authored Dec 11, 2024
```
- Specify torch==2.3.1 and torchvision==0.18.1 for Windows CUDA installation
- Add torch and torchvision version constraints in setup.py:
  - torch>=2.2.2,<=2.3.1
  - torchvision>=0.17.2,<=0.18.1
- Update installation instructions in both English and Chinese README files
```
  9a96362d
10 Dec, 2024 6 commits

Merge pull request #1252 from myhloli/dev · fdf15a45
Xiaomeng Zhao authored Dec 11, 2024
```
fix(detect_invalid_chars):fix the stack error caused by multiple memory releases in PyMuPDF
```
fdf15a45

build: enable pdfminer.six dependency · 023ed9c8

myhloli authored Dec 11, 2024

- Uncomment pdfminer.six in requirements.txt
- Specify version 20231228 for pdfminer.six

023ed9c8

refactor(model): update import paths for PaddleOCR modules · 061c03a0

myhloli authored Dec 11, 2024

- Change import paths from paddleocr.ppocr to ppocr for utility functions
- Update import paths for logging and utility modules in ppocr_273_mod.py- Modify import paths for tablemaster_paddle.py to use ppstructure instead of paddleocr.ppstructure

061c03a0

refactor(magic_pdf): switch to pdfminer for invalid character detection · e1be7da6

myhloli authored Dec 11, 2024

- Replace MuPDF with pdfminer for detecting invalid characters in PDFs
- Uncomment and update the detect_invalid_chars function to use pdfminer
- Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation

e1be7da6

refactor(tablemaster): update import paths for TableSystem and init_args · 01cd633d

myhloli authored Dec 11, 2024

- Change import path for TableSystem from 'ppstructure.table.predict_table' to 'paddleocr.ppstructure.table.predict_table'
- Change import path for init_args from 'ppstructure.utility' to 'paddleocr.ppstructure.utility'

01cd633d

refactor(magic_pdf): update paddleocr module import paths · 56fad23d

myhloli authored Dec 11, 2024

- Modify import paths for paddleocr utilities in ocr_utils.py and ppocr_273_mod.py
- Change from `ppocr.utils.utility` to `paddleocr.ppocr.utils.utility`
- Update related import statements in two files to reflect the new path

56fad23d