Commits · 7a0b87d53b6b3f6528bf9b763fb7669d0cfd7d47 · wangsen / MinerU

02 Apr, 2025 17 commits

docs: add RapidOCR and PaddleOCR2Pytorch to Acknowledgments list · 7a0b87d5

myhloli authored Apr 02, 2025

- Add RapidOCR and PaddleOCR2Pytorch to the Acknowledgments list in README.md
- Add RapidOCR and PaddleOCR2Pytorch to the Acknowledgments list in README_zh-CN.md

7a0b87d5

feat(README): update changelog for version 1.3.0 release · 0eff993a

myhloli authored Apr 02, 2025

- Installation and compatibility optimizations:
- Replace PaddleOCR with paddleocr2torch to resolve conflicts between Paddle and PyTorch
  - Remove layoutlmv3 usage to solve compatibility issues with detectron2
  - Extend PyTorch version compatibility to2.2~2.6  - Extend CUDA compatibility to 11.8~12.6
  - Extend Python version compatibility to 3.10~3.12

- Performance optimizations:
 - Support batch processing for multiple PDF files
  - Optimize mfr model loading and usage to reduce memory consumption and improve speed
  - Reduce minimum memory requirement to 6GB
  - Improve running speed on MPS devices

- Parsing effect optimization:
  - Update mfr model to unimernet(2503) to fix line break issues in multi-line formulas

0eff993a

docs(gpu): update CUDA acceleration documentation · a778645b

myhloli authored Apr 02, 2025

- Update CUDA version requirements to12.4
- Recommend nvidia-driver-570-server for Ubuntu
- Remove Python version specification for conda environment
- Update magic-pdf version requirement to 1.3.0
- Simplify CUDA acceleration testing instructions
- Remove OCR acceleration with paddlepaddle-gpu
- Update torch and torchvision installation instructions for Windows

a778645b

docs(README): update system requirements and GPU support · 298305dd

myhloli authored Apr 02, 2025

- Update Python version requirement to 3.10-3.12
- Expand CUDA environment options to 11.8/12.4/12.6
- Update GPU VRAM requirement to 6GB or more
-

298305dd

build(deps): update package versions for linux and macos · cb3a4314

myhloli authored Apr 02, 2025

- Update matplotlib minimum version to 3.10 for Linux and MacOS
- Specify version ranges for PyYAML, ftfy, openai, shapely, pyclipper, and omegaconf
- Update dill to version <1 for compatibility

cb3a4314

build(dependencies): update PyMuPDF, pydantic and transformers · 90321855

myhloli authored Apr 02, 2025

- Update PyMuPDF to version <1.25.0
- Update pydantic to version <2.11
- Update transformers to version < 5.0.0
- Remove always_apply parameter from alb.ToGray in image processing

90321855

feat(ocr): update OCR utility and dependencies · d09464be

myhloli authored Apr 02, 2025

- Update the default configuration path in pytorchocr_utility.py
- Add required dependencies for paddleocr2pytorch in setup.py:
  - shapely
  - pyclipper
  - omegaconf

d09464be

refactor(model): update OCR model and remove unused configs · c45a706c

myhloli authored Apr 02, 2025

- Remove unused UniMERNet and LayoutLMv3 model configurations
- Update OCR model path and dictionary path for PaddleOCR
- Modify README to update system requirements and installation instructions
- Update setup.py to include new package data

c45a706c

refactor(magic_pdf): remove unused imports and update dependencies · 243bc58c

myhloli authored Apr 02, 2025

- Remove unused imports for concurrent.futures, multiprocessing, and paddle
- Delete commented-out code
- Update numpy dependency to remove upper version limit
- Remove InferenceResult import that was commented out

243bc58c

refactor(docker): remove unused packages and simplify Dockerfile commands · ddaa7158

myhloli authored Apr 02, 2025

- Remove paddleocr, paddlepaddle, rapidocr-paddle, and rapidocr-onnxruntime from requirements.txt files
- Simplify pip install commands in Dockerfiles
- Remove installation of paddlepaddle-gpu in china and global Dockerfiles
- Update requirements.txt files across all Docker configurations

ddaa7158

refactor: comment out paddleocr model copying code · 3bd1e0e4

myhloli authored Apr 02, 2025

- Commented out the code that copies the paddleocr model to user directory
- This change affects both download_models.py and download_models_hf.py scripts

3bd1e0e4

fix(scripts): update model download scripts for OCR · 5237a385

myhloli authored Apr 02, 2025

- Update download_models.py and download_models_hf.py scripts
- Change OCR model path from paddleocr to paddleocr_torch

5237a385

chore: update dictionary files · 3b5d3fc8

myhloli authored Apr 02, 2025

- Add newline at the beginning of arabic_dict.txt
- Change mode of multiple dictionary files

3b5d3fc8

refactor(model): remove unused OCR and table models · d8ebd92f

myhloli authored Apr 02, 2025

- Remove OCR utils, modified PaddleOCR, and StructEqTable model
- Delete related import statements and model definitions
- Update dependencies in setup.py to remove paddlepaddle and related OCR packages

d8ebd92f

refactor(ocr): comment out print statements and update table model initialization · 5252c46e

myhloli authored Apr 02, 2025

- Comment out print statements in base_ocr_v20.py and pytorch_paddle.py
- Update table model initialization to use lang parameter instead of ocr_engine
- Remove unused RapidOCR initialization in rapid_table.py

5252c46e

Merge remote-tracking branch 'origin/dev' into dev · 9b3339f1
myhloli authored Apr 02, 2025

9b3339f1

feat(ocr): implement dynamic OCR processing for text spans with low contrast · a024c30f

myhloli authored Apr 02, 2025

- Comment out OCR model initialization and execution for low-contrast spans
- Add batch OCR processing for collected image spans
- Adjust contrast threshold for OCR processing
- Remove unnecessary OCR processing for high-contrast spans
- Implement more efficient OCR workflow by processing multiple spans at once

a024c30f

01 Apr, 2025 5 commits

Merge remote-tracking branch 'origin/dev' into dev · 62b7582f
myhloli authored Apr 01, 2025

62b7582f

feat(performance_stats): improve function identification in execution time logging · 978ef41c

myhloli authored Apr 01, 2025

- Enhance the logging of execution times by adding more detailed function identification
- Implement class name and module name inclusion for better traceability

978ef41c

refactor(ocr): remove unused OCR dictionaries and update model configurations · 41f1fb8a

myhloli authored Apr 01, 2025

- Remove unused OCR dictionaries for Arabic, Belarusian, Bulgarian and Armenian languages
- Update model configurations in arch_config.yaml:
- Comment out 'out_channels' for various language models
  - Rename Arabic, Korean, Japanese, Tamil and Devanagari model configurations to use 'v3' instead of 'v4'
- Delete ar_dict.txt, be_dict.txt and bg_dict.txt files
- Update arabic_dict.txt to remove blank line at the start

41f1fb8a

refactor(ocr): remove unused code and simplify model architecture · b3d6785d

myhloli authored Apr 01, 2025

- Remove unused imports and code
- Simplify model architecture by removing unnecessary components
- Update initialization and forward pass logic
- Rename variables for consistency

b3d6785d

fix(pdf_parse_union_core_v2): suppress FutureWarning from transformers · 3cb156f5

myhloli authored Apr 01, 2025

- Added warnings module to import list
- Implemented a warning catcher to ignore FutureWarning from the transformers module
- This change prevents unnecessary warning messages during model inference

3cb156f5

31 Mar, 2025 3 commits

refactor(model): integrate AtomModelSingleton for OCR and improve OCR result handling · 59d6b195

myhloli authored Mar 31, 2025

- Replace direct OCR model access with AtomModelSingleton for better model management
- Round OCR scores to 2 decimal places for consistency
- Improve error handling and logging in batch analysis
- Simplify OCR result processing in pdf_parse_union_core_v2.py

59d6b195

feat(ocr): implement language-specific OCR processing · d7d85a28

myhloli authored Mar 31, 2025

- Add support for multiple languages in OCR processing
- Create separate lists for each language to improve processing efficiency
- Update OCR model initialization to use PytorchPaddleOCR instead of ModifiedPaddleOCR
- Modify get_ocr_result_list function to include language information- Improve logging for OCR detection and recognition

d7d85a28

feat(ocr): implement separate detection and recognition processes · a330651d

myhloli authored Mar 31, 2025

- Split OCR process into detection and recognition stages
- Update batch analysis and document analysis pipelines
- Modify OCR result formatting and handling
- Remove unused imports and optimize code structure

a330651d

27 Mar, 2025 4 commits
- Merge branch 'opendatalab:dev' into dev · a9b37b71
  Xiaomeng Zhao authored Mar 27, 2025
  
  a9b37b71
- Merge remote-tracking branch 'origin/dev' into dev · 3c69c569
  myhloli authored Mar 27, 2025
  
  3c69c569
- feat(model): add OCR model base structure and utilities · a7a899f6
  myhloli authored Mar 27, 2025
```
- Add base model structure for OCR in pytorch
- Implement data augmentation and transformation modules
- Create utilities for dictionary handling and state dict conversion
- Include post-processing modules for OCR
- Add weight initialization and loading functions
```
  a7a899f6
- Merge pull request #2004 from icecraft/feat/remove_old_inference_code · ec566d22
  Xiaomeng Zhao authored Mar 27, 2025
```
feat: remove old inference code
```
  ec566d22
26 Mar, 2025 3 commits
- feat: remove old inference code · 4fbc3689
  icecraft authored Mar 26, 2025
  
  4fbc3689
- Merge pull request #2003 from icecraft/feat/batch_analyze_with_ocr_and_lang · f6bc4f70
  Xiaomeng Zhao authored Mar 26, 2025
```
feat: batch inference with ocr and lang flag
```
  f6bc4f70
- feat: batch inference with ocr and lang flag · bbba2a12
  icecraft authored Mar 26, 2025
  
  bbba2a12
24 Mar, 2025 8 commits
- Merge pull request #1986 from myhloli/dev · 2c8470b0
  Xiaomeng Zhao authored Mar 25, 2025
```
refactor(pdf_parse): adjust line calculation for block height
```
  2c8470b0
- refactor(pdf_parse): adjust line calculation for block height · 72e66c2d
  myhloli authored Mar 25, 2025
```
- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures
```
  72e66c2d
- Merge pull request #1985 from myhloli/dev · 26777d25
  Xiaomeng Zhao authored Mar 25, 2025
```
refactor(pdf_parse): adjust line calculation for block height
```
  26777d25
- refactor(pdf_parse): adjust line calculation for block height · 71efb101
  myhloli authored Mar 25, 2025
```
- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures
```
  71efb101
- Merge pull request #1984 from myhloli/dev · 9b048589
  Xiaomeng Zhao authored Mar 25, 2025
```
fix(pre_proc): improve character overlap handling in OCR processing
```
  9b048589
- fix(pre_proc): improve character overlap handling in OCR processing · be505a95
  myhloli authored Mar 25, 2025
```
- Add condition to check for identical or space characters when resolving overlaps
- Skip non-conflicting character pairs to prevent unnecessary removals
```
  be505a95
- Merge pull request #1981 from icecraft/fix/auto_lang_fix · 59e99fcf
  Xiaomeng Zhao authored Mar 24, 2025
```
fix: support auto method and auto lang
```
  59e99fcf
- fix: support auto method and auto lang · adbf4921
  icecraft authored Mar 24, 2025
  
  adbf4921