Commits · f4ffdfe8ef9c7242d4c2e256d76f1aeef0ccc823 · wangsen / MinerU

02 Apr, 2025 7 commits

build(dependencies): update PyMuPDF, pydantic and transformers · 90321855

myhloli authored Apr 02, 2025

- Update PyMuPDF to version <1.25.0
- Update pydantic to version <2.11
- Update transformers to version < 5.0.0
- Remove always_apply parameter from alb.ToGray in image processing

90321855

feat(ocr): update OCR utility and dependencies · d09464be

myhloli authored Apr 02, 2025

- Update the default configuration path in pytorchocr_utility.py
- Add required dependencies for paddleocr2pytorch in setup.py:
  - shapely
  - pyclipper
  - omegaconf

d09464be

refactor(model): update OCR model and remove unused configs · c45a706c

myhloli authored Apr 02, 2025

- Remove unused UniMERNet and LayoutLMv3 model configurations
- Update OCR model path and dictionary path for PaddleOCR
- Modify README to update system requirements and installation instructions
- Update setup.py to include new package data

c45a706c

refactor(magic_pdf): remove unused imports and update dependencies · 243bc58c

myhloli authored Apr 02, 2025

- Remove unused imports for concurrent.futures, multiprocessing, and paddle
- Delete commented-out code
- Update numpy dependency to remove upper version limit
- Remove InferenceResult import that was commented out

243bc58c

chore: update dictionary files · 3b5d3fc8

myhloli authored Apr 02, 2025

- Add newline at the beginning of arabic_dict.txt
- Change mode of multiple dictionary files

3b5d3fc8

refactor(model): remove unused OCR and table models · d8ebd92f

myhloli authored Apr 02, 2025

- Remove OCR utils, modified PaddleOCR, and StructEqTable model
- Delete related import statements and model definitions
- Update dependencies in setup.py to remove paddlepaddle and related OCR packages

d8ebd92f

refactor(ocr): comment out print statements and update table model initialization · 5252c46e

myhloli authored Apr 02, 2025

- Comment out print statements in base_ocr_v20.py and pytorch_paddle.py
- Update table model initialization to use lang parameter instead of ocr_engine
- Remove unused RapidOCR initialization in rapid_table.py

5252c46e

01 Apr, 2025 2 commits

refactor(ocr): remove unused OCR dictionaries and update model configurations · 41f1fb8a

myhloli authored Apr 01, 2025

- Remove unused OCR dictionaries for Arabic, Belarusian, Bulgarian and Armenian languages
- Update model configurations in arch_config.yaml:
- Comment out 'out_channels' for various language models
  - Rename Arabic, Korean, Japanese, Tamil and Devanagari model configurations to use 'v3' instead of 'v4'
- Delete ar_dict.txt, be_dict.txt and bg_dict.txt files
- Update arabic_dict.txt to remove blank line at the start

41f1fb8a

refactor(ocr): remove unused code and simplify model architecture · b3d6785d

myhloli authored Apr 01, 2025

- Remove unused imports and code
- Simplify model architecture by removing unnecessary components
- Update initialization and forward pass logic
- Rename variables for consistency

b3d6785d

31 Mar, 2025 3 commits

refactor(model): integrate AtomModelSingleton for OCR and improve OCR result handling · 59d6b195

myhloli authored Mar 31, 2025

- Replace direct OCR model access with AtomModelSingleton for better model management
- Round OCR scores to 2 decimal places for consistency
- Improve error handling and logging in batch analysis
- Simplify OCR result processing in pdf_parse_union_core_v2.py

59d6b195

feat(ocr): implement language-specific OCR processing · d7d85a28

myhloli authored Mar 31, 2025

- Add support for multiple languages in OCR processing
- Create separate lists for each language to improve processing efficiency
- Update OCR model initialization to use PytorchPaddleOCR instead of ModifiedPaddleOCR
- Modify get_ocr_result_list function to include language information- Improve logging for OCR detection and recognition

d7d85a28

feat(ocr): implement separate detection and recognition processes · a330651d

myhloli authored Mar 31, 2025

- Split OCR process into detection and recognition stages
- Update batch analysis and document analysis pipelines
- Modify OCR result formatting and handling
- Remove unused imports and optimize code structure

a330651d

27 Mar, 2025 1 commit

feat(model): add OCR model base structure and utilities · a7a899f6

myhloli authored Mar 27, 2025

- Add base model structure for OCR in pytorch
- Implement data augmentation and transformation modules
- Create utilities for dictionary handling and state dict conversion
- Include post-processing modules for OCR
- Add weight initialization and loading functions

a7a899f6

26 Mar, 2025 2 commits
- feat: remove old inference code · 4fbc3689
  icecraft authored Mar 26, 2025
  
  4fbc3689
- feat: batch inference with ocr and lang flag · bbba2a12
  icecraft authored Mar 26, 2025
  
  bbba2a12
24 Mar, 2025 2 commits
- fix: support auto method and auto lang · adbf4921
  icecraft authored Mar 24, 2025
  
  adbf4921
- fix(magic_pdf): improve image resizing and padding in UnimerSwinn model · 86d83c01
  myhloli authored Mar 24, 2025
```
- Comment out margin cropping to prevent errors with broken files
- Refactor image resizing to preserve aspect ratio
- Update padding calculation and application using OpenCV
```
  86d83c01
22 Mar, 2025 1 commit

refactor(ocr): improve ONNX model initialization and resource handling · cebcd2ad

myhloli authored Mar 22, 2025

- Replace deprecated importlib.resources.path with importlib.resources.files
- Simplify code structure and improve readability
- Remove unnecessary comments and empty lines

cebcd2ad

21 Mar, 2025 1 commit

refactor(model): update model downloads and disable unused models · dba28389

myhloli authored Mar 21, 2025

- Comment out LayoutLMv3, TableMaster, and StructEqTable models
- Update MFR model path to unimernet_hf_small_2503- Remove unused import in Unimernet.py

dba28389

20 Mar, 2025 6 commits

perf(inference): adjust batch ratio for GPU memory sizes · 2f40fa7d
myhloli authored Mar 20, 2025
```
- Remove separate condition for GPU memory >= 24GB
- Simplify logic to use a single threshold of 16GB
```
2f40fa7d

perf(inference): adjust batch ratio thresholds for GPU memory sizes · 74e954da

myhloli authored Mar 20, 2025

- Increase batch ratio to 32 for GPU memory >= 24GB
- Set batch ratio to 16 for GPU memory >= 16GB
- Reduce batch ratio to 8 for GPU memory >= 12GB
- Lower batch ratio to 4 for GPU memory >= 8GB
- Set batch ratio to 2 for GPU memory >= 6GB
- Keep batch ratio at 1 for lower GPU memory sizes

74e954da

refactor: remove torchtext deprecation warning handling · cf4ea78d

myhloli authored Mar 20, 2025

- Remove torchtext version check and deprecation warning handling from multiple files
- This code was unnecessary and potentially caused issues when torchtext was not installed

cf4ea78d

refactor(magic_pdf): remove unnecessary half() calls for CPU devices · 27281c92

myhloli authored Mar 20, 2025

- Remove half() calls for DocLayoutYOLO and YOLOv8 models
- This change prevents potential errors when running models on CPU

27281c92

refactor(model): update model initialization and dependencies · 2f3b66a5

myhloli authored Mar 20, 2025

- Update config version to1.2.0
- Refactor model initialization in model_init.py- Update dependencies in requirements.txt files
- Remove unused imports and models
- Add conditional imports for table models

2f3b66a5

refactor(magic_pdf): support mps device and optimize image processing · af27c0cc

myhloli authored Mar 20, 2025

- Add support for Apple M1 chips (mps device)
- Refactor image processing for better performance and compatibility
- Update model loading and inference for various devices
- Adjust batch processing and memory management

af27c0cc

19 Mar, 2025 2 commits
- feat(model): add UniMERNet model configuration and processing files · 31ebceb5
  myhloli authored Mar 19, 2025
```
- Add UnimerMBartConfig and UnimerSwinConfig classes
- Implement UnimerSwinImageProcessor for image preprocessing- Create necessary __init__.py files for module structure
```
  31ebceb5
- style: remove unused code · e9c24739
  icecraft authored Mar 19, 2025
  
  e9c24739
13 Mar, 2025 4 commits
- fix: import ppstruture error · c67a4793
  icecraft authored Mar 13, 2025
  
  c67a4793
- fix: fix ci error: no module found of ppstruture · 6aa1d88b
  icecraft authored Mar 13, 2025
  
  6aa1d88b
- feat: add parallel evalution · b50f742f
  icecraft authored Mar 13, 2025
  
  b50f742f
- feat: add parallel evalution · 3a2f86a1
  icecraft authored Mar 13, 2025
  
  3a2f86a1
12 Mar, 2025 1 commit

refactor(mfr): optimize image processing in Unimernet · 67b030eb

myhloli authored Mar 12, 2025

- Remove unnecessary __getitem__ method
- Simplify image cropping in detect_math_formula_region
- Improve code readability and efficiency

67b030eb

11 Mar, 2025 1 commit
- perf(inference): optimize batch processing for different GPU memory sizes · 6116488d
  myhloli authored Mar 11, 2025
```
- Set NPUDTCompile to false for better performance on NPU
- Adjust batch ratio
```
  6116488d
10 Mar, 2025 1 commit

refactor(data/utils.py): remove unnecessary decorator and improve image loading · 4f7ef05d

myhloli authored Mar 10, 2025

- Remove unused @ImportPIL decorator from load_images_from_pdf function
- Update image shape handling in YOLOv11.py for better compatibility

These changes improve code readability and performance without altering the original functionality.

4f7ef05d

07 Mar, 2025 2 commits

refactor(YOLOv11): handle image processing and resizing improvements · 0a1fb1e4

myhloli authored Mar 07, 2025

- Replace PIL with cv2 for image processing
- Fix issues with image cropping and resizing
- Add boundary checks and error handling
- Optimize code for better performance and readability

0a1fb1e4

refactor(magic_pdf): replace PIL with NumPy for image processing · 1b34f7e4

myhloli authored Mar 07, 2025

- Remove PIL usage across multiple files
- Convert image processing functions to use NumPy arrays
- Update crop_img function to work with NumPy arrays
- Modify image loading and resizing to use NumPy and OpenCV
- Clean up unused imports and comments related to PIL

1b34f7e4

03 Mar, 2025 4 commits
- perf(inference): adjust batch ratio for high GPU memory · 0b05dff7
  myhloli authored Mar 03, 2025
```
- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory
```
  0b05dff7
- fix: caption match · fb02be19
  icecraft authored Mar 03, 2025
  
  fb02be19
- perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c
  myhloli authored Mar 03, 2025
```
- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory
```
  58b6ad8c
- perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7
  myhloli authored Mar 03, 2025
```
- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory
```
  0d3304d7