Commits · a024c30fc489589eb39028cb43473d033669a7a8 · wangsen / MinerU

02 Apr, 2025 1 commit

feat(ocr): implement dynamic OCR processing for text spans with low contrast · a024c30f

myhloli authored Apr 02, 2025

- Comment out OCR model initialization and execution for low-contrast spans
- Add batch OCR processing for collected image spans
- Adjust contrast threshold for OCR processing
- Remove unnecessary OCR processing for high-contrast spans
- Implement more efficient OCR workflow by processing multiple spans at once

a024c30f

01 Apr, 2025 3 commits

refactor(ocr): remove unused OCR dictionaries and update model configurations · 41f1fb8a

myhloli authored Apr 01, 2025

- Remove unused OCR dictionaries for Arabic, Belarusian, Bulgarian and Armenian languages
- Update model configurations in arch_config.yaml:
- Comment out 'out_channels' for various language models
  - Rename Arabic, Korean, Japanese, Tamil and Devanagari model configurations to use 'v3' instead of 'v4'
- Delete ar_dict.txt, be_dict.txt and bg_dict.txt files
- Update arabic_dict.txt to remove blank line at the start

41f1fb8a

refactor(ocr): remove unused code and simplify model architecture · b3d6785d

myhloli authored Apr 01, 2025

- Remove unused imports and code
- Simplify model architecture by removing unnecessary components
- Update initialization and forward pass logic
- Rename variables for consistency

b3d6785d

fix(pdf_parse_union_core_v2): suppress FutureWarning from transformers · 3cb156f5

myhloli authored Apr 01, 2025

- Added warnings module to import list
- Implemented a warning catcher to ignore FutureWarning from the transformers module
- This change prevents unnecessary warning messages during model inference

3cb156f5

31 Mar, 2025 3 commits

refactor(model): integrate AtomModelSingleton for OCR and improve OCR result handling · 59d6b195

myhloli authored Mar 31, 2025

- Replace direct OCR model access with AtomModelSingleton for better model management
- Round OCR scores to 2 decimal places for consistency
- Improve error handling and logging in batch analysis
- Simplify OCR result processing in pdf_parse_union_core_v2.py

59d6b195

feat(ocr): implement language-specific OCR processing · d7d85a28

myhloli authored Mar 31, 2025

- Add support for multiple languages in OCR processing
- Create separate lists for each language to improve processing efficiency
- Update OCR model initialization to use PytorchPaddleOCR instead of ModifiedPaddleOCR
- Modify get_ocr_result_list function to include language information- Improve logging for OCR detection and recognition

d7d85a28

feat(ocr): implement separate detection and recognition processes · a330651d

myhloli authored Mar 31, 2025

- Split OCR process into detection and recognition stages
- Update batch analysis and document analysis pipelines
- Modify OCR result formatting and handling
- Remove unused imports and optimize code structure

a330651d

27 Mar, 2025 4 commits
- Merge branch 'opendatalab:dev' into dev · a9b37b71
  Xiaomeng Zhao authored Mar 27, 2025
  
  a9b37b71
- Merge remote-tracking branch 'origin/dev' into dev · 3c69c569
  myhloli authored Mar 27, 2025
  
  3c69c569
- feat(model): add OCR model base structure and utilities · a7a899f6
  myhloli authored Mar 27, 2025
```
- Add base model structure for OCR in pytorch
- Implement data augmentation and transformation modules
- Create utilities for dictionary handling and state dict conversion
- Include post-processing modules for OCR
- Add weight initialization and loading functions
```
  a7a899f6
- Merge pull request #2004 from icecraft/feat/remove_old_inference_code · ec566d22
  Xiaomeng Zhao authored Mar 27, 2025
```
feat: remove old inference code
```
  ec566d22
26 Mar, 2025 3 commits
- feat: remove old inference code · 4fbc3689
  icecraft authored Mar 26, 2025
  
  4fbc3689
- Merge pull request #2003 from icecraft/feat/batch_analyze_with_ocr_and_lang · f6bc4f70
  Xiaomeng Zhao authored Mar 26, 2025
```
feat: batch inference with ocr and lang flag
```
  f6bc4f70
- feat: batch inference with ocr and lang flag · bbba2a12
  icecraft authored Mar 26, 2025
  
  bbba2a12
24 Mar, 2025 10 commits
- Merge pull request #1986 from myhloli/dev · 2c8470b0
  Xiaomeng Zhao authored Mar 25, 2025
```
refactor(pdf_parse): adjust line calculation for block height
```
  2c8470b0
- refactor(pdf_parse): adjust line calculation for block height · 72e66c2d
  myhloli authored Mar 25, 2025
```
- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures
```
  72e66c2d
- Merge pull request #1985 from myhloli/dev · 26777d25
  Xiaomeng Zhao authored Mar 25, 2025
```
refactor(pdf_parse): adjust line calculation for block height
```
  26777d25
- refactor(pdf_parse): adjust line calculation for block height · 71efb101
  myhloli authored Mar 25, 2025
```
- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures
```
  71efb101
- Merge pull request #1984 from myhloli/dev · 9b048589
  Xiaomeng Zhao authored Mar 25, 2025
```
fix(pre_proc): improve character overlap handling in OCR processing
```
  9b048589
- fix(pre_proc): improve character overlap handling in OCR processing · be505a95
  myhloli authored Mar 25, 2025
```
- Add condition to check for identical or space characters when resolving overlaps
- Skip non-conflicting character pairs to prevent unnecessary removals
```
  be505a95
- Merge pull request #1981 from icecraft/fix/auto_lang_fix · 59e99fcf
  Xiaomeng Zhao authored Mar 24, 2025
```
fix: support auto method and auto lang
```
  59e99fcf
- fix: support auto method and auto lang · adbf4921
  icecraft authored Mar 24, 2025
  
  adbf4921
- Merge pull request #1980 from myhloli/dev · dc9322cc
  Xiaomeng Zhao authored Mar 24, 2025
```
fix(magic_pdf): improve image resizing and padding in UnimerSwinn model
```
  dc9322cc
- fix(magic_pdf): improve image resizing and padding in UnimerSwinn model · 86d83c01
  myhloli authored Mar 24, 2025
```
- Comment out margin cropping to prevent errors with broken files
- Refactor image resizing to preserve aspect ratio
- Update padding calculation and application using OpenCV
```
  86d83c01
22 Mar, 2025 2 commits

Merge pull request #1974 from myhloli/dev · eb02736a
Xiaomeng Zhao authored Mar 22, 2025
```
refactor(ocr): improve ONNX model initialization and resource handling
```
eb02736a

refactor(ocr): improve ONNX model initialization and resource handling · cebcd2ad

myhloli authored Mar 22, 2025

- Replace deprecated importlib.resources.path with importlib.resources.files
- Simplify code structure and improve readability
- Remove unnecessary comments and empty lines

cebcd2ad

21 Mar, 2025 4 commits

Merge pull request #1970 from myhloli/dev · 6a3cdb8d
Xiaomeng Zhao authored Mar 21, 2025
```
feat(pre_proc): add function to remove x-overlapping characters in spans
```
6a3cdb8d
Merge remote-tracking branch 'origin/dev' into dev · a2808f3a
myhloli authored Mar 21, 2025

a2808f3a

feat(pre_proc): add function to remove x-overlapping characters in spans · 3f2bafa8

myhloli authored Mar 21, 2025

- Implement `remove_x_overlapping_chars` function in `ocr_span_list_modify.py`
- Integrate the new function in `pdf_parse_union_core_v2.py` to process spans
- Remove unnecessary character replacement functions and comments

3f2bafa8

refactor(model): update model downloads and disable unused models · dba28389

myhloli authored Mar 21, 2025

- Comment out LayoutLMv3, TableMaster, and StructEqTable models
- Update MFR model path to unimernet_hf_small_2503- Remove unused import in Unimernet.py

dba28389

20 Mar, 2025 10 commits

Merge pull request #1959 from myhloli/dev · 07eaa2d7
Xiaomeng Zhao authored Mar 20, 2025
```
Dev push
```
07eaa2d7
perf(inference): adjust batch ratio for GPU memory sizes · 2f40fa7d
myhloli authored Mar 20, 2025
```
- Remove separate condition for GPU memory >= 24GB
- Simplify logic to use a single threshold of 16GB
```
2f40fa7d

perf(inference): adjust batch ratio thresholds for GPU memory sizes · 74e954da

myhloli authored Mar 20, 2025

- Increase batch ratio to 32 for GPU memory >= 24GB
- Set batch ratio to 16 for GPU memory >= 16GB
- Reduce batch ratio to 8 for GPU memory >= 12GB
- Lower batch ratio to 4 for GPU memory >= 8GB
- Set batch ratio to 2 for GPU memory >= 6GB
- Keep batch ratio at 1 for lower GPU memory sizes

74e954da

perf(model): enable bfloat16 for layoutreader on supported devices · 7210f7a6

myhloli authored Mar 20, 2025

- Add bf_16_support check for CUDA and MPS devices
- Use bfloat16 precision for layoutreader model on supported devices
- Improve performance on devices with bf_16 support

7210f7a6

Merge pull request #1958 from myhloli/dev · 132c16ad
Xiaomeng Zhao authored Mar 20, 2025
```
refactor: remove torchtext deprecation warning handling
```
132c16ad

refactor: remove torchtext deprecation warning handling · cf4ea78d

myhloli authored Mar 20, 2025

- Remove torchtext version check and deprecation warning handling from multiple files
- This code was unnecessary and potentially caused issues when torchtext was not installed

cf4ea78d

Merge remote-tracking branch 'origin/dev' into dev · 9ce72d78
myhloli authored Mar 20, 2025

9ce72d78
Merge pull request #1957 from myhloli/dev · e4074828
Xiaomeng Zhao authored Mar 20, 2025
```
refactor(magic_pdf): remove unnecessary half() calls for CPU devices
```
e4074828

refactor(magic_pdf): remove unnecessary half() calls for CPU devices · 27281c92

myhloli authored Mar 20, 2025

- Remove half() calls for DocLayoutYOLO and YOLOv8 models
- This change prevents potential errors when running models on CPU

27281c92

Merge pull request #1956 from myhloli/dev · 5a3283c8
Xiaomeng Zhao authored Mar 20, 2025
```
build(docker&setup): add ftfy package
```
5a3283c8