Commits · 978ef41cdd9fc7fc5d48cff652af7d2da8c42877 · wangsen / MinerU

01 Apr, 2025 1 commit

feat(performance_stats): improve function identification in execution time logging · 978ef41c

myhloli authored Apr 01, 2025

- Enhance the logging of execution times by adding more detailed function identification
- Implement class name and module name inclusion for better traceability

978ef41c

24 Mar, 2025 5 commits

refactor(pdf_parse): adjust line calculation for block height · 72e66c2d

myhloli authored Mar 25, 2025

- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures

72e66c2d

refactor(pdf_parse): adjust line calculation for block height · 71efb101

myhloli authored Mar 25, 2025

- Remove unnecessary addition of 1 when calculating lines for block height
- This change affects the logic for both potential double-column and triple-column structures

71efb101

fix(pre_proc): improve character overlap handling in OCR processing · be505a95

myhloli authored Mar 25, 2025

- Add condition to check for identical or space characters when resolving overlaps
- Skip non-conflicting character pairs to prevent unnecessary removals

be505a95

fix: support auto method and auto lang · adbf4921
icecraft authored Mar 24, 2025

adbf4921

fix(magic_pdf): improve image resizing and padding in UnimerSwinn model · 86d83c01

myhloli authored Mar 24, 2025

- Comment out margin cropping to prevent errors with broken files
- Refactor image resizing to preserve aspect ratio
- Update padding calculation and application using OpenCV

86d83c01

22 Mar, 2025 1 commit

refactor(ocr): improve ONNX model initialization and resource handling · cebcd2ad

myhloli authored Mar 22, 2025

- Replace deprecated importlib.resources.path with importlib.resources.files
- Simplify code structure and improve readability
- Remove unnecessary comments and empty lines

cebcd2ad

21 Mar, 2025 2 commits

feat(pre_proc): add function to remove x-overlapping characters in spans · 3f2bafa8

myhloli authored Mar 21, 2025

- Implement `remove_x_overlapping_chars` function in `ocr_span_list_modify.py`
- Integrate the new function in `pdf_parse_union_core_v2.py` to process spans
- Remove unnecessary character replacement functions and comments

3f2bafa8

refactor(model): update model downloads and disable unused models · dba28389

myhloli authored Mar 21, 2025

- Comment out LayoutLMv3, TableMaster, and StructEqTable models
- Update MFR model path to unimernet_hf_small_2503- Remove unused import in Unimernet.py

dba28389

20 Mar, 2025 7 commits

perf(inference): adjust batch ratio for GPU memory sizes · 2f40fa7d
myhloli authored Mar 20, 2025
```
- Remove separate condition for GPU memory >= 24GB
- Simplify logic to use a single threshold of 16GB
```
2f40fa7d

perf(inference): adjust batch ratio thresholds for GPU memory sizes · 74e954da

myhloli authored Mar 20, 2025

- Increase batch ratio to 32 for GPU memory >= 24GB
- Set batch ratio to 16 for GPU memory >= 16GB
- Reduce batch ratio to 8 for GPU memory >= 12GB
- Lower batch ratio to 4 for GPU memory >= 8GB
- Set batch ratio to 2 for GPU memory >= 6GB
- Keep batch ratio at 1 for lower GPU memory sizes

74e954da

perf(model): enable bfloat16 for layoutreader on supported devices · 7210f7a6

myhloli authored Mar 20, 2025

- Add bf_16_support check for CUDA and MPS devices
- Use bfloat16 precision for layoutreader model on supported devices
- Improve performance on devices with bf_16 support

7210f7a6

refactor: remove torchtext deprecation warning handling · cf4ea78d

myhloli authored Mar 20, 2025

- Remove torchtext version check and deprecation warning handling from multiple files
- This code was unnecessary and potentially caused issues when torchtext was not installed

cf4ea78d

refactor(magic_pdf): remove unnecessary half() calls for CPU devices · 27281c92

myhloli authored Mar 20, 2025

- Remove half() calls for DocLayoutYOLO and YOLOv8 models
- This change prevents potential errors when running models on CPU

27281c92

refactor(model): update model initialization and dependencies · 2f3b66a5

myhloli authored Mar 20, 2025

- Update config version to1.2.0
- Refactor model initialization in model_init.py- Update dependencies in requirements.txt files
- Remove unused imports and models
- Add conditional imports for table models

2f3b66a5

refactor(magic_pdf): support mps device and optimize image processing · af27c0cc

myhloli authored Mar 20, 2025

- Add support for Apple M1 chips (mps device)
- Refactor image processing for better performance and compatibility
- Update model loading and inference for various devices
- Adjust batch processing and memory management

af27c0cc

19 Mar, 2025 2 commits
- feat(model): add UniMERNet model configuration and processing files · 31ebceb5
  myhloli authored Mar 19, 2025
```
- Add UnimerMBartConfig and UnimerSwinConfig classes
- Implement UnimerSwinImageProcessor for image preprocessing- Create necessary __init__.py files for module structure
```
  31ebceb5
- style: remove unused code · e9c24739
  icecraft authored Mar 19, 2025
  
  e9c24739
17 Mar, 2025 1 commit

refactor(ocr_mkcontent): improve title level handling and formatting · c46d3373

myhloli authored Mar 17, 2025

- Move title level determination to the beginning of the Title block processing
- Add condition to include text_level only if it's not 0
- Adjust title level to 0 instead of 1 when it's less than 1

c46d3373

13 Mar, 2025 5 commits
- fix: import ppstruture error · c67a4793
  icecraft authored Mar 13, 2025
  
  c67a4793
- fix: fix ci error: no module found of ppstruture · 6aa1d88b
  icecraft authored Mar 13, 2025
  
  6aa1d88b
- doc: remove dummy log · 95f334fb
  icecraft authored Mar 13, 2025
  
  95f334fb
- feat: add parallel evalution · b50f742f
  icecraft authored Mar 13, 2025
  
  b50f742f
- feat: add parallel evalution · 3a2f86a1
  icecraft authored Mar 13, 2025
  
  3a2f86a1
12 Mar, 2025 1 commit

refactor(mfr): optimize image processing in Unimernet · 67b030eb

myhloli authored Mar 12, 2025

- Remove unnecessary __getitem__ method
- Simplify image cropping in detect_math_formula_region
- Improve code readability and efficiency

67b030eb

11 Mar, 2025 2 commits

perf(inference): optimize batch processing for different GPU memory sizes · 6116488d
myhloli authored Mar 11, 2025
```
- Set NPUDTCompile to false for better performance on NPU
- Adjust batch ratio
```
6116488d

fix(pre_proc): add Discarded block type to span block type compatibility · 7a856804

myhloli authored Mar 11, 2025

- Include BlockType.Discarded in the list of compatible block types for ContentType.Text and ContentType.InlineEquation
- This change improves the OCR dictionary merging process by handling discarded blocks more effectively

7a856804

10 Mar, 2025 1 commit

refactor(data/utils.py): remove unnecessary decorator and improve image loading · 4f7ef05d

myhloli authored Mar 10, 2025

- Remove unused @ImportPIL decorator from load_images_from_pdf function
- Update image shape handling in YOLOv11.py for better compatibility

These changes improve code readability and performance without altering the original functionality.

4f7ef05d

07 Mar, 2025 2 commits

refactor(YOLOv11): handle image processing and resizing improvements · 0a1fb1e4

myhloli authored Mar 07, 2025

- Replace PIL with cv2 for image processing
- Fix issues with image cropping and resizing
- Add boundary checks and error handling
- Optimize code for better performance and readability

0a1fb1e4

refactor(magic_pdf): replace PIL with NumPy for image processing · 1b34f7e4

myhloli authored Mar 07, 2025

- Remove PIL usage across multiple files
- Convert image processing functions to use NumPy arrays
- Update crop_img function to work with NumPy arrays
- Modify image loading and resizing to use NumPy and OpenCV
- Clean up unused imports and comments related to PIL

1b34f7e4

04 Mar, 2025 2 commits
- Update version.py with new version · 4da3c0f5
  myhloli authored Mar 04, 2025
  
  4da3c0f5
- refactor(magic_pdf): improve paragraph splitting logic and update dependencies · 842483cc
  myhloli authored Mar 04, 2025
```
- Optimize paragraph splitting algorithm for better text block separation
- Update fast-langdetect dependency to ensure compatibility
```
  842483cc
03 Mar, 2025 8 commits

Update version.py with new version · da0c2eaa
myhloli authored Mar 03, 2025

da0c2eaa

perf(inference): adjust batch ratio for high GPU memory · 0b05dff7

myhloli authored Mar 03, 2025

- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory

0b05dff7

refactor(pre_proc): allow interline equations to be associated with text blocks · 083b787c

myhloli authored Mar 03, 2025

- Update OCR dictionary merge logic to include text blocks when processing interline equations
- This change improves the handling of equations that may be embedded within text content

083b787c

fix: caption match · fb02be19
icecraft authored Mar 03, 2025

fb02be19

perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

58b6ad8c

perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

0d3304d7

perf(mfr): improve Math Formula Recognition by sorting images by area · 59fc80d4

myhloli authored Mar 03, 2025

- Sort detected images by area before processing to enhance MFR accuracy
- Implement stable sorting to maintain original order of images with equal

59fc80d4

refactor(pdf_parse): comment out performance measurement and logging · 6bfc1711

myhloli authored Mar 03, 2025

- Comment out @measure_time decorator for txt_spans_extract_v2 and sort_lines_by_model functions
- Remove logger.info for page_process_time
- Comment out PerformanceStats.print_stats call

6bfc1711