Commits · 73ccfbbfbed48356cedb826cd714c5d718c77189 · wangsen / MinerU

14 Apr, 2025 3 commits

Update version.py with new version · 29b47466
myhloli authored Apr 14, 2025

29b47466

fix(magic_pdf): correct range for images in document analysis · 67b31a78

myhloli authored Apr 14, 2025

- Update the range used to generate images_with_extra_info to match the number of images
- This fixes a potential IndexError when the number of images differs from the dataset length

67b31a78

refactor(footnote_detection): adjust footnote detection threshold · 8caf59f7

myhloli authored Apr 14, 2025

- Change footnote detection threshold from 50% of page height to 30%
- Improve accuracy of footnote identification in PDF processing

8caf59f7

12 Apr, 2025 2 commits
- Update version.py with new version · 5957cb65
  myhloli authored Apr 12, 2025
  
  5957cb65
- feat(magic_pdf): add logging for batch image processing · afe1b02c
  myhloli authored Apr 12, 2025
```
- Add batch processing logs to track the progress of image analysis
- Display the current batch number, total batches, and the number of processed pages
```
  afe1b02c
11 Apr, 2025 2 commits

refactor(tools): improve code readability and maintainability · 54ce594b

myhloli authored Apr 11, 2025

- Remove unnecessary line breaks and adjust indentation
- Update function call to use named arguments for better readability
- Modify _do_parse function call to use MakeMode.MM_MD instead of

54ce594b

refactor(model): optimize batch processing and inference · d2fc9dab

myhloli authored Apr 11, 2025

- Update batch processing logic for improved efficiency
- Refactor image analysis and inference methods
- Optimize dataset handling and image retrieval
- Improve error handling and logging in batch processes

d2fc9dab

10 Apr, 2025 1 commit
- feat: inference with iter style · 43164533
  icecraft authored Apr 10, 2025
  
  43164533
09 Apr, 2025 7 commits

refactor(ocr): comment out det_count update and update OCR models · f8323ae0

myhloli authored Apr 09, 2025

- Comment out the line that updates det_count in batch_analyze.py
- Add a new OCR model configuration for Chinese (ch_lite) in models_config.yml- Update the Chinese OCR model configuration to use a different recognition model

f8323ae0

fix(dataset): correct variable for language detection · 814bd4ea

myhloli authored Apr 09, 2025

- Change `bits` to `self._data_bits` for language detection
- This fixes the TypeError when opening PDF files

814bd4ea

perf(table): optimize aspect ratio calculation for text boxes · 4afdba36

myhloli authored Apr 09, 2025

- Simplify aspect ratio calculation using direct coordinate subtraction
- Remove unnecessary list append operation
- Improve code readability and performance in table rotation detection

4afdba36

feat(table): add orientation detection and rotation for portrait tables · ac893f32

myhloli authored Apr 09, 2025

- Implement table orientation detection to identify if a table is in portrait mode
- Add rotation logic to turn portrait tables 90 degrees clockwise before OCR
- Update OCR processing to work with potentially rotated images
- Improve text box analysis to determine if a table is rotated

ac893f32

fix(ocr): handle NaN values in recognition scores · c97959e4

myhloli authored Apr 09, 2025

- Update predict_rec.py to check for NaN values in recognition results
- Replace NaN scores with 0.0 to ensure stability and consistency

c97959e4

feat(model): improve table recognition by merging and filtering tables · df7ae404

myhloli authored Apr 09, 2025

- Add functions to calculate IoU, check if tables are inside each other, and merge tables
- Implement table merging for high IoU tables
- Add filtering to remove nested tables that don't overlap but cover a large area
- Update table_res_list and layout_res to reflect these changes

df7ae404

fix: support page range · 29c42a1a
icecraft authored Apr 09, 2025

29c42a1a

08 Apr, 2025 3 commits

refactor(ocr): improve OCR score precision to three decimal places · ea730ae2

myhloli authored Apr 08, 2025

- Update OCR score formatting in batch_analyze.py and pdf_parse_union_core_v2.py
- Change score rounding method to preserve three decimal places
- Enhance accuracy representation without significantly altering the score value

ea730ae2

Update version.py with new version · 79feb926
myhloli authored Apr 08, 2025

79feb926

fix(table): add model path for slanet-plus to resolve RapidTableError · e327e9ba

myhloli authored Apr 08, 2025

- Import os and pathlib modules to handle file paths
- Define the path to the slanet-plus model
- Update RapidTableInput initialization to include the model path

e327e9ba

07 Apr, 2025 2 commits
- fix(model): improve VRAM detection and handling · d32a63ca
  myhloli authored Apr 07, 2025
```
- Refactor VRAM detection logic for better readability and efficiency
- Add fallback mechanism for unknown VRAM sizes
- Improve device checking in get_vram function
```
  d32a63ca
- fix: image dataset add lang field · e36a083d
  icecraft authored Apr 07, 2025
  
  e36a083d
03 Apr, 2025 7 commits

Update version.py with new version · d629ce04
myhloli authored Apr 03, 2025

d629ce04
fix: convert image with pymupdf · 3e8ee23e
icecraft authored Apr 03, 2025

3e8ee23e
fix: support non-pdf file in batch mode · 3379f3b3
icecraft authored Apr 03, 2025

3379f3b3

refactor(magic_pdf): optimize table recognition and layout detection · 1fd72f5f

myhloli authored Apr 03, 2025

- Update table recognition logic to process each table individually
- Refactor layout detection to use tqdm for progress tracking
- Optimize OCR recognition by using a single tqdm wrapper
- Improve MFR prediction with a more accurate progress bar
- Simplify MFD prediction by removing unnecessary total calculation

1fd72f5f

refactor(magic_pdf): remove OCR timing measurement code · 795233d1

myhloli authored Apr 03, 2025

- Comment out OCR timing measurement code to improve readability and performance
- Remove unnecessary logging of OCR processing time

795233d1

refactor(magic_pdf): optimize code and improve logging · 553f250f

myhloli authored Apr 03, 2025

- Remove unused imports and comments
- Increase MIN_BATCH_INFERENCE_SIZE from 100 to 200
- Comment out VRAM cleaning and logging in batch_analyze.py
- Simplify code in doc_analyze_by_custom_model.py- Add tqdm progress bar in pdf_parse_union_core_v2.py
- Enable tqdm in OCR processing

553f250f

feat(model): add tqdm progress bar to model prediction loops · 8e1c2339

myhloli authored Apr 03, 2025

- Add tqdm progress bar to batch prediction loops in multiple model modules
- Improve logging and error handling in batch analysis script
- Update table model initialization to use default sub-model if none specified
- Add tqdm dependency to requirements.txt

8e1c2339

02 Apr, 2025 11 commits

feat(model): update Chinese OCR detection model to PP-OCRv3 · ddfeea94

myhloli authored Apr 03, 2025

- Replace ch_PP-OCRv4_det_infer.pth with ch_PP-OCRv3_det_infer.pth in models_config.yml
- Add new ch_PP-OCRv3_det_infer model configuration in arch_config.yaml

ddfeea94

refactor(ocr): remove redundant code and improve code quality · c4010ae0

myhloli authored Apr 03, 2025

- Remove unnecessary GPU checks and cuda() calls
- Consolidate tensor device placement using .to(self.device)
- Add warning suppression for cleaner output
- Refactor conditional logic for better readability

c4010ae0

refactor(demo): simplify batch_demo.py and update demo.py · b0e220c5

myhloli authored Apr 02, 2025

- Remove unnecessary imports and code in batch_demo.py
- Update demo.py to use relative paths and improve code structure
- Adjust output directory structure in both scripts
- Remove redundant code and simplify functions

b0e220c5

build(dependencies): update PyMuPDF, pydantic and transformers · 90321855

myhloli authored Apr 02, 2025

- Update PyMuPDF to version <1.25.0
- Update pydantic to version <2.11
- Update transformers to version < 5.0.0
- Remove always_apply parameter from alb.ToGray in image processing

90321855

feat(ocr): update OCR utility and dependencies · d09464be

myhloli authored Apr 02, 2025

- Update the default configuration path in pytorchocr_utility.py
- Add required dependencies for paddleocr2pytorch in setup.py:
  - shapely
  - pyclipper
  - omegaconf

d09464be

refactor(model): update OCR model and remove unused configs · c45a706c

myhloli authored Apr 02, 2025

- Remove unused UniMERNet and LayoutLMv3 model configurations
- Update OCR model path and dictionary path for PaddleOCR
- Modify README to update system requirements and installation instructions
- Update setup.py to include new package data

c45a706c

refactor(magic_pdf): remove unused imports and update dependencies · 243bc58c

myhloli authored Apr 02, 2025

- Remove unused imports for concurrent.futures, multiprocessing, and paddle
- Delete commented-out code
- Update numpy dependency to remove upper version limit
- Remove InferenceResult import that was commented out

243bc58c

chore: update dictionary files · 3b5d3fc8

myhloli authored Apr 02, 2025

- Add newline at the beginning of arabic_dict.txt
- Change mode of multiple dictionary files

3b5d3fc8

refactor(model): remove unused OCR and table models · d8ebd92f

myhloli authored Apr 02, 2025

- Remove OCR utils, modified PaddleOCR, and StructEqTable model
- Delete related import statements and model definitions
- Update dependencies in setup.py to remove paddlepaddle and related OCR packages

d8ebd92f

refactor(ocr): comment out print statements and update table model initialization · 5252c46e

myhloli authored Apr 02, 2025

- Comment out print statements in base_ocr_v20.py and pytorch_paddle.py
- Update table model initialization to use lang parameter instead of ocr_engine
- Remove unused RapidOCR initialization in rapid_table.py

5252c46e

feat(ocr): implement dynamic OCR processing for text spans with low contrast · a024c30f

myhloli authored Apr 02, 2025

- Comment out OCR model initialization and execution for low-contrast spans
- Add batch OCR processing for collected image spans
- Adjust contrast threshold for OCR processing
- Remove unnecessary OCR processing for high-contrast spans
- Implement more efficient OCR workflow by processing multiple spans at once

a024c30f

01 Apr, 2025 2 commits

feat(performance_stats): improve function identification in execution time logging · 978ef41c

myhloli authored Apr 01, 2025

- Enhance the logging of execution times by adding more detailed function identification
- Implement class name and module name inclusion for better traceability

978ef41c

refactor(ocr): remove unused OCR dictionaries and update model configurations · 41f1fb8a

myhloli authored Apr 01, 2025

- Remove unused OCR dictionaries for Arabic, Belarusian, Bulgarian and Armenian languages
- Update model configurations in arch_config.yaml:
- Comment out 'out_channels' for various language models
  - Rename Arabic, Korean, Japanese, Tamil and Devanagari model configurations to use 'v3' instead of 'v4'
- Delete ar_dict.txt, be_dict.txt and bg_dict.txt files
- Update arabic_dict.txt to remove blank line at the start

41f1fb8a