Commits · 49a8f8be0a5956e2b0f86a71a131878bdc3cf03d · wangsen / MinerU

29 Apr, 2025 2 commits

feat(model_utils): adjust table detection threshold and add features · 49a8f8be

myhloli authored Apr 29, 2025

- Adjust the threshold for considering tables inside other tables from2 to 3
- Add support for custom formula delimiters through user configuration
- Pin pdfminer.six to version 20250324 to prevent parsing failures

49a8f8be

fix(mfr): add LaTeX symbol replacements for fint and up · dfd13fa2
myhloli authored Apr 29, 2025
```
- Add regex patterns for replacing LaTeX symbols \fint and \up with their Unicode equivalents
```
dfd13fa2

27 Apr, 2025 3 commits

fix(mfr): add underscore symbol to unimernet · 7d77d614

myhloli authored Apr 28, 2025

- Add \textunderscore to the list of LaTeX patterns
- This allows the model to properly render underscore characters

7d77d614

fix(mfr): optimize LaTeX formula repair functionality · 2d1a0f2c

myhloli authored Apr 27, 2025

- Improve \left and \right command handling in LaTeX formulas
- Enhance environment type matching for array, matrix, and other structures
- Refactor code for better readability and maintainability

2d1a0f2c

fix(magic_pdf): improve LaTeX formula processing and environment handling · c8747cff

myhloli authored Apr 27, 2025

- Refactor LaTeX left/right pair fixing logic for better balance
- Add environment detection and correction for common math environments
- Implement more robust whitespace handling and command substitution
- Optimize regex patterns for improved performance and readability

c8747cff

25 Apr, 2025 2 commits

fix(mfr): improve LaTeX formula processing and repair · 2e91fb3f

myhloli authored Apr 25, 2025

- Add functions to fix LaTeX left and right commands
- Implement brace matching and repair in LaTeX formulas
- Remove unnecessary whitespace and repair LaTeX code
- Replace specific LaTeX commands with appropriate alternatives
- Add logging for debugging purposes

2e91fb3f

fix(mfr): improve LaTeX formula processing and repair · 6c151151

myhloli authored Apr 25, 2025

- Add functions to fix LaTeX left and right commands
- Implement brace matching and repair in LaTeX formulas
- Remove unnecessary whitespace and repair LaTeX code
- Replace specific LaTeX commands with appropriate alternatives
- Add logging for debugging purposes

6c151151

24 Apr, 2025 1 commit

fix(mfr): improve LaTeX whitespace handling in unimernet model · bfb80cb2

myhloli authored Apr 24, 2025

- Preserve "\ " sequences during whitespace removal
- Add temporary substitution to prevent incorrect processing of "\ " sequences
- Restore "\ " sequences after removing unnecessary whitespace

bfb80cb2

23 Apr, 2025 2 commits

refactor(ocr): update device parameter handling in paddleocr2pytorch · 45f50826

myhloli authored Apr 23, 2025

- Replace get_device() function call with direct 'device' variable usage
- Simplify device configuration in OCR model initialization

45f50826

feat(ocr): add new Chinese OCR model and update language support · 4f88fcaa

myhloli authored Apr 23, 2025

- Add new Chinese OCR model (ch_PP-OCRv4_rec_server_doc_infer) for server-side use
- Update language support in app.py to include new Chinese model
- Modify models_config.yml to add new model configuration

4f88fcaa

22 Apr, 2025 1 commit

fix(ocr): switch to ch_lite model for Chinese OCR on CPU · 69cdea90

myhloli authored Apr 22, 2025

- Automatically change to ch_lite model when using CPU for Chinese OCR
- This modification improves performance on CPU devices

69cdea90

15 Apr, 2025 1 commit

feat(model): add text region handling and improve overlap resolution · 07edefaa

myhloli authored Apr 15, 2025

- Add text region handling in get_res_list_from_layout_res function
- Implement remove_overlaps_min_blocks function to handle overlapping blocks
- Update OCR region handling to include text regions
- Improve overlap resolution for all regions in layout results

07edefaa

09 Apr, 2025 5 commits

refactor(ocr): comment out det_count update and update OCR models · f8323ae0

myhloli authored Apr 09, 2025

- Comment out the line that updates det_count in batch_analyze.py
- Add a new OCR model configuration for Chinese (ch_lite) in models_config.yml- Update the Chinese OCR model configuration to use a different recognition model

f8323ae0

perf(table): optimize aspect ratio calculation for text boxes · 4afdba36

myhloli authored Apr 09, 2025

- Simplify aspect ratio calculation using direct coordinate subtraction
- Remove unnecessary list append operation
- Improve code readability and performance in table rotation detection

4afdba36

feat(table): add orientation detection and rotation for portrait tables · ac893f32

myhloli authored Apr 09, 2025

- Implement table orientation detection to identify if a table is in portrait mode
- Add rotation logic to turn portrait tables 90 degrees clockwise before OCR
- Update OCR processing to work with potentially rotated images
- Improve text box analysis to determine if a table is rotated

ac893f32

fix(ocr): handle NaN values in recognition scores · c97959e4

myhloli authored Apr 09, 2025

- Update predict_rec.py to check for NaN values in recognition results
- Replace NaN scores with 0.0 to ensure stability and consistency

c97959e4

feat(model): improve table recognition by merging and filtering tables · df7ae404

myhloli authored Apr 09, 2025

- Add functions to calculate IoU, check if tables are inside each other, and merge tables
- Implement table merging for high IoU tables
- Add filtering to remove nested tables that don't overlap but cover a large area
- Update table_res_list and layout_res to reflect these changes

df7ae404

08 Apr, 2025 1 commit

fix(table): add model path for slanet-plus to resolve RapidTableError · e327e9ba

myhloli authored Apr 08, 2025

- Import os and pathlib modules to handle file paths
- Define the path to the slanet-plus model
- Update RapidTableInput initialization to include the model path

e327e9ba

07 Apr, 2025 1 commit

fix(model): improve VRAM detection and handling · d32a63ca

myhloli authored Apr 07, 2025

- Refactor VRAM detection logic for better readability and efficiency
- Add fallback mechanism for unknown VRAM sizes
- Improve device checking in get_vram function

d32a63ca

03 Apr, 2025 2 commits

refactor(magic_pdf): optimize table recognition and layout detection · 1fd72f5f

myhloli authored Apr 03, 2025

- Update table recognition logic to process each table individually
- Refactor layout detection to use tqdm for progress tracking
- Optimize OCR recognition by using a single tqdm wrapper
- Improve MFR prediction with a more accurate progress bar
- Simplify MFD prediction by removing unnecessary total calculation

1fd72f5f

feat(model): add tqdm progress bar to model prediction loops · 8e1c2339

myhloli authored Apr 03, 2025

- Add tqdm progress bar to batch prediction loops in multiple model modules
- Improve logging and error handling in batch analysis script
- Update table model initialization to use default sub-model if none specified
- Add tqdm dependency to requirements.txt

8e1c2339

02 Apr, 2025 9 commits

feat(model): update Chinese OCR detection model to PP-OCRv3 · ddfeea94

myhloli authored Apr 03, 2025

- Replace ch_PP-OCRv4_det_infer.pth with ch_PP-OCRv3_det_infer.pth in models_config.yml
- Add new ch_PP-OCRv3_det_infer model configuration in arch_config.yaml

ddfeea94

refactor(ocr): remove redundant code and improve code quality · c4010ae0

myhloli authored Apr 03, 2025

- Remove unnecessary GPU checks and cuda() calls
- Consolidate tensor device placement using .to(self.device)
- Add warning suppression for cleaner output
- Refactor conditional logic for better readability

c4010ae0

refactor(demo): simplify batch_demo.py and update demo.py · b0e220c5

myhloli authored Apr 02, 2025

- Remove unnecessary imports and code in batch_demo.py
- Update demo.py to use relative paths and improve code structure
- Adjust output directory structure in both scripts
- Remove redundant code and simplify functions

b0e220c5

build(dependencies): update PyMuPDF, pydantic and transformers · 90321855

myhloli authored Apr 02, 2025

- Update PyMuPDF to version <1.25.0
- Update pydantic to version <2.11
- Update transformers to version < 5.0.0
- Remove always_apply parameter from alb.ToGray in image processing

90321855

feat(ocr): update OCR utility and dependencies · d09464be

myhloli authored Apr 02, 2025

- Update the default configuration path in pytorchocr_utility.py
- Add required dependencies for paddleocr2pytorch in setup.py:
  - shapely
  - pyclipper
  - omegaconf

d09464be

refactor(model): update OCR model and remove unused configs · c45a706c

myhloli authored Apr 02, 2025

- Remove unused UniMERNet and LayoutLMv3 model configurations
- Update OCR model path and dictionary path for PaddleOCR
- Modify README to update system requirements and installation instructions
- Update setup.py to include new package data

c45a706c

chore: update dictionary files · 3b5d3fc8

myhloli authored Apr 02, 2025

- Add newline at the beginning of arabic_dict.txt
- Change mode of multiple dictionary files

3b5d3fc8

refactor(model): remove unused OCR and table models · d8ebd92f

myhloli authored Apr 02, 2025

- Remove OCR utils, modified PaddleOCR, and StructEqTable model
- Delete related import statements and model definitions
- Update dependencies in setup.py to remove paddlepaddle and related OCR packages

d8ebd92f

refactor(ocr): comment out print statements and update table model initialization · 5252c46e

myhloli authored Apr 02, 2025

- Comment out print statements in base_ocr_v20.py and pytorch_paddle.py
- Update table model initialization to use lang parameter instead of ocr_engine
- Remove unused RapidOCR initialization in rapid_table.py

5252c46e

01 Apr, 2025 2 commits

refactor(ocr): remove unused OCR dictionaries and update model configurations · 41f1fb8a

myhloli authored Apr 01, 2025

- Remove unused OCR dictionaries for Arabic, Belarusian, Bulgarian and Armenian languages
- Update model configurations in arch_config.yaml:
- Comment out 'out_channels' for various language models
  - Rename Arabic, Korean, Japanese, Tamil and Devanagari model configurations to use 'v3' instead of 'v4'
- Delete ar_dict.txt, be_dict.txt and bg_dict.txt files
- Update arabic_dict.txt to remove blank line at the start

41f1fb8a

refactor(ocr): remove unused code and simplify model architecture · b3d6785d

myhloli authored Apr 01, 2025

- Remove unused imports and code
- Simplify model architecture by removing unnecessary components
- Update initialization and forward pass logic
- Rename variables for consistency

b3d6785d

31 Mar, 2025 2 commits

feat(ocr): implement language-specific OCR processing · d7d85a28

myhloli authored Mar 31, 2025

- Add support for multiple languages in OCR processing
- Create separate lists for each language to improve processing efficiency
- Update OCR model initialization to use PytorchPaddleOCR instead of ModifiedPaddleOCR
- Modify get_ocr_result_list function to include language information- Improve logging for OCR detection and recognition

d7d85a28

feat(ocr): implement separate detection and recognition processes · a330651d

myhloli authored Mar 31, 2025

- Split OCR process into detection and recognition stages
- Update batch analysis and document analysis pipelines
- Modify OCR result formatting and handling
- Remove unused imports and optimize code structure

a330651d

27 Mar, 2025 1 commit

feat(model): add OCR model base structure and utilities · a7a899f6

myhloli authored Mar 27, 2025

- Add base model structure for OCR in pytorch
- Implement data augmentation and transformation modules
- Create utilities for dictionary handling and state dict conversion
- Include post-processing modules for OCR
- Add weight initialization and loading functions

a7a899f6

24 Mar, 2025 1 commit

fix(magic_pdf): improve image resizing and padding in UnimerSwinn model · 86d83c01

myhloli authored Mar 24, 2025

- Comment out margin cropping to prevent errors with broken files
- Refactor image resizing to preserve aspect ratio
- Update padding calculation and application using OpenCV

86d83c01

22 Mar, 2025 1 commit

refactor(ocr): improve ONNX model initialization and resource handling · cebcd2ad

myhloli authored Mar 22, 2025

- Replace deprecated importlib.resources.path with importlib.resources.files
- Simplify code structure and improve readability
- Remove unnecessary comments and empty lines

cebcd2ad

21 Mar, 2025 1 commit

refactor(model): update model downloads and disable unused models · dba28389

myhloli authored Mar 21, 2025

- Comment out LayoutLMv3, TableMaster, and StructEqTable models
- Update MFR model path to unimernet_hf_small_2503- Remove unused import in Unimernet.py

dba28389

20 Mar, 2025 2 commits

refactor(magic_pdf): remove unnecessary half() calls for CPU devices · 27281c92

myhloli authored Mar 20, 2025

- Remove half() calls for DocLayoutYOLO and YOLOv8 models
- This change prevents potential errors when running models on CPU

27281c92

refactor(model): update model initialization and dependencies · 2f3b66a5

myhloli authored Mar 20, 2025

- Update config version to1.2.0
- Refactor model initialization in model_init.py- Update dependencies in requirements.txt files
- Remove unused imports and models
- Add conditional imports for table models

2f3b66a5