Commits · 2a3a006f4db8c7a6830d69f3d26919409515c71a · wangsen / MinerU

21 Jan, 2025 1 commit

fix(models): update unimernet_small model path · 2a3a006f

myhloli authored Jan 21, 2025

- Update model path from 'unimernet_small' to 'unimernet_small_2501' in multiple scripts and configuration files
- This change affects download_models.py, download_models_hf.py, and model_configs.yaml

2a3a006f

20 Jan, 2025 3 commits

fix(ocr): improve ONNX model initialization and error handling · b3d60b96

myhloli authored Jan 20, 2025

- Add key length validation for ONNX model initialization
- Move import statements to the top of the file
- Wrap model initialization in a try-except block for better error handling
- Refactor code to improve readability and maintainability

b3d60b96

feat(pdf_parse): remove tilted lines for better text extraction · ba6c17a9

myhloli authored Jan 20, 2025

- Add remove_tilted_line function to filter out lines with angles between 2 and 88 degrees
- Integrate the new function into the text extraction process
- Improve the accuracy of text block processing by removing non-horizontal/vertical lines

ba6c17a9

Fix ocr utills · fbf1c4bf
陆逊 authored Jan 20, 2025

fbf1c4bf

17 Jan, 2025 3 commits

feat(llm_aided): add reasonability check and fine-tuning guidelines · d986e393

myhloli authored Jan 17, 2025

- Added instructions for checking the reasonability of heading levels
- Included guidelines for making fine adjustments based on context and logic
- Emphasized the importance of aligning the final result with the document's actual structure

d986e393

fix(magic_pdf): limit batch ratio for GPU memory · db8be974

myhloli authored Jan 17, 2025

- Commented out the original batch ratio calculation
- Set a fixed batch ratio of 2 for GPUs with less than 8 GB memory
- Increased batch ratio to 4 for GPUs with 8 GB or more memory

db8be974

refactor(table): add device configuration for Unitable model · e64d4fed

myhloli authored Jan 17, 2025

- Import get_device function from magic_pdf.libs.config_reader- Update RapidTableModel initialization to include device parameter for Unitable model

e64d4fed

16 Jan, 2025 3 commits

refactor(model): update batch analyze logic for rapid table model · 452a9c0b

myhloli authored Jan 16, 2025

- Modify the batch analyze process to handle the rapid table model's output
- Add logic_points variable to capture additional output from rapid table prediction

452a9c0b

feat(table): upgrade RapidTable to1.0.3 and add sub-model support · 79c8a5c8

myhloli authored Jan 16, 2025

- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format

79c8a5c8

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea

15 Jan, 2025 5 commits

refactor(magic_pdf): improve title block merging logic · 8570e006

myhloli authored Jan 15, 2025

- Rename and update merge_title_blocks function
- Implement merge_two_bbox helper function
- Refactor merging logic to preserve original block structure- Update function calls and integrate with existing pipeline

8570e006

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

fix(language): remove invalid UTF-16 surrogate pairs from input text · 1a549a0e

myhloli authored Jan 15, 2025

- Add `remove_invalid_surrogates` function to filter out invalid UTF-16 surrogate pairs
- Integrate the new function into the `detect_lang` workflow
- Include a test case with UTF-16 surrogates to verify the fix

1a549a0e

docs(magic_pdf): update llm_aided.py prompt for title list optimization · 916ced9f

myhloli authored Jan 15, 2025

- Clarify the expected format for the optimized title list JSON output- Emphasize the need to return only the title levels in the specified format

916ced9f

refactor(pre_proc): adjust IOU threshold for character overlap detection · f37b14bc

myhloli authored Jan 15, 2025

- Modified the IOU threshold in ocr_span_list_modify.py from 0.9 to 0.35
- This change aims to improve the detection of overlapping characters in OCR processed PDFs

f37b14bc

14 Jan, 2025 4 commits

feat(post_proc): enhance title block processing with average line height · bbd86955

myhloli authored Jan 14, 2025

- Add average line height calculation for title blocks
- Include page number in title dictionary
- Improve title optimization prompt for better hierarchy- Implement retry mechanism for JSON decoding errors
- Add error logging for title count mismatch

bbd86955

refactor(BatchAnalyze): comment out image rotation logic in doclayout_yolo · 902dcd2c
myhloli authored Jan 14, 2025

902dcd2c

feat(layout): improve title block handling and layout detection · c20e9a1e

myhloli authored Jan 14, 2025

- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types

c20e9a1e

Update pdf_parse_union_core_v2.py · 9f12c398
Xiaomeng Zhao authored Jan 14, 2025

9f12c398

10 Jan, 2025 4 commits
- Update version.py with new version · 67c9fdac
  myhloli authored Jan 10, 2025
  
  67c9fdac
- fix(llm_aided): add enable flag check for LLM aided optimizations · aaff1a26
  myhloli authored Jan 10, 2025
```
- Add enable flag check for formula, text, and title optimizations
```
  aaff1a26
- Update version.py with new version · 2c4a586e
  myhloli authored Jan 10, 2025
  
  2c4a586e
- fix(device): enable MPS support and fix related issues · 203b8f90
  myhloli authored Jan 10, 2025
```
- Add MPS support for Apple Silicon devices
- Implement empty_cache() for MPS devices
- Set PYTORCH_ENABLE_MPS_FALLBACK environment variable
- Adjust MFR model device allocation for MPS
```
  203b8f90
09 Jan, 2025 5 commits

fix(language): enhance language detection and text processing · 29681c4f

myhloli authored Jan 09, 2025

- Improve language detection by removing newline characters from the input text
- Add error handling and fallback mechanism to deal with text containing control characters

29681c4f

refactor(magic_pdf): update OCR engine selection in RapidTableModel · bd1b7677

myhloli authored Jan 09, 2025

- Remove conditional logic for OCR engine selection
- Always use RapidOCR as the OCR engine
- Simplify the __init__ method by removing unused code

bd1b7677

refactor(model): remove unused YOLO v11 language detection model · a80ff051

myhloli authored Jan 09, 2025

- Remove YOLO v11 language detection model from model_configs.yaml
- Update language detection utils to use a fixed model path instead of dynamic configuration
- Remove unused model weight parameter for YOLO v11 language detection

a80ff051

feat(pdf_parse): add internal block sorting for images and tables · 3f93b895

myhloli authored Jan 09, 2025

- Implement block sorting within image and table blocks
- Ensure correct order of captions and footnotes within blocks
- Improve overall document structure and parsing accuracy

3f93b895

refactor(langdetect): simplify language detection model and improve logging · 3271cf75

myhloli authored Jan 09, 2025

- Remove LangDetectMode and related conditional logic
- Use a single model weight for language detection
- Add logging for language detection results
- Update model initialization and prediction methods

3271cf75

08 Jan, 2025 3 commits

feat(model): add language detection model and update related modules · 735f3a70

myhloli authored Jan 08, 2025

- Add language detection model initialization and integration
- Update model list to include language detection
- Refactor language detection utils for better model management

735f3a70

feat(language-detection): improve language detection accuracy for specific languages · 356cb1f2

myhloli authored Jan 08, 2025

- Add separate models for Chinese/Japanese and English/French/German detection
- Implement mode-based detection to use appropriate models for different languages
- Update language detection process to use higher DPI for better accuracy
- Modify model initialization and prediction logic to support new language-specific models

356cb1f2

fix(pdf_parse): ensure block bounding boxes do not have negative values · 6b55fcfd

myhloli authored Jan 08, 2025

- Add logic to set any negative values in block['bbox'] to 0
- This prevents potential errors when processing PDF blocks

6b55fcfd

07 Jan, 2025 1 commit

feat(api): simplify markdown and content list generation · 52efe94d

myhloli authored Jan 07, 2025

- Remove DropMode and MakeMode imports from user code
- Set default drop_mode to DropMode.NONE in get_markdown and get_content_list methods
- Remove md_make_mode parameter from get_content_list method
- Add dump_middle_json method to PipeResult
- Update examples in API documentation and demo script

52efe94d

06 Jan, 2025 3 commits
- Delete magic_pdf/pipe/AbsPipe.py · 43cdaa55
  Xiaomeng Zhao authored Jan 06, 2025
  
  43cdaa55
- fix(table): handle empty OCR result in rapidtable · 12caa784
  myhloli authored Jan 06, 2025
```
- Add check for empty OCR result when using PaddleOCR model
- Assign None to ocr_result if no text is detected, preventing further errors
```
  12caa784
- refactor: remove unused method in MagicModel class · d13f3c6d
  icecraft authored Jan 06, 2025
  
  d13f3c6d
05 Jan, 2025 3 commits

feat(tools): add character bounding box drawing functionality · f911a102

myhloli authored Jan 05, 2025

- Add `draw_char_bbox` function to `draw_bbox.py` for drawing character bounding boxes
- Integrate `draw_char_bbox` into `common.py` for use in PDF processing pipeline
- Include option to draw character bounding boxes in debug mode

f911a102

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code... · 9951a170

myhloli authored Jan 05, 2025

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code formatting- Remove extra space in conditional statement for character spacing logic
- Adjust spacing in trigonometric checks for line direction- Improve overall code readability and consistency

9951a170

fix(magic-pdf): update OCR model selection logic · 16a0a350

myhloli authored Jan 05, 2025

- Add missing 'else' statement in OCR model selection logic
- Ensure consistent formatting of 'if' statements for better readability
- Remove unnecessary empty line in the 'app.py' file

16a0a350

03 Jan, 2025 2 commits

refactor(ocr): comment out unnecessary log statement · 04febf52
myhloli authored Jan 03, 2025
```
- Remove logger.info() call for additional_ocr_params to reduce log verbosity
```
04febf52

feat(model): add onnxruntime support for paddleocr on cpu · 512adb67

myhloli authored Jan 03, 2025

- Implement ONNXModelSingleton to manage ONNX models
- Modify ModifiedPaddleOCR to use ONNX models on ARM CPUs without CUDA
- Update RapidTableModel to use RapidOCR with ONNXRuntime on CPU
- Add rapidocr_onnxruntime dependency in setup.py

512adb67