Commits · 46ce94ebb5c7da4e8e52b143c93212fee73dc6ab · wangsen / MinerU

16 Jan, 2025 1 commit

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea

15 Jan, 2025 5 commits

refactor(magic_pdf): improve title block merging logic · 8570e006

myhloli authored Jan 15, 2025

- Rename and update merge_title_blocks function
- Implement merge_two_bbox helper function
- Refactor merging logic to preserve original block structure- Update function calls and integrate with existing pipeline

8570e006

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

fix(language): remove invalid UTF-16 surrogate pairs from input text · 1a549a0e

myhloli authored Jan 15, 2025

- Add `remove_invalid_surrogates` function to filter out invalid UTF-16 surrogate pairs
- Integrate the new function into the `detect_lang` workflow
- Include a test case with UTF-16 surrogates to verify the fix

1a549a0e

docs(magic_pdf): update llm_aided.py prompt for title list optimization · 916ced9f

myhloli authored Jan 15, 2025

- Clarify the expected format for the optimized title list JSON output- Emphasize the need to return only the title levels in the specified format

916ced9f

refactor(pre_proc): adjust IOU threshold for character overlap detection · f37b14bc

myhloli authored Jan 15, 2025

- Modified the IOU threshold in ocr_span_list_modify.py from 0.9 to 0.35
- This change aims to improve the detection of overlapping characters in OCR processed PDFs

f37b14bc

14 Jan, 2025 4 commits

feat(post_proc): enhance title block processing with average line height · bbd86955

myhloli authored Jan 14, 2025

- Add average line height calculation for title blocks
- Include page number in title dictionary
- Improve title optimization prompt for better hierarchy- Implement retry mechanism for JSON decoding errors
- Add error logging for title count mismatch

bbd86955

refactor(BatchAnalyze): comment out image rotation logic in doclayout_yolo · 902dcd2c
myhloli authored Jan 14, 2025

902dcd2c

feat(layout): improve title block handling and layout detection · c20e9a1e

myhloli authored Jan 14, 2025

- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types

c20e9a1e

Update pdf_parse_union_core_v2.py · 9f12c398
Xiaomeng Zhao authored Jan 14, 2025

9f12c398

10 Jan, 2025 4 commits
- Update version.py with new version · 67c9fdac
  myhloli authored Jan 10, 2025
  
  67c9fdac
- fix(llm_aided): add enable flag check for LLM aided optimizations · aaff1a26
  myhloli authored Jan 10, 2025
```
- Add enable flag check for formula, text, and title optimizations
```
  aaff1a26
- Update version.py with new version · 2c4a586e
  myhloli authored Jan 10, 2025
  
  2c4a586e
- fix(device): enable MPS support and fix related issues · 203b8f90
  myhloli authored Jan 10, 2025
```
- Add MPS support for Apple Silicon devices
- Implement empty_cache() for MPS devices
- Set PYTORCH_ENABLE_MPS_FALLBACK environment variable
- Adjust MFR model device allocation for MPS
```
  203b8f90
09 Jan, 2025 5 commits

fix(language): enhance language detection and text processing · 29681c4f

myhloli authored Jan 09, 2025

- Improve language detection by removing newline characters from the input text
- Add error handling and fallback mechanism to deal with text containing control characters

29681c4f

refactor(magic_pdf): update OCR engine selection in RapidTableModel · bd1b7677

myhloli authored Jan 09, 2025

- Remove conditional logic for OCR engine selection
- Always use RapidOCR as the OCR engine
- Simplify the __init__ method by removing unused code

bd1b7677

refactor(model): remove unused YOLO v11 language detection model · a80ff051

myhloli authored Jan 09, 2025

- Remove YOLO v11 language detection model from model_configs.yaml
- Update language detection utils to use a fixed model path instead of dynamic configuration
- Remove unused model weight parameter for YOLO v11 language detection

a80ff051

feat(pdf_parse): add internal block sorting for images and tables · 3f93b895

myhloli authored Jan 09, 2025

- Implement block sorting within image and table blocks
- Ensure correct order of captions and footnotes within blocks
- Improve overall document structure and parsing accuracy

3f93b895

refactor(langdetect): simplify language detection model and improve logging · 3271cf75

myhloli authored Jan 09, 2025

- Remove LangDetectMode and related conditional logic
- Use a single model weight for language detection
- Add logging for language detection results
- Update model initialization and prediction methods

3271cf75

08 Jan, 2025 3 commits

feat(model): add language detection model and update related modules · 735f3a70

myhloli authored Jan 08, 2025

- Add language detection model initialization and integration
- Update model list to include language detection
- Refactor language detection utils for better model management

735f3a70

feat(language-detection): improve language detection accuracy for specific languages · 356cb1f2

myhloli authored Jan 08, 2025

- Add separate models for Chinese/Japanese and English/French/German detection
- Implement mode-based detection to use appropriate models for different languages
- Update language detection process to use higher DPI for better accuracy
- Modify model initialization and prediction logic to support new language-specific models

356cb1f2

fix(pdf_parse): ensure block bounding boxes do not have negative values · 6b55fcfd

myhloli authored Jan 08, 2025

- Add logic to set any negative values in block['bbox'] to 0
- This prevents potential errors when processing PDF blocks

6b55fcfd

07 Jan, 2025 1 commit

feat(api): simplify markdown and content list generation · 52efe94d

myhloli authored Jan 07, 2025

- Remove DropMode and MakeMode imports from user code
- Set default drop_mode to DropMode.NONE in get_markdown and get_content_list methods
- Remove md_make_mode parameter from get_content_list method
- Add dump_middle_json method to PipeResult
- Update examples in API documentation and demo script

52efe94d

06 Jan, 2025 3 commits
- Delete magic_pdf/pipe/AbsPipe.py · 43cdaa55
  Xiaomeng Zhao authored Jan 06, 2025
  
  43cdaa55
- fix(table): handle empty OCR result in rapidtable · 12caa784
  myhloli authored Jan 06, 2025
```
- Add check for empty OCR result when using PaddleOCR model
- Assign None to ocr_result if no text is detected, preventing further errors
```
  12caa784
- refactor: remove unused method in MagicModel class · d13f3c6d
  icecraft authored Jan 06, 2025
  
  d13f3c6d
05 Jan, 2025 3 commits

feat(tools): add character bounding box drawing functionality · f911a102

myhloli authored Jan 05, 2025

- Add `draw_char_bbox` function to `draw_bbox.py` for drawing character bounding boxes
- Integrate `draw_char_bbox` into `common.py` for use in PDF processing pipeline
- Include option to draw character bounding boxes in debug mode

f911a102

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code... · 9951a170

myhloli authored Jan 05, 2025

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code formatting- Remove extra space in conditional statement for character spacing logic
- Adjust spacing in trigonometric checks for line direction- Improve overall code readability and consistency

9951a170

fix(magic-pdf): update OCR model selection logic · 16a0a350

myhloli authored Jan 05, 2025

- Add missing 'else' statement in OCR model selection logic
- Ensure consistent formatting of 'if' statements for better readability
- Remove unnecessary empty line in the 'app.py' file

16a0a350

03 Jan, 2025 2 commits

refactor(ocr): comment out unnecessary log statement · 04febf52
myhloli authored Jan 03, 2025
```
- Remove logger.info() call for additional_ocr_params to reduce log verbosity
```
04febf52

feat(model): add onnxruntime support for paddleocr on cpu · 512adb67

myhloli authored Jan 03, 2025

- Implement ONNXModelSingleton to manage ONNX models
- Modify ModifiedPaddleOCR to use ONNX models on ARM CPUs without CUDA
- Update RapidTableModel to use RapidOCR with ONNXRuntime on CPU
- Add rapidocr_onnxruntime dependency in setup.py

512adb67

02 Jan, 2025 2 commits

refactor(pdf_parse): improve character spacing handling in PDF text extraction · c93950dc

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

c93950dc

refactor(pdf_parse): improve character spacing handling in PDF text extraction · 7c5cdcd4

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

7c5cdcd4

30 Dec, 2024 2 commits

refactor(magic_pdf): comment out npu-related code · 88b909e2

myhloli authored Dec 30, 2024

- Remove use_npu variable initialization
- Comment out device assignment and npu check
- Comment out use_npu parameter in ModifiedPaddleOCR constructor

88b909e2

fix(npu): correct module name for NPU operations · 2684e775

myhloli authored Dec 30, 2024

- Update `clean_memory.py` to use `torch_npu.npu` instead of `torch.npu`
- Update `model_utils.py` to use `torch_npu.npu` instead of `torch.npu`
- Simplify NPU availability check and bfloat16 support in `pdf_parse_union_core_v2.py`

2684e775

27 Dec, 2024 1 commit
- fix: s3 path join method · d637dab3
  icecraft authored Dec 27, 2024
  
  d637dab3
26 Dec, 2024 2 commits

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

feat(model): add npu support and optimize table model · 7990e7df

myhloli authored Dec 26, 2024

- Add NPU support for memory cleaning and model initialization
- Optimize table model initialization and prediction process
- Update memory utils to support NPU
- Add language parameter for table model

7990e7df

25 Dec, 2024 2 commits

refactor(magic_pdf): remove unnecessary logging statements · 192047a1

myhloli authored Dec 25, 2024

- Comment out logging statements for title list, title completion, and length comparison
- Improve code readability and reduce clutter by removing unused debug information

192047a1

feat(llm_aided): add title optimization feature · 0a468eca

myhloli authored Dec 25, 2024

- Implement llm_aided_title function to optimize document titles using LLM
- Update pdf_parse_union_core_v2.py to include title optimization
- Modify ocr_mkcontent.py to use optimized title levels- Add openai SDK dependency in setup.py

0a468eca