Commits · 356cb1f2de0bb3faae25e6b2f85574921611649e · wangsen / MinerU

08 Jan, 2025 2 commits

feat(language-detection): improve language detection accuracy for specific languages · 356cb1f2

myhloli authored Jan 08, 2025

- Add separate models for Chinese/Japanese and English/French/German detection
- Implement mode-based detection to use appropriate models for different languages
- Update language detection process to use higher DPI for better accuracy
- Modify model initialization and prediction logic to support new language-specific models

356cb1f2

fix(pdf_parse): ensure block bounding boxes do not have negative values · 6b55fcfd

myhloli authored Jan 08, 2025

- Add logic to set any negative values in block['bbox'] to 0
- This prevents potential errors when processing PDF blocks

6b55fcfd

07 Jan, 2025 1 commit

feat(api): simplify markdown and content list generation · 52efe94d

myhloli authored Jan 07, 2025

- Remove DropMode and MakeMode imports from user code
- Set default drop_mode to DropMode.NONE in get_markdown and get_content_list methods
- Remove md_make_mode parameter from get_content_list method
- Add dump_middle_json method to PipeResult
- Update examples in API documentation and demo script

52efe94d

06 Jan, 2025 2 commits
- fix(table): handle empty OCR result in rapidtable · 12caa784
  myhloli authored Jan 06, 2025
```
- Add check for empty OCR result when using PaddleOCR model
- Assign None to ocr_result if no text is detected, preventing further errors
```
  12caa784
- refactor: remove unused method in MagicModel class · d13f3c6d
  icecraft authored Jan 06, 2025
  
  d13f3c6d
05 Jan, 2025 3 commits

feat(tools): add character bounding box drawing functionality · f911a102

myhloli authored Jan 05, 2025

- Add `draw_char_bbox` function to `draw_bbox.py` for drawing character bounding boxes
- Integrate `draw_char_bbox` into `common.py` for use in PDF processing pipeline
- Include option to draw character bounding boxes in debug mode

f911a102

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code... · 9951a170

myhloli authored Jan 05, 2025

style(pdf_parse_union_core_v2): remove unnecessary spaces and improve code formatting- Remove extra space in conditional statement for character spacing logic
- Adjust spacing in trigonometric checks for line direction- Improve overall code readability and consistency

9951a170

fix(magic-pdf): update OCR model selection logic · 16a0a350

myhloli authored Jan 05, 2025

- Add missing 'else' statement in OCR model selection logic
- Ensure consistent formatting of 'if' statements for better readability
- Remove unnecessary empty line in the 'app.py' file

16a0a350

03 Jan, 2025 2 commits

refactor(ocr): comment out unnecessary log statement · 04febf52
myhloli authored Jan 03, 2025
```
- Remove logger.info() call for additional_ocr_params to reduce log verbosity
```
04febf52

feat(model): add onnxruntime support for paddleocr on cpu · 512adb67

myhloli authored Jan 03, 2025

- Implement ONNXModelSingleton to manage ONNX models
- Modify ModifiedPaddleOCR to use ONNX models on ARM CPUs without CUDA
- Update RapidTableModel to use RapidOCR with ONNXRuntime on CPU
- Add rapidocr_onnxruntime dependency in setup.py

512adb67

02 Jan, 2025 2 commits

refactor(pdf_parse): improve character spacing handling in PDF text extraction · c93950dc

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

c93950dc

refactor(pdf_parse): improve character spacing handling in PDF text extraction · 7c5cdcd4

myhloli authored Jan 02, 2025

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one
- Adjust the spacing threshold to 25% of the average character width
- Ignore spaces at the end of lines to prevent double spaces

7c5cdcd4

30 Dec, 2024 2 commits

refactor(magic_pdf): comment out npu-related code · 88b909e2

myhloli authored Dec 30, 2024

- Remove use_npu variable initialization
- Comment out device assignment and npu check
- Comment out use_npu parameter in ModifiedPaddleOCR constructor

88b909e2

fix(npu): correct module name for NPU operations · 2684e775

myhloli authored Dec 30, 2024

- Update `clean_memory.py` to use `torch_npu.npu` instead of `torch.npu`
- Update `model_utils.py` to use `torch_npu.npu` instead of `torch.npu`
- Simplify NPU availability check and bfloat16 support in `pdf_parse_union_core_v2.py`

2684e775

27 Dec, 2024 1 commit
- fix: s3 path join method · d637dab3
  icecraft authored Dec 27, 2024
  
  d637dab3
26 Dec, 2024 2 commits

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

feat(model): add npu support and optimize table model · 7990e7df

myhloli authored Dec 26, 2024

- Add NPU support for memory cleaning and model initialization
- Optimize table model initialization and prediction process
- Update memory utils to support NPU
- Add language parameter for table model

7990e7df

25 Dec, 2024 2 commits

refactor(magic_pdf): remove unnecessary logging statements · 192047a1

myhloli authored Dec 25, 2024

- Comment out logging statements for title list, title completion, and length comparison
- Improve code readability and reduce clutter by removing unused debug information

192047a1

feat(llm_aided): add title optimization feature · 0a468eca

myhloli authored Dec 25, 2024

- Implement llm_aided_title function to optimize document titles using LLM
- Update pdf_parse_union_core_v2.py to include title optimization
- Modify ocr_mkcontent.py to use optimized title levels- Add openai SDK dependency in setup.py

0a468eca

24 Dec, 2024 1 commit

feat(llm): add LLM-aided formula and text correction · c660fdc8

myhloli authored Dec 24, 2024

- Add LLM-aided formula and text correction functionality
- Update config reader to include LLM-aided settings
- Create new LLM-aided processing module
- Update main processing script to incorporate LLM-aided corrections
- Modify download scripts to check for new config version

c660fdc8

20 Dec, 2024 1 commit

refactor(pre_proc): improve character overlap handling in spans · 15e87667

myhloli authored Dec 20, 2024

- Remove remove_overlaps_chars function
- Add check_chars_is_overlap_in_span function
- Update span processing logic to handle character overlaps- Improve efficiency and readability of overlap detection

15e87667

19 Dec, 2024 1 commit

feat(pre_proc): add function to remove overlapping characters in spans · 2f4d4b0c

myhloli authored Dec 19, 2024

- Implement remove_overlaps_chars function to detect and remove overlapping characters within spans
- Integrate remove_overlaps_chars function into the PDF parsing process
- Improve character-level processing and reduce redundancy in OCR results

2f4d4b0c

18 Dec, 2024 7 commits
- fix: drop reason append error · 1e6de549
  pangguosheng authored Dec 19, 2024
  
  1e6de549
- fix: skip the char corresponding to invalid bounding boxes · 51b8c57d
  pangguosheng authored Dec 19, 2024
  
  51b8c57d
- feat(gradio-app): improve PDF conversion and UI functionalities · bf2ff5a2
  myhloli authored Dec 18, 2024
```
- Add automatic conversion of uploaded files to PDF
- Update max page slider range and default value- Prevent interaction with PDF preview to avoid errors
- Increase Markdown rendering height for better visibility
- Update file change event handling for PDF conversion
- Modify supported image suffixes for file upload
```
  bf2ff5a2
- refactor(magic_pdf): move model config variables · 489f70e9
  myhloli authored Dec 18, 2024
```
- Move __use_inside_model__ and __model_mode__ from operators/__init__.py to model/__init__.py
- These variables are more appropriately located in the model module since they relate to model configuration
```
  489f70e9
- fix: remove pipe_auto_method · c968ce86
  icecraft authored Dec 18, 2024
  
  c968ce86
- docs: make sure the generate process of docs work properly · cd11ddcd
  xu rui authored Dec 18, 2024
  
  cd11ddcd
- refactor: refactor code · b2887ca0
  icecraft authored Dec 18, 2024
  
  b2887ca0
17 Dec, 2024 3 commits
- fix: AbsPipe initial method · 78f56a1e
  icecraft authored Dec 17, 2024
  
  78f56a1e
- feat(language-detection): add YOLOv11 language detection model · 20438bd2
  myhloli authored Dec 17, 2024
```
- Add YOLOv11 language detection model for PDF documents
- Implement language detection in PymuDocDataset
- Update app.py to include 'auto' language option
- Create language detection utilities and constants
```
  20438bd2
- feat: add get_middle_json method · e9d36221
  icecraft authored Dec 17, 2024
  
  e9d36221
16 Dec, 2024 1 commit

refactor(magic_pdf): remove YOLO_VERBOSE setting and update YOLOv8 prediction verbosity · 9e4ebea9

myhloli authored Dec 16, 2024

- Remove YOLO_VERBOSE environment variable from multiple files
- Set verbose=False in YOLOv8 prediction method to suppress logger output

9e4ebea9

13 Dec, 2024 3 commits
- fix(pdf): improve ligature handling and text extraction · c638fc5d
  myhloli authored Dec 13, 2024
```
- Move ligature replacement function to pdf_parse_union_core_v2.py
- Optimize ligature replacement using a more efficient approach
- Modify text extraction flags to preserve ligatures in PDF content
- Remove unnecessary function from ocr_mkcontent.py
```
  c638fc5d
- feat: add logging for detection time in BatchAnalyze when OCR is not applied · be010394
  Suven authored Dec 13, 2024
  
  be010394
- feat: enhance batch processing in BatchAnalyze with layout and OCR timing logs · 49bfdf07
  Suven authored Dec 13, 2024
  
  49bfdf07
12 Dec, 2024 4 commits
- fix: batch methods in DocLayoutYOLO and YOLOv8 models · 4fd1e41e
  Suven authored Dec 12, 2024
  
  4fd1e41e
- feat: add batch prediction methods for YOLOv8 and Unimernet models · 7ce9edc6
  Suven authored Dec 12, 2024
  
  7ce9edc6
- fix: projects · 440fd0c7
  icecraft authored Dec 12, 2024
  
  440fd0c7
- perf(layout): optimize layout detection for PDF extraction · 6a75d7dc
  myhloli authored Dec 12, 2024
```
- Add initial setup for layout detection
- Implement conditional cropping for tall images
- Skip cropping for wide images to improve performance
- Reuse Image object across layout detection steps
```
  6a75d7dc