Commits · 1b34f7e4ffa1202ce3a4edd8fba37342fac9b3ff · wangsen / MinerU

07 Mar, 2025 1 commit

refactor(magic_pdf): replace PIL with NumPy for image processing · 1b34f7e4

myhloli authored Mar 07, 2025

- Remove PIL usage across multiple files
- Convert image processing functions to use NumPy arrays
- Update crop_img function to work with NumPy arrays
- Modify image loading and resizing to use NumPy and OpenCV
- Clean up unused imports and comments related to PIL

1b34f7e4

11 Feb, 2025 1 commit

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

16 Jan, 2025 1 commit

feat(table): upgrade RapidTable to1.0.3 and add sub-model support · 79c8a5c8

myhloli authored Jan 16, 2025

- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format

79c8a5c8

15 Jan, 2025 1 commit

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

14 Jan, 2025 1 commit

feat(layout): improve title block handling and layout detection · c20e9a1e

myhloli authored Jan 14, 2025

- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types

c20e9a1e

10 Jan, 2025 1 commit

fix(device): enable MPS support and fix related issues · 203b8f90

myhloli authored Jan 10, 2025

- Add MPS support for Apple Silicon devices
- Implement empty_cache() for MPS devices
- Set PYTORCH_ENABLE_MPS_FALLBACK environment variable
- Adjust MFR model device allocation for MPS

203b8f90

26 Dec, 2024 2 commits

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

feat(model): add npu support and optimize table model · 7990e7df

myhloli authored Dec 26, 2024

- Add NPU support for memory cleaning and model initialization
- Optimize table model initialization and prediction process
- Update memory utils to support NPU
- Add language parameter for table model

7990e7df

16 Dec, 2024 1 commit

refactor(magic_pdf): remove YOLO_VERBOSE setting and update YOLOv8 prediction verbosity · 9e4ebea9

myhloli authored Dec 16, 2024

- Remove YOLO_VERBOSE environment variable from multiple files
- Set verbose=False in YOLOv8 prediction method to suppress logger output

9e4ebea9

12 Dec, 2024 1 commit

perf(layout): optimize layout detection for PDF extraction · 6a75d7dc

myhloli authored Dec 12, 2024

- Add initial setup for layout detection
- Implement conditional cropping for tall images
- Skip cropping for wide images to improve performance
- Reuse Image object across layout detection steps

6a75d7dc

11 Dec, 2024 1 commit

feat(layout): improve layout detection for DocLayout_YOLO model · f5d812b3

myhloli authored Dec 11, 2024

- Implement image cropping and pasting technique to enhance layout detection
- Adjust detected polygons to original image coordinates
- Add comments for better code readability

f5d812b3

06 Dec, 2024 7 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(ocr): replace AtomModelSingleton with ocr_model_init for OCR model instantiation · 47a83d28

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model creation
- Add ocr_model_init function to initialize OCR model
- Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py
- Modify txt_spans_extract_v2 function to accept ocr_model as a parameter
- Update parse_page_core function to use ocr_model instead of lang for OCR processing

47a83d28

refactor(model): implement thread-safe OCR model initialization · f2a92d57

myhloli authored Dec 06, 2024

- Add threading support for OCR model initialization
- Modify AtomModelSingleton to handle thread-specific instances
- Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization

f2a92d57

refactor(magic_pdf): remove unused threading lock and model initialization code · a1744b77

myhloli authored Dec 06, 2024

- Remove threading.Lock import and usage
- Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py

a1744b77

refactor(model): replace AtomModelSingleton with ocr_model_init for OCR model initialization · 488660dd

myhloli authored Dec 06, 2024

- Remove usage of AtomModelSingleton for OCR model initialization
- Add import of ocr_model_init from model_init module
- Update OCR model initialization process to use ocr_model_init function
- Remove lock for OCR processing as it's no longer needed

488660dd

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

refactor(magic_pdf): optimize model initialization and threading · 878f3de0

赵小蒙 authored Dec 06, 2024

- Remove unnecessary threading.Lock in AtomModelSingleton
- Add threading.Lock to CustomPEKModel for OCR processing
- Simplify model initialization logic in AtomModelSingleton

878f3de0

22 Nov, 2024 1 commit

refactor(model): move page total time logging to custom model analysis · f1e2f084

myhloli authored Nov 22, 2024

- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time

f1e2f084

21 Nov, 2024 1 commit

feat(ocr): improve text detection and OCR accuracy · b2e37a2d

myhloli authored Nov 21, 2024

- Update OCR utils to handle different box formats and improve angle calculation
- Modify PDF extraction kit to support OCR option and optimize processing flow
- Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy

b2e37a2d

19 Nov, 2024 1 commit
- refactor: move some constants or enums defs to config folder · b492c19c
  icecraft authored Nov 19, 2024
  
  b492c19c
15 Nov, 2024 1 commit
- refactor(model): rename and restructure model modules · 08f46125
  myhloli authored Nov 15, 2024
  
  08f46125
08 Nov, 2024 2 commits

feat(table): add RapidOCR support for RapidTable model · fe2c2c0d

myhloli authored Nov 09, 2024

- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py

fe2c2c0d

feat(table): integrate RapidTable model for table recognition · 240fe99e

myhloli authored Nov 08, 2024

- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py

240fe99e

06 Nov, 2024 1 commit

refactor(model): remove unused code and simplify OCR model initialization · 4b0f1176

myhloli authored Nov 06, 2024

- Remove unused code for copying detection and recognition models
- Simplify OCR model initialization using atom_model_manager
- Delete unnecessary comments and empty lines

4b0f1176

04 Nov, 2024 2 commits

feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit · 11f23843

myhloli authored Nov 04, 2024

- Update StructTableModel to use the latest struct-eqtable library
- Add support for HTML table extraction in PDF Extract Kit
- Improve error handling and model initialization
- Update dependencies in setup.py for struct-eqtable

11f23843

Update pdf_extract_kit.py · fb6cb8b0

ciaran authored Nov 04, 2024

Modify line 397 to ensure compatibility with CPU execution, addressing the issue where specifying 'cpu' in config.json still results in a ValueError for expecting a cuda device but getting 'cpu' during demo execution.

fb6cb8b0

28 Oct, 2024 3 commits
- refactor(table): disable StructEqTable support and add TableMaster support · 377b09cf
  myhloli authored Oct 28, 2024
```
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster
- Log error and exit if StructEqTable is selected, as it's under upgrade
- Update README files to reflect changes in table parsing capabilities
```
  377b09cf
- perf: table model update with PP OCRv4 · 4949408c
  liukaiwen authored Oct 28, 2024
  
  4949408c
- feat: table model update with paddle recognition v4 · a0eff3be
  liukaiwen authored Oct 28, 2024
  
  a0eff3be
25 Oct, 2024 1 commit

refactor(ocr): adjust OCR processing parameters · 1807126e

myhloli authored Oct 25, 2024

- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5
- Reduce the unclip ratio for OCR detection from 2.4 to 1.8

1807126e

24 Oct, 2024 1 commit

refactor(magic_pdf): adjust confidence threshold for DocLayout_YOLO model · ce72cf05

myhloli authored Oct 24, 2024

- Changed the confidence threshold from0.15 to 0.25 in the DocLayout_YOLO model prediction
- This adjustment aims to improve the accuracy of layout detection by filtering out low-confidence predictions

ce72cf05

23 Oct, 2024 1 commit

feat(model): add support for DocLayout-YOLO model · 1279f2cd

myhloli authored Oct 23, 2024

- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch

1279f2cd

17 Oct, 2024 2 commits

feat: merge formula update · 51f56aa3
liukaiwen authored Oct 17, 2024

51f56aa3

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

14 Oct, 2024 1 commit

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 15, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

08 Oct, 2024 2 commits

feat: merge formula update · a3358878
liukaiwen authored Oct 08, 2024

a3358878

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

06 Oct, 2024 1 commit

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

29 Sep, 2024 1 commit

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab