Commits · f27320c2c898cbd46f07b122da21daa2de365844 · wangsen / MinerU

21 Feb, 2025 3 commits

fix(model): handle import errors and improve exception logging · 66f0899a

myhloli authored Feb 21, 2025

- Add ImportError handling to silence known import-related exceptions
- Improve generic exception handling to log error messages- Maintain existing specific exception handlers for license-related issues

66f0899a

feat(model_init): implement license verification for Ascend plugin · d5f6fbc6

myhloli authored Feb 21, 2025

- Add license verification logic for Ascend plugin
- Handle different license-related exceptions with appropriate error messages
- Log success message with license expiration date if verification passes
- Fall back to CPU model if license verification fails or plugin is not available

d5f6fbc6

refactor(magic_pdf): improve title optimization process · 54940c61

myhloli authored Feb 21, 2025

- Update instructions for AI-generated titles optimization
- Use ast.literal_eval() instead of json.loads() for parsing completion content
- Refactor variable names and logging for better code readability- Add error handling for JSON decoding issues

54940c61

18 Feb, 2025 3 commits
- fix: update figure caption match algorithm · f731fcab
  icecraft authored Feb 18, 2025
  
  f731fcab
- fix: update figure caption match algorithm · 0793da41
  icecraft authored Feb 18, 2025
  
  0793da41
- fix: caption match algorithm · daf0593b
  icecraft authored Feb 18, 2025
  
  daf0593b
14 Feb, 2025 1 commit
- fix(pdf_parse): Fixed the issue where some headings were missing in certain complex layouts. · 30bd3a83
  myhloli authored Feb 14, 2025
  
  30bd3a83
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 2 commits

refactor(model_init): adjust table model import order and remove redundant imports · 4c0af020

myhloli authored Feb 10, 2025

- Remove redundant imports for StructTableModel and TableMasterPaddleModel
- Reorder imports to group related modules together
- Update import structure for better readability and maintainability

4c0af020

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

09 Feb, 2025 4 commits
- fix(pdf_parse): improve image processing and OCR accuracy · 5561ac95
  myhloli authored Feb 09, 2025
```
- Update calculate_contrast function to support both RGB and BGR image modes
- Add input validation for image mode in calculate_contrast function
- Modify usage of calculate_contrast function in OCR processing to specify image mode
```
  5561ac95
- perf(language_detection): optimize batch size for language detection model · e4e4eef1
  myhloli authored Feb 09, 2025
```
- Increase batch size from 8 to 256 for language detection inference
- Add timing measurement for language detection process
```
  e4e4eef1
- fix(filter): toggle invalid character detection method · a5342950
  myhloli authored Feb 09, 2025
  
  a5342950
- refactor(filter): remove unused text layout analysis for PDF classification · f35a6c08
  myhloli authored Feb 09, 2025
  
  f35a6c08
08 Feb, 2025 2 commits

feat(pdf_parse): improve OCR processing and contrast filtering · 9f18ca20

myhloli authored Feb 08, 2025

- Rename empty_spans to need_ocr_spans for better clarity
- Add calculate_contrast function to measure image contrast
- Filter out low-contrast spans to improve OCR accuracy
- Update OCR processing workflow to use new filtering method

9f18ca20

refactor(magic_pdf): update invalid character detection logic · 5aa809ff
myhloli authored Feb 08, 2025
```
- Uncomment detect_invalid_chars_by_pymupdf function call
- Comment out detect_invalid_chars function call
```
5aa809ff

07 Feb, 2025 1 commit

perf(model): optimize batch ratio for different GPU memory sizes · b1ac7afd

myhloli authored Feb 07, 2025

- Update batch ratio calculation logic to better utilize available GPU memory
- Improve logging for all GPU memory sizes

b1ac7afd

27 Jan, 2025 2 commits
- perf(model): adjust batch ratio for different GPU memory sizes · 29e7a948
  myhloli authored Jan 27, 2025
  
  29e7a948
- perf(model): adjust batch ratio for GPU memory range · d1af4566
  myhloli authored Jan 27, 2025
```
- Update batch ratio calculation for GPU memory range
- Increase upper limit for batch ratio 16 from 24 to 32 GB
```
  d1af4566
23 Jan, 2025 1 commit
- Update version.py with new version · 4211c74c
  myhloli authored Jan 23, 2025
  
  4211c74c
22 Jan, 2025 3 commits

feat(pdf_parse_union_core_v2): add timing log for LLM aided processes · 10e848b3

myhloli authored Jan 22, 2025

- Add timing measurement for formula, text, and title optimization using LLM
- Log the execution time for each LLM aided process

10e848b3

fix(boxbase): handle cases where bounding box area is zero · c38060d5

myhloli authored Jan 22, 2025

- Add a check to return 0 when either bbox1_area or bbox2_area is zero
- This prevents division by zero errors when calculating IoU

c38060d5

refactor(pdf_parse): uncomment char bbox validation logic · 1d08865f

myhloli authored Jan 22, 2025

- Restore commented code for filtering out characters with invalid bounding boxes
- This change may affect the filtering of unnecessary characters in PDF parsing

1d08865f

21 Jan, 2025 7 commits

fix(magic_pdf): correct batch ratio conditions for GPU memory · b6710b99

myhloli authored Jan 21, 2025

- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes

b6710b99

perf(magic_pdf): optimize batch processing for GPU · 55447c8b

myhloli authored Jan 21, 2025

- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM

55447c8b

perf(magic_pdf): adjust batch ratio calculation for GPU memory · 037736fb

myhloli authored Jan 21, 2025

- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents

037736fb

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM... · e74a2960

myhloli authored Jan 21, 2025

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable
- Reduce MFR (Math Formula Recognition) batch size from 64 to 32

e74a2960

perf(magic_pdf): optimize batch ratio calculation for GPU · 052a4d72

myhloli authored Jan 21, 2025

- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio

052a4d72

perf(model): adjust batch size for layout and formula detection · 49d140c5

myhloli authored Jan 21, 2025

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1
- Simplify batch ratio calculation for formula detection
- Remove unused conditional logic in batch ratio determination

49d140c5

fix(models): update unimernet_small model path · 2a3a006f

myhloli authored Jan 21, 2025

- Update model path from 'unimernet_small' to 'unimernet_small_2501' in multiple scripts and configuration files
- This change affects download_models.py, download_models_hf.py, and model_configs.yaml

2a3a006f

20 Jan, 2025 3 commits

fix(ocr): improve ONNX model initialization and error handling · b3d60b96

myhloli authored Jan 20, 2025

- Add key length validation for ONNX model initialization
- Move import statements to the top of the file
- Wrap model initialization in a try-except block for better error handling
- Refactor code to improve readability and maintainability

b3d60b96

feat(pdf_parse): remove tilted lines for better text extraction · ba6c17a9

myhloli authored Jan 20, 2025

- Add remove_tilted_line function to filter out lines with angles between 2 and 88 degrees
- Integrate the new function into the text extraction process
- Improve the accuracy of text block processing by removing non-horizontal/vertical lines

ba6c17a9

Fix ocr utills · fbf1c4bf
陆逊 authored Jan 20, 2025

fbf1c4bf

17 Jan, 2025 3 commits

feat(llm_aided): add reasonability check and fine-tuning guidelines · d986e393

myhloli authored Jan 17, 2025

- Added instructions for checking the reasonability of heading levels
- Included guidelines for making fine adjustments based on context and logic
- Emphasized the importance of aligning the final result with the document's actual structure

d986e393

fix(magic_pdf): limit batch ratio for GPU memory · db8be974

myhloli authored Jan 17, 2025

- Commented out the original batch ratio calculation
- Set a fixed batch ratio of 2 for GPUs with less than 8 GB memory
- Increased batch ratio to 4 for GPUs with 8 GB or more memory

db8be974

refactor(table): add device configuration for Unitable model · e64d4fed

myhloli authored Jan 17, 2025

- Import get_device function from magic_pdf.libs.config_reader- Update RapidTableModel initialization to include device parameter for Unitable model

e64d4fed

16 Jan, 2025 3 commits

refactor(model): update batch analyze logic for rapid table model · 452a9c0b

myhloli authored Jan 16, 2025

- Modify the batch analyze process to handle the rapid table model's output
- Add logic_points variable to capture additional output from rapid table prediction

452a9c0b

feat(table): upgrade RapidTable to1.0.3 and add sub-model support · 79c8a5c8

myhloli authored Jan 16, 2025

- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format

79c8a5c8

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea