Commits · 1b34f7e4ffa1202ce3a4edd8fba37342fac9b3ff · wangsen / MinerU

07 Mar, 2025 1 commit

refactor(magic_pdf): replace PIL with NumPy for image processing · 1b34f7e4

myhloli authored Mar 07, 2025

- Remove PIL usage across multiple files
- Convert image processing functions to use NumPy arrays
- Update crop_img function to work with NumPy arrays
- Modify image loading and resizing to use NumPy and OpenCV
- Clean up unused imports and comments related to PIL

1b34f7e4

03 Mar, 2025 5 commits
- perf(inference): adjust batch ratio for high GPU memory · 0b05dff7
  myhloli authored Mar 03, 2025
```
- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory
```
  0b05dff7
- fix: caption match · fb02be19
  icecraft authored Mar 03, 2025
  
  fb02be19
- perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c
  myhloli authored Mar 03, 2025
```
- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory
```
  58b6ad8c
- perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7
  myhloli authored Mar 03, 2025
```
- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory
```
  0d3304d7
- perf(mfr): improve Math Formula Recognition by sorting images by area · 59fc80d4
  myhloli authored Mar 03, 2025
```
- Sort detected images by area before processing to enhance MFR accuracy
- Implement stable sorting to maintain original order of images with equal
```
  59fc80d4
26 Feb, 2025 1 commit
- fix: match multiple captions · 15cd97ff
  icecraft authored Feb 26, 2025
  
  15cd97ff
25 Feb, 2025 1 commit

perf(model): optimize batch analyze process · 6753df8d

myhloli authored Feb 25, 2025

- Move batch model initialization outside the loop
- Collect page dimensions before analyzing- Update page info dictionary structure
- Add null dimensions for non-analyzed pages

6753df8d

24 Feb, 2025 1 commit

fix(magic_pdf): correct negative indexing for `end_page_id` · 90a27ecd

myhloli authored Feb 24, 2025

- Update the logic for determining `end_page_id` to handle negative values
- This change ensures proper behavior when `end_page_id` is set to -1 or other negative values

90a27ecd

23 Feb, 2025 1 commit

chore(magic_pdf): enhance license logging information · 3fe315d8

myhloli authored Feb 23, 2025

- Add license ID information to the log for better traceability
- Improve logging format to include both license ID and expiration date

3fe315d8

22 Feb, 2025 1 commit
- fix doc_analyze first page only · 37f3e200
  Nathan Dahlberg authored Feb 22, 2025
  
  37f3e200
21 Feb, 2025 2 commits

fix(model): handle import errors and improve exception logging · 66f0899a

myhloli authored Feb 21, 2025

- Add ImportError handling to silence known import-related exceptions
- Improve generic exception handling to log error messages- Maintain existing specific exception handlers for license-related issues

66f0899a

feat(model_init): implement license verification for Ascend plugin · d5f6fbc6

myhloli authored Feb 21, 2025

- Add license verification logic for Ascend plugin
- Handle different license-related exceptions with appropriate error messages
- Log success message with license expiration date if verification passes
- Fall back to CPU model if license verification fails or plugin is not available

d5f6fbc6

18 Feb, 2025 3 commits
- fix: update figure caption match algorithm · f731fcab
  icecraft authored Feb 18, 2025
  
  f731fcab
- fix: update figure caption match algorithm · 0793da41
  icecraft authored Feb 18, 2025
  
  0793da41
- fix: caption match algorithm · daf0593b
  icecraft authored Feb 18, 2025
  
  daf0593b
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 2 commits

refactor(model_init): adjust table model import order and remove redundant imports · 4c0af020

myhloli authored Feb 10, 2025

- Remove redundant imports for StructTableModel and TableMasterPaddleModel
- Reorder imports to group related modules together
- Update import structure for better readability and maintainability

4c0af020

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

09 Feb, 2025 1 commit

perf(language_detection): optimize batch size for language detection model · e4e4eef1

myhloli authored Feb 09, 2025

- Increase batch size from 8 to 256 for language detection inference
- Add timing measurement for language detection process

e4e4eef1

07 Feb, 2025 1 commit

perf(model): optimize batch ratio for different GPU memory sizes · b1ac7afd

myhloli authored Feb 07, 2025

- Update batch ratio calculation logic to better utilize available GPU memory
- Improve logging for all GPU memory sizes

b1ac7afd

27 Jan, 2025 2 commits
- perf(model): adjust batch ratio for different GPU memory sizes · 29e7a948
  myhloli authored Jan 27, 2025
  
  29e7a948
- perf(model): adjust batch ratio for GPU memory range · d1af4566
  myhloli authored Jan 27, 2025
```
- Update batch ratio calculation for GPU memory range
- Increase upper limit for batch ratio 16 from 24 to 32 GB
```
  d1af4566
21 Jan, 2025 6 commits

fix(magic_pdf): correct batch ratio conditions for GPU memory · b6710b99

myhloli authored Jan 21, 2025

- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes

b6710b99

perf(magic_pdf): optimize batch processing for GPU · 55447c8b

myhloli authored Jan 21, 2025

- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM

55447c8b

perf(magic_pdf): adjust batch ratio calculation for GPU memory · 037736fb

myhloli authored Jan 21, 2025

- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents

037736fb

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM... · e74a2960

myhloli authored Jan 21, 2025

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable
- Reduce MFR (Math Formula Recognition) batch size from 64 to 32

e74a2960

perf(magic_pdf): optimize batch ratio calculation for GPU · 052a4d72

myhloli authored Jan 21, 2025

- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio

052a4d72

perf(model): adjust batch size for layout and formula detection · 49d140c5

myhloli authored Jan 21, 2025

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1
- Simplify batch ratio calculation for formula detection
- Remove unused conditional logic in batch ratio determination

49d140c5

20 Jan, 2025 2 commits

fix(ocr): improve ONNX model initialization and error handling · b3d60b96

myhloli authored Jan 20, 2025

- Add key length validation for ONNX model initialization
- Move import statements to the top of the file
- Wrap model initialization in a try-except block for better error handling
- Refactor code to improve readability and maintainability

b3d60b96

Fix ocr utills · fbf1c4bf
陆逊 authored Jan 20, 2025

fbf1c4bf

17 Jan, 2025 2 commits

fix(magic_pdf): limit batch ratio for GPU memory · db8be974

myhloli authored Jan 17, 2025

- Commented out the original batch ratio calculation
- Set a fixed batch ratio of 2 for GPUs with less than 8 GB memory
- Increased batch ratio to 4 for GPUs with 8 GB or more memory

db8be974

refactor(table): add device configuration for Unitable model · e64d4fed

myhloli authored Jan 17, 2025

- Import get_device function from magic_pdf.libs.config_reader- Update RapidTableModel initialization to include device parameter for Unitable model

e64d4fed

16 Jan, 2025 3 commits

refactor(model): update batch analyze logic for rapid table model · 452a9c0b

myhloli authored Jan 16, 2025

- Modify the batch analyze process to handle the rapid table model's output
- Add logic_points variable to capture additional output from rapid table prediction

452a9c0b

feat(table): upgrade RapidTable to1.0.3 and add sub-model support · 79c8a5c8

myhloli authored Jan 16, 2025

- Update RapidTable dependency to version 1.0.3
- Add support for sub-models in RapidTable
- Update magic-pdf configuration to include table sub-model
- Modify table model initialization to support sub-models
- Update table prediction logic to handle new output format

79c8a5c8

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea

15 Jan, 2025 1 commit

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

14 Jan, 2025 2 commits

refactor(BatchAnalyze): comment out image rotation logic in doclayout_yolo · 902dcd2c
myhloli authored Jan 14, 2025

902dcd2c

feat(layout): improve title block handling and layout detection · c20e9a1e

myhloli authored Jan 14, 2025

- Merge title blocks that are close to each other horizontally
- Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection
- Update DocLayoutYOLO model weights
- Refactor drawing of bounding boxes for different block types

c20e9a1e