Commits · af27c0cc81e76199cfbfb5f1ca4cf1a360802fe4 · wangsen / MinerU

20 Mar, 2025 1 commit

refactor(magic_pdf): support mps device and optimize image processing · af27c0cc

myhloli authored Mar 20, 2025

- Add support for Apple M1 chips (mps device)
- Refactor image processing for better performance and compatibility
- Update model loading and inference for various devices
- Adjust batch processing and memory management

af27c0cc

19 Mar, 2025 1 commit
- style: remove unused code · e9c24739
  icecraft authored Mar 19, 2025
  
  e9c24739
13 Mar, 2025 2 commits
- feat: add parallel evalution · b50f742f
  icecraft authored Mar 13, 2025
  
  b50f742f
- feat: add parallel evalution · 3a2f86a1
  icecraft authored Mar 13, 2025
  
  3a2f86a1
11 Mar, 2025 1 commit
- perf(inference): optimize batch processing for different GPU memory sizes · 6116488d
  myhloli authored Mar 11, 2025
```
- Set NPUDTCompile to false for better performance on NPU
- Adjust batch ratio
```
  6116488d
03 Mar, 2025 3 commits

perf(inference): adjust batch ratio for high GPU memory · 0b05dff7

myhloli authored Mar 03, 2025

- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory

0b05dff7

perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

58b6ad8c

perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

0d3304d7

25 Feb, 2025 1 commit

perf(model): optimize batch analyze process · 6753df8d

myhloli authored Feb 25, 2025

- Move batch model initialization outside the loop
- Collect page dimensions before analyzing- Update page info dictionary structure
- Add null dimensions for non-analyzed pages

6753df8d

24 Feb, 2025 1 commit

fix(magic_pdf): correct negative indexing for `end_page_id` · 90a27ecd

myhloli authored Feb 24, 2025

- Update the logic for determining `end_page_id` to handle negative values
- This change ensures proper behavior when `end_page_id` is set to -1 or other negative values

90a27ecd

22 Feb, 2025 1 commit
- fix doc_analyze first page only · 37f3e200
  Nathan Dahlberg authored Feb 22, 2025
  
  37f3e200
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 1 commit

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

07 Feb, 2025 1 commit

perf(model): optimize batch ratio for different GPU memory sizes · b1ac7afd

myhloli authored Feb 07, 2025

- Update batch ratio calculation logic to better utilize available GPU memory
- Improve logging for all GPU memory sizes

b1ac7afd

27 Jan, 2025 2 commits
- perf(model): adjust batch ratio for different GPU memory sizes · 29e7a948
  myhloli authored Jan 27, 2025
  
  29e7a948
- perf(model): adjust batch ratio for GPU memory range · d1af4566
  myhloli authored Jan 27, 2025
```
- Update batch ratio calculation for GPU memory range
- Increase upper limit for batch ratio 16 from 24 to 32 GB
```
  d1af4566
21 Jan, 2025 6 commits

fix(magic_pdf): correct batch ratio conditions for GPU memory · b6710b99

myhloli authored Jan 21, 2025

- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes

b6710b99

perf(magic_pdf): optimize batch processing for GPU · 55447c8b

myhloli authored Jan 21, 2025

- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM

55447c8b

perf(magic_pdf): adjust batch ratio calculation for GPU memory · 037736fb

myhloli authored Jan 21, 2025

- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents

037736fb

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM... · e74a2960

myhloli authored Jan 21, 2025

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable
- Reduce MFR (Math Formula Recognition) batch size from 64 to 32

e74a2960

perf(magic_pdf): optimize batch ratio calculation for GPU · 052a4d72

myhloli authored Jan 21, 2025

- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio

052a4d72

perf(model): adjust batch size for layout and formula detection · 49d140c5

myhloli authored Jan 21, 2025

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1
- Simplify batch ratio calculation for formula detection
- Remove unused conditional logic in batch ratio determination

49d140c5

17 Jan, 2025 1 commit

fix(magic_pdf): limit batch ratio for GPU memory · db8be974

myhloli authored Jan 17, 2025

- Commented out the original batch ratio calculation
- Set a fixed batch ratio of 2 for GPUs with less than 8 GB memory
- Increased batch ratio to 4 for GPUs with 8 GB or more memory

db8be974

16 Jan, 2025 1 commit

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea

15 Jan, 2025 1 commit

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

26 Dec, 2024 1 commit

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

18 Dec, 2024 1 commit
- refactor: refactor code · b2887ca0
  icecraft authored Dec 18, 2024
  
  b2887ca0
17 Dec, 2024 1 commit

feat(language-detection): add YOLOv11 language detection model · 20438bd2

myhloli authored Dec 17, 2024

- Add YOLOv11 language detection model for PDF documents
- Implement language detection in PymuDocDataset
- Update app.py to include 'auto' language option
- Create language detection utilities and constants

20438bd2

16 Dec, 2024 1 commit

refactor(magic_pdf): remove YOLO_VERBOSE setting and update YOLOv8 prediction verbosity · 9e4ebea9

myhloli authored Dec 16, 2024

- Remove YOLO_VERBOSE environment variable from multiple files
- Set verbose=False in YOLOv8 prediction method to suppress logger output

9e4ebea9

10 Dec, 2024 1 commit

fix(magic_pdf): disable PaddlePaddle signal handler · dd7f6781

myhloli authored Dec 10, 2024

- Import paddle module and disable its signal handler to prevent interference with other components
- This change addresses potential conflicts between PaddlePaddle and other libraries or system signals

dd7f6781

09 Dec, 2024 1 commit

refactor(magic_pdf): optimize environment setup and dependencies · a296ea41

myhloli authored Dec 09, 2024

- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch

a296ea41

06 Dec, 2024 2 commits

refactor(magic-pdf): optimize model initialization and concurrency control · 012a46e0

myhloli authored Dec 06, 2024

- Remove concurrency limit logic from app.py
- Update model initialization process in various modules
- Remove unused VRAM check for concurrency limit
- Refactor OCR model initialization in pdf_extract_kit.py
- Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model

012a46e0

refactor(model): replace ModelSingleton with direct model initialization and improve threading · 6f636b6e

myhloli authored Dec 06, 2024

- Remove usage of ModelSingleton class
- Initialize model directly using custom_model_init function
- Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing

6f636b6e

03 Dec, 2024 2 commits
- feat: add function definitions · 4a82d6a0
  icecraft authored Nov 28, 2024
  
  4a82d6a0
- refactor: isolate inference and pipeline · a3a720ea
  icecraft authored Nov 27, 2024
  
  a3a720ea
26 Nov, 2024 1 commit

perf(image_processing): reduce maximum image size for analysis · b3644157

myhloli authored Nov 26, 2024

- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process

b3644157

22 Nov, 2024 1 commit

refactor(model): move page total time logging to custom model analysis · f1e2f084

myhloli authored Nov 22, 2024

- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time

f1e2f084

23 Oct, 2024 1 commit

feat(model): add support for DocLayout-YOLO model · 1279f2cd

myhloli authored Oct 23, 2024

- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch

1279f2cd

10 Oct, 2024 1 commit

feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e

myhloli authored Oct 10, 2024

- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages

6f63e70e