Commits · f4ffdfe8ef9c7242d4c2e256d76f1aeef0ccc823 · wangsen / MinerU

02 Apr, 2025 1 commit

refactor(magic_pdf): remove unused imports and update dependencies · 243bc58c

myhloli authored Apr 02, 2025

- Remove unused imports for concurrent.futures, multiprocessing, and paddle
- Delete commented-out code
- Update numpy dependency to remove upper version limit
- Remove InferenceResult import that was commented out

243bc58c

31 Mar, 2025 1 commit

feat(ocr): implement separate detection and recognition processes · a330651d

myhloli authored Mar 31, 2025

- Split OCR process into detection and recognition stages
- Update batch analysis and document analysis pipelines
- Modify OCR result formatting and handling
- Remove unused imports and optimize code structure

a330651d

26 Mar, 2025 2 commits
- feat: remove old inference code · 4fbc3689
  icecraft authored Mar 26, 2025
  
  4fbc3689
- feat: batch inference with ocr and lang flag · bbba2a12
  icecraft authored Mar 26, 2025
  
  bbba2a12
24 Mar, 2025 1 commit
- fix: support auto method and auto lang · adbf4921
  icecraft authored Mar 24, 2025
  
  adbf4921
20 Mar, 2025 4 commits

perf(inference): adjust batch ratio for GPU memory sizes · 2f40fa7d
myhloli authored Mar 20, 2025
```
- Remove separate condition for GPU memory >= 24GB
- Simplify logic to use a single threshold of 16GB
```
2f40fa7d

perf(inference): adjust batch ratio thresholds for GPU memory sizes · 74e954da

myhloli authored Mar 20, 2025

- Increase batch ratio to 32 for GPU memory >= 24GB
- Set batch ratio to 16 for GPU memory >= 16GB
- Reduce batch ratio to 8 for GPU memory >= 12GB
- Lower batch ratio to 4 for GPU memory >= 8GB
- Set batch ratio to 2 for GPU memory >= 6GB
- Keep batch ratio at 1 for lower GPU memory sizes

74e954da

refactor: remove torchtext deprecation warning handling · cf4ea78d

myhloli authored Mar 20, 2025

- Remove torchtext version check and deprecation warning handling from multiple files
- This code was unnecessary and potentially caused issues when torchtext was not installed

cf4ea78d

refactor(magic_pdf): support mps device and optimize image processing · af27c0cc

myhloli authored Mar 20, 2025

- Add support for Apple M1 chips (mps device)
- Refactor image processing for better performance and compatibility
- Update model loading and inference for various devices
- Adjust batch processing and memory management

af27c0cc

19 Mar, 2025 1 commit
- style: remove unused code · e9c24739
  icecraft authored Mar 19, 2025
  
  e9c24739
13 Mar, 2025 2 commits
- feat: add parallel evalution · b50f742f
  icecraft authored Mar 13, 2025
  
  b50f742f
- feat: add parallel evalution · 3a2f86a1
  icecraft authored Mar 13, 2025
  
  3a2f86a1
11 Mar, 2025 1 commit
- perf(inference): optimize batch processing for different GPU memory sizes · 6116488d
  myhloli authored Mar 11, 2025
```
- Set NPUDTCompile to false for better performance on NPU
- Adjust batch ratio
```
  6116488d
03 Mar, 2025 3 commits

perf(inference): adjust batch ratio for high GPU memory · 0b05dff7

myhloli authored Mar 03, 2025

- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory

0b05dff7

perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

58b6ad8c

perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

0d3304d7

25 Feb, 2025 1 commit

perf(model): optimize batch analyze process · 6753df8d

myhloli authored Feb 25, 2025

- Move batch model initialization outside the loop
- Collect page dimensions before analyzing- Update page info dictionary structure
- Add null dimensions for non-analyzed pages

6753df8d

24 Feb, 2025 1 commit

fix(magic_pdf): correct negative indexing for `end_page_id` · 90a27ecd

myhloli authored Feb 24, 2025

- Update the logic for determining `end_page_id` to handle negative values
- This change ensures proper behavior when `end_page_id` is set to -1 or other negative values

90a27ecd

22 Feb, 2025 1 commit
- fix doc_analyze first page only · 37f3e200
  Nathan Dahlberg authored Feb 22, 2025
  
  37f3e200
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 1 commit

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

07 Feb, 2025 1 commit

perf(model): optimize batch ratio for different GPU memory sizes · b1ac7afd

myhloli authored Feb 07, 2025

- Update batch ratio calculation logic to better utilize available GPU memory
- Improve logging for all GPU memory sizes

b1ac7afd

27 Jan, 2025 2 commits
- perf(model): adjust batch ratio for different GPU memory sizes · 29e7a948
  myhloli authored Jan 27, 2025
  
  29e7a948
- perf(model): adjust batch ratio for GPU memory range · d1af4566
  myhloli authored Jan 27, 2025
```
- Update batch ratio calculation for GPU memory range
- Increase upper limit for batch ratio 16 from 24 to 32 GB
```
  d1af4566
21 Jan, 2025 6 commits

fix(magic_pdf): correct batch ratio conditions for GPU memory · b6710b99

myhloli authored Jan 21, 2025

- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes

b6710b99

perf(magic_pdf): optimize batch processing for GPU · 55447c8b

myhloli authored Jan 21, 2025

- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM

55447c8b

perf(magic_pdf): adjust batch ratio calculation for GPU memory · 037736fb

myhloli authored Jan 21, 2025

- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents

037736fb

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM... · e74a2960

myhloli authored Jan 21, 2025

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable
- Reduce MFR (Math Formula Recognition) batch size from 64 to 32

e74a2960

perf(magic_pdf): optimize batch ratio calculation for GPU · 052a4d72

myhloli authored Jan 21, 2025

- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio

052a4d72

perf(model): adjust batch size for layout and formula detection · 49d140c5

myhloli authored Jan 21, 2025

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1
- Simplify batch ratio calculation for formula detection
- Remove unused conditional logic in batch ratio determination

49d140c5

17 Jan, 2025 1 commit

fix(magic_pdf): limit batch ratio for GPU memory · db8be974

myhloli authored Jan 17, 2025

- Commented out the original batch ratio calculation
- Set a fixed batch ratio of 2 for GPUs with less than 8 GB memory
- Increased batch ratio to 4 for GPUs with 8 GB or more memory

db8be974

16 Jan, 2025 1 commit

fix(magic_pdf): correct end page index and improve error handling · f209ddea

myhloli authored Jan 16, 2025

- Adjust end_page_id calculation to prevent IndexError when accessing pages
- Enhance error handling in LLM post-processing by specifically catching JSONDecodeError

f209ddea

15 Jan, 2025 1 commit

feat(model): improve batch analysis logic and support npu · f3502226

myhloli authored Jan 15, 2025

- Add support for NPU (Neural Processing Unit) when available
- Implement batch analysis for GPU and NPU devices
- Optimize memory usage and improve performance
- Update logging and error handling

f3502226

26 Dec, 2024 1 commit

refactor(device): optimize memory cleaning and device selection · 50f48417

myhloli authored Dec 26, 2024

- Update clean_memory function to support both CUDA and NPU devices
- Implement get_device function to centralize device selection logic
- Modify model initialization and memory cleaning to use the selected device
- Update RapidTableModel to support both RapidOCR and PaddleOCR engines

50f48417

18 Dec, 2024 1 commit
- refactor: refactor code · b2887ca0
  icecraft authored Dec 18, 2024
  
  b2887ca0
17 Dec, 2024 1 commit

feat(language-detection): add YOLOv11 language detection model · 20438bd2

myhloli authored Dec 17, 2024

- Add YOLOv11 language detection model for PDF documents
- Implement language detection in PymuDocDataset
- Update app.py to include 'auto' language option
- Create language detection utilities and constants

20438bd2

16 Dec, 2024 1 commit

refactor(magic_pdf): remove YOLO_VERBOSE setting and update YOLOv8 prediction verbosity · 9e4ebea9

myhloli authored Dec 16, 2024

- Remove YOLO_VERBOSE environment variable from multiple files
- Set verbose=False in YOLOv8 prediction method to suppress logger output

9e4ebea9

10 Dec, 2024 1 commit

fix(magic_pdf): disable PaddlePaddle signal handler · dd7f6781

myhloli authored Dec 10, 2024

- Import paddle module and disable its signal handler to prevent interference with other components
- This change addresses potential conflicts between PaddlePaddle and other libraries or system signals

dd7f6781

09 Dec, 2024 1 commit

refactor(magic_pdf): optimize environment setup and dependencies · a296ea41

myhloli authored Dec 09, 2024

- Add environment variables to disable albumentations and yolo updates
- Import torchtext and disable deprecation warnings
- Update unimernet to 0.2.2
- Specify ultralytics version as >=8.3.48
- Remove upper version limit for torch

a296ea41